# Homework 5: Exploratory Data Analysis (EDA)
In this assignment, you are going to perform exploratory data analysis (EDA) on a small dataset of your choice. You can choose any dataset you like, but you are encouraged to pick a dataset that you are interested in. You can use the datasets you have used in the previous assignments or you can choose a new dataset. If you don't have a dataset in mind, you can choose one from the datasets in the `Datasets` folder of the course repository.

### Instructions

1. Follow the instructions on how to setup your Python and Jupyter (or VSCode) environment and cloning or downloading our repository. Instructions can be found in the class notes:
   https://filipinascimento.github.io/usable_ai/m00-setup/class
2. Ensure that you have Python and Jupyter Notebook working. (You can also try using Google Colab. This is not the preferred method for this homework, but it is an option)
3. Load the dataset of your choice into a Pandas dataframe
4. Perform exploratory data analysis (EDA) on the dataset. Your analysis should include the following:
    - Summary statistics of the dataset
    - Data cleaning and preprocessing
    - Data visualization (e.g., histograms, scatterplots, etc.)
    - You should write a brief summary of the insights and conclusions you have drawn from your analysis.
    You can use the [exploratory_data_analysis.ipynb](notebook) as a reference.
5. **Important**: Create both code and markdown cells in your notebook to document your analysis.
6. Submit your completed notebook as a HTML export, or a PDF file.

### Submission Guidelines

- Submit your completed notebook as a HTML export, or a PDF file.

To export to HTML, if you are on Jupyter, select `File` > `Export Notebook As` > `HTML`.

If you are on VSCode, you can use the `Jupyter: Export to HTML` command.
 - Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on Mac).
    - Search for `Jupyter: Export to HTML`.
    - Save the HTML file to your computer and submit it via Canvas.

---

> 
> **Using Generative AI Responsibly**
>
> You're welcome to use Generative AI to assist your learning, but focus on understanding the concepts rather than just solving the assignment. For example, instead of copying and pasting the question into the model, ask it to explain the concept in the question. Try asking: `How can I open a file in Python? Can you give me examples?` or `What functions and methods can I use to extract the words of a text file? Can you explain how they work with some examples?`
>
> This way, you will learn how the solution works while building your skills. Remember to give context to the generative AI, so it can better assist you. Talk to the instructor and AIs if you have any questions or need insights.

Create your cells below this one. Hint: start by imporint the necessary libraries and loading your dataset.

___
### Source dataset

https://www.kaggle.com/datasets/oracledevrel/formulaaihackathon2022
https://github.com/oracle-devrel/formula-ai-2022-hackathon

---

### Load the dataset of your choice into a Pandas dataframe

In [2]:
# Load modules needed
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
import os

# Local directory
print(os.getcwd())

c:\Ricardo\2025-02 SP25 USABLE ARTIFICIAL INTELLIGENCE\GitHub\usable_ai\Homework


### Perform exploratory data analysis (EDA) on the dataset.

In [3]:
df_weather = pd.read_csv("../Datasets/weather.csv")
df_weather.head()

  df_weather = pd.read_csv("../Datasets/weather.csv")


Unnamed: 0,M_PACKET_FORMAT,M_GAME_MAJOR_VERSION,M_GAME_MINOR_VERSION,M_PACKET_VERSION,M_PACKET_ID,M_SESSION_UID,M_SESSION_TIME,M_FRAME_IDENTIFIER,M_PLAYER_CAR_INDEX,M_SECONDARY_PLAYER_CAR_INDEX,...,M_AI_DIFFICULTY,M_PIT_SPEED_LIMIT,M_NETWORK_GAME,M_TOTAL_LAPS,M_STEERING_ASSIST,M_IS_SPECTATING,M_DYNAMIC_RACING_LINE,M_DRSASSIST,M_NUM_MARSHAL_ZONES,Unnamed: 58
0,2021,1,14,1,1,1.30021e+19,2803.836,82458,0,255,...,0,80,0.0,200.0,0.0,0.0,0.0,0.0,16.0,
1,2021,1,14,1,1,1.30021e+19,2803.836,82458,0,255,...,0,80,0.0,200.0,0.0,0.0,0.0,0.0,16.0,
2,2021,1,14,1,1,1.30021e+19,2803.836,82458,0,255,...,0,80,0.0,200.0,0.0,0.0,0.0,0.0,16.0,
3,2021,1,14,1,1,1.30021e+19,2803.836,82458,0,255,...,0,80,0.0,200.0,0.0,0.0,0.0,0.0,16.0,
4,2021,1,14,1,1,1.30021e+19,2803.836,82458,0,255,...,0,80,0.0,200.0,0.0,0.0,0.0,0.0,16.0,


In [4]:
# Data types
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3572328 entries, 0 to 3572327
Data columns (total 59 columns):
 #   Column                                          Dtype  
---  ------                                          -----  
 0   M_PACKET_FORMAT                                 int64  
 1   M_GAME_MAJOR_VERSION                            int64  
 2   M_GAME_MINOR_VERSION                            int64  
 3   M_PACKET_VERSION                                int64  
 4   M_PACKET_ID                                     int64  
 5   M_SESSION_UID                                   float64
 6   M_SESSION_TIME                                  float64
 7   M_FRAME_IDENTIFIER                              int64  
 8   M_PLAYER_CAR_INDEX                              int64  
 9   M_SECONDARY_PLAYER_CAR_INDEX                    int64  
 10  M_BRAKING_ASSIST                                int64  
 11  M_SESSION_LINK_IDENTIFIER                       int64  
 12  M_PIT_RELEASE_ASSIST        

In [None]:
# Rows x Cols
df_weather.shape

(3572328, 59)

### Summary statistics of the dataset

In [6]:
df_weather.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
M_PACKET_FORMAT,3572328.0,2021.0,0.0,2021.0,2021.0,2021.0,2021.0,2021.0
M_GAME_MAJOR_VERSION,3572328.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
M_GAME_MINOR_VERSION,3572328.0,14.10704,0.3091641,14.0,14.0,14.0,14.0,15.0
M_PACKET_VERSION,3572328.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
M_PACKET_ID,3572328.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
M_SESSION_UID,3572328.0,8.239394e+18,5.11926e+18,2.106082e+16,3.591802e+18,7.802116e+18,1.279207e+19,1.826297e+19
M_SESSION_TIME,3572328.0,1019.926,1682.487,0.004,113.8,431.924,1024.212,9686.959
M_FRAME_IDENTIFIER,3572328.0,28574.27,54287.73,0.0,3007.0,11749.0,29724.0,333917.0
M_PLAYER_CAR_INDEX,3572328.0,9.212292,9.217495,0.0,0.0,2.0,19.0,19.0
M_SECONDARY_PLAYER_CAR_INDEX,3572328.0,255.0,0.0,255.0,255.0,255.0,255.0,255.0


### Data cleaning and preprocessing

### Data visualization (e.g., histograms, scatterplots, etc.)
- You should write a brief summary of the insights and conclusions you have drawn from your analysis.

                                             |          | [  0%]   00:00 -> (? left)

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
---
(likely due to only having a single row, containing non-NaN values for both correlated features)
Affected correlations:['M_WEATHER/M_PACKET_FORMAT', 'M_WEATHER/M_GAME_MAJOR_VERSION', 'M_WEATHER/M_PACKET_VERSION', 'M_WEATHER/M_PACKET_ID', 'M_WEATHER/M_SECONDARY_PLAYER_CAR_INDEX', 'M_WEATHER/M_SLI_PRO_NATIVE_SUPPORT', 'M_WEATHER/M_SAFETY_CAR_STATUS', 'M_PACKET_FORMAT/M_WEATHER', 'M_PACKET_FORMAT/M_GAME_MAJOR_VERSION', 'M_PACKET_FORMAT/M_GAME_MINOR_VERSION', 'M_PACKET_FORMAT/M_PACKET_VERSION', 'M_PACKET_FORMAT/M_PACKET_ID', 'M_PACKET_FORMAT/M_SESSION_UID', 'M_PACKET_FORMAT/M_SESSION_TIME', 'M_PACKET_FORMAT/M_FRAME_IDENTIFIER', 'M_PACKET_FORMAT/M_PLAYER_CAR_INDEX', 'M_PACKET_FORMAT/M_SECONDARY_PLAYER_CAR_INDEX', 'M_PACKET_FORMAT/M_BRAKING_ASSIST', 'M_PACKET_FORMAT/M_SESSION_LINK_IDENTIFIER', 'M_PACKET_FORMAT/M_PIT_RELEASE_ASSIST'