# Homework 5: Exploratory Data Analysis (EDA)
In this assignment, you are going to perform exploratory data analysis (EDA) on a small dataset of your choice. You can choose any dataset you like, but you are encouraged to pick a dataset that you are interested in. You can use the datasets you have used in the previous assignments or you can choose a new dataset. If you don't have a dataset in mind, you can choose one from the datasets in the `Datasets` folder of the course repository.

### Instructions

1. Follow the instructions on how to setup your Python and Jupyter (or VSCode) environment and cloning or downloading our repository. Instructions can be found in the class notes:
   https://filipinascimento.github.io/usable_ai/m00-setup/class
2. Ensure that you have Python and Jupyter Notebook working. (You can also try using Google Colab. This is not the preferred method for this homework, but it is an option)
3. Load the dataset of your choice into a Pandas dataframe
4. Perform exploratory data analysis (EDA) on the dataset. Your analysis should include the following:
    - Summary statistics of the dataset
    - Data cleaning and preprocessing
    - Data visualization (e.g., histograms, scatterplots, etc.)
    - You should write a brief summary of the insights and conclusions you have drawn from your analysis.
    You can use the [exploratory_data_analysis.ipynb](notebook) as a reference.
5. **Important**: Create both code and markdown cells in your notebook to document your analysis.
6. Submit your completed notebook as a HTML export, or a PDF file.

### Submission Guidelines

- Submit your completed notebook as a HTML export, or a PDF file.

To export to HTML, if you are on Jupyter, select `File` > `Export Notebook As` > `HTML`.

If you are on VSCode, you can use the `Jupyter: Export to HTML` command.
 - Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on Mac).
    - Search for `Jupyter: Export to HTML`.
    - Save the HTML file to your computer and submit it via Canvas.

---

> 
> **Using Generative AI Responsibly**
>
> You're welcome to use Generative AI to assist your learning, but focus on understanding the concepts rather than just solving the assignment. For example, instead of copying and pasting the question into the model, ask it to explain the concept in the question. Try asking: `How can I open a file in Python? Can you give me examples?` or `What functions and methods can I use to extract the words of a text file? Can you explain how they work with some examples?`
>
> This way, you will learn how the solution works while building your skills. Remember to give context to the generative AI, so it can better assist you. Talk to the instructor and AIs if you have any questions or need insights.

Create your cells below this one. Hint: start by imporint the necessary libraries and loading your dataset.

___
### Source dataset

Took data from **FormulaAI Hackathon 2022** for the first theme about creating an accurate weather prediction model for the F1 2021 video game.
The data is given in two formats CSV and JSON. I took the CSV as it is faster to process in my PC (size 716 MB).

``[1]`` https://www.kaggle.com/datasets/oracledevrel/formulaaihackathon2022

``[2]`` https://github.com/oracle-devrel/formula-ai-2022-hackathon

``[3]`` https://blogs.oracle.com/developers/post/formulaai-hackathon-2022

---

### Load the dataset of your choice into a Pandas dataframe

In [None]:
# Load modules needed
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
import os

# Local directory
print(os.getcwd())

### Perform exploratory data analysis (EDA) on the dataset.

In [None]:
df_weather = pd.read_csv("../Datasets/weather.csv")
df_weather.head()

In [None]:
# Data types
df_weather.info()

### Reference of columns data types and adding definition per ```[2]```

  | Pos | Column | Type | Definition [2] |
  |---|---|---|---|
 0   |M_PACKET_FORMAT                                 |int64  |Header 
 1   |M_GAME_MAJOR_VERSION                            |int64  |Game major version - "X.00"
 2   |M_GAME_MINOR_VERSION                            |int64  |Game minor version - "1.XX"
 3   |M_PACKET_VERSION                                |int64  |Version of this packet type, all start from 1
 4   |M_PACKET_ID                                     |int64  |Identifier for the packet type
 5   |M_SESSION_UID                                   |float64|Unique identifier for the session
 6   |M_SESSION_TIME                                  |float64|Session timestamp
 7   |M_FRAME_IDENTIFIER                              |int64  |Identifier for the frame the data was retrieved on
 8   |M_PLAYER_CAR_INDEX                              |int64  |Index of player's car in the array
 9   |M_SECONDARY_PLAYER_CAR_INDEX                    |int64  |Index of secondary player's car in the array (split screen); 255 if no second player
 10  |M_BRAKING_ASSIST                                |int64  |0 = off, 1 = low, 2 = medium, 3 = high
 11  |M_SESSION_LINK_IDENTIFIER                       |int64  |Identifier for season - persists across saves
 12  |M_PIT_RELEASE_ASSIST                            |int64  |0 = off, 1 = on
 13  |TIMESTAMP                                       |float64|timestamp for when the packet was received
 14  |M_ZONE_START                                    |float64|Fraction (0..1) of way through the lap the marshal zone starts
 15  |M_ZONE_FLAG                                     |float64|-1 = invalid/unknown, 0 = none, 1 = green, 2 = blue, 3 = yellow, 4 = red
 16  |M_PIT_STOP_WINDOW_IDEAL_LAP                     |int64  |Ideal lap to pit on for current strategy (player)
 17  |M_TRACK_TEMPERATURE                             |int64  |Track temp. in degrees Celsius
 18  |M_TRACK_LENGTH                                  |int64  |Track length in metres
 19  |M_GAME_PAUSED                                   |int64  |Whether the game is paused
 20  |M_FORECAST_ACCURACY                             |int64  |0 = Perfect, 1 = Approximate - The accuracy is a configurable in-game setting.
 21  |GAMEHOST                                        |object |unique identifier for the host that captured the data (not relevant) 
 22  |M_AIR_TEMPERATURE                               |int64  |Air temp. in degrees celsius 
 23  |M_NUM_WEATHER_FORECAST_SAMPLES                  |int64  |Array of weather forecast samples
 24  |M_SLI_PRO_NATIVE_SUPPORT                        |int64  |SLI Pro support, 0 = inactive, 1 = active. This refers to external LED devices like the Leo Bodnar SLI Pro and Fanatec steering wheels, which can display telemetry data from the F1 game.
 25  |M_SAFETY_CAR_STATUS                             |int64  |0 = no safety car, 1 = full, 2 = virtual, 3 = formation lap
 26  |M_TRACK_ID                                      |int64  |-1 for unknown, 0-21 for tracks
 27  |M_ERSASSIST                                     |int64  |0 = off, 1 = on
 28  |M_FORMULA                                       |int64  |Formula, 0 = F1 Modern, 1 = F1 Classic, 2 = F2, 3 = F1 Generic 
 29  |M_SEASON_LINK_IDENTIFIER                        |int64  |Identifier for season - persists across saves
 30  |M_PIT_ASSIST                                    |int64  |0 = off, 1 = on
 31  |M_GEARBOX_ASSIST                                |int64  |1 = manual, 2 = manual & suggested gear, 3 = auto
 32  |M_SESSION_TYPE                                  |int64  |0 = unknown, 1 = P1, 2 = P2, 3 = P3, 4 = Short P, 5 = Q1, 6 = Q2, 7 = Q3, 8 = Short Q, 9 = OSQ, 10 = R, 11 = R2, 12 = Time Trial
 33  |M_SPECTATOR_CAR_INDEX                           |int64  |Index of the car being spectated
 34  |M_PIT_STOP_WINDOW_LATEST_LAP                    |int64  |Latest lap to pit on for current strategy (player)
 35  |M_WEEKEND_LINK_IDENTIFIER                       |int64  |Identifier for weekend - persists across saves
 36  |M_DYNAMIC_RACING_LINE_TYPE                      |int64  |0 = 2D, 1 = 3D
 37  |M_SESSION_TIME_LEFT                             |int64  |Time left in session in seconds
 38  |M_SESSION_DURATION                              |int64  |Session duration in seconds
 39  |M_PIT_STOP_REJOIN_POSITION                      |int64  |Predicted position to rejoin at (player)
 40  |M_WEATHER_FORECAST_SAMPLES_M_SESSION_TYPE       |float64|0 = unknown, 1 = P1, 2 = P2, 3 = P3, 4 = Short P, 5 = Q1, 6 = Q2, 7 = Q3, 8 = Short Q, 9 = OSQ, 10 = R, 11 = R2, 12 = Time Trial
 41  |M_TIME_OFFSET                                   |float64|Time in minutes the forecast is for
 42  |M_WEATHER_FORECAST_SAMPLES_M_WEATHER            |float64|Weather - 0 = clear, 1 = light cloud, 2 = overcast, 3 = light rain, 4 = heavy rain, 5 = storm
 43  |M_WEATHER_FORECAST_SAMPLES_M_TRACK_TEMPERATURE  |float64|Track temp. in degrees Celsius
 44  |M_TRACK_TEMPERATURE_CHANGE                      |float64|Track temp. change – 0 = up, 1 = down, 2 = no change
 45  |M_WEATHER_FORECAST_SAMPLES_M_AIR_TEMPERATURE    |float64|Air temp. in degrees celsius
 46  |M_AIR_TEMPERATURE_CHANGE                        |float64|Rain percentage (0-100)
 47  |M_RAIN_PERCENTAGE                               |float64|Rain percentage (0-100)
 48  |M_WEATHER                                       |int64  |Weather - 0 = clear, 1 = light cloud, 2 = overcast, 3 = light rain, 4 = heavy rain, 5 = storm
 49  |M_AI_DIFFICULTY                                 |int64  |AI Difficulty rating – 0-110
 50  |M_PIT_SPEED_LIMIT                               |int64  |Pit speed limit in kilometres per hour
 51  |M_NETWORK_GAME                                  |float64|0 = offline, 1 = online
 52  |M_TOTAL_LAPS                                    |float64|Total number of laps in this race
 53  |M_STEERING_ASSIST                               |float64|0 = off, 1 = on
 54  |M_IS_SPECTATING                                 |float64|Whether the player is spectating
 55  |M_DYNAMIC_RACING_LINE                           |float64|0 = off, 1 = corners only, 2 = full
 56  |M_DRSASSIST                                     |float64|0 = off, 1 = on
 57  |M_NUM_MARSHAL_ZONES                             |float64|Number of marshal zones to follow
 58  |Unnamed: 58                                     |float64|?

In [None]:
# Rows x Cols
df_weather.shape

### Summary statistics of the dataset

In [None]:
df_weather.describe().transpose()

### Data cleaning and preprocessing

In [None]:
# Check for duplicated rows
duplicates = df_weather[df_weather.duplicated()]

display(duplicates.shape)
display(df_weather.shape)


In [None]:
# Will start reviewing the columns that are likely irrelevant, e.g. same value.
# For that will use those with standard deviation zero (from summary stats above) as possible list:
# M_PACKET_FORMAT
# M_GAME_MAJOR_VERSION
# M_PACKET_VERSION
# M_PACKET_ID
# M_SECONDARY_PLAYER_CAR_INDEX
# M_SLI_PRO_NATIVE_SUPPORT
# M_SAFETY_CAR_STATUS


# Will keep an array of the columns to drop
columns_to_drop = []

display(df_weather['M_PACKET_FORMAT'].describe().transpose())
display(df_weather['M_PACKET_FORMAT'].unique())

In [None]:
# This one can be drop because it is constant no matter the observation
columns_to_drop.append('M_PACKET_FORMAT')

# Continue with next column
display(df_weather['M_GAME_MAJOR_VERSION'].describe().transpose())
display(df_weather['M_GAME_MAJOR_VERSION'].unique())

In [None]:
# This one can be drop because it is constant no matter the observation
columns_to_drop.append('M_GAME_MAJOR_VERSION')

# Continue with next column
display(df_weather['M_PACKET_VERSION'].describe().transpose())
display(df_weather['M_PACKET_VERSION'].unique())

In [None]:
# This one can be drop because it is constant no matter the observation
columns_to_drop.append('M_PACKET_VERSION')

# Continue with next column
display(df_weather['M_PACKET_ID'].describe().transpose())
display(df_weather['M_PACKET_ID'].unique())

In [None]:
# This one can be drop because it is constant no matter the observation
columns_to_drop.append('M_PACKET_ID')

# Continue with next column
display(df_weather['M_SECONDARY_PLAYER_CAR_INDEX'].describe().transpose())
display(df_weather['M_SECONDARY_PLAYER_CAR_INDEX'].unique())

In [None]:
# This one can be drop because it is constant no matter the observation
columns_to_drop.append('M_SECONDARY_PLAYER_CAR_INDEX')

# Continue with next column
display(df_weather['M_SLI_PRO_NATIVE_SUPPORT'].describe().transpose())
display(df_weather['M_SLI_PRO_NATIVE_SUPPORT'].unique())

In [None]:
# This one can be drop because it is constant no matter the observation
columns_to_drop.append('M_SLI_PRO_NATIVE_SUPPORT')

# Continue with next column
display(df_weather['M_SAFETY_CAR_STATUS'].describe().transpose())
display(df_weather['M_SAFETY_CAR_STATUS'].unique())

In [None]:
# This one can be drop because it is constant no matter the observation
columns_to_drop.append('M_SAFETY_CAR_STATUS')

In [None]:
# Last columns 'Unnamed: 58' has zero count so that one can be dropped as well
df_weather = df_weather.drop(columns=['Unnamed: 58'])

In [None]:
# Look at the columns will less observations
df_weather.count().sort_values(ascending=True)

The are some columns with less than 3572328 rows and for those some action could be needed if any of those columns is used later
E.g. as part of modelling.

For now, will just list columns - but no action will be done.

|Column| Count|
|---|---|
|M_ZONE_START                                      | 974274
|M_ZONE_FLAG                                       | 974274
|M_WEATHER_FORECAST_SAMPLES_M_AIR_TEMPERATURE      |2598054
|M_WEATHER_FORECAST_SAMPLES_M_SESSION_TYPE         |2598054
|M_TIME_OFFSET                                     |2598054
|M_WEATHER_FORECAST_SAMPLES_M_WEATHER              |2598054
|M_WEATHER_FORECAST_SAMPLES_M_TRACK_TEMPERATURE    |2598054
|M_TRACK_TEMPERATURE_CHANGE                        |2598054
|M_AIR_TEMPERATURE_CHANGE                          |2598054
|M_RAIN_PERCENTAGE                                 |2598054
|GAMEHOST                                          |2663112

With 3572327 observations:
|Column|Column|Column|
|---|---|---|
|M_DRSASSIST              |M_NUM_MARSHAL_ZONES	|M_NETWORK_GAME                                    
|M_TOTAL_LAPS             |M_STEERING_ASSIST    |M_IS_SPECTATING                                   
|M_DYNAMIC_RACING_LINE    |M_AI_DIFFICULTY               |M_PIT_SPEED_LIMIT                                 
|M_PIT_STOP_REJOIN_POSITION   |M_SESSION_DURATION        |M_SESSION_TIME_LEFT                               
|M_DYNAMIC_RACING_LINE_TYPE   |M_WEEKEND_LINK_IDENTIFIER |M_WEATHER                                         
|M_PIT_STOP_WINDOW_LATEST_LAP |M_GAME_MINOR_VERSION      |M_GEARBOX_ASSIST                                  
|M_SESSION_UID                |M_SESSION_TIME            |M_FRAME_IDENTIFIER                                
|M_PLAYER_CAR_INDEX           |M_BRAKING_ASSIST          |M_SESSION_LINK_IDENTIFIER                         
|M_PIT_RELEASE_ASSIST         |TIMESTAMP                 |M_PIT_STOP_WINDOW_IDEAL_LAP                       
|M_SPECTATOR_CAR_INDEX        |M_TRACK_TEMPERATURE       |M_GAME_PAUSED                                     
|M_FORECAST_ACCURACY          |M_AIR_TEMPERATURE         |M_NUM_WEATHER_FORECAST_SAMPLES                    
|M_TRACK_ID                   |M_ERSASSIST               |M_FORMULA                                         
|M_SEASON_LINK_IDENTIFIER     |M_PIT_ASSIST              |M_TRACK_LENGTH                                    
|M_SESSION_TYPE

### Data visualization (e.g., histograms, scatterplots, etc.)
- You should write a brief summary of the insights and conclusions you have drawn from your analysis.

|Item|Insights|
|---|---|
|M_WEATHER|- 74.59% of data points under clear weather.<br/>- No data for categories: 3(light rain) and 4 (heavy rain).<br/>- Ordinal data as categories has an explicit order, so important to consider for future analysis.
|M_AIR_TEMPERATURE |- Integer values, instead of continuous as you may expect.<br/>- When plotting histogram  |


In [None]:
#### Will check that the data aligns with the definition of some categorical columns
# M_WEATHER should be 0 = clear, 1 = light cloud, 2 = overcast, 3 = light rain, 4 = heavy rain, 5 = storm

# Get unique values and sort them
unique_values = np.sort(df_weather['M_WEATHER'].unique())
print("M_WEATHER - categories found:", unique_values)

# Define category mapping
category_mapping = {
        0: "0-Clear",
        1: "1-Light Cloud",
        2: "2-Overcast",
        3: "3-Light Rain",
        4: "4-Heavy Rain",
        5: "5-Storm"
    }

fig = plt.figure(figsize=(5, 3))
# plt.hist(df_weather['M_WEATHER'],bins=100)
# will use bars as it is more clear for a column that is categorical
ax = df_weather['M_WEATHER'].value_counts().sort_index().plot(kind='bar')
plt.xlabel('M_WEATHER')
plt.ylabel('Frequency')
plt.title('Histogram of M_WEATHER')

# Debugging: Get the x-tick labels text 
xticks = ax.get_xticklabels()
# for tick in xticks:
#    print(tick.get_text())

# Create a list of labels using the category mapping
xticklabels = [category_mapping[int(tick.get_text())] for tick in xticks]

# Set the x-tick labels
ax.set_xticklabels(xticklabels)

# Rotate the labels for better readability
plt.xticks(rotation=45)

# Add count labels on top of each bar
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 5), textcoords='offset points')

plt.tight_layout()
plt.show()


In [None]:
# Apply the category mapping to the 'M_WEATHER' column
df_weather['M_WEATHER_LABEL'] = df_weather['M_WEATHER'].map(category_mapping)

# Create a frequency table with counts and percentages
frequency_table = df_weather['M_WEATHER_LABEL'].value_counts().reset_index()

# Rename columns for clarity
frequency_table.columns = ['M_WEATHER', 'Count']

# Add a percentage column
frequency_table['Percentage'] = (frequency_table['Count'] / len(df_weather["M_WEATHER"])) * 100

# Round the percentage values for readability
frequency_table['Percentage'] = frequency_table['Percentage'].round(2)

# Display the frequency table
display(frequency_table)


In [None]:
# M_AIR_TEMPERATURE in C

# Get unique values and sort them
unique_values = np.sort(df_weather['M_AIR_TEMPERATURE'].unique())
print("M_AIR_TEMPERATURE - values found:", unique_values)


fig = plt.figure(figsize=(5, 3))
plt.hist(df_weather['M_AIR_TEMPERATURE'],bins=100)
plt.xlabel('M_AIR_TEMPERATURE (in $^\circ$C)')
plt.ylabel('Frequency')
plt.title('Histogram of M_AIR_TEMPERATURE')

plt.tight_layout()
plt.show()

In [None]:
df_weather["M_AIR_TEMPERATURE"].describe().round(3)

In [None]:
fig = plt.figure(figsize=(5, 3))
for weather, group in df_weather.groupby('M_WEATHER_LABEL'):
    plt.hist(group['M_AIR_TEMPERATURE'], bins=10, alpha=0.5, label=weather)
plt.xlabel('M_AIR_TEMPERATURE ($^\circ$C)')
plt.ylabel('Frequency')
plt.title('Histogram of M_AIR_TEMPERATURE')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:

# Generate a list of colors (you can choose any colormap here)
colors = plt.cm.viridis(np.linspace(0, 1, len(df_weather['M_WEATHER_LABEL'].unique())))

# Create a figure with subplots
fig, axes = plt.subplots(nrows=len(df_weather['M_WEATHER_LABEL'].unique()), ncols=1, figsize=(5, 3 * len(df_weather['M_WEATHER_LABEL'].unique())))

# Flatten axes if there are multiple subplots
axes = axes.flatten()

# Plot each group in a separate subplot
for i, (weather, group) in enumerate(df_weather.groupby('M_WEATHER_LABEL')):
    axes[i].hist(group['M_AIR_TEMPERATURE'], bins=10, alpha=0.7, color=colors[i])
    axes[i].set_xlabel('M_AIR_TEMPERATURE ($^\circ$C)')
    axes[i].set_ylabel('Frequency')
    axes[i].set_title(f'Histogram of M_AIR_TEMPERATURE ({weather})')

# Adjust layout for better display
plt.tight_layout()
plt.show()

In [None]:
# Generate a list of colors
colors = plt.cm.viridis(np.linspace(0, 1, len(df_weather['M_WEATHER_LABEL'].unique())))

# Create a figure with subplots
fig, axes = plt.subplots(nrows=len(df_weather['M_WEATHER_LABEL'].unique()), ncols=1, figsize=(5, 3 * len(df_weather['M_WEATHER_LABEL'].unique())))

# Flatten axes if there are multiple subplots
axes = axes.flatten()

# Plot each group in a separate subplot
for i, (weather, group) in enumerate(df_weather.groupby('M_WEATHER_LABEL')):
    axes[i].hist(group['M_AIR_TEMPERATURE'], bins=10, alpha=0.5, color=colors[i])
    axes[i].set_xlabel('M_AIR_TEMPERATURE ($^\circ$C)')
    axes[i].set_ylabel('Frequency')
    axes[i].set_title(f'Histogram of M_AIR_TEMPERATURE ({weather})')

# Adjust layout for better display
plt.tight_layout()
plt.show()