# Formula 1 Dataset Analysis (2001~2024)

## Why data from 2001?
Formula 1 cars undergo regulation changes regularly. While regulation change is motivated for the drivers' safety, cost control, and closer competition, it affects the car performance massively - resetting constructor dominance every year.
In 2001, the massive regulation change was introduced - front & rear wing changes, crash standards tightening, driver aids, such as traction control, being reintroduced.   \
Because Formula 1 cars are different year by year, older data from 20th century may be irrelevant. Also, the oldest driver on the current grid, Fernando Alonso, joined Formula 1 in 2001, which makes 2001 a meaningful data. 

### Data
- results: race results, with starting positions and final result.
- races: information of each race
- drivers: information of each driver
- qualifying: information of qualifying session 
- pit_stops: data of pit stops during races
- sprint_results: sprint race results (introduced in 2021)

### Loading packages and data

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

results = pd.read_csv('./data/results.csv', na_values='\\N')
races = pd.read_csv('./data/races.csv', na_values='\\N')
drivers = pd.read_csv('./data/drivers.csv', na_values='\\N')
qualifying = pd.read_csv('./data/qualifying.csv', na_values='\\N')
pit_stops = pd.read_csv('./data/pit_stops.csv', na_values='\\N')
sprint_results = pd.read_csv('./data/sprint_results.csv', na_values='\\N')

### Clean data

#### Filtering years (2001~2024)

In [42]:
# filter year 2001~2024 (start by races.csv because it has year information)
races = races.loc[races['year'].isin(range(2001, 2025))].sort_values(by=['year', 'round']).reset_index(drop=True)
# races.head(5)

# filter year for other datasets
results = results.loc[results['raceId'].isin(races['raceId'])].sort_values(by=['resultId', 'position']).reset_index(drop=True)
# results.head(5)

drivers = drivers.loc[drivers['driverId'].isin(results['driverId'])].sort_values(by='driverId').reset_index(drop=True)
# drivers.head(5)

qualifying = qualifying.loc[qualifying['raceId'].isin(races['raceId'])].sort_values(by=['raceId', 'position']).reset_index(drop=True)
qualifying.head(5)

pit_stops['time'] = pd.to_datetime(pit_stops['time'], format='%H:%M:%S').dt.time
# pit_stops.dtypes
pit_stops = pit_stops.loc[pit_stops['raceId'].isin(races['raceId'])].sort_values(by=['raceId', 'time']).reset_index(drop=True)
# pit_stops.head(5)

sprint_results = sprint_results.loc[sprint_results['raceId'].isin(races['raceId'])].sort_values(by=['raceId', 'position']).reset_index(drop=True)
# sprint_results.dtypes => reveals that 'position' column is not in integer type, hindering the accurate sort order.
sprint_results['position'] = sprint_results['position'].astype('Int64') # int64 is used to handle \N (NA) values
# sprint_results.dtypes
# sprint_results.head(10)

#### Checking duplicates (Deduplication)

In [44]:
print(races.duplicated().sum())
print(results.duplicated().sum())
print(drivers.duplicated().sum())
print(qualifying.duplicated().sum())
print(pit_stops.duplicated().sum())
print(sprint_results.duplicated().sum())

0
0
0
0
0
0


Great, it seems there are no duplicates in these data!


### EDA