##### F1 Capstone Project

What is the goal?
- For 2024 (and 2025 YTD), predict each driver's finishing results per race using past performance data (and track data?)

Considerations:  
- should data be unmarried from specific drivers for better generalisation, or does that remove a crucial factor?

Challenges:  
- some years have seen big regulation changes
- best predictors of race results come from pre-race sessions (esp. quali), so how to incorporate that

useful background resources:
[https://github.com/ethan-eplee/HorseRacePrediction](https://github.com/ethan-eplee/HorseRacePrediction)  
[https://www.kaggle.com/discussions/general/333090](https://www.kaggle.com/discussions/general/333090)  
[https://medium.com/@fernando.siguenza/building-an-ai-to-predict-f1-race-outcomes-a-data-science-journey-7f55e0d75b1e](https://medium.com/@fernando.siguenza/building-an-ai-to-predict-f1-race-outcomes-a-data-science-journey-7f55e0d75b1e)

In [None]:
import pandas as pd
import numpy as np

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Context and prep:  
- scoring was funky pre 1991, so for end-year WDC title, will use at earliest 1991-onwards data
- a lot of the necessary data was pulled during the midcourse, some cleanup + EDA performed

### will need 2025 data as the season progresses

In [None]:
#races_df: contains info of all GP 1950-2024 w/ geo info
races_df = pd.read_csv('../data/races_df.csv')
#results_df: contains placement of every driver at the conclusion of every GP (does not currently include sprint results for 2021-present), 
#but YTD points should include fastest lap for 2019-2024 <-- check this
results_df = pd.read_csv('../data/results_df.csv', dtype={'number': str})
#results from all 18 sprints, maybe add to model
sprints = pd.read_csv('../data/sprint_results.csv')

In [None]:
pd.read_csv('../data/results.csv')
#grab 'fastestLapTime', 'fastestLap', and 'raceId', 'driverId' for merge for results_df
fast_laps = pd.read_csv('../data/results.csv', usecols=['raceId', 'driverId', 'fastestLapTime', 'fastestLap'])

In [None]:
#finishPosition - '\N' if driver DNFed (not a NaN)
#finishPosNum - same as above, but floats w/ NaNs <-- REMOVE
#positionText - 'R' if driver DNFed
#positionOrder - lists all drivers in points order, including DNFs <- start by using this
results_df.sample(5)

In [None]:
#dropping 'finishPosNum' because want placement to be int (I think?), and adding fastest laps
results_df = results_df.drop(columns = 'finishPosNum')
results_df = pd.merge(results_df, fast_laps, on = ['raceId', 'driverId'])

In [None]:
results_df.info()

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
#isolating the years (2001- ) and columns of interest for the model; using 2001 b/c first year of oldest active driver in 2024/2025
results = results_df[['raceId', #database-specific id
                           'year', 
                           'round', #race # within x year
                           'date',
                           'circuitId', #database-specific id
                           'driverId', #database-specific id
                           'code', #three-letter driver abbr. (for my reference)
                           'constructorId', #database-specific id
                           'grid', #starting position on grid; 0 = pitlane start
                           'positionOrder', #finishing position, includes numbered finishes for retirements
                           #positionText, #strings; 'R' if driver DNFed
                           'points', #pts scored in race towards WDC/WCC
                           'YTDpoints',
                           'WDCposition', #as of race entry
                           'wins', #as of race entry within x year
                           #add: 
                           #% of points won?
                           #total wins on specific circuit as of race -- DONE
                           #performance on circuit (win %) -- DONE
                           #performance on circuit (avg finishing pos) -- DONE
                           #last race finish -- DONE
                           #average of last 4 race finishes -- DONE
                           #YTD avg finish pos? -- DONE
                           #career win% -- DONE
                           #YTD win% -- DONE
                           #top 3 finishes on circuit -- DONE
                           #maybe add wins to-date AND avg placement with constructor? (how to link driver + constructor)
                           #maybe avg constructor performance to date?
                           #whether driver is a rookie
                          ]]
results['date'] = pd.to_datetime(results['date'])
#results[['circuitId', 'driverId', 'constructorId']] = results[['circuitId', 'driverId', 'constructorId']].astype(str) #bad for models that need numerical data
#results_2024 = results_2024[results_2024['year'] > 2000] #for later

In [None]:
#filling in the NaNs
nan_df = results[results.isna().any(axis=1)]

In [None]:
#if scored 0 points, then YTD and wins will also be 0
nan_df.fillna({'YTDpoints': 0}, inplace=True)
nan_df.fillna({'wins': 0}, inplace=True)
nan_df['WDCposition'] = nan_df['positionOrder']

In [None]:
results = results.combine_first(nan_df)
results[['WDCposition', 'wins']] = results[['WDCposition', 'wins']].astype('int64')

In [None]:
results['won'] = np.where(results['positionOrder'] == 1, 1, 0)

In [None]:
#adding wins on specific circuit as of year [take #2, don't ask] (to date)
results['prior_wins_on_circuit'] = 0
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    circuit_list = driver_df['circuitId'].unique()
    for circuit in circuit_list:
        circuits_df = driver_df[driver_df['circuitId'] == circuit].sort_values(by=['year'])
        x=0
        for i in range(circuits_df.shape[0]):
            index = circuits_df.index[i]
            results.loc[index, 'prior_wins_on_circuit'] = x
            if circuits_df['won'].iloc[i] == 1:
                x+=1           

In [None]:
#adding win percentage on circuit (to date)
results['win_percentage_on_circuit'] = 0
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    circuit_list = driver_df['circuitId'].unique()
    for circuit in circuit_list:
        circuits_df = driver_df[driver_df['circuitId'] == circuit].sort_values(by=['year'])
        win_percentage = 0
        wins = 0
        count = 1
        for i in range(circuits_df.shape[0]):
            win_percentage = round(((wins / count) * 100), 1)
            index = circuits_df.index[i]
            results.loc[index, 'win_percentage_on_circuit'] = win_percentage
            if circuits_df['won'].iloc[i] == 1:
                wins +=1
            count +=1

In [None]:
#adding average finishing position on circuit (to date)
results['avg_finish_pos_on_circuit'] = 0
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    circuit_list = driver_df['circuitId'].unique()
    for circuit in circuit_list:
        circuits_df = driver_df[driver_df['circuitId'] == circuit].sort_values(by=['year'])
        avg_finish_pos = 0
        finish_pos = 0
        cumulative_finish_pos = 0
        count = 1
        for i in range(circuits_df.shape[0]):
            avg_finish_pos = round((cumulative_finish_pos / count), 0)
            index = circuits_df.index[i]
            results.loc[index, 'avg_finish_pos_on_circuit'] = avg_finish_pos
            finish_pos = circuits_df['positionOrder'].iloc[i]
            cumulative_finish_pos += finish_pos
            count +=1

In [None]:
#adding previous race finish
results['previous_finish'] = 0
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    previous_finish = 0
    for i in range(driver_df.shape[0]):
        index = driver_df.index[i]
        results.loc[index, 'previous_finish'] = previous_finish
        previous_finish = driver_df['positionOrder'].iloc[i]        

In [None]:
#adding average of last 4 (or up to 4) race finishes
results['avg_last_4_finishes'] = np.nan
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    driver_df['avg_last_4_finishes'] = round((driver_df['positionOrder'].rolling(4, min_periods=1).mean()), 0)
    temp_df = driver_df[['avg_last_4_finishes']].shift(periods=1, fill_value=0)
    results = results.combine_first(temp_df)

In [None]:
#adding YTD avg finish pos
results['YTD_avg_finish_pos'] = np.nan
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    for year in range(driver_df['year'].min(), driver_df['year'].max()+1):
        one_year_df = driver_df[driver_df['year'] == year].sort_values(by=['round'])
        one_year_df['YTD_avg_finish_pos'] = round((one_year_df['positionOrder'].expanding().mean()), 0).shift(periods=1, fill_value=0)
        temp_df = one_year_df[['YTD_avg_finish_pos']]
        results = results.combine_first(temp_df)

In [None]:
#adding career win% (to date)
results['career_win_pct'] = 0
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    win_percentage = 0
    wins = 0
    count = 1
    for i in range(driver_df.shape[0]):
        win_percentage = round(((wins / count) * 100), 1)
        index = driver_df.index[i]
        results.loc[index, 'career_win_pct'] = win_percentage
        if driver_df['won'].iloc[i] == 1:
            wins +=1
        count +=1

In [None]:
#adding YTD win%
results['YTD_win_pct'] = 0
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    for year in range(driver_df['year'].min(), driver_df['year'].max()+1):
        one_year_df = driver_df[driver_df['year'] == year].sort_values(by=['round'])
        win_percentage = 0
        wins = 0
        count = 1
        for i in range(one_year_df.shape[0]):
            win_percentage = round(((wins / count) * 100), 1)
            index = one_year_df.index[i]
            results.loc[index, 'YTD_win_pct'] = win_percentage
            if one_year_df['won'].iloc[i] == 1:
                wins +=1
            count +=1

In [None]:
#adding top3 finishes on circuit (to date)
results['top_3_finishes_on_circuit'] = 0
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    circuit_list = driver_df['circuitId'].unique()
    for circuit in circuit_list:
        circuits_df = driver_df[driver_df['circuitId'] == circuit].sort_values(by=['year'])
        x=0
        for i in range(circuits_df.shape[0]):
            index = circuits_df.index[i]
            results.loc[index, 'top_3_finishes_on_circuit'] = x
            if (circuits_df['positionOrder'].iloc[i] >= 1) & (circuits_df['positionOrder'].iloc[i] <= 3):
                x+=1       

In [None]:
#adding whether driver is a rookier (first year)
results['rookie'] = 0
driver_list = results['driverId'].unique()
for driver in driver_list:
    driver_df = results[results['driverId'] == driver].sort_values(by=['year','round'])
    rookie_year = driver_df['year'].min()
    #Tbc

In [None]:
results.info()

In [None]:
#testing testing
driver = 1
circuit = 1
year = 2020

In [None]:
results[results['driverId'] == driver].sort_values(by=['year','round'])

In [None]:
results[(results['driverId'] == driver) & (results['circuitId'] == circuit)].sort_values(by=['year'])

What features are important and can they be factored into the model?  
- performance in previous race(s) and pre-race sessions <-- DEFINITELY WANT TO ADD 2024 PRE-RACE SESSIONS SINCE DATA EXISTS!
- constructor performance? (good car = good performance)
- within teams, driver vs driver stats?
- team pitstop performance <- need to add
- weather (rain / rain expected)
- tyre performance <-- if I can find this (maybe based on tyre+laptime/stints?)

Model building:

In [None]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report 

In [None]:
train_data_1 = results[results['year'] < 2024]
test_data_1 = results[results['year'] == 2024]

In [None]:
X_train = train_data_1.drop(columns = ['positionOrder', 'date', 'code'])
y_train = train_data_1['positionOrder']
X_test = test_data_1.drop(columns = ['positionOrder', 'date', 'code'])
y_test = test_data_1['positionOrder']

In [None]:
params = {
'objective': 'multiclass', #predicting placement, 1-20
'boosting_type': 'gbdt', #default
'num_leaves': 31, #default
'learning_rate': 0.1, #default
'feature_fraction': 1.0 #default; can be reduced to speed up training or reduce overfitting
}
clf = lgb.LGBMClassifier(**params)
clf.fit(X_train, y_train)

In [None]:
predictions = clf.predict(X_test)
print(classification_report(y_test, predictions))