Skip to content

Predicting 2023 Formula One Constructors' Championship Standings

Notifications You must be signed in to change notification settings

liang-sarah/F1_ML

Repository files navigation

F1_ML

The goal of this project is to build a machine learning model to predict the Formula One World Constructors’ Championship Standings for the upcoming 2023 season.

The following listed files include all my hard code and model building steps. My entire project, including the thorough thought and work process, is neatly explained and presented here.


R packages used:

tidyverse tidymodels parsnip kknn recipes workflows glmnet magrittrranger
naniar visdat dplyr ggplot2 ggthemes corrplot vip themis kableExtra ISLR

Some of these packages are necessary for the model building process, while others are for concise and convenient coding and visual presentation experience.




The following files are a representation of my overall workflow. I put raw code in .R script files and saved important arguments or variables for later use in the correspondingly named .rda files.


read_data.R

R script file read_data.R includes code used to read in csv files.


modify_data.R

This modify_data.R file includes code used to manipulate and join the data sets. Inital data cleaning is also executed in this R script file, which can range from converting timestamps into workable numeric variables to streamlining several related variables into one useful parameter.


eda.R

Exploratory data analysis code is included in R script file eda.R. This file includes code used to do further cleaning with a focus on missing data. This file also includes some visual exploratory data analysis, mostly looking at possible surface level trends and relationships between variables, which provides some good beginning insight before considering potential models.


model_building2.R

This file includes steps to set up the machine learning models. This involves training and testing data splits and building a recipe with the desired response variable and predictors. Using the recipe() tidymodels function allows us to dummy code categorical predictors and impute missing values in the predictors within the step of creating the recipe. I further set up k-fold cross validation and apply different machine learning models to the recipe. I developed the following models to have a thorough discussion of the truly best fitting model.

linear polynomial regression k nearest neighbors (knn) elastic net linear regression
elastic net with lasso regression elastic net with ridge regression random forest

To build the models, we use the following steps:

  • set up each model with tuning parameter, the engine, and regression mdoe
  • set up a workflow() with each model and the recipe
  • set up a tuning grid with grid_regular() and levels for tuning the parameters
  • tune each model with tune_grid() using the corresponding workflow, k-fold cross validation, and tuning grid
  • collect root mean squared error (RMSE) metric of tuned models and find the lowest RMSE for each model

model_results_final.rda

The corresponding .R script file is not in here, but the results are saved in this .rda file. I analyzed the performance of the more noteworthy models: elastic net, polynomial regression, knn, and random forest. For a thorough explanation and interpretation of the parameters and performance of these models, refer to the completed presentation here.

After analyzing RMSE depending on the tuning parameters, I conclude the random forest model with parameters mtry=5, trees=400, and min_n=20 is the best performing model. I use that model to fit on the testing split and once again analyze the RMSE.

Releases

No releases published

Packages

No packages published

Languages