# Data selection

## Dependencies

The dependencies used are as follows


In [1]:
import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import warnings

warnings.simplefilter("ignore")

## Merging data

The main source of data is from kaggle, specifically,

_Source: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020_

From these data, the datasets used are as follows

- results.csv
- drivers.csv
- constructors.csv
- races.csv
- circuits.csv
- status.csv
- driver_standings.csv
- constructor_standings.csv
- qualifying.csv

Therefore, the ones that remain unused are

- constructor_results.csv
- lap_times.csv
- pit_stops.csv
- seasons.csv
- sprint_results.csv

These data have been omitted either because they are not seen to be useful or are attributes that are acquired at the end of the race, i.e., they could be objective attributes.

Obviously, if the trained model is poor, such unused data sets will be revisited or, in other case, simply look for other data apart from kaggle.


In [2]:
results_df = pd.read_csv("../assets/data/kaggle/results.csv")
drivers_df = pd.read_csv("../assets/data/kaggle/drivers.csv")
constructors_df = pd.read_csv("../assets/data/kaggle/constructors.csv")
races_df = pd.read_csv("../assets/data/kaggle/races.csv")
circuits_df = pd.read_csv("../assets/data/kaggle/circuits.csv")
status_df = pd.read_csv("../assets/data/kaggle/status.csv")
driver_standings_df = pd.read_csv("../assets/data/kaggle/driver_standings.csv")
constructor_standings_df = pd.read_csv(
    "../assets/data/kaggle/constructor_standings.csv"
)
qualifyings_df = pd.read_csv("../assets/data/kaggle/qualifying.csv")

df = pd.merge(results_df, drivers_df, how="left", on="driverId")
df = df.merge(constructors_df, how="left", on="constructorId")
df = df.merge(races_df, how="left", on="raceId")
df = df.merge(circuits_df, how="left", on="circuitId", suffixes=("", "_z"))
df = df.merge(status_df, how="left", on="statusId")
df = df.merge(driver_standings_df, how="left", on=["raceId", "driverId"])
df = df.merge(constructor_standings_df, how="left", on=["raceId", "constructorId"])
df = df.merge(
    qualifyings_df,
    how="left",
    on=["raceId", "driverId", "constructorId"],
    suffixes=("", "_u"),
)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26080 entries, 0 to 26079
Data columns (total 72 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   resultId                26080 non-null  int64  
 1   raceId                  26080 non-null  int64  
 2   driverId                26080 non-null  int64  
 3   constructorId           26080 non-null  int64  
 4   number_x                26080 non-null  object 
 5   grid                    26080 non-null  int64  
 6   position_x              26080 non-null  object 
 7   positionText_x          26080 non-null  object 
 8   positionOrder           26080 non-null  int64  
 9   points_x                26080 non-null  float64
 10  laps                    26080 non-null  int64  
 11  time_x                  26080 non-null  object 
 12  milliseconds            26080 non-null  object 
 13  fastestLap              26080 non-null  object 
 14  rank                    26080 non-null

## Renaming data

Once the datasets have been merged, repeated or useless attributes, such as identifiers, are eliminated. In addition, some of the attributes will be renamed to facilitate compression.


In [3]:
df = df.drop(
    [
        "resultId",
        "position_x",
        "positionText_x",
        "time_x",
        "driverId",
        "code",
        "forename",
        "surname",
        "url_x",
        "constructorId",
        "name_x",
        "url_y",
        "raceId",
        "url",
        "fp1_date",
        "fp1_time",
        "fp2_date",
        "fp2_time",
        "fp3_date",
        "fp3_time",
        "quali_date",
        "quali_time",
        "sprint_date",
        "sprint_time",
        "circuitId",
        "name",
        "url_z",
        "statusId",
        "driverStandingsId",
        "points_y",
        "position_y",
        "positionText_y",
        "constructorStandingsId",
        "positionText",
        "qualifyId",
        "number",
        "position_u",
    ],
    axis=1,
)

col_name = {
    "number_x": "carNumber",
    "grid": "positionGrid",
    "positionOrder": "positionFinal",
    "points_x": "pointsDriverEarned",
    "laps": "lapsCompleted",
    "milliseconds": "timeTakenInMillisec",
    "rank": "fastestLapRank",
    "fastestLapSpeed": "maxSpeed",
    "statusId": "status",
    "number_y": "driverNumber",
    "dob": "driverBirth",
    "nationality_x": "driverNationality",
    "nationality_y": "constructorNationality",
    "year": "raceYear",
    "round": "raceRound",
    "name_y": "grandPrix",
    "date": "raceDate",
    "time_y": "raceTime",
    "location": "circuitLocation",
    "country": "circuitCountry",
    "lat": "circuitLat",
    "lng": "circuitLng",
    "alt": "circuitAlt",
    "status": "driverStatus",
    "wins_x": "driverWins",
    "wins_y": "constructorWins",
    "points": "pointsConstructorEarned",
    "position": "constructorPosition",
}

df.rename(columns=col_name, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26080 entries, 0 to 26079
Data columns (total 35 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   carNumber                26080 non-null  object 
 1   positionGrid             26080 non-null  int64  
 2   positionFinal            26080 non-null  int64  
 3   pointsDriverEarned       26080 non-null  float64
 4   lapsCompleted            26080 non-null  int64  
 5   timeTakenInMillisec      26080 non-null  object 
 6   fastestLap               26080 non-null  object 
 7   fastestLapRank           26080 non-null  object 
 8   fastestLapTime           26080 non-null  object 
 9   maxSpeed                 26080 non-null  object 
 10  driverRef                26080 non-null  object 
 11  driverNumber             26080 non-null  object 
 12  driverBirth              26080 non-null  object 
 13  driverNationality        26080 non-null  object 
 14  constructorRef        

Finally, we write down the results for later sections


In [4]:
df.to_csv("../assets/data/processed/data_selection.csv", index=False)