# Missing values


## Dependencies


The dependencies used are as follows


In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import warnings

warnings.simplefilter("ignore")

## Checking nulls


First, we will check the number of nulls in each column.


In [2]:
df = pd.read_csv("../assets/data/processed/data_transformation.csv")
df.isnull().sum()

carNumber                      6
positionGrid                   0
positionFinal                  0
pointsDriverEarned             0
lapsCompleted                  0
timeTakenInMillisec        18830
fastestLap                 18465
fastestLapRank             18249
fastestLapTime             18465
maxSpeed                   18465
driverRef                      0
driverNumber               20257
driverNationality              0
constructorRef                 0
constructorNationality         0
raceYear                       0
raceRound                      0
grandPrix                      0
raceTime                   18469
circuitRef                     0
circuitLocation                0
circuitCountry                 0
circuitLat                     0
circuitLng                     0
circuitAlt                    60
driverStatus                   0
driverWins                   469
pointsConstructorEarned     1867
constructorPosition         1867
constructorWins             1867
q1        

We can see several nulls in several columns, three of them being the qualifying times. Being data that provide great information about how well a driver can be given a circuit days before the grand prix, for the treatment of nulls we can start focusing on these times.

Specifically, there are several nulls, since the currently known qualifiers did not exist until 2006, with minor modifications. Previously, it could be that there were no qualifiers at all, or that there were not all three. For example, see the following link from 2016 for a bit of history of the qualifiers and the grid.

_Source: https://www.formula1.com/en/latest/features/2016/3/deciding-the-grid-a-history-of-f1-qualifying-formats.html_

Likewise, we can see that this is also corroborated in the data.


In [3]:
first_q_not_null = ~(df["q1"].isnull())
second_q_not_null = ~(df["q2"].isnull())
third_q_not_null = ~(df["q3"].isnull())
mask = first_q_not_null & second_q_not_null & third_q_not_null

df[mask]["raceYear"].drop_duplicates().sort_values().to_numpy().reshape(-1, 1)

array([[2006],
       [2007],
       [2008],
       [2009],
       [2010],
       [2011],
       [2012],
       [2013],
       [2014],
       [2015],
       [2016],
       [2017],
       [2018],
       [2019],
       [2020],
       [2021],
       [2022],
       [2023]], dtype=int64)

Therefore, the model will be limited to training only from 2006 onwards.

Obviously, if we see that the model remains poor, we can change the range and study the behavior of the model.


In [4]:
df = df[df["raceYear"] >= 2006]

df.isnull().sum()

carNumber                     0
positionGrid                  0
positionFinal                 0
pointsDriverEarned            0
lapsCompleted                 0
timeTakenInMillisec        3654
fastestLap                  321
fastestLapRank              105
fastestLapTime              321
maxSpeed                    321
driverRef                     0
driverNumber               1724
driverNationality             0
constructorRef                0
constructorNationality        0
raceYear                      0
raceRound                     0
grandPrix                     0
raceTime                      0
circuitRef                    0
circuitLocation               0
circuitCountry                0
circuitLat                    0
circuitLng                    0
circuitAlt                   60
driverStatus                  0
driverWins                   51
pointsConstructorEarned      18
constructorPosition          18
constructorWins              18
q1                          102
q2      

## Qualifiers


There are still some nulls, which is normal, since not all pilots run all the qualifying rounds. As the grid position is determined by the lowest possible time of the three qualifiers, it will be sufficient to replicate the times in those qualifiers that are null. If the first two races are run and not the third, one of the other two races will be randomly selected to fill the third.


In [5]:
mask = first_q_not_null & ~second_q_not_null & ~third_q_not_null
df.loc[mask, ["q2", "q3"]] = df.loc[mask, "q1"]

mask = first_q_not_null & second_q_not_null & ~third_q_not_null
df.loc[mask, "q3"] = df.loc[mask, ["q1", "q2"]].sample(axis=1).squeeze()

df.isnull().sum()

carNumber                     0
positionGrid                  0
positionFinal                 0
pointsDriverEarned            0
lapsCompleted                 0
timeTakenInMillisec        3654
fastestLap                  321
fastestLapRank              105
fastestLapTime              321
maxSpeed                    321
driverRef                     0
driverNumber               1724
driverNationality             0
constructorRef                0
constructorNationality        0
raceYear                      0
raceRound                     0
grandPrix                     0
raceTime                      0
circuitRef                    0
circuitLocation               0
circuitCountry                0
circuitLat                    0
circuitLng                    0
circuitAlt                   60
driverStatus                  0
driverWins                   51
pointsConstructorEarned      18
constructorPosition          18
constructorWins              18
q1                          102
q2      

In the same way, there are still nulls, which will probably be those drivers who for some reason have not been able to run the qualifiers and will start in the last positions.


In [6]:
df[df["q1"].isnull()]["positionGrid"].drop_duplicates().sort_values().values

array([ 0, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24], dtype=int64)

Ignoring the case of the value 0, which will be discussed later, we can see that these are indeed the last positions.

Therefore, the maximum time will be assigned, taking into account that it will be per race and not globally.


In [7]:
qualify_by_race = (
    df[["raceYear", "raceRound", "q1"]]
    .groupby(["raceYear", "raceRound"])
    .agg({"q1": "max"})
)

mask = ~first_q_not_null & ~second_q_not_null & ~third_q_not_null

mg = df.loc[mask]
mg = mg.join(qualify_by_race, how="left", on=["raceYear", "raceRound"], rsuffix="Max")
df.loc[mask, ["q1", "q2", "q3"]] = mg["q1Max"]

df.isnull().sum()

carNumber                     0
positionGrid                  0
positionFinal                 0
pointsDriverEarned            0
lapsCompleted                 0
timeTakenInMillisec        3654
fastestLap                  321
fastestLapRank              105
fastestLapTime              321
maxSpeed                    321
driverRef                     0
driverNumber               1724
driverNationality             0
constructorRef                0
constructorNationality        0
raceYear                      0
raceRound                     0
grandPrix                     0
raceTime                      0
circuitRef                    0
circuitLocation               0
circuitCountry                0
circuitLat                    0
circuitLng                    0
circuitAlt                   60
driverStatus                  0
driverWins                   51
pointsConstructorEarned      18
constructorPosition          18
constructorWins              18
q1                            0
q2      

## Grid positions


Once the nulls of the qualifiers have been resolved, we will move on to others.

The case of the grid positions that are 0, are special cases in which the FIA allows drivers who have not qualified to race. For example, in the third grand prix of the 2023 season, perez and bottas were allowed to race without qualifying, as shown in the following link

_Source: https://www.formula1.com/en/results.html/2023/races/1143/australia/starting-grid.html_

We can also see this in the data


In [8]:
race_last_year = df["raceYear"] == 2023
race_round_three = df["raceRound"] == 3
position_grid_zero = df["positionGrid"] == 0
mask = race_last_year & race_round_three & position_grid_zero

df[mask].head()

Unnamed: 0,carNumber,positionGrid,positionFinal,pointsDriverEarned,lapsCompleted,timeTakenInMillisec,fastestLap,fastestLapRank,fastestLapTime,maxSpeed,driverRef,driverNumber,driverNationality,constructorRef,constructorNationality,raceYear,raceRound,grandPrix,raceTime,circuitRef,circuitLocation,circuitCountry,circuitLat,circuitLng,circuitAlt,driverStatus,driverWins,pointsConstructorEarned,constructorPosition,constructorWins,q1,q2,q3,driverAgeToday,driverAgeAtRace
25884,11.0,0,5,11.0,58,9161691.0,53.0,1.0,80235.0,236.814,perez,11.0,Mexican,red_bull,Austrian,2023,3,Australian Grand Prix,18000000.0,albert_park,Melbourne,Australia,-37.8497,144.968,10.0,Finished,1.0,123.0,1.0,3.0,78714.0,78714.0,78714.0,34,33
25890,77.0,0,11,0.0,58,9164884.0,46.0,17.0,82233.0,231.06,bottas,77.0,Finnish,alfa,Swiss,2023,3,Australian Grand Prix,18000000.0,albert_park,Melbourne,Australia,-37.8497,144.968,10.0,Finished,0.0,6.0,8.0,0.0,78714.0,78714.0,78714.0,35,34


Then, as it is not known in which grid position they started, we will put the last positions.


In [9]:
last_grid_by_race = (
    df[["raceYear", "raceRound", "positionGrid"]]
    .groupby(["raceYear", "raceRound"])
    .agg({"positionGrid": lambda x: max(x) + 1})
)

mask = position_grid_zero
mg = df[mask]
mg = mg.join(last_grid_by_race, how="left", on=["raceYear", "raceRound"], rsuffix="Max")
df.loc[mask, "positionGrid"] = mg["positionGridMax"]

df["positionGrid"].drop_duplicates().sort_values().to_numpy()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24], dtype=int64)

## Circuits


Regarding the circuits, we are missing information on the altitude of the following circuits


In [10]:
df[df["circuitAlt"].isnull()]["circuitRef"].drop_duplicates()

25340    losail
25480     miami
Name: circuitRef, dtype: object

Simply look up the altitude of these two circuits manually and enter them into the data set.


In [11]:
df.loc[df["circuitRef"] == "losail", "circuitAlt"] = 20
df.loc[df["circuitRef"] == "miami", "circuitAlt"] = 2

df.isnull().sum()

carNumber                     0
positionGrid                  0
positionFinal                 0
pointsDriverEarned            0
lapsCompleted                 0
timeTakenInMillisec        3654
fastestLap                  321
fastestLapRank              105
fastestLapTime              321
maxSpeed                    321
driverRef                     0
driverNumber               1724
driverNationality             0
constructorRef                0
constructorNationality        0
raceYear                      0
raceRound                     0
grandPrix                     0
raceTime                      0
circuitRef                    0
circuitLocation               0
circuitCountry                0
circuitLat                    0
circuitLng                    0
circuitAlt                    0
driverStatus                  0
driverWins                   51
pointsConstructorEarned      18
constructorPosition          18
constructorWins              18
q1                            0
q2      

## Driver wins


To solve the number of driver wins, knowing that they are per season, i.e., they start at 0 per year, it is simple to calculate the nulls.


In [12]:
df = df.sort_values(by=["raceYear", "driverRef", "raceRound"]).reset_index(drop=True)

drivers = df[df["driverWins"].isnull()]["driverRef"].drop_duplicates().to_numpy()
j = df.columns.get_loc("driverWins")

for driver in drivers:
    mask = (df["driverRef"] == driver) & (df["driverWins"].isnull())
    indexes = df[mask].index
    for i in indexes:
        pdata, cdata = df.iloc[i - 1], df.iloc[i]
        if (
            pdata["driverRef"] != cdata["driverRef"]
            or pdata["raceYear"] != cdata["raceYear"]
        ):
            if cdata["positionFinal"] == 1:
                df.iloc[i, j] = 1
            else:
                df.iloc[i, j] = 0
        else:
            if cdata["positionFinal"] == 1:
                df.iloc[i, j] = pdata["driverWins"] + 1
            else:
                df.iloc[i, j] = pdata["driverWins"]

df.isnull().sum()

carNumber                     0
positionGrid                  0
positionFinal                 0
pointsDriverEarned            0
lapsCompleted                 0
timeTakenInMillisec        3654
fastestLap                  321
fastestLapRank              105
fastestLapTime              321
maxSpeed                    321
driverRef                     0
driverNumber               1724
driverNationality             0
constructorRef                0
constructorNationality        0
raceYear                      0
raceRound                     0
grandPrix                     0
raceTime                      0
circuitRef                    0
circuitLocation               0
circuitCountry                0
circuitLat                    0
circuitLng                    0
circuitAlt                    0
driverStatus                  0
driverWins                    0
pointsConstructorEarned      18
constructorPosition          18
constructorWins              18
q1                            0
q2      

Note how this attribute is derived from the final position of the driver. It can be seen that, when he is in first position, this attribute increases by one in that same race, which is not desired since we are using information from the future. Therefore, what is going to be done is to move forward the attribute one position removing the last one and adding a zero as the first instance.


In [13]:
years = df["raceYear"].drop_duplicates().to_numpy()
drivers = df["driverRef"].drop_duplicates().to_numpy()

for year in years:
    for driver in drivers:
        mask = (df["raceYear"] == year) & (df["driverRef"] == driver)
        races = df.loc[mask, "driverWins"].iloc[:-1]
        races.loc[-1] = 0
        races.index += 1
        races.sort_index(inplace=True)
        races = races.to_numpy()
        df.loc[mask, "driverWins"] = races

## Constructor wins


Regarding the three nulls related to the constructors, since they are a little more difficult to calculate, and since there are few of them, we will first study them manually.


In [14]:
proyection = [
    "raceYear",
    "raceRound",
    "constructorRef",
    "constructorWins",
    "pointsConstructorEarned",
    "constructorPosition",
    "driverRef",
    "driverStatus",
]

nulls = df[df["pointsConstructorEarned"].isnull()]
nulls.sort_values(by=["raceYear", "raceRound", "constructorRef"])[proyection]

Unnamed: 0,raceYear,raceRound,constructorRef,constructorWins,pointsConstructorEarned,constructorPosition,driverRef,driverStatus
864,2008,1,force_india,,,,fisichella,Collision
1066,2008,1,force_india,,,,sutil,Hydraulics
788,2008,1,honda,,,,barrichello,Disqualified
824,2008,1,honda,,,,button,Collision
842,2008,1,red_bull,,,,coulthard,Collision
1120,2008,1,red_bull,,,,webber,Collision
860,2008,1,super_aguri,,,,davidson,Collision
1062,2008,1,super_aguri,,,,sato,Transmission
882,2008,1,toyota,,,,glock,Accident
1084,2008,1,toyota,,,,trulli,Electrical


It can be seen that they have no wins or points as these are the first races of the seasons where the two drivers of the same constructor have not been able to finish the race. Therefore, for these two attributes will be filled with zeros.

Regarding the position, each constructor will be provided with the maximum, obviously, per race.


In [15]:
mask = ["constructorWins", "pointsConstructorEarned"]
df[mask] = df[mask].fillna(0)

last_ctor_by_race = (
    df[["raceYear", "raceRound", "constructorPosition"]]
    .groupby(["raceYear", "raceRound"])
    .agg({"constructorPosition": lambda x: max(x) + 1})
)

mask = df["constructorPosition"].isnull()

mg = df[mask]
mg = mg.join(last_ctor_by_race, how="left", on=["raceYear", "raceRound"], rsuffix="Max")
df.loc[mask, "constructorPosition"] = mg["constructorPositionMax"]

df.isnull().sum()

carNumber                     0
positionGrid                  0
positionFinal                 0
pointsDriverEarned            0
lapsCompleted                 0
timeTakenInMillisec        3654
fastestLap                  321
fastestLapRank              105
fastestLapTime              321
maxSpeed                    321
driverRef                     0
driverNumber               1724
driverNationality             0
constructorRef                0
constructorNationality        0
raceYear                      0
raceRound                     0
grandPrix                     0
raceTime                      0
circuitRef                    0
circuitLocation               0
circuitCountry                0
circuitLat                    0
circuitLng                    0
circuitAlt                    0
driverStatus                  0
driverWins                    0
pointsConstructorEarned       0
constructorPosition           0
constructorWins               0
q1                            0
q2      

Note how this attribute is derived from the final position of the pilots involved, i.e., the same happens as in the previous section, so we will proceed in a similar way.


In [16]:
years = df["raceYear"].drop_duplicates().to_numpy()
drivers = df["driverRef"].drop_duplicates().to_numpy()

for year in years:
    for driver in ["alonso"]:
        mask = (df["raceYear"] == year) & (df["driverRef"] == driver)
        races = df.loc[mask, "constructorWins"].iloc[:-1]
        races.loc[-1] = 0
        races.index += 1
        races.sort_index(inplace=True)
        races = races.to_numpy()
        df.loc[mask, "constructorWins"] = races

## Driver numbers


Finally, the driver numbers have yet to be specified. First we will see which are the drivers without numbers.


In [17]:
df[df["driverNumber"].isnull()]["driverRef"].drop_duplicates().to_numpy()

array(['albers', 'barrichello', 'coulthard', 'doornbos', 'fisichella',
       'heidfeld', 'ide', 'klien', 'liuzzi', 'michael_schumacher',
       'montagny', 'monteiro', 'montoya', 'ralf_schumacher', 'rosa',
       'sato', 'speed', 'trulli', 'villeneuve', 'webber', 'yamamoto',
       'davidson', 'kovalainen', 'markus_winkelhock', 'nakajima', 'wurz',
       'bourdais', 'glock', 'piquet_jr', 'alguersuari', 'badoer', 'buemi',
       'bruno_senna', 'chandhok', 'grassi', 'petrov', 'ambrosio',
       'karthikeyan', 'resta', 'pic', 'garde'], dtype=object)

Doing a little research, it's not just that these pilots are not numbered in the data set, but that they can't be found on the web either. For example, in the following link there is a compilation of numbers in which none appears.

_Source: https://en.wikipedia.org/wiki/List_of_Formula_One_driver_numbers_

For this reason, and bearing in mind that the numbers should not significantly influence the result, they will be provided with a possible random number, i.e., one that is in the range 2 to 99 without the 17, and, obviously, not already in use.


In [18]:
numbers = np.arange(2, 100, dtype=np.float64)
numbers = np.delete(numbers, np.where(numbers == 17.0))
numbers_by_year = {
    year: np.delete(
        numbers,
        np.where(np.isin(numbers, df["driverNumber"].drop_duplicates())),
    ).tolist()
    for year in df["raceYear"].drop_duplicates()
}

drivers = df[df["driverNumber"].isnull()]["driverRef"].drop_duplicates().to_numpy()

for driver in drivers:
    mask = df["driverRef"] == driver
    first_year = df.loc[mask, "raceYear"].head(1).values[0]
    last_year = df.loc[mask, "raceYear"].tail(1).values[0]
    possible_numbers = numbers_by_year[first_year]
    driver_number = np.random.choice(possible_numbers, 1)[0]
    df.loc[mask, "driverNumber"] = driver_number
    while first_year <= last_year:
        ls = numbers_by_year[first_year]
        new_possible_numbers = np.delete(ls, np.where(ls == driver_number))
        numbers_by_year[first_year] = new_possible_numbers
        first_year += 1

df.isnull().sum()

carNumber                     0
positionGrid                  0
positionFinal                 0
pointsDriverEarned            0
lapsCompleted                 0
timeTakenInMillisec        3654
fastestLap                  321
fastestLapRank              105
fastestLapTime              321
maxSpeed                    321
driverRef                     0
driverNumber                  0
driverNationality             0
constructorRef                0
constructorNationality        0
raceYear                      0
raceRound                     0
grandPrix                     0
raceTime                      0
circuitRef                    0
circuitLocation               0
circuitCountry                0
circuitLat                    0
circuitLng                    0
circuitAlt                    0
driverStatus                  0
driverWins                    0
pointsConstructorEarned       0
constructorPosition           0
constructorWins               0
q1                            0
q2      

## Remaining attributes


Finally, there are attributes that are known once the race is over, i.e., possible target attributes. As they will not be used for now, they will be left unprocessed, with a possible revision in the future if we need to add more attributes to improve the model or change prediction.

A possible example of processing would be:

- to make an accumulated history per season until the race taken from podiums or positions
- fastest time per circuit
- max speed per circuit
- fastest time per circuit and driver
- max speed per circuit and driver
- etc

In addition, we again write the results obtained for later sections.


In [19]:
df.to_csv("../assets/data/processed/missing_values.csv", index=False)