# Circuit data


## Dependencies


The dependencies used are as follows


In [1]:
from sklearn.preprocessing import LabelEncoder, RobustScaler

import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import warnings

warnings.simplefilter("ignore")

## Preprocessing


The next thing we are going to add is more data related to the circuits. In this case, the urls of each race will be used to get some data, as well as a compilation of the circuits, which can be found at the following link.

_Source: https://en.wikipedia.org/wiki/List_of_Formula_One_circuits_


In [2]:
df = pd.read_csv("../assets/data/processed/weather.csv")
circuits_plus_df = pd.read_csv("../assets/data/scraping/circuits_plus.csv")
circuits_plusplus_df = pd.read_csv("../assets/data/scraping/circuits_plusplus.csv")

df = df.merge(circuits_plus_df, how="left", on=["circuitRef"])
df = df.merge(circuits_plusplus_df, how="left", on=["circuitRef"])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7235 entries, 0 to 7234
Data columns (total 46 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   carNumber                7235 non-null   float64
 1   positionGrid             7235 non-null   int64  
 2   positionFinal            7235 non-null   int64  
 3   pointsDriverEarned       7235 non-null   float64
 4   lapsCompleted            7235 non-null   int64  
 5   timeTakenInMillisec      3581 non-null   float64
 6   fastestLap               6914 non-null   float64
 7   fastestLapRank           7130 non-null   float64
 8   fastestLapTime           6914 non-null   float64
 9   maxSpeed                 6914 non-null   float64
 10  driverRef                7235 non-null   object 
 11  driverNumber             7235 non-null   float64
 12  driverNationality        7235 non-null   object 
 13  constructorRef           7235 non-null   object 
 14  constructorNationality  

First, we will change the datatypes.


In [3]:
numbers = ["circuitLaps", "circuitDist"]
for n in numbers:
    df[n] = pd.to_numeric(df[n], errors="coerce")

Now that the datatypes are correct, we will proceed to see if there are any nulls.


In [4]:
df.isnull().sum()

carNumber                     0
positionGrid                  0
positionFinal                 0
pointsDriverEarned            0
lapsCompleted                 0
timeTakenInMillisec        3654
fastestLap                  321
fastestLapRank              105
fastestLapTime              321
maxSpeed                    321
driverRef                     0
driverNumber                  0
driverNationality             0
constructorRef                0
constructorNationality        0
raceYear                      0
raceRound                     0
grandPrix                     0
raceTime                      0
circuitRef                    0
circuitLocation               0
circuitCountry                0
circuitLat                    0
circuitLng                    0
circuitAlt                    0
driverStatus                  0
driverWins                    0
pointsConstructorEarned       0
constructorPosition           0
constructorWins               0
q1                            0
q2      

In [5]:
mask = df["circuitLaps"].isnull()
proyection = ["circuitRef"]

df.loc[mask, proyection].drop_duplicates()

Unnamed: 0,circuitRef
198,indianapolis


As it is only one null, it will be solved manually. Specifically, if we change the language of the race wikipedia page to Spanish, it already gives us the missing data, which is 73 laps and 306.041 km.

_Source: https://es.wikipedia.org/wiki/Indianapolis_Motor_Speedway_


In [6]:
df.loc[df["circuitLaps"].isnull(), "circuitLaps"] = np.float64(73)
df.loc[df["circuitDist"].isnull(), "circuitDist"] = np.float64(306.041)

## Encoding and normalization


Once preprocessed, we proceed to re-encode and re-normalize. In addition, we will remove previously added data for comparison purposes.


In [7]:
X = df.drop(
    [
        "positionFinal",
        "pointsDriverEarned",
        "lapsCompleted",
        "timeTakenInMillisec",
        "fastestLap",
        "fastestLapRank",
        "fastestLapTime",
        "maxSpeed",
        "driverStatus",
        "pointsConstructorEarned",
        "constructorPosition",
        "weather",
        "weatherWarm",
        "weatherCold",
        "weatherDry",
        "weatherWet",
        "weatherCloudy",
    ],
    axis=1,
)

enc = LabelEncoder()
for c in X.columns:
    if X[c].dtype == "object":
        X[c] = enc.fit_transform(X[c])


scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), index=X.index, columns=X.columns)

Finally we write both dataframes for the following sections


In [8]:
df.to_csv("../assets/data/processed/circuit.csv", index=False)
X.to_csv("../assets/data/processed/circuit_X.csv", index=False)