# Weather data


## Dependencies


The dependencies used are as follows


In [1]:
from sklearn.preprocessing import LabelEncoder, RobustScaler

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import warnings

warnings.simplefilter("ignore")

## Preprocessing


For the extraction of the weather we will use the urls that provide each race to wikipedia, in an external script that can be found in the github repository. Note that its code is omitted since it is not relevant for the improvement of the model as such.


In [2]:
df = pd.read_csv("../assets/data/processed/base_model.csv")
weather_df = pd.read_csv("../assets/data/scraping/weather.csv")
weather_df.rename(columns={"year": "raceYear", "round": "raceRound"}, inplace=True)

df = df.merge(weather_df, how="left", on=["raceYear", "raceRound"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7235 entries, 0 to 7234
Data columns (total 41 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   carNumber                7235 non-null   float64
 1   positionGrid             7235 non-null   int64  
 2   positionFinal            7235 non-null   int64  
 3   pointsDriverEarned       7235 non-null   float64
 4   lapsCompleted            7235 non-null   int64  
 5   timeTakenInMillisec      3581 non-null   float64
 6   fastestLap               6914 non-null   float64
 7   fastestLapRank           7130 non-null   float64
 8   fastestLapTime           6914 non-null   float64
 9   maxSpeed                 6914 non-null   float64
 10  driverRef                7235 non-null   object 
 11  driverNumber             7235 non-null   float64
 12  driverNationality        7235 non-null   object 
 13  constructorRef           7235 non-null   object 
 14  constructorNationality  

Since the datatypes are already correct, we will proceed to see if there are any nulls.


In [3]:
mask = df["weather"] == "not found"
proyection = ["raceYear", "raceRound", "grandPrix"]

df.loc[mask, proyection].drop_duplicates()

Unnamed: 0,raceYear,raceRound,grandPrix
88,2006,5,European Grand Prix


As it is only one null, it will be solved manually. Specifically, if we change the language of the race wikipedia page to Italian, it already gives us the weather, which is sunny.

_Source: https://it.wikipedia.org/wiki/Gran_Premio_d'Europa_2006_


In [4]:
df.loc[mask, "weatherWarm"] = 1
df.loc[mask, "weather"] = "Sunny"

## Encoding and normalization

Once preprocessed, we will see how the models obtained from the previous section perform with these new attributes. Let's not forget that we have to re-encode and re-normalize.


In [5]:
X = df.drop(
    [
        "positionFinal",
        "pointsDriverEarned",
        "lapsCompleted",
        "timeTakenInMillisec",
        "fastestLap",
        "fastestLapRank",
        "fastestLapTime",
        "maxSpeed",
        "driverStatus",
        "pointsConstructorEarned",
        "constructorPosition",
    ],
    axis=1,
)

enc = LabelEncoder()
for c in X.columns:
    if X[c].dtype == "object":
        X[c] = enc.fit_transform(X[c])

scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), index=X.index, columns=X.columns)

Finally we write both dataframes for the following sections


In [6]:
df.to_csv("../assets/data/processed/weather.csv", index=False)
X.to_csv("../assets/data/processed/weather_X.csv", index=False)