# Model validation


## Dependencies


The dependencies used are as follows


In [1]:
from sklearn.preprocessing import LabelEncoder, RobustScaler

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import warnings

warnings.simplefilter("ignore")

## Improvements


For the final model we will combine everything previously done in a single model. Specifically:

- All new added data,
- the manual outlier treatment, and
- all the models used, including the last 3 related to XGBoost

For this, we will start from the data in adding_data.csv and remove the outliers again. Note that the "qid" column will always be created prior to the ranker model, because if we create it before normalizing it will give problems in that model and if we create it after normalizing it will affect other models such as KNN and MLP.


## Outliers manual treatment


We will only replicate what was done in the previous section related to outliers.


In [2]:
df = pd.read_csv("../assets/data/processed/adding_data.csv")
df = df[df["positionFinal"] <= 20]
df = df[(df["driverStatus"] == "Finished") | (df["driverStatus"].str.contains("Lap"))]

## Encoding and normalization


Now we proceed to re-encode and re-normalize.


In [3]:
X = df.drop(
    [
        "positionFinal",
        "pointsDriverEarned",
        "lapsCompleted",
        "timeTakenInMillisec",
        "fastestLap",
        "fastestLapRank",
        "fastestLapTime",
        "maxSpeed",
        "driverStatus",
        "pointsConstructorEarned",
        "constructorPosition",
    ],
    axis=1,
)

enc = LabelEncoder()
for c in X.columns:
    if X[c].dtype == "object":
        X[c] = enc.fit_transform(X[c])

scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), index=X.index, columns=X.columns)

Finally, we write down the results for later sections


In [4]:
df.to_csv("../assets/data/processed/final_model.csv", index=False)
X.to_csv("../assets/data/processed/final_model_X.csv", index=False)