# Manual treatment


## Dependencies


The dependencies used are as follows


In [1]:
from sklearn.preprocessing import LabelEncoder, RobustScaler

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import warnings

warnings.simplefilter("ignore")

## Preprocessing


In this case, what will be eliminated are instances that at first sight may introduce noise to the model, which are:

- Those that have more than 20 positions, since in the majority as of 2006 they always have 10 teams, which translates into 20 drivers.
- Those drivers who have not qualified, i.e., those who have had an accident, a car problem, etc., since it is not something that can be predicted.


In [2]:
df = pd.read_csv("../assets/data/processed/base_model.csv")

In [3]:
df = df[df["positionFinal"] <= 20]

In [4]:
df = df[(df["driverStatus"] == "Finished") | (df["driverStatus"].str.contains("Lap"))]

Note that there is no need to check neither datatypes nor nulls, as this is already well established by previous preprocessing.


## Encoding and normalization


Once preprocessed, we proceed to re-encode and re-normalize. In addition, we will remove previously added data for comparison purposes.


In [5]:
X = df.drop(
    [
        "positionFinal",
        "pointsDriverEarned",
        "lapsCompleted",
        "timeTakenInMillisec",
        "fastestLap",
        "fastestLapRank",
        "fastestLapTime",
        "maxSpeed",
        "driverStatus",
        "pointsConstructorEarned",
        "constructorPosition",
    ],
    axis=1,
)

enc = LabelEncoder()
for c in X.columns:
    if X[c].dtype == "object":
        X[c] = enc.fit_transform(X[c])

scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), index=X.index, columns=X.columns)

Finally we write both dataframes for the following sections


In [6]:
df.to_csv("../assets/data/processed/outliers.csv", index=False)
X.to_csv("../assets/data/processed/outliers_X.csv", index=False)