# Model validation


## Dependencies


The dependencies used are as follows


In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import warnings

warnings.simplefilter("ignore")

## Other models

In order to continue trying to improve the model, other models will be tested, specifically:

- XGBClassifier
- XGBRegressor
- XGBRanker

Bearing in mind that we will pay special attention to the latter since it has a more different behavior than the rest.


## Ranker model

For XGBRanker we need to give them an additional parameter "group" the number of instance per group, or "qid", a list that corresponds to the group of each instance. The latter can also be added to the dataframe and not have to make it explicit, which is what we are going to do.


In [2]:
df = pd.read_csv("../assets/data/processed/base_model.csv")
X = pd.read_csv("../assets/data/processed/base_model_X.csv")

races_per_year = np.cumsum([0] + df.groupby("raceYear")["raceRound"].max().to_list())
set_id = lambda y, r: r + (races_per_year[y - 2006])

df["qid"] = df.apply(lambda x: set_id(x["raceYear"], x["raceRound"]), axis=1)
X["qid"] = df["qid"]

Note that we have added to X the attribute without normalizing, which is not necessary, since the XGBoost models we are going to use are tree-based, and therefore it is not necessary.

Finally, we write the results.


In [3]:
df.to_csv("../assets/data/processed/other_models.csv", index=False)
X.to_csv("../assets/data/processed/other_models_X.csv", index=False)