# Assignment

**Goal**: The goal of this assignment is to use the packages presented before, i.e. pydantic, pandera and optuna, in a real case scenario, where you want to train a model and find the best hyperparameters.

Using an open source insurance dataset, we will do the different steps:
* Validate the dataset, making sure it is prepared to train a model using pandera.
* Validate the input parameters for the bayesian optimization using pydantic.

**Table of contents**:
1. [Validate the dataset using pandera](#1-validate-the-dataset-using-pandera)
2. [Validate the parameters using pydantic](#2-validate-the-parameters-using-pydantic)
3. [Bonus](#3-bonus)

# Imports

In [None]:
from pathlib import Path
import sys
sys.path.insert(0, str(Path.cwd().parent))  # adjust .parent depth so 'src' is findable

## Packages

In [None]:
from typing import Any

import pandas as pd
import pandera.pandas as pa
from sklearn.metrics import root_mean_squared_error
from pydantic import (
    BaseModel,
    Field,
    StrictStr,
    field_validator,
)

In [None]:
from src.train_utils import (
    load_conf_parameters,
    retrieve_data_w_features,
    run_bayesian_optimization,
)

## Options

In [None]:
DATA_PATH = "../data/01_raw"

In [None]:
pd.options.display.max_columns = 150

## Dataset

In [None]:
df = pd.read_parquet(f"{DATA_PATH}/fremotor1prem0304.parquet")

In [None]:
df.head(5)

# 1. Validate the dataset using pandera

**Goal**: The goal of this section is to validate the dataframe we will use to train our model.

We want to validate the dataframe before training our model. Your goal is to make sure the columns will verify the following rules:
* **Year**: Check the year are between 2003 and 2004.
* **DrivAge**: Make sure the driver's age are possible (e.g. between 18 and 100).
* **DrivGender**: The gender is either 'M' or 'F'.
* **MaritalStatus**: Possible values "Cohabiting", "Married", "Single", "Widowed" or "Divorced".
* **BonusMalus**: The value of the bonus / malus is over 50.
* **LicenceNb**: The licence number is over 1.
* **JobCode**: The possible values are "Private employee", "Public employee", "Retiree", "Other", "Craftsman", "Farmer" or "Retailer",
* **VehAge**: Make sure the vehicule age is possible.
* **VehGas**: Either "Regular" or "Diesel".
* **Area**: Possible values are from A1 to A12 included.

**Exercise**: Update the schema to check the rules defined above.

In [None]:
schema_df = pa.DataFrameSchema({
    "IDpol": pa.Column(str),
    "PayFreq": pa.Column(str, checks=pa.Check.isin(["Annual", "Half-yearly", "Quarterly", "Monthly"])),
    "VehClass": pa.Column(str, checks=pa.Check.isin([
        "Cheapest", "Cheaper", "Cheap", "Medium low", "Medium", "Medium high", "Expensive", "More expensive", "Most expensive",
    ])),
    "VehPower": pa.Column(str, checks=pa.Check.isin([f"P{i}" for i in range(1, 21)])),
    "VehUsage": pa.Column(str, checks=pa.Check.isin([
        "Private+trip to office", "Professional", "Professional run",
    ])),
    "Garage": pa.Column(str, checks=pa.Check.isin([
        "Closed zbox", "Closed collective parking", "Opened collective parking", "Street",
    ])),
    ######################
    ### YOUR CODE HERE ###
    ######################
})

df_validated = schema_df.validate(df)

However, as you can see, some features contain NaN values. To handle this issue, you have several solutions:
* In the dataframe schema from pandera, set the option of possible NaN values to True. We don't recommend this approach as many models can't handle NaN values or you want to make sure to use them properly.
* You can set the option in the pandera schema to drop rows with NaN values.
* You can handle it with the classical feature engineering technics (feature imputing, creating new category, etc.)

**Exercise**: Create a function to handle missing values.

<details>

<summary>Click to reveal tip</summary>

Use the `mode()[0]` function to get the most represented category.

In [None]:
######################
### YOUR CODE HERE ###
######################

To train the model, splitting the dataset between train, validation and test sets is important. You want to make sure each row is splitted into one and only one of the sets. To do so, you can use custom functions from pandera.

**Exercise**: Adapt the schema to check that each row is in one and only one set.
1. Add the right checks for the columns `train_set`, `val_set` and `test_set`.
2. Add a schema check to make sure that each row is assigned to one and only one set.

<details>

<summary>Click to reveal tip</summary>

Use the `checks` parameter of the schema directly.

In [None]:
######################
### YOUR CODE HERE ###
######################

We also have the `big_train_set` column which indicates rows contained in the train or validation sets.

**Exercise**: Adapt the schema to check that rows in either train or validation sets are in the big train set.

In [None]:
######################
### YOUR CODE HERE ###
######################

# 2. Validate the parameters using pydantic

**Goal**: The goal of this section is to display a real use of the pydantic validations to check that the parameters passed to your code are ones you accept.
The validations will be applied to parameters to use for the optuna modelisation.

**Exercise**: Create a pydantic class to validate the parameters given in a conf file to make sure they respect some rules to catch bugs before running your code:
1. Define the attribute `search_space` which represents the search space for the Bayesian optimization.
2. Define the attribute `categorical_feat` which represents the categorical features.
3. Define the attribute `default_params` which represents the default combination of hyperparameters to test at the beginning of the Bayesian optimization.
4. Create a custom validator to validate the `search_space`. The search space should be a dict where the keys are parameters names and the values are dict containing:
   * `sampling_type`: The type of the parameter. Should be `categorical`, `int` or `float`.
   * If the sampling type is categorical, an element `choices` containing the possible values.
   * If the sampling type is either int or float, elements `min` and `max` containing the minimum and maximum values for the search space.

For example:
```yaml
search_space: {
    "param_1": {
        "sampling_type": "categorical",
        "choices": ["cat_1", "cat_2"],
    },
    "param_2":{
        "sampling_type": "int",
        "min": min_val,
        "max": max_val,
    },
}
```

In [None]:
class validate_input_parameters(BaseModel):
    """A pydantic class to validate the input parameters of the process."""
    target_name: StrictStr = Field(
        pattern=r"^[A-Za-z0-1\_]+$",
        description="Name of the target column.",
        frozen=True,
    )
    # ========================
    # ==== YOUR CODE HERE ====
    # ========================

In [None]:
params = load_conf_parameters("../conf/parameters.yml")

validated_params = validate_input_parameters(**params)

As you can see, some parameters don't follow the rules you defined in the pydantic class!

**Exercise**: Fix the parameters in the conf file to follow rules you defined.

In [None]:
#####################
### ACTIONS TO DO ###
#####################

Then, we can run the Bayesian optimization:

In [None]:
cols_to_drop = ["IDpol", "Year", "train_set", "val_set", "test_set", "big_train_set"]

X_big_train, y_big_train = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="big_train_set")
X_train, y_train = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="train_set")
X_val, y_val = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="val_set")
X_test, y_test = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="test_set")

best_params = run_bayesian_optimization(
    df_train=X_train,
    y_train=y_train,
    df_val=X_val,
    y_val=y_val,
    categorical_features=validated_params.categorical_feat,
    search_params=validated_params.search_space,
    default_params_list=validated_params.default_params,
)

best_params

Now that we have found the best hyperparameters, we can train on the big train set and evaluate on the test set!

**Exercise**: Train a final model on the big train set using the best hyperparameters and evaluate on the test set.

In [None]:
######################
### YOUR CODE HERE ###
######################

big_train_predictions = #
test_predictions = #

big_train_rmse = root_mean_squared_error(y_true=y_big_train, y_pred=big_train_predictions)
print(f"Big Train RMSE: {big_train_rmse}")
test_rmse = root_mean_squared_error(y_true=y_test, y_pred=test_predictions)
print(f"Test RMSE: {test_rmse}")

# 3. Bonus

You can see that there is overfitting! 

**Exercise**: Modify the function in the `train_utils.py` file to limit the overfitting of the model.

<details>

<summary>Click to reveal tip 1</summary>

Add an element in the metric to optimize containing the delta between train and validation.

<details>

<summary>Click to reveal tip 2</summary>

Use an alpha parameter to control the impact of this delta.

In [None]:
######################
### YOUR CODE HERE ###
######################