# Preprocessing

We have to encode categorical features and we will also create our `Pipeline` object.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os 
import importlib


sys.path.append(os.path.abspath(os.path.join(os.getcwd(), ".."))) # adds folder to Python path

In [3]:
from src.data.load_data import load_data
claims_raw, _ = load_data(raw=True) # returns train and test dataframes

First, let's choose our y(target) variable. We want to try multiple things out so we will create it flexibly.

1. Simply `ClaimNb`.
<br/><br/>
2. ClaimRate $\ = \frac{ClaimNb}{Exposure} $ . 
<br/><br/>
3. log_ClaimRate $\ =  \log(1+ \frac{ClaimNb}{min(Exposure, threshold)})   $. This accounts for the big skew and also caps some extreme values. We use 0.01 as a floor because it is widely used in insurance pricing as it reduces extreme variance.

In [4]:
from src.data.preprocess import create_target, TargetType
print(TargetType)

typing.Literal['ClaimNb', 'ClaimRate', 'log_ClaimRate']


We can choose our target variable that we want to use:

In [5]:
importlib.reload(sys.modules['src.data.preprocess'])

target = "log_ClaimRate" # <-- choose target here from above

y = create_target(claims_raw, target=target)    

And here the features and which transformations we want on them:

- `Exposure`: Exposure is only used as a feature when it's part of the Target(e.g when target is `ClaimNb`). We choose to floor it at .01, as there are some extreme values. We do not use any more transformations.

-  `VehPower`: Vehicle power has 11 categories ranging from 4 to 15. We choose to make 7 bins out of this(4,5,6,7,8-9,10-11,12+), as higher power cars are rarer, so this makes it more monotic. Then we apply a ordinal encoding function, as we have a clear lower/higher structure.

- `VehAge`: Vehicle age is very right skewed, with older cars ranging to 100 y.o. We choose to try binning it into 8 bins(0-1,2-3,4-5,6-10, 11-15, 16-20, 21-30 and 30+), as data is better distributed. After that ordinal encoding should be applied. We also try using it as a numerical feature.

-  `DrivAge`: Driver age looks pretty normal compared to every other feature, with a mild skew only(.4). We decided to keep it without transformation. 

- `Density`:

- `BonusMalus`:

In [None]:
from src.data.preprocess import building_pipeline

log_features = []

bin_features = ["VehPower", "VehAge"]
bin_specs = {}

cap_features = []

ratio_pairs = []

categorical_features = []

ordinal_features = []

numerical_features = ["DrivAge"]


In [None]:
preprocessor_pipeline = building_pipeline(
    log_features=log_features,
    bin_features=bin_features,
    bin_specs=bin_specs,
    cap_features=cap_features,
    cap_q=.99,
    ratio_pairs=ratio_pairs,
    categorical_features=categorical_features,
    ordinal_features=ordinal_features
    )
preprocessor_pipeline

In [11]:
X_raw = claims_raw.drop(columns=[target])
X = preprocessor_pipeline.fit_transform(X_raw)
X_df = pd.DataFrame(X, columns=preprocessor_pipeline.get_feature_names_out())

X_df.to_parquet("data/processed/claims_processed_train.parquet", index= False)

ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'D'