# Preprocessing

We have to encode categorical features and we will also create our `Pipeline` object.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os 
import importlib


sys.path.append(os.path.abspath(os.path.join(os.getcwd(), ".."))) # adds folder to Python path

In [2]:
from src.data.load_data import load_data
claims_raw, _ = load_data(raw=True) # returns train and test dataframes

First, let's choose our y(target) variable. We want to try multiple things out so we will create it flexibly.

1. Simply `ClaimNb`.
<br/><br/>
2. ClaimRate $\ = \frac{ClaimNb}{Exposure} $ . 
<br/><br/>
3. log_ClaimRate $\ =  \log(1+ \frac{ClaimNb}{min(Exposure, threshold)})   $. This accounts for the big skew and also caps some extreme values. We use 0.01 as a floor because it is widely used in insurance pricing as it reduces extreme variance.

In [3]:
from src.data.preprocess import create_target, TargetType
print(TargetType)

typing.Literal['ClaimNb', 'ClaimRate', 'log_ClaimRate']


We can choose our target variable that we want to use:

In [4]:
importlib.reload(sys.modules['src.data.preprocess'])

target = "log_ClaimRate" # <-- choose target here from above

y_train = create_target(claims_raw, target=target)    

And here the features and which transformations we want on them:

- `Exposure`: Exposure is only used as a feature when it's part of the Target(e.g when target is `ClaimNb`). We choose to floor it at .01, as there are some extreme values. We do not use any more transformations.

-  `VehPower`: Vehicle power has 11 categories ranging from 4 to 15. We choose to make 7 bins out of this(4,5,6,7,8-9,10-11,12+), as higher power cars are rarer, so this makes it more monotic. Then we apply a ordinal encoding function, as we have a clear lower/higher structure.

- `VehAge`: Vehicle age is very right skewed, with older cars ranging to 100 y.o. We choose to try binning it into 8 bins(0-1,2-3,4-5,6-10, 11-15, 16-20, 21-30 and 30+), as data is better distributed. After that ordinal encoding should be applied. We also try using it as a numerical feature.

-  `DrivAge`: Driver age looks pretty normal compared to every other feature, with a mild skew only(.4). We decided to keep it without transformation. 

- `Density`: We use logarithm transformation on the denisty feature, as the log transformed feature loses its skewed distribution(.05).

- `Area`: Area is fully determined by Density ranges. Logarithm scatterplots show clear stripes, so we decided to drop Area, as it's redundant.

- `BonusMalus`: More than half of the data has a Bonus Malus score of 50. It is hard to make any transformations to make the feature better. We decided to keep it as it is.

- `VehBrand`: As we don't have any ordinal structure in this feature, we can not bin or ordinal encode the feature, so we decide to keep it as it is and one-hot encode it.

- `VehGas:`: The ratio of regular and diesel is almost 50/50, thats the only 2 value types, so we one-hot encode it to make a it a binary feature.

- `Region`: Region has a lot of categories, and a many of them contain very low counts(<1%). We decided to group these into a seperate category so we can reduce dimensionality. Then we use one hot encoding.

First we make some basic transformations on the features before creating the pipeline.

In [5]:
from src.data.preprocess import preprocess_manual, build_feature_pipeline, FEATURES


claims_processed = preprocess_manual(claims_raw) 
#claims_processed.info()
# save the processed data
#claims_processed.to_csv(f"../data/processed/claims_processed_train.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 542410 entries, 0 to 542409
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Exposure    542410 non-null  float64
 1   VehPower    542410 non-null  int64  
 2   VehAge      542410 non-null  int64  
 3   DrivAge     542410 non-null  int64  
 4   BonusMalus  542410 non-null  int64  
 5   VehBrand    542410 non-null  object 
 6   VehGas      542410 non-null  object 
 7   Density     542410 non-null  int64  
 8   Region      542410 non-null  object 
dtypes: float64(1), int64(5), object(3)
memory usage: 37.2+ MB


In [6]:


feature_configs_filtered = [
    f for f in FEATURES if f.name != "Exposure" or target == "ClaimNb"
]

preprocesser = build_feature_pipeline()

X_train = preprocesser.fit_transform(claims_processed)


In [9]:
import joblib
from scipy import sparse

sparse.save_npz(f"../data/processed/X_train_{target}.npz", X_train)
np.save(f"../data/processed/y_train_{target}.npy", y_train.values)


joblib.dump(preprocesser, f"../models/preprocesser_{target}.joblib")

PicklingError: Can't pickle <function make_binner.<locals>.binner at 0x141c81da0>: it's not found as src.data.preprocess.make_binner.<locals>.binner