# Preprocessing

We have to encode categorical features and we will also create our `Pipeline` object.

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os 
import importlib


sys.path.append(os.path.abspath(os.path.join(os.getcwd(), ".."))) # adds folder to Python path

In [22]:
from src.data.load_data import load_data
claims_raw, claims_raw_test = load_data(raw=True) # returns train and test dataframes

First, let's choose our y(target) variable. We want to try multiple things out so we will create it flexibly.

1. Simply `ClaimNb`.
<br/><br/>
2. ClaimRate $\ = \frac{ClaimNb}{Exposure} $ . 
<br/><br/>
3. log_ClaimRate $\ =  \log(1+ \frac{ClaimNb}{min(Exposure, threshold)})   $. This accounts for the big skew and also caps some extreme values. We use 0.01 as a floor because it is widely used in insurance pricing as it reduces extreme variance.

Also, in all cases, `Exposure` is capped at 1.0, as it shouldn't exceed it. We also clip it at 0.01 as minimum, because the values lead to extreme targets.

First we make some basic transformations on the features before creating the pipeline.

In [23]:
from src.preprocess.preprocess import preprocess_manual, build_feature_pipeline, FEATURES


claims_processed = preprocess_manual(claims_raw, exposure=False) 
#claims_processed.info()
# save the processed data
#claims_processed.to_csv(f"../data/processed/claims_processed_train.csv", index=False)

In [24]:
from src.preprocess.preprocess import create_target, TargetType
print(TargetType)

typing.Literal['ClaimNb', 'ClaimRate', 'log_ClaimRate']


We can choose our target variable that we want to use:

In [25]:
importlib.reload(sys.modules['src.preprocess.preprocess'])

target = "log_ClaimRate" # <-- choose target here from above

y_train = create_target(claims_raw, target=target)    

And here the features and which transformations we want on them:

- `Exposure`: Exposure is only used as a feature when it's part of the Target(e.g when target is `ClaimNb`). We choose to floor it at .01, as there are some extreme values. We do not use any more transformations.

-  `VehPower`: Vehicle power has 11 categories ranging from 4 to 15. We choose to make 7 bins out of this(4,5,6,7,8-9,10-11,12+), as higher power cars are rarer, so this makes it more monotic. Then we apply a ordinal encoding function, as we have a clear lower/higher structure.

- `VehAge`: Vehicle age is very right skewed, with older cars ranging to 100 y.o. We choose to try binning it into 8 bins(0-1,2-3,4-5,6-10, 11-15, 16-20, 21-30 and 30+), as data is better distributed. After that ordinal encoding should be applied. We also try using it as a numerical feature.

-  `DrivAge`: Driver age looks pretty normal compared to every other feature, with a mild skew only(.4). We decided to keep it without transformation. 

- `Density`: We use logarithm transformation on the denisty feature, as the log transformed feature loses its skewed distribution(.05).

- `Area`: Area is fully determined by Density ranges. Logarithm scatterplots show clear stripes, so we decided to drop Area, as it's redundant.

- `BonusMalus`: More than half of the data has a Bonus Malus score of 50. It is hard to make any transformations to make the feature better. We decided to keep it as it is.

- `VehBrand`: As we don't have any ordinal structure in this feature, we can not bin or ordinal encode the feature, so we decide to keep it as it is and one-hot encode it.

- `VehGas:`: The ratio of regular and diesel is almost 50/50, thats the only 2 value types, so we one-hot encode it to make a it a binary feature.

- `Region`: Region has a lot of categories, and a many of them contain very low counts(<1%). We decided to group these into a seperate category so we can reduce dimensionality. Then we use one hot encoding.

In [26]:


feature_configs_filtered = [
    f for f in FEATURES if f.name != "Exposure" or target == "ClaimNb"
]

preprocessor = build_feature_pipeline(feature_configs_filtered)

X_train = preprocessor.fit_transform(claims_processed)



ValueError: Shape of passed values is (542410, 1), indices imply (542410, 35)

In [None]:
import joblib
from scipy import sparse

sparse.save_npz(f"../data/processed/X_train.npz", X_train)
np.save(f"../data/processed/y_train_{target}.npy", y_train.values)


joblib.dump(preprocessor, f"../models/preprocesser_{target}.joblib")x

['VehPower__VehPower' 'VehAge__VehAge' 'DrivAge__DrivAge'
 'Density__Density' 'BonusMalus__BonusMalus' 'VehBrand__VehBrand_B1'
 'VehBrand__VehBrand_B10' 'VehBrand__VehBrand_B11'
 'VehBrand__VehBrand_B12' 'VehBrand__VehBrand_B13'
 'VehBrand__VehBrand_B14' 'VehBrand__VehBrand_B2' 'VehBrand__VehBrand_B3'
 'VehBrand__VehBrand_B4' 'VehBrand__VehBrand_B5' 'VehBrand__VehBrand_B6'
 'VehGas__VehGas_Diesel' 'VehGas__VehGas_Regular' 'Region__Region_Other'
 'Region__Region_R11' 'Region__Region_R22' 'Region__Region_R23'
 'Region__Region_R24' 'Region__Region_R25' 'Region__Region_R26'
 'Region__Region_R31' 'Region__Region_R41' 'Region__Region_R52'
 'Region__Region_R53' 'Region__Region_R54' 'Region__Region_R72'
 'Region__Region_R73' 'Region__Region_R82' 'Region__Region_R91'
 'Region__Region_R93']


We can also do the same with the test datasets, to get their processed version as well.

In [19]:
y_test = create_target(claims_raw_test, target=target)
X_test = preprocessor.transform(preprocess_manual(claims_raw_test, exposure=False))
X_test_df = pd.DataFrame(X_test, columns=clean_names)

sparse.save_npz(f"../data/processed/X_test.npz", X_test)
np.save(f"../data/processed/y_test_{target}.npy", y_test.values)

X_test_df.to_csv(f"../data/processed/claims_features_test.csv", index=False)

In [20]:
a = X_train.todense()
for i in range(a.shape[1]):
    col = a[:, i]
    print(f"Feature {i}: mean={np.mean(col)}, std={np.std(col)}")
    

Feature 0: mean=1.4890470308438266, std=1.5364414490919565
Feature 1: mean=2.34988661713464, std=1.621879527772838
Feature 2: mean=-2.000133065569051e-16, std=1.0
Feature 3: mean=-9.18455182629008e-16, std=0.9999999999999999
Feature 4: mean=1.0912079402816383e-16, std=1.0
Feature 5: mean=0.24014306520897477, std=0.427170192594255
Feature 6: mean=0.026216330819859517, std=0.15977807990523357
Feature 7: mean=0.02007521985214137, std=0.1402576393642412
Feature 8: mean=0.24489039656348519, std=0.4300221973741171
Feature 9: mean=0.017846278645305212, std=0.13239255637617747
Feature 10: mean=0.005975184823288657, std=0.07706803481091366
Feature 11: mean=0.23565384119024355, std=0.42440677223923634
Feature 12: mean=0.07868955218377242, std=0.2692536101167258
Feature 13: mean=0.03721539057170775, std=0.18928921067061114
Feature 14: mean=0.05134492358179237, std=0.22070029996393822
Feature 15: mean=0.04194981655942921, std=0.20047451072408048
Feature 16: mean=0.4898268837226452, std=0.499896496