# Feature Engineering with information available at initial quote request

## Overview

- The goal of this notebook is to engineer features from variables of dataset that have been obtained by an underwriter(s) at the first contact (initial insurance quote request).These features can be used for risk classification or technical premium calculation. The variables that we will be working with include:
    - `ID`: Unique identifier for each policyholder
    - `Distribution_channel`: Classifies the channel through which the policy was contracted. 0 for Agent and 1 for Insurance brokers.
    - `Date_birth`: Date of birth of the insured declared in the policy (DD/MM/YYYY)
    - `Date_driving_licence`: Date of issuance of the insured person's driver's license (DD/MM/YYYY)
    - `Premium`: Net premium amount associated with the policy during the current year
    - `Type_risk`: Type of risk associated with the policy. Each value corresponds to a specific risk type: 1 for motorbikes, 2 for vans, 3 for passenger cars and 4 for agricultural vehicles
    - `Area`: Dichotomous variable indicates the area. 0 for rural and 1 for urban (more than 30,000 inhabitants) in terms of traffic conditions.
    - `Second_driver`: 1 if there are multiple regular drivers declared, or 0 if only one driver is declared
    - `Year_matriculation`: Year of registration of the vehicle (YYYY)
    - `Power`: Vehicle power measured in horsepower
    - `Cylinder_capacity`: Cylinder capacity of the vehicle
    - `Value_vehicle`: Market value of the vehicle on 31/12/2019
    - `N_doors`: Number of vehicle doors
    - `Type_fuel`: Specific kind of energy source used to power a vehicle. Petrol (P) or Diesel (D)
    - `Length`: Length, in meters, of the vehicle
    - `Weight`: Weight, in kilograms, of the vehicle

## Setup

In [None]:
from datetime import datetime
from typing import Any
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Load the data

In [None]:
insurance_initiation_variables_path = "../../data/output/Insurance_Initiation_Variables.csv"
claims_variables_path = "../../data/input/exp/sample_type_claim.csv"
insurance_df = pd.read_csv(insurance_initiation_variables_path, delimiter=';')
claims_df = pd.read_csv(claims_variables_path, delimiter=';')

## 1. Prepare Dataset

| Step | Action |
|------|--------|
| Aggregate | Sum claims by (ID, year) |
| Merge | Left join on (ID, year) |
| Fill | NaN â†’ 0 for no claims |
| Split | 80/20 train-test split |

### 1.1 Aggregate claims by policyholder and year

In [None]:
claim_grouping_columns = ['ID', 'Cost_claims_year']
claim_aggregation_column = 'Cost_claims_by_type'
claims_aggregated = (
    claims_df
    .groupby(claim_grouping_columns, as_index=False)[claim_aggregation_column]
    .sum()
)

### 1.2 Merge insurance and claims data

In [None]:
merging_columns = ['ID', 'Cost_claims_year']

dataset = insurance_df.merge(claims_aggregated, on=merging_columns, how='left')
dataset[claim_aggregation_column] = dataset[claim_aggregation_column].fillna(0)
dataset['claims_frequency'] = (dataset[claim_aggregation_column] > 0).astype(int)

### 1.3 Split into train and test sets

In [None]:
test_ratio = 0.2
to_shuffle = False
if to_shuffle:
    dataset = dataset.sample(frac=1, random_state=42).reset_index(drop=True)

split_index = int(len(dataset) * (1 - test_ratio))
trainset = dataset.iloc[:split_index].reset_index(drop=True)
testset = dataset.iloc[split_index:].reset_index(drop=True)
print(f"Total records: {len(dataset)}")
print(f"Training set: {len(trainset)} records ({100*(1-test_ratio):.0f}%)")
print(f"Test set: {len(testset)} records ({100*test_ratio:.0f}%)")

## 2. Research Baseline Features

While there are now advancement in the motor insurance sector that enables using telematic data points, I will be exploring traditional features in this segment mostly based on domain knowledge [some capture here](https://www.researchgate.net/publication/338007809_An_Analysis_of_the_Risk_Factors_Determining_Motor_Insurance_Premium_in_a_Small_Island_State_The_Case_of_Malta). Some key ones that align with this dataset include

- Type of Vehicle: `Exists in the dataset`
- Value of the vehicle : `Exists in the dataset`
- Age of the driver : `To be Engineered`
- Vehicle technology equipment : `To be Engineered (easy proxy could be age of vehicle to imply that newer vehicles have more sophisticated technology)`
- Geographic location  : `Exist in the dataset`
- Repair cost of the vehicle: `Close proxy is the value of the vehicle`
- Occupation of the driver: `Not captured in dataset`
- Medical condition of the driver: `Not captured in dataset`
- Recent performance vehicle modifications (power-to-weight ratio, brake horsepower, etc.) : `To be Engineered`

## 3. Implement Engineered Features

| Feature | Formula | Unit |
|---------|---------|------|
| Driver_age_years | today - Date_birth | Years |
| Driver_experience_years | today - Date_driving_licence | Years |
| Car_age_years | today_year - Year_matriculation | Years |
| power_to_weight | Power / Weight | HP/kg |

### 3.1 Define helper functions

In [None]:
def convert_to_datetime(value:object, format:str="%d/%m/%Y", yearfirst:bool=True) -> Any:
    return pd.to_datetime(arg=value, format=format, yearfirst=yearfirst)

def take_datetime_difference_in_years(first_datetime:datetime, second_datetime:datetime, interval) -> float:
    diff = (second_datetime - first_datetime) / np.timedelta64(1, interval)
    diff_years = diff/365.25
    return diff_years

def take_int_difference(first_number:int, second_number:int) -> int:
    return abs(first_number - second_number)

### 3.2 Apply transformations to trainset

In [None]:
today_date = pd.Timestamp.today()  ## why not consider the end date highlighted for the data collection?n- 2018-12-31
today_year = today_date.year
features_trainset = (
    trainset
    .assign(
        Date_birth_dt=trainset['Date_birth'].apply(convert_to_datetime),
        Date_driving_licence_dt=trainset['Date_driving_licence'].apply(convert_to_datetime),
        power_to_weight = trainset['Power'] / trainset['Weight'],
        Car_age_years= trainset['Year_matriculation'].apply(take_int_difference, args=(today_year,)))
    .assign(
        Driver_age_years=lambda df: df['Date_birth_dt'].apply(take_datetime_difference_in_years, args=(today_date, 'D')),
        Driver_experience_years=lambda df: df['Date_driving_licence_dt'].apply(take_datetime_difference_in_years, args=(today_date, 'D')),
    )
)

### 3.3 Resulting variables

At the end of engineering on the features on the trainset, we now have the following variables in our dataset
- ID
- Date_birth
- Date_driving_licence
- Distribution_channel
- Premium
- Cost_claims_year
- Type_risk
- Area
- Second_driver
- Year_matriculation
- Power
- Cylinder_capacity
- Value_vehicle
- N_doors
- Type_fuel
- Length
- Weight
- claims_frequency
- Date_birth_dt
- Date_driving_licence_dt
- power_to_weight
- Car_age_years
- Driver_age_years
- Driver_experience_years

For all intent and purposes, the premium, cost_claims_year and claims frequency are potential target variables

## 4. Explore Engineered Features

| Check | Purpose |
|-------|---------|
| Null counts | Feature completeness |
| Histograms | Distribution shape |
| Bar plots | Categorical balance |
| Correlation | Multicollinearity |

### 4.1 Check feature completeness

In [None]:
#Feature completeness pattern
features_trainset.isnull().sum()

### 4.2 Binning pattern for histogram

In [None]:
##Feature data distribution pattern - binning pattern for histogram
#Step 1: Bin the variable
variable = 'Weight'
binned_variable = pd.cut(features_trainset[variable], bins=10)

#Step 2: join the binned variable to existing dataset
binned_variable.name = f"binned_{variable}"
binned_df = pd.concat([features_trainset, binned_variable], axis=1 )
#binned_df

#Step 3: Group the dataset using the binned variable
groups = []
for group, subset in features_trainset.groupby(by=binned_variable, observed=False):
    groups.append({
        'Binrange': group,
        'Count': len(subset),
    })
group_df = pd.DataFrame(groups)

#Step 4: Visualize this with histogram using barplot (converted the bin range to str because matplotlib not handling intervals well)
plt.bar(x=group_df['Binrange'].astype(str), height=group_df['Count'])
plt.xticks(rotation=90)
plt.show()

### 4.3 Histogram visualisation (scalable)

In [None]:
## Easier patter for histogram visualization with seaborn that scales to columns
#Step 1: Define the columns to obtain distribution as list
cols = ['Car_age_years', 'Driver_age_years', 'Driver_experience_years', 'Power', 'power_to_weight']

##Step 2: Define the bin (in this case number of bins)
bin = 10

#Step 3 : visualize
fig, axes = plt.subplots(1, len(cols), figsize=(18, 3))
for i, col in enumerate(cols):
    sns.histplot(data=features_trainset, x=col, bins=bin, kde=False, ax=axes[i])
    axes[i].set_title(f"Distribution of {col}")
plt.tight_layout()
plt.show()

### 4.4 Categorical distributions

In [None]:
##Similar pattern this time for barplots
cols  = [
    'Distribution_channel',
    'Type_risk',
    'Type_fuel',
    'Area',
    'Second_driver'
]

fig, axes = plt.subplots(2, 3, figsize=(15, 6))  # 2 rows, 3 columns
axes = axes.flatten()  # turn into 1D array

for i, col in enumerate(cols):
    sns.countplot(data=features_trainset, x=col, ax=axes[i])
    axes[i].set_title(f"Distribution of {col}")
    axes[i].tick_params(axis="x")
fig.delaxes(axes[-1])

plt.tight_layout()
plt.show()

### 4.5 Bivariate correlation

In [None]:
## Simple bivariate relationship pattern
features_trainset[['Power', 'Cylinder_capacity', 'power_to_weight', 'Value_vehicle', 'Length', 'Weight',]].corr()