# Feature Engineering with information available at initial quote request

- The goal of this notebook is to engineer features from variables of dataset that have been obtained by an underwriter(s) at the first contact (initial insurance quote request).These features can be used for risk classification or technical premium calculation. The variables that we will be working with include:
    - `ID`: Unique identifier for each policyholder
    - `Distribution_channel`: Classifies the channel through which the policy was contracted. 0 for Agent and 1 for Insurance brokers.
    - `Date_birth`: Date of birth of the insured declared in the policy (DD/MM/YYYY)
    - `Date_driving_licence`: Date of issuance of the insured person's driver's license (DD/MM/YYYY)
    - `Premium`: Net premium amount associated with the policy during the current year
    - `Type_risk`: Type of risk associated with the policy. Each value corresponds to a specific risk type: 1 for motorbikes, 2 for vans, 3 for passenger cars and 4 for agricultural vehicles
    - `Area`: Dichotomous variable indicates the area. 0 for rural and 1 for urban (more than 30,000 inhabitants) in terms of traffic conditions.
    - `Second_driver`: 1 if there are multiple regular drivers declared, or 0 if only one driver is declared
    - `Year_matriculation`: Year of registration of the vehicle (YYYY)
    - `Power`: Vehicle power measured in horsepower
    - `Cylinder_capacity`: Cylinder capacity of the vehicle
    - `Value_vehicle`: Market value of the vehicle on 31/12/2019
    - `N_doors`: Number of vehicle doors
    - `Type_fuel`: Specific kind of energy source used to power a vehicle. Petrol (P) or Diesel (D)
    - `Length`: Length, in meters, of the vehicle
    - `Weight`: Weight, in kilograms, of the vehicle


## 001: Create the dataset and split dataset

In [69]:
import pandas as pd
from src.dataset import Dataset

insurance_initiation_variables_path = "../data/input/exp/Insurance_Initiation_Variables.csv"
claims_variables_path = "../data/input/exp/sample_type_claim.csv"

claim_grouping_columns = ['ID', 'Cost_claims_year']
claim_aggregation_column = 'Cost_claims_by_type'
merging_columns = ['ID', 'Cost_claims_year']

dataset =  (Dataset(data_path=insurance_initiation_variables_path,
                              claims_path=claims_variables_path)
                      .group_claims(grouping_columns=claim_grouping_columns,aggregation_column=claim_aggregation_column)
                      .create_dataset(merge_columns=merging_columns)
                     )
trainset, testset = dataset.split_dataset(test_ratio=0.2, to_shuffle=False)

## 002: Research relevant features

While there are now advancement in the motor insurance sector that enables using telematic data points, I will be exploring traditional features in this segment mostly based on domain knowledge [some capture here](https://www.researchgate.net/publication/338007809_An_Analysis_of_the_Risk_Factors_Determining_Motor_Insurance_Premium_in_a_Small_Island_State_The_Case_of_Malta). Some key ones that align with this dataset include

- Type of Vehicle
- Value of the vehicle
- Age of the driver
- Vehicle technology equipment (easy proxy could be age of vehicle to imply that newer vehicles have more sophisticated technology)
- Geographic location
- Repair cost of the vehicle
- Occupation of the driver
- Medical condition of the driver
- Recent performance vehicle modifications (power-to-weight ratio, brake horsepower, etc.)

In [70]:
from datetime import datetime
from typing import Any
import numpy as np

def convert_to_datetime(value:object, format:str="%d/%m/%Y", yearfirst:bool=True) -> Any:
    return pd.to_datetime(arg=value, format=format, yearfirst=yearfirst)

def take_datetime_difference_in_years(first_datetime:datetime, second_datetime:datetime, interval) -> float:
    diff = (second_datetime - first_datetime) / np.timedelta64(1, interval)
    diff_years = diff/365.25
    return diff_years

def take_int_difference(first_number:int, second_number:int) -> int:
    return int(first_number - second_number)


In [82]:
today_date = pd.Timestamp.today()
features_trainset = (
    trainset
    .assign(
        Date_birth_dt=trainset['Date_birth'].apply(convert_to_datetime),
        Date_driving_licence_dt=trainset['Date_driving_licence'].apply(convert_to_datetime))
    .assign(
        Driver_age_years=lambda df: df['Date_birth_dt'].apply(take_datetime_difference_in_years, args=(today_date, 'D')),
        Driver_experience_years=lambda df: df['Date_driving_licence_dt'].apply(take_datetime_difference_in_years, args=(today_date, 'D')),
    )
)

In [83]:
features_trainset

Unnamed: 0,ID,Date_birth,Date_driving_licence,Distribution_channel,Premium,Cost_claims_year,Type_risk,Area,Second_driver,Year_matriculation,...,Value_vehicle,N_doors,Type_fuel,Length,Weight,claims_frequency,Date_birth_dt,Date_driving_licence_dt,Driver_age_years,Driver_experience_years
0,1,15/04/1956,20/03/1976,0,222.52,0.0,1,0,0,2004,...,7068.00,0,P,,190,,1956-04-15,1976-03-20,69.376307,49.447491
1,1,15/04/1956,20/03/1976,0,213.78,0.0,1,0,0,2004,...,7068.00,0,P,,190,,1956-04-15,1976-03-20,69.376307,49.447491
2,1,15/04/1956,20/03/1976,0,214.84,0.0,1,0,0,2004,...,7068.00,0,P,,190,,1956-04-15,1976-03-20,69.376307,49.447491
3,1,15/04/1956,20/03/1976,0,216.99,0.0,1,0,0,2004,...,7068.00,0,P,,190,,1956-04-15,1976-03-20,69.376307,49.447491
4,2,15/04/1956,20/03/1976,0,213.70,0.0,1,0,0,2004,...,7068.00,0,P,,190,,1956-04-15,1976-03-20,69.376307,49.447491
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84439,42325,26/02/1953,30/06/1971,0,288.48,0.0,3,0,0,2010,...,20600.00,5,P,4.256,1186,,1953-02-26,1971-06-30,72.508408,54.170283
84440,42325,26/02/1953,30/06/1971,0,295.70,0.0,3,0,0,2010,...,20600.00,5,P,4.256,1186,,1953-02-26,1971-06-30,72.508408,54.170283
84441,42325,26/02/1953,30/06/1971,0,291.26,0.0,3,0,0,2010,...,20600.00,5,P,4.256,1186,,1953-02-26,1971-06-30,72.508408,54.170283
84442,42325,26/02/1953,30/06/1971,0,288.34,0.0,3,0,0,2010,...,20600.00,5,P,4.256,1186,,1953-02-26,1971-06-30,72.508408,54.170283


In [None]:
#TODO: write function to take the differenc of int to get age of car