# Hastings Direct Takehome

Background:
Insurance companies make pricing decisions based on historical claims experience. The more recent the claims experience, the more predictive it may be of future losses. In the case of many large claims however, the exact cost is not known at the time of the accident. In fact, some cases take years to develop and settle. Companies sometimes learn that a claim is large several years after the accident took place.
Your Underwriting Director believes it is possible to predict the ultimate value of individual claims well in advance by using FNOL (First Notification Of Loss) characteristics. This is the information recorded when the claim is first notified. If so, it would allow the company to know about future costs earlier and this information could be used to make better pricing decisions.
You are given a historical dataset of a particular type of claim - head-on collisions - and are also told their individual current estimated values (labelled Incurred). (Given these claims are now a few years old, you can assume the incurred values are equal to the cost at which the claims will finally settle). 

Task breakdown:
1) Using this data, build a model to predict the ultimate individual claim amounts
"2) Prepare a 15 minute presentation summarising your model. Your presentation should either be in notebook format or a more traditional slide deck.  If you opt for the slide deck approach, please make sure that you provide supporting code. 
Your presentation should cover the following aspects:
- Issues identified with the data and how these were addressed
- Data cleansing
- Model specification and justification for selecting this model specification
- Assessment of your model's accuracy and model diagnostics
- Suggestions of how your model could be improved
- Practical challenges for implementing your model"

Note: columns beginning with TP_* show the number of third parties involved in an accident (under a given category)

# Modelling

See notebook 01_ for data processing etc

In [27]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import plotly.express as px
from pprint import pprint
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import r2_score
import statsmodels.api as sm
import pickle
from catboost import CatBoostRegressor, Pool
import optuna
pd.set_option('display.max_rows', 500)

In [2]:
full_train = pd.read_parquet('data/full_train.parquet')
full_test = pd.read_parquet('data/full_test.parquet')
recent_train = pd.read_parquet('data/recent_train.parquet')
recent_test = pd.read_parquet('data/recent_test.parquet')

In [3]:
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)

recent_count_encoders = load_pickle('data/recent_count_encoders.pkl')
full_count_encoders = load_pickle('data/full_count_encoders.pkl')
categories_grouped = load_pickle('data/categories_grouped.pkl')
recent_cat_encoders = load_pickle('data/recent_cat_encoders.pkl')
full_cat_encoders = load_pickle('data/full_cat_encoders.pkl')

A reminder of date based feature quality and definition of full and recent train splits:

### Test Train split

Features with bad coverage across dates:

* count of observations: low until 2012 when it starts growing
* weather: spike in n/k after 2015
* main_driver: only use after 2012
* ph_considered_tp_at_fault: 2012 +
* tp_type: 2012+

Date Splits:

1. full_train - upto 2014
2. full_test - 2014 onwards
3. recent_train - 2012-2014q3
4. recent_test - 2014q3 onwards

In [4]:
all_cols = list(full_train.columns)

In [5]:
categories_grouped['tp_type']

['tp_type_insd_pass_back',
 'tp_type_driver',
 'tp_type_pass_back',
 'tp_type_pass_front',
 'tp_type_bike',
 'tp_type_cyclist',
 'tp_type_pedestrian',
 'tp_type_other',
 'tp_type_nk']

In [6]:
all_cols

['claim_number',
 'date_of_loss',
 'notifier',
 'notification_period',
 'inception_to_loss',
 'location_of_incident',
 'weather_conditions',
 'vehicle_mobile',
 'time_hour',
 'main_driver',
 'ph_considered_tp_at_fault',
 'injury_details_present',
 'tp_type_insd_pass_back',
 'tp_type_driver',
 'tp_type_pass_back',
 'tp_type_pass_front',
 'tp_type_bike',
 'tp_type_cyclist',
 'tp_type_pedestrian',
 'tp_type_other',
 'tp_type_nk',
 'tp_injury_whiplash',
 'tp_injury_traumatic',
 'tp_injury_fatality',
 'tp_injury_unclear',
 'tp_injury_nk',
 'tp_region_eastang',
 'tp_region_eastmid',
 'tp_region_london',
 'tp_region_north',
 'tp_region_northw',
 'tp_region_outerldn',
 'tp_region_scotland',
 'tp_region_southe',
 'tp_region_southw',
 'tp_region_wales',
 'tp_region_westmid',
 'tp_region_yorkshire',
 'incurred',
 'capped_incurred',
 'incurred_log',
 'capped_incurred_log',
 'ds',
 'missing_target',
 'tp_type',
 'tp_injury',
 'tp_region',
 'tp_total',
 'year',
 'month',
 'day',
 'dayofweek',
 'hour

In [7]:
id_cols = ['claim_number']

In [8]:
cols_with_recent_coverage = ['main_driver', 'ph_considered_tp_at_fault'] + categories_grouped['tp_type'] + ['tp_type']

In [9]:
cat_cols = list(full_train.select_dtypes(include=['object', 'category']).columns)
cat_cols

['notifier',
 'location_of_incident',
 'weather_conditions',
 'vehicle_mobile',
 'main_driver',
 'ph_considered_tp_at_fault',
 'tp_type',
 'tp_injury',
 'tp_region',
 'day_group',
 'hour_bin',
 'dow_bin',
 'month_bin',
 'day_bin']

In [10]:
cols_with_full_coverage = [
    c for c in all_cols
    if c not in cols_with_recent_coverage 
    and c not in id_cols 
    and c not in cat_cols
    ]

In [11]:
cols_with_full_coverage

['date_of_loss',
 'notification_period',
 'inception_to_loss',
 'time_hour',
 'injury_details_present',
 'tp_injury_whiplash',
 'tp_injury_traumatic',
 'tp_injury_fatality',
 'tp_injury_unclear',
 'tp_injury_nk',
 'tp_region_eastang',
 'tp_region_eastmid',
 'tp_region_london',
 'tp_region_north',
 'tp_region_northw',
 'tp_region_outerldn',
 'tp_region_scotland',
 'tp_region_southe',
 'tp_region_southw',
 'tp_region_wales',
 'tp_region_westmid',
 'tp_region_yorkshire',
 'incurred',
 'capped_incurred',
 'incurred_log',
 'capped_incurred_log',
 'ds',
 'missing_target',
 'tp_total',
 'year',
 'month',
 'day',
 'dayofweek',
 'hour',
 'weekofyear',
 'quarter',
 'is_weekend',
 'month_sin',
 'month_cos',
 'dayofweek_sin',
 'dayofweek_cos',
 'hour_sin',
 'hour_cos',
 'days_since_start',
 'is_uk_holiday',
 'notifier_loo',
 'location_of_incident_loo',
 'weather_conditions_loo',
 'vehicle_mobile_loo',
 'main_driver_loo',
 'ph_considered_tp_at_fault_loo',
 'injury_details_present_loo',
 'tp_region_

In [15]:
target = 'capped_incurred'

mask_full_train = full_train[target].notna()
mask_full_test = full_test[target].notna()
mask_recent_train = recent_train[target].notna()
mask_recent_test = recent_test[target].notna()

full_train = full_train.loc[mask_full_train, id_cols + cols_with_full_coverage]
full_test = full_test.loc[mask_full_test, id_cols + cols_with_full_coverage]
recent_train = recent_train.loc[mask_recent_train, id_cols + cols_with_full_coverage + cols_with_recent_coverage]
recent_test = recent_test.loc[mask_recent_test, id_cols + cols_with_full_coverage + cols_with_recent_coverage]

# Random Forest

Trying this algorithm first as it is efficient and strong at predictive accuracy. If I want a more explainable model I can also refer to feature importance here to help narrow down predictors. 

In [16]:
def train_catboost_rf(
    train_df
  , target='capped_incurred'
  , id_col='claim_number'
  , exclude_cols=None
  , tune=False
):
    if exclude_cols is None:
        exclude_cols = []

    # Prepare feature set
    features = train_df.drop(columns=[target, id_col] + exclude_cols).columns.tolist()
    X = train_df[features]
    y = train_df[target]

    # Identify categorical features (CatBoost can natively handle them)
    cat_features = X.select_dtypes(include='category').columns.tolist()
    if not cat_features:
        cat_features = X.select_dtypes(include='object').columns.tolist()

    train_pool = Pool(data=X, label=y, cat_features=cat_features)

    if tune:
        def objective(trial):
            params = {
                'loss_function': 'RMSE'
              , 'eval_metric': 'RMSE'
              , 'random_strength': trial.suggest_float('random_strength', 0, 10)
              , 'depth': trial.suggest_int('depth', 4, 10)
              , 'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True)
              , 'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 10, log=True)
              , 'bagging_temperature': trial.suggest_float('bagging_temperature', 0, 1)
              , 'border_count': trial.suggest_int('border_count', 32, 255)
              , 'iterations': 500
              , 'random_seed': 42
              , 'verbose': 0
              , 'task_type': 'CPU'
            }
            model = CatBoostRegressor(**params)
            model.fit(train_pool)
            preds = model.predict(train_pool)
            return ((preds - y) ** 2).mean() ** 0.5  # RMSE

        study = optuna.create_study(direction='minimize')
        study.optimize(objective, n_trials=30)
        best_params = study.best_params
        best_params.update({
            'loss_function': 'RMSE'
          , 'iterations': 500
          , 'verbose': 0
          , 'random_seed': 42
          , 'task_type': 'CPU'
        })
    else:
        # Default config
        best_params = {
            'loss_function': 'RMSE'
          , 'iterations': 300
          , 'depth': 6
          , 'learning_rate': 0.05
          , 'random_strength': 1
          , 'bagging_temperature': 0.5
          , 'l2_leaf_reg': 3
          , 'verbose': 0
          , 'random_seed': 42
          , 'task_type': 'CPU'
        }

    model = CatBoostRegressor(**best_params)
    model.fit(train_pool)

    return model, features

In [19]:
model, features = train_catboost_rf(
    train_df=full_train
  , target='capped_incurred'
  , id_col='claim_number'
  , exclude_cols=None
  , tune=False
)

In [21]:
model

<catboost.core.CatBoostRegressor at 0x1440fea10>

In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def eval(preds, test_df, target='capped_incurred', date_col=None, group_cols=None):
    mask = test_df[target].notna()
    y_true = test_df.loc[mask, target]
    y_pred = preds[mask] if isinstance(preds, (pd.Series, np.ndarray)) else np.array(preds)[mask]
    df = test_df.loc[mask].copy()
    df['pred'] = y_pred

    mse = mean_squared_error(y_true, y_pred)

    summary = {
        'rmse': mse**0.5
      , 'mae': mean_absolute_error(y_true, y_pred)
      , 'r2': r2_score(y_true, y_pred)
    }

    print('Overall metrics:', summary)

    # Helper to plot grouped metrics
    def plot_grouped_metric(df, groupby_col, metric_fn, metric_name):
        grouped = df.groupby(groupby_col).apply(
            lambda g: metric_fn(g[target], g['pred'])
        ).reset_index(name=metric_name)

        plt.figure()
        plt.plot(grouped[groupby_col], grouped[metric_name], marker='o')
        plt.title(f'{metric_name.upper()} by {groupby_col}')
        plt.xlabel(groupby_col)
        plt.ylabel(metric_name.upper())
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()

    if date_col and date_col in df.columns:
        if not pd.api.types.is_datetime64_any_dtype(df[date_col]):
            df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
        df['__date'] = df[date_col].dt.to_period('M').dt.to_timestamp()
        for fn, name in [(mean_squared_error, 'rmse'), (mean_absolute_error, 'mae'), (r2_score, 'r2')]:
            plot_grouped_metric(df, '__date', fn if name != 'rmse' else lambda a,b: mean_squared_error(a,b)**0.5, name)

    if group_cols:
        for col in group_cols:
            if col in df.columns:
                for fn, name in [(mean_squared_error, 'rmse'), (mean_absolute_error, 'mae'), (r2_score, 'r2')]:
                    plot_grouped_metric(df, col, fn if name != 'rmse' else lambda a,b: mean_squared_error(a,b)**0.5, name)

    return summary

In [38]:
preds = model.predict(full_test)

In [39]:
eval(preds, full_test)

Overall metrics: {'rmse': 214.10323094077987, 'mae': 118.39104471623449, 'r2': 0.9995643521383518}


{'rmse': 214.10323094077987,
 'mae': 118.39104471623449,
 'r2': 0.9995643521383518}