# Severity Modelling

## Overview

- This notebook covers the second part of the frequency-severity modelling approach, focusing on predicting claim amounts (severity)
- The target variable is `Cost_claims_year` which represents the total claim amount per policy per year
- We use a Gamma GLM since claim amounts are strictly positive and right-skewed
- Remember that technically `premium = frequency * severity`

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import GammaRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,mean_squared_error

## 1. Prepare Dataset

| Step | Action |
|------|--------|
| Load | Read insurance and claims data |
| Aggregate | Count claims by (ID, year) |
| Merge | Left join on (ID, year) |
| Fill | NaN â†’ 0 for no claims |
| Split | 80/20 train-test split |

### 1.1 Load and merge data

In [None]:
insurance = pd.read_csv('../../data/input/Motor_vehicle_insurance_data.csv', delimiter=";")
claims =  pd.read_csv('../../data/input/sample_type_claim.csv', delimiter=';')

claims_frequency  = (
    claims
    .groupby(['ID', 'Cost_claims_year'])
    .agg({'Cost_claims_by_type': 'count'})
    .rename(columns={'Cost_claims_by_type': 'claims_frequency'})
    .reset_index()
)
dataset = (
    pd
    .merge(
        left=insurance,
        right=claims_frequency,
        how='left',
        on=['ID', 'Cost_claims_year']
    )
    .fillna(value={'claims_frequency':0})
)
trainset, testset = train_test_split(dataset, test_size=0.2, random_state=42, shuffle=True)

## 2. Engineer Relevant Features

In [None]:
from src.model.freq_sev.feature import  main as feature_main
features_trainset = feature_main(trainset)
features_testset = feature_main(testset)

## 3. Analyse Target Variable [on training data]

The response variable is the amount of claims dubbed `Cost_claims_year` in the dataset. Unlike frequency modelling where we predict count of claims, severity modelling predicts the monetary amount.

Key considerations for severity modelling:
1. Distribution of the response variable (typically right-skewed)
2. Only policies with claims (severity > 0) are used for training
3. Gamma distribution is appropriate for positive continuous outcomes

### 3.1 Distribution of claim severity

In [None]:
fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(15, 5))
ax0.set_title('Loss Distribution')
_ = features_trainset['Cost_claims_year'].hist(bins=40, log=True, ax=ax0)

p2_5, p97_5 = np.percentile(features_trainset['Cost_claims_year'], [2.5, 97.5])
middle_95 = features_trainset['Cost_claims_year'][(features_trainset['Cost_claims_year'] >= p2_5) &                                         (features_trainset['Cost_claims_year'] <= p97_5)]
ax1.set_title('Middle-95% Loss Distribution (2.5%-97.5%)')
_ = middle_95.hist(bins=40, log=False, ax=ax1)
print(
    "Average loss distribution: {}".format(
        np.average(features_trainset['Cost_claims_year'])
    )
)

- The distribution is heavily right-skewed with a long tail of high-value claims
- The middle 95% shows the bulk of claims are concentrated at lower values
- This confirms the Gamma distribution is an appropriate choice for modelling

## 4. Model Development

| Model | Type | Description |
|-------|------|-------------|
| Baseline | DummyRegressor | Predict mean |
| Gamma | GLM | Severity-specific, handles positive skewed data |

### 4.1 Define training variables and evaluation metrics

In [None]:
train_variables = ['Car_age_years', 'Type_risk', 'Area', 'Value_vehicle', 'Distribution_channel','Cylinder_capacity']
target = ['Cost_claims_year']

In [None]:
def model_evaluation_metrics(estimator, df_test, target_variable, training_variables):
    y_pred = estimator.predict(df_test[training_variables])
    mse = mean_squared_error(df_test[target_variable], y_pred)
    mae = mean_absolute_error(df_test[target_variable], y_pred)
    print(
        "MSE: %.3f"
        % mse
    )
    print(
        "MAE: %.3f"
        % mae
    )

### 4.2 Filter for policies with claims

Because we are fitting a Gamma regressor, we filter out policies where no claims have been made. The Gamma distribution only supports strictly positive values (y > 0).

In [None]:
train_mask = features_trainset['Cost_claims_year']>0
updated_features_trainset = features_trainset[train_mask]
test_mask = features_testset['Cost_claims_year']>0
updated_features_testset = features_testset[test_mask]

### 4.3 Model 1: Baseline (Mean Prediction)

In [None]:
from sklearn.dummy import DummyRegressor
dummy = DummyRegressor(strategy="mean")
dummy_regressor = dummy.fit(updated_features_trainset[train_variables], updated_features_trainset[target])
print("Constant mean severity evaluation:")
model_evaluation_metrics(estimator=dummy_regressor, df_test=updated_features_testset, target_variable=target, training_variables=train_variables)

### 4.4 Model 2: Gamma Regressor

In [None]:
gamma = GammaRegressor(alpha=10,
                       solver="newton-cholesky",
                       max_iter=10000, )
gamma_regressor =  gamma.fit(updated_features_trainset[train_variables], updated_features_trainset[target].values.ravel())
print("Gamma regression evaluation:")
model_evaluation_metrics(estimator=gamma_regressor, df_test=updated_features_testset, target_variable=target, training_variables=train_variables)

## 5. Model Evaluation

| Check | Purpose |
|-------|---------|
| Metrics comparison | Compare MSE/MAE across models |
| Observed vs Predicted | Visual calibration check |

### 5.1 Model comparison

In [None]:
for idx, model in enumerate([dummy_regressor, gamma_regressor]):
    print(f"Now evaluating model {model.__class__.__name__}")
    model_evaluation_metrics(estimator=model, df_test=updated_features_testset, target_variable=target, training_variables=train_variables)
    print("-------------")

### 5.2 Observed vs Predicted visualization

In [None]:
def plot_obs_pred(
    df,
    feature,
    observed,
    predicted,
    y_label=None,
    title=None,
    ax=None,
    fill_legend=False,
):
    # aggregate observed and predicted variables by feature level
    df_ = df.loc[:, [feature]].copy()
    df_["observed"] = df[observed] #* df[weight]
    df_["predicted"] = predicted #* df[weight]
    df_ = (
        df_.groupby([feature])[[ "observed", "predicted"]]
        .sum()
        .assign(observed=lambda x: x["observed"])
        .assign(predicted=lambda x: x["predicted"])
    )
    ax = df_.loc[:, ["observed", "predicted"]].plot(style=".", ax=ax)
    y_max = df_.loc[:, ["observed", "predicted"]].values.max() * 0.8
    p2 = ax.fill_between(
        df_.index,
        0,
        y_max, #* df_[weight] / df_[weight].values.max(),
        color="g",
        alpha=0.1,
    )
    if fill_legend:
        ax.legend([p2], ["{} distribution".format(feature)])
    ax.set(
        ylabel=y_label if y_label is not None else None,
        title=title if title is not None else "Train: Observed vs Predicted",
    )

In [None]:
feature_col = target[0]
fig, ax = plt.subplots(ncols=1, figsize=(15, 5))
plot_obs_pred(
    df=updated_features_testset,
    feature=feature_col,
    observed=feature_col,
    predicted=gamma_regressor.predict(updated_features_testset[train_variables]),
    y_label="Average claim severity",
    title="Predicted vs Observed",
    ax=ax
)

#TODO: Add more severity models (e.g., GBM with gamma loss), introduce mlflow
