# Modelling

- Let's attempt to build a frequency vs severity model using both GLM and GBM. Remember that technically `premium = frequency * severity`
- Bonus: see if along the way we can use the variable importance to identify key rating factors for both frequency and severity that can be used for risk classification

## 001: Create the dataset and split dataset

In [None]:
from src.dataset import Dataset


insurance_initiation_variables_path = "../data/input/exp/Insurance_Initiation_Variables.csv"
claims_variables_path = "../data/input/exp/sample_type_claim.csv"

claim_grouping_columns = ['ID', 'Cost_claims_year']
claim_aggregation_column = 'Cost_claims_by_type'
merging_columns = ['ID', 'Cost_claims_year']

dataset =  (Dataset(data_path=insurance_initiation_variables_path,
                              claims_path=claims_variables_path)
                      .group_claims(grouping_columns=claim_grouping_columns,aggregation_column=claim_aggregation_column)
                      .create_dataset(merge_columns=merging_columns)
                     )
trainset, testset = dataset.split_dataset(test_ratio=0.2, to_shuffle=False)

## 002: Engineer relevant features

In [None]:
from src.feature import  main as feature_main

features_trainset = feature_main(trainset)
features_testset = feature_main(testset)

In [None]:
# Fill nulls in claims_frequency with 0 - missed during dataset creation step, to be added later
features_trainset.loc[:, 'claims_frequency'] = features_trainset['claims_frequency'].fillna(0)
features_testset.loc[:, 'claims_frequency'] = features_testset['claims_frequency'].fillna(0)

## 003: Frequency modelling - Poisson regression

The response variable is the number of claims dubbed `claims_frequency` in the dataset. It is important to note that Claims frequency is actually a rate (i.e. it is the claims frequency for a year for each policy-holder). However, since the exposure is constant for all policies in this dataset (1 year), it is an implicit rate.


Let's check a few assumptions before we fit a Poisson regression model:
1. Distribution of the response variable
2. Equidispersion: the mean and variance of the response variable should be roughly equal

In [None]:
import numpy as np
import matplotlib.pyplot as plt
fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(15, 5))
ax0.set_title('Claims Frequency Distribution')
_ = features_trainset['claims_frequency'].hist(bins=4, log=True, ax=ax0)

print(
    "Average claims frequency: {}".format(
        np.average(features_trainset['claims_frequency'])
    )
)

print(
    "Fraction of claims frequency that is Zero {0:.2%}".format(
        features_trainset[features_trainset['claims_frequency']==0].__len__() / features_trainset['claims_frequency'].__len__()
    )
)

In [None]:
mean_claims_frequency = features_trainset['claims_frequency'].mean()
var_claims_frequency = features_trainset['claims_frequency'].var()
print(f"Mean of claims_frequency: {mean_claims_frequency :.4f}")
print(f"Variance of claims_frequency: {var_claims_frequency:.4f}")

- From the both the histogram and the mean-variance comparison, we can see that the claims frequency is unimodal and rightly skewed, the mean and variance after filling nulls with 0 appear to be reasonably close for this dataset. We can therefore proceed to fit a Poisson regression model.

In [None]:
training_variables = ['Car_age_years', 'Type_risk', 'Area', 'Value_vehicle', 'Distribution_channel', 'Cylinder_capacity']
target = ['claims_frequency']

In [None]:
from sklearn.metrics import mean_absolute_error,mean_poisson_deviance,mean_squared_error

def model_evaluation_metrics(estimator, df_test, target_variable=target, training_variables=training_variables):
    """Score an estimator on the test set."""
    y_pred = estimator.predict(df_test[training_variables])

    print(
        "MSE: %.3f"
        % mean_squared_error(
            df_test[target], y_pred,
        )
    )
    print(
        "MAE: %.3f"
        % mean_absolute_error(
            df_test[target], y_pred
        )
    )

    # Ignore non-positive predictions, as they are invalid for
    # the Poisson deviance.
    mask = y_pred > 0
    if (~mask).any():
        n_masked, n_samples = (~mask).sum(), mask.shape[0]
        print(
            "WARNING: Estimator yields invalid, non-positive predictions "
            f" for {n_masked} samples out of {n_samples}. These predictions "
            "are ignored when computing the Poisson deviance."
        )

    print(
        "mean Poisson deviance: %.3f"
        % mean_poisson_deviance(
            df_test[target][mask],
            y_pred[mask],
        )
    )


#### Model 1 - Baseline Model, Just predicting the mean

In [None]:
from sklearn.dummy import DummyRegressor
dummy_regressor = DummyRegressor(strategy="mean")
baseline_model = dummy_regressor.fit(features_trainset[training_variables], features_trainset[target])
print("Constant mean frequency evaluation:")
model_evaluation_metrics(estimator=baseline_model, df_test=features_testset, target_variable=target, training_variables=training_variables)

#### Model 2 - Ridge Regression

In [None]:
from sklearn.linear_model import Ridge
ridge_glm = Ridge(alpha=1)
ridge_model = ridge_glm.fit(features_trainset[training_variables], features_trainset[target])
print("Ridge regression evaluation:")
model_evaluation_metrics(estimator=ridge_model, df_test=features_testset, target_variable=target, training_variables=training_variables)

#### Model 3 - Poisson Regression

In [None]:
from sklearn.linear_model import PoissonRegressor
poisson_regressor = PoissonRegressor(alpha=1e-12, solver='newton-cholesky', max_iter=300)
poisson_model = poisson_regressor.fit(features_trainset[training_variables], features_trainset[target].values.ravel())
print("Poisson regression evaluation:")
model_evaluation_metrics(estimator=poisson_model, df_test=features_testset, target_variable=target, training_variables=training_variables)

#### Model 4 - Gradient Boosting Machine


In [None]:
from sklearn.ensemble import HistGradientBoostingRegressor
gbm_regressor = HistGradientBoostingRegressor(loss='poisson', max_leaf_nodes=128)
gbm_model = gbm_regressor.fit(features_trainset[training_variables], features_trainset['claims_frequency'])
print("GBM regression evaluation:")
model_evaluation_metrics(estimator=gbm_model, df_test=features_testset, target_variable=target, training_variables=training_variables)

#TODO: Severity Modelling