# Hastings Direct Takehome

Background:
Insurance companies make pricing decisions based on historical claims experience. The more recent the claims experience, the more predictive it may be of future losses. In the case of many large claims however, the exact cost is not known at the time of the accident. In fact, some cases take years to develop and settle. Companies sometimes learn that a claim is large several years after the accident took place.
Your Underwriting Director believes it is possible to predict the ultimate value of individual claims well in advance by using FNOL (First Notification Of Loss) characteristics. This is the information recorded when the claim is first notified. If so, it would allow the company to know about future costs earlier and this information could be used to make better pricing decisions.
You are given a historical dataset of a particular type of claim - head-on collisions - and are also told their individual current estimated values (labelled Incurred). (Given these claims are now a few years old, you can assume the incurred values are equal to the cost at which the claims will finally settle). 

Task breakdown:
1) Using this data, build a model to predict the ultimate individual claim amounts
"2) Prepare a 15 minute presentation summarising your model. Your presentation should either be in notebook format or a more traditional slide deck.  If you opt for the slide deck approach, please make sure that you provide supporting code. 
Your presentation should cover the following aspects:
- Issues identified with the data and how these were addressed
- Data cleansing
- Model specification and justification for selecting this model specification
- Assessment of your model's accuracy and model diagnostics
- Suggestions of how your model could be improved
- Practical challenges for implementing your model"

Note: columns beginning with TP_* show the number of third parties involved in an accident (under a given category)

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pandas_profiling import ProfileReport
import plotly.express as px

  from pandas_profiling import ProfileReport


In [2]:
data = pd.read_csv('data/task_data.csv')

In [3]:
data.head(3).T

Unnamed: 0,0,1,2
Claim Number,1,2,3
date_of_loss,2003-04-15,2003-04-20,2003-04-24
Notifier,PH,CNF,CNF
Loss_code,LD003,LD003,LD003
Loss_description,Head on collision,Head on collision,Head on collision
Notification_period,22,1,5
Inception_to_loss,13,9,17
Location_of_incident,Main Road,Main Road,Main Road
Weather_conditions,NORMAL,WET,WET
Vehicle_mobile,Y,Y,Y


In [4]:
print(f"Dataset shape: {data.shape}")
print(f"Columns: {list(data.columns)}")

Dataset shape: (7691, 47)
Columns: ['Claim Number', 'date_of_loss', 'Notifier', 'Loss_code', 'Loss_description', 'Notification_period', 'Inception_to_loss', 'Location_of_incident', 'Weather_conditions', 'Vehicle_mobile', 'Time_hour', 'Main_driver', 'PH_considered_TP_at_fault', 'Vechile_registration_present', 'Incident_details_present', 'Injury_details_present', 'TP_type_insd_pass_back', 'TP_type_insd_pass_front', 'TP_type_driver', 'TP_type_pass_back', 'TP_type_pass_front', 'TP_type_bike', 'TP_type_cyclist', 'TP_type_pass_multi', 'TP_type_pedestrian', 'TP_type_other', 'TP_type_nk', 'TP_injury_whiplash', 'TP_injury_traumatic', 'TP_injury_fatality', 'TP_injury_unclear', 'TP_injury_nk', 'TP_region_eastang', 'TP_region_eastmid', 'TP_region_london', 'TP_region_north', 'TP_region_northw', 'TP_region_outerldn', 'TP_region_scotland', 'TP_region_southe', 'TP_region_southw', 'TP_region_wales', 'TP_region_westmid', 'TP_region_yorkshire', 'Incurred', 'Capped Incurred', 'Unnamed: 46']


## Data Cleaning

Lots of tab characters messing things up

In [5]:
cols_to_strip = data.columns[data.columns.get_loc('Vechile_registration_present'):]
pattern = r'[^0-9]'
data[cols_to_strip] = data[cols_to_strip].apply(lambda col: col.astype(str).str.replace(pattern, '', regex=True))

Fill in missing 0s in flag cols

In [6]:
flag_cols = data.columns[data.columns.get_loc('Vechile_registration_present'): data.columns.get_loc('Incurred')]
data[flag_cols] = (
    data[flag_cols]
    .replace('', np.nan)
    .fillna(0)
    .astype('int64')
)

  .replace('', np.nan)


Convert currency cols

In [7]:
currency_cols = data.columns[data.columns.get_loc('Incurred'):]
data[currency_cols] = data[currency_cols].replace('', np.nan).astype('float64')

  data[currency_cols] = data[currency_cols].replace('', np.nan).astype('float64')


Can drop empty or constant (all the same) columns

In [8]:
cols_to_drop = [
    'Unnamed: 46'
    ,'Loss_code' # all the same
    ,'Loss_description' # all the same
    ,'TP_type_insd_pass_front' # all zero
    ,'TP_type_pass_multi'
]
data = data.drop(cols_to_drop, axis=1)

In [9]:
data['date_of_loss'] = pd.to_datetime(data['date_of_loss']).dt.date

Weather condition has some missing values, think we can use n/k for these safely

In [10]:
data['Weather_conditions'].value_counts(dropna=False)

Weather_conditions
NORMAL          4564
WET             1903
N/K              450
SNOW,ICE,FOG     429
NaN              345
Name: count, dtype: int64

In [11]:
data['Weather_conditions'] = data['Weather_conditions'].fillna('N/K')
data['Weather_conditions'].value_counts(dropna=False)

Weather_conditions
NORMAL          4564
WET             1903
N/K              795
SNOW,ICE,FOG     429
Name: count, dtype: int64

Convert Object to Category columns

In [12]:
cat_cols = data.select_dtypes(include='object').columns.tolist()
data[cat_cols] = data[cat_cols].astype('category')

In [13]:
def standardise_col_names(col):
    return col.lower().replace(' ', '_')

data.columns = [standardise_col_names(c) for c in data.columns]

## EDA

Originally I used Pandas Profiling to find the interesting parts. Here I reproduce the things I noticed

In [14]:
profile = ProfileReport(
    data, 
    title="Hastings Direct Claims Data - Pandas Profiling Report",
    explorative=True,
    correlations={
        "pearson": {"calculate": True},
        "spearman": {"calculate": True},
        "kendall": {"calculate": True},
        "phi_k": {"calculate": True},
        "cramers": {"calculate": True}
    },
    missing_diagrams={
        "matrix": True,
        "bar": True,
        "heatmap": True,
        "dendrogram": True
    },
    duplicates={
        "head": 10
    }
)

In [15]:
output_file = "hastings_data_profile.html"
profile.to_file(output_file)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 42/42 [00:00<00:00, 452.69it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Qs
What does inception to loss mean?

### Targets - Incurred and Capped Incurred 

Check difference between these

In [34]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Melt the relevant numeric columns
melted = data.melt(
    value_vars=['incurred', 'capped_incurred', 'incurred_log'],
    var_name='variable',
    value_name='value'
)

# Drop non-numeric or missing entries
melted['value'] = pd.to_numeric(melted['value'], errors='coerce')
melted = melted.dropna(subset=['value'])

# Get list of unique variables
variables = melted['variable'].unique()

# Create subplots (1 row, n columns)
fig = make_subplots(rows=1, cols=len(variables), subplot_titles=variables)

# Add a histogram to each subplot with 50 bins
for i, var in enumerate(variables, start=1):
    subset = melted[melted['variable'] == var]['value']
    counts, bin_edges = np.histogram(subset, bins=50)
    bin_centers = 0.5 * (bin_edges[1:] + bin_edges[:-1])

    fig.add_trace(
        go.Bar(x=bin_centers, y=counts, name=var, showlegend=False),
        row=1, col=i
    )

fig.update_layout(
    title_text='Distribution of Incurred Values by Type',
    height=400,
    width=300 * len(variables),
)
fig.update_xaxes(matches=None)
fig.show()

In [16]:
fig = px.scatter(data, x='incurred', y='capped_incurred')
fig.show()

Ok so seems to be same except capped at £50k. 

Not sure if this is to help with outliers and a hint that they will skew the predictions or because there is a real life cap?

Assuming its to help with modelling. Could also try logging the target(s) and see how this affects EDA and preds

In [17]:
data['incurred_log'] = np.where(
    (data['incurred'] > 0) & (~np.isnan(data['incurred'])),
    np.log(data['incurred']),
    np.nan
)

data['capped_incurred_log'] = np.where(
    (data['capped_incurred'] > 0) & (~np.isnan(data['capped_incurred'])),
    np.log(data['capped_incurred']),
    np.nan
)


divide by zero encountered in log


divide by zero encountered in log



In [18]:
fig = px.scatter(data, x='incurred', y=['incurred_log', 'capped_incurred_log'], 
                 title='Log Transformed vs Original Incurred Values')
fig.show()

### Claim Number 

This should not be meaningful unless it is numbered sequentally and matches date_of_claim

In [19]:
fig = px.scatter(data, x='date_of_loss', y='claim_number')
fig.show()

Ok so it IS sequential. So we can discount using this and use the actual date instead.

Interesting to note the non linear rise of claims per year

### date_of_loss

Firstly what is the relationship to target?

In [26]:
# Filter out invalid values before plotting
plot_data = data.dropna(subset=['date_of_loss', 'incurred', 'capped_incurred', 'incurred_log'])
plot_data['date_of_loss'] = pd.to_datetime(plot_data['date_of_loss'])

fig = px.scatter(plot_data, x='date_of_loss', y=['incurred', 'capped_incurred', 'incurred_log'], 
                 facet_col='variable', 
                 title='Incurred Values Over Time',
                 facet_col_spacing=0.05,
                 facet_row_spacing=0.05,
                 trendline='lowess',
                 trendline_color_override="black"
                 )
fig.update_yaxes(matches=None)
fig.for_each_yaxis(lambda yaxis: yaxis.update(showticklabels=True))
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Ok seems to be a rise in value then a general drop. Not seeing an obvious effect of inflation we would need to correct for. A simple date seq feature like unix datestamp probably has a small amount of predictive value 

In [21]:

from prophet import Prophet

df = data[['date_of_loss', 'capped_incurred']].rename(columns={'date_of_loss': 'ds', 'capped_incurred': 'y'})

# Initialize Prophet with multiple seasonality
m = Prophet(weekly_seasonality=True, daily_seasonality=True)
m.add_seasonality(name='monthly', period=30.5, fourier_order=5)

# Fit
m.fit(df)

# Make future dataframe
future = m.make_future_dataframe(periods=30)
forecast = m.predict(future)

# Plot components
fig = m.plot_components(forecast)
fig.show()

11:59:51 - cmdstanpy - INFO - Chain [1] start processing
11:59:52 - cmdstanpy - INFO - Chain [1] done processing

FigureCanvasAgg is non-interactive, and thus cannot be shown



In [22]:
# Plot histogram of date_of_loss to see distribution of claims over time
fig = px.histogram(plot_data, x='date_of_loss', 
                   title='Distribution of Claims by Date of Loss',
                   labels={'count': 'Number of Claims', 'date_of_loss': 'Date of Loss'},
                   template="plotly_white")
fig.update_layout(xaxis_title='Date of Loss', yaxis_title='Number of Claims')
fig.show()



We see a rise in the number of claims over time, presumably as business grows. This may be handy as the brief mentions the more recent the claim the more predictive it is hypothesised to be - it might not be neccessary to weight observations by recency

TO DO:
Weather_conditions needs missing filling - us N/K value
Time_hour has suspicious amount at 0 - probably missing value so treat that as categorical
Vechile_registration_present has only 2 distinct values and nearly all are 1 except 6 whihc are -.

In [23]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from scipy.stats import pearsonr, spearmanr

def explore_col(df, col, target):
    summary = {
        "col": col,
        "target": target,
        "col_data_type": df[col].dtype.name
    }

    col_data = df[col]
    target_data = df[target]

    if col_data.dtype == 'object' or pd.api.types.is_categorical_dtype(col_data):
        encoded_col = LabelEncoder().fit_transform(col_data.astype(str))
        corr, _ = spearmanr(encoded_col, target_data)
        summary["col_target_corr"] = corr
        summary["n_unique"] = col_data.nunique()
        summary["top_categories"] = col_data.value_counts().head(5).to_dict()

        # Visualize mean target per category
        agg_df = df.groupby(col)[target].mean().reset_index().sort_values(by=target, ascending=False)
        fig = px.bar(agg_df, x=col, y=target, title=f"Mean {target} by {col}", template="plotly_white")

    else:
        corr, _ = pearsonr(col_data.fillna(0), target_data.fillna(0))
        summary["col_target_corr"] = corr
        summary["mean"] = col_data.mean()
        summary["std"] = col_data.std()

        if target_data.nunique() < 10:
            fig = px.box(df, x=target, y=col, points="all", title=f"{col} distribution by {target}", template="plotly_white")
        else:
            fig = px.scatter(df, x=col, y=target, trendline="ols", title=f"{col} vs {target} (Corr: {corr:.2f})", template="plotly_white")

    fig.update_layout(title_font_size=18, height=400, margin=dict(t=40, b=20, l=10, r=10))
    fig.show()

    return summary

In [24]:
for col in data.columns:
    explore_col(data, col, 'Incurred')

KeyError: 'Incurred'