# Dataset Preparation: Building the Training Dataset

## Overview

- Claims data includes total claim for the year-product assigned to a policyholder
- The total cost is then split by claim type.
- This dataset prepares the target variables for our model training by
    - Aggregating the claims types to obtain the claims frequency unique to each ID and year
    - Merge result with the insurance factors that contain the features on (ID, year).
    - For product with no claims cost broken down by claim type, fill such rows with claims frequency of 0


## Setup

In [None]:
import pandas as pd

### Load the data

In [None]:
insurance = pd.read_csv('../../data/input/Motor_vehicle_insurance_data.csv', delimiter=";")
claims =  pd.read_csv('../../data/input/sample_type_claim.csv', delimiter=';')

## 1. Aggregate Claims Frequency

| Column | Description |
|--------|-------------|
| ID | Policyholder identifier |
| Cost_claims_year | Total claim cost for the year |
| claims_frequency | Count of claim types (our target variable) |

### 1.1 Group by (ID, year) and count claim types

In [None]:
claims_frequency  = (
    claims
    .groupby(['ID', 'Cost_claims_year'])
    .agg({'Cost_claims_by_type': 'count'})
    .rename(columns={'Cost_claims_by_type': 'claims_frequency'})
    .reset_index()
)

## 2. Merge with Insurance Features

| Join Type | Reason |
|-----------|--------|
| Left join | Keep all insurance records, even those with no claims |
| Fill NaN â†’ 0 | No match means no claims filed |

### 2.1 Left join on (ID, Cost_claims_year)

In [None]:
dataset = (
    pd
    .merge(
        left=insurance,
        right=claims_frequency,
        how='left',
        on=['ID', 'Cost_claims_year']
    )
    .fillna(value={'claims_frequency':0})
)
dataset

