# Insurance Data EDA

Exploratory Data Analysis for AlphaCare Insurance Solutions using `MachineLearningRating_v3.csv` (converted from pipe-separated .txt, Feb 2014–Aug 2015).

**Guiding Questions**:
- What is the overall Loss Ratio (TotalClaims / TotalPremium)? How does it vary by Province, VehicleType, Gender?
- What are the distributions of TotalPremium, TotalClaims, CustomValueEstimate? Are there outliers?
- Are there temporal trends in claim frequency/severity over the 18-month period?
- Which vehicle make/Model are associated with the highest/lowest claim amounts?

**Notes**:
- `CapitalOutstanding` had comma-separated numbers, converted to floats.
- `MaritalStatus`, `Gender`, `CrossBorder` had mixed types, set as categories.
- Negative `TotalPremium` and `TotalClaims` may indicate refunds or errors.
- `CustomValueEstimate` has 78% missing values.

In [1]:
import sys
import os
sys.path.append(os.path.abspath('..'))



import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from src.data_loader import load_data
from src.utils import save_plot

%matplotlib inline
plt.style.use('seaborn-v0_8')
pd.set_option('display.max_columns', 60)
pd.set_option('display.max_rows', 100)

# Load data
df = load_data('../data/MachineLearningRating_v3.csv')

## Data Summarization

In [2]:
# Descriptive statistics
numerical_cols = ['TotalPremium', 'TotalClaims', 'SumInsured', 'kilowatts', 'CustomValueEstimate', 'CapitalOutstanding']
print("Descriptive Statistics:\n", df[numerical_cols].describe())

# Data types
print("\nData Types:\n", df.dtypes)

# Missing values
print("\nMissing Values:\n", df.isnull().sum())

# Check categorical columns
print("\nMaritalStatus Unique Values:\n", df['MaritalStatus'].value_counts(dropna=False))
print("\nGender Unique Values:\n", df['Gender'].value_counts(dropna=False))
print("\nCrossBorder Unique Values:\n", df['CrossBorder'].value_counts(dropna=False))

# Negative values
print("\nNegative TotalPremium Count:", (df['TotalPremium'] < 0).sum())
print("Negative TotalClaims Count:", (df['TotalClaims'] < 0).sum())

# TransactionMonth
print("\nTransactionMonth Missing Values:", df['TransactionMonth'].isnull().sum())
print("TransactionMonth Range:\n", df['TransactionMonth'].agg(['min', 'max']))

Descriptive Statistics:
        TotalPremium   TotalClaims   SumInsured      kilowatts  \
count  1.000098e+06  1.000098e+06   1000098.00  999546.000000   
mean   6.190550e+01  6.486119e+01    604172.50      97.207916   
std    2.302845e+02  2.384075e+03   1508331.75      19.393255   
min   -7.825768e+02 -1.200241e+04         0.01       0.000000   
25%    0.000000e+00  0.000000e+00      5000.00      75.000000   
50%    2.178333e+00  0.000000e+00      7500.00     111.000000   
75%    2.192982e+01  0.000000e+00    250000.00     111.000000   
max    6.528260e+04  3.930921e+05  12636200.00     309.000000   

       CustomValueEstimate  
count         2.204560e+05  
mean          2.255311e+05  
std           5.645158e+05  
min           2.000000e+04  
25%           1.350000e+05  
50%           2.200000e+05  
75%           2.800000e+05  
max           2.655000e+07  

Data Types:
 UnderwrittenCoverID                  int64
PolicyID                             int64
TransactionMonth            

## Univariate Analysis

In [3]:
# Histogram: TotalClaims
plt.figure(figsize=(10, 6))
sns.histplot(df['TotalClaims'].dropna(), bins=50, color='teal')
plt.title('Distribution of Total Claims')
plt.xlabel('Total Claims (ZAR)')
plt.ylabel('Frequency')
save_plot('total_claims_hist.png')

# Bar chart: Province
plt.figure(figsize=(10, 6))
df['Province'].value_counts().plot(kind='bar', color='coral')
plt.title('Distribution of Policies by Province')
plt.xticks(rotation=45)
save_plot('province_bar.png')

## Bivariate/Multivariate Analysis

In [None]:
# Scatter plot: Premium vs Claims by PostalCode
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='TotalPremium', y='TotalClaims', hue='PostalCode', size=10, legend=False, palette='viridis')
plt.title('Premium vs Claims by PostalCode')
save_plot('premium_vs_claims_scatter.png')

# Correlation matrix
corr_matrix = df[numerical_cols].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
save_plot('correlation_matrix.png')

## Loss Ratio Analysis

In [5]:
# LossRatio
df['LossRatio'] = df['TotalClaims'] / df['TotalPremium'].replace(0, float('nan'))
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='Province', y='LossRatio', estimator='mean', palette='Set2')
plt.title('Average Loss Ratio by Province')
plt.xticks(rotation=45)
plt.ylabel('Loss Ratio')
save_plot('loss_ratio_province.png')


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=df, x='Province', y='LossRatio', estimator='mean', palette='Set2')


## Temporal Trends

In [None]:
df['HasClaim'] = df['TotalClaims'] > 0
claim_freq = df.groupby(df['TransactionMonth'].dt.to_period('M'))['HasClaim'].mean()
if not claim_freq.empty:
    plt.figure(figsize=(12, 6))
    claim_freq.plot(kind='line', color='purple', marker='o')
    plt.title('Claim Frequency Over Time')
    plt.xlabel('Transaction Month')
    plt.ylabel('Claim Frequency')
    save_plot('claim_freq_time.png')
else:
    print("No claim frequency data available to plot.")
    print("TransactionMonth Missing:", df['TransactionMonth'].isnull().sum())
    print("HasClaim Value Counts:\n", df['HasClaim'].value_counts())
    print("TotalClaims > 0 Count:", (df['TotalClaims'] > 0).sum())

## Outlier Detection

In [6]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='TotalClaims', x='VehicleType', palette='Pastel1')
plt.title('Box Plot of Total Claims by Vehicle Type')
plt.xticks(rotation=45)
plt.ylabel('Total Claims (ZAR)')
save_plot('total_claims_boxplot.png')


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=df, y='TotalClaims', x='VehicleType', palette='Pastel1')


## Vehicle Make/Model Analysis

In [None]:
claims_by_make = df.groupby('make')['TotalClaims'].mean().sort_values()
plt.figure(figsize=(12, 6))
pd.concat([claims_by_make.head(5), claims_by_make.tail(5)]).plot(kind='bar', color='skyblue')
plt.title('Top/Bottom 5 Makes by Average Claim Amount')
plt.xticks(rotation=45)
plt.ylabel('Average Claim Amount (ZAR)')
save_plot('claims_by_make.png')