# Insurance Data EDA

Exploratory Data Analysis for AlphaCare Insurance Solutions using `MachineLearningRating_v3.csv` (converted from .txt, Feb 2014–Aug 2015).

**Guiding Questions**:
- What is the overall Loss Ratio (TotalClaims / TotalPremium)? How does it vary by Province, VehicleType, Gender?
- What are the distributions of TotalPremium, TotalClaims, CustomValueEstimate? Are there outliers?
- Are there temporal trends in claim frequency/severity over the 18-month period?
- Which vehicle Make/Model are associated with the highest/lowest claim amounts?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from src.data_loader import load_data
from src.utils import save_plot

%matplotlib inline
plt.style.use('seaborn-v0_8')

# Load data (assumes .txt has been converted to .csv)
df = load_data('../data/MachineLearningRating_v3.csv')

## Data Summarization

In [None]:
# Descriptive statistics
numerical_cols = ['TotalPremium', 'TotalClaims', 'SumInsured', 'Kilowatts', 'CustomValueEstimate']
print("Descriptive Statistics:\n", df[numerical_cols].describe())

# Data types
print("\nData Types:\n", df.dtypes)

# Missing values
print("\nMissing Values:\n", df.isnull().sum())

## Univariate Analysis

In [None]:
# Histogram: TotalClaims
plt.figure(figsize=(10, 6))
sns.histplot(df['TotalClaims'], bins=50)
plt.title('Distribution of Total Claims')
save_plot('total_claims_hist.png')

# Bar chart: Province
plt.figure(figsize=(10, 6))
df['Province'].value_counts().plot(kind='bar')
plt.title('Distribution of Policies by Province')
plt.xticks(rotation=45)
save_plot('province_bar.png')

## Bivariate/Multivariate Analysis

In [None]:
# Scatter plot: Premium vs Claims by PostalCode
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='TotalPremium', y='TotalClaims', hue='PostalCode', size=10, legend=False)
plt.title('Premium vs Claims by PostalCode')
save_plot('premium_vs_claims_scatter.png')

# Correlation matrix
corr_matrix = df[numerical_cols].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
save_plot('correlation_matrix.png')

## Loss Ratio Analysis

Calculate `LossRatio = TotalClaims / TotalPremium` and analyze by Province, VehicleType, Gender.

In [None]:
# Avoid division by zero
df['LossRatio'] = df['TotalClaims'] / df['TotalPremium'].replace(0, float('nan'))
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='Province', y='LossRatio', estimator='mean')
plt.title('Average Loss Ratio by Province')
plt.xticks(rotation=45)
save_plot('loss_ratio_province.png')

## Temporal Trends

Analyze claim frequency over TransactionMonth.

In [None]:
df['HasClaim'] = df['TotalClaims'] > 0
claim_freq = df.groupby('TransactionMonth')['HasClaim'].mean()
plt.figure(figsize=(12, 6))
claim_freq.plot(kind='line')
plt.title('Claim Frequency Over Time')
plt.xlabel('Transaction Month')
plt.ylabel('Claim Frequency')
save_plot('claim_freq_time.png')

## Outlier Detection

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='TotalClaims', x='VehicleType')
plt.title('Box Plot of Total Claims by Vehicle Type')
plt.xticks(rotation=45)
save_plot('total_claims_boxplot.png')

## Vehicle Make/Model Analysis

Identify Make/Model with highest/lowest claim amounts.

In [None]:
claims_by_make = df.groupby('Make')['TotalClaims'].mean().sort_values()
plt.figure(figsize=(12, 6))
pd.concat([claims_by_make.head(5), claims_by_make.tail(5)]).plot(kind='bar')
plt.title('Top/Bottom 5 Makes by Average Claim Amount')
plt.xticks(rotation=45)
save_plot('claims_by_make.png')