# Task 4: Statistical Modeling

## Objective
Build predictive models to estimate TotalClaims (insurance risk) using policy and vehicle features.

## Models
- Linear Regression
- Random Forest
- XGBoost

## Evaluation
Using RMSE, MAE, and R² metrics, plus SHAP for interpretability.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

# Add src to path
sys.path.append(os.path.abspath(os.path.join('..')))

from src.loader import load_data
from src.modeling import DataPreprocessor, train_linear_regression, train_random_forest, train_xgboost, evaluate_model

## 1. Load Data

Using the first 100,000 rows to manage memory efficiently.

In [None]:
# Load data efficiently
filepath = '../data/raw/MachineLearningRating_v3.txt'

# Use nrows to limit dataset size and avoid memory issues
print("Loading data (first 100,000 rows)...")
df = pd.read_csv(filepath, sep='|', low_memory=False, nrows=100000)
print(f"Loaded {len(df):,} rows")
print(f"Shape: {df.shape}")

## 2. Data Preparation

In [None]:
# Select Features and Target
target_col = 'TotalClaims'

# Drop rows where target is missing
df = df.dropna(subset=[target_col])
print(f"After dropping missing target: {len(df):,} rows")

# Feature Selection
categorical_features = ['Gender', 'Province', 'VehicleType', 'make', 'bodytype']
numerical_features = ['RegistrationYear', 'Cylinders', 'cubiccapacity', 'kilowatts', 'NumberOfDoors', 'CapitalOutstanding']

# Check if features exist
all_features = categorical_features + numerical_features
missing_features = [f for f in all_features if f not in df.columns]
if missing_features:
    print(f"Warning: Missing features: {missing_features}")
    # Remove missing features
    categorical_features = [f for f in categorical_features if f in df.columns]
    numerical_features = [f for f in numerical_features if f in df.columns]

print(f"\nCategorical features ({len(categorical_features)}): {categorical_features}")
print(f"Numerical features ({len(numerical_features)}): {numerical_features}")

# Prepare X and y
X = df[categorical_features + numerical_features].copy()
y = df[target_col].copy()

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target shape: {y.shape}")

## 3. Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

## 4. Preprocessing

In [None]:
# Initialize preprocessor
preprocessor = DataPreprocessor(categorical_features, numerical_features)

# Fit and transform
print("Fitting preprocessor...")
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print(f"\nProcessed training shape: {X_train_processed.shape}")
print(f"Processed test shape: {X_test_processed.shape}")

## 5. Model Training

### 5.1 Linear Regression

In [None]:
print("Training Linear Regression...")
lr_model = train_linear_regression(X_train_processed, y_train)
lr_eval = evaluate_model(lr_model, X_test_processed, y_test)

print("\n=== Linear Regression Results ===")
for metric, value in lr_eval.items():
    print(f"{metric}: {value:,.2f}")

### 5.2 Random Forest

In [None]:
print("Training Random Forest...")
rf_model = train_random_forest(X_train_processed, y_train, n_estimators=50, random_state=42)
rf_eval = evaluate_model(rf_model, X_test_processed, y_test)

print("\n=== Random Forest Results ===")
for metric, value in rf_eval.items():
    print(f"{metric}: {value:,.2f}")

### 5.3 XGBoost

In [None]:
print("Training XGBoost...")
xgb_model = train_xgboost(X_train_processed, y_train, random_state=42)
xgb_eval = evaluate_model(xgb_model, X_test_processed, y_test)

print("\n=== XGBoost Results ===")
for metric, value in xgb_eval.items():
    print(f"{metric}: {value:,.2f}")

## 6. Model Comparison

In [None]:
# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Linear Regression': lr_eval,
    'Random Forest': rf_eval,
    'XGBoost': xgb_eval
}).T

print("\n" + "="*60)
print("MODEL PERFORMANCE COMPARISON")
print("="*60)
print(comparison_df.to_string())
print("="*60)

# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['RMSE', 'MAE', 'R2']):
    comparison_df[metric].plot(kind='bar', ax=axes[idx], color=['#3498db', '#2ecc71', '#e74c3c'])
    axes[idx].set_title(f'{metric} Comparison', fontsize=12, fontweight='bold')
    axes[idx].set_ylabel(metric)
    axes[idx].set_xlabel('Model')
    axes[idx].tick_params(axis='x', rotation=45)
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../reports/model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nModel comparison plot saved to reports/model_comparison.png")

## 7. Model Interpretability (SHAP)

Using SHAP to understand feature importance and model predictions.

In [None]:
import shap

print("Generating SHAP values for XGBoost model...")

# Use TreeExplainer for tree-based models (faster)
explainer = shap.TreeExplainer(xgb_model)

# Calculate SHAP values for test set (use subset for speed)
X_test_sample = X_test_processed[:1000]  # Use first 1000 samples
shap_values = explainer(X_test_sample)

print(f"SHAP values computed for {len(X_test_sample)} samples")

In [None]:
# Summary plot
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values, X_test_sample, show=False)
plt.title('SHAP Summary Plot - Feature Importance', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('../reports/shap_summary_plot.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nSHAP summary plot saved to reports/shap_summary_plot.png")

In [None]:
# Bar plot for mean absolute SHAP values
plt.figure(figsize=(10, 6))
shap.plots.bar(shap_values, show=False)
plt.title('SHAP Feature Importance (Mean |SHAP value|)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('../reports/shap_feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("SHAP feature importance plot saved to reports/shap_feature_importance.png")

## 8. Business Insights and Recommendations

### Model Performance Analysis

Based on the evaluation metrics:

1. **Best Performing Model**: The model with the lowest RMSE and MAE, and highest R² should be deployed.
   - Tree-based models (Random Forest, XGBoost) typically outperform Linear Regression for insurance data due to non-linear relationships.

2. **Feature Importance**: SHAP analysis reveals which features most influence claim predictions:
   - High-impact features should be prioritized in underwriting decisions
   - Low-impact features may be candidates for removal in future iterations

### Recommendations for Low-Risk Client Identification

1. **Risk Scoring**: Use the model to generate risk scores for each policy
2. **Threshold Definition**: Set a claims prediction threshold to identify "low-risk" clients
3. **Premium Optimization**: Offer competitive premiums to clients predicted as low-risk
4. **Continuous Monitoring**: Regularly retrain models with new claim data to maintain accuracy

### Next Steps

1. **Model Deployment**: Integrate the best model into the pricing pipeline
2. **A/B Testing**: Test premium adjustments on a subset of low-risk clients
3. **Feature Engineering**: Explore additional derived features (e.g., vehicle age, claims history ratios)
4. **Hyperparameter Tuning**: Optimize model parameters for production use

## Summary

✅ Successfully trained three models: Linear Regression, Random Forest, and XGBoost  
✅ Evaluated models using RMSE, MAE, and R² metrics  
✅ Generated SHAP visualizations for model interpretability  
✅ Provided business recommendations based on model insights  

**Task 4 Complete!**