# **Predictive Modeling for Insurance Risk and Premium Optimization**

This notebook implements Task 4 of the challenge: "Build and evaluate predictive models that form the core of a dynamic, risk-based pricing system." We will focus on two primary modeling goals: predicting `Claim Severity` and exploring approaches for `Premium Optimization`.

## **Table of Contents**

1. [Setup and Data Loading](#1-setup-and-data-loading)
2. [Data Preparation](#2-data-preparation)
    - 2.1. Initial Data Loading and Cleaning
    - 2.2. Feature Engineering
    - 2.3. Encoding Categorical Data & Scaling Numerical Data
    - 2.4. Train-Test Split
3. [Modeling Goal 1: Claim Severity Prediction (Risk Model)](#3-modeling-goal-1-claim-severity-prediction-risk-model)
    - 3.1. Model Building & Training
    - 3.2. Model Evaluation
    - 3.3. Model Interpretability (SHAP & LIME)
4. [Modeling Goal 2: Premium Optimization (Pricing Framework)](#4-modeling-goal-2-premium-optimization-pricing-framework)
    - 4.1. Conceptual Framework: Risk-Based Premium
    - 4.2. Model Building: Probability of Claim (Classification Model)
    - 4.3. Model Evaluation: Probability of Claim
    - 4.4. Model Interpretability (SHAP & LIME) for Probability Model
    - 4.5. Simplified Premium Prediction (Direct Regression on TotalPremium)
5. [Overall Model Comparison and Interpretation](#5-overall-model-comparison-and-interpretation)
6. [Conclusion and Business Recommendations](#6-conclusion-and-business-recommendations)

## **1. Setup and Data Loading**

We begin by importing all necessary libraries and our custom modular functions for data loading, preprocessing, and modeling. We'll load the processed data from Task 1, which should be available in `data/processed/`.

### Import necessary libraries

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from pathlib import Path
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings

### Setp plotting style

In [2]:
# Suppress specific warnings from libraries like shap for cleaner output
warnings.filterwarnings('ignore', category=UserWarning, module='shap')
warnings.filterwarnings('ignore', category=FutureWarning, module='shap')

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['legend.fontsize'] = 12
plt.rcParams['font.family'] = 'Inter'

### Import modules

In [3]:
# Add project root to sys.path to enable importing modular scripts
import sys
project_root = Path.cwd()
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import data handling utilities
# from src.utils.data_loader import load_data # Now handled by DataPreprocessor.load_and_clean_data

# Import new data preparation utilities
from src.utils.data_preparation.preprocessor import DataPreprocessor
from src.utils.data_preparation.feature_engineer import create_time_features, create_risk_ratio_features, create_vehicle_age_feature

# Import modeling utilities
from src.models.model_trainer import ModelTrainer
from src.models.linear_regression_strategy import LinearRegressionStrategy
from src.models.random_forest_strategy import RandomForestStrategy
from src.models.decision_tree_strategy import DecisionTreeStrategy # New
from src.models.xgboost_strategy import XGBoostStrategy
from src.models.model_evaluator import evaluate_regression_model, evaluate_classification_model
from src.models.model_interpreter import ModelInterpreter

# Import metrics calculator from  for HasClaim and Margin
from src.utils.hypothesis_testing.metrics_calculator import calculate_claim_frequency, calculate_margin

### Load and Preprocss Data

In [4]:
# Define path to the processed data file
processed_data_path = project_root / "data" / "processed" / "processed_insurance_data.csv"
raw_data_path = project_root / "data" / "raw" / "temp_extracted_data" / "MachineLearningRating_v3.txt"

# Load and initially clean data using the new preprocessor method
# print(f"Attempting to load and clean data from: {processed_data_path}")
print(f"Attempting to load and clean data from: {raw_data_path}")

# Use comma delimiter as specified
df = DataPreprocessor.load_and_clean_data(raw_data_path, delimiter='|', file_type='txt')

if df.empty:
    raise ValueError(f"DataFrame is empty. Please ensure '{raw_data_path}' exists and is correctly formatted. "
                     "This notebook expects processed data from previous tasks.")

print(f"\nDataFrame shape after loading and initial cleaning: {df.shape}")
print("\nInitial DataFrame Info:")
df.info()

# Add 'HasClaim' and 'Margin' columns from  metrics calculator
df = calculate_claim_frequency(df.copy())
df = calculate_margin(df.copy())

print("\nDataFrame head after loading, initial cleaning, and initial metric calculation:")
print(df.head())


Attempting to load and clean data from: /home/micha/Downloads/course/10-accademy/week-3/Insurance-Risk-Analytics-and-Predictive-Modeling/data/raw/temp_extracted_data/MachineLearningRating_v3.txt
Successfully loaded data from 'MachineLearningRating_v3.txt' using specified delimiter '|'. Shape: (1000098, 52)
Removed 0 duplicate rows.
Data loaded and duplicates removed. Current shape: (1000098, 52)

DataFrame shape after loading and initial cleaning: (1000098, 52)

Initial DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000098 entries, 0 to 1000097
Data columns (total 52 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   UnderwrittenCoverID       1000098 non-null  int64  
 1   PolicyID                  1000098 non-null  int64  
 2   TransactionMonth          1000098 non-null  object 
 3   IsVATRegistered           1000098 non-null  bool   
 4   Citizenship               1000098 non-null  objec

## **2. Data Preparation**

This phase is critical for transforming the raw or initially processed data into a format suitable for machine learning models. It involves handling missing values, creating new informative features, encoding categorical variables, and splitting the data for training and testing.

### **2.1. Initial Data Loading and Cleaning**

This step (already performed in section 1) ensures duplicates are removed and critical financial columns `TotalPremium`, `TotalClaims` are properly loaded and `NaN`s are filled with 0. Further imputation for other features will be handled by the `DataPreprocessor`'s pipeline.

In [5]:
print("\n--- 2.1. Initial Data Loading and Cleaning ---")
print("Initial loading and cleaning (duplicate removal, critical NaN handling) already performed in Setup section.")

print("\nMissing values remaining after initial load and clean:")
print(df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending=False))

# For any other non-critical numerical/categorical columns that may have NaNs,
# the DataPreprocessor's pipeline will handle their imputation before scaling/encoding.


--- 2.1. Initial Data Loading and Cleaning ---
Initial loading and cleaning (duplicate removal, critical NaN handling) already performed in Setup section.

Missing values remaining after initial load and clean:
NumberOfVehiclesInFleet    1000098
CrossBorder                 999400
CustomValueEstimate         779642
WrittenOff                  641901
Converted                   641901
Rebuilt                     641901
NewVehicle                  153295
Bank                        145961
AccountType                  40232
Gender                        9536
MaritalStatus                 8259
mmcode                         552
VehicleType                    552
make                           552
VehicleIntroDate               552
NumberOfDoors                  552
bodytype                       552
kilowatts                      552
cubiccapacity                  552
Cylinders                      552
Model                          552
CapitalOutstanding               2
dtype: int64


### **2.2. Feature Engineering**

We'll create new features that might enhance model performance by capturing additional information or relationships from existing columns.

In [6]:
print("\n--- 2.2. Feature Engineering ---")

# Create time-based features from 'TransactionMonth'
df_fe = create_time_features(df.copy(), 'TransactionMonth')

# Create risk ratio features (e.g., ClaimPremiumRatio)
df_fe = create_risk_ratio_features(df_fe.copy())

# Create vehicle age feature
df_fe = create_vehicle_age_feature(df_fe.copy(), current_year=2025) # Adjust current_year as needed

print("\nDataFrame head after Feature Engineering:")
print(df_fe.head())
print("\nDataFrame Info after Feature Engineering:")
df_fe.info()


--- 2.2. Feature Engineering ---
DEBUG: Initial 'TransactionMonth' dtype: object
DEBUG: Initial 'TransactionMonth' head:
0    2015-03-01 00:00:00
1    2015-05-01 00:00:00
2    2015-07-01 00:00:00
3    2015-05-01 00:00:00
4    2015-07-01 00:00:00
Name: TransactionMonth, dtype: object


  converted_dates_1 = pd.to_datetime(df_copy[date_col], errors='coerce', infer_datetime_format=True)


DEBUG: NaNs after inferring format: 0 / 1000098
  Using inferred parsing result for 'TransactionMonth'.
Created time-based features from 'TransactionMonth'.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_copy['ClaimPremiumRatio'].replace([np.inf, -np.inf], np.nan, inplace=True) # Modified: `replace` is okay with inplace, but safer to assign


Created 'ClaimPremiumRatio' feature.
Created 'VehicleAge' feature.

DataFrame head after Feature Engineering:
   UnderwrittenCoverID  PolicyID TransactionMonth  IsVATRegistered  \
0               145249     12827       2015-03-01             True   
1               145249     12827       2015-05-01             True   
2               145249     12827       2015-07-01             True   
3               145255     12827       2015-05-01             True   
4               145255     12827       2015-07-01             True   

  Citizenship          LegalType Title Language                 Bank  \
0              Close Corporation    Mr  English  First National Bank   
1              Close Corporation    Mr  English  First National Bank   
2              Close Corporation    Mr  English  First National Bank   
3              Close Corporation    Mr  English  First National Bank   
4              Close Corporation    Mr  English  First National Bank   

       AccountType  ... HasClaim    

### **2.3. Encoding Categorical Data & Scaling Numerical Data**

We'll define the final set of features for our models and then use our `DataPreprocessor` to handle one-hot encoding for categorical features and standard scaling for numerical features. This is crucial as most machine learning models require numerical input and can perform better with scaled data.

In [None]:
print("\n--- 2.3. Encoding Categorical Data & Scaling Numerical Data ---")

# Define numerical and categorical columns for preprocessing *that will be used as features in models*
# Exclude unique identifiers, already processed/target columns, and columns dropped during FE
features_to_exclude = [
    'PolicyID', 'UnderwrittenCoverID', 'TotalClaims', 'TotalPremium',
    'HasClaim', 'Margin', 'TransactionMonth', 'VehicleIntroDate'
]

# Identify potential numerical and categorical features from the DataFrame after Feature Engineering
# Filter to ensure they are present and not in the exclude list
all_current_cols = df_fe.columns.tolist()

# Attempt to infer data types for better selection, or use a predefined list if schema is known
# This list can be more robustly defined based on domain knowledge or a more thorough EDA.
inferred_numerical_features = df_fe.select_dtypes(include=np.number).columns.tolist()
inferred_categorical_features = df_fe.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()

# Filter down to the actual feature columns, excluding targets and IDs
final_numerical_features = [
    col for col in inferred_numerical_features
    if col in all_current_cols and col not in features_to_exclude
]
final_categorical_features = [
    col for col in inferred_categorical_features
    if col in all_current_cols and col not in features_to_exclude
]

# Handle `Mmcode` potentially being treated as categorical or numerical ID.
# For modeling, it's often dropped or one-hot encoded if it has few unique values.
# Assuming it's an ID for now and thus excluded. If it's a feature, add to numerical/categorical.
if 'Mmcode' in final_numerical_features:
    final_numerical_features.remove('Mmcode')
if 'Mmcode' in final_categorical_features:
    final_categorical_features.remove('Mmcode')
if 'Mmcode' in features_to_exclude: # Ensure it's in excluded if it was intended to be
    pass
else:
    features_to_exclude.append('Mmcode')


print(f"Features selected for preprocessing: {len(final_numerical_features)} numerical, {len(final_categorical_features)} categorical.")
print("Numerical Features (after FE and exclusion):", final_numerical_features)
print("Categorical Features (after FE and exclusion):", final_categorical_features)

# Initialize the preprocessor with identified features
# For modeling, 'onehot' encoding and 'standard' scaling are good defaults.
preprocessor = DataPreprocessor(
    numerical_cols=final_numerical_features,
    categorical_cols=final_categorical_features
)

# Apply the full preprocessing pipeline (encoding + scaling)
df_processed_for_modeling = preprocessor.preprocess(
    df_fe.copy(), # Use a copy of df_fe to avoid modifying it in place if needed later
    encoder_type='onehot',
    scaler_type='standard'
)

# Display info about the final processed DataFrame for modeling
print("\nDataFrame Info after full preprocessing (for modeling):")
df_processed_for_modeling.info()
print("\nTransformed DataFrame Head:")
print(df_processed_for_modeling.head())

# Store the final feature names for model interpretability (SHAP/LIME)
# The order is important: numerical features first, then one-hot encoded categorical features
final_model_feature_names = [col for col in df_processed_for_modeling.columns if col not in features_to_exclude]
print("\nFinal feature names for models:", final_model_feature_names[:10], "...") # Print first few



--- 2.3. Encoding Categorical Data & Scaling Numerical Data ---
Features selected for preprocessing: 19 numerical, 35 categorical.
Numerical Features (after FE and exclusion): ['PostalCode', 'mmcode', 'RegistrationYear', 'Cylinders', 'cubiccapacity', 'kilowatts', 'NumberOfDoors', 'CustomValueEstimate', 'NumberOfVehiclesInFleet', 'SumInsured', 'CalculatedPremiumPerTerm', 'Month', 'Year', 'DayOfWeek', 'DayOfYear', 'WeekOfYear', 'Quarter', 'ClaimPremiumRatio', 'VehicleAge']
Categorical Features (after FE and exclusion): ['IsVATRegistered', 'Citizenship', 'LegalType', 'Title', 'Language', 'Bank', 'AccountType', 'MaritalStatus', 'Gender', 'Country', 'Province', 'MainCrestaZone', 'SubCrestaZone', 'ItemType', 'VehicleType', 'make', 'Model', 'bodytype', 'AlarmImmobiliser', 'TrackingDevice', 'CapitalOutstanding', 'NewVehicle', 'WrittenOff', 'Rebuilt', 'Converted', 'CrossBorder', 'TermFrequency', 'ExcessSelected', 'CoverCategory', 'CoverType', 'CoverGroup', 'Section', 'Product', 'StatutoryClass

### **2.4. Train-Test Split**

We will split the data into training and testing sets. A 70:30 ratio is common. This ensures we evaluate models on unseen data.

In [None]:
print("\n--- 2.4. Train-Test Split ---")

# Define features (X) and targets (y) for modeling
# X includes all processed features (numerical + one-hot encoded)
# y includes the original target columns ('TotalClaims', 'HasClaim', 'TotalPremium')

X = df_processed_for_modeling[final_model_feature_names] # Use the identified final feature names
y_severity = df_processed_for_modeling['TotalClaims'] # Target for Claim Severity
y_probability = df_processed_for_modeling['HasClaim'] # Target for Claim Probability
y_premium = df_processed_for_modeling['TotalPremium'] # Target for Direct Premium Prediction

print(f"Features (X) shape: {X.shape}")
print(f"TotalClaims target (y_severity) shape: {y_severity.shape}")
print(f"HasClaim target (y_probability) shape: {y_probability.shape}")
print(f"TotalPremium target (y_premium) shape: {y_premium.shape}")

# Split the data into training and testing sets
# Use a common random_state for reproducibility
X_train, X_test, y_severity_train, y_severity_test = train_test_split(
    X, y_severity, test_size=0.3, random_state=42
)
_, _, y_probability_train, y_probability_test = train_test_split(
    X, y_probability, test_size=0.3, random_state=42, stratify=y_probability # Stratify for classification target
)
_, _, y_premium_train, y_premium_test = train_test_split(
    X, y_premium, test_size=0.3, random_state=42
)


print(f"\nTrain set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"X_train columns (first 5): {X_train.columns.tolist()[:5]}...")


## **3. Modeling Goal 1: Claim Severity Prediction (Risk Model)**

For policies that have a claim (`TotalClaims > 0`), we build a regression model to predict the `TotalClaims` amount. This model is crucial for estimating the financial liability associated with a policy.

- **Target Variable**: `TotalClaims` (on the subset of data where claims > 0).
- **Evaluation Metric**: Root Mean Squared Error (RMSE) to penalize large prediction errors, and R-squared.

In [None]:
print("\n--- 3. Modeling Goal 1: Claim Severity Prediction ---")

# Filter data for policies with actual claims (Claim Severity is for WHEN a claim occurs)
# Ensure X and y align after filtering
claims_mask_train = (y_severity_train > 0)
X_train_severity = X_train[claims_mask_train]
y_train_severity = y_severity_train[claims_mask_train]

claims_mask_test = (y_severity_test > 0)
X_test_severity = X_test[claims_mask_test]
y_test_severity = y_severity_test[claims_mask_test]

print(f"Claim Severity Training Data Shape (claims > 0): {X_train_severity.shape}, {y_train_severity.shape}")
print(f"Claim Severity Test Data Shape (claims > 0): {X_test_severity.shape}, {y_test_severity.shape}")

if X_train_severity.empty or X_test_severity.empty:
    print("Warning: Insufficient data for Claim Severity prediction (very few or no claims with value > 0). Skipping this section.")
else:
    # Dictionary to store model evaluation results for comparison
    severity_model_results = {}

    # Initialize ModelTrainer
    trainer = ModelTrainer(LinearRegressionStrategy()) # Start with LR

    # --- 3.1. Model Building & Training ---

    # Model 1: Linear Regression for Claim Severity
    print("\n--- Training Linear Regression for Claim Severity ---")
    trainer.train_model(X_train_severity, y_train_severity)
    lr_severity_predictions = trainer.predict_model(X_test_severity)
    lr_severity_model = trainer.get_current_model_object()
    severity_model_results['Linear Regression'] = evaluate_regression_model(y_test_severity, lr_severity_predictions)


    # Model 2: Decision Tree for Claim Severity 
    print("\n--- Training Decision Tree Regressor for Claim Severity ---")
    trainer.set_strategy(DecisionTreeStrategy(model_type='regressor', random_state=42))
    trainer.train_model(X_train_severity, y_train_severity)
    dt_severity_predictions = trainer.predict_model(X_test_severity)
    dt_severity_model = trainer.get_current_model_object()
    severity_model_results['Decision Tree'] = evaluate_regression_model(y_test_severity, dt_severity_predictions)


    # Model 3: Random Forest for Claim Severity
    print("\n--- Training Random Forest for Claim Severity ---")
    trainer.set_strategy(RandomForestStrategy(n_estimators=200, random_state=42))
    trainer.train_model(X_train_severity, y_train_severity)
    rf_severity_predictions = trainer.predict_model(X_test_severity)
    rf_severity_model = trainer.get_current_model_object()
    severity_model_results['Random Forest'] = evaluate_regression_model(y_test_severity, rf_severity_predictions)


    # Model 4: XGBoost for Claim Severity
    print("\n--- Training XGBoost Regressor for Claim Severity ---")
    trainer.set_strategy(XGBoostStrategy(objective='reg:squarederror', n_estimators=200, random_state=42))
    trainer.train_model(X_train_severity, y_train_severity)
    xgb_severity_predictions = trainer.predict_model(X_test_severity)
    xgb_severity_model = trainer.get_current_model_object()
    severity_model_results['XGBoost Regressor'] = evaluate_regression_model(y_test_severity, xgb_severity_predictions)


    # --- 3.2. Model Evaluation (Consolidated) ---
    print("\n--- Claim Severity Model Comparison ---")
    severity_comparison_df = pd.DataFrame(severity_model_results).T
    print(severity_comparison_df.sort_values(by='RMSE')) # Sort by RMSE to find best performing

    # Identify the best performing model based on RMSE
    best_severity_model_name = severity_comparison_df['RMSE'].idxmin()
    print(f"\nBest performing Claim Severity model based on RMSE: {best_severity_model_name}")

    # Select the best model object for interpretability
    best_severity_model = {
        'Linear Regression': lr_severity_model,
        'Decision Tree': dt_severity_model,
        'Random Forest': rf_severity_model,
        'XGBoost Regressor': xgb_severity_model
    }.get(best_severity_model_name)

    # --- 3.3. Model Interpretability (SHAP & LIME) ---
    if best_severity_model and not X_test_severity.empty:
        print(f"\n--- Model Interpretability for {best_severity_model_name} (Claim Severity) ---")

        # Initialize ModelInterpreter with the best model and features for SHAP/LIME
        # Ensure feature_names correspond to X_test_severity columns
        interpreter_severity = ModelInterpreter(
            model=best_severity_model,
            feature_names=X_test_severity.columns.tolist(),
            model_type='regression'
        )

        # SHAP for overall feature importance
        # Use a subset of X_test_severity for SHAP explanation if dataset is large,
        # as KernelExplainer can be computationally intensive. TreeExplainer is faster.
        shap_X_severity = X_test_severity.sample(min(1000, len(X_test_severity)), random_state=42) if len(X_test_severity) > 1000 else X_test_severity

        interpreter_severity.explain_model_shap(shap_X_severity)
        print("\nSHAP Summary Plot (Global Feature Importance for Claim Severity):")
        interpreter_severity.plot_shap_summary(shap_X_severity)

        # LIME for individual instance explanation
        print("\nLIME Explanation for a Sample Prediction (first instance in test set):")
        # For LIME explainer, it's best to initialize once with representative training data
        interpreter_severity.lime_explainer = lime.lime_tabular.LimeTabularExplainer(
            training_data=X_train_severity.values,
            feature_names=X_train_severity.columns.tolist(),
            mode='regression'
        )
        # Explain the first instance from the test set
        interpreter_severity.explain_instance_lime(X_test_severity.iloc[0])

    else:
        print("Skipping model interpretability for Claim Severity due to insufficient data or no best model found.")



## **4. Modeling Goal 2: Premium Optimization (Pricing Framework)**

The ultimate goal is a dynamic, risk-based pricing system. This involves predicting the probability of a claim and its severity.

### **4.1. Conceptual Framework: Risk-Based Premium**

The prompt outlines a sophisticated approach for a Risk-Based Premium:

Premium = (Predicted Probability of Claim * Predicted Claim Severity) + Expense Loading + Profit Margin

This formula suggests a two-stage modeling approach:

1. **Probability of Claim Model**: A classification model predicts `P(Claim = 1)`.
2. **Claim Severity Model**: A regression model (which we just built) predicts `E[TotalClaims | Claim = 1]`.

We then combine these with business-defined "Expense Loading" (costs of operations, sales, etc.) and "Profit Margin" to arrive at a recommended premium.

### **4.2. Model Building: Probability of Claim (Classification Model)**

We will build a binary classification model to predict `HasClaim` (0 or 1).

- **Target Variable**: `HasClaim`.
- **Evaluation Metric**: Accuracy, Precision, Recall, F1-score.

In [None]:
print("\n--- 4.2. Model Building: Probability of Claim (Classification Model) ---")

# For classification, we use y_probability_train and y_probability_test
# Ensure that the target column 'HasClaim' is suitable for classification (binary: 0 or 1)
print(f"Claim Probability Training Target Value Counts:\n{y_probability_train.value_counts()}")
print(f"Claim Probability Test Target Value Counts:\n{y_probability_test.value_counts()}")

# Check for class imbalance which is common in claim data
if not y_probability_train.empty and y_probability_train.value_counts().min() / y_probability_train.value_counts().max() < 0.2:
    print("Warning: Significant class imbalance detected in 'HasClaim'. Consider techniques like SMOTE or class weighting for training.")
else:
    print("Class distribution for 'HasClaim' appears balanced enough or data is too small to assess imbalance.")

# Dictionary to store classification model evaluation results
probability_model_results = {}

# Initialize ModelTrainer
trainer_clf = ModelTrainer(DecisionTreeStrategy(model_type='classifier', random_state=42)) # Start with DT Classifier


### Model 1: Decision Tree Classifier for Claim Probability

In [None]:
# Model 1: Decision Tree Classifier for Claim Probability 
print("\n--- Training Decision Tree Classifier for Claim Probability ---")
trainer_clf.train_model(X_train, y_probability_train)
dt_probability_predictions_proba = trainer_clf.predict_model(X_test)
dt_probability_model = trainer_clf.get_current_model_object()
probability_model_results['Decision Tree Classifier'] = evaluate_classification_model(y_probability_test, dt_probability_predictions_proba)


### Model 2: XGBoost Classifier for Claim Probability

In [None]:
# Model 2: XGBoost Classifier for Claim Probability
print("\n--- Training XGBoost Classifier for Claim Probability ---")
# Use use_label_encoder=False and eval_metric for compatibility with latest XGBoost
trainer_clf.set_strategy(XGBoostStrategy(objective='binary:logistic', n_estimators=200, random_state=42, use_label_encoder=False, eval_metric='logloss'))
trainer_clf.train_model(X_train, y_probability_train) # Using full X_train for probability model
xgb_probability_predictions_proba = trainer_clf.predict_model(X_test) # Get probabilities
xgb_probability_model = trainer_clf.get_current_model_object()
probability_model_results['XGBoost Classifier'] = evaluate_classification_model(y_probability_test, xgb_probability_predictions_proba)


### **4.3. Model Evaluation: Probability of Claim**

In [None]:
# --- 4.3. Model Evaluation: Probability of ClaimModel Evaluation: Probability of Claim (Consolidated) ---
print("\n--- Claim Probability Model Performance Summary ---")
probability_comparison_df = pd.DataFrame(probability_model_results).T
print(probability_comparison_df.sort_values(by='F1-score', ascending=False)) # Sort by F1-score for classification

# Identify the best performing model based on F1-score
best_probability_model_name = probability_comparison_df['F1-score'].idxmax()
print(f"\nBest performing Claim Probability model based on F1-score: {best_probability_model_name}")

# Select the best model object
best_probability_model = {
    'Decision Tree Classifier': dt_probability_model,
    'XGBoost Classifier': xgb_probability_model
}.get(best_probability_model_name)


### **4.4. Model Interpretability (SHAP & LIME) for Probability Model**

Understanding which features drive the prediction of a claim occurring is vital for risk assessment and underwriting.

In [None]:
if best_probability_model and not X_test.empty:
    print(f"\n--- Model Interpretability for {best_probability_model_name} (Claim Probability) ---")

    # Initialize ModelInterpreter for classification
    # Ensure class_names are correct for your binary target (e.g., [0, 1] or ['No Claim', 'Claim'])
    interpreter_prob = ModelInterpreter(
        model=best_probability_model,
        feature_names=X_test.columns.tolist(),
        class_names=['No Claim', 'Claim'], # Assuming 0 for No Claim, 1 for Claim
        model_type='classification'
    )

    # SHAP for overall feature importance
    shap_X_prob = X_test.sample(min(1000, len(X_test)), random_state=42) if len(X_test) > 1000 else X_test
    interpreter_prob.explain_model_shap(shap_X_prob)
    print("\nSHAP Summary Plot (Global Feature Importance for Claim Probability):")
    interpreter_prob.plot_shap_summary(shap_X_prob)

    # LIME for individual instance explanation
    print("\nLIME Explanation for a Sample Probability Prediction (first instance in test set):")
    interpreter_prob.lime_explainer = lime.lime_tabular.LimeTabularExplainer(
        training_data=X_train.values, # LIME needs the raw training data array
        feature_names=X_train.columns.tolist(),
        class_names=['No Claim', 'Claim'],
        mode='classification'
    )
    interpreter_prob.explain_instance_lime(X_test.iloc[0])
else:
    print("Skipping model interpretability for Claim Probability due to insufficient data or no best model found.")


### **4.5. Simplified Premium Prediction (Direct Regression on TotalPremium)**

While the full "Risk-Based Premium" formula involves combining models, a simpler approach for "predicting an appropriate premium" might be to directly regress on `TotalPremium`. This demonstrates a pricing model without the full two-stage complexity.

- **Target Variable**: `TotalPremium`.
- **Evaluation Metric**: RMSE, R-squared.

In [None]:
print("\n--- 4.5. Simplified Premium Prediction (Direct Regression on TotalPremium) ---")

# Dictionary to store direct premium model evaluation results
premium_model_results = {}

# Initialize ModelTrainer (using full X_train for premium prediction)
trainer_premium = ModelTrainer(LinearRegressionStrategy()) # Start with LR

### Model 1: Linear Regression for TotalPremium

In [None]:
# Model 1: Linear Regression for TotalPremium
print("\n--- Training Linear Regression for TotalPremium ---")
trainer_premium.train_model(X_train, y_premium_train)
lr_premium_predictions = trainer_premium.predict_model(X_test)
lr_premium_model = trainer_premium.get_current_model_object()
premium_model_results['Linear Regression'] = evaluate_regression_model(y_premium_test, lr_premium_predictions)


### Model 2: Decision Tree for TotalPremium

In [None]:
# Model 2: Decision Tree for TotalPremium 
print("\n--- Training Decision Tree Regressor for TotalPremium ---")
trainer_premium.set_strategy(DecisionTreeStrategy(model_type='regressor', random_state=42))
trainer_premium.train_model(X_train, y_premium_train)
dt_premium_predictions = trainer_premium.predict_model(X_test)
dt_premium_model = trainer_premium.get_current_model_object()
premium_model_results['Decision Tree'] = evaluate_regression_model(y_premium_test, dt_premium_predictions)


### Model 3: Random Forest for TotalPremium

In [None]:
# Model 3: Random Forest for TotalPremium
print("\n--- Training Random Forest for TotalPremium ---")
trainer_premium.set_strategy(RandomForestStrategy(n_estimators=200, random_state=42))
trainer_premium.train_model(X_train, y_premium_train)
rf_premium_predictions = trainer_premium.predict_model(X_test)
rf_premium_model = trainer_premium.get_current_model_object()
premium_model_results['Random Forest'] = evaluate_regression_model(y_premium_test, rf_premium_predictions)


### Model 4: XGBoost for TotalPremium

In [None]:
# Model 4: XGBoost for TotalPremium
print("\n--- Training XGBoost Regressor for TotalPremium ---")
trainer_premium.set_strategy(XGBoostStrategy(objective='reg:squarederror', n_estimators=200, random_state=42))
trainer_premium.train_model(X_train, y_premium_train)
xgb_premium_predictions = trainer_premium.predict_model(X_test)
xgb_premium_model = trainer_premium.get_current_model_object()
premium_model_results['XGBoost Regressor'] = evaluate_regression_model(y_premium_test, xgb_premium_predictions)


### Direct Premium Prediction Model Comparison

In [None]:
print("\n--- Direct Premium Prediction Model Comparison ---")
premium_comparison_df = pd.DataFrame(premium_model_results).T
print(premium_comparison_df.sort_values(by='RMSE'))

# Identify the best performing model based on RMSE
best_premium_model_name = premium_comparison_df['RMSE'].idxmin()
print(f"\nBest performing Direct Premium Prediction model based on RMSE: {best_premium_model_name}")

# Select the best model object for interpretability
best_premium_model = {
    'Linear Regression': lr_premium_model,
    'Decision Tree': dt_premium_model,
    'Random Forest': rf_premium_model,
    'XGBoost Regressor': xgb_premium_model
}.get(best_premium_model_name)

# Feature Importance for the best direct premium prediction model
if best_premium_model and not X_test.empty:
    print(f"\n--- Model Interpretability for {best_premium_model_name} (Direct Premium Prediction) ---")
    interpreter_premium = ModelInterpreter(
        model=best_premium_model,
        feature_names=X_test.columns.tolist(),
        model_type='regression'
    )

    shap_X_premium = X_test.sample(min(1000, len(X_test)), random_state=42) if len(X_test) > 1000 else X_test

    interpreter_premium.explain_model_shap(shap_X_premium)
    print("\nSHAP Summary Plot (Global Feature Importance for Direct Premium Prediction):")
    interpreter_premium.plot_shap_summary(shap_X_premium)

    print("\nLIME Explanation for a Sample Premium Prediction (first test instance):")
    interpreter_premium.lime_explainer = lime.lime_tabular.LimeTabularExplainer(
        training_data=X_train.values,
        feature_names=X_train.columns.tolist(),
        mode='regression'
    )
    interpreter_premium.explain_instance_lime(X_test.iloc[0])
else:
    print("Skipping model interpretability for Direct Premium Prediction due to insufficient data or no best model found.")



## **5. Overall Model Comparison and Interpretation**

Based on the evaluation metrics and feature importance analysis, compare the performance of the developed models for both Claim Severity and Premium Prediction.

*(**Your detailed interpretation and comparison here, based on the outputs from the cells above**)*

- **Claim Severity Model Comparison:**
    - **RMSE & R-squared**: Which model minimized RMSE and maximized R-squared on the `claims > 0` subset? This indicates its accuracy in predicting the actual cost of a claim.
    - **Feature Importance (SHAP & LIME)**: Identify the top 5-10 most influential features for the best Claim Severity model. Explain their impact (e.g., "SHAP analysis reveals that for every year older a vehicle is, the predicted claim amount increases by X Rand, holding other factors constant. This provides quantitative evidence to refine our age-based premium adjustments."). How do global (SHAP) and local (LIME) explanations compare?
- **Premium Probability (Classification) Model Comparison:**
    - **Metrics**: Which model performed best on Accuracy, Precision, Recall, and F1-score for predicting `HasClaim`? Consider the business context: is it more critical to identify all potential claims (high recall) or to only identify claims with high certainty (high precision)?
    - **Feature Importance (SHAP & LIME)**: Identify the top 5-10 most influential features for the Claim Probability model. Explain their impact on the likelihood of a claim. How do global (SHAP) and local (LIME) explanations compare?
- **Direct Premium Prediction Model Comparison:**
    - **RMSE & R-squared**: Which model best predicted `TotalPremium`?
    - **Feature Importance (SHAP & LIME)**: Identify the top 5-10 features for predicting the premium. This can reveal what attributes are currently strongly correlated with the premiums being charged. How do global (SHAP) and local (LIME) explanations compare?

**Interpretation Insights:**

- **Overlap in Important Features**: Are there common influential features across Claim Severity, Claim Probability, and Direct Premium Prediction models? These are likely the most critical risk drivers.
- **Discrepancies**: If current `TotalPremium` drivers differ significantly from actual `TotalClaims` or `HasClaim` drivers, it highlights potential mispricing opportunities.

## **6. Conclusion and Business Recommendations**

Synthesize all findings from model building and evaluation into actionable business recommendations for ACIS's dynamic, risk-based pricing system.

**Key Model Findings Summary:**

- **Claim Severity Model**: [Best Model Name] achieved [RMSE value] and [R-squared value]. The most influential features were [list top 3-5, e.g., 'VehicleAge', 'CustomValueEstimate', 'Make'].
    - *Interpretation*: Provide 1-2 sentences on what these features imply for financial liability (e.g., "Vehicle age and custom value estimate are significant drivers of claim severity, indicating older, more valuable vehicles accrue higher claim costs. For example, LIME analysis showed that for a specific high-claim policy, a high custom value was a strong positive contributor to the predicted severity.").
- **Claim Probability Model**: [Best Model Name] achieved [Accuracy/F1-score]. The most influential features were [list top 3-5, e.g., 'Province_Gauteng', 'VehicleType_SUV', 'PostalCode_2000'].
    - *Interpretation*: Provide 1-2 sentences on what these features imply for claim likelihood (e.g., "Certain vehicle types and postal codes significantly influence the probability of a claim, suggesting localized risk factors. For instance, LIME highlights that living in 'Gauteng' strongly increases the predicted probability of a claim for an individual policy.").
- **Direct Premium Model**: [Best Model Name] achieved [RMSE value] and [R-squared value]. The most influential features were [list top 3-5, e.g., 'SumInsured', 'CoverType', 'TermFrequency'].
    - *Interpretation*: Provide 1-2 sentences on what these features indicate about current pricing (e.g., "Current premiums are heavily influenced by sum insured and cover type, but perhaps less directly by actual claims history or vehicle age, which could be an area for refinement. SHAP values confirm that a higher 'SumInsured' directly leads to a higher predicted premium.").

**Strategic Recommendations for Dynamic, Risk-Based Pricing:**

1. **Develop a Two-Stage Risk-Based Premium Model**: Implement the conceptual framework: `Premium = (Predicted Probability of Claim * Predicted Claim Severity) + Expense Loading + Profit Margin`. This requires deploying both the best-performing Claim Probability (classification) model and Claim Severity (regression) model. This sophisticated approach will ensure premiums are directly tied to the expected loss for each policy.
2. **Refine Feature Engineering**: Leverage the identified feature importance from SHAP and LIME analysis. Focus on collecting and utilizing data for the most influential features. For instance, if 'VehicleAge' is highly important for severity, ensure this is accurately captured and used in pricing algorithms.
3. **Optimize Pricing Segments**: The models provide a granular understanding of risk drivers. ACIS can now move beyond broad segments to dynamic pricing that incorporates multiple predictive features. For example, rather than just province, pricing can factor in `VehicleAge`, `CustomValueEstimate`, and `PostalCode` simultaneously, weighted by their predictive power.
4. **Continuous Model Monitoring & Retraining**: Predictive models degrade over time. Implement a robust MLOps pipeline to continuously monitor model performance (RMSE, R-squared, classification metrics) and retrain models with fresh data to adapt to changing market conditions and claim patterns. Specifically, monitor model drift for key features identified by SHAP/LIME.
5. **Integration with Underwriting**: The feature importance insights can directly inform underwriting rules, allowing for automated decision-making or flagging of high-risk policies that require manual review. LIME can be particularly useful for underwriters to understand *why* a specific policy received a certain risk score.
6. **Business Buy-in and Explainability**: Use SHAP visualizations and quantitative explanations (like "for every year older a vehicle is...") to communicate model insights to business stakeholders, fostering trust and enabling data-driven decisions. LIME provides policy-specific explanations, which can be invaluable for customer service or claims dispute resolution.

By implementing these recommendations, ACIS can transition to a more data-driven, precise, and dynamic risk-based pricing system, leading to improved profitability, fairer premiums for customers, and a competitive advantage in the insurance market.