# Insurance Risk Analysis Notebook

This notebook demonstrates data analysis and modeling techniques for insurance underwriting risk assessment.

In [None]:
# Import necessary libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import xgboost as xgb
import shap

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Add parent directory to path to import project modules
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))

## 1. Generate Sample Insurance Data

For demonstration purposes, we'll generate synthetic insurance data that mimics real-world underwriting data.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate property data
def generate_property_data(n_samples=1000):
    property_data = pd.DataFrame({
        'property_id': [f'P{i:04d}' for i in range(1, n_samples + 1)],
        'property_type': np.random.choice(['Residential', 'Commercial', 'Industrial'], n_samples),
        'property_value': np.random.uniform(100000, 1000000, n_samples),
        'property_age': np.random.randint(1, 50, n_samples),
        'property_size': np.random.uniform(1000, 10000, n_samples),
        'flood_risk_score': np.random.uniform(0, 10, n_samples),
        'fire_risk_score': np.random.uniform(0, 10, n_samples),
        'location_risk': np.random.choice(['Low', 'Medium', 'High'], n_samples),
    })
    
    # Create a more realistic target variable with dependencies on features
    # Higher risk scores and property age increase claim probability
    logits = (
        0.05 * property_data['property_value'] / 100000 +
        0.2 * property_data['property_age'] +
        0.4 * property_data['flood_risk_score'] +
        0.3 * property_data['fire_risk_score'] +
        (property_data['location_risk'] == 'High') * 2 +
        (property_data['location_risk'] == 'Medium') * 1
    )
    
    # Normalize and convert to probability
    logits = (logits - logits.mean()) / logits.std()
    probs = 1 / (1 + np.exp(-logits))
    
    # Generate binary outcome
    property_data['claim_filed'] = (np.random.random(n_samples) < probs).astype(int)
    
    return property_data

# Generate the data
property_data = generate_property_data(1000)

# Display the first few rows
property_data.head()

## 2. Exploratory Data Analysis

Let's explore the data to understand the relationships between different features and the target variable.

In [None]:
# Basic statistics
property_data.describe()

In [None]:
# Distribution of property types
plt.figure(figsize=(10, 6))
sns.countplot(x='property_type', data=property_data)
plt.title('Distribution of Property Types')
plt.xlabel('Property Type')
plt.ylabel('Count')
plt.show()

In [None]:
# Distribution of claims by property type
plt.figure(figsize=(10, 6))
sns.countplot(x='property_type', hue='claim_filed', data=property_data)
plt.title('Claims by Property Type')
plt.xlabel('Property Type')
plt.ylabel('Count')
plt.legend(['No Claim', 'Claim Filed'])
plt.show()

In [None]:
# Claim rate by property type
claim_rates = property_data.groupby('property_type')['claim_filed'].mean()

plt.figure(figsize=(10, 6))
sns.barplot(x=claim_rates.index, y=claim_rates.values)
plt.title('Claim Rate by Property Type')
plt.xlabel('Property Type')
plt.ylabel('Claim Rate')
plt.ylim(0, 0.5)  # Adjust as needed
plt.show()

In [None]:
# Correlation between numeric features
numeric_cols = ['property_value', 'property_age', 'property_size', 'flood_risk_score', 'fire_risk_score', 'claim_filed']
corr = property_data[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Between Features')
plt.show()

In [None]:
# Distribution of risk scores
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
sns.histplot(property_data, x='flood_risk_score', hue='claim_filed', kde=True, bins=20)
plt.title('Flood Risk Score Distribution')
plt.xlabel('Flood Risk Score')

plt.subplot(1, 2, 2)
sns.histplot(property_data, x='fire_risk_score', hue='claim_filed', kde=True, bins=20)
plt.title('Fire Risk Score Distribution')
plt.xlabel('Fire Risk Score')

plt.tight_layout()
plt.show()

In [None]:
# Relationship between property age and claim rate
# Create age bins
property_data['age_bin'] = pd.cut(property_data['property_age'], bins=[0, 10, 20, 30, 40, 50], labels=['0-10', '11-20', '21-30', '31-40', '41-50'])

# Calculate claim rate by age bin
age_claim_rates = property_data.groupby('age_bin')['claim_filed'].mean()

plt.figure(figsize=(10, 6))
sns.barplot(x=age_claim_rates.index, y=age_claim_rates.values)
plt.title('Claim Rate by Property Age')
plt.xlabel('Property Age (years)')
plt.ylabel('Claim Rate')
plt.ylim(0, 0.5)  # Adjust as needed
plt.show()

## 3. Feature Engineering

Let's create some additional features that might be useful for predicting insurance claims.

In [None]:
# Create a copy of the data for feature engineering
data_engineered = property_data.copy()

# Create combined risk score
data_engineered['combined_risk_score'] = (data_engineered['flood_risk_score'] + data_engineered['fire_risk_score']) / 2

# Create value per square foot
data_engineered['value_per_sqft'] = data_engineered['property_value'] / data_engineered['property_size']

# Create age-related features
data_engineered['is_new_property'] = (data_engineered['property_age'] < 5).astype(int)
data_engineered['is_old_property'] = (data_engineered['property_age'] > 30).astype(int)

# Create interaction features
data_engineered['age_x_flood_risk'] = data_engineered['property_age'] * data_engineered['flood_risk_score']
data_engineered['age_x_fire_risk'] = data_engineered['property_age'] * data_engineered['fire_risk_score']

# One-hot encode categorical variables
data_engineered = pd.get_dummies(data_engineered, columns=['property_type', 'location_risk'], drop_first=False)

# Display the engineered features
data_engineered.head()

## 4. Model Training and Evaluation

Now let's train machine learning models to predict insurance claims.

In [None]:
# Prepare data for modeling
# Exclude non-feature columns
exclude_cols = ['property_id', 'claim_filed', 'age_bin']
feature_cols = [col for col in data_engineered.columns if col not in exclude_cols]

X = data_engineered[feature_cols].values
y = data_engineered['claim_filed'].values

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")

In [None]:
# Train multiple models
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42)
}

# Dictionary to store results
results = {}

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    results[name] = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_pred_proba),
        'model': model,
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba
    }
    
    print(f"  Accuracy: {results[name]['accuracy']:.4f}")
    print(f"  ROC AUC: {results[name]['roc_auc']:.4f}")
    print()

In [None]:
# Compare model performance
metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
model_comparison = pd.DataFrame(index=models.keys(), columns=metrics)

for name in models.keys():
    for metric in metrics:
        model_comparison.loc[name, metric] = results[name][metric]

model_comparison

In [None]:
# Visualize model comparison
plt.figure(figsize=(12, 8))
model_comparison.plot(kind='bar', figsize=(12, 8))
plt.title('Model Performance Comparison')
plt.xlabel('Model')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.legend(title='Metric', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Plot ROC curves for all models
plt.figure(figsize=(10, 8))

for name, result in results.items():
    fpr, tpr, _ = roc_curve(y_test, result['y_pred_proba'])
    roc_auc = result['roc_auc']
    plt.plot(fpr, tpr, lw=2, label=f'{name} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

## 5. Model Interpretation

Let's analyze the best model to understand which features are most important for predicting insurance claims.

In [None]:
# Determine the best model
best_model_name = model_comparison['roc_auc'].idxmax()
best_model = results[best_model_name]['model']
print(f"Best model: {best_model_name} with ROC AUC: {model_comparison.loc[best_model_name, 'roc_auc']:.4f}")

In [None]:
# Feature importance for the best model
if hasattr(best_model, 'feature_importances_'):
    # Get feature importances
    importances = best_model.feature_importances_
    
    # Create a DataFrame for visualization
    feature_importance = pd.DataFrame({
        'Feature': feature_cols,
        'Importance': importances
    })
    
    # Sort by importance
    feature_importance = feature_importance.sort_values('Importance', ascending=False)
    
    # Plot top 15 features
    plt.figure(figsize=(12, 8))
    sns.barplot(x='Importance', y='Feature', data=feature_importance.head(15))
    plt.title(f'Top 15 Feature Importances ({best_model_name})')
    plt.tight_layout()
    plt.show()
else:
    print("Feature importances not available for this model type.")

In [None]:
# SHAP values for model interpretation
try:
    # Create a SHAP explainer
    if isinstance(best_model, (RandomForestClassifier, GradientBoostingClassifier, xgb.XGBClassifier)):
        explainer = shap.TreeExplainer(best_model)
        
        # Calculate SHAP values for a subset of test data
        shap_values = explainer.shap_values(X_test[:100])
        
        # Summary plot
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_values, X_test[:100], feature_names=feature_cols)
        plt.title(f'SHAP Summary Plot ({best_model_name})')
        plt.show()
        
        # Dependence plots for top features
        if feature_importance is not None:
            top_features = feature_importance.head(3)['Feature'].values
            for feature in top_features:
                plt.figure(figsize=(10, 6))
                feature_idx = feature_cols.index(feature)
                shap.dependence_plot(feature_idx, shap_values, X_test[:100], feature_names=feature_cols)
                plt.title(f'SHAP Dependence Plot for {feature}')
                plt.show()
    else:
        print("SHAP analysis not implemented for this model type.")
except Exception as e:
    print(f"Error in SHAP analysis: {e}")

## 6. Business Insights and Recommendations

Based on our analysis, let's summarize key insights and recommendations for underwriting.

### Key Insights:

1. **Risk Factors**: The most important predictors of insurance claims are [highlight top 3-5 features based on importance].

2. **Property Types**: [Describe differences in claim rates between property types].

3. **Age Impact**: [Describe how property age affects claim probability].

4. **Risk Scores**: [Describe how flood and fire risk scores correlate with claims].

### Recommendations for Underwriting:

1. **Risk Assessment**: Prioritize evaluation of [top risk factors] when assessing new policies.

2. **Premium Adjustment**: Consider adjusting premiums based on [specific risk factors].

3. **Risk Mitigation**: For high-risk properties, recommend specific risk mitigation measures related to [relevant risk factors].

4. **Data Collection**: Improve collection of data related to [important features] to enhance future risk models.

5. **Model Implementation**: Implement the [best model] in the underwriting process to improve risk assessment accuracy.

## 7. Next Steps

1. **Model Refinement**: Further tune hyperparameters and explore ensemble methods to improve model performance.

2. **Additional Data**: Integrate additional third-party data sources such as weather patterns, crime statistics, and economic indicators.

3. **Deployment Strategy**: Develop an API for real-time risk scoring in the underwriting process.

4. **Monitoring Plan**: Establish a monitoring framework to track model performance over time and detect concept drift.

5. **Feedback Loop**: Create a mechanism to incorporate underwriter feedback to continuously improve the model.