# Insurance Claim Prediction Model
## Predicting Building Insurance Claims Based on Building Characteristics

**Objective:** Build a predictive model to determine if a building will have an insurance claim during a certain period.

**Target Variable:**
- 1: Building has at least one claim over the insured period
- 0: Building has no claims over the insured period

---

## 1. Import Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB

# Evaluation metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, 
    roc_auc_score, confusion_matrix, classification_report,
    roc_curve, precision_recall_curve
)

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

## 2. Load and Explore Data

In [None]:
# Load the data
train_data = pd.read_csv('Train_data.csv')
variable_desc = pd.read_csv('Variable_Description.csv')

print("Dataset loaded successfully!")
print(f"\nTraining data shape: {train_data.shape}")
print(f"Number of features: {train_data.shape[1] - 1}")
print(f"Number of samples: {train_data.shape[0]}")

In [None]:
# Display variable descriptions
print("\n=== VARIABLE DESCRIPTIONS ===")
display(variable_desc)

In [None]:
# First look at the data
print("\n=== FIRST 5 ROWS ===")
display(train_data.head())

In [None]:
# Data info
print("\n=== DATA INFORMATION ===")
train_data.info()

In [None]:
# Statistical summary
print("\n=== STATISTICAL SUMMARY ===")
display(train_data.describe())

In [None]:
# Check for missing values
print("\n=== MISSING VALUES ===")
missing_values = train_data.isnull().sum()
missing_percent = (missing_values / len(train_data)) * 100
missing_df = pd.DataFrame({
    'Missing_Count': missing_values,
    'Percentage': missing_percent
})
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
display(missing_df)

In [None]:
# Check data types
print("\n=== DATA TYPES ===")
print(train_data.dtypes)

In [None]:
# Check unique values for categorical columns
print("\n=== UNIQUE VALUES IN CATEGORICAL COLUMNS ===")
categorical_cols = train_data.select_dtypes(include=['object']).columns
for col in categorical_cols:
    print(f"\n{col}: {train_data[col].nunique()} unique values")
    print(train_data[col].value_counts().head())

## 3. Data Cleaning and Preprocessing

In [None]:
# Create a copy of the data for cleaning
df = train_data.copy()

print("Original shape:", df.shape)

### 3.1 Handle Customer ID

In [None]:
# Customer ID is just an identifier, not useful for prediction
# We'll drop it but keep it if needed for tracking
customer_ids = df['Customer Id'].copy()
df = df.drop('Customer Id', axis=1)

print("Dropped Customer Id column")
print("Current shape:", df.shape)

### 3.2 Handle NumberOfWindows Missing Values

In [None]:
# Check NumberOfWindows - it appears to have '   .' for missing values
print("NumberOfWindows unique values:")
print(df['NumberOfWindows'].value_counts().head(10))
print(f"\nData type: {df['NumberOfWindows'].dtype}")

In [None]:
# Replace '   .' with NaN and convert to numeric
df['NumberOfWindows'] = df['NumberOfWindows'].replace('   .', np.nan)
df['NumberOfWindows'] = pd.to_numeric(df['NumberOfWindows'], errors='coerce')

print(f"Missing values in NumberOfWindows: {df['NumberOfWindows'].isnull().sum()}")
print(f"Percentage: {(df['NumberOfWindows'].isnull().sum() / len(df)) * 100:.2f}%")

### 3.3 Handle Date_of_Occupancy

In [None]:
# Create age of building feature
df['Building_Age'] = df['YearOfObservation'] - df['Date_of_Occupancy']

# Check for any negative or unrealistic values
print(f"Building Age Statistics:")
print(df['Building_Age'].describe())
print(f"\nNegative Building Age: {(df['Building_Age'] < 0).sum()}")
print(f"Very old buildings (>150 years): {(df['Building_Age'] > 150).sum()}")

In [None]:
# Handle negative or unrealistic building ages
# For negative ages, we'll take the absolute value
# For very old buildings, we'll cap at a reasonable value
df['Building_Age'] = df['Building_Age'].abs()
df['Building_Age'] = df['Building_Age'].clip(upper=200)

# Drop original date column as we have Building_Age now
df = df.drop('Date_of_Occupancy', axis=1)

print("Created Building_Age feature and dropped Date_of_Occupancy")
print(f"Building Age range: {df['Building_Age'].min()} to {df['Building_Age'].max()} years")

### 3.4 Encode Categorical Variables

In [None]:
# Check categorical variables
print("Categorical variables encoding:")
print(f"\nBuilding_Painted: {df['Building_Painted'].unique()}")
print(f"Building_Fenced: {df['Building_Fenced'].unique()}")
print(f"Garden: {df['Garden'].unique()}")
print(f"Settlement: {df['Settlement'].unique()}")

In [None]:
# Binary encode categorical variables
# Building_Painted: N=Painted, V=Not Painted -> N=1, V=0
df['Building_Painted'] = df['Building_Painted'].map({'N': 1, 'V': 0})

# Building_Fenced: N=Fenced, V=Not Fenced -> N=1, V=0
df['Building_Fenced'] = df['Building_Fenced'].map({'N': 1, 'V': 0})

# Garden: V=has garden, O=no garden -> V=1, O=0
df['Garden'] = df['Garden'].map({'V': 1, 'O': 0})

# Settlement: U=urban, R=rural -> U=1, R=0
df['Settlement'] = df['Settlement'].map({'U': 1, 'R': 0})

print("Categorical variables encoded successfully!")

In [None]:
# Check current state of the dataset
print("\n=== CLEANED DATA INFO ===")
df.info()

In [None]:
# Check for any remaining missing values
print("\n=== REMAINING MISSING VALUES ===")
missing = df.isnull().sum()
missing = missing[missing > 0]
if len(missing) > 0:
    print(missing)
else:
    print("No missing values!")

## 4. Exploratory Data Analysis (EDA)

### 4.1 Target Variable Distribution

In [None]:
# Target variable distribution
print("=== TARGET VARIABLE DISTRIBUTION ===")
claim_counts = df['Claim'].value_counts()
claim_percent = df['Claim'].value_counts(normalize=True) * 100

print("\nClaim Distribution:")
print(f"No Claim (0): {claim_counts[0]} ({claim_percent[0]:.2f}%)")
print(f"Has Claim (1): {claim_counts[1]} ({claim_percent[1]:.2f}%)")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(data=df, x='Claim', ax=axes[0], palette='Set2')
axes[0].set_title('Target Variable Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Claim Status')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['No Claim', 'Has Claim'])

# Add value labels on bars
for i, v in enumerate(claim_counts):
    axes[0].text(i, v + 50, str(v), ha='center', fontweight='bold')

# Pie chart
axes[1].pie(claim_counts, labels=['No Claim', 'Has Claim'], autopct='%1.1f%%', 
            colors=sns.color_palette('Set2'), startangle=90)
axes[1].set_title('Target Variable Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Check for class imbalance
imbalance_ratio = claim_counts[0] / claim_counts[1]
print(f"\nClass Imbalance Ratio (No Claim / Has Claim): {imbalance_ratio:.2f}")
if imbalance_ratio > 2 or imbalance_ratio < 0.5:
    print("⚠️ Dataset shows class imbalance. Consider using techniques like SMOTE or class weights.")
else:
    print("✓ Dataset is relatively balanced.")

### 4.2 Numerical Features Analysis

In [None]:
# Identify numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numerical_cols.remove('Claim')  # Remove target variable

print(f"Numerical features: {numerical_cols}")

In [None]:
# Distribution of numerical features
fig, axes = plt.subplots(3, 3, figsize=(16, 12))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    if idx < len(axes):
        axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
        axes[idx].set_title(f'Distribution of {col}', fontweight='bold')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Frequency')
        axes[idx].grid(alpha=0.3)

# Hide unused subplots
for idx in range(len(numerical_cols), len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.show()

In [None]:
# Box plots to detect outliers
fig, axes = plt.subplots(3, 3, figsize=(16, 12))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    if idx < len(axes):
        sns.boxplot(y=df[col].dropna(), ax=axes[idx], palette='Set2')
        axes[idx].set_title(f'Boxplot of {col}', fontweight='bold')
        axes[idx].set_ylabel(col)

# Hide unused subplots
for idx in range(len(numerical_cols), len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.show()

### 4.3 Relationship with Target Variable

In [None]:
# Numerical features vs Target
fig, axes = plt.subplots(3, 3, figsize=(16, 12))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    if idx < len(axes):
        df_temp = df[[col, 'Claim']].dropna()
        sns.boxplot(data=df_temp, x='Claim', y=col, ax=axes[idx], palette='Set2')
        axes[idx].set_title(f'{col} vs Claim Status', fontweight='bold')
        axes[idx].set_xticklabels(['No Claim', 'Has Claim'])

# Hide unused subplots
for idx in range(len(numerical_cols), len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.show()

In [None]:
# Categorical features vs Target
categorical_features = ['Residential', 'Building_Painted', 'Building_Fenced', 'Garden', 'Settlement']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for idx, col in enumerate(categorical_features):
    if idx < len(axes):
        ct = pd.crosstab(df[col], df['Claim'], normalize='index') * 100
        ct.plot(kind='bar', ax=axes[idx], color=['#66c2a5', '#fc8d62'], width=0.7)
        axes[idx].set_title(f'{col} vs Claim Rate', fontweight='bold')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Percentage')
        axes[idx].legend(['No Claim', 'Has Claim'])
        axes[idx].set_xticklabels(axes[idx].get_xticklabels(), rotation=0)

# Hide unused subplots
for idx in range(len(categorical_features), len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.show()

### 4.4 Correlation Analysis

In [None]:
# Correlation matrix
plt.figure(figsize=(14, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Correlation Matrix of All Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

In [None]:
# Correlation with target variable
target_corr = correlation_matrix['Claim'].sort_values(ascending=False)
print("\n=== CORRELATION WITH TARGET VARIABLE ===")
print(target_corr)

# Visualize
plt.figure(figsize=(10, 8))
target_corr[target_corr.index != 'Claim'].plot(kind='barh', color='steelblue')
plt.title('Feature Correlation with Claim (Target Variable)', fontsize=14, fontweight='bold')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Features')
plt.axvline(x=0, color='red', linestyle='--', linewidth=1)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### 4.5 Additional Insights

In [None]:
# Claim rate by Building Type
building_type_claim = df.groupby('Building_Type')['Claim'].agg(['sum', 'count', 'mean'])
building_type_claim.columns = ['Total_Claims', 'Total_Buildings', 'Claim_Rate']
building_type_claim['Claim_Rate'] = building_type_claim['Claim_Rate'] * 100
building_type_claim = building_type_claim.sort_values('Claim_Rate', ascending=False)

print("\n=== CLAIM RATE BY BUILDING TYPE ===")
display(building_type_claim)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
building_type_claim['Claim_Rate'].plot(kind='bar', color='coral', ax=ax)
plt.title('Claim Rate by Building Type', fontsize=14, fontweight='bold')
plt.xlabel('Building Type')
plt.ylabel('Claim Rate (%)')
plt.xticks(rotation=0)
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Claim rate by Year of Observation
year_claim = df.groupby('YearOfObservation')['Claim'].agg(['sum', 'count', 'mean'])
year_claim.columns = ['Total_Claims', 'Total_Buildings', 'Claim_Rate']
year_claim['Claim_Rate'] = year_claim['Claim_Rate'] * 100

print("\n=== CLAIM RATE BY YEAR ===")
display(year_claim)

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(year_claim.index, year_claim['Claim_Rate'], marker='o', linewidth=2, markersize=8, color='darkgreen')
plt.title('Claim Rate Trend Over Years', fontsize=14, fontweight='bold')
plt.xlabel('Year of Observation')
plt.ylabel('Claim Rate (%)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Analyze Building Dimension quartiles and claim rate
df['Dimension_Quartile'] = pd.qcut(df['Building Dimension'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
dimension_claim = df.groupby('Dimension_Quartile')['Claim'].mean() * 100

print("\n=== CLAIM RATE BY BUILDING DIMENSION QUARTILE ===")
print(dimension_claim)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
dimension_claim.plot(kind='bar', color='purple', alpha=0.7, ax=ax)
plt.title('Claim Rate by Building Dimension Quartile', fontsize=14, fontweight='bold')
plt.xlabel('Building Dimension Quartile')
plt.ylabel('Claim Rate (%)')
plt.xticks(rotation=0)
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

# Drop the temporary column
df = df.drop('Dimension_Quartile', axis=1)

### 4.6 Key Insights Summary

In [None]:
print("\n" + "="*80)
print("KEY INSIGHTS FROM EXPLORATORY DATA ANALYSIS")
print("="*80)

# Target distribution
claim_rate = (df['Claim'].sum() / len(df)) * 100
print(f"\n1. Overall claim rate: {claim_rate:.2f}%")

# Most correlated features
top_corr = target_corr[target_corr.index != 'Claim'].head(3)
print(f"\n2. Top 3 features correlated with claims:")
for feature, corr in top_corr.items():
    print(f"   - {feature}: {corr:.3f}")

# Missing values summary
missing_count = df.isnull().sum().sum()
print(f"\n3. Total missing values to handle: {missing_count}")

# Building types analysis
high_risk_type = building_type_claim.index[0]
print(f"\n4. Highest risk building type: Type {high_risk_type} ({building_type_claim.loc[high_risk_type, 'Claim_Rate']:.2f}% claim rate)")

print("\n" + "="*80)

## 5. Feature Engineering and Preprocessing for Modeling

### 5.1 Create Additional Features

In [None]:
# Create a copy for modeling
df_model = df.copy()

# Feature: Total property features (painted + fenced + garden)
df_model['Total_Features'] = (df_model['Building_Painted'].fillna(0) + 
                               df_model['Building_Fenced'].fillna(0) + 
                               df_model['Garden'].fillna(0))

# Feature: Building dimension categories
df_model['Dimension_Category'] = pd.cut(df_model['Building Dimension'], 
                                         bins=[0, 500, 1000, 2000, float('inf')],
                                         labels=[0, 1, 2, 3])
df_model['Dimension_Category'] = df_model['Dimension_Category'].astype(int)

# Feature: Building age categories
df_model['Age_Category'] = pd.cut(df_model['Building_Age'], 
                                   bins=[0, 20, 50, 100, float('inf')],
                                   labels=[0, 1, 2, 3])
df_model['Age_Category'] = df_model['Age_Category'].astype(int)

# Feature: Insured period categories
df_model['InsuredPeriod_Category'] = pd.cut(df_model['Insured_Period'],
                                             bins=[0, 0.5, 0.75, 1.0],
                                             labels=[0, 1, 2],
                                             include_lowest=True)
df_model['InsuredPeriod_Category'] = df_model['InsuredPeriod_Category'].astype(int)

print("New features created:")
print("- Total_Features: Sum of painted, fenced, and garden features")
print("- Dimension_Category: Building dimension in categories")
print("- Age_Category: Building age in categories")
print("- InsuredPeriod_Category: Insured period in categories")

print(f"\nTotal features now: {df_model.shape[1]}")

### 5.2 Handle Missing Values

In [None]:
# Check for missing values
missing_cols = df_model.isnull().sum()
missing_cols = missing_cols[missing_cols > 0]
print("Columns with missing values:")
print(missing_cols)

# Impute NumberOfWindows with median (it's likely missing at random)
if 'NumberOfWindows' in missing_cols.index:
    median_windows = df_model['NumberOfWindows'].median()
    df_model['NumberOfWindows'].fillna(median_windows, inplace=True)
    print(f"\nImputed NumberOfWindows with median: {median_windows}")

# Check if there are any remaining missing values
print(f"\nRemaining missing values: {df_model.isnull().sum().sum()}")

### 5.3 Prepare Features and Target

In [None]:
# Separate features and target
X = df_model.drop('Claim', axis=1)
y = df_model['Claim']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns: {list(X.columns)}")

### 5.4 Train-Test Split

In [None]:
# Split the data (80-20 split, stratified to maintain class distribution)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print(f"\nTrain set claim rate: {(y_train.sum() / len(y_train)) * 100:.2f}%")
print(f"Test set claim rate: {(y_test.sum() / len(y_test)) * 100:.2f}%")

### 5.5 Feature Scaling

In [None]:
# Standardize features (important for models like Logistic Regression, SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("Features scaled using StandardScaler")
print(f"\nScaled training data shape: {X_train_scaled.shape}")
print(f"Scaled test data shape: {X_test_scaled.shape}")

## 6. Model Building and Training

### 6.1 Define Models

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss', n_jobs=-1),
    'Support Vector Machine': SVC(probability=True, random_state=42),
    'Naive Bayes': GaussianNB()
}

print(f"Initialized {len(models)} different models:")
for i, model_name in enumerate(models.keys(), 1):
    print(f"{i}. {model_name}")

### 6.2 Train Models

In [None]:
# Dictionary to store trained models and results
trained_models = {}
model_results = []

print("Training models...\n")
print("="*80)

for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    
    # Use scaled data for models that benefit from it
    if model_name in ['Logistic Regression', 'Support Vector Machine', 'Naive Bayes']:
        X_train_use = X_train_scaled
        X_test_use = X_test_scaled
    else:
        X_train_use = X_train
        X_test_use = X_test
    
    # Train model
    model.fit(X_train_use, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_use)
    y_pred_proba = model.predict_proba(X_test_use)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Store results
    trained_models[model_name] = model
    model_results.append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': roc_auc
    })
    
    print(f"✓ {model_name} trained successfully!")
    print(f"  Accuracy: {accuracy:.4f} | ROC-AUC: {roc_auc:.4f}")

print("\n" + "="*80)
print("All models trained successfully!")

## 7. Model Evaluation and Comparison

### 7.1 Performance Comparison

In [None]:
# Create results DataFrame
results_df = pd.DataFrame(model_results)
results_df = results_df.sort_values('ROC-AUC', ascending=False).reset_index(drop=True)

print("\n" + "="*80)
print("MODEL PERFORMANCE COMPARISON")
print("="*80)
display(results_df.style.background_gradient(cmap='YlGnBu', subset=['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']))

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']

for idx, metric in enumerate(metrics):
    ax = axes[idx // 3, idx % 3]
    results_sorted = results_df.sort_values(metric, ascending=True)
    ax.barh(results_sorted['Model'], results_sorted[metric], color='steelblue')
    ax.set_xlabel(metric, fontweight='bold')
    ax.set_title(f'Model Comparison: {metric}', fontweight='bold')
    ax.set_xlim([0, 1])
    ax.grid(alpha=0.3, axis='x')

# Hide the last subplot
axes[1, 2].set_visible(False)

plt.tight_layout()
plt.show()

### 7.2 Confusion Matrices

In [None]:
# Plot confusion matrices for all models
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.ravel()

for idx, (model_name, model) in enumerate(trained_models.items()):
    # Use appropriate data
    if model_name in ['Logistic Regression', 'Support Vector Machine', 'Naive Bayes']:
        X_test_use = X_test_scaled
    else:
        X_test_use = X_test
    
    y_pred = model.predict(X_test_use)
    cm = confusion_matrix(y_test, y_pred)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                xticklabels=['No Claim', 'Has Claim'],
                yticklabels=['No Claim', 'Has Claim'])
    axes[idx].set_title(f'{model_name}', fontweight='bold')
    axes[idx].set_ylabel('True Label')
    axes[idx].set_xlabel('Predicted Label')

# Hide unused subplot
axes[-1].set_visible(False)

plt.tight_layout()
plt.show()

### 7.3 ROC Curves

In [None]:
# Plot ROC curves for all models
plt.figure(figsize=(12, 8))

for model_name, model in trained_models.items():
    # Use appropriate data
    if model_name in ['Logistic Regression', 'Support Vector Machine', 'Naive Bayes']:
        X_test_use = X_test_scaled
    else:
        X_test_use = X_test
    
    y_pred_proba = model.predict_proba(X_test_use)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.3f})', linewidth=2)

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier', linewidth=2)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curves - Model Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### 7.4 Detailed Report for Best Model

In [None]:
# Identify best model based on ROC-AUC
best_model_name = results_df.iloc[0]['Model']
best_model = trained_models[best_model_name]

print("\n" + "="*80)
print(f"BEST MODEL: {best_model_name}")
print("="*80)

# Use appropriate data
if best_model_name in ['Logistic Regression', 'Support Vector Machine', 'Naive Bayes']:
    X_test_use = X_test_scaled
else:
    X_test_use = X_test

y_pred_best = best_model.predict(X_test_use)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best, target_names=['No Claim', 'Has Claim']))

# Confusion matrix
cm_best = confusion_matrix(y_test, y_pred_best)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_best, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Claim', 'Has Claim'],
            yticklabels=['No Claim', 'Has Claim'])
plt.title(f'Confusion Matrix - {best_model_name}', fontsize=14, fontweight='bold')
plt.ylabel('True Label', fontweight='bold')
plt.xlabel('Predicted Label', fontweight='bold')
plt.tight_layout()
plt.show()

### 7.5 Feature Importance (for tree-based models)

In [None]:
# Feature importance for tree-based models
tree_models = ['Decision Tree', 'Random Forest', 'Gradient Boosting', 'XGBoost']

for model_name in tree_models:
    if model_name in trained_models:
        model = trained_models[model_name]
        
        if hasattr(model, 'feature_importances_'):
            importances = model.feature_importances_
            feature_importance_df = pd.DataFrame({
                'Feature': X_train.columns,
                'Importance': importances
            }).sort_values('Importance', ascending=False)
            
            print(f"\n=== Feature Importance: {model_name} ===")
            display(feature_importance_df.head(10))
            
            # Visualize
            plt.figure(figsize=(10, 6))
            plt.barh(feature_importance_df['Feature'].head(10), 
                    feature_importance_df['Importance'].head(10),
                    color='teal')
            plt.xlabel('Importance', fontweight='bold')
            plt.title(f'Top 10 Feature Importances - {model_name}', 
                     fontsize=14, fontweight='bold')
            plt.gca().invert_yaxis()
            plt.grid(alpha=0.3, axis='x')
            plt.tight_layout()
            plt.show()

### 7.6 Cross-Validation for Best Model

In [None]:
# Perform cross-validation on the best model
print(f"\nPerforming 5-fold cross-validation on {best_model_name}...")

# Use appropriate data
if best_model_name in ['Logistic Regression', 'Support Vector Machine', 'Naive Bayes']:
    X_cv = scaler.fit_transform(X)
else:
    X_cv = X

cv_scores = cross_val_score(best_model, X_cv, y, cv=5, scoring='roc_auc')

print(f"\nCross-Validation ROC-AUC Scores: {cv_scores}")
print(f"Mean ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# Visualize CV scores
plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), cv_scores, marker='o', linestyle='-', linewidth=2, markersize=10)
plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', label=f'Mean: {cv_scores.mean():.4f}')
plt.xlabel('Fold Number', fontweight='bold')
plt.ylabel('ROC-AUC Score', fontweight='bold')
plt.title(f'Cross-Validation Scores - {best_model_name}', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 8. Summary and Recommendations

In [None]:
print("\n" + "="*80)
print("PROJECT SUMMARY AND RECOMMENDATIONS")
print("="*80)

print(f"\n1. BEST PERFORMING MODEL: {best_model_name}")
print(f"   - Test Set ROC-AUC: {results_df.iloc[0]['ROC-AUC']:.4f}")
print(f"   - Test Set Accuracy: {results_df.iloc[0]['Accuracy']:.4f}")
print(f"   - Cross-Validation ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

print(f"\n2. TOP 3 MODELS:")
for i in range(min(3, len(results_df))):
    print(f"   {i+1}. {results_df.iloc[i]['Model']}: ROC-AUC = {results_df.iloc[i]['ROC-AUC']:.4f}")

print(f"\n3. KEY INSIGHTS:")
print(f"   - Dataset contains {len(df)} buildings with {X.shape[1]} features")
print(f"   - Overall claim rate: {(y.sum() / len(y)) * 100:.2f}%")
print(f"   - Class distribution maintained in train-test split")

print(f"\n4. RECOMMENDATIONS:")
print(f"   - Deploy {best_model_name} for production predictions")
print(f"   - Monitor model performance regularly and retrain with new data")
print(f"   - Consider collecting more data for buildings with claims to improve recall")
print(f"   - Implement feature monitoring to detect data drift")
print(f"   - Consider ensemble methods combining top 3 models for robust predictions")

print("\n" + "="*80)
print("ANALYSIS COMPLETE!")
print("="*80)

## 9. Save Model (Optional)

In [None]:
# Uncomment to save the best model
# import joblib
# joblib.dump(best_model, 'best_insurance_claim_model.pkl')
# joblib.dump(scaler, 'scaler.pkl')
# print(f"Best model ({best_model_name}) saved successfully!")

---
## Project Checklist ✓

✅ **Data Cleaning and Preprocessing**
- Handled missing values in NumberOfWindows
- Encoded categorical variables
- Created Building_Age feature from Date_of_Occupancy
- Removed identifier column (Customer Id)

✅ **Exploratory Data Analysis**
- Target variable distribution analysis
- Numerical features distribution and outlier detection
- Correlation analysis
- Feature relationship with target variable
- Claim rate analysis by building type, year, and dimensions

✅ **Preprocessing for Modeling**
- Feature engineering (Total_Features, Dimension_Category, Age_Category)
- Train-test split with stratification
- Feature scaling with StandardScaler
- Proper handling of missing values

✅ **Multiple Model Implementation**
- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting
- XGBoost
- Support Vector Machine
- Naive Bayes

✅ **Model Evaluation**
- Comprehensive metrics (Accuracy, Precision, Recall, F1-Score, ROC-AUC)
- Confusion matrices for all models
- ROC curves comparison
- Feature importance analysis
- Cross-validation for best model
- Detailed classification report

---

**GitHub Repository Structure Recommendation:**
```
insurance-claim-prediction/
│
├── README.md
├── insurance_claim_prediction.ipynb
├── data/
│   ├── Train_data.csv
│   └── Variable_Description.csv
├── models/
│   ├── best_model.pkl
│   └── scaler.pkl
└── requirements.txt
```

**README.md should include:**
- Project objective and description
- Dataset information
- Installation instructions
- Usage guide
- Model performance summary
- Key findings and insights

---