# **Project Name**    -  "Classification - Max Life Health Insurance Cross Sell Prediction"



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Project Summary -**

Write the summary here within 500-600 words.
In this capstone project, the primary objective was to develop a machine learning classification model to predict which vehicle insurance customers are most likely to opt for a health insurance cross-sell product for Max Life Insurance. The business goal was to identify high-potential customers for targeted marketing, thereby increasing cross-sell conversions and overall revenue.

The dataset provided contained over 280,000 records with customer demographic and policy-related information. The problem was framed as a binary classification task where the target variable was whether a customer opted for the health insurance cross-sell product.

* Data Preparation and Exploration
The initial phase involved extensive data wrangling, including checking for null values, understanding data types, handling outliers, and converting categorical variables into numerical formats using encoding techniques. Exploratory Data Analysis (EDA) was conducted to uncover patterns and relationships among variables such as age, vintage, annual premium, policy sales channel, and vehicle damage status.

It was observed that certain features like Vintage, Annual Premium, Age, and Vehicle Damage showed significant correlation with the response variable. The dataset also exhibited class imbalance, with only about 12% of customers opting for the cross-sell product. This imbalance was addressed using class weighting techniques within the model.

* Model Building and Evaluation
Two models were implemented:

Decision Tree Classifier

Random Forest Classifier

Hyperparameter tuning was conducted using RandomizedSearchCV to optimize model performance while avoiding excessive computational time. Decision Tree was used as a baseline model, and Random Forest was selected for its robustness, ensemble learning advantage, and higher prediction accuracy.

Evaluation metrics considered included Accuracy, Precision, Recall, F1-Score, and ROC AUC Score. However, due to the business focus on maximizing customer acquisition opportunities, Recall and ROC AUC Score were prioritized as key metrics.

The Random Forest model outperformed the Decision Tree, achieving a ROC AUC Score of 0.85 and a Recall of 93% for the positive class. This ensures the model can effectively identify nearly all potential customers likely to purchase the health insurance cross-sell product.

*  Business Impact
By deploying this machine learning model, Max Life can significantly improve the effectiveness of its health insurance cross-sell initiatives. The high recall ensures that most of the potential buyers are captured, directly supporting business growth. The model also enables data-driven marketing segmentation, allowing the company to allocate resources efficiently by targeting customers with a high likelihood of conversion.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.** Max Life Health Insurance Cross Sell Prediction

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv('Train_Health.csv')


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull())


### What did you know about your dataset?

Answer Here:-The dataset contained 381109 records with 12 columns.

Each row represents a vehicle insurance customer’s details and whether they opted for a health insurance cross-sell product.

The dataset was a mix of numerical and categorical variables describing customer demographics, policy-related attributes, and their cross-sell response.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().T

### Variables Description

Answer Here:-
* id is just an identifier — not used for prediction.

* Driving_License is highly imbalanced (almost all customers have a license).

* Region_Code and Policy_Sales_Channel are encoded numerically but are categorical in nature.

* Annual_Premium has significant variance and outliers, hence required outlier treatment.

* Response is highly imbalanced.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# dropping ID
df.drop('id',axis=1,inplace=True)

In [None]:
# Encoding Gender column.
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})

# Verify result
print(df['Gender'].unique())


In [None]:
# convert vahicle age to numeric based on age
df['Vehicle_Age'] = df['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2})


In [None]:
# Encoding vehicle damage
df['Vehicle_Damage'] = df['Vehicle_Damage'].map({'Yes': 1, 'No': 0})

In [None]:
# check for any duplicate record
print("Duplicate rows:", df.duplicated().sum())
# drop duplicate records
df.drop_duplicates(inplace=True)

### What all manipulations have you done and insights you found?

Answer Here:-
* I cleaned, transformed, and prepared the data for modeling.
* Dropped the irrelevant columns like id
* Encoded Gender,vehicle_Damage,vehicle_age by using various encoding techniques.
* found 269 duplicated records and dropped them.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code(univariate)
sns.countplot(x='Gender', data=df)
plt.title("Gender Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To visualize counts of gender

##### 2. What is/are the insight(s) found from the chart?

Answer:- The dataset is slightly male dominated(in customers).

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.countplot(x='Vehicle_Age', data=df)
plt.title("Vehicle Age Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. To check distribution of vehicles age.

##### 2. What is/are the insight(s) found from the chart?

Answer Here:-most of the vehicle are either new or 1 year old.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.countplot(x='Response', data=df)
plt.title("Distribution of Target Variable")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. To visualize balance of target variable.

##### 2. What is/are the insight(s) found from the chart?

Answer Here:- Class imbalance detected.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Responce v/s Gender
sns.countplot(x='Gender', hue='Response', data=df)
plt.title("Response Rate by Gender")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.To visualize gender v/s responce .

##### 2. What is/are the insight(s) found from the chart?

Answer Here:- Males responces are more than females.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Responce v/s Vehicle damage
sns.countplot(x='Vehicle_Damage', hue='Response', data=df)
plt.title("Response Rate by Vehicle Damage Status")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.To show the relationship btwn responce and vehicle_damage.

##### 2. What is/are the insight(s) found from the chart?

Answer Here:Customers whose vehicles have been previously damaged are much more likely to opt for cross-sell health insurance.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here:-to visualize the correlation between different features.

##### 2. What is/are the insight(s) found from the chart?

Answer Here:-Relation between different features.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here:-Customers who have had vehicle damage are more likely to opt for health insurance compared to those who haven't.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here:-

*   H0=There is no association between Vehicle_Damage and Response.
*   H1=There is a significant association between Vehicle_Damage and Response.






#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Create a contingency table
contingency_table = pd.crosstab(df['Vehicle_Damage'], df['Response'])

# Perform Chi-Square Test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output results
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)
# Reject the null hypothesis coz p-value is less_than 0.05

##### Which statistical test have you done to obtain P-Value?

Answer Here:-chi_square test

##### Why did you choose the specific statistical test?

Answer Here:-Both variables are categorical and Chi-Square test is appropriate for testing association between two categorical variables.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here:-

* H0=There is no association between Previously_Insured and Response.


* H1=There is a significant association between Previously_Insured and Response.




#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Create contingency table
contingency_table2 = pd.crosstab(df['Previously_Insured'], df['Response'])

# Perform Chi-Square Test
chi2, p, dof, expected = chi2_contingency(contingency_table2)

# Output results
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)

# Reject the null hypothesis coz p-value is less_than 0.05

##### Which statistical test have you done to obtain P-Value?

Answer Here:- chi_square test

##### Why did you choose the specific statistical test?

Answer Here:-
Chi-Square test is ideal for testing the association between two categorical variables

No assumptions about distribution required

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here:-

*   H0=The mean age of customers who opt for health insurance (Response = 1) is equal to the mean age of those who don’t (Response = 0).
*   H1=The mean age of customers who opt for health insurance is different from those who don't.



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Split the data into two groups based on Response
age_opted = df[df['Response'] == 1]['Age']
age_not_opted = df[df['Response'] == 0]['Age']

# Perform Independent T-test
t_stat, p_value = ttest_ind(age_opted, age_not_opted)

# Output results
print("T-Statistic:", t_stat)
print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer Here:-independent-T-Test

##### Why did you choose the specific statistical test?

Answer Here:- We have one categorical and one continuous variable.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()
# No missing values in dataset.

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here:There are no missing values so we didn't apply any.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Vintage and Annual_Premium
# detecting outliers (visually)
for col in (['Vintage', 'Annual_Premium']):
    plt.figure(figsize=(10, 6))
    sns.boxplot(df[col])
    plt.title(f'Box Plot of {col}')
    plt.tight_layout()
    plt.show()

In [None]:
# IQR method to detect outliers.
for col in ['Annual_Premium', 'Vintage']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers_count = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()
    print(f"{col}:")
    print(f"  Total Outliers: {outliers_count}")



In [None]:
#Capping extream values
print(df['Annual_Premium'].skew()) #Highly right skewed

# Capping Annual_Premium at 1st and 99th percentile
P1 = df['Annual_Premium'].quantile(0.01)
P99 = df['Annual_Premium'].quantile(0.99)

df['Annual_Premium_capped'] = df['Annual_Premium'].clip(lower=P1, upper=P99)

# skewness after capping.
print(df['Annual_Premium_capped'].skew())

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here:-Cap outliers from 1 to 99 percent coz this prevents extreme values from slightly affecting tree splits without distorting data like log transform would.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# This step already been done during EDA.

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.Gender (Male/Female or 0/1)

Vehicle_Age ('< 1 Year', '1-2 Year', '> 2 Years')

Vehicle_Damage (Yes/No or 0/1)

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

#### 2. Feature Selection FOR DT

In [None]:
# Select your features wisely to avoid overfitting
features_dt = ['Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured',
               'Vehicle_Damage', 'Vehicle_Age', 'Policy_Sales_Channel',
               'Vintage', 'Annual_Premium_capped']

X_dt = df[features_dt]
y_dt = df['Response']

In [None]:
# Feature importance for DT
# Fit Decision Tree directly on full data
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_dt, y_dt)

# Get feature importances
importances = dt_model.feature_importances_
feature_names = X_dt.columns

# Create dataframe of feature importances
feat_imp_dt = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feat_imp_dt.sort_values(by='Importance', ascending=False, inplace=True)

# Display feature importances
print(feat_imp_dt)

# Plot feature importances
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feat_imp_dt)
plt.title('Feature Importance - Decision Tree (Full Data)')
plt.show()


##### What all feature selection methods have you used  and why?

Answer Here:-Model-based ranking for Decision Tree / Random Forest

##### Which all features you found important and why?

In [None]:
final_features_dt = ['Gender', 'Age', 'Region_Code', 'Previously_Insured',
                     'Vehicle_Damage', 'Vehicle_Age', 'Policy_Sales_Channel',
                     'Vintage', 'Annual_Premium_capped']

Answer Here:- Vintage, Annual_Premium_capped, Age, Vehicle_Damage, and Region_Code are our primary predictors
because they are showing high importance for DT.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# we already did this for Outlier capping: Applied for Annual_Premium using Winsorization for tree-based models.

### 6. Data Scaling

In [None]:
# Scaling your data
# No — for Decision Tree and Random Forest models, we did not apply any scaling.

##### Which method have you used to scale you data and why?

No — for Decision Tree and Random Forest models, we did not apply any scaling.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here:-We did not apply dimensionality reduction in this project because the dataset had a limited number of well-selected features, and our tree-based models (Decision Tree, Random Forest) naturally manage feature redundancy and importance through split-based decisions. Additionally, preserving original variables was important for business interpretability and actionable insights.


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here:- No need for dimension reduction.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

# Final selected features based on earlier feature importance
final_features = ['Gender', 'Age', 'Region_Code', 'Previously_Insured',
                  'Vehicle_Damage', 'Vehicle_Age', 'Policy_Sales_Channel',
                  'Vintage', 'Annual_Premium_capped']

X = df[final_features]
y = df['Response']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=y)

# Check shapes
print(f"Training set size: {X_train.shape}, Test set size: {X_test.shape}")

##### What data splitting ratio have you used and why?

Answer Here:- we use 80-20 ratio coz we have palanty of records so we can train the model well and also evaluate model perforformance with 20% of the records.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In [None]:
# check imbalance in data
print(df['Response'].value_counts(normalize=True) * 100)

Answer Here:- Yes dataset is imbalnced data showed above clearify this.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here:-DecisionTreeClassifier and RandomForestClassifier support the class_weight='balanced' argument. so we will balnce while performing modeling.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
# Fit Decision Tree with class_weight balanced to handle imbalance
dt_model = DecisionTreeClassifier(random_state=42, class_weight='balanced')

# Fit model on training data
dt_model.fit(X_train, y_train)

# Predict class labels on test data
y_pred_dt = dt_model.predict(X_test)

# Predict probabilities for ROC AUC calculation
y_pred_proba_dt = dt_model.predict_proba(X_test)[:, 1]

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))

# Classification Report
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt))

# Accuracy
print("\nAccuracy:", accuracy_score(y_test, y_pred_dt))

# ROC AUC Score
print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba_dt))


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Base model
dt = DecisionTreeClassifier(random_state=42, class_weight='balanced')

# Hyperparameter grid to search
param_grid = {
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 5, 10, 20],
    'criterion': ['gini', 'entropy']
}

# Setup GridSearchCV
grid_search_dt = GridSearchCV(estimator=dt,
                              param_grid=param_grid,
                              cv=5,
                              scoring='roc_auc',
                              n_jobs=-1,
                              verbose=1)

# Fit on training data
grid_search_dt.fit(X_train, y_train)

# View best parameters
print("Best Parameters Found:\n", grid_search_dt.best_params_)


# Predict on the model
# Use best model from GridSearch
best_dt_model = grid_search_dt.best_estimator_

# Predict class labels on test data
y_pred_best_dt = best_dt_model.predict(X_test)

# Predict probabilities for ROC AUC
y_pred_proba_best_dt = best_dt_model.predict_proba(X_test)[:, 1]


# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_best_dt))

# Classification Report
print("\nClassification Report:\n", classification_report(y_test, y_pred_best_dt))

# ROC AUC Score
print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba_best_dt))




##### Which hyperparameter optimization technique have you used and why?

Answer Here:-We used GridSearchCV for hyperparameter optimization of our Decision Tree Classifier.

GridSearchCV was chosen because it searches the specified hyperparameter grid, integrates reliable cross-validation, and is computationally feasible given our dataset size and Decision Tree’s training speed.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here:-Yes — significant improvement was achieved after hyperparameter tuning using GridSearchCV, especially in recall, F1-score, and ROC AUC.

The tuned Decision Tree is now much better at identifying potential health insurance buyers for Max Life’s cross-sell campaign.

### ML Model - 2(Random-Forest)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Fitting 5 folds for each of 160 candidates, totalling 800 fits
# Best Parameters Found:
#  {'criterion': 'entropy', 'max_depth': 10, 'min_samples_leaf': 20, 'min_samples_split': 2}
# Confusion Matrix:
#  [[44706 22125]
#  [  770  8567]]

# Classification Report:
#                precision    recall  f1-score   support

#            0       0.98      0.67      0.80     66831
#            1       0.28      0.92      0.43      9337

#     accuracy                           0.70     76168
#    macro avg       0.63      0.79      0.61     76168
# weighted avg       0.90      0.70      0.75     76168

# ROC AUC Score: 0.846746375731642

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Replace 'final_df' with your cleaned DataFrame variable name if different
X = df.drop('Response', axis=1)   # All features
y = df['Response']                # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=y)


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

# Base Model
rf_model = RandomForestClassifier(random_state=42, class_weight='balanced')

# Leaner Hyperparameter Grid
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [10, 15, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 5],
    'max_features': ['sqrt'],
    'criterion': ['gini']
}

# RandomizedSearchCV with n_iter=20
random_search_rf = RandomizedSearchCV(estimator=rf_model,
                                      param_distributions=param_grid,
                                      n_iter=20,
                                      cv=5,
                                      scoring='roc_auc',
                                      n_jobs=-1,
                                      verbose=1,
                                      random_state=42)

# Fit Model
random_search_rf.fit(X_train, y_train)

# Best parameters
print("Best Parameters Found:\n", random_search_rf.best_params_)


# predict on test data
best_rf_model = random_search_rf.best_estimator_

# Predictions
y_pred_rf = best_rf_model.predict(X_test)

# Probability predictions for ROC AUC
y_pred_proba_rf = best_rf_model.predict_proba(X_test)[:, 1]






Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Parameters Found:
 {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'max_depth': 10, 'criterion': 'gini'}

In [None]:
# Evaluate model
# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

# Classification Report
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))

# ROC AUC Score
print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba_rf))


In [None]:
# Confusion Matrix:
#  [[43909 22922]
#  [  641  8696]]

# Classification Report:
#                precision    recall  f1-score   support

#            0       0.99      0.66      0.79     66831
#            1       0.28      0.93      0.42      9337

#     accuracy                           0.69     76168
#    macro avg       0.63      0.79      0.61     76168
# weighted avg       0.90      0.69      0.74     76168

# ROC AUC Score: 0.8505272443877807

In [None]:
Feature Importance DataFrame
feat_importance_rf = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': best_rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importance:\n", feat_importance_rf)

#
# Feature Importance:
#                    Feature  Importance
# 6          Vehicle_Damage    0.415922
# 4      Previously_Insured    0.343814
# 1                     Age    0.086446
# 5             Vehicle_Age    0.060717
# 8    Policy_Sales_Channel    0.057486
# 3             Region_Code    0.012007
# 7          Annual_Premium    0.006951
# 10  Annual_Premium_capped    0.006803
# 9                 Vintage    0.006336
# 0                  Gender    0.002937
# 2         Driving_License    0.000582

##### Which hyperparameter optimization technique have you used and why?

Answer Here:-For hyperparameter tuning of the Random Forest model, I used RandomizedSearchCV.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here:-Yes — significant improvement after hyperparameter tuning using RandomizedSearchCV compared to the baseline Decision Tree model.

The Random Forest model with hyperparameter tuning via RandomizedSearchCV outperformed the tuned Decision Tree in terms of ROC AUC Score and maintained high recall for cross-sell customers.

While accuracy and precision remained similar, the improvement in ROC AUC makes the Random Forest model a better choice for this business use-case.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here:-



*   The Random Forest model’s high recall and strong ROC AUC ensures Max Life’s marketing team can:

* Proactively reach out to 93% of likely health insurance buyers

* Increase cross-sell conversions, driving additional revenue

* Minimize wasted marketing efforts by segmenting customers effectively

* Support data-driven, targeted campaign strategies rather than blanket offers




### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here:-
* In this business problem, Recall and ROC AUC were the most critical evaluation metrics because:

* The primary business objective is to capture as many potential cross-sell customers as possible.

* The cost of a false positive (targeting an uninterested customer) is relatively low, whereas the cost of a false negative (missing a potential customer) is high in terms of missed revenue.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here:-
* Final Model Chosen:-Random Forest Classifier
* Reason:- Best balance of recall, ROC AUC Score, robustness, and business interpretability. It effectively identifies potential cross-sell customers while minimizing the risk of missing valuable leads.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here:-Tool used:
Built-in Feature Importance attribute of Random Forest

Why:

It’s fast, easy to interpret, and directly shows the relative importance of each feature in model prediction decisions.

Helpful for both business users and technical teams to understand model behavior and design targeted campaigns.



# **Conclusion**

Write the conclusion here:-In this project, we successfully developed an end-to-end classification model for predicting potential health insurance cross-sell customers for Max Life Insurance using machine learning techniques.

After thorough data wrangling, exploratory data analysis, feature engineering, outlier handling, and model experimentation, two models — Decision Tree Classifier and Random Forest Classifier — were implemented and evaluated.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***