# **Project Name**    - Health insurance cross sell prediction



##### **Project Type**    - Classification

# **Hirapara Paras**


# **Project Summary -**

In order to construct a model to determine whether the policyholders (customers) from the previous year will also be interested in the firm's vehicle insurance, our client, an insurance company, needs assistance. This insurance company has previously offered Health Insurance to its customers.

An insurance policy is a contract whereby a business agrees to guarantee compensation in the event of a specific loss, damage, disease, or death in exchange for the payment of a predetermined premium. The amount of money that the customer must consistently pay to an insurance provider in exchange for this assurance is known as a premium.

If you, God forbid, become ill and need to be hospitalized in that year, the insurance provider firm will cover the cost of hospitalization and other expenses up to Rs. 200,000 if you pay a premium of Rs. 5000 annually for a health insurance cover of Rs. 200,000/-. The idea of probabilities enters the scene when you ask how a corporation can afford such large hospitalization costs when it only charges a premium of Rs. 5000. For instance, 100 clients may pay a premium of Rs. 5000 annually, just like you, but only some of them—not all—would end up in the hospital that year.



# **GitHub Link -**

https://github.com/parashirapara?tab=repositories

# **Problem Statement**


Building a model to forecast a customer's interest in Vehicle Insurance is very beneficial for the business because it allows it to design its communication strategy to reach out to those clients in the most effective way possible and maximize its business model and revenue.

Now that you have information about demographics (gender, age, region code type), vehicles (vehicle age, damage), policies (premium, sourcing channel), etc., you can anticipate whether a consumer would be interested in vehicle insurance.

We need to building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue. Now, we need to predict whether the customer would be interested in Vehicle insurance or not.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd
from numpy import math
from scipy import stats
import missingno as msno
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from mlxtend.plotting import plot_confusion_matrix
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

### Dataset Loading

In [None]:
# Load Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Loading the dataset

file_path = '/content/drive/MyDrive/Data Science/Capston Project/Health insurance cross sell prediction/HEALTH INSURANCE CROSS SELL PREDICTION.csv'
data = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look

data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print(f'Number of rows : {len(data.axes[0])}')
print(f'Number of columns : {len(data.axes[1])}')

### Dataset Information

In [None]:
# Dataset Info

data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

data.isnull().sum()

In [None]:
# visualization of missing values

msno.matrix(data,figsize=(7,4), fontsize=12)

### What did you know about your dataset?

Data set has no any missing values.

Data set has 381109 rows and 13 columns.

Data set has no any duplicate values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

data.columns

In [None]:
# Dataset Describe

data.describe()

### Variables Description

id : Unique ID for the customer

Gender : Gender of the customer

Age : Age of the customer

Driving_License 0 : Customer does not have DL, 1 : Customer already has DL

Region_Code : Unique code for the region of the customer

Previously_Insured : 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

Vehicle_Age : Age of the Vehicle

Vehicle_Damage :1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

Annual_Premium : The amount customer needs to pay as premium in the year

PolicySalesChannel : Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

Vintage : Number of Days, Customer has been associated with the company

Response : 1 : Customer is interested, 0 : Customer is not interested

### Check Unique Values for each variable.

In [None]:
data.columns

In [None]:
# categorical_variables

categorical_variables = ['id', 'Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage', 'Response']

In [None]:
# Check Unique Values for each variable.

for i in categorical_variables:
  print(f'Unique values for {i} is : {data[i].unique()}')

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Dependent variable 'Response'

sns.set(style="darkgrid")
plt.figure(figsize=(7,4))
total = float(len(data))
ax = sns.countplot(x='Response', data=data)
plt.title('Response Count', fontsize=20)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width()
    y = p.get_height()
    ax.annotate(percentage, (x, y),ha='center')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Data are not balanced, 87.7% percent peopel are interest and 12.3 percent people are not interest.

#### Chart - 2

In [None]:
# Chart - 2 Distribution of 'Age'

plt.figure(figsize=(20,4))
sns.set_theme(style='whitegrid')
sns.countplot(x=data['Age'],data=data)

##### 2. What is/are the insight(s) found from the chart?

Age between 21 to 26 has higher responce compare to other age groups.

That indicates young people more interested compare to other one.

#### Chart - 3

In [None]:
# Chart - 3 Distribution of 'Previously_Insured'

plt.figure(figsize=(7,9))
plt.pie(data['Previously_Insured'].value_counts(), autopct='%.0f%%', shadow=True, startangle=200, explode=[0.01,0])
plt.legend(labels=['Insured','Not insured'])
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Out of total population given in dataset 54 % people are insured and 46 % people is not insured.

Not insured people percentage is high compare with whole dataset.

That means we need to focus on not insured people and try to find why they are not prefer insurance.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

data['Annual_Premium'].hist(figsize=(10,5), bins = 50, density = True, range=[0, 200000])
plt.xlabel('Annual_Premium')
plt.ylabel('count')
plt.show()

In [None]:
plt.figure(figsize=(7,4))
sns.boxplot(x='Annual_Premium',palette="rocket_r", data=data)

##### 1. Why did you pick the specific chart?

From the distribution plot we can infer that the annual premimum variable is right skewed.

From the boxplot we can observe lot of outliers in the variable.

As we show in chart annual premium higher frequency vary between 0 to 100000 and other premium amount has less frequency.

Here, i try to remove outlier with help of lower bound and upper bound values and also try to find inter quntile range of annual premium.

#### Chart - 5

In [None]:
# Chart - 5 visualization code for Vehicle_Damage

plt.figure(figsize=(9,4))
ax2 = sns.countplot(x=data["Vehicle_Damage"])
for p in ax2.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width()
    y = p.get_height()
    ax2.annotate(percentage, (x, y),ha='center')
plt.show()

##### 1. Why did you pick the specific chart?

Based on chart, i found that the there is no significant change between vehicle damage (50.5 %) or not demage (49.5 %).

#### Chart - 6

In [None]:
# Chart - 6 visualization code for Vehicle_Age

plt.figure(figsize=(9,4))
ax3 = sns.histplot(data["Vehicle_Age"])
plt.show()

##### 1. Why did you pick the specific chart?

Higher percentage of vehicles vary between 1 to 2 year age (52.6 %), second highest vehicle age is less than 1 year (43.2 %), and lowest one is more than 2 year (4.2 %).

#### Chart - 7

In [None]:
# Chart - 7 visualization code for Gender count

plt.figure(figsize=(9,4))
ax4 = sns.countplot(x=data["Gender"])
for p in ax4.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width()
    y = p.get_height()
    ax4.annotate(percentage, (x, y),ha='center')
plt.show()

##### 1. Why did you pick the specific chart?

Not much difference between gender count in our dataset. (male has 54.1 % and female is 45.9 %)

#### Chart - 8

In [None]:
# Chart - 8 visualization code Vehicle_Damage by gender

plt.figure(figsize=(17,5))
ax5 = sns.countplot(data = data, x = "Gender", hue = "Vehicle_Damage")
for p in ax5.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width()
    y = p.get_height()
    ax5.annotate(percentage, (x, y),ha='center')
plt.show()

##### 1. Why did you pick the specific chart?

In male case vehicle damage has 29. % and not demage is 24.5 %, which indicates male has higher damage percentage compare to not demage.

In female case vehicle damage has 20.9 % and not demage is 25 %, which indicates female has lower damage percentage compare to not demage.

#### Chart - 9

In [None]:
# Chart - 9 visualization code #Age VS Response

plt.figure(figsize=(16,5))
sns.countplot(data=data, x='Age',hue='Response', palette='CMRmap_r')
plt.xlabel('Age response')
plt.ylabel('count')
plt.show()

##### 1. Why did you pick the specific chart?

People ages between from 31 to 50 are more likely to respond.

while Young people below 30 are not interested in vehicle insurance.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

plt.figure(figsize=(16,5))
ax6 = sns.countplot(data=data, x='Gender',hue='Response', palette='CMRmap_r')
for p in ax6.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width()
    y = p.get_height()
    ax6.annotate(percentage, (x, y),ha='center')
plt.show()

##### 1. Why did you pick the specific chart?

Male category is slightly greater (46.6 %) than that of female (41.2 %) and chances of buying the insurance is also little high

#### Chart - 11

In [None]:
# Chart - 11 visualization code

plt.figure(figsize = (10,4) )
ax7 = sns.countplot(data = data, x = 'Vehicle_Age', hue = 'Response', palette='Dark2_r')
plt.xlabel('Vehicle Age', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.title('Vehicle Age and Customer Response analysis', fontsize = 19)
for p in ax7.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width()
    y = p.get_height()
    ax7.annotate(percentage, (x, y),ha='center')
plt.show()

##### 1. Why did you pick the specific chart?

Customers with vechicle age 1-2 years are more likely to interested as compared to the other two

Customers with with Vehicle_Age <1 years have very less chance of buying Insurance

#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize = (20, 8))
sns.heatmap(data.corr(), annot = True)

##### 2. What is/are the insight(s) found from the chart?

Based on heat map, i observed that there is no any multi colinearity between variables.

policy_sales_channel and age has a negativly colinear.

Target variable is not much affected by Vintage variable. we can drop least correlated variable.

#### Chart - 15 - Pair Plot

In [None]:
df = ['Gender', 'Age', 'Vehicle_Age', 'Annual_Premium','Response', 'Policy_Sales_Channel']

In [None]:
# Pair Plot visualization code

sns.pairplot(data[df])

##### 2. What is/are the insight(s) found from the chart?

It is said that there is no correlation between the variables if the value is zero or very near to it.

In [None]:
numeric_features = data.describe().columns
numeric_features

In [None]:
# plot a bar plot for each numerical feature count (except car_ID)

for col in numeric_features[1:]:
    fig = plt.figure(figsize=(6, 3))
    ax = fig.gca()
    feature = data[col]
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

In [None]:
for col in numeric_features[1:-1]:
    fig = plt.figure(figsize=(5, 4))
    ax = fig.gca()
    feature = data[col]
    label = data['Response']
    correlation = feature.corr(label)
    plt.scatter(x=feature, y=label)
    plt.xlabel(col)
    plt.ylabel('Price')
    ax.set_title('price vs ' + col + '- correlation: ' + str(correlation))
    z = np.polyfit(data[col], data['Response'], 1)
    y_hat = np.poly1d(z)(data[col])

    plt.plot(data[col], y_hat, "r--", lw=1)

plt.show()

Based on all above analysis, i find that the data is ready to use and has no any colinearity between them.

In [None]:
# Finding Multicollinearity
def cal_vif(X):
    # Calculating VIF
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
   return(vif)

In [None]:
cal_vif(data[[i for i in data.describe().columns if i not in ['Response']]])

In [None]:
data.drop(columns=['Driving_License'],axis=1,inplace=True)

Since practically every customer has a driver's license, it is useless to insure anyone without one because it would be detrimental to the business. Hence, we wouldn't provide vehicle insurance to someone who didn't have a license to drive. As we can drop driving license column as they are not providing any valuable information.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Annual_Premium by the customers is less than 30000

#### 2. Perform an appropriate statistical test.

In [None]:
# extract Annual_Premium column from dataset and try to find shape of the Annual_Premium
Annual_Premium = pd.DataFrame(data, columns = ['Annual_Premium'])

Annual_Premium.shape

In [None]:
# mean of the sales
Annual_Premium.mean()

In [None]:
# standard deviation of Annual_Premium
Annual_Premium.std()

In [None]:
# convert Annual_Premium data into list formate

Annual_Premium_list = data["Annual_Premium"].tolist()

In [None]:
# choose random sample from sales dataset (sample size is 1000)

import random
random_sample = random.sample(Annual_Premium_list, 50000)
random_samples = pd.DataFrame(random_sample)

In [None]:
def calculate_z_score(value, random_samples):
    mean = random_samples.mean()
    std_dev = random_samples.std()
    square_root = math.sqrt(len(random_sample))
    z_score = (mean - value) / (std_dev / square_root)
    return z_score

In [None]:
Annual_Premium_list = 30000

z_score = calculate_z_score(Annual_Premium_list, random_samples)
print("Z-Score:", z_score)

In [None]:
# We can calculate p-value

prob_z = norm.cdf(z_score)
print(prob_z)

In [None]:
P_value = 1 - prob_z
print(P_value)

Based on hypothesis testing, i infer that,

Reject the our null hypothesis.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
data['Annual_Premium'].hist(figsize=(10,5), bins = 50, density = True, range=[0, 200000])

As we saw data set annual premium is right skewed distribution.

So, we fill the missing values with help of median function.

In [None]:
# Handling null values

data['Annual_Premium'] = data['Annual_Premium'].fillna(data['Annual_Premium'].median())

In [None]:
data.info()

Based on that i find that the Age has most positively colinear, and previously insured has most negatively colinear in our data set.

### 2. Handling Outliers

In [None]:
#outlier column
outlier_column=['Annual_Premium']
#determining the inter-quartile range for the columns with outliers
Q1 = data[outlier_column].quantile(0.25)
Q3 = data[outlier_column].quantile(0.75)
IQR = Q3-Q1
IQR

In [None]:
# determining the upper and lower limit for the removal of outliers
upper_limit = Q3 + (1.5*IQR)
lower_limit = Q1 - (1.5*IQR)
data[outlier_column] = data[outlier_column][~((data[outlier_column] < lower_limit) | (data[outlier_column] > upper_limit))]

### 3. Categorical Encoding

In [None]:
# changing categorical value to numerical values

data['Gender'].replace({'Female':1, 'Male':0}, inplace=True)
data.head(2)

In [None]:
# similarly for the vehicle age

data['Vehicle_Age']= data['Vehicle_Age'].map({'< 1 Year':0,'1-2 Year':1,'> 2 Years':2})
data.head(2)

In [None]:
# similarly for the vehicle damage

data['Vehicle_Damage']=data['Vehicle_Damage'].map({'Yes':1, 'No':0})
data.head(2)

In [None]:
# try to find correlation with respect to responce

correlation = data.corr()
correlation['Response'].sort_values(ascending = False)[1:]

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x=data[['Gender','Age','Previously_Insured','Vehicle_Age','Vehicle_Damage','Annual_Premium','Vintage']]
y=data['Response']

In [None]:
# check for imbalance in data
data['Response'].value_counts()

ML techniques like decision trees and logistic regression show a bias in favour of the majority class and frequently disregard the minority class. Resampling is the method we employ to solve this problem.

In [None]:
# Resampling
ros = RandomOverSampler(random_state=0)
X_new,y_new= ros.fit_resample(x, y)

print("After Random Over Sampling Of Minor Class Total Samples are :", len(y_new))
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_new)))

In [None]:
X_train, X_test ,y_train, y_test=  train_test_split(X_new, y_new, random_state=42, test_size=0.3)
print(X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

In [None]:
# Normalizing the Dataset using Standard Scaling Technique.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

## ***7. ML Model Implementation***

### ML Model - Logistic Regression

In [None]:
# ML Model - 1 Implementation Logistic Regression

#Importing Logistic Regression
logistic_model = LogisticRegression(random_state=30)
logistic_model=logistic_model.fit(X_train,y_train)

#Making prediction
y_pred_lg = logistic_model.predict(X_test)
y_pred_prob_lg = logistic_model.predict_proba(X_test)[:,1]

In [None]:
y_pred_prob_lg

In [None]:
y_pred_lg

In [None]:
# Evaluation
RS_lgt= recall_score(y_test, y_pred_lg)
print("Recall_Score : ", RS_lgt)

PS_lgt= precision_score(y_test, y_pred_lg)
print("Precision_Score :",PS_lgt)

f1S_lgt= f1_score(y_test, y_pred_lg)
print("f1_Score :", f1S_lgt)

AS_lgt= accuracy_score(y_pred_lg, y_test)
print("Accuracy_Score :",AS_lgt)

acu_lgt = roc_auc_score(y_pred_lg, y_test)
print("ROC_AUC Score:",acu_lgt)

In [None]:
#ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob_lg)
plt.title('Logistic Regression ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

In [None]:
#confusion matrix
cm_logistic = confusion_matrix(y_test, y_pred_lg)
print(cm_logistic)

# chart vissualization
fig, ax = plot_confusion_matrix(conf_mat=cm_logistic, figsize=(3, 3), cmap=plt.cm.Blues)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
print(classification_report(y_test, y_pred_lg))

In [None]:
# coefficients
logistic_model.coef_

In [None]:
coef = pd.Series(data=logistic_model.coef_[0], index=x.columns)
coef = coef.sort_values()
print(coef)

In [None]:
plt.figure(figsize=(14,6))
sns.barplot(x=coef.index, y=coef.values);
plt.ylabel("Coeficient")
plt.show()

In [None]:
# hyperperameter tuning

logistic_model_tunning= LogisticRegression()
param_logreg = {'C': [1, 0.5, 0.1, 5, 9],'penalty':['l2','l1', 'elasticnet']}
l_m_t= GridSearchCV(estimator = logistic_model_tunning, param_grid = param_logreg, cv = 3, n_jobs = -1 , verbose = 1, scoring = 'recall')
l_m_t.fit(X_train, y_train)

In [None]:
# fitting the model

log_tuned = l_m_t.best_estimator_
y_tuned_log = log_tuned.predict(X_test)
y_tuned_log_prob=log_tuned.predict_proba(X_test)[:,1]

In [None]:
# Evaluation
RS_lgt_tun= recall_score(y_test, y_tuned_log)
print("Recall_Score : ", RS_lgt_tun)

PS_lgt_tun= precision_score(y_test, y_tuned_log)
print("Precision_Score :",PS_lgt_tun)

f1S_lgt_tun= f1_score(y_test, y_tuned_log)
print("f1_Score :", f1S_lgt_tun)

AS_lgt_tun= accuracy_score(y_tuned_log, y_test)
print("Accuracy_Score :",AS_lgt_tun)

acu_lgt_tun = roc_auc_score(y_tuned_log, y_test)
print("ROC_AUC Score:",acu_lgt_tun)

In [None]:
#ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_tuned_log_prob)
plt.title('Logistic Regression ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

In [None]:
#confusion matrix
cm_logistic_tun = confusion_matrix(y_test, y_tuned_log)
print(cm_logistic_tun)

In [None]:
# graphical representation of confusion matrix
fig, ax = plot_confusion_matrix(conf_mat=cm_logistic_tun, figsize=(3, 3), cmap=plt.cm.Blues)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
print(classification_report(y_test, y_tuned_log))

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

dt_model = DecisionTreeClassifier(random_state=30)
dt_model=dt_model.fit(X_train, y_train)

#Making prediction
dt_pred = dt_model.predict(X_test)
dt_pred_prob = dt_model.predict_proba(X_test)[:,1]

In [None]:
# Evaluation
RS_dt= recall_score(y_test, dt_pred)
print("Recall_Score : ", RS_dt)

PS_dt= precision_score(y_test, dt_pred)
print("Precision_Score :",PS_dt)

f1S_dt= f1_score(y_test, dt_pred)
print("f1_Score :", f1S_dt)

AS_dt= accuracy_score(y_test, dt_pred)
print("Accuracy_Score :",AS_dt)

acu_dt = roc_auc_score(y_test, dt_pred)
print("ROC_AUC Score:",acu_dt)

In [None]:
#ROC Curve
fpr, tpr, _ = roc_curve(y_test, dt_pred_prob)
plt.title('Decision Tree ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

In [None]:
#confusion matrix
cm_dt= confusion_matrix(y_test, dt_pred)
print(cm_dt)

In [None]:
# graphical representation of confusion metrix

fig, ax = plot_confusion_matrix(conf_mat=cm_dt, figsize=(3, 3), cmap=plt.cm.Blues)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
print(classification_report(dt_pred, y_test))

In [None]:
# features importance

plt.figure(figsize=(14,3))
feat_importances = pd.Series(dt_model.feature_importances_, index=X_new.columns)
feat_importances.nlargest(5).plot(kind='bar')

### ML Model - 3

In [None]:
# ML Model - 3 Implementation Random Forest

# Importing Random Forest
rf_model= RandomForestClassifier(random_state=30)
rf_model= rf_model.fit(X_train, y_train)

# Making prediction
rf_pred= rf_model.predict(X_test)
rf_pred_proba= rf_model.predict_proba(X_test)[:,1]

In [None]:
# Evaluation
RS_rf= recall_score(y_test, rf_pred)
print("Recall_Score : ", RS_lgt)

PS_rf= precision_score(y_test, rf_pred)
print("Precision_Score :",PS_lgt)

f1S_rf= f1_score(y_test, rf_pred)
print("f1_Score :", f1S_lgt)

AS_rf= accuracy_score(y_test, rf_pred)
print("Accuracy_Score :",AS_lgt)

acu_rf = roc_auc_score(y_test, rf_pred)
print("ROC_AUC Score:",acu_lgt)

In [None]:
#ROC Curve
fpr, tpr, _ = roc_curve(y_test, rf_pred_proba)
plt.title('Random Forest ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

In [None]:
#confusion matrix
cm_rf= confusion_matrix(y_test, rf_pred)
print(cm_rf)

In [None]:
fig, ax = plot_confusion_matrix(conf_mat=cm_rf, figsize=(3, 3), cmap=plt.cm.Blues)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
print(classification_report(rf_pred, y_test))

In [None]:
# features importance

plt.figure(figsize=(14,4))
feat_importances = pd.Series(rf_model.feature_importances_,index=X_new.columns)
feat_importances.nlargest(5).plot(kind='bar')

### ML Model - *4*

In [None]:
# KNN

knn_model=KNeighborsClassifier()
knn_model=knn_model.fit(X_train,y_train)

#Making prediction
knn_pred = knn_model.predict(X_test)
knn_pred_prob = knn_model.predict_proba(X_test)[:,1]

In [None]:
# Evaluation
RS_knn= recall_score(y_test, knn_pred)
print("Recall_Score : ", RS_knn)

PS_knn= precision_score(y_test, knn_pred)
print("Precision_Score :",PS_knn)

f1S_knn= f1_score(y_test, knn_pred)
print("f1_Score :", f1S_lgt)

AS_knn= accuracy_score(y_test, knn_pred)
print("Accuracy_Score :",AS_knn)

acu_knn = roc_auc_score(y_test, knn_pred)
print("ROC_AUC Score:",acu_knn)

In [None]:
#ROC Curve
fpr, tpr, _ = roc_curve(y_test, knn_pred_prob)
plt.title('KNN ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

In [None]:
#confusion matrix
cm_knn= confusion_matrix(y_test, knn_pred)
print(cm_knn)

In [None]:
fig, ax = plot_confusion_matrix(conf_mat=cm_knn, figsize=(3, 3), cmap=plt.cm.Blues)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
print(classification_report(knn_pred, y_test))

In [None]:
# Hyperparameter tuning for KNN

KNN_tuning = KNeighborsClassifier()
param_KNN = {'n_neighbors':[5,7,9],'weights':['uniform','distance'],'p':[2, 1]}
model_KNN_tuned = GridSearchCV(estimator = KNN_tuning, param_grid = param_KNN, cv = 3, n_jobs = -1 , verbose = 1, scoring = 'recall')
model_KNN_tuned.fit(X_train, y_train)

In [None]:
model_KNN_tuned.best_params_

In [None]:
model_KNN_tuned.best_score_

In [None]:
# best parameters and prediction

KNN_tuned = model_KNN_tuned.best_estimator_
y_tuned_KNN = KNN_tuned.predict(X_test)
y_tuned_KNN_prob=KNN_tuned.predict_proba(X_test)[:,1]

In [None]:
# Evaluation
RS_knn_tun= recall_score(y_test, y_tuned_KNN )
print("Recall_Score : ", RS_knn_tun)

PS_knn_tun= precision_score(y_test, y_tuned_KNN )
print("Precision_Score :",PS_knn_tun)

f1S_knn_tun= f1_score(y_test, y_tuned_KNN )
print("f1_Score :", f1S_knn_tun)

AS_knn_tun= accuracy_score(y_test, y_tuned_KNN )
print("Accuracy_Score :",AS_knn_tun)

acu_knn_tun = roc_auc_score(y_test, y_tuned_KNN )
print("ROC_AUC Score:",acu_knn_tun)

In [None]:
#roc curve
fpr, tpr, _ = roc_curve(y_test, y_tuned_KNN_prob)
plt.title('KNN ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

In [None]:
#confusion matrix
cm_knn= confusion_matrix(y_test, y_tuned_KNN)
print(cm_knn)

In [None]:
fig, ax = plot_confusion_matrix(conf_mat=cm_knn, figsize=(3, 3), cmap=plt.cm.Blues)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
print(classification_report(y_tuned_KNN, y_test))

### ML Model - *4 Gradient Boosting*

In [None]:
GB_model=GradientBoostingClassifier(random_state=30)
GB_model=GB_model.fit(X_train,y_train)
#Making prediction
GB_pred = GB_model.predict(X_test)
GB_pred_prob = GB_model.predict_proba(X_test)[:,1]

In [None]:
# Evaluation
RS_GB= recall_score(y_test, GB_pred)
print("Recall_Score : ", RS_GB)

PS_GB= precision_score(y_test, GB_pred)
print("Precision_Score :",PS_GB)

f1S_GB= f1_score(y_test, GB_pred)
print("f1_Score :", f1S_GB)

AS_GB= accuracy_score(y_test, GB_pred)
print("Accuracy_Score :",AS_GB)

acu_GB = roc_auc_score(y_test, GB_pred)
print("ROC_AUC Score:",acu_GB)

In [None]:
#ROC Curve
fpr, tpr, _ = roc_curve(y_test, GB_pred_prob)
plt.title('Gradient Boosting ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

In [None]:
#Confusion matrix
cm_GB= confusion_matrix(y_test, GB_pred)
print(cm_GB)

In [None]:
fig, ax = plot_confusion_matrix(conf_mat=cm_GB, figsize=(3, 3), cmap=plt.cm.Blues)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
print(classification_report(GB_pred, y_test))

In [None]:
# result of all model
list_of_model = ['Logistic Regression','Decision Tree ','Random Forest','KNN','Gradient Boosting']

In [None]:
result_list_RS = [RS_lgt,RS_dt,RS_rf,RS_knn,RS_GB]
result_list_PS = [PS_lgt,PS_dt,PS_rf,PS_knn,PS_GB]
result_list_f1S = [f1S_lgt,f1S_dt,f1S_rf,f1S_knn,f1S_GB]
result_list_AS = [AS_lgt,AS_dt,AS_rf,AS_knn,AS_GB]
relust_list_Acu=[acu_lgt,acu_dt,acu_rf,acu_knn,acu_GB]

In [None]:
# creating an empty dataframe
results_df = pd.DataFrame()

In [None]:
results_df['model_name'] = list_of_model
results_df['Recall_Score'] = result_list_RS
results_df['Precision_Score'] = result_list_PS
results_df['f1_Score'] = result_list_f1S
results_df['Accuracy_Score'] = result_list_AS
results_df['ROC_AUC Score'] = relust_list_Acu

In [None]:
results_df.style.hide_index().background_gradient(cmap='RdYlBu_r').format()

In [None]:
# Result of Hyperparameter tunning model

list_of_model = ['Logistic Regression','KNN']
result_list_RS = [RS_lgt_tun,RS_knn_tun]
result_list_PS = [PS_lgt_tun,PS_knn_tun]
result_list_f1S = [f1S_lgt_tun,f1S_knn_tun]
result_list_AS = [AS_lgt_tun,AS_knn_tun]
relust_list_Acu=[acu_lgt_tun,acu_knn_tun]

In [None]:
# creating an empty dataframe
results_df = pd.DataFrame()
results_df['model_name'] = list_of_model
results_df['Recall_Score'] = result_list_RS
results_df['Precision_Score'] = result_list_PS
results_df['f1_Score'] = result_list_f1S
results_df['Accuracy_Score'] = result_list_AS
results_df['ROC_AUC Score'] = relust_list_Acu

In [None]:
results_df.style.hide_index().background_gradient(cmap='RdYlBu_r').format()

In [None]:
#predictions
y_pred_test = rf_model.predict(X_test)
responce_pred = y_test.copy()
responce_pred['rf_pred'] = y_pred_test.tolist()

In [None]:
#head
responce_pred.head()

In [None]:
rf_pred

# **Conclusion**

After loading our dataset, the first thing we did was look for duplicates and null values. There were no duplicates or null values, thus there was no need to treat them.

The gender variable in the dataset is spread nearly evenly. The male category is marginally larger than the female category, and the likelihood of purchasing insurance is also slightly higher. The response rate of those who are not interested in purchasing vehicle insurance is higher than that of those who are interested in buying vehicle insurance.

The response rate of those who are not interested in purchasing vehicle insurance is higher than that of those who are not interested in buying vehicle insurance.

As vehicle age increases most of the people are aware of insurance and interested to buy the insurance for reducing the risk.

By using the inter quartile range, we eliminated outliers and dealt with null data. We split the dataset into train and test splits after feature encoding three columns.

Further, we applied 4 machine learning algorithms to see which customers might be interested in purchasing vehicle insurance and we also used hyperparameter tuning for three models to discover which model gives the best results. Vehicle damage and annual premium are the two most significant features seen in decision trees, while vintage and annual premium are seen in random forests. With 93% and 92% ROC AUC scores, Decision Tree and Random Forest outperform all other models.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***