![image](http://www.motortrainingschool.co.in/images/course-09.jpg)

<font size="+3" color='#0000FF'><b> Problem Statement</b></font>

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

<font size="+3" color="#513B1C"><b>Bussiness Goal</b></font>

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

<font size="+3" color="513B1C"><b>What this Notebook Will Cover??</b></font>

> <font size="+1" color="brown"><b>Exploratory data analysis</b></font>

> <font size="+1" color="brown"><b>Modeling and evaluation</b></font>

<font size="+2" color=chocolate ><b>Please Upvote my kernel if you like my work.</b></font>

**Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold, KFold, GridSearchCV
from sklearn.metrics import f1_score, roc_auc_score,accuracy_score,confusion_matrix, precision_recall_curve, auc, roc_curve, recall_score, classification_report 

> **Reading dataset**

In [None]:
train=pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')
train.head()

> **Copying train data into variable through which we can perform operations**

In [None]:
df=train.copy()

In [None]:
df.shape

> **Checking NULL values**

In [None]:
df.isnull().sum()

In [None]:
df.describe()

In [None]:
df.info()

> **Number of columns and number of unique values present in it**

In [None]:
print('columns and number of unique values')
dict=[]
for i in df.columns:
    print(f'{i} --> {df[i].nunique()}')

> **Separating the Categorical data**

In [None]:
categorical_columns = df.select_dtypes(include=['object']).columns
categorical_columns

<font size="+2" color="blue"><b>Exploratory Data Analysis</b></font>

> **Count of Gender with respect to the target variable**

> **Response for MALE is slightly greater than of female**

In [None]:
sns.countplot(data=df,x='Gender',hue='Response')

> **Count of Gender with respect to the Previously injured**

> **Response for MALE and Female are same**

In [None]:
sns.countplot(data=df,x='Gender',hue='Previously_Insured')

> **AGE vs PREVIOUSLY INJURED**

In [None]:
sns.relplot(data=df,x='Age',y='Previously_Insured',kind='line')

> **VEHICLE AGE vs VEHICLE DAMAGE**

In [None]:
sns.countplot(data=df,x='Vehicle_Age',hue='Vehicle_Damage')

> **AGE DISTRIBUTION**

In [None]:
sns.distplot(df['Age'])

> **GENDER vs DRIVING LICENSE**

In [None]:
df=train.groupby(['Gender'])['Driving_License'].count().to_frame().reset_index()
df

In [None]:
df=df.groupby(['Gender'])['Driving_License'].count().to_frame().reset_index()
sns.catplot(x="Gender", y="Driving_License",
                data=df, kind="bar");

> **COUNT FOR VEHICLE AGE vs RESPONSE**

In [None]:
df=train.groupby(['Vehicle_Age','Response'])['id'].count().to_frame().rename(columns={'id':'count'}).reset_index()
df

In [None]:
g = sns.catplot(x="Vehicle_Age", y="count",col="Response",
                data=df, kind="bar",
                height=4, aspect=.7);

> **COUNT FOR VEHICLE DAMAGE vs RESPONSE**

In [None]:
df=train.groupby(['Vehicle_Damage','Response'])['id'].count().to_frame().rename(columns={'id':'count'}).reset_index()
df

In [None]:
g = sns.catplot(x="Vehicle_Damage", y="count",col="Response",
                data=df, kind="bar",aspect=0.7)

> **ANNUAL PREMIUM DISTRIBUTION**

In [None]:
sns.distplot(train['Annual_Premium'])

> **VINTAGE**

In [None]:
sns.distplot(train['Vintage'])

<font size="+2" color="blue"><b>Data Preprocessing</b></font>

>**Categorical data into numeric data** 

In [None]:
train['Gender']=train['Gender'].replace({'Male':1,'Female':0})
train.head()

In [None]:
train['Vehicle_Age'].unique()

In [None]:
train['Vehicle_Damage'].replace({'Yes':1,'No':0},inplace=True)
train['Vehicle_Age']=train['Vehicle_Age'].replace({'< 1 Year':1,'1-2 Year':2, '> 2 Years':3})
train.head()

**Hence data has been preprocessed we can perform some operations**

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(train.corr(),annot=True)

> **Highly correlated columns wrt to target columns which can give us better accuracy**

In [None]:
hig_corr = train.corr()
hig_corr_features = hig_corr.index[abs(hig_corr["Response"]) >= 0.2]
hig_corr_features

> **Separating dependent data and independent data**

In [None]:
X=train.drop(['Response'],axis=1)
print(X.shape)
y=train['Response']
print(y.shape)

**Separating train data into train and test(on 20% of the training dataset)**

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
print(X_train.shape)
print(X_test.shape)

<font size="+3" color="blue"><b>Modeling</b></font>

**We gonna use RandomForestclassifier and XGBoost**

In [None]:
model=RandomForestClassifier()
model.fit(X_train,y_train)
from sklearn.metrics import mean_absolute_error,mean_squared_error,confusion_matrix,r2_score,accuracy_score
y_pred=model.predict(X_test)
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",model.score(X_test,y_test)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
print(model.get_params())
print(accuracy_score(y_test,y_pred)*100)

<font color="green"><b>Random Forest Classifier score is : 87.30</b></font>

In [None]:
y_score = model.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)

plt.title('Random Forest ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))

<font color="green"><b>Random Forest Classifier AUC is : 84.20</b></font>

**XGBOOST**

In [None]:
from xgboost import XGBClassifier
model1=XGBClassifier()
model1.fit(X_train,y_train)
y_pred=model1.predict(X_test)
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",model1.score(X_test,y_test)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
print(model1.get_params())
print('accuracy score',accuracy_score(y_test,y_pred)*100)

<font color="green"><b>XGboost classifier accuracy score is : 88 </b></font>

In [None]:
y_score = model1.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)

plt.title('Random Forest ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))

<font color="green"><b>XGboost classifier AUC is : 85.63 </b></font>

**Hense doing the same processing on test data will give us the same accuracy**

<font size="+1" color='blue'><b> I hope you enjoyed this kernel , Please don't forget to appreciate me with an Upvote.</b></font>

In [None]:
nan