**INSURANCE PREDICTION WITH MACHINE LEARNING**

Importing the necessary Python libraries and the dataset:

In [1]:
import pandas as pd
import plotly.express as px
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report

import pickle

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
df = pd.read_csv('/content/drive/MyDrive/TravelInsurancePrediction.csv')
df

Unnamed: 0.1,Unnamed: 0,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,0,31,Government Sector,Yes,400000,6,1,No,No,0
1,1,31,Private Sector/Self Employed,Yes,1250000,7,0,No,No,0
2,2,34,Private Sector/Self Employed,Yes,500000,4,1,No,No,1
3,3,28,Private Sector/Self Employed,Yes,700000,3,1,No,No,0
4,4,28,Private Sector/Self Employed,Yes,700000,8,1,Yes,No,0
...,...,...,...,...,...,...,...,...,...,...
1982,1982,33,Private Sector/Self Employed,Yes,1500000,4,0,Yes,Yes,1
1983,1983,28,Private Sector/Self Employed,Yes,1750000,5,1,No,Yes,0
1984,1984,28,Private Sector/Self Employed,Yes,1150000,6,1,No,No,0
1985,1985,34,Private Sector/Self Employed,Yes,1000000,6,0,Yes,Yes,1


The `Unnamed` column in this dataset is of no use, so I'll just remove it from the data:

In [3]:
df.drop(['Unnamed: 0'],axis=1,inplace=True)
df

Unnamed: 0,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,31,Government Sector,Yes,400000,6,1,No,No,0
1,31,Private Sector/Self Employed,Yes,1250000,7,0,No,No,0
2,34,Private Sector/Self Employed,Yes,500000,4,1,No,No,1
3,28,Private Sector/Self Employed,Yes,700000,3,1,No,No,0
4,28,Private Sector/Self Employed,Yes,700000,8,1,Yes,No,0
...,...,...,...,...,...,...,...,...,...
1982,33,Private Sector/Self Employed,Yes,1500000,4,0,Yes,Yes,1
1983,28,Private Sector/Self Employed,Yes,1750000,5,1,No,Yes,0
1984,28,Private Sector/Self Employed,Yes,1150000,6,1,No,No,0
1985,34,Private Sector/Self Employed,Yes,1000000,6,0,Yes,Yes,1


Now let's look at some of the necessary insights to get an idea about what kind of data we are working with:

In [4]:
df.isna().sum()

Age                    0
Employment Type        0
GraduateOrNot          0
AnnualIncome           0
FamilyMembers          0
ChronicDiseases        0
FrequentFlyer          0
EverTravelledAbroad    0
TravelInsurance        0
dtype: int64

In [5]:
df.dtypes

Age                     int64
Employment Type        object
GraduateOrNot          object
AnnualIncome            int64
FamilyMembers           int64
ChronicDiseases         int64
FrequentFlyer          object
EverTravelledAbroad    object
TravelInsurance         int64
dtype: object

In this dataset, the labels we want to predict are in the “TravelInsurance” column. The values in this column are mentioned as 0 and 1 where 0 means not bought and 1 means bought. For a better understanding when analyzing this data, I will convert 1 and 0 to purchased and not purchased:

In [6]:
df["TravelInsurance"] = df["TravelInsurance"].map({0: "Not Purchased", 1: "Purchased"})

Now let’s start by looking at the age column to see how age affects the purchase of an insurance policy:

In [7]:
figure = px.histogram(df, x = "Age",
                      color = "TravelInsurance",
                      title = "Factors Affecting Purchase of Travel Insurance: Age")
figure.show()

According to the visualization above, people around 34 are more likely to buy an insurance policy and people around 28 are very less likely to buy an insurance policy. Now let's see how a person's type of employment affects the purchase of an insurance policy:

In [8]:
figure = px.histogram(df, x = "Employment Type",
                      color = "TravelInsurance",
                      title = "Factors Affecting Purchase of Travel Insurance: Employment Type")
figure.show()

According to the visualization above, people working in the private sector or the self-employed are more likely to have an insurance policy. Now let's see how a person's annual income affects the purchase of an insurance policy:

In [9]:
figure = px.histogram(df, x = "AnnualIncome",
                      color = "TravelInsurance",
                      title = "Factors Affecting Purchase of Travel Insurance: Income")
figure.show()

According to the above visualisation, people who are having an annual income of more than 1400000 are more likely to purchase the insurance policy.

**Insurance Prediction Model**

I will convert all categorical values to 1 and 0 first because all columns are important for training the insurance prediction model:

In [10]:
df["Employment Type"] = df["Employment Type"].map({"Government Sector": 0, "Private Sector/Self Employed": 1})
df["GraduateOrNot"] = df["GraduateOrNot"].map({"No": 0, "Yes": 1})
df["FrequentFlyer"] = df["FrequentFlyer"].map({"No": 0, "Yes": 1})
df["EverTravelledAbroad"] = df["EverTravelledAbroad"].map({"No": 0, "Yes": 1})
df["TravelInsurance"] = df["TravelInsurance"].map({"Not Purchased":0, "Purchased":1})

In [11]:
df.corr()

Unnamed: 0,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
Age,1.0,-0.115134,0.027125,-0.020101,0.027409,0.007359,-0.033159,-0.012779,0.06106
Employment Type,-0.115134,1.0,-0.127133,0.349157,-0.003354,-0.011553,0.14379,0.181098,0.147847
GraduateOrNot,0.027125,-0.127133,1.0,0.108066,0.021201,0.018811,-0.02812,0.062683,0.018934
AnnualIncome,-0.020101,0.349157,0.108066,1.0,-0.015367,-0.001149,0.353087,0.486043,0.396763
FamilyMembers,0.027409,-0.003354,0.021201,-0.015367,1.0,0.028209,-0.023775,-0.020755,0.079909
ChronicDiseases,0.007359,-0.011553,0.018811,-0.001149,0.028209,1.0,-0.04372,0.021238,0.01819
FrequentFlyer,-0.033159,0.14379,-0.02812,0.353087,-0.023775,-0.04372,1.0,0.277334,0.232103
EverTravelledAbroad,-0.012779,0.181098,0.062683,0.486043,-0.020755,0.021238,0.277334,1.0,0.433183
TravelInsurance,0.06106,0.147847,0.018934,0.396763,0.079909,0.01819,0.232103,0.433183,1.0


In [12]:
x = df.iloc[:,:-1]
x

Unnamed: 0,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad
0,31,0,1,400000,6,1,0,0
1,31,1,1,1250000,7,0,0,0
2,34,1,1,500000,4,1,0,0
3,28,1,1,700000,3,1,0,0
4,28,1,1,700000,8,1,1,0
...,...,...,...,...,...,...,...,...
1982,33,1,1,1500000,4,0,1,1
1983,28,1,1,1750000,5,1,0,1
1984,28,1,1,1150000,6,1,0,0
1985,34,1,1,1000000,6,0,1,1


In [13]:
y = df.iloc[:,-1]
y

0       0
1       0
2       1
3       0
4       0
       ..
1982    1
1983    0
1984    0
1985    1
1986    0
Name: TravelInsurance, Length: 1987, dtype: int64

In [14]:
y.value_counts()

TravelInsurance
0    1277
1     710
Name: count, dtype: int64

In [15]:
scaler = MinMaxScaler()
xscale = scaler.fit_transform(x)
xscale

array([[0.6, 0. , 1. , ..., 1. , 0. , 0. ],
       [0.6, 1. , 1. , ..., 0. , 0. , 0. ],
       [0.9, 1. , 1. , ..., 1. , 0. , 0. ],
       ...,
       [0.3, 1. , 1. , ..., 1. , 0. , 0. ],
       [0.9, 1. , 1. , ..., 0. , 1. , 1. ],
       [0.9, 1. , 1. , ..., 0. , 0. , 0. ]])

Now let's split the data and train the model by using the decision tree classification algorithm:

In [16]:
xtrain, xtest, ytrain, ytest = train_test_split(xscale, y, test_size=0.3, random_state=15)

In [17]:
xtrain.shape, xtest.shape

((1390, 8), (597, 8))

In [18]:
ytrain.shape, ytest.shape

((1390,), (597,))

In [19]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score,accuracy_score,precision_score,recall_score

knn = KNeighborsClassifier()
sv = SVC()
nbs = GaussianNB()
rb = RandomForestClassifier()
ad = AdaBoostClassifier()
xg = XGBClassifier()
dtc = DecisionTreeClassifier()

models=[knn,sv,nbs,rb,ad,xg,dtc]
for model in models:
  print(model)
  model.fit(xtrain,ytrain)
  ypred=model.predict(xtest)
  print(classification_report(ytest,ypred))
  # print(precision_score(ytest,ypred),',',recall_score(ytest,ypred),',',f1_score(ytest,ypred),',',accuracy_score(ytest,ypred))

KNeighborsClassifier()
              precision    recall  f1-score   support

           0       0.79      0.89      0.84       383
           1       0.75      0.57      0.64       214

    accuracy                           0.78       597
   macro avg       0.77      0.73      0.74       597
weighted avg       0.77      0.78      0.77       597

SVC()
              precision    recall  f1-score   support

           0       0.76      0.96      0.85       383
           1       0.88      0.46      0.61       214

    accuracy                           0.78       597
   macro avg       0.82      0.71      0.73       597
weighted avg       0.80      0.78      0.76       597

GaussianNB()
              precision    recall  f1-score   support

           0       0.76      0.86      0.81       383
           1       0.67      0.51      0.58       214

    accuracy                           0.74       597
   macro avg       0.72      0.69      0.69       597
weighted avg       0.73      0.7

In [None]:
from sklearn.model_selection import GridSearchCV
params= {
    'n_estimators': [100, 200, 500],
    'max_depth': [3, 5, 8],
    'subsample': [0.7, 0.8, 0.9],
    'lambda': [1, 10, 100],
    'alpha': [0, 0.01, 0.1],
}
xgb = XGBClassifier()
gscv = GridSearchCV(xgb,params,cv=10,scoring='accuracy')
gscv.fit(xtrain,ytrain)

In [None]:
gscv.best_params_

In [None]:
xgb = XGBClassifier(alpha= 0, max_depth= 3,n_estimators= 100,subsample= 0.7)
xgb.fit(xtrain,ytrain)
ypred=model.predict(xtest)
print(classification_report(ytest,ypred))

In [None]:
y_new = model.predict([[33,	1,	1,	1500000,	4,	0,	1,	1]])
y_new

The model gives a score of over 80% which is not bad for this kind of problem. So this is how you can train a machine learning model for the task of insurance prediction using Python.

In [None]:
pickle.dump(model, open('model.sav', 'wb'))
pickle.dump(scaler, open('scaler.sav', 'wb'))