## Travel Insurance Data Analysis

Change log
| Date     | Name | Changes | To-Do | 
| -------- | -----| --------|-------|
| 2022/08/11 | Matthew | FE & Modelling | 1. Reproducible code (use functions) <br> 2. Focus on FE & imbalance rather than modelling <br> 3. File structure (separate notebook and data) 

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import autokeras as ak

from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report
from autosklearn.classification import AutoSklearnClassifier
from imblearn.over_sampling import SMOTE
from pandas_profiling import ProfileReport

Load & Quick Examination of the Dataset

In [61]:
travel_df = pd.read_csv('travel_insurance_dataset.csv')

In [62]:
travel_df.head()

Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Claim,Duration,Destination,Net Sales,Commision (in value),Gender,Age
0,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,81
1,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,71
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,65,AUSTRALIA,-49.5,29.7,,32
3,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,60,AUSTRALIA,-39.6,23.76,,32
4,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,79,ITALY,-19.8,11.88,,41


EDA using Pandas Profiling

In [55]:
prof = ProfileReport(travel_df)
prof.to_file(output_file='travel_eda.html')

Summarize dataset: 100%|██████████| 41/41 [00:50<00:00,  1.23s/it, Completed]                                         
Generate report structure: 100%|██████████| 1/1 [00:09<00:00,  9.23s/it]
Render HTML: 100%|██████████| 1/1 [00:02<00:00,  2.50s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 168.37it/s]


Feature Engineering

In [71]:
def feature_engineering(df):
    df['Age']= df['Age'].replace([0,118],np.NaN) #or outlier
    df = pd.get_dummies(df,drop_first=True)
    col = df.columns
    imputer = KNNImputer(n_neighbors=2)
    df = pd.DataFrame(imputer.fit_transform(df))
    df.columns = col
    return df

cleaned_df = feature_engineering(travel_df)

Data Splitting

In [74]:
def data_splitting(df,target_col,test_size):
    global X_train, X_test, y_train, y_test
    X = df.loc[:, df.columns != target_col]
    y = df.loc[:, target_col]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = 0)

data_splitting(cleaned_df,'Claim_Yes',0.3)

Imbalanced Data

In [79]:
cleaned_df['Claim_Yes'].value_counts()

0.0    62399
1.0      927
Name: Claim_Yes, dtype: int64

In [43]:
sm = SMOTE(random_state = 1)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())

Modelling (Auto-Sklearn)

In [46]:
model = AutoSklearnClassifier()
model.fit(X_train_res,y_train_res)

AutoSklearnClassifier(per_run_time_limit=360)

In [48]:
y_pred = model.predict(X_test)
testing_accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy score {0}".format(testing_accuracy))

print(classification_report(y_test, y_pred))

Test Accuracy score 0.9738393515106853
              precision    recall  f1-score   support

         0.0       0.99      0.99      0.99     18748
         1.0       0.06      0.07      0.06       250

    accuracy                           0.97     18998
   macro avg       0.52      0.53      0.53     18998
weighted avg       0.98      0.97      0.97     18998



Modelling (Auto-Keras)

In [50]:
# It tries 10 different models.
clf = ak.StructuredDataClassifier(overwrite=True, max_trials=3)
# Feed the structured data classifier with training data.
clf.fit(X_train_res, y_train_res, epochs=10)
# Predict with the best model.
predicted_y = clf.predict(X_test)
# Evaluate the best model with testing data.
print(clf.evaluate(X_test, y_test))

Trial 3 Complete [00h 10m 28s]
val_accuracy: 1.0

Best val_accuracy So Far: 1.0
Total elapsed time: 00h 31m 37s
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
[3039.328125, 0.015264764428138733]


In [51]:
print(classification_report(y_test, predicted_y))

              precision    recall  f1-score   support

         0.0       1.00      0.00      0.00     18748
         1.0       0.01      1.00      0.03       250

    accuracy                           0.02     18998
   macro avg       0.51      0.50      0.02     18998
weighted avg       0.99      0.02      0.00     18998

