# Assignment 1: Naive Bayes classification {-}

This assignment aims at familiarizing you with training and testing Naive Bayes model. You will have to:

- Load the dataset.
- Analyze the dataset.
- Split the dataset into training, validation and test set.
- Train a Gaussian Naive Bayes (GaussianNB, https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) model and find the best set of hyperparameters (var_smoothing hyperparameter) using validation set and cross validation technique (see GridSearchCV )
- Train a Mixed Naive Bayes (MixedNB, https://pypi.org/project/mixed-naive-bayes/) model.
- Evaluate and compare the model performance of GaussianNB and MixedNB on the test set using the following metrics: precision, recall, f1-score.

The dataset you will be working on is 'travel-insurance.csv'. It is composed of attributes such as age, employment type, etc, to predict if a customer is going to buy a travel insurance.

### Submission {-}
The structure of submission folder should be organized as follows:

- ./\<StudentID>-assignment1-notebook.ipynb: Jupyter notebook containing source code.

The submission folder is named ML4DS-\<StudentID>-Assignment1 (e.g., ML4DS-2012345-Assigment1) and then compressed with the same name.
    
### Evaluation {-}
Assignment evaluation will be conducted on how properly you handle the data for training and testing purpose, build a Naive Bayes classifier and evaluate the model performance. In addition, your code should conform to a Python coding convention such as PEP-8.

### Deadline {-}
Please visit Canvas for details.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("travel-insurance.csv", index_col=0)
df.head()

Unnamed: 0,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,31,Government Sector,Yes,400000,6,1,No,No,0
1,31,Private Sector/Self Employed,Yes,1250000,7,0,No,No,0
2,34,Private Sector/Self Employed,Yes,500000,4,1,No,No,1
3,28,Private Sector/Self Employed,Yes,700000,3,1,No,No,0
4,28,Private Sector/Self Employed,Yes,700000,8,1,Yes,No,0


Here follows the list of columns in the dataset:

* Age - Age of the customer
* Employment Type - The sector in which customer is employed
* GraduateOrNot - Whether the customer is college graduate or not
* AnnualIncome - The yearly income of the customer in indian rupees
* FamilyMembers - Number of members in customer's family
* ChronicDisease - Whether the customer suffers from any major disease or conditions like diabetes/high BP or asthama, etc.
* FrequentFlyer - Derived data based on customer's history of booking air tickets on atleast 4 different instances in the last 2 Years (2017-2019).
* EverTravelledAbroad - Has the customer ever travelled to a foreign country.
* TravelInsurance: (label) Did the customer buy travel insurance package during introductory offering held in the year 2019.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1987 entries, 0 to 1986
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Age                  1987 non-null   int64 
 1   Employment Type      1987 non-null   object
 2   GraduateOrNot        1987 non-null   object
 3   AnnualIncome         1987 non-null   int64 
 4   FamilyMembers        1987 non-null   int64 
 5   ChronicDiseases      1987 non-null   int64 
 6   FrequentFlyer        1987 non-null   object
 7   EverTravelledAbroad  1987 non-null   object
 8   TravelInsurance      1987 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 155.2+ KB


In [5]:
# You code goes here
df.describe()

Unnamed: 0,Age,AnnualIncome,FamilyMembers,ChronicDiseases,TravelInsurance
count,1987.0,1987.0,1987.0,1987.0,1987.0
mean,29.650226,932763.0,4.752894,0.277806,0.357323
std,2.913308,376855.7,1.60965,0.44803,0.479332
min,25.0,300000.0,2.0,0.0,0.0
25%,28.0,600000.0,4.0,0.0,0.0
50%,29.0,900000.0,5.0,0.0,0.0
75%,32.0,1250000.0,6.0,1.0,1.0
max,35.0,1800000.0,9.0,1.0,1.0


In [6]:
df.isnull().any

<bound method NDFrame._add_numeric_operations.<locals>.any of         Age  Employment Type  GraduateOrNot  AnnualIncome  FamilyMembers  \
0     False            False          False         False          False   
1     False            False          False         False          False   
2     False            False          False         False          False   
3     False            False          False         False          False   
4     False            False          False         False          False   
...     ...              ...            ...           ...            ...   
1982  False            False          False         False          False   
1983  False            False          False         False          False   
1984  False            False          False         False          False   
1985  False            False          False         False          False   
1986  False            False          False         False          False   

      ChronicDiseases  Fr

In [7]:
from sklearn import preprocessing
label_encode = preprocessing.LabelEncoder()
df['GraduateOrNot'] = label_encode.fit_transform(df['GraduateOrNot'])
df['FrequentFlyer'] = label_encode.fit_transform(df['FrequentFlyer'])
df['EverTravelledAbroad'] = label_encode.fit_transform(df['EverTravelledAbroad'])
df

Unnamed: 0,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,31,Government Sector,1,400000,6,1,0,0,0
1,31,Private Sector/Self Employed,1,1250000,7,0,0,0,0
2,34,Private Sector/Self Employed,1,500000,4,1,0,0,1
3,28,Private Sector/Self Employed,1,700000,3,1,0,0,0
4,28,Private Sector/Self Employed,1,700000,8,1,1,0,0
...,...,...,...,...,...,...,...,...,...
1982,33,Private Sector/Self Employed,1,1500000,4,0,1,1,1
1983,28,Private Sector/Self Employed,1,1750000,5,1,0,1,0
1984,28,Private Sector/Self Employed,1,1150000,6,1,0,0,0
1985,34,Private Sector/Self Employed,1,1000000,6,0,1,1,1


In [8]:
employ = pd.get_dummies(df["Employment Type"], dtype = np.uint8)
data = pd.concat([df, employ], axis=1)
data.drop('Employment Type', inplace=True, axis=1)
data


Unnamed: 0,Age,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance,Government Sector,Private Sector/Self Employed
0,31,1,400000,6,1,0,0,0,1,0
1,31,1,1250000,7,0,0,0,0,0,1
2,34,1,500000,4,1,0,0,1,0,1
3,28,1,700000,3,1,0,0,0,0,1
4,28,1,700000,8,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...
1982,33,1,1500000,4,0,1,1,1,0,1
1983,28,1,1750000,5,1,0,1,0,0,1
1984,28,1,1150000,6,1,0,0,0,0,1
1985,34,1,1000000,6,0,1,1,1,0,1


Gaussian Bayes


In [9]:
feature_names = data.columns.tolist()
feature_names.remove("TravelInsurance")
feature_names.remove("EverTravelledAbroad")
X = data[feature_names].values
y = data.TravelInsurance.values


In [10]:
X.shape

(1987, 8)

In [11]:
y.shape

(1987,)

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [13]:
# Standardize the data using Standard scaler
from sklearn.preprocessing import StandardScaler
normalizer = StandardScaler()
X_normal_train = normalizer.fit_transform(X_train)     # Note that we use fit_transform() on training data so that it can learn the scaling parameters of that data.
X_normal_test = normalizer.transform(X_test)           # But we only transform() in test data using the learned scaling parameters.

In [14]:
# Initialize and train Gaussian Naive Bayes model using X_normal_train (data features) and y_train (data label)
from sklearn.naive_bayes import GaussianNB
naive_model = GaussianNB()
naive_model.fit(X_normal_train, y_train)

In [15]:
# Impport libraries to calculate evaluation metrics: precision, recall, f1 score.
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

# Make prediction on the test data
predicted_label = naive_model.predict(X_normal_test)

# Calculate evaluation metrics by comparing the prediction with the data label y_test
print(precision_score(predicted_label, y_test))
print(recall_score(predicted_label, y_test))
print(f1_score(predicted_label, y_test))
print(classification_report(predicted_label, y_test))

0.5841121495327103
0.6377551020408163
0.6097560975609756
              precision    recall  f1-score   support

           0       0.81      0.78      0.80       401
           1       0.58      0.64      0.61       196

    accuracy                           0.73       597
   macro avg       0.70      0.71      0.70       597
weighted avg       0.74      0.73      0.73       597



GridSearchCV


In [16]:
# Split the data into train/test set using sklearn library
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [17]:
# Standardize the data using Standard scaler
from sklearn.preprocessing import StandardScaler, PowerTransformer
normalizer = StandardScaler()
X_normal_train = normalizer.fit_transform(X_train)     # Note that we use fit_transform() on training data so that it can learn the scaling parameters of that data.
X_normal_test = normalizer.transform(X_test)           # But we only transform() the test data using the learned scaling parameters.

In [31]:
# Randomized CV splitters may return different results for each call of split by using Repeated StratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
cross_val = RepeatedStratifiedKFold(n_splits=5, n_repeats=8, random_state=724)
params_nb = {'var_smoothing': np.logspace(0,-8, num=100)}
print(np.logspace(0,-8, num=10))


[1.00000000e+00 1.29154967e-01 1.66810054e-02 2.15443469e-03
 2.78255940e-04 3.59381366e-05 4.64158883e-06 5.99484250e-07
 7.74263683e-08 1.00000000e-08]


In [32]:
from sklearn.model_selection import GridSearchCV
grid_nb = GridSearchCV(estimator=naive_model,
                     param_grid=params_nb,
                     cv= cross_val,
                     verbose=1,
                     scoring='accuracy')

In [33]:
Data_transformed = PowerTransformer().fit_transform(X_normal_train)
Test_transformed = PowerTransformer().fit_transform(X_normal_test)
grid_nb.fit(Data_transformed, y_train);

Fitting 40 folds for each of 100 candidates, totalling 4000 fits


In [34]:
grid_nb.best_params_ # Show the best value for smoothing

{'var_smoothing': 0.47508101621027965}

In [36]:
grid_nb.best_score_ #show the model performance with the above parameter

0.7212230215827338

In [38]:
predict_test = grid_nb.predict(Test_transformed)

In [40]:
# Accuracy Score on test dataset
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_test = accuracy_score(y_test,predict_test)
print('accuracy score on test dataset : ', accuracy_test)

accuracy score on test dataset :  0.7353433835845896


MixedNB


In [41]:
# Install the library
!pip install git+https://github.com/remykarem/mixed-naive-bayes#egg=mixed_naive_bayes

Collecting mixed_naive_bayes
  Cloning https://github.com/remykarem/mixed-naive-bayes to /tmp/pip-install-015rli08/mixed-naive-bayes_016b001bfc624e2ea04a34a736bb73fe
  Running command git clone --filter=blob:none --quiet https://github.com/remykarem/mixed-naive-bayes /tmp/pip-install-015rli08/mixed-naive-bayes_016b001bfc624e2ea04a34a736bb73fe
  Resolved https://github.com/remykarem/mixed-naive-bayes to commit 6d90de8adf75dbef032ad51029ad3782190ec577
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: mixed_naive_bayes
  Building wheel for mixed_naive_bayes (setup.py) ... [?25l[?25hdone
  Created wheel for mixed_naive_bayes: filename=mixed_naive_bayes-0.0.4-py3-none-any.whl size=10858 sha256=274ddce131db9a08f4939d739dd4d7855f5cb3aceb15789ec2d21df9c21a6103
  Stored in directory: /tmp/pip-ephem-wheel-cache-w6d5quz8/wheels/2a/a3/2a/c06776fa657161751268d6cb9915f08b208b93a6908a1342d3
Successfully built mixed_naive_bayes
Installing collected packages

In [42]:
# Import mixed Naive Bayes library
from mixed_naive_bayes import MixedNB



In [43]:
data[feature_names]

Unnamed: 0,Age,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,Government Sector,Private Sector/Self Employed
0,31,1,400000,6,1,0,1,0
1,31,1,1250000,7,0,0,0,1
2,34,1,500000,4,1,0,0,1
3,28,1,700000,3,1,0,0,1
4,28,1,700000,8,1,1,0,1
...,...,...,...,...,...,...,...,...
1982,33,1,1500000,4,0,1,0,1
1983,28,1,1750000,5,1,0,0,1
1984,28,1,1150000,6,1,0,0,1
1985,34,1,1000000,6,0,1,0,1


In [44]:
# Split the data into train/test set using sklearn library
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [45]:
clf = MixedNB(categorical_features=[1, 4, 5, 6, 7])
clf.fit(X_train, y_train)

MixedNB(alpha=0.5, var_smoothing=1e-09)

In [46]:
clf.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,