Capstone - Part 4

Topic: Data Pre-Processing Part 2 & Classification Modelling

Created By: Jason

In [1]:
#Mounting our Drive to Google Colab
from google.colab import drive
drive.mount('/drive')

Mounted at /drive


In [3]:
# Import necessary libraries

import pandas as pd # data analysis
import numpy as np # arrays

In [4]:
path = '/drive/MyDrive/Colab Notebooks/data_tour_cluster.csv' # load new dataset (with "cluster" as the target variable)
data = pd.read_csv(path)

In [5]:
# Check the variable information (Data type etc.)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115 entries, 0 to 114
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               115 non-null    int64 
 1   Cluster                  115 non-null    int64 
 2   gender                   115 non-null    object
 3   age                      115 non-null    object
 4   household_income         115 non-null    object
 5   education_level          115 non-null    object
 6   occupation               115 non-null    object
 7   travel_purpose           115 non-null    object
 8   pref_transport_mode      115 non-null    object
 9   pref_length_stay         115 non-null    object
 10  travel_pay_willingness   115 non-null    object
 11  pref_poi                 115 non-null    object
 12  pref_group_size          115 non-null    object
 13  pref_travel_destination  115 non-null    object
dtypes: int64(2), object(12)
memory usage: 12.7

In [6]:
# variable 'cluster' type is classified as 'numeric' - hence, changes needed to change
# creating a dictionary with column name and data type
data_types_dict = {'Cluster': str}
  
# we will change the data type 
# of id column to str by giving
# the dict to the astype method
data = data.astype(data_types_dict)
  
# checking the data types
# using df.dtypes method
data.dtypes

Unnamed: 0                  int64
Cluster                    object
gender                     object
age                        object
household_income           object
education_level            object
occupation                 object
travel_purpose             object
pref_transport_mode        object
pref_length_stay           object
travel_pay_willingness     object
pref_poi                   object
pref_group_size            object
pref_travel_destination    object
dtype: object

In [7]:
# Remove irrelavant columns 
# First variable is not be useful for further analysis

data = data.drop(['Unnamed: 0'], axis=1)
data.dtypes

Cluster                    object
gender                     object
age                        object
household_income           object
education_level            object
occupation                 object
travel_purpose             object
pref_transport_mode        object
pref_length_stay           object
travel_pay_willingness     object
pref_poi                   object
pref_group_size            object
pref_travel_destination    object
dtype: object

Data Pre-Processing: Label Encoding

In [8]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder
# Label encode categorical variables
# gender # age # household income # education level # occupation

data_encoded = data.copy() # To serve as backup incase there is Pandas error
data_encoded[['gender', 'age', 'household_income','education_level', 'occupation', 'pref_transport_mode', 'travel_pay_willingness', 'pref_group_size', 'travel_purpose']] = data_encoded[['gender', 'age', 'household_income','education_level', 'occupation', 'pref_transport_mode', 'travel_pay_willingness', 'pref_group_size','travel_purpose']].apply(LabelEncoder().fit_transform)


In [9]:
# Simple view on the dataset to double-confirm
data_encoded.head()

Unnamed: 0,Cluster,gender,age,household_income,education_level,occupation,travel_purpose,pref_transport_mode,pref_length_stay,travel_pay_willingness,pref_poi,pref_group_size,pref_travel_destination
0,0,0,0,1,1,1,0,0,Less than 3 days,0,Shopping Malls & Historical Site,0,East Malaysia (Sabah and Sarawak)
1,1,1,0,1,2,0,1,0,3 Days,0,Shopping Malls & Historical Site,0,"Southern Region (Malacca, Johor and Negeri Sem..."
2,2,0,0,0,2,0,1,0,3 Days,1,Beaches and waterfall,1,"East Coast Region (Kelantan, Terengganu and Pa..."
3,2,1,0,0,1,0,1,1,Less than 3 days,0,Beaches and waterfall,0,"East Coast Region (Kelantan, Terengganu and Pa..."
4,2,0,0,0,1,0,1,0,3 Days,1,"Adventure and Activity (Ex: Hiking, Mountain C...",0,"East Coast Region (Kelantan, Terengganu and Pa..."


In [None]:
# Feature Selection -> Based on Boruta Algorithm carried out in R Programming
# pref group size and education level will be dropped as they are not important
# to predict the dependent variable 'clulster'
data_encoded = data_encoded.drop(['pref_group_size', 'education_level'], axis=1)

Data Pre-Processing: One-Hot Encoding

In [10]:
# One Hot Encoding
# pref length stay
# pref travel destination
# poi
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

column_transform_ohe = make_column_transformer(
               (OneHotEncoder(), ['pref_length_stay','pref_poi','pref_travel_destination']), 
               remainder = 'passthrough')

data_encoded_2 = column_transform_ohe.fit_transform(data_encoded)

In [11]:
data_encoded_2

array([[0.0, 0.0, 1.0, ..., 0, 0, 0],
       [1.0, 0.0, 0.0, ..., 0, 0, 0],
       [1.0, 0.0, 0.0, ..., 0, 1, 1],
       ...,
       [0.0, 1.0, 0.0, ..., 0, 1, 1],
       [0.0, 0.0, 1.0, ..., 0, 0, 0],
       [0.0, 1.0, 0.0, ..., 0, 0, 0]], dtype=object)

Modelling - Supervised Machine Learning / Classification (Phase 2)

Data Partitioning / Data Splitting

In [12]:
# Identify target variable 
Y = data_encoded["Cluster"]
X = data_encoded.drop("Cluster", axis = 1)

In [13]:
# Data Partitioning 
# Split the data
# 70% train & 30% test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)  

(80, 12) (35, 12) (80,) (35,)


Model 1: LOGISTIC REGRESSION - Multinomial (Multi-class)

In [14]:
# Build Logistic Regression classifier 
from sklearn.linear_model import LogisticRegression
lrmodel = LogisticRegression(multi_class='multinomial', solver='lbfgs') 
# 'multinomial' is selected as there are 3 categories in the target variable

In [15]:
from sklearn.pipeline import make_pipeline
# pipe library was loaded to streamline the workflow 
# The code below will execute the one hot encoding process before training
# pipeline creation
# 1. Data Pre-processing -> One hot encode 
# 2. Apply LR classifier 

lrpipe = make_pipeline(column_transform_ohe, lrmodel)

In [16]:
lrpipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['pref_length_stay',
                                                   'pref_poi',
                                                   'pref_travel_destination'])])),
                ('logisticregression',
                 LogisticRegression(multi_class='multinomial'))])

In [17]:
y_pred = lrpipe.predict(X_test)
y_pred

array(['1', '2', '2', '1', '1', '1', '0', '2', '0', '2', '0', '2', '1',
       '2', '2', '0', '1', '1', '0', '1', '0', '0', '2', '2', '0', '0',
       '1', '1', '1', '0', '1', '0', '2', '1', '2'], dtype=object)

In [18]:
# Model performance - LR (Accuracy)
accuracy = lrpipe.score(X_test, y_test)
print("Accuracy: %.4f" % accuracy)

Accuracy: 0.8571


In [19]:
# Compute Confusion Matrix (Prediction of LR model)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[ 9  0  2]
 [ 0 12  0]
 [ 2  1  9]]
              precision    recall  f1-score   support

           0       0.82      0.82      0.82        11
           1       0.92      1.00      0.96        12
           2       0.82      0.75      0.78        12

    accuracy                           0.86        35
   macro avg       0.85      0.86      0.85        35
weighted avg       0.85      0.86      0.85        35



Model 2: ARTIFICIAL NEURAL NETWORK (ANN)

In [20]:
from sklearn.neural_network import MLPClassifier 
# 2 Hidden layers
# Each Hidden layers has 100 neurons
nnmodel = MLPClassifier(hidden_layer_sizes = [100,100], alpha = 5.0, random_state = 42, solver = 'lbfgs')

In [21]:
from sklearn.pipeline import make_pipeline
# pipe library was loaded to streamline the workflow 
# The code below will execute the one hot encoding process before training
# pipeline creation
# 1. Data Pre-processing -> One hot encode 
# 2. Apply Neural Network (ANN) 

nnpipe = make_pipeline(column_transform_ohe, nnmodel)

In [22]:
nnpipe.fit(X_train, y_train) #train the ANN

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['pref_length_stay',
                                                   'pref_poi',
                                                   'pref_travel_destination'])])),
                ('mlpclassifier',
                 MLPClassifier(alpha=5.0, hidden_layer_sizes=[100, 100],
                               random_state=42, solver='lbfgs'))])

In [23]:
y_pred = nnpipe.predict(X_test)
y_pred # view predicted results

array(['1', '2', '2', '1', '1', '1', '0', '2', '0', '2', '0', '2', '1',
       '2', '2', '0', '1', '1', '0', '1', '0', '0', '2', '2', '0', '0',
       '1', '1', '1', '0', '1', '0', '2', '1', '2'], dtype='<U1')

In [24]:
# Model performance - ANN (Accuracy)
accuracy = nnpipe.score(X_test, y_test)
print("Accuracy: %.4f" % accuracy)

Accuracy: 0.8571


In [25]:
# Compute Confusion Matrix (Prediction of ANN model)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[ 9  0  2]
 [ 0 12  0]
 [ 2  1  9]]
              precision    recall  f1-score   support

           0       0.82      0.82      0.82        11
           1       0.92      1.00      0.96        12
           2       0.82      0.75      0.78        12

    accuracy                           0.86        35
   macro avg       0.85      0.86      0.85        35
weighted avg       0.85      0.86      0.85        35



Model 3: ENSEMBLE MODEL - RANDOM FOREST (RF)

In [26]:
from sklearn.ensemble import RandomForestClassifier
rfmodel = RandomForestClassifier(n_estimators=20, random_state=42)

In [27]:
from sklearn.pipeline import make_pipeline
# pipe library was loaded to streamline the workflow 
# The code below will execute the one hot encoding process before training
# pipeline creation
# 1. Data Pre-processing -> One hot encode 
# 2. Apply Random Forest (RF)

rfpipe = make_pipeline(column_transform_ohe, rfmodel)

In [28]:
rfpipe.fit(X_train, y_train) #train the RF

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['pref_length_stay',
                                                   'pref_poi',
                                                   'pref_travel_destination'])])),
                ('randomforestclassifier',
                 RandomForestClassifier(n_estimators=20, random_state=42))])

In [29]:
y_pred = rfpipe.predict(X_test)
y_pred # view predicted results

array(['2', '2', '1', '1', '2', '1', '2', '2', '0', '2', '2', '1', '1',
       '2', '2', '0', '1', '1', '0', '1', '0', '0', '2', '2', '0', '2',
       '1', '1', '1', '1', '1', '2', '1', '1', '0'], dtype=object)

In [30]:
# Model performance - RF (Accuracy)
accuracy = rfpipe.score(X_test, y_test)
print("Accuracy: %.4f" % accuracy)

Accuracy: 0.7143


In [31]:
# Compute Confusion Matrix (Prediction of RF model)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[ 7  1  3]
 [ 0 10  2]
 [ 0  4  8]]
              precision    recall  f1-score   support

           0       1.00      0.64      0.78        11
           1       0.67      0.83      0.74        12
           2       0.62      0.67      0.64        12

    accuracy                           0.71        35
   macro avg       0.76      0.71      0.72        35
weighted avg       0.75      0.71      0.72        35



Experimentation: Cross Validation

In [32]:
# MODEL BUILDING (Separate X and Y)
Y = data_encoded["Cluster"]
X = data_encoded.drop("Cluster", axis = 1)

In [33]:
# Split the data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.50, random_state = 42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)  

(57, 12) (58, 12) (57,) (58,)


In [34]:
# ANN / MLP
from sklearn.neural_network import MLPClassifier 
model = MLPClassifier(hidden_layer_sizes = [100,100], alpha = 5.0, random_state = 42, solver = 'lbfgs')

In [35]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(column_transform_ohe, model)

In [36]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
cross_val_score(pipe, X_train, y_train, cv=5, scoring = 'accuracy').mean()

0.740909090909091

In [37]:
cross_val_score(pipe, X_test, y_test, cv=5, scoring = 'accuracy').mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

0.8075757575757576