# Creating a model to predict hotel cancellation

## Importing the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

## Loading the raw data

In [2]:
raw_data = pd.read_csv('hotel_bookings.csv')

In [3]:
raw_data.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,Arrival_Date,Country_Name,Region,Region_Name
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,Transient,0.0,0,0,Check-Out,01/07/2015,01/07/2015,Portugal,Europe,Southern Europe
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,Transient,0.0,0,0,Check-Out,01/07/2015,01/07/2015,Portugal,Europe,Southern Europe
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,Transient,75.0,0,0,Check-Out,02/07/2015,01/07/2015,United Kingdom of Great Britain and Northern I...,Europe,Northern Europe
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,Transient,75.0,0,0,Check-Out,02/07/2015,01/07/2015,United Kingdom of Great Britain and Northern I...,Europe,Northern Europe
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,Transient,98.0,0,1,Check-Out,03/07/2015,01/07/2015,United Kingdom of Great Britain and Northern I...,Europe,Northern Europe


# Data Cleanse

### Establishing key factors for cancelled bookings?


#### This sections will look at each feature with the aim to establish relevance and correlation with target vablue, thus determining whether it would be a good predictor.

In [4]:
data = raw_data.copy()

In [5]:
# remove company field
data = data.drop('company', axis=1)

In [6]:
# let's remove outliers in leadtime by 
data.lead_time[(data.lead_time > data.lead_time.quantile(0.95))].count() / data.lead_time.count()

0.04907446184772594

In [7]:
# it appears that there are outliers situated around higher lead time, and it best to address this...
# some may decide to remove those outliers but these data will be kept here, by updating any value above 95th percentile
data.lead_time = np.where(data.lead_time > data.lead_time.quantile(0.95), data.lead_time.quantile(0.95),data.lead_time)
data.lead_time.describe(include='all')



count    119390.000000
mean        100.362065
std          96.587131
min           0.000000
25%          18.000000
50%          69.000000
75%         160.000000
max         320.000000
Name: lead_time, dtype: float64

In [8]:
# Given the fact that less than 1% require more than 1 parking space, it's worth simplifying this feature
# ...by grouping this feature into two options: those with a parking space requirement and those with none
data.required_car_parking_spaces = np.where(data.required_car_parking_spaces > 0,1,0)
data.groupby("required_car_parking_spaces")["required_car_parking_spaces"].value_counts()

required_car_parking_spaces  required_car_parking_spaces
0                            0                              111974
1                            1                                7416
Name: required_car_parking_spaces, dtype: int64

In [9]:
# let's remove fields that we don't need:
data = data.drop(['arrival_date_year','arrival_date_week_number','arrival_date_day_of_month','agent'], axis=1)

In [10]:
# Get a list of remaining/relevant numerical variables
[var for var in data.columns if data[var].dtypes!='object']

['is_canceled',
 'lead_time',
 'stays_in_weekend_nights',
 'stays_in_week_nights',
 'adults',
 'children',
 'babies',
 'is_repeated_guest',
 'previous_cancellations',
 'previous_bookings_not_canceled',
 'booking_changes',
 'days_in_waiting_list',
 'adr',
 'required_car_parking_spaces',
 'total_of_special_requests']

#### Preprocessing categorical data

Step 1: List variables and remove irrelevant ones.

Step 2: Create dummy variables.

In [11]:
#check for missing value again, in case we've missed any.
data.isna().sum()

hotel                                0
is_canceled                          0
lead_time                            0
arrival_date_month                   0
stays_in_weekend_nights              0
stays_in_week_nights                 0
adults                               0
children                             4
babies                               0
meal                                 0
country                            488
market_segment                       0
distribution_channel                 0
is_repeated_guest                    0
previous_cancellations               0
previous_bookings_not_canceled       0
reserved_room_type                   0
assigned_room_type                   0
booking_changes                      0
deposit_type                         0
days_in_waiting_list                 0
customer_type                        0
adr                                  0
required_car_parking_spaces          0
total_of_special_requests            0
reservation_status       

In [12]:
data.Region = np.where(data.Region.isna(),'Unspecified',data.Region)

In [13]:
data.children = np.where(data.children.isna(),0,data.children)

In [14]:
[var for var in data.columns if data[var].dtypes=='object']

['hotel',
 'arrival_date_month',
 'meal',
 'country',
 'market_segment',
 'distribution_channel',
 'reserved_room_type',
 'assigned_room_type',
 'deposit_type',
 'customer_type',
 'reservation_status',
 'reservation_status_date',
 'Arrival_Date',
 'Country_Name',
 'Region',
 'Region_Name']

In [15]:
data = data.drop(['arrival_date_month','distribution_channel','reservation_status','reservation_status_date','Arrival_Date','Country_Name','country','Region_Name'], axis=1)


In [16]:
#Get a list of remaining/relevant categorical variables
[var for var in data.columns if data[var].dtypes=='object']

['hotel',
 'meal',
 'market_segment',
 'reserved_room_type',
 'assigned_room_type',
 'deposit_type',
 'customer_type',
 'Region']

In [17]:
# create dummy variables for categorical data
data_inc_dummies = pd.get_dummies(data, drop_first=True)

# End of EDA and data cleanse

# Preprocessing

#### Preprocessing cleansed dataset
[356] / [458]

Step 1: Balance the dataset (between cancelled and not cancelled), to get a split of around 50% 

Step 2: Split inputs and targets

Step 3: Standardise (non dummies) inputs 

Step 4: Shuffle the data (inputs & outputs)

Step 5: Split the dataset into train, validation, and test

Step 6: Save the three datasets in *.npz (for neural network)

In [18]:
data_preprocessed = data_inc_dummies.copy()

In [19]:
with pd.option_context('display.max_rows', 15, 'display.max_columns', None): 
    display(data_preprocessed)

Unnamed: 0,is_canceled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests,hotel_Resort Hotel,meal_FB,meal_HB,meal_SC,meal_Undefined,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline TA/TO,market_segment_Online TA,market_segment_Undefined,reserved_room_type_B,reserved_room_type_C,reserved_room_type_D,reserved_room_type_E,reserved_room_type_F,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,reserved_room_type_P,assigned_room_type_B,assigned_room_type_C,assigned_room_type_D,assigned_room_type_E,assigned_room_type_F,assigned_room_type_G,assigned_room_type_H,assigned_room_type_I,assigned_room_type_K,assigned_room_type_L,assigned_room_type_P,deposit_type_Non Refund,deposit_type_Refundable,customer_type_Group,customer_type_Transient,customer_type_Transient-Party,Region_Americas,Region_Asia,Region_Europe,Region_Oceania,Region_Unspecified
0,0,320.0,0,0,2,0.0,0,0,0,0,3,0,0.00,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
1,0,320.0,0,0,2,0.0,0,0,0,0,4,0,0.00,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
2,0,7.0,0,1,1,0.0,0,0,0,0,0,0,75.00,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
3,0,13.0,0,1,1,0.0,0,0,0,0,0,0,75.00,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
4,0,14.0,0,2,2,0.0,0,0,0,0,0,0,98.00,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,23.0,2,5,2,0.0,0,0,0,0,0,0,96.14,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
119386,0,102.0,2,5,3,0.0,0,0,0,0,0,0,225.43,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
119387,0,34.0,2,5,2,0.0,0,0,0,0,0,0,157.71,0,4,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
119388,0,109.0,2,5,2,0.0,0,0,0,0,0,0,104.40,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0


### Balance the dataset

###### The aim is to have a "balanced" dataset, therefore some input/target pairs would have to be removed.

In [20]:
# First split data between cancelled and non-cancelled
cancelled_data_all = data_preprocessed[data_preprocessed.is_canceled == 1]
confirmed_data_all = data_preprocessed[data_preprocessed.is_canceled == 0]

In [21]:
# Next, determine the difference between the two subsets
to_remove = cancelled_data_all.shape[0] - confirmed_data_all.shape[0]

In [22]:
# Then, remove some data in non-cancelled data, equal to the difference between the two subsets
confirmed_data_all = confirmed_data_all[:to_remove]

In [23]:
# Finally, it's time to merge the subsets into a balanced dataset
balanced_dataset = pd.concat([cancelled_data_all, confirmed_data_all]
                             #, sort=True
                             #, ignore_index=True
                            )

In [24]:
with pd.option_context('display.max_rows', 10, 'display.max_columns', None): 
    display(balanced_dataset)

Unnamed: 0,is_canceled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests,hotel_Resort Hotel,meal_FB,meal_HB,meal_SC,meal_Undefined,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline TA/TO,market_segment_Online TA,market_segment_Undefined,reserved_room_type_B,reserved_room_type_C,reserved_room_type_D,reserved_room_type_E,reserved_room_type_F,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,reserved_room_type_P,assigned_room_type_B,assigned_room_type_C,assigned_room_type_D,assigned_room_type_E,assigned_room_type_F,assigned_room_type_G,assigned_room_type_H,assigned_room_type_I,assigned_room_type_K,assigned_room_type_L,assigned_room_type_P,deposit_type_Non Refund,deposit_type_Refundable,customer_type_Group,customer_type_Transient,customer_type_Transient-Party,Region_Americas,Region_Asia,Region_Europe,Region_Oceania,Region_Unspecified
8,1,85.0,0,3,2,0.0,0,0,0,0,0,0,82.0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
9,1,75.0,0,3,2,0.0,0,0,0,0,0,0,105.5,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
10,1,23.0,0,4,2,0.0,0,0,0,0,0,0,123.0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
27,1,60.0,2,5,2,0.0,0,0,0,0,0,0,107.0,0,2,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
32,1,96.0,2,8,2,0.0,0,0,0,0,0,0,108.3,0,2,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88397,0,2.0,0,1,1,0.0,0,1,0,3,0,0,110.0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
88398,0,8.0,0,1,1,0.0,0,1,0,4,0,0,89.0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
88399,0,1.0,0,1,1,0.0,0,1,0,5,0,0,80.0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
88400,0,36.0,2,2,3,0.0,0,0,0,0,0,0,159.3,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0


In [25]:
# Let's confirm this is in fact the case:
balanced_dataset.is_canceled.sum()/balanced_dataset.is_canceled.shape[0]

0.5

###### Split the dataset between inputs and targets

In [26]:
# Let's get a list of all columns, in order to rearrange them
balanced_dataset.columns

Index(['is_canceled', 'lead_time', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'booking_changes',
       'days_in_waiting_list', 'adr', 'required_car_parking_spaces',
       'total_of_special_requests', 'hotel_Resort Hotel', 'meal_FB', 'meal_HB',
       'meal_SC', 'meal_Undefined', 'market_segment_Complementary',
       'market_segment_Corporate', 'market_segment_Direct',
       'market_segment_Groups', 'market_segment_Offline TA/TO',
       'market_segment_Online TA', 'market_segment_Undefined',
       'reserved_room_type_B', 'reserved_room_type_C', 'reserved_room_type_D',
       'reserved_room_type_E', 'reserved_room_type_F', 'reserved_room_type_G',
       'reserved_room_type_H', 'reserved_room_type_L', 'reserved_room_type_P',
       'assigned_room_type_B', 'assigned_room_type_C', 'assigned_room_type_D',
       'assigned_room_type_E', 'assigned_

In [27]:
# Next, rearrange so that target variable is first in the list
balanced_dataset.columns = ['is_canceled', 'Region_Americas', 'Region_Asia', 'Region_Europe', 'Region_Oceania',
       'Region_Unspecified', 'adr', 'adults', 'assigned_room_type_B',
       'assigned_room_type_C', 'assigned_room_type_D', 'assigned_room_type_E',
       'assigned_room_type_F', 'assigned_room_type_G', 'assigned_room_type_H',
       'assigned_room_type_I', 'assigned_room_type_K', 'assigned_room_type_L',
       'assigned_room_type_P', 'babies', 'booking_changes', 'children',
       'customer_type_Group', 'customer_type_Transient',
       'customer_type_Transient-Party', 'days_in_waiting_list',
       'deposit_type_Non Refund', 'deposit_type_Refundable',
       'hotel_Resort Hotel', 'is_repeated_guest', 'lead_time',
       'market_segment_Complementary', 'market_segment_Corporate',
       'market_segment_Direct', 'market_segment_Groups',
       'market_segment_Offline TA/TO', 'market_segment_Online TA',
       'market_segment_Undefined', 'meal_FB', 'meal_HB', 'meal_SC',
       'meal_Undefined', 'previous_bookings_not_canceled',
       'previous_cancellations', 'required_car_parking_spaces',
       'reserved_room_type_B', 'reserved_room_type_C', 'reserved_room_type_D',
       'reserved_room_type_E', 'reserved_room_type_F', 'reserved_room_type_G',
       'reserved_room_type_H', 'reserved_room_type_L', 'reserved_room_type_P',
       'stays_in_week_nights', 'stays_in_weekend_nights',
       'total_of_special_requests']

In [28]:
# Split inputs and targets data
inputs_unscaled = balanced_dataset.iloc[:,1:]
targets         = balanced_dataset.iloc[:,0]

### Standardize the inputs


Scale non-dummy features separate then merge arrays

in future, scale prior to adding dummy variables

In [29]:
# check what are all columns that we've got
inputs_unscaled.columns.values

array(['Region_Americas', 'Region_Asia', 'Region_Europe',
       'Region_Oceania', 'Region_Unspecified', 'adr', 'adults',
       'assigned_room_type_B', 'assigned_room_type_C',
       'assigned_room_type_D', 'assigned_room_type_E',
       'assigned_room_type_F', 'assigned_room_type_G',
       'assigned_room_type_H', 'assigned_room_type_I',
       'assigned_room_type_K', 'assigned_room_type_L',
       'assigned_room_type_P', 'babies', 'booking_changes', 'children',
       'customer_type_Group', 'customer_type_Transient',
       'customer_type_Transient-Party', 'days_in_waiting_list',
       'deposit_type_Non Refund', 'deposit_type_Refundable',
       'hotel_Resort Hotel', 'is_repeated_guest', 'lead_time',
       'market_segment_Complementary', 'market_segment_Corporate',
       'market_segment_Direct', 'market_segment_Groups',
       'market_segment_Offline TA/TO', 'market_segment_Online TA',
       'market_segment_Undefined', 'meal_FB', 'meal_HB', 'meal_SC',
       'meal_Undefined', 'p

In [30]:
# choose the columns to scale/omit:
    
# select the columns to omit (All dummy variables)
columns_to_omit = ['Region_Americas', 'Region_Asia', 'Region_Europe',
       'Region_Oceania', 'Region_Unspecified',
       'assigned_room_type_B', 'assigned_room_type_C',
       'assigned_room_type_D', 'assigned_room_type_E',
       'assigned_room_type_F', 'assigned_room_type_G',
       'assigned_room_type_H', 'assigned_room_type_I',
       'assigned_room_type_K', 'assigned_room_type_L',
       'assigned_room_type_P',
       'customer_type_Group', 'customer_type_Transient',
       'customer_type_Transient-Party',
       'deposit_type_Non Refund', 'deposit_type_Refundable',
       'hotel_Resort Hotel', 'is_repeated_guest',
       'market_segment_Complementary', 'market_segment_Corporate',
       'market_segment_Direct', 'market_segment_Groups',
       'market_segment_Offline TA/TO', 'market_segment_Online TA',
       'market_segment_Undefined', 'meal_FB', 'meal_HB', 'meal_SC',
       'meal_Undefined', 'required_car_parking_spaces',
       'reserved_room_type_B', 'reserved_room_type_C',
       'reserved_room_type_D', 'reserved_room_type_E',
       'reserved_room_type_F', 'reserved_room_type_G',
       'reserved_room_type_H', 'reserved_room_type_L',
       'reserved_room_type_P']


In [31]:
# create the columns to scale, based on the columns to omit
# use list comprehension to iterate over the list
columns_to_scale = [x for x in inputs_unscaled.columns.values if x not in columns_to_omit]
columns_to_scale

['adr',
 'adults',
 'babies',
 'booking_changes',
 'children',
 'days_in_waiting_list',
 'lead_time',
 'previous_bookings_not_canceled',
 'previous_cancellations',
 'stays_in_week_nights',
 'stays_in_weekend_nights',
 'total_of_special_requests']

In [32]:
# Let's separate features to: 'to scale' (or numerical variables) and 'to omit' or (dummy variable)
inputs_to_omit = inputs_unscaled[columns_to_omit]

In [33]:
inputs_to_scale = inputs_unscaled[columns_to_scale]

In [34]:
inputs_to_scale

Unnamed: 0,adr,adults,babies,booking_changes,children,days_in_waiting_list,lead_time,previous_bookings_not_canceled,previous_cancellations,stays_in_week_nights,stays_in_weekend_nights,total_of_special_requests
8,0,0,0,0,0,1,0,0,0,1,0,0
9,0,0,0,0,0,0,0,0,0,1,0,0
10,0,0,0,0,0,1,1,0,0,1,0,0
27,0,0,0,0,0,1,1,0,0,1,0,0
32,0,0,0,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
88397,0,1,0,0,1,0,0,0,0,1,0,0
88398,0,1,0,0,1,0,0,0,0,1,0,0
88399,0,1,0,0,1,0,0,0,0,1,0,0
88400,0,0,0,0,0,1,0,0,0,1,0,0


In [35]:
# Now to standardise the inputs, since data of different magnitude (scale) can be biased towards high values.
# StandardScaler will be used here, to put all inputs in similar magnitude.
from sklearn.preprocessing import StandardScaler

# documentation: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [36]:
scaler = StandardScaler()

In [37]:
# Scaling just numerical features
scaler.fit(inputs_to_scale)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [38]:
scaled_inputs = scaler.transform(inputs_to_scale)

In [39]:
scaled_inputs

array([[-0.08233262, -0.19251777, -0.11573171, ...,  0.30445454,
        -0.05440239, -0.12837487],
       [-0.08233262, -0.19251777, -0.11573171, ...,  0.30445454,
        -0.05440239, -0.12837487],
       [-0.08233262, -0.19251777, -0.11573171, ...,  0.30445454,
        -0.05440239, -0.12837487],
       ...,
       [-0.08233262,  5.19432562, -0.11573171, ...,  0.30445454,
        -0.05440239, -0.12837487],
       [-0.08233262, -0.19251777, -0.11573171, ...,  0.30445454,
        -0.05440239, -0.12837487],
       [-0.08233262, -0.19251777, -0.11573171, ...,  0.30445454,
        -0.05440239, -0.12837487]])

In [40]:
scaled_inputs.shape

(88448, 12)

In [41]:
untouched_inputs = inputs_to_omit.to_numpy()

In [42]:
untouched_inputs.shape

(88448, 44)

In [43]:
inputs = np.concatenate((scaled_inputs,untouched_inputs), axis = 1)

In [44]:
inputs.shape

(88448, 56)

## Split the data into train & test and shuffle

### Import the relevant module

In [45]:
# import train_test_split so we can split our data into train and test
from sklearn.model_selection import train_test_split

### Split

## Addtional steps for NN

### Splitting data for NN - Train, Validation, and Test

In [46]:
train_test_split(inputs, targets)

[array([[-0.08233262, -0.19251777, -0.11573171, ...,  0.        ,
          0.        ,  0.        ],
        [-0.08233262, -0.19251777, -0.11573171, ...,  0.        ,
          0.        ,  0.        ],
        [-0.08233262,  5.19432562, -0.11573171, ...,  1.        ,
          0.        ,  0.        ],
        ...,
        [-0.08233262, -0.19251777, -0.11573171, ...,  0.        ,
          0.        ,  0.        ],
        [-0.08233262, -0.19251777, -0.11573171, ...,  1.        ,
          0.        ,  0.        ],
        [-0.08233262, -0.19251777, -0.11573171, ...,  0.        ,
          0.        ,  0.        ]]),
 array([[-0.08233262, -0.19251777, -0.11573171, ...,  0.        ,
          0.        ,  0.        ],
        [-0.08233262, -0.19251777, -0.11573171, ...,  0.        ,
          0.        ,  0.        ],
        [-0.08233262, -0.19251777, -0.11573171, ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [-0.08233262, -0.19251777, -0.11573171, ...,  

In [47]:
# train/test split
x_train, x_test, y_train, y_test = train_test_split(inputs, targets, #train_size = 0.8, 
                                                                            test_size = 0.1, random_state = 20)

In [48]:
# train/validation split
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, #train_size = 0.8, 
                                                                            test_size = 0.1, random_state = 20)

In [49]:
# check the shape of the train inputs and targets
print (x_train.shape, y_train.shape)

(71642, 56) (71642,)


In [50]:
# check the shape of the test inputs and targets
print (x_val.shape, y_val.shape)

(7961, 56) (7961,)


In [51]:
# check the shape of the test inputs and targets
print (x_test.shape, y_test.shape)

(8845, 56) (8845,)


In [56]:
x_train, x_val, x_test = x_train.astype(float), x_val.astype(float), x_test.astype(float)

In [67]:
# Save the three datasets in *.npz.

np.savez('Hotel_data_train', inputs=x_train, targets=y_train)
np.savez('Hotel_data_validation', inputs=x_val, targets=y_val)
np.savez('Hotel_data_test', inputs=x_test, targets=y_test)

In [68]:
npz = np.load('Hotel_data_train.npz')
train_inputs, train_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

npz = np.load('Hotel_data_validation.npz')
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

npz = np.load('Hotel_data_test.npz')
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

### End of Preprocessing

# Applying Bi-Predict Model

## 2. Neural Network

Outline, optimizers, loss, early stopping and training

In [61]:
import tensorflow as tf

In [83]:
# Set the input and output sizes
input_size = 56
output_size = 2
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 500 
    
# Defining the model for the problem 
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 3rd hidden layer 
    tf.keras.layers.Dense(hidden_layer_size, activation='sigmoid'), # 4th hidden layer 
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])


### Choosing the optimizer and the loss function:

# preferred optimizer to use is Adaptive Moment Estimation (adam) , as it's one of the best optimiser available.
# the loss function, 
# and the metrics we are interested in obtaining at each iteration
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])


### Training the model:

# set the batch size
batch_size = 100

# set a maximum number of training epochs
max_epochs = 100

# setting an early stopping mechanism, with patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

# fit the model
# note that this time the train, validation and test data are not iterable
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2 # getting essential information about the training process
          )  

Train on 71642 samples, validate on 7961 samples
Epoch 1/100
71642/71642 - 16s - loss: 0.4826 - accuracy: 0.7647 - val_loss: 0.3819 - val_accuracy: 0.8175
Epoch 2/100
71642/71642 - 16s - loss: 0.3745 - accuracy: 0.8226 - val_loss: 0.3599 - val_accuracy: 0.8245
Epoch 3/100
71642/71642 - 16s - loss: 0.3547 - accuracy: 0.8302 - val_loss: 0.3671 - val_accuracy: 0.8177
Epoch 4/100
71642/71642 - 16s - loss: 0.3454 - accuracy: 0.8352 - val_loss: 0.3424 - val_accuracy: 0.8311
Epoch 5/100
71642/71642 - 16s - loss: 0.3383 - accuracy: 0.8388 - val_loss: 0.3586 - val_accuracy: 0.8255
Epoch 6/100
71642/71642 - 15s - loss: 0.3312 - accuracy: 0.8424 - val_loss: 0.3345 - val_accuracy: 0.8400
Epoch 7/100
71642/71642 - 16s - loss: 0.3256 - accuracy: 0.8462 - val_loss: 0.3282 - val_accuracy: 0.8473
Epoch 8/100
71642/71642 - 13s - loss: 0.3213 - accuracy: 0.8481 - val_loss: 0.3388 - val_accuracy: 0.8396
Epoch 9/100
71642/71642 - 13s - loss: 0.3161 - accuracy: 0.8522 - val_loss: 0.3278 - val_accuracy: 0.84

<tensorflow.python.keras.callbacks.History at 0x1a5997ae50>

After fine tuning the hyperparameters of the model, the validation accuracy was improved to 85 per cent.  

 

## Test the model

The last step here is to test the final prediction power of the model by fitting it on the test dataset that the algorithm isn't familiar. As adjusting the hyperparameters overfits the validation dataset. 

Consequently, this is the absolute final instance, because adjusting the model afterwards will start overfitting the test dataset and defeat its purpose.

In [None]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)

In [None]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))

The NN model provides test of 85 per cent.