# Use PyTorch to Predict Hotel Cancellations
- Based on Codecademy project: [View Project Page](https://www.codecademy.com/paths/build-deep-learning-models-pytorch/tracks/pytorch-sp-pytorch-for-classification/modules/pytorch-sp-mod-pytorch-for-classification/projects/predicting-hotel-booking-cancellations)

In [1]:
import pandas as pd
import numpy as np

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

In [81]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

## Import and Explore

The file `hotel_bookings.csv` contains a subset of a [real-world dataset](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand) containing reservation and cancellation data for a resort hotel. 

This project will build and train a neural network to predict if a customer will cancel their hotel booking reservation based on data including the booking dates, average daily cost, number of adults/children/babies, duration of stay, and so forth.

In [34]:
hotels = pd.read_csv("hotel_bookings.csv")

hotels.head() # read first 5 records
hotels.drop(["hotel"], axis=1, inplace=True)
hotels.dropna()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
2392,0,6,2015,October,42,11,2,0,2,0.0,...,No Deposit,240.0,113.0,0,Transient,82.0,1,1,Check-Out,2015-10-13
2697,0,24,2015,October,44,26,7,15,1,0.0,...,No Deposit,185.0,281.0,0,Transient-Party,52.2,0,0,Check-Out,2015-11-17
2867,0,24,2015,November,45,3,0,3,2,0.0,...,No Deposit,334.0,281.0,0,Transient-Party,48.0,0,0,Check-Out,2015-11-06
2877,0,24,2015,November,45,3,2,10,1,0.0,...,No Deposit,328.0,281.0,0,Transient-Party,40.0,0,0,Check-Out,2015-11-15
2878,0,24,2015,November,45,3,3,10,2,0.0,...,No Deposit,326.0,281.0,0,Transient-Party,48.0,0,0,Check-Out,2015-11-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112499,0,13,2017,May,21,24,0,1,2,0.0,...,No Deposit,9.0,478.0,0,Transient-Party,150.0,0,1,Check-Out,2017-05-25
113046,0,13,2017,May,22,29,1,3,1,0.0,...,No Deposit,290.0,148.0,0,Transient,95.0,0,0,Check-Out,2017-06-02
113082,0,13,2017,May,22,29,1,3,2,0.0,...,No Deposit,290.0,148.0,0,Transient,110.0,0,0,Check-Out,2017-06-02
113627,0,210,2017,June,23,9,0,1,2,0.0,...,No Deposit,14.0,229.0,0,Transient,135.0,0,0,Check-Out,2017-06-10


In [35]:
hotels.info() # look at summary of columns, non-null counts, and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 31 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   is_canceled                     119390 non-null  int64  
 1   lead_time                       119390 non-null  int64  
 2   arrival_date_year               119390 non-null  int64  
 3   arrival_date_month              119390 non-null  object 
 4   arrival_date_week_number        119390 non-null  int64  
 5   arrival_date_day_of_month       119390 non-null  int64  
 6   stays_in_weekend_nights         119390 non-null  int64  
 7   stays_in_week_nights            119390 non-null  int64  
 8   adults                          119390 non-null  int64  
 9   children                        119386 non-null  float64
 10  babies                          119390 non-null  int64  
 11  meal                            119390 non-null  object 
 12  country         

### Exploring columns in more detail

The `is_canceled` column contains the data to be predicted. 

In [36]:
print("Number of overall cancellations: \n", hotels["is_canceled"].value_counts(0))

print("\n% of overall cancellations: \n", hotels["is_canceled"].value_counts(1))

Number of overall cancellations: 
 0    75166
1    44224
Name: is_canceled, dtype: int64

% of overall cancellations: 
 0    0.629584
1    0.370416
Name: is_canceled, dtype: float64


The `reservation_status` column tells us if the booking was canceled while also telling us if the customer was a no-show.

We need to be sure to exclude this column from the training set, otherwise this information will be _leaked_ to our model resulting in inaccurate performance. 

In [37]:
print("Number of overall cancellations: \n", hotels["reservation_status"].value_counts(0))

print("\n% of overall cancellations: \n", hotels["reservation_status"].value_counts(1))

Number of overall cancellations: 
 Check-Out    75166
Canceled     43017
No-Show       1207
Name: reservation_status, dtype: int64

% of overall cancellations: 
 Check-Out    0.629584
Canceled     0.360307
No-Show      0.010110
Name: reservation_status, dtype: float64


### Confounding Factors

Does the timing (i.e. month) affect cancellations?

In [38]:
hotels.groupby("arrival_date_month")["is_canceled"].mean().sort_values(ascending=False)

arrival_date_month
June         0.414572
April        0.407972
May          0.396658
September    0.391702
October      0.380466
August       0.377531
July         0.374536
December     0.349705
February     0.334160
March        0.321523
November     0.312334
January      0.304773
Name: is_canceled, dtype: float64

## Data Cleaning and Preparation

Investigate the categorial data

In [39]:
object_columns = hotels.select_dtypes(include=['object']).columns.to_list()
object_columns = [col for col in object_columns if col != "reservation_status"]

hotels[object_columns].head()

Unnamed: 0,arrival_date_month,meal,country,market_segment,distribution_channel,reserved_room_type,assigned_room_type,deposit_type,customer_type,reservation_status_date
0,July,BB,PRT,Direct,Direct,C,C,No Deposit,Transient,2015-07-01
1,July,BB,PRT,Direct,Direct,C,C,No Deposit,Transient,2015-07-01
2,July,BB,GBR,Direct,Direct,A,C,No Deposit,Transient,2015-07-02
3,July,BB,GBR,Corporate,Corporate,A,A,No Deposit,Transient,2015-07-02
4,July,BB,GBR,Online TA,TA/TO,A,A,No Deposit,Transient,2015-07-03


Drop columns with many missing values or columns that are irrelevant to prediction task. 

These columns were chosen as they may be too specific and lead to overfitting (e.g. week number or day of month of arrival), or there are many missing values (e.g. agent or company). 

In [40]:
cols_to_remove = ["arrival_date_year", "arrival_date_week_number", "arrival_date_day_of_month", "country", "agent", "company", "reservation_status_date"]
hotels.drop(cols_to_remove, axis=1, inplace=True)

Encode the `meal` column with a meaningful order (# of meals booked) using the following scheme:

- `Undefined` and `SC` to `0`
- `BB` to `1`
- `HB` to `2`
- `FB` to `3` 

In [41]:
hotels["meal"] = hotels["meal"].replace({"Undefined": 0, "SC": 0, "BB": 1, "HB": 2, "FB": 3})

Prepare the rest of the categorical columns using one-hot encoding. 

In [42]:
one_hot_columns = ['arrival_date_month', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type', 'market_segment']

hotels = pd.get_dummies(hotels, columns=one_hot_columns, dtype=int)

hotels.head()

Unnamed: 0,is_canceled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,is_repeated_guest,previous_cancellations,...,customer_type_Transient,customer_type_Transient-Party,market_segment_Aviation,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline TA/TO,market_segment_Online TA,market_segment_Undefined
0,0,342,0,0,2,0.0,0,1,0,0,...,1,0,0,0,0,1,0,0,0,0
1,0,737,0,0,2,0.0,0,1,0,0,...,1,0,0,0,0,1,0,0,0,0
2,0,7,0,1,1,0.0,0,1,0,0,...,1,0,0,0,0,1,0,0,0,0
3,0,13,0,1,1,0.0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
4,0,14,0,2,2,0.0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0


## Create Training and Testing Sets

In [72]:
hotels = hotels.dropna() 

not_train_cols = ["is_canceled", "reservation_status"]
train_features = [col for col in hotels.columns if col not in not_train_cols]

Using the list of training features in `train_features`, create `X` and `y` tensors:

- `X` contains the data values from the `train_features` columns
- `y` contains the binary labels in the `is_canceled` column in `hotels`

In [73]:
X = torch.tensor(hotels[train_features].values, dtype=torch.float)
y = torch.tensor(hotels["is_canceled"].values, dtype=torch.float).view(-1,1)

print("\nChecking for wonky values")
print(torch.isinf(X).any())
print(torch.isinf(X).any())
print(X.max(), X.min())
print(torch.isinf(y).any())
print(torch.isinf(y).any())
print(y.max(), y.min())


Checking for wonky values
tensor(False)
tensor(False)
tensor(5400.) tensor(-6.3800)
tensor(False)
tensor(False)
tensor(1.) tensor(0.)


Split our data contained in `X` and `y` into training and testing sets (80/20).

In [74]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Shape:", X_train.shape)
print("Testing Shape:", X_test.shape)

Training Shape: torch.Size([95508, 69])
Testing Shape: torch.Size([23878, 69])


## Train a Neural Network for Binary Classification

Build the neural network architecture using `nn.Sequential` with the following:
- input layer with nodes equal to the number of training features
- first hidden layer with `36` nodes and a ReLU activation
- second hidden layer with `18` nodes and a ReLU activation
- output layer with `1` node and a Sigmoid activation

In [75]:
torch.manual_seed(42)

# build model
model = nn.Sequential(
    nn.Linear(X_train.shape[1], 36),
    nn.ReLU(),
    nn.Linear(36, 18),
    nn.ReLU(), 
    nn.Linear(18, 1), 
    nn.Sigmoid()
)

In [76]:
loss = nn.BCELoss() # define loss with BCE
optimizer = optim.Adam(model.parameters(), lr=0.005) # define optimizer

Train the neural network for `1000` epochs. Keep track of the training performance by printing out the binary cross-entropy loss and accuracy score every `100` epochs. Before calculating accuracy, convert the model's predicted probabilities to binary labels (as integers) using `0.5` as the threshold.

In [78]:
n_epochs = 1000

for i in range(n_epochs):
    predictions = model(X_train)
    BCELoss = loss(predictions, y_train)
    BCELoss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
    if (i+1) % 100 == 0:
        predicted_labels = (predictions >= 0.5).int()
        accuracy = accuracy_score(y_train, predicted_labels)
        print(f"Epoch {i+1}/{n_epochs}: BCELoss: {BCELoss.item():.4f}, Accuracy: {accuracy.item():.4f}")

Epoch 100/1000: BCELoss: 0.4347, Accuracy: 0.8091
Epoch 200/1000: BCELoss: 0.4066, Accuracy: 0.8164
Epoch 300/1000: BCELoss: 0.3966, Accuracy: 0.8184
Epoch 400/1000: BCELoss: 0.3900, Accuracy: 0.8189
Epoch 500/1000: BCELoss: 0.3869, Accuracy: 0.8187
Epoch 600/1000: BCELoss: 0.3808, Accuracy: 0.8213
Epoch 700/1000: BCELoss: 0.3792, Accuracy: 0.8222
Epoch 800/1000: BCELoss: 0.3739, Accuracy: 0.8229
Epoch 900/1000: BCELoss: 0.3730, Accuracy: 0.8224
Epoch 1000/1000: BCELoss: 0.3793, Accuracy: 0.8200


Evaluate the trained neural network on the testing set

In [83]:
model.eval() # set model to evaluation mode
with torch.no_grad():
    test_predictions = model(X_test)
    test_predicted_labels = (test_predictions >= 0.5).int()

In [88]:
accuracy = accuracy_score(y_test, test_predicted_labels)
print(f"Accuracy: {accuracy.item():.4f}")

class_report = classification_report(y_test, test_predicted_labels)
print("Classification Report:\n", class_report)

Accuracy: 0.8159
Classification Report:/n               precision    recall  f1-score   support

         0.0       0.85      0.86      0.85     14973
         1.0       0.76      0.73      0.75      8905

    accuracy                           0.82     23878
   macro avg       0.80      0.80      0.80     23878
weighted avg       0.81      0.82      0.82     23878



## Train a Neural Network for Multiclass Classification

Predict customers who **no-showed** within the `reservation_status` column.

If a hotel can accurately predict no-shows, they can reach out ahead of time to customers who are at high risk of not-showing to their reservation.

In [89]:
hotels["reservation_status"] = hotels["reservation_status"].replace({"Check-Out":2, "Canceled":1, "No-Show":0})

Using the same list of training features in `train_features`, create the `X` and `y` tensors where:

- `X` contains the data values from the `train_features` columns
- `y` contains the multiclass data values in the `reservation_status` column

In [91]:
X = torch.tensor(hotels[train_features].values, dtype=torch.float)
y = torch.tensor(hotels["reservation_status"].values, dtype=torch.long)

print("\nChecking for wonky values")
print(torch.isinf(X).any())
print(torch.isinf(X).any())
print(X.max(), X.min())
print(torch.isinf(y).any())
print(torch.isinf(y).any())
print(y.max(), y.min())


Checking for wonky values
tensor(False)
tensor(False)
tensor(5400.) tensor(-6.3800)
tensor(False)
tensor(False)
tensor(2) tensor(0)


In [92]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Shape:", X_train.shape)
print("Testing Shape:", X_test.shape)

Training Shape: torch.Size([95508, 69])
Testing Shape: torch.Size([23878, 69])


Construct the multiclass neural network with the following architecture:

- input layer with `65` nodes (equal to the number of training features)
- first hidden layer with `65` nodes and a ReLU activation
- second hidden layer with `36` nodes and a ReLU activation
- final output layer with `3` nodes corresponding to each of the categories in `reservation_status`

In [93]:
torch.manual_seed(42)

multiclass_model = nn.Sequential(
    nn.Linear(X_train.shape[1], X_train.shape[1]),
    nn.ReLU(),
    nn.Linear(X_train.shape[1], 36),
    nn.ReLU(),
    nn.Linear(36, 3)
)

In [97]:
loss = nn.CrossEntropyLoss()
optimizer = optim.Adam(multiclass_model.parameters(), lr=0.01)

Train the neural network for `500` epochs. Keep track of the training performance by printing out the cross-entropy loss and accuracy score every `100` epochs. Convert the output probabilites of the multiclass model to labels using the `torch.argmax()` function.

In [98]:
n_epochs = 500

for i in range(n_epochs):
    predictions = multiclass_model(X_train)
    CELoss = loss(predictions, y_train)
    CELoss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
    if (i+1) % 100 == 0:
        predicted_labels = torch.argmax(predictions, dim=1)
        accuracy = accuracy_score(y_train, predicted_labels)
        print(f"Epoch {i+1}/{n_epochs}: BCELoss: {BCELoss.item():.4f}, Accuracy: {accuracy.item():.4f}")

Epoch 100/500: BCELoss: 0.3793, Accuracy: 0.8115
Epoch 200/500: BCELoss: 0.3793, Accuracy: 0.8199
Epoch 300/500: BCELoss: 0.3793, Accuracy: 0.8228
Epoch 400/500: BCELoss: 0.3793, Accuracy: 0.8231
Epoch 500/500: BCELoss: 0.3793, Accuracy: 0.8233


Evaluate the trained neural network on the testing set

In [99]:
multiclass_model.eval()
with torch.no_grad():
    multiclass_predictions = multiclass_model(X_test)
    multiclass_predicted_labels = torch.argmax(multiclass_predictions, dim=1)

In [100]:
accuracy = accuracy_score(y_test, multiclass_predicted_labels)
print(f"Accuracy: {accuracy.item():.4f}")

class_report = classification_report(y_test, multiclass_predicted_labels)
print("Classification Report:\n", class_report)

Accuracy: 0.8172
Classification Report:/n               precision    recall  f1-score   support

           0       0.40      0.01      0.02       232
           1       0.87      0.62      0.72      8673
           2       0.80      0.95      0.87     14973

    accuracy                           0.82     23878
   macro avg       0.69      0.52      0.53     23878
weighted avg       0.82      0.82      0.81     23878

