# Hotel Booking Cancellation Prediction

!["Image of hotel resort in the mountains with a swimming pool in the foreground"](images/hotel-pool.jpg)

## Overview

This project aims to develop a predictive model that can forecast hotel booking cancellations based on historical data. The model will be trained [this dataset](https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset) obtained from Kaggle, which contains a large number of hotel reservations made by customers. The primary objective of this project is to build a machine learning model that can accurately predict the likelihood of a hotel booking being cancelled. This information can be valuable for hotels as it allows them to make informed decisions about their inventory management, staff scheduling, and revenue optimization strategies.

The project will involve several steps, including data cleaning, exploratory data analysis, feature engineering, model selection, and evaluation. The performance of the model will be measured using metrics such as accuracy, precision, recall, and F1-score.

Throughout this project, we will also explore the relationships between different variables and their impact on booking cancellations. This will help us to gain insights into the factors that contribute to cancellations and to identify potential areas for improvement. Overall, this project has the potential to provide valuable insights and practical applications for the hospitality industry. By developing a predictive model that can accurately forecast booking cancellations, hotels can better manage their resources, improve customer satisfaction, and increase revenue.

## Business Understanding

The hospitality industry is a highly competitive market, with hotels constantly striving to optimize their revenue and improve customer satisfaction. One of the major challenges faced by hotels is managing cancellations, which can result in lost revenue and reduced occupancy rates.

According to a [recent study](https://www.hotelmanagement.net/tech/study-cancelation-rate-at-40-as-otas-push-free-change-policy), average hotel cancellations have risen to at least 40% of overall bookings. This highlights the need for a predictive model that can accurately forecast booking cancellations. By building a predictive model that can accurately forecast booking cancellations, hotels can take proactive measures to minimize the impact of cancellations on their revenue. For example, they can offer discounts or special promotions to customers who are likely to cancel their bookings, or they can allocate resources more efficiently by adjusting their staff scheduling and inventory management strategies based on the predicted cancellation rates.

Overall, the business value of this project lies in its ability to help hotels improve their revenue optimization strategies, increase customer satisfaction, and reduce the financial impact of booking cancellations. By developing a predictive model that can accurately forecast booking cancellations, hotels can gain a competitive edge in the highly competitive hospitality industry.

## Data Understanding

To achieve this objective, I have utilized the Hotel Reservations [Dataset](https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset) obtained from Kaggle. The dataset contains information about hotel bookings made by customers, including various features such as the number of adults and children, the type of meal plan, the requirement for a car parking space, the type of room reserved, the lead time, the arrival date, the market segment type, and whether the booking was cancelled or not.

The dataset consists of 36,275 rows and 19 columns, with each row representing a unique booking. The target variable will be the `booking_status` column, which indicates whether the booking was cancelled (1) or not (0).

The dataset includes the following features:
- Booking_ID: unique identifier of each booking
- no_of_adults: Number of adults
- no_of_children: Number of Children
- no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
- no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
- type_of_meal_plan: Type of meal plan booked by the customer:
- required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
- room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
- lead_time: Number of days between the date of booking and the arrival date
- arrival_year: Year of arrival date
- arrival_month: Month of arrival date
- arrival_date: Date of the month
- market_segment_type: Market segment designation.
- repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
- no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
- no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
- avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
- no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
- booking_status: Flag indicating if the booking was canceled or not.

In [1]:
# Importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split

sns.set_theme(style="whitegrid")
%matplotlib inline

### Reading the data

In [2]:
# Load data
df = pd.read_csv('data/Hotel-Reservations.csv')
df.head(2)

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled


### Tidying the data
1. Check data types and figure out which figures are numerical and which are categorical.
2. Check for null values.
3. Check for duplicate values
4. Remove unnecessary columns and missing values

I check for null values in the dataset. There are none.

In [3]:
# Check for missing values
df.isna().sum()

Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

I then check for duplicates. There are none. 

In [4]:
# Check for duplicates
df.duplicated().sum()

0

In [5]:
# check the data types
df.dtypes

Booking_ID                               object
no_of_adults                              int64
no_of_children                            int64
no_of_weekend_nights                      int64
no_of_week_nights                         int64
type_of_meal_plan                        object
required_car_parking_space                int64
room_type_reserved                       object
lead_time                                 int64
arrival_year                              int64
arrival_month                             int64
arrival_date                              int64
market_segment_type                      object
repeated_guest                            int64
no_of_previous_cancellations              int64
no_of_previous_bookings_not_canceled      int64
avg_price_per_room                      float64
no_of_special_requests                    int64
booking_status                           object
dtype: object

 The `repeated_guest` feature is represented as an integer, with 0 indicating that the customer is not a repeated guest and 1 indicating that the customer is a repeated guest. However, the nature of this feature suggests that it is categorical in nature, as it represents a binary classification of customers based on their booking history. Therefore, it is appropriate to treat this feature as a categorical variable in the analysis and modeling process. This also applies to the `required_car_parking_space` feature, which is also represented as an integer.

### Data Preparation

1. Convert `booking_status` to numeric - I will convert 'Canceled' to 1 and 'Not_Canceled' to 0.

In [6]:
# convert values in the column 'booking_status' to 0 and 1
df['booking_status'] = df['booking_status'].map({'Not_Canceled': 0, 'Canceled': 1})

2. Convert the rest of the categorical features i.e. `type_of_meal_plan`, `market_segment_type`, `room_type_reserved` to numeric values.

In [7]:
# Check the unique values in type_of_meal_plan, market_segment_type, 
# and room_type_reserved
meal_plans = df['type_of_meal_plan'].unique()
market_segments = df['market_segment_type'].unique()
room_types = df['room_type_reserved'].unique()

# Print unique values
print("Unique values in 'type_of_meal_plan' column:")
print(meal_plans)
print("\nUnique values in 'market_segment_type' column:")
print(market_segments)
print("\nUnique values in 'room_type_reserved' column:")
print(room_types)

Unique values in 'type_of_meal_plan' column:
['Meal Plan 1' 'Not Selected' 'Meal Plan 2' 'Meal Plan 3']

Unique values in 'market_segment_type' column:
['Offline' 'Online' 'Corporate' 'Aviation' 'Complementary']

Unique values in 'room_type_reserved' column:
['Room_Type 1' 'Room_Type 4' 'Room_Type 2' 'Room_Type 6' 'Room_Type 5'
 'Room_Type 7' 'Room_Type 3']


Next, I convert the categorical features in `type_of_meal_plan`, `market_segment_type`, and `room_type_reserved` to numeric values using `.map()` function because they represent ordinal relationships. 

In [9]:
# Convert values in the type_of_meal_plan to numerical values
df['type_of_meal_plan'] = df['type_of_meal_plan'].map({'Not Selected': 0,
                                                             'Meal Plan 1': 1,
                                                             'Meal Plan 2': 2,
                                                             'Meal Plan 3': 3})

# Convert values in the room_type_reserved to numerical values
df['room_type_reserved'] = df['room_type_reserved'].map({'Room_Type 1': 1,
                                                             'Room_Type 2': 2,
                                                             'Room_Type 3': 3,
                                                             'Room_Type 4': 4,
                                                             'Room_Type 5': 5,
                                                             'Room_Type 6': 6,
                                                             'Room_Type 7': 7})


# Convert values in the market_segment_type to numerical values
df['market_segment_type'] = df['market_segment_type'].map({'Offline': 0,
                                                             'Online': 1,
                                                             'Corporate': 2,
                                                             'Complementary': 3,
                                                             'Aviation': 4})

### Data Transformation

First, I'll split the data into train and test sets using the `train_test_split` function from the `sklearn.model_selection` module. This will enable me to estimate how well the model will perform on unseen data later on.

In [10]:
# split dataset into features and target
X = df.drop(['booking_status', 'Booking_ID'], axis=1)
y = df['booking_status']

In [11]:
# split dataset into train and test sets
# set shuffle to True to randomize the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=28)

Next, I'll scikit-learn's OneHotEncoder to convert `type_of_meal_plan`, `market_segment_type`, `repeated_guest`, `room_type_reserved`, and `required_car_parking_space` into numerical values.

In [12]:
# Create an instance of OneHotEncoder
ohe = OneHotEncoder(drop='first', sparse_output=False)

# Create dataframe with only the columns that require One Hot Encoding
categorical_train = X_train[['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 
                          'market_segment_type', 'repeated_guest']].copy()

categorical_test = X_test[['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 
                          'market_segment_type', 'repeated_guest']].copy()

# Fit the encoder on the training data and transform it
ohe.fit_transform(categorical_train)

# Transform the test data
ohe.transform(categorical_test)

# Create new dataframes with One Hot Encoded columns
categorical_train_ohe = pd.DataFrame(data=ohe.transform(categorical_train),
                                    columns=ohe.get_feature_names_out(),
                                    index=categorical_train.index)

categorical_test_ohe = pd.DataFrame(data=ohe.transform(categorical_test),
                                   columns=ohe.get_feature_names_out(),
                                   index=categorical_test.index)

Next I'll scale the data using scikit-learn's `StandardScaler`.

In [13]:
# create an instance of StandardScaler
scaler = StandardScaler()

# Create dataframe with only quantitative variables
quant_train = X_train[['arrival_year', 'arrival_month', 'arrival_date','no_of_adults', 'no_of_children', 
                   'no_of_weekend_nights', 'no_of_week_nights', 'lead_time', 'no_of_previous_cancellations', 
                   'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']].copy()

quant_test = X_test[['arrival_year', 'arrival_month', 'arrival_date','no_of_adults', 'no_of_children', 
                   'no_of_weekend_nights', 'no_of_week_nights', 'lead_time', 'no_of_previous_cancellations', 
                   'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']].copy()

# Fit and transform the train data
scaler.fit_transform(quant_train)

# Transform the test data
scaler.transform(quant_test)

# Create new dataframes with Scaler columns
quant_train_scaler = pd.DataFrame(data=scaler.transform(quant_train),
                                  columns=quant_train.columns,
                                  index=quant_train.index)

quant_test_scaler = pd.DataFrame(data=scaler.transform(quant_test),
                                  columns=quant_test.columns,
                                  index=quant_test.index)

In [14]:
# Append encoded and scaled columns to X_train_transformed and X_test_transformed
# Append one hot encoded data back to dataframe
X_train_transformed = pd.concat([quant_train_scaler, categorical_train_ohe], axis=1)

X_test_transformed = pd.concat([quant_test_scaler, categorical_test_ohe], axis=1)

# Preview new dataframe
X_train_transformed.head()

Unnamed: 0,arrival_year,arrival_month,arrival_date,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,lead_time,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,...,room_type_reserved_3,room_type_reserved_4,room_type_reserved_5,room_type_reserved_6,room_type_reserved_7,market_segment_type_1,market_segment_type_2,market_segment_type_3,market_segment_type_4,repeated_guest_1
2947,0.46805,-0.136897,-0.642154,0.298216,2.20181,0.218603,-1.550369,-0.991105,-0.062816,-0.086268,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3033,0.46805,-1.758329,1.423318,-1.635357,-0.26098,1.363639,-0.847614,-0.97943,-0.062816,-0.086268,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
30081,0.46805,0.835962,-1.101148,0.298216,4.6646,-0.926433,-0.847614,0.304761,-0.062816,-0.086268,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
21861,0.46805,-1.434042,-1.330645,0.298216,-0.26098,1.363639,-0.14486,-0.66422,-0.062816,-0.086268,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
11680,0.46805,0.835962,-0.9864,0.298216,-0.26098,-0.926433,-0.14486,-0.827662,-0.062816,-0.086268,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


### Modeling

First, I'll create a Logistic Regression model using scikit-learn's `LogisticRegression` class. The model will be trained on the `X_train_transformed` and `y_train` data. This baseline model will model the probability of the target variable `booking_status` as a function of the input features, and it can provide insights into the relative importance of each feature in predicting the target.

Here's the criteria I will use to explain the baseline model metrics:

1. Accuracy - The proportion of correct predictions made by the model.
2. Precision - The proportion of true positive predictions out of all the positive predictions made by the model.
3. Recall - The proportion of true positive predictions out of all the actual positive instances. 
4. F1 score - The harmonic mean of precision and recall.
5. ROC AUC score - The area under the ROC curve, which measures the model's ability to distinguish between positive and negative instances.

By evaluating the baseline model using this criteria, we can gain insights into it's performance and identify areas for improvement.

In [15]:
# Fit the logistic regression model using sklearn
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train_transformed, y_train)

y_pred_train = logreg.predict(X_train_transformed)
y_pred_test = logreg.predict(X_test_transformed)

# Add a constant to the input features for the intercept
X_train_transformed = sm.add_constant(X_train_transformed)

# Get the coefficients and intercept from the fitted sklearn model
params = np.append(logreg.intercept_, logreg.coef_)

# Calculate the p-values using statsmodels
logit_model = sm.Logit(y_train, X_train_transformed)
result = logit_model.fit(disp=False)

# Print the summary to get the p-values
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:         booking_status   No. Observations:                27206
Model:                          Logit   Df Residuals:                    27178
Method:                           MLE   Df Model:                           27
Date:                Tue, 27 Aug 2024   Pseudo R-squ.:                  0.3300
Time:                        16:36:40   Log-Likelihood:                -11492.
converged:                      False   LL-Null:                       -17153.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                   -2.0257      0.079    -25.741      0.000      -2.180      -1.871
arrival_year                             0.1654      

