# __Final Project Proposal__
## Hotel Reservation Cancellation Prediction
- Group Members: Qing Dou, Ruoyu Chen, Zhengnan Li
- Repository: https://github.com/jc000222/Data_Science_DAV6150/tree/main/FinalProject

## 1 Introduction 

In this project, we aim to create a predictive model that can be used to accurately predict whether a hotel reservation will be canceled. This is important for hotels because canceled bookings can severely impact a hotel's revenue and related operational strategies. Accurate information about room reservations is key to decrease the hotel losses, such as the profit loss of last-minute cancellations. Whether a hotel receives advance notice or not, these cancellations result in vacancies in their booked rooms that could have been re-rented to others. However when cancellations are received at the last minute, which can make it difficult to schedule re-released rooms. 

The negative impact of cancellations includes not only the loss of revenue from these vacant rooms, but also the additional costs associated with the distribution channel for re-booking, as well as lower profit margins due to having to sell the rooms at a reduced price at the last minute. However, by accurately predicting cancellations and taking steps that can effectively decrease the negative impact of cancellations, hotels can improve profitability and operational efficiency. We utilize a dataset containing a wide range of booking-related factors to provide valuable insights for hotel management. Our goal is to build a robust model that helps hotels refine their operational processes and maximize the revenue.

__Literature Review:__  
This data has been used in many literatures for machine learning, and we select a few to show the scope of the researches others have done.  

__Application of machine learning to cluster hotel booking curves for hotel demand forecasting__  
Link: https://www.sciencedirect.com/science/article/pii/S2352340918315191  
This study proposes a new method for daily hotel demand forecasting, leveraging clusters of stay dates from historical booking data. Results indicate improved accuracy in demand forecasts when generated at the cluster level, aiding data-driven revenue management amidst the COVID-19 pandemic's demand fluctuations.

__Forecasting Hotel Demand Using Machine Learning Approaches__  
Link: https://www.researchgate.net/publication/340134067_Forecasting_Hotel_Demand_Using_Machine_Learning_Approaches  
This study compares traditional pick-up based models with machine learning approaches for forecasting hospitality demand, demonstrating the superior performance of machine learning models, especially with longer booking histories. These findings have practical implications for improving forecast accuracy and revenue optimization in hotel management, while also laying the foundation for future research in refining machine learning models for revenue management.

__Prediction of hotel booking cancellations: Integration of machine learning and probability model based on interpretable feature interaction__  
link:
https://www.sciencedirect.com/science/article/pii/S0167923623000349  
The scope of the research is to enhance hotel cancellation prediction by proposing an interpretable feature interaction method and integrating Bayesian networks and Lasso regression models. 

| Model                                    | Accuracy   | Recall    | Precision  | F1-score  |
|------------------------------------------|------------|-----------|------------|-----------|
| BN                                       | 0.7818     | 0.3890    | 0.6891     | 0.4972    |
| Lasso-original                          | 0.7624     | 0.2225    | 0.7455     | 0.3428    |
| Lasso-original-Bayesian                 | 0.7823     | 0.3936    | 0.6914     | 0.5016    |
| Lasso-original-Bayesian-interaction     | 0.8186     | 0.5097    | 0.7597     | 0.6101    |

The above study explore the dataset and the algorithms for predicting the results, but failed to accomplish the one or more tasks below:
1. Explanatory Data Analysis
2. Feature selection
3. Data transforming
4. Managing the scale of the data
5. Balancing biased data
6. Apply XGboost, Neural Network, Ensemble Model
7. Comparison between models

## 2 Reasearch Questions
1. __What are the key factors affecting hotel booking cancellation, and how can we predict the cancellations?__  
    The research into factors influencing hotel booking cancellations and predictive modeling can have significant implement for the hotel bussiness. Most online travel agency have loose cancellation policy for users, in oreder to encourage the booking activities. However the pressure for handling the cancellations falls on the hotels, for cancelled rooms lead to revenue loss and reduced profits due to price reductions, and increase advertising costs. By identifying key factors in booking information that contribute to cancellations, such as lead time, deposit type, and special requests, hotels can predict the cancellation of the booking orders and  make informed decisions about overbooking, room pricing, and staffing levels, ultimately enhancing efficiency, profitability, and customer satisfaction.In addition, we believe that weather conditions are also an important factor affecting hotel reservation cancellations, so we also imported the weather conditions in Lisbon, Portugal from 2015 to 2017.  
    With this predictive model, hotels can proactively identify which users are likely to cancel their bookings and take remedial actions in a timely manner. For instance, by reaching out to users with a higher likelihood of cancellation in advance, hotels can encourage them to cancel their bookings as early as possible, thereby allowing the hotel to free up more rooms for sale.Alternatively, hotels can also engage with users who show tendencies to cancel by highlighting the advantages of staying at the hotel and offering incentives or rewards for staying, aiming to persuade them to retain their bookings.
    
2. __Does different deposit types have different pattern when predicting the booking cancellations?__  
    Exploring the impact of various deposit types on hotel booking cancellations can provide valuable insights into understanding customer behavior and optimizing cancellation prediction models. Different deposit types, such as non-refundable deposits, refundable deposits, or no deposits, may influence cancellation patterns differently due to varying levels of commitment from guests. Understanding how each deposit type can aid hotels in deciding their cancellation policies and pricing strategies to mitigate revenue loss and maximize profitability. For example, if a predictive model indicates that bookings made with non-refundable deposits have a higher likelihood of cancellation, hotels can reach out to these guests ahead of time to encourage them to confirm or modify their bookings. 


# 3 Data to be Used
The data we will be using is sourced from github: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-02-11 and visual crossing: https://www.visualcrossing.com/weather/weather-data-services. We will be using pandas to read the raw file directly from the website. To maintain consistency, we download the raw file from the website, save it to our own repository, and then use pandas to read the CSV file.  
The first Data set contains 119390 rows of observation and 32 variables.The second dataset about weather from 2015 to 2017 will be filtered in the future analysis.  
__Data Dictionary__

| Variable                        | Description                                                                                                          |
|--------------------------------|----------------------------------------------------------------------------------------------------------------------|
| hotel                          | Hotel (Resort Hotel or City Hotel)                                                                         |
| lead_time                      | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date             |
| arrival_date_year              | Year of arrival date                                                                                                 |
| arrival_date_month             | Month of arrival date                                                                                                |
| arrival_date_week_number       | Week number of year for arrival date                                                                                 |
| arrival_date_day_of_month      | Day of arrival date                                                                                                  |
| stays_in_weekend_nights        | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel                       |
| stays_in_week_nights           | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel                            |
| adults                         | Number of adults                                                                                                     |
| children                       | Number of children                                                                                                   |
| babies                         | Number of babies                                                                                                     |
| meal                           | Type of meal booked. Categories are presented in standard hospitality meal packages                                 |
| country                        | Country of origin. Categories are represented in the ISO 3155–3:2013 format                                         |
| market_segment                 | Market segment designation. In categories, the term "TA" means "Travel Agents" and "TO" means "Tour Operators"      |
| distribution_channel           | Booking distribution channel. The term "TA" means "Travel Agents" and "TO" means "Tour Operators"                   |
| is_repeated_guest              | Value indicating if the booking name was from a repeated guest (1) or not (0)                                       |
| previous_cancellations         | Number of previous bookings that were cancelled by the customer prior to the current booking                          |
| previous_bookings_not_canceled | Number of previous bookings not cancelled by the customer prior to the current booking                               |
| reserved_room_type             | Code of room type reserved. Code is presented instead of designation for anonymity reasons                           |
| assigned_room_type             | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons | 
| booking_changes                | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
| deposit_type                   | Indication on if the customer made a deposit to guarantee the booking                                                |
| agent                          | ID of the travel agency that made the booking                                                                        |
| company                        | ID of the company/entity that made the booking or responsible for paying the booking                                  |
| days_in_waiting_list           | Number of days the booking was in the waiting list before it was confirmed to the customer                           |
| customer_type                  | Type of booking, assuming one of four categories                                                                     |
| adr                            | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| required_car_parking_spaces    | Number of car parking spaces required by the customer                                                                |
| total_of_special_requests      | Number of special requests made by the customer (e.g. twin bed or high floor)                                        |
| reservation_status             | Reservation last status, assuming one of three categories                                                            |
| reservation_status_date        | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus           |


- Dataset 1

In [2]:
import pandas as pd
pd.read_csv("https://raw.githubusercontent.com/jc000222/Data_Science_DAV6150/main/FinalProject/hotel.csv")

Unnamed: 0.1,Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,0,Resort Hotel,0,342,2015,July,27,1,0,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,1,Resort Hotel,0,737,2015,July,27,1,0,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,2,Resort Hotel,0,7,2015,July,27,1,0,1,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,3,Resort Hotel,0,13,2015,July,27,1,0,1,...,No Deposit,304.0,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,4,Resort Hotel,0,14,2015,July,27,1,0,2,...,No Deposit,240.0,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,119385,City Hotel,0,23,2017,August,35,30,2,5,...,No Deposit,394.0,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,119386,City Hotel,0,102,2017,August,35,31,2,5,...,No Deposit,9.0,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,119387,City Hotel,0,34,2017,August,35,31,2,5,...,No Deposit,9.0,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,119388,City Hotel,0,109,2017,August,35,31,2,5,...,No Deposit,89.0,,0,Transient,104.40,0,0,Check-Out,2017-09-07


- Dataset 2

In [7]:
pd.read_csv('https://raw.githubusercontent.com/Zhengnan817/DAV-6150/main/Final_project/src/Lisbon%2CPortugal%202015-01-01%20to%202017-09-26.csv')

Unnamed: 0,name,datetime,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,...,solarenergy,uvindex,severerisk,sunrise,sunset,moonphase,conditions,description,icon,stations
0,"Lisbon,Portugal",2015-01-01,14.7,2.9,8.5,14.7,2.1,7.9,1.1,61.4,...,10.0,5,,2015-01-01T07:54:47,2015-01-01T17:25:26,0.37,Clear,Clear conditions throughout the day.,clear-day,"08535099999,08534099999,08536099999,LPPT,08579..."
1,"Lisbon,Portugal",2015-01-02,12.6,2.3,7.9,12.6,0.1,6.8,3.9,77.2,...,10.2,5,,2015-01-02T07:54:55,2015-01-02T17:26:15,0.40,Partially cloudy,Becoming cloudy in the afternoon.,partly-cloudy-day,"08535099999,08534099999,08536099999,LPPT,08579..."
2,"Lisbon,Portugal",2015-01-03,13.9,4.2,8.6,13.9,1.9,7.6,2.9,69.0,...,10.2,5,,2015-01-03T07:55:01,2015-01-03T17:27:05,0.44,Clear,Clear conditions throughout the day.,clear-day,"08535099999,08534099999,08536099999,LPPT,08579..."
3,"Lisbon,Portugal",2015-01-04,14.0,3.2,8.2,14.0,3.2,7.7,4.2,77.0,...,10.2,5,,2015-01-04T07:55:05,2015-01-04T17:27:57,0.48,Clear,Clear conditions throughout the day.,clear-day,"08535099999,08534099999,08536099999,LPPT,08579..."
4,"Lisbon,Portugal",2015-01-05,12.4,2.4,7.6,12.4,1.0,7.2,5.3,86.0,...,8.9,5,,2015-01-05T07:55:07,2015-01-05T17:28:50,0.50,Clear,Clear conditions throughout the day.,clear-day,"08532099999,08535099999,08534099999,0853609999..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,"Lisbon,Portugal",2017-09-22,25.1,15.0,18.4,25.1,15.0,18.4,12.3,69.3,...,20.2,8,,2017-09-22T07:24:16,2017-09-22T19:33:36,0.07,Partially cloudy,Becoming cloudy in the afternoon.,partly-cloudy-day,"08532099999,08535099999,08534099999,0853609999..."
996,"Lisbon,Portugal",2017-09-23,25.7,14.1,18.9,25.7,14.1,18.9,2.9,41.6,...,19.6,8,,2017-09-23T07:25:10,2017-09-23T19:32:00,0.10,Partially cloudy,Partly cloudy throughout the day.,partly-cloudy-day,"08535099999,08534099999,08536099999,LPPT,08579..."
997,"Lisbon,Portugal",2017-09-24,25.3,14.6,19.4,25.3,14.6,19.4,-1.8,28.7,...,20.7,8,,2017-09-24T07:26:04,2017-09-24T19:30:24,0.14,Partially cloudy,Partly cloudy throughout the day.,partly-cloudy-day,"08535099999,08534099999,08536099999,LPPT,08579..."
998,"Lisbon,Portugal",2017-09-25,25.7,17.0,20.3,25.7,17.0,20.3,4.9,42.4,...,19.9,8,,2017-09-25T07:26:58,2017-09-25T19:28:48,0.17,Partially cloudy,Partly cloudy throughout the day.,partly-cloudy-day,"08535099999,08532099999,08534099999,0853609999..."


# 4 Approach
To address the research questions regarding the key factors affecting hotel booking cancellations and the impact of deposit types on cancellation predictions, the following plan outlines the management, analysis, and modeling approaches for the dataset.
#### Data Management
Because the dataset is pretty huge. So it will be stored and managed in a relational database management system (RDBMS) such as PostgreSQL for efficient querying, analysis, and storage. This approach allows for effective handling of the data's structure and relationships, facilitating easy retrieval and manipulation of records for analysis. 
#### Statistical Analysis
To explore the dataset and identify potential factors influencing booking cancellations, initial exploratory data analysis (EDA) will be conducted using Python libraries such as Pandas for data manipulation and Matplotlib and Seaborn for visualization. This will include summary statistics, correlation analysis, and visualization techniques to identify trends, patterns, and anomalies in the data.
#### Task Allocation
Qing Dou:
- Introduction: Briefly describe the purpose and motivation behind the project, introduce the dataset, and clearly state the research question.
- Prepped Data Overview: Conduct a secondary EDA on the prepared data to verify its readiness for modeling.
- Model Selection: Evaluate model performance using metrics such as confusion matrix, accuracy, recall, F1 score, and ROC curve.

Ruoyu Chen:
- Abstract: Summarize the problem, methodology, and major outcomes.  
- Exploratory Data Analysis: Conduct initial exploration to understand the dataset's characteristics, such as distributions of key variables, presence of missing values, and potential outliers.
- Data Preparation: Assist in cleaning and preprocessing the data to enhance its quality and suitability for modeling.

Zhengnan Li:
- Machine Learning Models: Build the models using logistic regression, neural network, and decision tree models to identify the most effective approach for the research question.
- Ensemble Model: Explore and implement ensemble approaches such as simple voting ensemble or stacking to combine predictions from individual models and improve overall accuracy.
- Conclusions: Summarize the project's findings, the efficacy of the selected model, and suggest avenues for future research building on the current work.


#### Project Report Structure
__1. Abstract:__ Summarize our problem, methodology, and major outcomes.  
__2. Introduction:__ Briefly describe the purpose and motivation behind the project, introduce the dataset, including its origin, size, and the types of variables it contains and clearly state the question or problem our project aims to address.  
__3. Exploratory Data Analysis:__ Conduct an initial exploration to understand the dataset's characteristics, such as distributions of key variables, presence of missing values, and potential outliers.The following are diagrams that may be used.  
- Correlation Heatmaps: To visualize the relationships between different variables and identify those most correlated with booking cancellations.
- Bar Charts and histogram: To illustrate the distribution of categorical variables like hotel type, deposit type, and market segment.
- Box Plots: To explore the distribution of numerical variables and identify outliers or anomalies that could affect cancellation predictions.  

__4. Data Preparation:__ Clean and preprocess the data to enhance its quality and suitability for modeling, including addressing missing values, outliers, and feature engineering.  
__5. Prepped Data Overview:__ Conduct a secondary EDA on the prepared data to verify its readiness for modeling and to refine the analysis strategy.  
__6. Machine Learning Models:__ Build the models to identify the most effective approach for the research question. Thses are models we are going to use:  
- Logistic Regression: Logistic Regression is a statistical method widely used for binary classification problems, predicting the probability of an event occurring. In the context of hotel booking cancellations, it can be used to estimate the probability of a booking being canceled.The model is simple and highly interpretable, making it suitable as a baseline model.
- Neural Network Model: Neural Networks mimic the structure and function of the human brain, suitable for dealing with complex nonlinear problems. With a multi-layered structure, neural networks can capture and learn deep features and patterns in the data. And the model is with powerful data fitting capabilities, particularly effective for large-scale and high-dimensional datasets.
- Decision Tree Model: Decision Trees split the data through a series of rules, each split representing a decision node, until the prediction objective is reached. It is an intuitive model that clearly shows the decision-making process and logic. The model is easy to understand and interpret, visually representing the decision-making process.

__7. Model Selection:__ Evaluate model performance using metrics such as confusion matrix, accuracy, recall, F1 score, and ROC curve.Compare the performance of different models to identify the most suitable one for predicting hotel booking cancellations.  
__8. Ensemble Model:__  In addition to these individual models, an ensemble approach will be considered to combine the predictions from each model to improve overall accuracy. The ensemble method could be a simple voting ensemble, where predictions from each model are combined to make a final decision, or a more complex stacking approach, where the outputs of individual models serve as inputs to a final model that makes the ultimate prediction.  
__9. Conclusions:__  Summarize the project's findings, the efficacy of the selected model, and suggest avenues for future research building on the current work.

## References
__Dataset source__: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-02-11

__Domain Knowledge__:  
Antonio, N., de Almeida, A., and Nunes, L. (2019). Hotel booking demand datasets: https://www.sciencedirect.com/science/article/pii/S2352340918315191  
