# Research project on hotel bookings
### Contributors: Kenneth Bansah, Aeron Cable, Jacob Hartmann, Joshua Johnson, Joseph McEvoy, Logan Willson

## Introduction

The hospitality industry plays a significant role in the economic development of every nation, providing
revenue to both national and local governments and employing many people. Further, many people
derive their livelihoods and incomes from the hospitality industry. Like Lawson (2015) stated, hotels, a
major player in the hospitality industry, serve the needs of travelers, business visitors, and tourists.
Additionally, it serves individuals who need temporary accommodation among several other reasons.
Data shows that the U.S. hotel industry employs about 2.3 million people and in 2018, contributed
659.4 billion to the U.S. gross domestic product. Thus, it is crucial to ensure the sustainability of the
hotel industry so that it continues to serve the needs of the people and to provide revenue and incomes.
A few scholarly research works have been conducted around factors that could impact the sustainability
of the hotel industry. For example, Wilkins, Merrilees, and Herington (2007) used exploratory and
confirmatory factor analysis and structural modeling and concluded that customers rated quality of
service as an important indicator for choosing a hotel, while Saleh and Ryan (1992, p.1) observed that
‘availability of a restaurant, convenient parking, interior decor and exterior aesthetics’ were factors
customers consider when selecting a hotel. While Su (2004) found that customers were not satisfied
with their hotel services, Poon and Low (2005) observed that satisfaction levels varied among Asian and
Western travelers.
These studies typically used social science research techniques such as focus groups, surveys, interviews,
and thematic and content analyses to draw conclusions. The sample sizes in the studies were relatively
small, mostly ranging from 50 to 600 study participants. In this study, however, we aim to perform a
more rigorous scientific analysis by relying on hotel data of 32 variables and more than 119,000 samples.
According to a study sponsored by the American Hotel &amp; Lodging Association and the American Hotel &amp;
Lodging Educational Foundation, the U.S. hotel industry generated $42.4 billion in direct capital
investment in 2018 and supported more than 329,000 direct jobs from the construction and renovation
of hotel projects 2 . It is important to note that customers are critical in sustaining the hotel industry.

Without customers to frequent the hotels, the huge capital investment injected into the construction of
hotels could have a debilitating ramification for the investors. In addition, governments will lose
revenue, jobs will decline, and livelihoods will be significantly affected. We contribute to the scholarly
discourse on sustainability of the hotel industry by attempting to answer the following questions:

● Whether Asian or western travelers prefer city hotels or resort hotels?

● If adult customers with children are more likely to cancel their reservation?

● If guests with family have a higher reservation lead time than guests without families?

● Which country’s citizens mostly prefer resort hotels?

● What are the most preferred meals for guests?

●What is the probability that the room the customer reserved will be assigned to them?

● What is the meal preference for guests from countries with the most reservations?

● If car parking space is a factor in deciding on the choice of a hotel?

● What hotel type (resort or city) is mostly preferred by guests?

● Which type of guest is likely to make special requests?

● How does the average daily rate impact the choice of a hotel?

These questions could help unravel important factors that contribute to customer satisfaction and the
sustainability of hotels. The findings of this study will be disseminated via journal and conference
publications to contribute to the existing body of knowledge and to help improve the hotel industry in
the United States and other countries that recognize hotels and the hospitality industry as an important
economic sector for development.

## Data Source and Collection

The data for this proposed study is retrieved from kaggle.com and includes customer reservations of
hotels from July 2015 to August 2017. There are 32 variables and 119,390 observations. The variables
include details such as type of hotel, date of arrival and departure, meals, nationality of guests, and
average daily rates.
The CSV file containing the dataset has a combination of values and string data and will be relied on in
this study to answer the research questions. Python 3 will be used for modeling and prediction and the
results and findings will be interpreted and discussed following sound scientific principles. Additional
existing scholarly literature will be reviewed and used to support our analysis, interpretation, and
discussion.

## Methods

This research project will use a variety of common data science and business analytics tools to
analyze the data and attempt to pull information that could show insight into the conditions of
the hotel industry. We will use methods such as data visualization, regression analysis, and
fitting models to identify and remove outliers, show trends, and enable prediction of data
elements based on the given observations. The way we will use many of these methods is
through pandas because it is very useful for data analysis and manipulation of the dataset. Data
manipulation will be important for this dataset because there are non-numerical variables we
will have to convert. Some of the numerical variables correlate to a non numerical value. We
will convert those variables to useful versions that will allow us to do better analysis.

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [14]:
df = pd.read_csv("hotel_bookings.csv")
df

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017,August,35,30,2,5,2,...,No Deposit,394.0,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,City Hotel,0,102,2017,August,35,31,2,5,3,...,No Deposit,9.0,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,City Hotel,0,34,2017,August,35,31,2,5,2,...,No Deposit,9.0,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,City Hotel,0,109,2017,August,35,31,2,5,2,...,No Deposit,89.0,,0,Transient,104.40,0,0,Check-Out,2017-09-07


In [15]:
for col in df.columns:
    print(col)

hotel
is_canceled
lead_time
arrival_date_year
arrival_date_month
arrival_date_week_number
arrival_date_day_of_month
stays_in_weekend_nights
stays_in_week_nights
adults
children
babies
meal
country
market_segment
distribution_channel
is_repeated_guest
previous_cancellations
previous_bookings_not_canceled
reserved_room_type
assigned_room_type
booking_changes
deposit_type
agent
company
days_in_waiting_list
customer_type
adr
required_car_parking_spaces
total_of_special_requests
reservation_status
reservation_status_date


In [16]:
df.dtypes

hotel                              object
is_canceled                         int64
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                          float64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             

In [17]:
df["arrival_date_month"]

0           July
1           July
2           July
3           July
4           July
           ...  
119385    August
119386    August
119387    August
119388    August
119389    August
Name: arrival_date_month, Length: 119390, dtype: object

In [18]:
#count of arrivals by month
df.drop_duplicates().arrival_date_month.value_counts()

August       11257
July         10057
May           8355
April         7908
June          7765
March         7513
October       6934
September     6690
February      6098
December      5131
November      4995
January       4693
Name: arrival_date_month, dtype: int64

In [19]:
#count of canceled bookings
df.drop_duplicates().is_canceled.value_counts()

0    63371
1    24025
Name: is_canceled, dtype: int64