# Hotel booking


The dataset contains data from two different hotels. One Resort hotel and one City hotel.

From the publication (https://www.sciencedirect.com/science/article/pii/S2352340918315191) we know that both hotels are located in Portugal (southern Europe) ("H1 at the resort region of Algarve and H2 at the city of Lisbon"). The distance between these two locations is ca. 280 km by car and both locations border on the north atlantic.  

The data contains "bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017".  

Columns description:

   - `is_canceled` - Value indicating if the booking was canceled (1) or not (0)
   - `lead_time` - Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
   - `arrival_date_year` - Year of arrival date
   - `arrival_date_month` - Month of arrival date
   - `arrival_date_week_number` - Week number of year for arrival date
   - `arrival_date_day_of_month` - Day of arrival date
   - `stays_in_weekend_nights`- Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
   - `stays_in_week_nights` - Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
   - `adults` - Number of adults
   - `childrenNumber` - Number of children
   - `babies` - Number of babies
   - `mealType` - type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)
   - `country` - Country of origin. Categories are represented in the ISO 3155–3:2013 format 
   - `market_segment` - Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
   - `distribution_channel` - Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
   - `is_repeated_guest` - Value indicating if the booking name was from a repeated guest (1) or not (0)
   - `previous_cancellations` - Number of previous bookings that were cancelled by the customer prior to the current booking
   - `previous_bookings_not_canceled` - Number of previous bookings not cancelled by the customer prior to the current booking
   - `reserved_room_type` - Code of room type reserved. Code is presented instead of designation for anonymity reasons.
   - `assigned_room_type` - Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.
   - `booking_changes` - Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
   - `deposit_type` - Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
   - `agentID` - of the travel agency that made the booking
   - `companyID` - of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
   - `days_in_waiting_list` - Number of days the booking was in the waiting list before it was confirmed to the customer
   - `customer_type` - Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking.
   - `adr` - Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
   - `required_car_parking_spaces` - Number of car parking spaces required by the customer
   - `total_of_special_requests` - Number of special requests made by the customer (e.g. twin bed or high floor)  
   - `reservation_status` - Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
   - `reservation_status_date` - Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

### 2. Predicting Cancelations
The goal of this task is to predict whether the customer will come or cancel his booking. This can help a hotel to plan things like stuff, supplies, pricing etc. Its gonna be classical `classification` problem.

### 3. Evaluate Feature importance
Which features are most important to predict cancelations?  

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Use 3 decimal places in output display
pd.set_option("display.precision", 3)
# Don't wrap repr(DataFrame) across additional lines
pd.set_option("display.expand_frame_repr", False)
# Set max rows displayed in output to 250
pd.set_option("display.max_rows", 250)
# Set max columns displayed in output to 50
pd.set_option("display.max_columns", 250)

# Import label encoder 
from sklearn import preprocessing 
from categotical_features_encoding import to_label_encode_cat_data, to_one_hot_encode_cat_data, get_list_of_cat_features
from evaluation import cv_roc_auc_acc, model_evaluation_classification, cv_rmse_mae, model_evaluation_regression
from preprocessing import time_features_encoding

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px

In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold, KFold, train_test_split, cross_validate
from sklearn.metrics import accuracy_score, roc_auc_score, r2_score, median_absolute_error, mean_absolute_error, mean_squared_error
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import ElasticNet, Lasso, Ridge
import xgboost as xgb
import lightgbm as lgb


import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

## Intro: Preprocessing

In [None]:
df = pd.read_csv("../Hotel_Booking_Demand/data/hotel_bookings.csv")
print("Data shape: rows {} and columns {}".format(df.shape[0], df.shape[1]))

In [None]:
df.head().T

In [None]:
df.describe().T

Interestingly that the minimum value for `adults` is zero, meaning that we have some bookings without adults, which does not really make much sence. We are going to expore it later.

In [None]:
df.info()

In [None]:
# Null values
df.isnull().sum()

In [None]:
# Room type
print(df.reserved_room_type.unique())
print("Number of unique values in 'reserved_room_type' column is {}".format(len(df.reserved_room_type.unique())))

In [None]:
# Company has at most of missing values lets have a look
print(df.company.unique())
print("Number of unique values in 'company' column is {}".format(len(df.company.unique())))

In [None]:
# Now proceed with agent has second most of missing values lets have a look
print(df.agent.unique())
print("Number of unique values in 'agent' column is {}".format(len(df.agent.unique())))

In [None]:
# Country
print(df.country.unique())
print("Number of unique values in 'country' column is {}".format(len(df.country.unique())))

We have missing values in 4 columns. Imputing strategy for NaN's is going to be the following:

1. `Agent` is gonna be replaced by 0, meaning that if no agency is giving the assumption is that booking was made without one.
2. `Company` is also going to be replaced by zero, since if None given it was most likely private.
3. `Country` will be repalced by simply "unknown"
4. `Children` replace by 0s.

In [None]:
nan_replace_dict = {"company": 0, "country": "Unknown", "agent": 0, "children": 0}

df = df.fillna(nan_replace_dict)

In [None]:
# Lets have more detailed look on other columns, for example 'meal'

print(df.meal.unique())
print("Number of unique values in 'meal' column is {}".format(len(df.meal.unique())))

'Undefined' means 'SC' - no food is being order, hence we can replace it

In [None]:
df["meal"].replace("Undefined", "SC", inplace=True)

Now we go back to 0 in the 'adults' column.

In [None]:
df.loc[df["adults"] == 0].T

We can see that some rows have zeros in `adults`, `children`, `babies` altogether. Lets drop these rows, since they do not seem to be correct.

In [None]:
ghost_bookings = df.loc[df["adults"] + df["children"] +  df["babies"]==0].index
print("The amount of ghost booking rows: {}".format(len(ghost_bookings)))

In [None]:
df.drop(df.index[ghost_bookings], inplace=True)
df.shape

## EDA

In [None]:
#Lets separate for a while Resort and City hotels for easier analysis/plotting
# To know the acutal visitor numbers, only bookings that were not canceled are included. 
rh = df.loc[(df["hotel"] == "Resort Hotel") & (df["is_canceled"] == 0)]
ch = df.loc[(df["hotel"] == "City Hotel") & (df["is_canceled"] == 0)]

### Lets answer the question: 1) where do the guests come from?

In [None]:
country_data = pd.DataFrame(df.loc[df["is_canceled"] == 0]["country"].value_counts())
country_data.rename(columns={"country": "Number of Guests"}, inplace=True)
total_guests = country_data["Number of Guests"].sum()
country_data["Guests in percentages"] = round(country_data["Number of Guests"] / total_guests * 100, 2)
country_data["country"] = country_data.index

#Plot
fig = px.pie(country_data,
             values="Number of Guests",
             names="country",
             title="Home country of Bookings",
             template="seaborn")

fig.update_traces(textposition="inside", textinfo="value+percent+label")
fig.show()

In [None]:
# display on the map

bookings_map = px.choropleth(
                            country_data,
                            locations=country_data.index,
                            color=country_data["Guests in percentages"],
                            hover_name=country_data.index,
                            color_continuous_scale=px.colors.sequential.Plasma,
                            title="Home country of Bookings"
                            )
bookings_map.show()

We can see that most of the bookings comes from West European Countries.

###  2) What is the price per night?

In [None]:
# We have been given a following column: Average Daily Rate, so lets have a 
# look at that what is Adr per Person
df["adr_pp"] = df["adr"] / (df["adults"] + df["children"])
rh["adr_pp"] = rh["adr"] / (rh["adults"] + rh["children"])
ch["adr_pp"] = ch["adr"] / (ch["adults"] + ch["children"])

In [None]:
print("""From all non-cancelled bookings, across all room types and meals, the average prices are: Resort hotel: {:.2f} euro 
per night and person. City hotel: {:.2f} euro per night and person.""".format(rh["adr_pp"].mean(), ch["adr_pp"].mean()))

In [None]:
# lets keep only actualy bookings
df_bookings = df.loc[df["is_canceled"] == 0]
room_prices = df_bookings[["hotel", "reserved_room_type", "adr_pp"]].sort_values("reserved_room_type")

plt.figure(figsize=(12, 8))
sns.boxplot(x="reserved_room_type",
            y="adr_pp",
            hue="hotel",
            data=room_prices,
            hue_order=["City Hotel", "Resort Hotel"])

plt.title("Price of room types per night and person", fontsize=16)
plt.xlabel("Room type", fontsize=16)
plt.ylabel("Price [EUR] per person", fontsize=16)
plt.legend(loc="upper right")
plt.ylim(0, 160)
plt.show()

### 3) How does the price per night vary over the year?  
Lets have a look at the average price per night and person, regardless of the room type and meal.  

In [None]:
room_price_monthly = df[["hotel", "arrival_date_month", "adr_pp"]].sort_values("arrival_date_month")

# order by month:
ordered_months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]

room_price_monthly["arrival_date_month"] = pd.Categorical(room_prices_mothly["arrival_date_month"], categories=ordered_months, ordered=True)

#lineplot with std (https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Categorical.html)
plt.figure(figsize=(12,8))
sns.lineplot(x="arrival_date_month", y="adr_pp", hue="hotel", data=room_price_monthly,  
             hue_order = ["City Hotel", "Resort Hotel"], ci="sd", size="hotel", sizes=(2.5, 2.5))
plt.title("Room price per night (per person) over the year", fontsize=16)
plt.xlabel("Month", fontsize=16)
plt.xticks(rotation=45)
plt.ylabel("Price [EUR] per person", fontsize=16)
plt.show()

This clearly shows that the prices in the Resort hotel are much higher during the summer (no surprise here).   

### 4) Which months are the most occupied ones

In [None]:
resort_guest_monthly = rh.groupby("arrival_date_month")["hotel"].count()
city_guest_monthly = ch.groupby("arrival_date_month")["hotel"].count()

resort_guest_data = pd.DataFrame({"Month": list(resort_guest_monthly.index),
                                  "Hotel": "Resort Hotel",
                                  "Guests": list(resort_guest_monthly.values)
                                 })

city_guest_data = pd.DataFrame({"Month": list(city_guest_monthly.index),
                                  "Hotel": "City Hotel",
                                  "Guests": list(city_guest_monthly.values)
                                 })
guest_data = pd.concat([resort_guest_data, city_guest_data], ignore_index=True)

# it would actually make sense to order by month
ordered_months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]

guest_data["Month"] = pd.Categorical(guest_data["Month"], categories=ordered_months, ordered=True)
 
    
# Normalize data since July and August date from 3 years, the other month from 2 years  
guest_data.loc[(guest_data["Month"] == "July")|(guest_data["Month"] == "August"), "Guests"] /= 3
guest_data.loc[~(guest_data["Month"] == "July")|(guest_data["Month"] == "August"), "Guests"] /= 2

# show figure:
plt.figure(figsize=(12,8))
sns.lineplot(x="Month", y="Guests", hue="Hotel", data=guest_data,
            hue_order = ["City Hotel", "Resort Hotel"], size="Hotel", sizes=(2.5, 2.5))
plt.title("Average number of hotel guests per month", fontsize=16)
plt.xlabel("Month", fontsize=16)
plt.xticks(rotation=45)
plt.ylabel("Number of guests", fontsize=16)
plt.show()

The City hotel has more guests during spring and autumn, when its prices higher. In July and August there are less visitors (most probably goes to resort), although prices are lower. Guest numbers for the Resort hotel go down slighty from June to September, which is also when the prices are highest.  
Both hotels have the fewest guests during the winter.