# Hotel Cancellation Forecast - Project
The accomodation industry is a 4.1 Trillion Dollar industry in 2021.
In today's fast-paced world, consumers are becoming more flexible with their stays, "Free Cancellation" offers are helping large booking websites like Booking.com and Hotels.com stay competitive by allowing consumers said flexibility.
However, these offers bring an old-new problem to the table - booking cancellations.
In this project we aim to allow accurate forecasting of booking cancellations in order to aid hotels and booking websites correctly anticipate hotel cancellations and act accordingly to prevent loss and maximize capacity.
##### By Oriel Perets & Dafna Meron


-------

#### Project setup
1. Importing dependecies
    * Numpy
    * Pandas
2. Importing data
    * csv --> dataFrame


In [5]:
import numpy as np
import pandas as pd

df = pd.read_csv('hotel_bookings.csv')

#### Visualizing data

----------

### Converting values 
* lead_time -> intervals
* customer_type - > integer

#### CustomerType

In [None]:
customer_type = df['customer_type']
interval = []
for cus in customer_type:
    if cus == 'Transient':
        interval.append(0)
    elif cus == 'Transient-Party':
        interval.append(1)
    elif cus == 'Contract':
        interval.append(2)
    elif cus == 'Group':
        interval.append(3)
df['t_CustomerType'] = interval

#### LeadTime

In [None]:
df['lead_time'].hist()

In [None]:
lead_time = df['lead_time']
converted = []
for lt in lead_time:
    if lt >= 0 and lt <= 100:
        converted.append(0)
    if lt > 100 and lt < 200:
        converted.append(0)
    if lt >= 200:
        converted.append(0)
    
df['t_LeadTime'] = converted

#### Months into integer & seasons

In [None]:
month = df['arrival_date_month']
# months to numbers map
dct = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6, 'July': 7, 'August': 8, 'September': 9, 'October': 10, 'November': 11, 'December': 12}
t_month = list(map(dct.get, month))

# add to dataframe
df['t_ArrivalMonth'] = t_month

# months to seasons map
seasons = [1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1]
season_dct = dict(zip(range(1,13), seasons))
t_seasons = list(map(season_dct.get, t_month))
df['t_ArrivalSeasons'] = t_seasons

#### Dist Channel to Integer

In [None]:
df['distribution_channel'].hist()

In [None]:
channel = df['distribution_channel']
# undefined included into TA/TO for size considerations
dct = {'TA/TO':1 ,'Undefined': 1, 'Corporate': 2, 'Direct': 3, 'GDS': 4, 'Undefined': 5}
t_channel= list(map(dct.get, channel))
df['t_Dist'] = t_channel

#### Previous Cancellations

In [None]:
prev_cancel = df['previous_cancellations']
converted = []
for i in prev_cancel:
    if i == 0:
        converted.append(0)
    else:
        converted.append(1)
df['t_PrevCancellations'] = converted

#### Resort / City Hotels

In [None]:
# Hotel into binary 0/1
hotel = df['hotel']
# undefined included into TA/TO for size considerations
dct = {'Resort Hotel':0, 'City Hotel': 1}
t_hotel= list(map(dct.get, hotel))
df['t_Hotel'] = t_hotel

In [None]:
import seaborn as sns
data = df['t_Hotel']
sns.histplot(data)

#### Deposit type

In [None]:
# Histrogram of the data
df['deposit_type'].hist()

In [None]:
# Converting the data to integers
# Hotel into binary 0/1
deposit = df['deposit_type']
# undefined included into TA/TO for size considerations
dct = {'No Deposit':0, 'Refundable':1, 'Non Refund': 2}
t_deposit = list(map(dct.get, deposit))
df['t_DepositType'] = t_deposit

#### Repeated Guests
##### Will be incorporated as is, 0/1 values - unbalanced.

#### Family/Party size in total

In [None]:
# Missing values in Children
children = df['children']
new_children = []
for i in children:
    if i != i:
        new_children.append(0)
    else:
        new_children.append(i)

# sum of party (adults,children,babies)
party = df['adults'] + new_children + df['babies']
t_party = []
for i in party:
   t_party.append(int(i))

df['t_Party'] = t_party

## Model 

In [None]:
# General dependencies + Scikit learn RF classifier
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Splitting data to train an test datasets
part = np.random.rand(len(df)) < 0.8
train = df[part]
test = df[~part]

In [None]:
df.info()

In [None]:
# Prep train data
# preparing training data
cols = ['t_LeadTime', 't_CustomerType','t_ArrivalMonth','t_Dist','t_PrevCancellations', 't_Hotel', 't_ArrivalSeasons', 't_DepositType', 'is_repeated_guest', 't_Party']
x_train = train[cols]
y = train['is_canceled']
x_test = test[cols]

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestRegressor
m = RandomForestRegressor(n_estimators = 200, random_state = 0)

In [None]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
m = GaussianNB(priors=None, var_smoothing=1e-09)

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression  
m = LogisticRegression(penalty='l2')

In [None]:
 # Evalutating the model
scores = cross_val_score(m, x_train, y, cv = 10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))