# Predicting Hotel Cancellations

## 🏨 Background

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in! 

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

## The Data

They have provided you with their bookings data in a file called `hotel_bookings.csv`, which contains the following:

| Column     | Description              |
|------------|--------------------------|
| `Booking_ID` | Unique identifier of the booking. |
| `no_of_adults` | The number of adults. |
| `no_of_children` | The number of children. |
| `no_of_weekend_nights` | Number of weekend nights (Saturday or Sunday). |
| `no_of_week_nights` | Number of week nights (Monday to Friday). |
| `type_of_meal_plan` | Type of meal plan included in the booking. |
| `required_car_parking_space` | Whether a car parking space is required. |
| `room_type_reserved` | The type of room reserved. |
| `lead_time` | Number of days before the arrival date the booking was made. |
| `arrival_year` | Year of arrival. |
| `arrival_month` | Month of arrival. |
| `arrival_date` | Date of the month for arrival. |
| `market_segment_type` | How the booking was made. |
| `repeated_guest` | Whether the guest has previously stayed at the hotel. |
| `no_of_previous_cancellations` | Number of previous cancellations. |
| `no_of_previous_bookings_not_canceled` | Number of previous bookings that were canceled. |
| `avg_price_per_room` | Average price per day of the booking. |
| `no_of_special_requests` | Count of special requests made as part of the booking. |
| `booking_status` | Whether the booking was cancelled or not. |

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

## The Challenge

* Use your skills to produce recommendations for the hotel on what factors affect whether customers cancel their booking.

In [1]:
import pandas as pd
hotels = pd.read_csv('data/hotel_bookings.csv')
hotels.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,,,,,,,,,,,,,,,,,,Not_Canceled
1,INN00002,2.0,0.0,2.0,3.0,Not Selected,0.0,Room_Type 1,5.0,2018.0,11.0,6.0,Online,0.0,0.0,0.0,106.68,1.0,Not_Canceled
2,INN00003,1.0,0.0,2.0,1.0,Meal Plan 1,0.0,Room_Type 1,1.0,2018.0,2.0,28.0,Online,0.0,0.0,0.0,60.0,0.0,Canceled
3,INN00004,2.0,0.0,0.0,2.0,Meal Plan 1,0.0,Room_Type 1,211.0,2018.0,5.0,20.0,Online,0.0,0.0,0.0,100.0,0.0,Canceled
4,INN00005,2.0,0.0,1.0,1.0,Not Selected,0.0,Room_Type 1,48.0,2018.0,4.0,11.0,Online,0.0,0.0,0.0,94.5,0.0,Canceled


In [2]:
hotels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          35862 non-null  float64
 2   no_of_children                        35951 non-null  float64
 3   no_of_weekend_nights                  35908 non-null  float64
 4   no_of_week_nights                     35468 non-null  float64
 5   type_of_meal_plan                     35749 non-null  object 
 6   required_car_parking_space            33683 non-null  float64
 7   room_type_reserved                    35104 non-null  object 
 8   lead_time                             35803 non-null  float64
 9   arrival_year                          35897 non-null  float64
 10  arrival_month                         35771 non-null  float64
 11  arrival_date   

In [3]:
hotels['type_of_meal_plan'].value_counts()

Meal Plan 1     27421
Not Selected     5057
Meal Plan 2      3266
Meal Plan 3         5
Name: type_of_meal_plan, dtype: int64

In [4]:
def meal_plan_convert(row):
    if row.type_of_meal_plan == 'Not Selected':
        return 0
    elif row.type_of_meal_plan == 'Meal Plan 1':
        return 1
    elif row.type_of_meal_plan == 'Meal Plan 2':
        return 2
    else:
        return 3

hotels['type_of_meal_plan'] = hotels.apply(meal_plan_convert, axis=1)
hotels.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,,,,,3,,,,,,,,,,,,,Not_Canceled
1,INN00002,2.0,0.0,2.0,3.0,0,0.0,Room_Type 1,5.0,2018.0,11.0,6.0,Online,0.0,0.0,0.0,106.68,1.0,Not_Canceled
2,INN00003,1.0,0.0,2.0,1.0,1,0.0,Room_Type 1,1.0,2018.0,2.0,28.0,Online,0.0,0.0,0.0,60.0,0.0,Canceled
3,INN00004,2.0,0.0,0.0,2.0,1,0.0,Room_Type 1,211.0,2018.0,5.0,20.0,Online,0.0,0.0,0.0,100.0,0.0,Canceled
4,INN00005,2.0,0.0,1.0,1.0,0,0.0,Room_Type 1,48.0,2018.0,4.0,11.0,Online,0.0,0.0,0.0,94.5,0.0,Canceled


In [5]:
hotels['room_type_reserved'].value_counts()

Room_Type 1    27234
Room_Type 4     5851
Room_Type 6      939
Room_Type 2      664
Room_Type 5      256
Room_Type 7      154
Room_Type 3        6
Name: room_type_reserved, dtype: int64

In [6]:
room_type_reserved_mapping = {'Room_Type 1':1, 'Room_Type 2':2, 'Room_Type 3':3, 'Room_Type 4':4, 'Room_Type 5':5, 'Room_Type 6':6,'Room_Type 7':7, }
hotels['room_type_reserved'] = hotels['room_type_reserved'].map(room_type_reserved_mapping)
hotels.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,,,,,3,,,,,,,,,,,,,Not_Canceled
1,INN00002,2.0,0.0,2.0,3.0,0,0.0,1.0,5.0,2018.0,11.0,6.0,Online,0.0,0.0,0.0,106.68,1.0,Not_Canceled
2,INN00003,1.0,0.0,2.0,1.0,1,0.0,1.0,1.0,2018.0,2.0,28.0,Online,0.0,0.0,0.0,60.0,0.0,Canceled
3,INN00004,2.0,0.0,0.0,2.0,1,0.0,1.0,211.0,2018.0,5.0,20.0,Online,0.0,0.0,0.0,100.0,0.0,Canceled
4,INN00005,2.0,0.0,1.0,1.0,0,0.0,1.0,48.0,2018.0,4.0,11.0,Online,0.0,0.0,0.0,94.5,0.0,Canceled


In [7]:
hotels['market_segment_type'].value_counts()

Online           22264
Offline          10076
Corporate         1926
Complementary      375
Aviation           122
Name: market_segment_type, dtype: int64

In [8]:
market_segment_type_mapping = {'Online':1, 'Offline':2, 'Corporate':3, 'Complementary':4, 'Aviation':5}
hotels['market_segment_type'] = hotels['market_segment_type'].map(market_segment_type_mapping)
hotels['market_segment_type'].value_counts()

1.0    22264
2.0    10076
3.0     1926
4.0      375
5.0      122
Name: market_segment_type, dtype: int64

In [9]:
hotels['booking_status'].value_counts()

Not_Canceled    24390
Canceled        11885
Name: booking_status, dtype: int64

In [10]:
booking_status_mapping = {'Not_Canceled':0, 'Canceled':1}
hotels['booking_status'] = hotels['booking_status'].map(booking_status_mapping)
hotels['booking_status'].value_counts()

0    24390
1    11885
Name: booking_status, dtype: int64

In [11]:
hotels.isna().sum()

Booking_ID                                 0
no_of_adults                             413
no_of_children                           324
no_of_weekend_nights                     367
no_of_week_nights                        807
type_of_meal_plan                          0
required_car_parking_space              2592
room_type_reserved                      1171
lead_time                                472
arrival_year                             378
arrival_month                            504
arrival_date                             981
market_segment_type                     1512
repeated_guest                           586
no_of_previous_cancellations             497
no_of_previous_bookings_not_canceled     550
avg_price_per_room                       460
no_of_special_requests                   789
booking_status                             0
dtype: int64

In [12]:
hotels.fillna(hotels.mean(), inplace=True)
hotels = hotels.round()
hotels.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2.0,0.0,1.0,2.0,3,0.0,2.0,85.0,2018.0,7.0,16.0,1.0,0.0,0.0,0.0,103.0,1.0,0
1,INN00002,2.0,0.0,2.0,3.0,0,0.0,1.0,5.0,2018.0,11.0,6.0,1.0,0.0,0.0,0.0,107.0,1.0,0
2,INN00003,1.0,0.0,2.0,1.0,1,0.0,1.0,1.0,2018.0,2.0,28.0,1.0,0.0,0.0,0.0,60.0,0.0,1
3,INN00004,2.0,0.0,0.0,2.0,1,0.0,1.0,211.0,2018.0,5.0,20.0,1.0,0.0,0.0,0.0,100.0,0.0,1
4,INN00005,2.0,0.0,1.0,1.0,0,0.0,1.0,48.0,2018.0,4.0,11.0,1.0,0.0,0.0,0.0,94.0,0.0,1


In [13]:
hotels = hotels.drop('Booking_ID', axis=1)

In [14]:
X = hotels.iloc[:,:17]
y = hotels.iloc[:,17]

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 0)

In [21]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

accuracy_lr = accuracy_score(y_test, y_pred)
f1_lr = f1_score(y_test, y_pred)

print(acc_lr, f1_lr)

0.7853533033171001 0.6406153846153846


In [23]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

accuracy_knn = accuracy_score(y_test, y_pred)
f1_knn = f1_score(y_test, y_pred)

print(accuracy_knn, f1_knn)

0.7960121290085455 0.6625114016418364


In [26]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

accuracy_svc = accuracy_score(y_test, y_pred)
f1_svc = f1_score(y_test, y_pred)

print(accuracy_svc, f1_svc)

0.762933014793715 0.532608695652174


In [33]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

accuracy_dt = accuracy_score(y_test, y_pred)
f1_dt = f1_score(y_test, y_pred)

print(accuracy_dt, f1_dt)

0.8537168060277497 0.7764044943820224


In [28]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred)
f1_rf = f1_score(y_test, y_pred)

print(accuracy_rf, f1_rf)

0.892492878801801 0.8269742679680568


In [29]:
from sklearn.ensemble import GradientBoostingClassifier

gbt = GradientBoostingClassifier()
gbt.fit(X_train, y_train)
y_pred = gbt.predict(X_test)

accuracy_gbt = accuracy_score(y_test, y_pred)
f1_gbt = f1_score(y_test, y_pred)

print(accuracy_gbt, f1_gbt)

0.8441606174767987 0.7412267317668599


In [30]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()
sgd.fit(X_train, y_train)
y_pred = sgd.predict(X_test)

accuracy_sgd = accuracy_score(y_test, y_pred)
f1_sgd = f1_score(y_test, y_pred)

print(accuracy_sgd, f1_sgd)

0.7579711476614904 0.49655963302752293


In [31]:
import xgboost as xgb

xgb = xgb.XGBClassifier()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)

accuracy_xgb = accuracy_score(y_test, y_pred)
f1_xgb = f1_score(y_test, y_pred)

print(accuracy_xgb, f1_xgb)

0.8852338509602132 0.8174777144527253


In [34]:
models = pd.DataFrame({
    'Model': ['Random Forest', 
              'Knn', 
              'Decision Tree', 
              'Logistic Regression', 
              'SVC',
              'SGD',
              'XGB',
              'GradientBoosting'
                ],
    'F1_Score': [f1_rf,
                 f1_knn,
                 f1_dt,
                 f1_lr,
                 f1_svc,
                 f1_sgd,
                 f1_xgb,
                 f1_gbt
              ],
      'Accuracy':[accuracy_rf,
                 accuracy_knn,
                 accuracy_dt,
                 accuracy_lr,
                 accuracy_svc,
                 accuracy_sgd,
                 accuracy_sgd,
                 accuracy_gbt]})
models.sort_values(by='Accuracy', ascending=False)

Unnamed: 0,Model,F1_Score,Accuracy
0,Random Forest,0.826974,0.892493
2,Decision Tree,0.776404,0.853717
7,GradientBoosting,0.741227,0.844161
1,Knn,0.662511,0.796012
3,Logistic Regression,0.640615,0.785353
4,SVC,0.532609,0.762933
5,SGD,0.49656,0.757971
6,XGB,0.817478,0.757971


In [None]:
## use Random Forest Classifier

In [35]:
from sklearn.model_selection import GridSearchCV

param_grid={'n_estimators':[30,50,80,120,200,300],'max_depth':[5,10,15,20,25,30,50]}
grid_search=GridSearchCV(RandomForestClassifier(),param_grid,cv=5, n_jobs=2)
 
grid_search.fit(X_train,y_train)
 
grid_search.best_params_,grid_search.best_score_

({'max_depth': 25, 'n_estimators': 300}, 0.8908319524339383)

In [37]:
rf = RandomForestClassifier(max_depth=25, n_estimators=300)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred)
f1_rf = f1_score(y_test, y_pred)

print(accuracy_rf, f1_rf)

0.8936874023706699 0.8286687398193395


# Put prediction into dataset

In [38]:
y_pred = rf.predict(X_test)
y_pred

array([0, 1, 1, ..., 0, 0, 1])

In [39]:
y_pred_text = []
for i in range(len(y_pred)):
    if y_pred[i] == 0:
        y_pred_text.append('Not_Canceled')
    else:
        y_pred_text.append('Canceled')
y_pred_text

['Not_Canceled',
 'Canceled',
 'Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Canceled',
 'Not_Canceled',
 'Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Canceled',
 'Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Canceled',
 'Not_Canceled',
 'Canceled',
 'Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Canceled',
 'Not_Canceled',
 'Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Canceled',
 'Not_Canceled',
 'Not_Canceled',
 'Canceled',
 'Canceled',
 'Not_Canc