# Hotel Reservation Cancellation Predictor 

Hotel Reservation Cancellation Predictor is an Linear Regression Machine Learning algorithm-based solution which predicts the probability of guests cancelling their hotel reservations. Cancellation of scheduled hotel stays is a significant challenge for hospitality companies.Cancellations lead to erroneous demand estimation, room pricing and revenue management. This solution predicts the likelihood of guests cancelling their hotel reservations based on guests' booking information. The Solution assists hospitality companies to maximize occupancy and revenue per available room.

## Contents

1. Prequisites
2. Data Dictionary
3. Import Libraries
4. Load Input Data
5. Create Model
6. Predict Test Datapoints
7. Saving Prediction

## Prerequisites

To run this notebook you need to have install following packages:

- `pandas` to read/save csv files.
- `sklearn` to train model $\&$ generate prediction.

## Data Dictionary

- The input has to be a '.csv' file with 'utf-8' encoding. 
- PLEASE NOTE: If your input .csv file is not 'utf-8' encoded, model will not perform as expected.
- Required Features: `hotel`, `is_canceled`, `lead_time`, `arrival_date_year`, `arrival_date_month`, `arrival_date_week_number`, `arrival_date_day_of_month`, `stays_in_weekend_nights`, `stays_in_week_nights`, `adults`, `children`, `babies`, `meal`, `country`, `market_segment`, `distribution_channel`, `is_repeated_guest`, `previous_cancellations`, `previous_bookings_not_canceled`, `reserved_room_type`, `assigned_room_type`, `booking_changes`, `deposit_type`, `days_in_waiting_list`, `customer_type`, `adr`, `required_car_parking_spaces`, `total_of_special_requests`, `reservation_status_date`.

## Import Libraries

In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

## Load Input Data

In [3]:
# load the dataset
train_df = pd.read_csv('train.csv')
train_df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,reserved_room_type,assigned_room_type,booking_changes,deposit_type,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,C,C,3,No Deposit,0,Transient,0.0,0,0,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,C,C,4,No Deposit,0,Transient,0.0,0,0,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,A,C,0,No Deposit,0,Transient,75.0,0,0,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,A,A,0,No Deposit,0,Transient,75.0,0,0,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,A,A,0,No Deposit,0,Transient,98.0,0,1,2015-07-03


In [4]:
# check for missing values
train_df.isnull().sum()

hotel                             0
is_canceled                       0
lead_time                         0
arrival_date_year                 0
arrival_date_month                0
arrival_date_week_number          0
arrival_date_day_of_month         0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          0
babies                            0
meal                              0
country                           0
market_segment                    0
distribution_channel              0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
reserved_room_type                0
assigned_room_type                0
booking_changes                   0
deposit_type                      0
days_in_waiting_list              0
customer_type                     0
adr                               0
required_car_parking_spaces       0
total_of_special_requests   

## Create Model

I will look up 1's precision, recall, accuracy as model metrics as well as interpretability to decide the best model.

In [5]:
cat_cols=['is_canceled','arrival_date_month','meal','market_segment','distribution_channel','reserved_room_type',
      'is_repeated_guest','deposit_type','customer_type']

train_df[cat_cols] = train_df[cat_cols].astype('category')

num_cols = ['lead_time','arrival_date_week_number','arrival_date_day_of_month','stays_in_weekend_nights','stays_in_week_nights',
        'adults','children','babies','previous_cancellations','previous_bookings_not_canceled','required_car_parking_spaces',
        'total_of_special_requests','adr']
        
model_df = train_df[cat_cols+num_cols]

In [6]:
X = pd.get_dummies(model_df.drop(columns=['is_canceled']))
y = model_df['is_canceled']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sc = StandardScaler()

sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

##### Logistic Regression

In [8]:
lr = LogisticRegression()
lr.fit(X_train_std, y_train)

LogisticRegression()

## Predict Test Datapoints

In [9]:
predictions = lr.predict(X_test_std)

## Saving Prediction

In [10]:
test_df = pd.DataFrame(X_test_std, index=None)

In [11]:
test_df['is_cancelled_prediction'] = predictions

In [12]:
test_df.to_csv("output.csv", index=None)