                        Case Study: Predictive Analytics in Hotel Booking Management

Objective: 
This case study aims to equip you with practical skills in data science, focusing on predicting customer 
behaviors and booking cancellations in the hotel industry. You will apply EDA, KNN, Decision Tree 
algorithms, and learn to handle class imbalances using SMOTE.

Dataset overview: 

The dataset has several feature(independant variables) such as: 

Booking_ID: unique identifier of each booking
no_of_adults: Number of adults
no_of_children: Number of Children
no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or 
booked to stay at the hotel
no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to 
stay at the hotel
type_of_meal_plan: Type of meal plan booked by the customer:
required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
room_type_reserved: Type of room reserved by the customer. The values are ciphered 
(encoded) by INN Hotels.
lead_time: Number of days between the date of booking and the arrival date
arrival_year: Year of arrival date
arrival_month: Month of arrival date
arrival_date: Date of the month
market_segment_type: Market segment designation.
repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
no_of_previous_cancellations: Number of previous bookings that were canceled by the 
customer prior to the current booking
no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the 
customer prior to the current booking
avg_price_per_room: Average price per day of the reservation; prices of the rooms are 
dynamic. (in euros)
no_of_special_requests: Total number of special requests made by the customer (e.g. high 
floor, view from the room, etc)
booking_status: Flag indicating if the booking was canceled or not.

Importing libraries 

In [14]:
import pandas as pd 
import numpy as np 
from ydata_profiling import ProfileReport
from sklearn.preprocessing import OneHotEncoder, LabelEncoder


Loading Dataset

In [5]:
hotel_data= pd.read_csv(r"C:\Users\HP.Com\Desktop\HAMZA\atomcamp_work_hamza\ML_module\session4\Hotel Reservations.csv")
hotel_data.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0,Canceled
3,INN00004,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.0,0,Canceled
4,INN00005,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.5,0,Canceled


In [20]:
hotel_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date   

In [6]:
profile = ProfileReport(hotel_data)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

### Data Preprocessing:

##### Handle missing or anomalous data.

##### Convert categorical variables into numerical formats using encoding techniques (e.g., one-hot encoding, label encoding).

1) type_of_meal_plan  -- one hot encoding
2) room_type_reserved -- one hot encoding
3) market_segment_type -- one hot encoding
4) booking_status -- Label Encoding

In [23]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Preparing the encoders
one_hot_encoder = OneHotEncoder()
label_encoder = LabelEncoder()

# Applying one-hot encoding to nominal columns
nominal_columns = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
encoded_nominal_data = pd.DataFrame()

for col in nominal_columns:
    # Transform and integrate into the new DataFrame
    encoded_data = one_hot_encoder.fit_transform(hotel_data[[col]]).toarray()
    encoded_col_names = one_hot_encoder.get_feature_names_out([col])
    encoded_nominal_data = pd.concat([encoded_nominal_data, pd.DataFrame(encoded_data, columns=encoded_col_names)], axis=1)

# Applying label encoding to the binary column
encoded_binary_column = label_encoder.fit_transform(hotel_data['booking_status'])

# Adding the encoded binary column to the DataFrame
encoded_nominal_data['booking_status'] = encoded_binary_column

# Display the first few rows of the transformed data
encoded_nominal_data.tail()


Unnamed: 0,type_of_meal_plan_Meal Plan 1,type_of_meal_plan_Meal Plan 2,type_of_meal_plan_Meal Plan 3,type_of_meal_plan_Not Selected,room_type_reserved_Room_Type 1,room_type_reserved_Room_Type 2,room_type_reserved_Room_Type 3,room_type_reserved_Room_Type 4,room_type_reserved_Room_Type 5,room_type_reserved_Room_Type 6,room_type_reserved_Room_Type 7,market_segment_type_Aviation,market_segment_type_Complementary,market_segment_type_Corporate,market_segment_type_Offline,market_segment_type_Online,booking_status
36270,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
36271,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
36272,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
36273,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
36274,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1


#### Drop categorical data from the original data

In [28]:
hotel_data1 = hotel_data.drop(['type_of_meal_plan', 'room_type_reserved', 'market_segment_type','booking_status'], axis=1)

In [30]:
hotel_data1.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,required_car_parking_space,lead_time,arrival_year,arrival_month,arrival_date,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
0,INN00001,2,0,1,2,0,224,2017,10,2,0,0,0,65.0,0
1,INN00002,2,0,2,3,0,5,2018,11,6,0,0,0,106.68,1
2,INN00003,1,0,2,1,0,1,2018,2,28,0,0,0,60.0,0
3,INN00004,2,0,0,2,0,211,2018,5,20,0,0,0,100.0,0
4,INN00005,2,0,1,1,0,48,2018,4,11,0,0,0,94.5,0


Merging the hotel_data1( the removed categorical columns) with the one in which we have our encoded data

In [33]:
data = pd.concat([hotel_data1,encoded_nominal_data], axis=1)
data.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,required_car_parking_space,lead_time,arrival_year,arrival_month,arrival_date,...,room_type_reserved_Room_Type 4,room_type_reserved_Room_Type 5,room_type_reserved_Room_Type 6,room_type_reserved_Room_Type 7,market_segment_type_Aviation,market_segment_type_Complementary,market_segment_type_Corporate,market_segment_type_Offline,market_segment_type_Online,booking_status
0,INN00001,2,0,1,2,0,224,2017,10,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
1,INN00002,2,0,2,3,0,5,2018,11,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
2,INN00003,1,0,2,1,0,1,2018,2,28,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3,INN00004,2,0,0,2,0,211,2018,5,20,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
4,INN00005,2,0,1,1,0,48,2018,4,11,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


In our final "data", we dont need to do any normalization or standardization, nor there is a need to remove duplicates, nor were there any null values. Our pre-processing step only contained converting the categorical data 

## Exploratory Data Analysis 