<a href="https://colab.research.google.com/github/jagadish9084/learnbay-ds-ml-course/blob/main/ml_models/supervised/decision_tree_classifier/hotel_booking__cancellation_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project : Hotel Booking Cancellation Prediction



## Column Name	Descriptions


1. **Booking_ID**:	Unique identifier for each booking.
2. **number of adults:**	The number of adults included in the booking.
3. **number of children:**	The number of children included in the booking.
4. **number of weekend nights:**	The number of weekend nights (Friday and Saturday) included in the booking.
5. **number of week nights:**	The number of weekdays (Monday to Thursday) included in the booking.
6. **type of meal:**	The type of meal plan selected by the guest (e.g., Meal Plan 1, Not Selected).
7. **car parking space:**	Indicates if a car parking space was requested or included in the booking(1 = Yes, 0 = No).
8. **room type:**	The type of room booked by the guest (e.g., Room_Type 1).
9. **lead time:**	The number of days between the booking date and the check-in date.
10. **market segment type:**	The channel through which the booking was made (e.g., Online, Offline).
11. **repeated:**	Indicates if the guest is a repeated customer (1 = Yes, 0 = No).
12. **P-C:**	Number of previous bookings that were canceled by the customer prior to the current booking.
13. **P-not-C:**	Number of previous bookings not canceled by the customer prior to the current booking.
14. **average price:**	The average price per night for the booking.
15. **special requests:**	The number of special requests made by the guest (e.g., room preferences, amenities).
16. **date of reservation:**	The date when the reservation was made (MM/DD/YYYY format).
17. **booking status:**	The status of the booking, indicating whether it was canceled or not (e.g., Canceled, Not_Canceled).

Source: https://www.kaggle.com/datasets/youssefaboelwafa/hotel-booking-cancellation-prediction/data

# Data Cleaning and Feature Engineering

In [9]:
# Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, classification_report
import warnings

warnings.filterwarnings('ignore')

In [10]:
#Load data
data = pd.read_csv('/content/sample_data/booking.csv')
data.head()

Unnamed: 0,Booking_ID,number of adults,number of children,number of weekend nights,number of week nights,type of meal,car parking space,room type,lead time,market segment type,repeated,P-C,P-not-C,average price,special requests,date of reservation,booking status
0,INN00001,1,1,2,5,Meal Plan 1,0,Room_Type 1,224,Offline,0,0,0,88.0,0,10/2/2015,Not_Canceled
1,INN00002,1,0,1,3,Not Selected,0,Room_Type 1,5,Online,0,0,0,106.68,1,11/6/2018,Not_Canceled
2,INN00003,2,1,1,3,Meal Plan 1,0,Room_Type 1,1,Online,0,0,0,50.0,0,2/28/2018,Canceled
3,INN00004,1,0,0,2,Meal Plan 1,0,Room_Type 1,211,Online,0,0,0,100.0,1,5/20/2017,Canceled
4,INN00005,1,0,1,2,Not Selected,0,Room_Type 1,48,Online,0,0,0,77.0,0,4/11/2018,Canceled


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36285 entries, 0 to 36284
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Booking_ID                36285 non-null  object 
 1   number of adults          36285 non-null  int64  
 2   number of children        36285 non-null  int64  
 3   number of weekend nights  36285 non-null  int64  
 4   number of week nights     36285 non-null  int64  
 5   type of meal              36285 non-null  object 
 6   car parking space         36285 non-null  int64  
 7   room type                 36285 non-null  object 
 8   lead time                 36285 non-null  int64  
 9   market segment type       36285 non-null  object 
 10  repeated                  36285 non-null  int64  
 11  P-C                       36285 non-null  int64  
 12  P-not-C                   36285 non-null  int64  
 13  average price             36285 non-null  float64
 14  specia

In [12]:
# Delete Booking_ID column as it does not provide any meaningfull information for analysis or modeling
data.drop(['Booking_ID'], axis=1, inplace=True)

In [13]:
# Standardise the columns name
data.columns = data.columns.str.strip().str.lower().str.replace('[\s-]+','_', regex=True)
print(f"Column names after standardisation: {data.columns}")

Column names after standardisation: Index(['number_of_adults', 'number_of_children', 'number_of_weekend_nights',
       'number_of_week_nights', 'type_of_meal', 'car_parking_space',
       'room_type', 'lead_time', 'market_segment_type', 'repeated', 'p_c',
       'p_not_c', 'average_price', 'special_requests', 'date_of_reservation',
       'booking_status'],
      dtype='object')
