# <p style = "font-size : 42px; color : #000000 ; font-family : 'Oregon'; text-align : center; background-color : #dba514; border-radius: 5px 5px;"><strong>Hotel Booking Cancellation Prediction</strong></p>

<p align="center">
  <img style = "border:5px solid #ffb037;" src="https://5.imimg.com/data5/EF/GO/MY-17287433/hotel-bookings-500x500.jpg" alt="Project Banner" width="1000"/>
</p>

### <p style = "font-size : 25px; color : #ff0099; font-family : 'Comic Sans MS'; "><strong>Problem Statement</strong></p> 

<p style = "font-size : 16px; color : #ff9900; font-family : 'Comic Sans MS';">
    <strong>
        Problem: Client wants to reduce the cancellations of hotel bookings.
        <br>Solution: Using EDA techniques we have to find relation between features, hidden patterns and actionable insights from data.
        <br>Using machine learning algorithms we have to predict cancellations and implement targeted marketing strategies.
    </strong>
</p> 

### <p style = "font-size : 25px; color : #ff0099; font-family : 'Comic Sans MS'; "><strong>About Data</strong></p> 

<p style = "font-size : 16px; color : #ff9900; font-family : 'Comic Sans MS';">
    <strong>
        The data includes detailed information on hotel bookings, covering customer demographics, booking patterns, and reservation specifics. 
        <br>Key attributes include booking status, stay duration, guest count, booking channel, room assignment, and any special requests. 
        <br>It is suitable for analyzing booking trends, customer behaviors, and factors influencing cancellations and modifications.
    </strong>
</p> 

## <p style = "font-size : 25px; color : #ff0099; font-family : 'Comic Sans MS'; "><strong>Data Collection</strong></p> 

<p style = "font-size : 16px; color : #ff9900; font-family : 'Comic Sans MS';">
    <strong>
        The Hotel booking data is stored in mysql database we will fetch the data from the database
    </strong>
</p> 

### <a id = '0.1'></a>
<p style = "font-size : 40px; color : #f9858b ; font-family : 'Calibri'; text-align : center; background-color : #bdfff6; border-radius: 5px 5px;"><strong>Importing Libraries</strong></p> 

In [16]:
import warnings
warnings.filterwarnings("ignore")

# database connection
from sqlalchemy import create_engine

# data manipulation
import pandas as pd
import numpy as np

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from folium.plugins import HeatMap

# style
plt.style.use('fivethirtyeight')
%matplotlib inline
pd.set_option('display.max_columns', 40)

### <p style = "font-size : 25px; color : #ff0099; font-family : 'Comic Sans MS'; "><strong>Load Dataset</strong></p> 

In [3]:
engine = create_engine("mysql+mysqlconnector://projects:AIMLprojects1@127.0.0.1:3306/projects_db")
conn = engine.connect()

In [4]:
query = "SELECT * FROM hotel_booking"

In [9]:
df = pd.read_sql(query, engine)
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,name,email,phone-number,credit_card
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,Ernest Barnes,Ernest.Barnes31@outlook.com,669-792-1661,************4322
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,Andrea Baker,Andrea_Baker94@aol.com,858-637-6955,************9157
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02,Rebecca Parker,Rebecca_Parker@comcast.net,652-885-2745,************3734
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02,Laura Murray,Laura_M@gmail.com,364-656-8427,************5677
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03,Linda Hines,LHines@verizon.com,713-226-5883,************5498


### <p style = "font-size : 25px; color : #ff0099; font-family : 'Comic Sans MS'; "><strong>Look into Data</strong></p> 

In [20]:
df.shape

(119386, 36)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119386 entries, 0 to 119385
Data columns (total 36 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119386 non-null  object 
 1   is_canceled                     119386 non-null  int64  
 2   lead_time                       119386 non-null  int64  
 3   arrival_date_year               119386 non-null  int64  
 4   arrival_date_month              119386 non-null  object 
 5   arrival_date_week_number        119386 non-null  int64  
 6   arrival_date_day_of_month       119386 non-null  int64  
 7   stays_in_weekend_nights         119386 non-null  int64  
 8   stays_in_week_nights            119386 non-null  int64  
 9   adults                          119386 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119386 non-null  int64  
 12  meal            

In [18]:
# Summary of Numrical features
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
is_canceled,119386.0,0.370395,0.482913,0.0,0.0,0.0,1.0,1.0
lead_time,119386.0,104.014801,106.863286,0.0,18.0,69.0,160.0,737.0
arrival_date_year,119386.0,2016.156593,0.707456,2015.0,2016.0,2016.0,2017.0,2017.0
arrival_date_week_number,119386.0,27.165003,13.605334,1.0,16.0,28.0,38.0,53.0
arrival_date_day_of_month,119386.0,15.798553,8.780783,1.0,8.0,16.0,23.0,31.0
stays_in_weekend_nights,119386.0,0.927605,0.998618,0.0,0.0,1.0,2.0,19.0
stays_in_week_nights,119386.0,2.50031,1.908289,0.0,1.0,2.0,3.0,50.0
adults,119386.0,1.85639,0.579261,0.0,2.0,2.0,2.0,55.0
children,119386.0,0.10389,0.398561,0.0,0.0,0.0,0.0,10.0
babies,119386.0,0.007949,0.097438,0.0,0.0,0.0,0.0,10.0


In [19]:
# Summary of categorical features
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
hotel,119386,2,City Hotel,79326
arrival_date_month,119386,12,August,13873
meal,119386,5,BB,92306
country,119386,178,PRT,48586
market_segment,119386,7,Online TA,56476
distribution_channel,119386,5,TA/TO,97870
reserved_room_type,119386,10,A,85994
assigned_room_type,119386,12,A,74053
deposit_type,119386,3,No Deposit,104637
agent,119386,334,9.0,31960


### <p style = "font-size : 25px; color : #ff0099; font-family : 'Comic Sans MS'; "><strong>Explore Data</strong></p> 

In [21]:
# define numerical & categorical columns
numeric_features = [_ for _ in df.columns if df[_].dtype != 'O']
categorical_features = [_ for _ in df.columns if df[_].dtype == 'O']

# print columns
print(f'We have {len(numeric_features)} numerical features : {numeric_features}')
print(f'\nWe have {len(categorical_features)} categorical features : {categorical_features}')

We have 18 numerical features : ['is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'days_in_waiting_list', 'adr', 'required_car_parking_spaces', 'total_of_special_requests']

We have 18 categorical features : ['hotel', 'arrival_date_month', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'agent', 'company', 'customer_type', 'reservation_status', 'reservation_status_date', 'name', 'email', 'phone-number', 'credit_card']


In [22]:
# proportion of count data on categorical columns
for col in categorical_features:
    print(df[col].value_counts(normalize=True) * 100)
    print('- ' * 50)

hotel
City Hotel      66.444977
Resort Hotel    33.555023
Name: proportion, dtype: float64
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
arrival_date_month
August       11.620290
July         10.605096
May           9.876367
October       9.347830
April         9.288359
June          9.162716
September     8.801702
March         8.203642
February      6.757911
November      5.690785
December      5.679058
January       4.966244
Name: proportion, dtype: float64
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
meal
BB           77.317273
HB           12.114486
SC            8.920644
Undefined     0.979177
FB            0.668420
Name: proportion, dtype: float64
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
country
PRT    40.696564
GBR    10.159483
FRA     8.723803
ESP     7.176721
DEU     6.103731
         ...    
DJI     0.00083

<p style = "font-size : 16px; color : #ff9900; font-family : 'Comic Sans MS';">
    <strong>
        Insights:
    </strong>
</p> 