<a href="https://colab.research.google.com/github/pranitatiwari29/EDA-Hotel-booking-analysis---Pranita/blob/main/first_commit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **HOTEL BOOKING ANALYSIS**




##### **Project Type**    - EDA
##### **Contribution**    - Team
##### **Team Member 1 -**Dharmendra Yadav
##### **Team Member 2 -**Kratika Jawariya
##### **Team Member 3 -**Pranita tiwari


# **Project Summary**

**Introduction**

The hospitality industry is highly dynamic, with customer preferences, market trends, and competitive landscape constantly evolving. In this context, data-driven decision-making becomes crucial for hotels to stay competitive and optimize their operations effectively. Exploratory Data Analysis (EDA) on hotel booking datasets can provide valuable insights into customer behavior, revenue generation, and seasonal trends, empowering hotels to make informed decisions and gain a competitive advantage. This article discusses the key aspects of conducting EDA on a hotel booking dataset and its potential impact on decision-making in the hospitality industry.

**Data Wrangling: Preparing the Dataset**

The first step in EDA involves data wrangling tasks to ensure the dataset is in a suitable format for analysis. Data preprocessing techniques are applied to handle missing values, remove duplicated columns, and change data types. By cleaning the dataset, we ensure the quality and integrity of the data used for analysis. These data wrangling steps set the foundation for a reliable and meaningful exploration of hotel bookings.

**Customer Demographics: Understanding Customer Segments**

An essential aspect of EDA is analyzing customer demographics, such as nationality, market segments, and customer types (transient, contract, group). By identifying patterns in customer behavior, hotels can tailor marketing strategies and target specific customer segments effectively. Insights into customer preferences and behavior enable hotels to enhance guest experiences and build customer loyalty, leading to improved customer retention and positive word-of-mouth.

**Booking Patterns: Uncovering Behavioral Insights**

EDA also focuses on understanding booking patterns, including lead time, booking duration, and booking changes. These insights provide a deeper understanding of customer preferences, enabling hotels to optimize pricing strategies and enhance the booking process. By analyzing lead time and booking duration, hotels can predict demand and allocate resources efficiently. Moreover, understanding booking changes allows hotels to identify potential pain points in the booking process and improve customer satisfaction.

**Distribution Channels: Optimizing Revenue Streams**

The analysis delves into different distribution channels, such as online travel agents (OTA), direct bookings, and global distribution systems (GDS). By exploring revenue generation and booking cancellations from various channels, hotels can optimize their distribution strategies. Strengthening partnerships with high-performing channels and addressing cancellation issues can lead to improved revenue streams and increased bookings. Understanding channel performance also aids in effective marketing strategies, as hotels can focus their efforts on channels that yield better results.

**Cancellations: Minimizing Revenue Loss**

Cancellation analysis is crucial for identifying reasons behind booking cancellations and factors contributing to revenue loss. Factors such as waiting time, lead time, and room type allocation can influence cancellation percentages. By recognizing significant cancellation rates for specific distribution channels (e.g., OTA), hotels can take corrective measures to reduce cancellations and potential revenue loss. Implementing effective cancellation policies and targeted communication with customers can enhance booking retention.

**Revenue Generation: Optimizing Average Daily Rates (ADR)**

EDA allows hotels to compare Average Daily Rates (ADR) for different hotels, distribution channels, and months. Understanding revenue generation patterns helps hotels implement pricing strategies that maximize revenue potential. Analyzing ADR across months helps identify seasonal trends, allowing hotels to capitalize on peak months by adjusting pricing and managing inventory effectively.

**Seasonal Trends: Effective Resource Allocation**

Exploring seasonal trends in guest numbers, revenue, and ADR provides insights that help hotels allocate resources efficiently. Identifying peak months enables hotels to manage staff levels, tailor marketing campaigns, and optimize overall operational efficiency during high-demand periods. By preparing for seasonal variations, hotels can improve service quality and enhance guest satisfaction.

**Conclusion**

Exploratory Data Analysis on hotel booking datasets offers valuable insights to enhance decision-making in the hospitality industry. By understanding customer demographics, booking patterns, distribution channels, cancellations, revenue generation, and seasonal trends, hotels can make data-driven decisions that optimize pricing strategies, improve customer satisfaction, and maximize revenue potential. EDA empowers hotels to stay competitive, capitalize on market opportunities, and continuously adapt to evolving customer preferences and market dynamics. The comprehensive report resulting from the EDA process serves as a guide for hotel management to implement effective strategies, resulting in enhanced operational efficiency and overall success in the competitive hospitality industry.






# **GitHub link**

**Pranita Tiwari** - https://github.com/pranitatiwari29/EDA-Hotel-booking-analysis---Pranita.git

# **Problem Statement**


**BIVARIATE ANALYSIS**

Q.1 Does the length of stay have an impact on the ADR?

Q.2 Which hotel has higher lead time?

Q.3 Which hotel has a longer waiting time?

Q.4 Which hotel hs higher booking cancellation rate?

Q.5 Which hotel has a high chance that its customer will return for another stay?

Q.6 Which is most common channel for booking hotels?

Q.7 Which channel is mostly used for early booking of hotels?

Q.8 Which channel has longer average waiting time?

Q.9 Which distribution channels brings better revenue generating deals for hotels?

Q.10 If not getting the same room affects the ADR?

Q.11 How does the booking percentage vary among three categoriesof customer- single, couple and family/friends?

Q.12 Lets predict whether or not a hotel was likely to receive a disproportinately high number of special requests?

Q.13 Special request according to the number of kids?

**Univariate analysis**

Q.14 Which agent makes most number of bookings?

Q.15 Which type is in most demand and which room type generate hifhest ADR?

Q.16 Which meal type is most preferred meal of customers?

Q.17 What is percentage of bookings in each hotels?

Q.18 Which is preferred stay length in each hotel?

Q.19 Which are the most busiest month?

Q.20 Which month yields the highest revenue?

Q.21 Monthly booking analysis of both type of hotel?

Q.22 Yearly booking trend in both type of hotel?

Q.23 From which country most number of guest has arrived?

Q.24 How long do people stay at hotel?







#### **Define Your Business Objective?**

The objective of this project is to perfrom an EDA on hotel booking dataset to gain some insights which will be helpful in understanding the patterns and trends in booking analysis, cancellation analysis, Revnue analysis, customer segmentation and competitive analysis.
The valuable insights gained will be helpful in improving the operational efficiency, better finance management, increasing revnue, support data-driven decision-making and competitive positioning.

# **General Guidelines**

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]

# ***Let's Begin !***

## ***1. Know your Data***


Import libraries


In [None]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Dataset loading


In [None]:
# mounting drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# reading the Dataset
path = "/content/drive/MyDrive/capstone project data/Hotel Bookings.csv"
hotel_df = pd.read_csv(path)

Data set first **look**

In [None]:
hotel_df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


Dataset Information

In [None]:
# Dataset Info
hotel_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

 Null values

In [None]:
# null values
null_values = hotel_df.isnull().sum()
print(null_values)

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

We see that there are 32 columns in the dataframe and some columns like 'children', 'company', 'country', and 'agent' have null values.

## **2. Understanding your variables**

In [None]:
# Dataset description
hotel_df.describe()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,agent,company,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
count,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119386.0,119390.0,119390.0,119390.0,119390.0,119390.0,103050.0,6797.0,119390.0,119390.0,119390.0,119390.0
mean,0.370416,104.011416,2016.156554,27.165173,15.798241,0.927599,2.500302,1.856403,0.10389,0.007949,0.031912,0.087118,0.137097,0.221124,86.693382,189.266735,2.321149,101.831122,0.062518,0.571363
std,0.482918,106.863097,0.707476,13.605138,8.780829,0.998613,1.908286,0.579261,0.398561,0.097436,0.175767,0.844336,1.497437,0.652306,110.774548,131.655015,17.594721,50.53579,0.245291,0.792798
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,0.0,-6.38,0.0,0.0
25%,0.0,18.0,2016.0,16.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,62.0,0.0,69.29,0.0,0.0
50%,0.0,69.0,2016.0,28.0,16.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,179.0,0.0,94.575,0.0,0.0
75%,1.0,160.0,2017.0,38.0,23.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,229.0,270.0,0.0,126.0,0.0,1.0
max,1.0,737.0,2017.0,53.0,31.0,19.0,50.0,55.0,10.0,10.0,1.0,26.0,72.0,21.0,535.0,543.0,391.0,5400.0,8.0,5.0


### Check unique values for each variable

In [None]:
# creating copy of Dataframe
df1 = hotel_df.copy()

In [None]:
# unique values
df1['hotel'].unique()

array(['Resort Hotel', 'City Hotel'], dtype=object)

In [None]:
df1['is_canceled'].unique()

array([0, 1])

In [None]:
df1['arrival_date_year'].unique()

array([2015, 2016, 2017])

In [None]:
df1['meal'].unique()

array(['BB', 'FB', 'HB', 'SC', 'Undefined'], dtype=object)

In [None]:
df1['market_segment'].unique()

array(['Direct', 'Corporate', 'Online TA', 'Offline TA/TO',
       'Complementary', 'Groups', 'Undefined', 'Aviation'], dtype=object)

In [None]:
df1['distribution_channel'].unique()

array(['Direct', 'Corporate', 'TA/TO', 'Undefined', 'GDS'], dtype=object)

In [None]:
df1['children'].unique()

array([ 0.,  1.,  2., 10.,  3., nan])

### Variables Description

1. **hotel**: Categorical - Resort hotel or city hotel.
2.	**is_canceled**: ‘1’ for booking and ‘0’  not cancelled.
3.	**lead_time**: Period between time of booking and checking in (considered in days here).
4.	**arrival_date_month**: Arrival month
5.	**country**: Country of origin. List of 158 countries.
6.	**days_in_waiting_list**: Number of waiting days.
7.	**Deposit_type**: Categorical - No-deposit, Non-Refund, Refundable.
8.	**Adr**: Average Daily rate as defined by the average rental revenue earned for an occupied room per day.
9.	**Adults, Babies, Children**: Number of adults, babies and children.
10.	**Assigned Room Type**: Code for the type of room assigned to the booking.
11.	**Booking Changes**: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation.
12.	**Distribution_channel**: Booking distribution channel.
13.	**Is_repeated_guest’**: Categorical- repeated guest(1) or not (0).
14.	**Company**: ID of the company/entity that made the booking or responsible for paying the booking.
15.	**Customer Type**: Categorical:
- Contract – when the booking has an allotment or other type of contract
associated to it
- Group – when the booking is associated to a group;
- Transient – when the bookings is not part of a group or contract, and is not associated to other transient booking;
- Transient party – when the booking is transient to at least other transient booking.
16.	**Market_segment**: Market segment designation.
17.	**Previous_cancellations**: Number of previous bookings that were cancelled by the customer prior to the current booking.
18.	**Required_car_parking_spaces**: Number of car parking spaces required by the customer.
19.	**Reservation_status**:Categorical:
- Cancelled – booking was cancelled by the customer.
- Check Out – customer has checked in but already departed.
- No Show – customer did not check in and did inform the hotel of the reason why.
20.	**Reservation_status_date**: Date at which the last status was set. This variable can be used in conjunction with the Reservation Status to understand when was the booking canceled or when did the customer checked-out of the hotel.
21.	**Reserved_room_type**: Code of room type reserved.
22.	**Types_of_special_requests**: Number of special requests made by the customer (e.g. Twin bed or high floor)
23.	**Stays_in_weekend_nights**, **Stays_in_week_nights**: Number of weekend nights and week nights the guest stayed or booked to stay at the hotel.


First of all we will try to understand the meaning of all columns of the dataframe.
For this we will see the unique values attained by each column whose meaning we are unable to understand.

## **3. Data wrangling**

Data Wrangling is crucial step before EDA as it will remove the ambigous data that can affect the outcome of EDA.
1. Remove duplicate rows.
2. Handling missing values.
3. Convert columns to appropriate datatypes.
4. Adding important columns

#### Step 1: Removing duplicate rows if any

In [None]:
# Show no. of rows of duplicate rows
df1[df1.duplicated()].shape

(31994, 32)

In [None]:
# Dropping duplicate values
df1.drop_duplicates(inplace = True)

In [None]:
df1.shape

(87396, 32)

#### Step 2. Handling missing value.

In [None]:
# columns having missing values
df1.isnull().sum().sort_values(ascending = False)[:6]

company               82137
agent                 12193
country                 452
children                  4
reserved_room_type        0
assigned_room_type        0
dtype: int64

Since, company and agent columns have company number and agent numbers as data. There may be some cases when     customer didnt booked hotel via any agent or via any company. So in that case values can be null under these columns.
We will replace null values by 0 in these columns.

In [None]:
df1[['company','agent']] = df1[['company','agent']].fillna(0)

In [None]:
df1['children'].unique()

array([ 0.,  1.,  2., 10.,  3., nan])

This column 'children' has 0 as value which means 0 children were present in group of customers who made that transaction.
So, 'nan' values are the missing values due to error of recording data.
We will replace the null values under this column with mean value of children.

In [None]:
df1['children'].fillna(df1['children'].mean(), inplace = True)

Next column with missing value is 'country'. This column represents the country of origin of customer.
Since this column has datatype of string. We will replace the missing value with the mode of 'country' column.

In [None]:
 df1['country'].fillna('others', inplace = True)

In [None]:
#checking if all null values are removed
df1.isnull().sum().sort_values(ascending = False)[:6]

hotel                          0
is_canceled                    0
reservation_status             0
total_of_special_requests      0
required_car_parking_spaces    0
adr                            0
dtype: int64

There are some rows with total number of adults, children or babies equal to zero.So we will replace such rows

In [None]:
df1[df1['adults']+df1['babies']+df1['children'] == 0].shape

(166, 32)

In [None]:
df1.drop(df1[df1['adults']+df1['babies']+df1['children'] == 0].index, inplace = True)

#### Step 3: Converting columns to appropriate datatypes

In [None]:
# converting datatype of columns 'children', 'agent', and 'company'from float to int.
df1[['children', 'company', 'agent']] = df1[['children', 'company', 'agent']].astype('int64')

In [None]:
df1['reservation_status_date'] = pd.to_datetime(df1['reservation_status_date'], format= '%Y-%m-%d')

#### Step: 4 Adding important columns

In [None]:
#Adding total staying in hotels
df1['total_stay'] = df1['stays_in_weekend_nights'] + df1['stays_in_week_nights']

In [None]:
# adding total people num as column, i.e. total people num = num of adults + children + babies
df1['total_people'] = df1['adults'] + df1['children'] + df1['babies']


We are adding this column so that we can analyse th stay length at hotels.