<a href="https://colab.research.google.com/github/omi82/Hotel-Booking-analysis/blob/main/Hotel_Booking_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Project Name**    - Hotel Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -** Omendra Puri


# **Project Summary -**

This project involves the analysis of a hotel booking dataset to gain insights into customer booking behavior and channel preferences. The project aims to:

1. Define the business objectives of the study.
2. Clean and prepare the data by handling missing values, outliers, etc.
3. Perform exploratory data analysis (EDA) to investigate relationships between features, generate new variables, and present the data in a comprehensible format.
4. Provide observations and recommendations based on the EDA findings.

**Key Activities:**

- Data Cleaning:
    - Handling missing values
    - Treating outliers
    - Checking for data consistency
- Exploratory Data Analysis:
    - Analyzing booking trends based on various factors such as hotel type, arrival month, lead time, etc.
    - Identifying relationships between features using statistical methods and visualizations.
    - Creating new variables to capture additional insights.
- Presentation of Findings:
    - Summarizing key observations from the EDA.
    - Providing recommendations for improving hotel booking strategies.

**Expected Outcomes:**

- Improved understanding of customer booking behavior and preferences.
- Identification of key factors influencing hotel bookings.
- Recommendations for optimizing hotel revenue and occupancy.

# **GitHub Link -**

https://github.com/omi82/Hotel-Booking-analysis/blob/main/Hotel_Booking_analysis1.ipynb

# **Problem Statement**


* <b> Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!

 <b>This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. </b>

 <b> Explore and analyze the data to discover important factors that govern the bookings. </b>

#### **Define Your Business Objective?**


The objective of this data set is to gain insights into hotel booking patterns and cancellations, and to identify the factors that influence these patterns. The goal is to develop predictive models that can accurately predict booking cancellations and to identify potential areas for improvement in hotel policies and practices. Ultimately, the objective is to increase revenue and profitability for hotels by reducing booking cancellations and improving overall customer satisfaction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import folium

### Dataset Loading

In [None]:
# Mounted our google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
hotel_booking_df = pd.read_csv('/content/drive/MyDrive/almabetter/capstone project/EDA/Hotel Bookings.csv')

# copy original dataset to new dataset
df = hotel_booking_df.copy()

### Dataset First View

In [None]:
# Dataset First Look
df.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


In [None]:
# Dataset Rows & Columns count
print("Dataset_Row_count:    ",df.shape[0])
print("Dataset_Column_count: ",df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().value_counts()           # True means duplicated rows

In [None]:
# Visualizing the duplicate values
plt.figure(figsize=(5,4))
sns.countplot(x=df.duplicated())

So we have 31994 are duplicate row  in our  dataset

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
Missing_Values = df.isnull().sum().sort_values(ascending=False)
Missing_Values[:5]

In [None]:
import missingno as msno

# Visualize missing values as a matrix
msno.bar(df, figsize = (15,4))

# Display the plot
plt.show()

### What did you know about your dataset?

- This dataset conatins 119390 rows and 32 columns
-All the columns are diveded into three Dtypes(Object, Float64 and Int64)
-This dataset have duplicate as well as missing values. There are 31994 duplicate values and four columns have missing values.
-The missing values columns are company, agent, country and children and the missing values are 112593, 16340, 488 & 4 respectively.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(df.columns)

In [None]:
# Dataset Describe
df.describe().T

### Variables Description

1. **hotel** : *Resort Hotel or City Hotel*

2. **is_canceled** : *Value indicating if the booking was canceled (1) or not (0)*

3. **lead_time** :*The number of days between the booking date and the arrival date*

4. **arrival_date_year** : *Year of arrival date*

5. **arrival_date_month** : *Month of arrival date*

6. **arrival_date_week_number** : *Week number of year for arrival date*

7. **arrival_date_day_of_month** : *Day of arrival date*

8. **stays_in_weekend_nights** : *Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel*

9. **stays_in_week_nights** : *Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel*

10. **adults** : *Number of adults*

11. **children** : *Number of children*

12. **babies** : *Number of babies*

13. **meal** : *The type of meal booked (e.g., Bed & Breakfast, Half board):*

14. **country** : *Country of origin.*

15. **market_segment** : *Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”*

16. **distribution_channel** : *Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”*

17. **is_repeated_guest** : *Value indicating if the booking name was from a repeated guest (1) or not (0)*

18. **previous_cancellations** : *Number of previous bookings that were cancelled by the customer prior to the current booking*

19. **previous_bookings_not_canceled** : *Number of previous bookings not cancelled by the customer prior to the current booking*

20. **reserved_room_type** : *Code of room type reserved. Code is presented instead of designation for anonymity reasons.*

21. **assigned_room_type** : *Code for the type of room assigned to the booking.*

22. **booking_changes** : *Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation*

23. **deposit_type** : *Indication on if the customer made a deposit to guarantee the booking.*

24. **agent** : *ID of the travel agency that made the booking*

25. **company** : *ID of the company/entity that made the booking or responsible for paying the booking.*

26. **days_in_waiting_list** : *Number of days the booking was in the waiting list before it was confirmed to the customer*

27. **customer_type** : *Type of booking, assuming one of four categories*


28. **adr** : *The average daily rate (i.e., the sum of all lodging transactions divided by the total number of staying nights)*

29. **required_car_parking_spaces** : *Number of car parking spaces required by the customer*

30. **total_of_special_requests** :*Number of special requests made by the customer (e.g. twin bed or high floor)*

31. **reservation_status** : *Reservation last status, assuming one of three categories*
* Canceled – booking was canceled by the customer
* Check-Out – customer has checked in but already departed
* No-Show – customer did not check-in and did inform the hotel of the reason why





32. **reservation_status_date** : *Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel*

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

**Data Cleaning**

In [None]:
# To fill the null value in the column, let's check which columns has null value, we have all ready store the same
Missing_Values[:5]

In [None]:
# let's check, what is the percentage of null value in each column
percent_missing = Missing_Values * 100 / len(df)
percent_missing[:5]

In [None]:
# It is better to drop company column there are extremely high values are missing compared to the number of rows
df.drop(['company'], axis=1, inplace=True)

In [None]:
# Replacing null values of agent and children with value 0
df[['agent','children']] = df[['agent', 'children']].fillna(0)

In [None]:
# Replacing null values of country column with other
df[['country']] = df[['country']].fillna('other')

In [None]:
#Checking
df.isnull().sum().sort_values(ascending=False)[:4]

In [None]:
# Drop the duplicate value
df=df.drop_duplicates()
df

In [None]:
# Checking the shape of the dataset whose combining value of adults, babies and children column is 0
df[df['adults']+df['children']+df['babies']==0].shape

In [None]:
# Checking the shape of updated data set
df.shape

In [None]:
# Dropping the row where combining values of adults, babies and children is 0 beacause there is no booking
df.drop(df[df['adults']+df['babies']+df['children']==0].index,inplace=True)

In [None]:
df.shape      # checking row is drop

In [None]:
# Checking total drop row

87389-87223

In [None]:
# Checking datetype of column 'reservation_status_date' from object to date_type
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'],format='%Y-%m-%d')

In [None]:
# Changing agent and children data type float64 into int64
df[['agent','children']]=df[['agent','children']].astype('int64')

In [None]:
df.info()  # For cheking changes in reservation_status_date,agent and children datatype

### *adding some important columns*

In [None]:
# Adding total stay day in hotel
df['total_stay'] = df['stays_in_week_nights'] + df['stays_in_weekend_nights']

In [None]:
# Adding total number of people
df['total_people']=df['adults']+df['babies']+df['children']

In [None]:
# Checking the shape of the dataset
df.shape

### What all manipulations have you done and insights you found?

- Here, Company, Agent, Country and Children columns have missing values. Company columns have more than 94% missing values. So, We drop the company columns. Agent columns have more than 13% missing values and Country and Children columns have less than 1% missing values. So missing values of Agent and Children columns are replace with zero and Country values replace with other.
- Drop the duplicate values.
-Dropping the row where combining values of adults, babies and children is 0 beacause there is no booking
-Adding new columns total_stay day in hotel (stays_in_week_nights + stays_in_weekend_nights)
- And total_people (adults + children + babies)
-New shape of dataset have 87223 rows and 33 columns


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Most Preffered Hotel

In [None]:
# Chart - 1 visualization code
# create a pie chart
df['hotel'].value_counts().plot.pie(figsize=(5,7),fontsize=25, explode=[0.05,0.05], autopct='%1.1f%%', shadow=True)

# add a title
plt.title('pie chart for more preffered hotel ')

# show the chart
plt.show()

##### 1. Why did you pick the specific chart?

The pie chart is an effective way to visualize the distribution of categorical data. In this case, it helps us to see which hotel is preferred more by the customers.

##### 2. What is/are the insight(s) found from the chart?

- The City Hotel is more preferred than the Resort Hotel.
- Around 61.1% of the customers prefer City Hotel, while only 38.9% prefer Resort Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, for both hotel this chart show some positive bussiness impact.

- The hotel management can focus more on improving the facilities and services of the City Hotel to attract more customers.
- The hotel management can also consider expanding the City Hotel or opening more City Hotels in other locations.
- The hotel management can also consider offering special discounts or packages for the Resort Hotel to attract more customers.

Therefore, the gained insights from this chart will help the hotel business to increase its revenue and profitability.

#### Chart - 2 For Reapeted Guest

In [None]:
# Chart - 2 visualization code
# create a pie chart
df['is_repeated_guest'].value_counts().plot.pie(figsize=(5,7),fontsize=25, explode=[0.05,0.05], autopct='%1.1f%%', shadow=False)

# add a title
plt.title('pie chart for repeated guest ')

# show the chart
plt.show()

In [None]:
# repeated guest=1
#not repeated guest=0
#groupby hotel
repeated_guests_df=df[df['is_repeated_guest']==1].groupby('hotel').size().reset_index().rename(columns={0:'number_of_repated_guests'})

#set plot size adn plot barplot
plt.figure(figsize=(5,4))
sns.barplot(x=repeated_guests_df['hotel'],y=repeated_guests_df['number_of_repated_guests'])

# set labels
plt.xlabel('Hotel type')
plt.ylabel('count of repeated guests')
plt.title("Most repeated guests for each hotel")

##### 1. Why did you pick the specific chart?

- The pie chart is an effective way to visualize the distribution of categorical data. In this case, it helps us to see which hotel is preferred more by the repeated customers.
- Bar plot is used to compare the number of repeated guests for each hotel.

##### 2. What is/are the insight(s) found from the chart?

- From the pie chart, we can see that around 3.9% of the customers are repeated guests.
- From the bar plot, we can see that the City Hotel and the Resort Hotel have almost equal repeated guest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights from this chart will help the hotel business to increase its revenue and profitability.

- The hotel management can focus on improving the facilities and services of both hotels to attract more repeated guests.
- The hotel management can also consider offering loyalty programs or discounts to repeated guests.
- The hotel management can also consider conducting surveys or gathering feedback from repeated guests to understand their needs and preferences better.

Therefore, the gained insights from this chart will help the hotel business to improve its customer satisfaction and retention, which will ultimately lead to increased revenue and profitability.

#### Chart - 3 Required Car Parking Spaces

In [None]:
# Chart - 3 visualization code
# Create a bar chart
plt.figure(figsize=(8, 6))
sns.barplot(x=df['required_car_parking_spaces'].value_counts().index, y=df['required_car_parking_spaces'].value_counts().values)

# Add labels and title
plt.xlabel('Required Car Parking Spaces')
plt.ylabel('Number of Bookings')
plt.title('Required Car Parking Spaces Distribution')

# Add percentage values above each bar
for i in range(len(df['required_car_parking_spaces'].value_counts())):
    plt.text(i, df['required_car_parking_spaces'].value_counts().values[i], f'{round(df["required_car_parking_spaces"].value_counts().values[i] / df["required_car_parking_spaces"].value_counts().values.sum() * 100, 2)}%', ha='center', va='bottom')

# Show the chart
plt.show()


##### 1. Why did you pick the specific chart?


The bar chart is an effective way to visualize the distribution of categorical data. In this case, it helps us to see how many customers required different numbers of car parking spaces.

##### 2. What is/are the insight(s) found from the chart?

- Most of the customers (around 91.62%) did not require any car parking spaces.
- A small percentage of customers (around 8.34%) required 1 car parking spaces.
- Only a very small percentage of customers (less than 0.03%) required 2 or more car parking spaces.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from this chart will help the hotel business to create a positive business impact.


- The hotel management can consider offering discounts or special rates to customers who require car parking spaces.
- The hotel management can also consider partnering with car rental companies to offer rental cars to customers who do not have their own cars.

Therefore, the gained insights from this chart will help the hotel business to improve its customer satisfaction and convenience, which will ultimately lead to increased revenue and profitability.

#### Chart - 4 Most Preffered meal Type

In [None]:
# Chart - 4 visualization code
# Create a bar chart
plt.figure(figsize=(8, 6))
ax = df.meal.value_counts().plot(kind='bar')

# Add labels and title
plt.xlabel('Meal Type')
plt.ylabel('Number of Bookings')
plt.title('Most Preferred Meal Type')

# Add percentage values above each bar
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy()
    ax.text(x + width / 2, y + height / 2,
            '{:.01f}%'.format(height / len(df) * 100),
            horizontalalignment='center',
            verticalalignment='center')

# Show the chart
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is an effective way to visualize the distribution of categorical data. In this case, it helps us to see which meal type is preferred more by the customers.

##### 2. What is/are the insight(s) found from the chart?

Using this chart see that the most preffered meal type is BB(77.8%) followed by SC(10.8%),HB(10.4),Undefined(0.6%) and FB(0.4%).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help create a positive business impact.

- The hotel management can focus on improving the quality and variety of the BB meal type to attract more customers.
- The hotel management can also consider offering special discounts or promotions on the BB meal type to encourage more customers to choose it.
- Additionally, the hotel management can consider expanding the selection of BB meal options to cater to a wider range of customer preferences.

By taking these steps, the hotel business can increase customer satisfaction and revenue.

#### Chart - 5 ADR of each Hotel type and ADR across Distribution type

In [None]:
# Chart - 5 visualization code
group_by_hotel = df.groupby('hotel')

In [None]:
# group by hotel adr
highest_adr = group_by_hotel['adr'].mean().reset_index()
plt.figure(figsize=(5,4))
plt.xlabel('hotel type', fontsize=10)
plt.ylabel('adr', fontsize=10)
plt.title("ADR of each hotel type", fontsize=10)
sns.barplot(x=highest_adr['hotel'],y=highest_adr['adr'])

In [None]:
# Using groupby distribution channel
distribution_channel_df=df.groupby(['distribution_channel','hotel'])['adr'].mean().reset_index()
# plot bar chart
plt.figure(figsize=(12,5))
sns.barplot(x='distribution_channel', y='adr',data=distribution_channel_df,hue='hotel')
plt.title('ADR across Distribution channel')

##### 1. Why did you pick the specific chart?

The bar chart is an effective way to visualize the distribution of categorical data. In this case, it helps us to see which hotel type has the highest ADR and how ADR varies across different distribution channels.

##### 2. What is/are the insight(s) found from the chart?

- From the first chart, we can see that the City Hotel has a higher ADR than the Resort Hotel.

- From the second chart, Global Distribution System(GDS) has no ADR for resort hotel.AND TA/TO and Carporate type distribution has more ADR for city hotel and direct and undefined type distribution has more ADR for resort hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help create a positive business impact.

- The hotel management can focus on improving the facilities and services of the City Hotel to attract more customers and increase ADR.
- The hotel management can also consider expanding the City Hotel or opening more City Hotels in other locations.
- The hotel management can also consider offering special discounts or packages for the Resort Hotel to attract more customers and increase ADR.
- The hotel management can also consider reviewing their distribution channels to understand why GDS has no ADR for resort hotel and why TA/TO and Carporate type distribution has more ADR for city hotel and direct and undefined type distribution has more ADR for resort hotel.

By taking these steps, the hotel business can increase customer satisfaction, revenue, and profitability.

#### Chart - 6 Which agent made highest booking

In [None]:
# Chart - 6 visualization code
# return highest bookings made by agents
highest_bookings= df.groupby(['agent'])['agent'].agg({'count'}).reset_index().rename(columns={'count': "Most_Bookings" }).sort_values(by='Most_Bookings',ascending=False)

 # as agent 0 was NAN value and we replaced it with 0 and indicates no bookings.so droping.
highest_bookings.drop(highest_bookings[highest_bookings['agent']==0].index,inplace=True)

# taking top 10 bookings made by agent
top_ten_highest_bookings=highest_bookings[:10]

top_ten_highest_bookings

In [None]:
# Create a bar chart
plt.figure(figsize=(12,6))
ax = top_ten_highest_bookings.plot(kind='bar', x='agent', y='Most_Bookings')

# Add labels and title
plt.xlabel('Agent')
plt.ylabel('Number of Bookings')
plt.title('Top 10 Agents with Highest Bookings')

# Add percentage values above each bar
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy()
    ax.text(x + width / 2, y + height / 2,
            '{:.01f}%'.format(height / len(df) * 100),
            horizontalalignment='center',
            verticalalignment='center')

# Show the chart
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is an effective way to visualize the distribution of categorical data. In this case, it helps us to see which agents made the most bookings.

##### 2. What is/are the insight(s) found from the chart?

- From the chart, we can see that the top 10 agents made a significant number of bookings.
- Agent 9 made the highest number of bookings, followed by Agent 240 and Agent 14.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the chart will help create a positive business impact.

- The hotel management can focus on providing these top 10 agents with special incentives or rewards to encourage them to continue making more bookings.
- The hotel management can also consider working with these agents more closely to understand their needs and preferences better and to develop new strategies to attract more customers.
- Additionally, the hotel management can consider training other agents to improve their booking performance and to increase overall bookings.

By taking these steps, the hotel business can increase its revenue and profitability.

#### Chart - 7 Distribution of Customer Type

In [None]:
# Chart - 7 visualization
df['customer_type'].value_counts().plot.pie(explode=[0.09]*4,shadow=True,autopct='%1.2f%%',figsize=(12,8),fontsize=15,labels=None)

labels=df['customer_type'].value_counts().index.tolist()
plt.title('% Distribution of Customer Type')
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)


##### 1. Why did you pick the specific chart?

The pie chart is an effective way to visualize the distribution of categorical data. In this case, it helps us to see the percentage of each customer type in the dataset.

##### 2. What is/are the insight(s) found from the chart?

From the chart,we can see that majority of the customer are transient type (82.38%) followed by transient-party(13.4%),contract(3.59%) and group(0.62%).




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the above plot will help create a positive business impact.

- The hotel management can focus on attracting more transient and transient-party customers, as these two customer types make up the majority of the hotel's bookings.
- The hotel management can also consider offering special discounts or promotions to these two customer types to encourage them to book more often.
- Additionally, the hotel management can consider developing new marketing campaigns that target these two customer types.

By taking these steps, the hotel business can increase its revenue and profitability.

There are no insights from the above plot that would lead to negative growth.

#### Chart - 8 Booking by month and Optimal Stay length in hotels

In [None]:
# Chart - 8 visualization code
# groupby arrival_date_month and taking the hotel count
bookings_by_months_df=df.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"Counts"})
# Create list of months in order
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
# creating df which will map the order of above months list without changing its values.
bookings_by_months_df['arrival_date_month']=pd.Categorical(bookings_by_months_df['arrival_date_month'],categories=months,ordered=True)
# sorting by arrival_date_month
bookings_by_months_df=bookings_by_months_df.sort_values('arrival_date_month')

bookings_by_months_df

In [None]:

# set plot size
plt.figure(figsize=(15,6))

#pltting lineplot on x- months & y- booking counts
sns.lineplot(x=bookings_by_months_df['arrival_date_month'],y=bookings_by_months_df['Counts'])

# set title for the plot
plt.title('Number of bookings across each month')
#set x label
plt.xlabel('Month')
#set y label
plt.ylabel('Number of bookings')

In [None]:
# Using group by function on total_stay and hotel
stay = df.groupby(['total_stay', 'hotel']).agg('count').reset_index()
# Taking only first three columns
stay = stay.iloc[:,:3]
# Remaining the columns
stay = stay.rename(columns={'is_canceled':'Number of stays'})

In [None]:
# Setting plot size for bar chart
plt.figure(figsize=(20,10))
sns.barplot(x='total_stay', y='Number of stays', hue='hotel',data=stay)
# Set labels
plt.title('Optimal Stay Length in Both Hotel types', fontsize=15)
plt.ylabel('Count of Stay',fontsize=15)
plt.xlabel('Total stay(days)',fontsize=15)


##### 1. Why did you pick the specific chart?

- **Number of bookings across each month**: This plot shows the number of bookings for each month. This information can be used to identify the peak and off-peak seasons for the hotel, which can be used to develop pricing and marketing strategies.
- **Optimal stay length in both type of hotel**: This plot shows the average length of stay for guests in each hotel type. This information can be used to determine the optimal length of stay for different types of guests, which can be used to develop targeted marketing campaigns.

##### 2. What is/are the insight(s) found from the chart?

In the first chart we have found that July and August had the most booking.

In the second chart we found optimal stay in both type hotel is less than 7 days. And after that staying number is declined.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help create a positive business impact.

- The hotel management can focus on attracting more customers during the peak season (July and August) by offering special discounts or promotions.
- The hotel management can also consider developing new marketing campaigns that target customers who are looking for short stays (less than 7 days).
- Additionally, the hotel management can consider offering special amenities or services to guests who are staying for longer periods of time.

By taking these steps, the hotel business can increase its revenue and profitability.

There are no insights from the above plot that would lead to negative growth.

#### Chart - 9 Which year had the highest booking

In [None]:
# Chart - 9 visualization code
# set plot size
plt.figure(figsize=(12,5))

#  plot with countplot
sns.countplot(x= df['arrival_date_year'],hue=df['hotel'])
plt.title("Year Wise bookings")


##### 1. Why did you pick the specific chart?

The specific graph, a countplot with hue, was chosen to visualize the number of bookings for each year, while also differentiating between the two hotel types. This allows for a clear comparison of the booking trends for each hotel across different years.

##### 2. What is/are the insight(s) found from the chart?

- The year 2016 had the highest number of bookings for both hotels combined.
- The City Hotel had a higher number of bookings than the Resort Hotel in all years except for 2015.
- The Resort Hotel had a significant increase in bookings from 2015 to 2016, but this increase was not sustained in subsequent years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from this chart can help create a positive business impact.

- The hotel management can focus on understanding the factors that contributed to the high number of bookings in 2016 and replicate those factors in future years.
- The hotel management can also consider developing marketing campaigns that target potential guests who are more likely to book during the peak season (2016).
- Additionally, the hotel management can consider investing in renovations or upgrades to the Resort Hotel to attract more guests and increase its occupancy rate.

By taking these steps, the hotel business can increase its revenue and profitability.

There are no insights from the above plot that would lead to negative growth.

#### Chart - 10 From which country most guest are coming?

In [None]:
# Chart - 10 visualization code
guest_country = df[df['is_canceled'] == 0]['country'].value_counts().reset_index()
guest_country.columns = ['Country', 'No of guests']
guest_country

In [None]:
basemap = folium.Map()
ax = px.choropleth(guest_country, locations = guest_country['Country'],
                           color = guest_country['No of guests'], hover_name = guest_country['Country'])
ax.show()

In [None]:
# Counting the guests from various countries.
country_df=df['country'].value_counts().reset_index().rename(columns={'index': 'country','country': 'count of guests'})[:10]

# Create a bar chart
plt.figure(figsize=(12,6))
ax = country_df.plot(kind='bar', x='country', y='count of guests')

# Add labels and title
plt.xlabel('Country')
plt.ylabel('Number of guest',fontsize = 12)
plt.title("Top 10 Number of guests from diffrent Countries")
print("\n\nPRT = Portugal\nGBR = Great Britain & Northern Ireland\nFRA = France\nESP = Spain\nDEU = Germany\nITA = Italy\nIRL = Ireland\nBRA = Brazil\nBEL = Belgium\nNLD = Netherland")

# Add percentage values above each bar
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy()
    ax.text(x + width / 2, y + height / 2,
            '{:.01f}%'.format(height / len(df) * 100),
            horizontalalignment='center',
            verticalalignment='center')

# Show the chart
plt.show()

##### 1. Why did you pick the specific chart?

The specific graph, a Choropleth map, was chosen to visualize the number of guests from different countries on a world map. This allows for a clear understanding of the geographical distribution of guests and to identify the countries that are sending the most guests to the hotel.

The bar plot was chosen to visualize the top 10 countries with the most guests. This allows for a more detailed comparison of the number of guests from each country and to identify any trends or patterns.

##### 2. What is/are the insight(s) found from the chart?

- The majority of guests come from Portugal, followed by Great Britain & Northern Ireland, France, Spain, Germany, Italy, Ireland, Brazil, Belgium, and the Netherlands.
- These top 10 countries account for over 80% of all guests.
- The Choropleth map shows that guests are coming from all over the world, with a concentration in Europe and South America.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help create a positive business impact.

- The hotel management can focus on marketing and advertising efforts in the top 10 countries that are sending the most guests.
- The hotel management can also consider developing targeted marketing campaigns for each of these countries, taking into account the specific needs and preferences of guests from that country.
- Additionally, the hotel management can consider offering special discounts or promotions to guests from these countries to encourage them to book more often.

By taking these steps, the hotel business can increase its revenue and profitability.

There are no insights from the above plot that would lead to negative growth.

#### Chart - 11 Which distribution channel had highest booking and cancellation

In [None]:
# Chart - 11 visualization code
#Creating labels
labels= df['distribution_channel'].value_counts().index.tolist()

# creating new df of distribution channel
distribution_channel_df=df['distribution_channel'].value_counts().reset_index().rename(columns={'index':"distribution_channel",'distribution_channel':'count'})

#adding percentage columns to the distribution_channel_df
distribution_channel_df['percentage']=round(distribution_channel_df['count']*100/df.shape[0],1)

#Creating list of percentage
sizes=distribution_channel_df['percentage'].values.tolist()

#plotting the piw chart
hotel_booking_df['distribution_channel'].value_counts().plot.pie(explode=[0.05]*5, shadow=False, figsize=(15,8),fontsize=10,labels=None)

# setting legends with the percentage values
labels = [f'{l}, {s}%' for l, s in zip(labels, sizes)]
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)
plt.title(' Mostly Used Distribution Channel for Hotel Bookings ')

In [None]:
canceled_df=df[df['is_canceled']==1] # 1= canceled

#group by distribution channel
canceled_df=canceled_df.groupby(['distribution_channel','hotel']).size().reset_index().rename(columns={0:'Counts'})
canceled_df

#set plot size and plot barchart
plt.figure(figsize=(8,5))
sns.barplot(x='distribution_channel',y='Counts',hue="hotel",data=canceled_df)

# set labels
plt.xlabel('Distribution channel')
plt.ylabel('counts')
plt.title('Cancellation Rate Vs Distribution channel')

##### 1. Why did you pick the specific chart?

The pie chart is an effective way to visualize the distribution of categorical data. In this case, it helps us to see which distribution channel had the highest percentage of bookings.

The bar chart is an effective way to visualize the relationship between two categorical variables. In this case, it helps us to see which distribution channels had the highest cancellation rates for each hotel type.

##### 2. What is/are the insight(s) found from the chart?

- The majority of bookings come through TA/TO (Tour Agent/Tour Operator), followed by Direct and Corporate.
- TA/TO also has the highest cancellation rate for both hotels.
- This suggests that TA/TO may be a less reliable source of bookings, as guests who book through TA/TO are more likely to cancel their reservations.
- The hotel management may want to consider working with TA/TO to improve the quality of their bookings and to reduce the cancellation rate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help create a positive business impact.

- The hotel management can focus on attracting more bookings through the Direct and Corporate distribution channels, as these channels have a lower cancellation rate.
- The hotel management can also consider offering special discounts or promotions to guests who book through these channels to encourage them to book more often.
- Additionally, the hotel management can consider working with TA/TO to improve the quality of their bookings and to reduce the cancellation rate.

By taking these steps, the hotel business can increase its revenue and profitability.

There are no insights from the above plot that would lead to negative growth.

#### Chart - 12 Relationship between repeated guests and previous booking not cancelled

In [None]:
# Chart - 12 visualization code
repeated_guests_df=df[df['is_repeated_guest']==1]
repeated_guests_df_1=df[df['is_repeated_guest']==0]
plt.figure(figsize=(8,5))
sns.barplot(x=df['is_repeated_guest'],y= df['previous_bookings_not_canceled'])
plt.xticks([0,1],['Not_repeated_guests','repeated_guests'],fontsize=15)
plt.title('Relationship Between repeated guests and previous bookings not cancelled.')
plt.show()

**Percentage of booking cancellation**

In [None]:
# booking canceled=1
# booking not canceled= 0

# creating new DataFrame where bookings are cancelled.
canceled_df=df[df['is_canceled']==1]

# Grouping by hotel
canceled_df=canceled_df.groupby('hotel').size().reset_index().rename(columns={0: "no_of_cancelled_bookings"})

# adding 'total booking column for calculating the percentage.
canceled_df['total_booikngs']=df.groupby('hotel').size().reset_index().rename(columns={0:"total_bookings"}).drop('hotel',axis=1)
canceled_df

#plotting the barchat
plt.figure(figsize=(8,5))
sns.barplot(x=canceled_df['hotel'],y=canceled_df['no_of_cancelled_bookings']*100/canceled_df['total_booikngs'])

#set labels
plt.xlabel('Hotel type')
plt.ylabel('Percentage(%)')
plt.title("Percentage of booking cancellation")

##### 1. Why did you pick the specific chart?

The bar chart is an effective way to visualize the relationship between two categorical variables. In this case, it helps us to see the relationship between repeated guests and previous bookings not cancelled.

##### 2. What is/are the insight(s) found from the chart?

- The first chart shows that repeated guests are more likely to have previous bookings that were not cancelled.
- This suggests that repeated guests are more likely to be satisfied with their stay at the hotel and are more likely to return in the future.
- The second chart shows that the City Hotel has a higher percentage of booking cancellations than the Resort Hotel.
- This suggests that guests who stay at the City Hotel may be more likely to cancel their reservations than guests who stay at the Resort Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help create a positive business impact.

- The hotel management can focus on attracting more repeated guests, as these guests are more likely to be satisfied with their stay and to return in the future.
- The hotel management can also consider offering special discounts or promotions to repeated guests to encourage them to book more often.
- The hotel management can consider making changes to the City Hotel to make it more appealing to guests and to reduce the cancellation rate.

By taking these steps, the hotel business can increase its revenue and profitability.

There are no insights from the above plot that would lead to negative growth.

#### Chart - 13  Plotting Histogram

In [None]:
# Chart - 13 visualization code
df.hist(figsize=(24,18))
plt.show()

##### 1. Why did you pick the specific chart?

To understanding the data in clear way with proper insights,I have used the histogram here.It is used to summarize discrete or continuous data that are measured on an interval scale.It is often used to illustrate the major features of the distribution of the data in convenient form.It is also useful when dealing with large data sets.It can help detect any unusual observation (outlier) or any gaps in the data.Thus we have used the histogram plot to analysis the variable distribution over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?

Some insights found the chart as follows:

-  We can see that the maximum guest came in the year 2016.

-  Maximum arrival week number is 30.

-  Maximum arrival happens in the last of the month.

-  Maximum guests comes with no children.

-  There is very less requirement of car parking space.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Histogram can not define business impact. it's just to see the distribution of the column data over the dataset

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(18,10))
sns.heatmap(df.corr(),annot=True)
plt.title('Co-relation of the columns')

##### 1. Why did you pick the specific chart?

The correlation heatmap is a visual representation of the correlation coefficients between all pairs of columns in a data set. It is used to identify variables that are positively or negatively correlated, as well as the strength of those correlations.

##### 2. What is/are the insight(s) found from the chart?

Using this heatmap we can clearly see that the highest correlation value is 0.95 and lowest correlation value is -0.51.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

# Select the columns for the pair plot
columns_for_pairplot = ['lead_time', 'adr', 'total_stay', 'is_repeated_guest', 'previous_cancellations', 'booking_changes', 'total_of_special_requests','is_canceled']

# Create a new DataFrame with only the selected columns
df_selected = df[columns_for_pairplot]

# Create the pair plot
sns.pairplot(df_selected)

# Display the plot
plt.show()



##### 1. Why did you pick the specific chart?

The pair plot is a useful tool for visualizing the relationships between multiple variables in a data set. It can help to identify patterns and trends in the data, as well as to identify any outliers.

##### 2. What is/are the insight(s) found from the chart?

This graph show relationship between lead_time, adr, total_stay, is_repeated_guest, previous_cancellations, booking_changes and total_of_special_requests.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

-->  Based on the insights gained from the data analysis, the following suggestions are made to the client to achieve their business objective:

1. Increase bookings:
    - Target marketing campaigns to countries with the highest number of guests.
    - Offer discounts and promotions to attract new guests.
    - Work with travel agents and online booking platforms to increase visibility.

2. Decrease cancellations:
    - Improve communication with guests to ensure they have all the information they need before booking.
    - Offer flexible cancellation policies.
    - Work with travel agents and online booking platforms to improve the booking process.

3. Increase customer retention:
    - Implement a loyalty program to reward repeat guests.
    - Offer special discounts and promotions to repeat guests.
    - Provide excellent customer service to ensure guests have a positive experience.

4. Extend stays:
    - Offer discounts for longer stays.
    - Create packages that include activities and attractions in the local area.
    - Provide amenities and services that encourage guests to stay longer.



# **Conclusion**

This project analyzed hotel booking data to gain insights and make recommendations to improve business performance. The analysis revealed several key findings:

1. City hotels are more popular than resort hotels, but have a higher cancellation rate.
2. Repeat guests are more likely to have previous bookings that were not canceled.
3. The majority of bookings come through TA/TO (Tour Agent/Tour Operator), followed by Direct and Corporate.
4. The percentage of booking cancellation is higher for TA/TO than for Direct and Corporate bookings.
5. The average lead time for resort hotels is higher than for city hotels.
6. The average ADR (Average Daily Rate) for city hotels is higher than for resort hotels.
7. The majority of guests are transient (82.4%).
8. The optimal stay in both types of hotels is less than 7 days.

Based on these findings, several recommendations were made to the client to achieve their business objectives:

1. Increase bookings by targeting marketing campaigns to countries with the highest number of guests and offering discounts and promotions.
2. Decrease cancellations by improving communication with guests and offering flexible cancellation policies.
3. Increase customer retention by implementing a loyalty program and providing excellent customer service.
4. Extend stays by offering discounts for longer stays and creating packages that include activities and attractions in the local area.

By implementing these recommendations, the hotel can improve its business performance and achieve its objectives.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***