<a href="https://colab.research.google.com/github/pratiktamgadge/project-2/blob/main/EDA_Hotel_Booking_Project_By_Pratik.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Project Name**    - **Hotel Booking Analysis**

---




##### **Project Type**    - EDA
##### **Contribution**    - Individual/(Pratik Tamgadge)


# **Project Summary -**

This project aims to conduct an exploratory data analysis (EDA) on a hotel bookings dataset. This dataset includes information on bookings at two hotels over a specific period, detailing guest profiles, booking details, hotel specifics, and information on cancellations and no-shows.

The EDA will seek to provide insights and answer the following questions:

Which type of hotel is most preferred by guests? How does lead time for booking vary across different arrival years, and what are the associated cancellation trends? Which country has the highest number of guests? What is the distribution of preferred meal types among hotel guests? What is the relationship between the average daily rate (ADR) and the total number of people? Which year recorded the highest number of bookings? Which distribution channel contributes most to ADR, thus increasing income? How does ADR vary across different months? Which distribution channel has the highest cancellation rate? Which distribution channel has the highest count of repeat guests? What is the distribution of market segments and distribution channels in hotel bookings? How are guests with babies distributed across different hotel types, and what are their booking cancellation rates? What is the distribution of hotel arrivals by year and month? The EDA process will begin with data cleaning to address any missing or erroneous data. Following this, the data will be explored using a variety of visualizations, such as scatter plots, histograms, and heatmaps, to identify patterns and relationships.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions! This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. Explore and analyse the data to discover important factors that govern the bookings.

#### **Define Your Business Objective?**

Business objective is to understand and visualize dataset from hotel and customer point of view.

1) Reasons for booking cancellations

2)Best time to book hotel

3)Peak season

4) suggestions to reduce cancellations

5) increase revenue of hotels.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import data analysis Libraries
import numpy as np
import pandas as pd
import warnings

# for visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#for plotting
import plotly.express as px

### Dataset Loading

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
path = "/content/drive/MyDrive/Hotel Bookings (3).csv"
hotel_booking_df = pd.read_csv(path)

### Dataset First View

In [None]:
# first 5 rows of the datset
hotel_booking_df.head()

In [None]:
# view of last 5 rows of the data
hotel_booking_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'Total number of rows : {hotel_booking_df.shape[0]}\nTotal number of columns :{hotel_booking_df.shape[1]}')


### Dataset Information

In [None]:
# Dataset Info
hotel_booking_df.info()

#### Duplicate Values

In [None]:
# Count of duplicate values
print(f'Total number of duplicate values : {hotel_booking_df.duplicated().sum()}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_value_count = hotel_booking_df.isna().sum().sort_values(ascending = False)

# Print only columns with missing values
null_value_count = null_value_count[null_value_count>0]
print('Number of null values these coloumn contains:\n',null_value_count)

In [None]:
# Visualizing the missing values
# create heatmap
plt.figure(figsize=(15, 4))

sns.heatmap(hotel_booking_df.isnull())
plt.show()

### What did you know about your dataset?

This data set contains a single file which compares various booking information between two hotels: a city hotel and a resort hotel.It includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. The dataset contains a total of 119390 rows and 32 columns.Dataset Contains duplicated items i.e 31944 which is removed later.Also it have null values in company , agent , country, children column. Company has extremely large number of null values as compared to other columns.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(hotel_booking_df.columns)

In [None]:
# Dataset Describe
hotel_booking_df.describe()

### Variables Description

***The columns and the data they represent are listed below:***
        
1. **hotel :** Name of the hotel (Resort Hotel or City Hotel)

2. **is_canceled :** If the booking was canceled (1) or not (0)

3. **lead_time:** Number of days before the actual arrival of the guests

4. **arrival_date_year :** Year of arrival date

5. **arrival_date_month :** Month of month arrival date

6. **arrival_date_week_number :** Week number of year for arrival date

7. **arrival_date_day_of_month :** Day of arrival date

8. **stays_in_weekend_nights :** Number of weekend nights (Saturday or Sunday) spent at the hotel by the guests.

9. **stays_in_week_nights :** Number of weeknights (Monday to Friday) spent at the hotel by the guests.

10. **adults :** Number of adults among guests

11. **children :** Number of children among guests

12. **babies :** Number of babies among guests

13. **meal :** Type of meal booked

14. **country :** Country of guests

15. **market_segment :** Designation of market segment

16. **distribution_channel :** Name of booking distribution channel

17. **is_repeated_guest :** If the booking was from a repeated guest (1) or not (0)

18. **previous_cancellations :** Number of previous bookings that were cancelled by the customer prior to the current booking

19. **previous_bookings_not_canceled :** Number of previous bookings not cancelled by the customer prior to the current booking

20. **reserved_room_type :** Code of room type reserved

21. **assigned_room_type :** Code of room type assigned

22. **booking_changes :** Number of changes/amendments made to the booking

23. **deposit_type :** Type of the deposit made by the guest

24. **agent :** ID of travel agent who made the booking

25. **company :** ID of the company that made the booking

26. **days_in_waiting_list :** Number of days the booking was in the waiting list

27. **customer_type :** Type of customer, assuming one of four categories

28. **adr :** Average Daily Rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights

29. **required_car_parking_spaces :** Number of car parking spaces required by the customer

30. **total_of_special_requests :** Number of special requests made by the customer

31. **reservation_status :** Reservation status (Canceled, Check-Out or No-Show)

32. **reservation_status_date :** Date at which the last reservation status was










### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(hotel_booking_df.apply(lambda col: col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

Dropping duplicates

In [None]:
# Dropping the duplicate values
hotel_booking_df.drop_duplicates(inplace = True)

In [None]:
# after removing duplicate , count of total rows
print(f'After removing duplicates , number of rows are :  {hotel_booking_df.shape[0]}')

 Replacing null values

In [None]:
# droping the company coloumn as it  has highest null values
hotel_booking_df.drop(['company'], axis=1, inplace=True)

# replacing null values in children column with 0  as there are only 4 missing values ,assuming that family had 0 children
# replacing null values in agent column with 0 assuming those rooms were booked without company/agent
hotel_booking_df[['children','agent']] = hotel_booking_df[['children','agent']].fillna(0)

In [None]:
# replacing null values in country column as 'Others' assuming that user didnt found their country name
hotel_booking_df['country'].fillna('Others', inplace = True)

In [None]:
# checking that now there are no null values
hotel_booking_df.isna().sum().sort_values(ascending=False)

In [None]:
# There are some rows with total number of adults, children or babies equal to zero
# this means there is no any booking were made.
# So we can remove such rows

In [None]:
hotel_booking_df.drop(hotel_booking_df[hotel_booking_df['adults']+hotel_booking_df['babies']+hotel_booking_df['children'] == 0].index, inplace = True)

In [None]:
# number of rows having with 0 in adult babies and children coloumn
zero_entry=hotel_booking_df[hotel_booking_df['adults']+hotel_booking_df['babies']+hotel_booking_df['children'] == 0].shape[0]
print(f"There are {zero_entry} rows which has zero value")

### Changing data type

In [None]:
#showing the info of the data to check datatype
hotel_booking_df.info()

In [None]:
# We have seen that childern & agent column as datatype as float whereas it contains only int value,so change datatype as int64
hotel_booking_df[['children', 'agent']] = hotel_booking_df[['children', 'agent']].astype('int64')
hotel_booking_df[['children', 'agent']]

In [None]:
# for our understanding, in column 'is_canceled' we will replace the value from (0,1) to not_canceled, is canceled.
hotel_booking_df['is_canceled'] = hotel_booking_df['is_canceled'].replace([0,1], ['not canceled', 'is canceled'])
hotel_booking_df['is_canceled']

In [None]:
#Same for 'is_repeated_guest' column
hotel_booking_df['is_repeated_guest'] = hotel_booking_df['is_repeated_guest'].replace([0,1], ['not repeated', 'repeated'])
hotel_booking_df['is_repeated_guest']

## Addition of new columns

In [None]:
#total stay in nights
# We have created a col for total stays in nights by adding week night & weekend nights stay col.
hotel_booking_df['total_stay_in_nights'] = hotel_booking_df ['stays_in_week_nights'] + hotel_booking_df ['stays_in_weekend_nights']
hotel_booking_df['total_stay_in_nights']

In [None]:
# We have created a col for revenue using total stay * adr
hotel_booking_df['revenue'] = hotel_booking_df['total_stay_in_nights'] *hotel_booking_df['adr']
hotel_booking_df['revenue']

In [None]:
# Also, for information, we will add a column with total number of guests coming for each booking
hotel_booking_df['total_guest'] = hotel_booking_df['adults'] + hotel_booking_df['children'] + hotel_booking_df['babies']
hotel_booking_df['total_guest']

### What all manipulations have you done and insights you found?

**1**)Dropping duplicates : There were total number of duplicate values was 31994 . So first we have dropped these duplicates to clean the data. Now number of rows are after removing duplicates are 87396.



**2**)Handling Null Values: 'children', 'country', 'agent', and 'company'.
These are the column with null values and droping the company coloumn as it  has highest null values


A)   Then replacing null values in agent column with 0 assuming those rooms were booked without company/agent

B)   Then I have filled  null values in children column with 0  as there are only 4 missing values ,assuming that family had 0 children.

C)   Replacing null values in the country column with 'others'to avoid confusion.




**3**) There are some rows with total number of adults, children or babies equal to zero , this means there is no any booking were made . So we have removed 166 rows having with 0 in adult ,babies and children coloumn

**4**)Childern & agent column  datatype was float but it contains only int value,so change datatype as int64.

**5**)for our understanding, in column 'is_canceled' we will replace the value from (0,1) to not_canceled, is canceled and same done for 'is_repeated_guest' column

**6)**There are few columns required in Data to analysis purpose which is created from the given columns.

Total Guests: This columns will help us to evaluate the volumes of total guest and revenue as well. We get this value by adding total no. of Adults, Children & babies.

Revenue : We find revenue by multiplying adr & total guest. This column will use to analyse the profit and growth of each hotel.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

**Which type of hotel is preffered by guests resort or city ?**

In [None]:
# count occurence of each hotel type in hotel column
hotel_counts = hotel_booking_df['hotel'].value_counts()

In [None]:
# visualize the data
# Plot a pie chart
colors = ['#FFFF00', '#FF0000',]
plt.figure(figsize = (6,7))
plt.pie(hotel_counts, labels=hotel_counts.index, autopct='%1.2f%%', startangle=50, explode=[0.05, 0.05], textprops={'fontsize': 14}, colors=colors)
plt.title('Preference for Hotel Types', fontsize=15, fontweight='bold')
plt.legend(title='Hotel Type',bbox_to_anchor=(1.3, 0.7))

# display the pie chart
plt.show()

##### 1. Why did you pick the specific chart?

Pie chart helps in comparision of data and also it shows the fractional part of whole.

##### 2. What is/are the insight(s) found from the chart?

We can clearly see from visualization that city hotel has higher number of bookings ie. 61.07%  and resort hotel has 38.93 % bookings only

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gained sights helped in creating positive business impact as in city hotel it can provide more services to increase their revenue. In resort hotel it can improve thier services to attract more customers.

#### Chart - 2

**How many guests repeated ?**

In [None]:
# count occurences of repeated and non repeated in is_repeated_col
guest_repeat = hotel_booking_df['is_repeated_guest'].value_counts()

In [None]:
# Chart - 2
# visualize the data
# Plotting a pie chart
colors = ['#FFFF00', '#FF0000',]
plt.figure(figsize = (7,7))
plt.pie(guest_repeat, labels=guest_repeat.index, autopct='%1.2f%%', startangle=140, explode=[0.2, 0.09], textprops={'fontsize': 14}, )
plt.title('Repeated Customers vs Non Repeated Customers', fontsize=15, fontweight='bold')
plt.legend()
# display the pie chart
plt.show()

##### 1. Why did you pick the specific chart?

To show the percentage share of repeated & non-repeated guests pie chart helps for comparision

##### 2. What is/are the insight(s) found from the chart?

The number of repeated guests is very less as compared to overall guests, its only 3.86%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Taking feedback from guests so that hotel can imporve their services. Giving seasonal discounts and offers,can attract non repeating customers to visit hotel again that will have postive impact on business

#### Chart - 3

**What is the percentage of cancellation?**

In [None]:
# count of canceled and non-canceled bookings
cancel_count = hotel_booking_df['is_canceled'].value_counts()

In [None]:
# Chart - 3
# visualize the data
# Plotting a pie chart
colors =['#800080','#FFB6C1']
plt.figure(figsize = (6,7))
plt.pie(cancel_count, labels=cancel_count.index, autopct='%1.1f%%', startangle=20, explode=[0.05, 0.05], textprops={'fontsize': 12}, colors=colors)
plt.title('Cancellation by guests', fontsize=15, fontweight='bold')
plt.legend()
# display the pie chart
plt.show()

##### 1. Why did you pick the specific chart?

This chart presentes the cancellation rate of the hotels booking.

##### 2. What is/are the insight(s) found from the chart?

Guests who cancelled are around 27% and those who not cancelled are 72.5%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Around 72.5% guests didnt cancel their booking so, its a positve impact on hotel. Cancellation rate is 27.5% , from business persecptive cancellation rate is quite high. We need to findout the reason of cancellation by taking feedback from guest.

#### Chart - 4

**Which type of food is mostly preferred by the guests?**

In [None]:
# Chart - 4
# count occurences each meal type
meal_count=hotel_booking_df['meal'].value_counts()
meal_count

In [None]:
#visualize the data
plt.figure(figsize=(10,6))
sns.countplot(x='meal', data=hotel_booking_df, order=meal_count.index, hue='meal', palette="muted", legend=False)
plt.title("Most preferred Food",fontsize=15, fontweight='bold')
plt.xlabel('Type of the food', fontsize = 15)
plt.ylabel('Food type count', fontsize = 15)
plt.show()

Types of meal in hotels:

BB - (Bed and Breakfast)
HB- (Half Board)
FB- (Full Board)
SC- (Self Catering)

##### 1. Why did you pick the specific chart?

I have choose barchart to visualize most preferred food because it displays the count of each observation for each category and here we have to find which food is most preferred

##### 2. What is/are the insight(s) found from the chart?

Bed and Breakfast is the most preferred food by 67907 guests

Self catering and Half board are equally preferred but very less in comparison to BB

Full board is least preferred food i.e only 360 guest ordered it

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

BB type of food is most preferred food this makes positive impact on business.
Undefined and FB type of food is less preferred this insight makes neative impact on business.

#### Chart - 5

**What is most preferred room type by guests?**

In [None]:
# count of room type booked by guest
room_count=hotel_booking_df['assigned_room_type'].value_counts()
room_count

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10,6))

sns.countplot(x='assigned_room_type', data=hotel_booking_df, order=room_count.index, hue='assigned_room_type', palette='muted', legend=False)

plt.title("Most preferred Room type", fontsize=15, fontweight='bold')
plt.xlabel('Type of the Room', fontsize=15)
plt.ylabel('Room type count', fontsize=15)

plt.show()

##### 1. Why did you pick the specific chart?

 I have choose countplot to visualize most prefferd roomtype because countplot display the count of each observation for each category and here we have to represent room type vs room type count.

##### 2. What is/are the insight(s) found from the chart?

The insighte found from the chart is A type rooms are most prefered rooms and the count is 46283 and after that D type rooms are prefered by the guest and count is 22419.

Least preffered room is K and L. K type room is preffered by only 185 and only 1 guest has booked L type room.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

A type rooms are most preferred rooms. This make positive impact on business.
H,I,K,L type rooms are less preferred this insight makes neative impact.
This is beacause type A rooms have 46283 bookings anf type L room has only one booking.

#### Chart - 6

**Which agent made the most bookings?**

In [None]:
# count top 10 agent with most bookings
top_agent=hotel_booking_df['agent'].value_counts().reset_index().rename(columns={'index':'agent','agent':'num_of_bookings'})[:10]
# rename column name for better understanding
top_agent

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10,7))
sns.countplot(x='agent', data=hotel_booking_df, order=hotel_booking_df['agent'].value_counts().index[:10], hue='agent', palette='muted', legend=False)

plt.title('Top 10 agents with most bookings', fontsize=15, fontweight='bold')
plt.ylabel('Number of bookings', fontsize=15)
plt.xlabel('Agent number', fontsize=15)

plt.show()

##### 1. Why did you pick the specific chart?

I choose barplot here because it gives data visualization in pictorial form and due to this comparison of data is easy.

##### 2. What is/are the insight(s) found from the chart?

The insight found here is Agent no. 9 made most of the bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, Agent no.9, 240 has more bookings which makes positive impact.**
* Aent no. 1 and 6 has less bookings which makes negative impact.**
* Booking made by agent no 1 and 6 are about 4.27% of agent no 9 which has highest bookings.

#### Chart - 7

**From which country most of the guests are coming?**


In [None]:
# Count occurrences of top 10 country in the country column
top_ten_country=hotel_booking_df['country'].value_counts().reset_index()[:10]
top_ten_country

In [None]:
# Visualizing by  plotting the graph
plt.figure(figsize=(10,6))

sns.barplot(x='country', y='count', data=top_ten_country, palette='muted', hue='country', dodge=False)
plt.legend([],[], frameon=False)  # Disable the legend

plt.xlabel('Country', fontsize=12)
plt.ylabel('Number of guests', fontsize=12)
plt.title("Top 10 countries by guests", fontsize=15, fontweight='bold')

plt.show()

##### 1. Why did you pick the specific chart?

Bar plots are effective for comparing numerical values, such as the number of guests in this case


##### 2. What is/are the insight(s) found from the chart?

Most of the guests are coming from portugal i.e 25000 guests are from portugal

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This data can be used for attracting the remaining people of Portugal and other neighbouring contries, some schemes may be introduced or social media awareness, advertising may be increased to get more customers from nearby areas.

Feedback gathered from these guests may be used to increase guests from other countries too and strategies may be planned according to this.

abbrevations for countries-

PRT- Portugal

GBR- United Kingdom

FRA- France

ESP- Spain

DEU - Germany

ITA -Itlay

IRL - Ireland

BEL -Belgium

BRA -Brazil

NLD-Netherlands


#### Chart - 8

**Which distribution channel is mostly used for hotel booking?**

In [None]:
# count of occurences of each distribution channel
distribution_channel = hotel_booking_df['distribution_channel'].value_counts()
distribution_channel

In [None]:
# Chart - 8 visualization
plt.figure(figsize=(10,7))

sns.countplot(x='distribution_channel', data=hotel_booking_df, order=distribution_channel.index, hue='distribution_channel', palette='muted', legend=False)

plt.title("Mostly used distribution Channels", fontsize=15)
plt.xlabel('Distribution Channel', fontsize=15)
plt.ylabel('Count of Booking by distribution channel', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

The following chart represent maximum volume of booking done through which channel to represnt the numbers in descending order

##### 2. What is/are the insight(s) found from the chart?

Mostly used distribution channel is TA/TO channel.The total count of booking is 69028

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Distribution channel TA/TO is mostly used channel this makes positive impact.
* Distribution channel GDS and undefined is less used channels this makes negative impact.
* Use of TA/TO is 79.13% and use of GDS is 0.21%.**
* Other channels can provide those facilities which are provided by TA/TO channel.

#### Chart - 9

**Which year has highest number of bookings?**

In [None]:
# Chart - 9 visualization code
custom_palette = ['#FF9999', '#66B3FF']
# set plot size
plt.figure(figsize=(10,8))
# plot with countplot
sns.countplot(x=hotel_booking_df['arrival_date_year'],hue=hotel_booking_df['hotel'],palette=custom_palette)
plt.title("Year Wise bookings")
plt.show()

##### 1. Why did you pick the specific chart?

Because countplot is easy to understand.

##### 2. What is/are the insight(s) found from the chart?

2016 had highest bookings and 2015 had lowest bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For Resort hotel in year 2015 it was with least number of bookings and in 2016 it increased but in 2017 it again decreased , it means they have to find out the reasons for this using feedback of guest.
For City hotel in year 2015 it was the lowest and in 2016 it was on peak and again it decreased in booking in 2017

#### Chart - 10

**What is ADR accross different months?**

In [None]:
# Chart - 10 visualization code
#  Using groupby funtion
bookings_by_months_df = hotel_booking_df.groupby(['arrival_date_month', 'hotel'])['adr'].mean().reset_index()

# Create month list
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# It will take the order of the month list in the dataframe along with values
bookings_by_months_df['arrival_date_month'] = pd.Categorical(bookings_by_months_df['arrival_date_month'], categories = months, ordered = True)

# Sorting values
bookings_by_months_df = bookings_by_months_df.sort_values('arrival_date_month')

In [None]:
# Setting the chart size
plt.figure(figsize=(15,5))

# Plotting the values in a line chart
sns.lineplot(x=bookings_by_months_df['arrival_date_month'],y=bookings_by_months_df['adr'],hue=bookings_by_months_df['hotel'])

# Setting the labels and title
plt.title('ADR across each month', fontsize=20)
plt.xlabel('Month Name', fontsize=12)
plt.ylabel('ADR', fontsize=12)

# Show chart
plt.show()

##### 1. Why did you pick the specific chart?

I have used line plot in this case because it effectively shows the trend of ADR over time

##### 2. What is/are the insight(s) found from the chart?

City Hotel : It is clear that City Hotel generates more revenue in May months in comparison to other months.

Resort Hotel : Resort Hotel generates more revenue in between July and August months.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

With the knowledge that City Hotel generates more revenue in May and Resort Hotel generates more revenue between July and August, hotels can focus marketing , promotions, and special offer to attract more guests during these peak months.

#### Chart - 11

**Which month has the most bookings in each hotel type?**

In [None]:
# Chart - 11 visualization code
custom_palette = ['#6A5ACD', '#B22222']
plt.figure(figsize=(15,5))
sns.countplot(x=hotel_booking_df['arrival_date_month'],hue=hotel_booking_df['hotel'], palette=custom_palette)
plt.title("Number of booking across months", fontsize = 25)
plt.show()

##### 1. Why did you pick the specific chart?

Count plots are used when you want to compare the counts of different categories. In this case, it allows us to visually compare the number of bookings made for each month by hotel type

##### 2. What is/are the insight(s) found from the chart?

The highest number of bookings appear to be in July and August.
Lowest booking months: December, January and February appear to be the months with the fewest bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is negative insight but hotel can use this insight to arrange everything in advance and welcome their guest in the best way possible and hotel can also run some promotional offer in these 2 months to attract more guests.

#### Chart - 12

Which distribution channel has highest adr?

In [None]:
# Chart - 12 visualization code
# Grouping dist_channel and hotels on their adr
distribution_channel = hotel_booking_df.groupby(['distribution_channel','hotel'])['adr'].mean().reset_index()

# Visualization by using barplot
sns.barplot(x='distribution_channel',y='adr',data=distribution_channel,hue='hotel')
plt.title('ADR across Distribution channel', fontsize=20)
plt.xlabel('Distribution channel',fontsize=15)
plt.ylabel('ADR', fontsize=15)
dist_channel_adr = hotel_booking_df.groupby(['distribution_channel','hotel'])['adr'].mean().reset_index()

##### 1. Why did you pick the specific chart?

I use here barplot to visualise ADR across distribution channel beacuse it give easy to undertand visualization to large data.

##### 2. What is/are the insight(s) found from the chart?

The insight find from the above chart is that GDS channel contributed most in ADR in city hotel and Direct and TA/TO has nearly equal contribution in adr in both hotel types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


- GDS distribution channel contributed more to adr for city hotel

- Undefined distribution channel contributed more to adr for resort hotel this makes positive impact.

- GDS distribution channel has no any contribution to adr for resort hotel and undefined distribution channel contributed less to adr for city hotel this makes neative impact.

#### Chart - 13

**How does lead time vary accross different hotel types?**

In [None]:
# Chart - 13 visualization code
sns.boxplot(data=hotel_booking_df, x='hotel', y='lead_time')
plt.xlabel('Hotel Type')
plt.ylabel('Lead Time')
plt.title('Lead Time Variation Across Hotel Types')
plt.show()

##### 1. Why did you pick the specific chart?

The boxplot chart illustrates the variation in lead time (the duration between booking and arrival) across different hotel types.

##### 2. What is/are the insight(s) found from the chart?

By understanding the average lead time for different hotel types, the business can communicate realistic expectations to guests, especially regarding the time required for booking confirmation and preparation before arrival.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By understanding the average lead time for different hotel types, the business can communicate realistic expectations to guests, especially regarding the time required for booking confirmation and preparation before arrival.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_columns = hotel_booking_df.select_dtypes(include=['int64', 'float64']).columns

# Calculating the correlation matrix
correlation_matrix = hotel_booking_df[numeric_columns].corr()

# Plotting the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

chose a heatmap to visualize the correlation matrix because it is an effective way to represent the relationships between multiple numeric variables in a dataset. Heatmaps provide a clear, intuitive visual representation of how variables correlate with each other, with color gradients indicating the strength and direction (positive or negative) of these correlations.

##### 2. What is/are the insight(s) found from the chart?

Total Stay in Nights is a central variable, highly correlated with both stays in week nights and weekend nights, and strongly influencing revenue.
Revenue is significantly influenced by ADR, total stay in nights, and total guests.
Variables like lead time and previous cancellations show little to no correlation with most other variables, indicating their limited impact on other factors.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
plt.figure(figsize=(10,8))
sns.set_style('whitegrid')
sns.pairplot(data = hotel_booking_df[hotel_booking_df['adr']<500][['hotel','adr', 'total_guest']],height = 3,aspect = 0.8, hue ='hotel')
plt.show()

##### 1. Why did you pick the specific chart?

A pairs plot allows us to see both distribution of single variables and relationships between two variables .It is also a great method to identify trends for follow-up analysis.

##### 2. What is/are the insight(s) found from the chart?

Here we can see that For City hotel ADR is maximum if we compare with resort hotel, and ADR is also increases as number of people is less (<=5). If number of of people will increase they tends to book Resort Hotel.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Reasons for Booking Cancellations:

They look at reasons like plans changing, finding better deals somewhere else, or not liking something about the booking. By asking guests directly through surveys, hotels can learn exactly why people cancel. This helps hotels make changes to keep more bookings and make guests happier.

Best Time to Book Hotel:

They look at when most people book and how early they do it. By using data to predict when more people will want to book, hotels can change prices to attract more guests. They also offer discounts and deals at times when fewer people usually book to get more guests to come.

Peak Season:

By knowing this, hotels can plan better to have enough staff and rooms available when lots of people want to stay. They also advertise more during these busy times to make more money.

Suggestions to Reduce Cancellations:

Hotels can do things to make fewer people cancel their bookings. They can let guests change or cancel their bookings easily without extra fees. Hotels also need to communicate well with guests before and after they book to make sure guests know what to expect. This helps guests feel more comfortable and less likely to cancel.

Increase Revenue of Hotels: They do this by changing room prices based on how many people want to stay and what other hotels are doing. Hotels also try to sell more things to guests like upgraded rooms or special packages. By working with local businesses, hotels can offer unique deals that attract more guests and make more money.

# **Conclusion**

These are some conclusions that we get after performing EDA.

City hotel is mostly preferred hotel by guests.

Percentage of repeated guest is less which is 3.86%.

Cancellation rate is 27 %.

Mostly preferred food type is BBtype food.

Room type A is mostly preferred room type.

Agent no. 9 made the most bookings.

Most guests are coming from Portugal.

TA/TO distribution channel is mostly used and percentage is 79.13%.

For year 2016 highest bookings were there for both city and resort hotel.

City Hotel generates more revenue in May and Resort Hotel generates more revenue between July and August.

GDS channel contributed most in ADR in city hotel and Direct and TA/TO has nearly equal contribution in adr in both hotel types.







### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***