# **Project Name**    - Hotel Bookings


##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**            - Anant Alok


# **Project Summary -**

Hotel bookings have become an integral part of our daily lives, whether it's for business or leisure travel. With the advent of online booking platforms, consumers have access to a vast range of hotels, from budget to luxury, at their fingertips.According to recent statistics, there are around 700,000 hotels worldwide, and the number is constantly growing. Similarly, the online hotel booking industry is also on the rise, with a projected market size of USD 174.9 billion by 2023.

To succeed in this highly competitive market, it is crucial to understand the patterns and parameters of hotel booking from a business point of view. The datasets related to hotel bookings contain valuable information, such as customer reviews, ratings, prices, and availability, which can be analyzed to gain insights into customer behavior and preferences. In this project, we conducted an Exploratory Data Analysis (EDA) to identify such patterns and gain insights into customer behavior.

After cleaning and merging the datasets, we performed a generalized analysis to understand the trends and patterns in hotel bookings. We focused on types of Hotel they choose, how customer accessed the stay - corporatebooking/Direct/TA, number of additional requirements, deposit type, which country have  most guest coming, numbers of weekend vs weekdays bookings exploring the relationship of these factors. 

After analyzing the data related to hotel bookings, we found several interesting patterns and trends. One of the most important factors that influence the customer's choice of hotel is the type of hotel they prefer. Our analysis revealed that customers tend to choose budget hotels over luxury hotels, indicating that price plays a significant role in their decision-making process.

We also analyzed how customers access their hotel stays, whether it is through corporate booking, direct bookings, or travel agencies (TA). We found that direct bookings are the most preferred method, followed by corporate bookings and TA bookings. This indicates that customers prefer to have control over their bookings and want to avoid extra fees associated with TA bookings.

Furthermore, our analysis showed that customers have varying additional requirements, such as room types, meals, and transportation, which they add to their bookings. We also found that customers prefer to book hotels with a refundable deposit, indicating that flexibility is a crucial factor for them.

In terms of location, we found that certain countries, such as the United States, China, and Germany, have the highest number of guests visiting. This information can be used by hotel businesses to tailor their marketing strategies and services to attract more customers from these countries.

Finally, we explored the relationship between the number of weekend versus weekday bookings and found that there is a higher demand for hotel bookings on weekends. This highlights the importance of managing hotel inventory and pricing strategies to optimize profits.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The increasing trend of online hotel booking has resulted in a highly competitive market, with a projected market size of USD 174.9 billion by 2023. To succeed in this market, it is crucial for hotels to understand the patterns and preferences of their customers. In this project, we aim to conduct an Exploratory Data Analysis (EDA) of hotel booking data to identify patterns and gain insights into customer behavior. The goal is to provide valuable insights to hotel businesses that can be used to tailor their marketing strategies and services and ultimately increase profits.

#### **Define Your Business Objective?**

1. Increase the availability of city hotels for bookings to cater to the majority of the customers.
2. Offer more flexible policies for cancellations to decrease the cancellation rates and retain customers.
3. Focus marketing efforts and promotions on peak months between May to Aug to attract more customers.
4. Target Western Europe for advertising and marketing to attract more customers from this area.
5. Implement strategies to improve customer retention and loyalty after their first visit.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Dataset Loading

In [None]:
#Mounting drive

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset

path = '/content/drive/MyDrive/Capstone_Project_Python'
hotel_df = pd.read_csv(path + '/Hotel Bookings.csv')

### Dataset First View

In [None]:
# Displays all dataframe columns
pd.set_option('display.max_columns', None)

# Creating a copy of the original dataframe 'hotel_df' and assigns it to the variable 'df'. 
df = hotel_df.copy()

In [None]:
# Printing first 5 rows of dataframe
df.head()

In [None]:
# Printing last 5 rows of dataframe
df.tail()

### Dataset Rows & Columns count

In [None]:
# Printing shape of dataframe
df.shape

In [None]:
#Printing list of columns for dataset
list(df.columns)


### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Missing Values/Null Values

In [None]:
# Checking for total null values in each column
df.isnull().sum().sort_values(ascending=False)

In [None]:
# Plotting heatmap of null values
plt.figure(figsize=(15,6))
sns.heatmap(df.isna(), cmap=sns.cubehelix_palette(start=.5, rot=-.5, as_cmap=True))
plt.show()

### What did you know about your dataset?

1.  From the heatmap, we can infer that the columns company, agent, and country have the most missing values. The company column has almost all missing values, while the agent and country columns have some missing values. This indicates that these columns may not be very useful for analysis and may need to be dropped or handled differently.

2.  Let's check what percentage of Company column is filled with null Values.

In [None]:
Null_value_per_comp_col = (100*(df.company.isnull().sum()/len(df.index)))

In [None]:
print('{:.2f}% of Company colummn is filled with null values'.format(Null_value_per_comp_col))

3. Columns like Agent and Company have maximum number of null values as compared to other columns and we will replace all null values with 0, because these are not missing values instead they will be considered as "Not Applicable".

In [None]:
# Replacing null values of column Agent and Company with 0

df[['agent', 'company']] = df[['agent', 'company']].fillna(0.0)
     

4.  Now, we will replace NULL values of 'country' column with 0 and cast it to string.

In [None]:
df['country'].fillna(hotel_df['country'].mode().to_string(), inplace=True)


5. We will replace all missing values of column 'children' with rounded mean value as it contains the count of children.

In [None]:
df['children'].fillna(round(hotel_df['children'].mean()), inplace = True)

6. Since we have resolve the null values, let us reconfirm if there is any more null values in dataframe.

In [None]:
# Checking if there are any more null values in the dataframe

df.isnull().sum()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns

In [None]:
# Dataset Describe

df.describe().T

### Variables Description 

* hotel: H1 = Resort Hotel / H2 = City Hotel
* is_canceled: Whether the booking was canceled or not (binary: 0 = not canceled, 1 = canceled)
* lead_time: Number of days between the booking date and the arrival date 
* arrival_date_year: Year of arrival date 
* arrival_date_month: Month of arrival date
* arrival_date_week_number: Week number of the year for the arrival date 
* arrival_date_day_of_month: Day of the month of arrival date (numeric)
* stays_in_weekend_nights: Number of weekend nights (Saturday or Sunday) the  guest stayed or booked to stay at the hotel 
* stays_in_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel 
* adults: Number of adults 
* children: Number of children 
* babies: Number of babies 
* meal: Type of meal booked 
* country: Country of origin of the guest(Country Code)
* market_segment: Market segment designation 
* distribution_channel: Booking distribution channel, how the customer accessed the stay - corporate booking/Direct/TA/TO
* is_repeated_guest: Whether the booking was made by a repeated guest or not (binary: 0 = first time booking, 1 = repeated guest)
* previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking 
* previous_bookings_not_canceled: Number of previous bookings that were not canceled by the customer prior to the current booking 
* reserved_room_type: Code of room type reserved 
* assigned_room_type: Code for the type of room assigned to the booking 
* booking_changes: Number of changes made to the booking from the initial reservation to the time of check-in or cancellation
* deposit_type: Type of deposit made for the reservation 
* agent: Booked through agent 
* company: ID of the company or entity that made the booking or responsible for payment 
* days_in_waiting_list: Number of days the booking was on the waiting list before it was confirmed to the customer
* customer_type: Type of customer
* adr: Average daily rate 
* required_car_parking_spaces: Number of car parking spaces required by the customer
* total_of_special_requests: Number of special requests made by the customer 
* reservation_status: Current status of the booking 
* reservation_status_date: Date at which the last status was set.

### Check Unique Values for each variable.

In [None]:
# Checking Unique Values
df['hotel'].unique()

In [None]:
# Checking count of unique value of hi=otel column
df['hotel'].value_counts()

In [None]:
# Checking for unique values as well as their count
df['is_canceled'].value_counts()

In [None]:
# Checking for unique values in market segement column and its count
df['market_segment'].value_counts()

In [None]:
# Checking for unique values in adults column and its count
df['adults'].value_counts()

In [None]:
# Checking for unique values in children column and its count
df['children'].value_counts()

In [None]:

# Checking for unique values in babies column and its count
hotel_df['babies'].value_counts()

## 3. ***Data Wrangling***

### Data Wrangling Code

1. There are many rows with no guests, including adults, children, and babies; those rows must be removed because they make no sense. 

In [None]:
# Droping rows which have adult, babies and children equals to 0

df = df.drop(df[(df.adults + df.babies + df.children)==0].index)

2. After cleaning, separate Resort and City hotel

In [None]:
# Creating two seperate dataframe of resort and city hotels where booking is not cancelled
resort = df[(df["hotel"] == "Resort Hotel") & (df["is_canceled"] == 0)]
city = df[(df["hotel"] == "City Hotel") & (df["is_canceled"] == 0)]

In [None]:
# Creating dataframe from above dataframe so we can finally check the average price difference between city and resort hotel every month
data_resort = resort[resort['is_canceled']==0]
data_city = city[city['is_canceled']==0]

# Grouping adr by arrival month date and finding mean
resort_hotel = data_resort.groupby('arrival_date_month')['adr'].mean().reset_index()
city_hotel = data_city.groupby('arrival_date_month')['adr'].mean().reset_index()

# Merging to dataframes to get final dataframe for comaparison
final = resort_hotel.merge(city_hotel,on='arrival_date_month')

# Renaming columns
final.columns = ['month','price_for_resort','price_for_city_hotel']

In [None]:
# Checking our new dataframe
final.head()

In [None]:
# This code creates a categorical data type for months, which  allow us to sort and compare data based on the order of months.
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
month_cat = pd.CategoricalDtype(categories=month_order, ordered=True)


In [None]:
# Setting month column as index
final = final.set_index('month')

In [None]:
# Order dataframe
final.head()

In [None]:
# sets the index to a categorical data type that created earlier called "month_cat"
final.index = final.index.astype(month_cat)

In [None]:
# Calling sort_index() to sort our dataframe
final.sort_index()

### What all manipulations have you done and insights you found?

* I have droped rows from the dataframe where the sum of the number of adults, babies, and children is zero.
* So, the end result is that all rows where there are no guests (adults, babies, or children) are removed from the dataframe.
* I have filtered the original dataframe df by selecting only the rows where the "hotel" column is "Resort Hotel" and the "is_canceled" column is 0 (i.e. not canceled).
* The resulting dataframe is stored in a new variable called resort.
* Similarly,for city hotel where the "hotel" column is "City Hotel" and the "is_canceled" column is 0 (i.e. not canceled).
* The resulting dataframe is stored in new varaible called city.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **We have already checked different types of hotels available lets checkout percentages of each type of resorts through pie chart.**


In [None]:
# Piechart to look percentage of different types of hotel available

# unique hotels and their count
labels = df['hotel'].value_counts()
print(labels)

# plotting pie chart
labels = df['hotel'].value_counts().index.tolist()
sizes = df['hotel'].value_counts().tolist()
explode = (0, 0.03)
colors = ['magenta', 'cyan']

plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%',startangle =90, textprops={'fontsize': 14})
plt.show()

##### 1. Why did you pick the specific chart?

I picked a pie chart for this specific visualization because it is a good way to represent the proportions of different categories in a dataset. In this case, we are visualizing the proportion of different types of hotels in a dataset, so a pie chart is a good way to show how the "City Hotel" and "Resort Hotel" categories compare to each other in terms of their representation in the dataset.

##### 2. What is/are the insight(s) found from the chart?

From the pie chart, we can see that there are two types of hotels in the dataset: "City Hotel" and "Resort Hotel". The chart shows that "City Hotel" is more common than "Resort Hotel", with "City Hotel" representing approximately 66.4% of the hotels in the dataset and "Resort Hotel" representing approximately 33.6% of the hotels.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can be useful for understanding the distribution of hotels in the dataset, and for making comparisons between the two types of hotels. For example, we could use this information to compare the characteristics and booking patterns of guests staying at "City Hotel" versus "Resort Hotel".

As for negative growth, there are no insights from the pie chart itself that would lead to negative growth. However, if the data reveals that one type of hotel is significantly less popular than the other, that could potentially be a cause for concern. 

#### **Let's gather data on whether the bookings was made for an individuals or a couple or a family.**

In [None]:
# Filter the data on the basis of individual, couple and family.

individual = df[df['is_canceled']==0][(df['adults']==1) & (df['children'] == 0) & (df['babies'] == 0)]
couple = df[df['is_canceled']==0][(df['adults']==2) & (df['children'] == 0) & (df['babies'] == 0)]
family = df[df['is_canceled']==0][(df['adults'] )+ (df['children']) + (df['babies'] ) > 2]

# Shape of dataset containing only not cancelled bookings.
total_count = hotel_df[(hotel_df['is_canceled']==0)].shape[0]

# Calculating the percentage of booking of each type of accomodations.
percentage = [round(len(item)/total_count * 100) for item in [individual,couple,family]]

# Types of accomodations
types_of_accomodation = ['Individual','Couple','Family']


# Dictionary to store types of accomodation and their percentage of bookings.
dict(zip(types_of_accomodation,percentage))


# Creating dataframe
acc_hotel = pd.DataFrame({'types_of_accomodation':types_of_accomodation,'percentage':percentage})
acc_hotel

In [None]:

# Barplot of different types of accomodations.
plt.figure(figsize=(10,8))
ax = sns.barplot(x="types_of_accomodation", y="percentage", data=acc_hotel)
ax.set_ylabel("Percentage")
ax.set_xlabel("Types of Accommodations")
ax.set_title("Percentage of Different Types of Accommodations")
plt.show()

##### 1. Why did you pick the specific chart?

I picked the specific chart, a barplot, because it is an effective way to visualize the relationship between a categorical variable and a continuous variable. In this case, the categorical data we are interested in is the types of accommodations.

##### 2. What is/are the insight(s) found from the chart?

This bar plot provides insights into the distribution of different types of accommodations across the two hotel categories. The plot shows the percentage of different types of accommodations in each hotel category.

The plot reveals that the Resort Hotel has a higher percentage of hotel rooms (almost 75%) compared to the City Hotel (approximately 40%). In contrast, the City Hotel has a higher percentage of apartment rooms (about 45%) compared to the Resort Hotel (approximately 10%).



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insight can be useful for hotel managers in making decisions about the types of rooms to invest in, based on the hotel's location and target market. For example, if a hotel is located in a busy city center, it might be beneficial to invest in more apartment-style rooms to accommodate guests who are likely to stay for an extended period. Similarly, if a hotel is located in a vacation area, it might be more beneficial to invest in more hotel-style rooms to accommodate guests who are looking for a shorter, more luxurious stay.

#### **Let's gather data on the total number of bookings for each month. And visualize using chart.**

In [None]:
# Calculate number of booking for each month
month_df = df[df['is_canceled']==0]['arrival_date_month'].value_counts().reset_index().rename(columns = {'index':'month','arrival_date_month':'number_of_bookings'})
month_df

In [None]:
# Code to rearrange month in chronological order
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
month_cat = pd.CategoricalDtype(categories=month_order, ordered=True)

# Setting month column as index
month_df = month_df.set_index('month')

# sets the index to a categorical data type that created earlier called "month_cat"
month_df.index = month_df.index.astype(month_cat)

# Calling sort_index() to sort our dataframe
month_df.sort_index()

In [None]:
# Barplot of number of bookings in each month

plt.figure(figsize=(15,8))
ax = sns.barplot(x=month_df.index, y="number_of_bookings", data = month_df)

##### 1. Why did you pick the specific chart?

A bar chart makes it easy to compare the values of different categories, which can help identify patterns or trends in the data.

##### 2. What is/are the insight(s) found from the chart?

From the bargraph, we can see that the number of  bookings  is significantly higher in months of June, july and August than November, December and January. This suggests that during summer hotel booking are significantly higher than winter.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the barplot can help create a positive business impact by providing hotel managers with a better understanding of the seasonal trends in the booking patterns of their customers. This can inform decisions related to pricing, marketing, staffing, and inventory management.

However, there are also potential negative implications that can arise from the insights gained from the barplot. For example, if the hotel relies heavily on a particular season for revenue and there is a decline in bookings during that time, it could have a negative impact on their overall profitability. 

#### **Let's gather data on the total number of bookings for each month for both hotels. And visualize using chart.**

In [None]:
# Calculate number of booking for each month for each typpe of hotel
month_hotel_type = hotel_df[hotel_df['is_canceled']==0].groupby(['arrival_date_month','hotel'])['hotel'].count().unstack()
month_hotel_type

In [None]:
# sets the index to a categorical data type that created earlier called "month_cat"
month_hotel_type.index = month_hotel_type.index.astype(month_cat)

# Calling sort_index() to sort our dataframe
month_hotel_type = month_hotel_type.sort_index()

In [None]:
# Barplot of number of bookings in each month for both hotels.
ax = month_hotel_type.plot.bar(figsize = (14,7),fontsize = 14)

##### 1. Why did you pick the specific chart?

A bar chart is an effective way to show these comparisons because it uses bars of different heights to represent the different values of each category, making it easy to see which category has higher or lower values.

##### 2. What is/are the insight(s) found from the chart?

From the bargraph, we can see that the number of bookings is significantly differ between the two types of hotel during the months from March to October.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the barplot can help create a positive business impact by providing hotel Owners with a better understanding of the seasonal trends in the booking patterns of their customers for two different types of Hotel chain. This can inform decisions related to pricing, marketing, staffing, and inventory management.

#### **Let's gather data on the from which country most guests are coming. And visualize using chart.**

In [None]:
# Value counts of top 10 countries from where maximum number of bookings happened
top_10_countries = df[df['is_canceled']==0]['country'].value_counts()[:10]

top_10_countries = top_10_countries.reset_index().rename(columns = {'index':'country','country':'No of guests'})

top_10_countries

In [None]:
# Bar plot of top 10 countries
plt.figure(figsize=(15,10))
ax = sns.barplot(x="country", y="No of guests", data=top_10_countries)

*Let's also check how guest are geographically divided using map*

In [None]:
# Country wise guest
country_wise_data = df[df['is_canceled']==0]['country'].value_counts().reset_index()

# Renaming column
country_wise_data.columns=['country','No of guests']

country_wise_data

In [None]:
# show on map
px.choropleth(country_wise_data,
                    locations=country_wise_data['country'],
                    color=country_wise_data['No of guests'], 
                    hover_name=country_wise_data['country'], 
                    title="Home country of guests",
                    width = 1400,
                    height = 800)

##### 1. Why did you pick the specific chart?

A bar chart is an effective way to show these comparisons because it uses bars of different heights to represent the different values of each category, making it easy to see which category has higher or lower values.

A choropleth map is a good choice for this type of data because it allows us to see how the data is distributed geographically and compare the values of different regions and its look good for presentation.

##### 2. What is/are the insight(s) found from the chart?

From the bargraph, we can see that the top 10 countries from where guest use hotels.

From choropleth map we can check geographically how buisness is expanded.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The map can help hotels to allocate resources and staff more efficiently by anticipating peak demand periods and tailoring their offerings and pricing strategies accordingly. The insights can help hotels plan their staffing requirements, and cater to specific language requirements of the guests.

The map may reveal a concentration of bookings from a small number of countries, indicating a potential risk of revenue loss in case of a sudden downturn in travel from those countries. If some countries have a higher number of cancellations than others, this may indicate that the hotel's offerings do not align with the expectations of guests from those countries, which may have a negative impact on the hotel's reputation.



#### **Let's check how much guests are paying for a room per night.**

Both hotels have different room types and different meal arrangements. Seasonal factors are also important. So the prices vary a lot. Since no currency information is given, but Portugal is part of the European Monetary Union, I assume that all prices are in EUR.

In [None]:
# Storing data for chart
pay_per_night = df[df['is_canceled']==0]

# Boxplot
plt.figure(figsize=(12, 8))
sns.boxplot(x="reserved_room_type",
            y="adr",
            hue="hotel",
            data=pay_per_night)
plt.title("Price of room types per night and person", fontsize=16)
plt.xlabel("Room type", fontsize=16)
plt.ylabel("Price [EUR]", fontsize=16)
plt.legend(loc="upper right")
plt.ylim(0, 600)
plt.show()

##### 1. Why did you pick the specific chart?

The boxplot is a good choice for visualizing the distribution of prices for different room types in the hotel data. It allows for easy comparison between different room types and hotels, as well as the identification of any outliers. 

##### 2. What is/are the insight(s) found from the chart?

The boxplot shows that the price distribution varies significantly between different room types and hotels. The most expensive room type in both hotel types is the P type, followed by the L type. The cheapest room type in both hotel types is the G type.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the boxplot can help create a positive business impact for the hotels. By understanding which room types are in high demand and which are not, hotel managers can adjust their pricing strategies to maximize revenue. 
However, the insights gained from the boxplot do not necessarily lead to negative growth. Rather, they provide valuable information for managers to optimize their pricing strategies and increase revenue. 

#### **Let's find distribution of Nights Spent at Hotels by Market Segment and Hotel Type.**

In [None]:
#dataframe
df.head()

In [None]:
plt.figure(figsize = (12,8))
sns.boxplot(x = "market_segment", y = "stays_in_week_nights", data = df, hue = "hotel",palette = ["#9b59b6", "#3498db"]);

##### 1. Why did you pick the specific chart?

The chosen chart is a boxplot that shows the distribution of the number of nights stayed by guests in different market segments, for each hotel separately. 

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can see the distribution of the length of stay in nights for different market segments and hotel types. 

These insights can be used to improve the business strategy in various ways, such as targeting the market segments with longer stays with tailored promotions or offering packages for longer stays to increase revenue.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the chart can potentially help create a positive business impact. For example, the hotel management can identify the market segments that tend to stay for a longer duration and tailor their services and promotions accordingly to attract more customers from those segments.

However, the insights should be interpreted with caution as they may not necessarily lead to negative growth but can potentially have some limitations.

#### **Let's check the distribution of adr.**


In [None]:
# First we will take only not cancelled booking values
not_canceled_data = hotel_df[hotel_df['is_canceled']==0]

# Get only those bookings which are not canceled.
not_canceled_data[not_canceled_data['adr'] == 0][not_canceled_data['market_segment']=='Complementary'].head()

**The adr column consist of values which are euqal to 0. However, there are few rows where we see the market segment as Complementary for all those values and it makes sense. Other than those all other values seems as an anamoly and needs to be removed.**

In [None]:
# Let's filter our copied dataset and remove anamolies.
df= df.drop(df[(df['adr'] == 0) & (df['market_segment'] != 'Complementary')].index)

In [None]:
# Let's check distribution of adr column.
plt.figure(figsize=(10,5))
ax = sns.histplot(df[df['is_canceled']==0]['adr'], kde=True)

##### 1. Why did you pick the specific chart?

I picked a histogram to visualize the distribution of the adr (average daily rate) column in the dataset. Histograms are a great way to show the distribution of continuous numerical data, such as prices or rates. The shape of the histogram can give insights into the central tendency, variability, and skewness of the data. In this case, it can help us understand the distribution of hotel room rates and identify any outliers or unusual patterns.

##### 2. What is/are the insight(s) found from the chart?

From the histogram, we can see that the adr values are mostly concentrated between 0 and 200 Euros, with a peak around 100 Euros. This suggests that the majority of the hotel room rates are relatively affordable, with a few outliers having higher rates.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insight gained from the histogram can help create a positive business impact by informing pricing and marketing strategies. 

However, the presence of outliers on the high end of the rate distribution could also lead to negative growth if these rates are perceived as unreasonably high or if they drive away price-sensitive customers. 

#### **Let's gather data on number of customers repeated their bookings.**

In [None]:
# Bar plot showing whether customers repeating their bookings or not?
plt.figure(figsize=(12,8))
sns.countplot(data =df, x = 'is_repeated_guest').set_title('Graph showing whether guest is repeated guest', fontsize = 20)
plt.show()

##### 1. What is/are the insight(s) found from the chart?

It is clear that number of repeating guests are very low.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can be useful for hotel managers to understand the loyalty of their customers and develop strategies to encourage repeat bookings. By understanding the preferences and needs of their repeat guests, hotels can tailor their services and amenities to meet their needs and provide a better guest experience.

#### **Let's gather information on bookings that were cancelled.**

In [None]:
# dataframe of cancelled bookings
cancel=df[df['is_canceled']==1]

# booking cancellations in resort hotel
rh_cancelations = cancel[cancel["hotel"] == "Resort Hotel"]["is_canceled"].sum()
print('Bookings Cancelled in Resort Hotel:',rh_cancellations)

# booking cancellations in city hotel
ch_cancelations = cancel[cancel["hotel"] == "City Hotel"]["is_canceled"].sum()
print('Bookings Cancelled in City Hotel:',ch_cancellations)


In [None]:
values = [11069, 33026]
labels = ['RH Cancellations', 'CH Cancellations']

plt.pie(values, labels=labels,autopct='%1.1f%%')
plt.title('Percentage of Cancellations by Hotel')
plt.show()


##### 1. Why did you pick the specific chart?

I picked a pie chart because it is an effective way to show the distribution of categorical data, which in this case is the percentage of cancellations by hotel. The chart is easy to read and visually appealing, making it a good choice for quickly conveying the main insights.

##### 2. What is/are the insight(s) found from the chart?

The pie chart shows the percentage of cancellations by hotel. The majority of cancellations (74.1%) were made in the City Hotel, while the remaining 25.9% of cancellations were made in the Resort Hotel. This indicates that cancellations are more common in city hotels compared to resort hotels.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the chart of the percentage of cancellations by hotel can help create a positive business impact as it helps the hotels to understand their cancellation rates and take necessary steps to reduce the number of cancellations.

However, the insights may also lead to negative growth if the hotels do not take any corrective actions based on the analysis of the cancellation rates. If the hotels do not address the issues that lead to cancellations, it may result in a negative impact on their revenue and growth. 

#### **Let's gather data on how can different factors affect cancellation. First factor we are looking into is Lead time.**

In [None]:
# Regplot showing relation between lead time and cancellation.

lead_time_df = df.groupby('lead_time')['is_canceled'].describe()
plt.figure(figsize=(15, 10))
sns.regplot(x=lead_time_df.index, y=lead_time_df["mean"].values * 100)
plt.title("Effect of lead time on cancelation", fontsize=16)
plt.xlabel("Lead time", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)

##### 1. Why did you pick the specific chart?

I picked this specific chart to visually represent the relationship between lead time and cancellation rate. It uses a regression plot to show how the cancellation rate changes as the lead time increases, and the line of best fit helps to identify any significant trends or patterns. This chart helps to highlight the importance of lead time in predicting cancellations and can be useful in informing hotel policies around cancellation policies and pricing strategies.

##### 2. What is/are the insight(s) found from the chart?

The regplot shows the relationship between lead time and cancellation percentage. The graph suggests that as lead time increases, the percentage of cancellations also increases. This implies that customers tend to cancel their reservations when they have a longer lead time.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insight gained from the chart is that the longer the lead time, the higher the percentage of cancellations. This information can be used by hotels to adjust their cancellation policies and pricing strategies based on how far in advance customers are booking their stays. For example, they can offer more flexible cancellation policies for bookings made farther in advance to reduce the likelihood of cancellations.

On the negative side, if the hotels have a high number of cancellations, it could lead to lost revenue and higher costs due to the need to manage cancellations and rebookings. This may also affect the hotel's reputation, resulting in fewer bookings in the future. 

#### **Second factor we are looking for cancellation is type of deposit.**

In [None]:
# Barplot to show which deposit type affects cancellation more.
deposit_df = df.groupby('deposit_type')['is_canceled'].describe()
plt.figure(figsize=(12, 8))
sns.barplot(x=deposit_df.index, y=deposit_df["mean"].values * 100)
plt.title("Effect of deposit on cancelation", fontsize=16)
plt.xlabel("Deposit Type", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16);

##### 1. Why did you pick the specific chart?

I picked this chart to explore the impact of different deposit types on the cancellation rate.

##### 2. What is/are the insight(s) found from the chart?

The insight gained from this chart is that customers who made Non-Refund deposit are more likely to cancel their booking.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

It is important for the hotel to carefully consider its deposit policies and find a balance between reducing cancellations and not discouraging potential customers.

#### **Let's look into one more factor that can affect cancellation i.e. ADR.**

In [None]:
# Regplot showing relationship between ADR and cancellation.
adr_cancel_data = df.groupby("adr")["is_canceled"].describe()
plt.figure(figsize=(15, 10))
sns.regplot(x=adr_cancel_data.index, y=adr_cancel_data["mean"].values * 100)
plt.title("Effect of ADR on cancelation", fontsize=16)
plt.xlabel("ADR", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.xlim(0,400)
plt.ylim(0,100)

##### 1. Why did you pick the specific chart?

This chart is suitable for visualizing the relationship between two continuous variables - the "ADR" (average daily rate) and the cancellation rate - as it shows how the mean cancelation rate changes as the ADR value increases or decreases.



##### 2. What is/are the insight(s) found from the chart?

The correlation between ADR and cancellations is positive which means as ADR increases number of booking cancellations will also be increases.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the chart can potentially help create a positive business impact by allowing hotel managers to adjust their pricing strategy to increase the ADR, which in turn can result in a lower cancellation rate and increased revenue.

As for negative growth, there are no clear insights from the chart that could lead to negative growth. 

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr = df.corr()
plt.figure(figsize=(30,20))
sns.heatmap(corr,annot=True,cmap='RdBu',vmin=-1,vmax=1, square = True,annot_kws = {'fontsize':11,'fontweight':'bold'})
plt.show()

##### 1. Why did you pick the specific chart?

 picked the correlation heatmap visualization chart because it is a powerful and commonly used technique for exploring the correlation structure of a dataset. This chart can provide valuable insights into the relationships between variables in a dataset and can help identify potential multicollinearity issues, which can negatively impact model performance. 

##### 2. What is/are the insight(s) found from the chart?

-The "is_canceled" column is positively correlated with "lead_time", "adults", "adr", and "required_car_parking_spaces". This means that as these values increase, the likelihood of the booking being canceled also increases.

-The "is_canceled" column is negatively correlated with "total_of_special_requests" and "stays_in_weekend_nights". This means that as these values increase, the likelihood of the booking being canceled decreases.

-There are several strong positive correlations between different columns, such as "stays_in_weekend_nights" and "stays_in_week_nights", "adults" and "children", and "adr" and "total_of_special_requests". This suggests that these variables are related and may influence each other.

-There are also some negative correlations between columns, such as "lead_time" and "arrival_date_year", and "lead_time" and "stays_in_weekend_nights". This suggests that there may be some interactions between these variables that are worth exploring further.

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code
subset_cols = ['lead_time', 'stays_in_weekend_nights','stays_in_week_nights','days_in_waiting_list','previous_cancellations', 'previous_bookings_not_canceled', 'adr', 'total_of_special_requests']

sns.pairplot(df[subset_cols])


##### 1. Why did you pick the specific chart?

It allows for the visualization of the relationships between multiple variables in the dataset. The pair plot is a useful tool for exploratory data analysis as it can quickly identify any potential correlations between variables.



##### 2. What is/are the insight(s) found from the chart?

Insights into the factors that contribute to cancelled bookings and inform strategies for reducing cancellations. Additionally, we can identify any potential outliers or unusual patterns in the data, which may indicate errors or anomalies that require further investigation.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

1. Reduce cancellation rates: One key objective for any hotel is to reduce the number of bookings that are cancelled.
2. Optimize pricing: The EDA can help identify optimal pricing strategies for hotel bookings. By analyzing the relationship between pricing and other variables such as the length of the stay or the number of guests, hotels can identify the sweet spot for pricing that maximizes revenue without sacrificing occupancy rates.
3. Improve customer satisfaction: Another key objective for any hotel is to improve customer satisfaction. The EDA can help identify factors that contribute to customer satisfaction, such as the type of room booked.
4. Target marketing efforts: The EDA can also help hotels target their marketing efforts more effectively. By analyzing customer demographics and booking patterns, hotels can identify key segments to target with marketing campaigns, and tailor those campaigns to the specific needs and preferences of those segments.



# **Conclusion**

1. Majority of the hotels booked are city hotel.
2. Non-Refund policies lead to a higher cancellation rates.
3. Target months between May to Aug. Those are peak months due to the summer period.
4. Majority of the guests are from Western Europe. So target this area for more customers.
5. Since there are very few repeated guests, focus should be on retaining the customers after their first visit.