# **Project Name**    - **Hotel** **Booking** **Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
# Name - Najani khatoon

# **Project Summary -**


This project aims to perform Exploratory Data Analysis (EDA) on a dataset containing hotel booking information to gain insights and understand patterns in customer behaviour, booking trends, and other relevant factors that impact the hospitality industry. By conducting a comprehensive EDA, we intend to extract meaningful information that can guide decision-making, improve customer experience, and optimise hotel management strategies.




# Objectives::


Data Collection and Description:- The project begins with the collection of a comprehensive dataset from various sources, such as hotel reservation systems, customer databases, and online booking platforms. The dataset includes a wide range of attributes, such as customer demographics, booking dates, room types, booking channels, length of stay, cancellations, special requests, and historical hotel performance metrics.

Data Cleaning and Preprocessing:- The collected data undergoes a rigorous preprocessing phase to handle missing values, outliers, and inconsistencies. Data cleaning techniques are applied to ensure data integrity and accuracy. Irrelevant or redundant features may be removed, and categorical variables are encoded appropriately for analysis.

Exploratory Data Analysis (EDA):- The heart of the project lies in conducting an extensive EDA to extract meaningful insights and patterns from the dataset. The EDA process includes various analytical techniques and visualizations, such as statistical summaries, histograms, box plots, scatter plots, heatmaps, and time-series analyses.

Customer Segmentation:- Based on the analysis, we will attempt to segment customers into different groups based on their booking behaviors, demographics, and preferences. This can help in tailoring marketing strategies and service offerings to specific customer segments.

# **GitHub Link -**


Provide your GitHub Link here.


# **Problem Statement**



We are here to explore a hotel booking dataset to discover important factors that govern the bookings, which contain booking information for a city hotel and a resort hotel. We will analyze some important aspects of hotel bookings which will helps us identify major loopholes and give us insights which will be helpful to run profitable hotel business are as follows:

The time of year to book a hotel room?
Optimal length of stay to get the best daily rate?
To predict whether or not a hotel was likely to receive a disproportionately high number of special requests?

#### **Define Your Business Objective?**

In Hotel industry, Cancellation and Average Daily Rate are two important factors that help run the business effectively.

By understanding the factors that are determining the cancellation of a certain booking, the hotels can take necessary precautions to reduce Cancellation rate.
By understanding the patterns in ADR against different variables, the hotels can be prepared in advance to generate more revenue and help make a profitable business.
Our goal here is to understand such factors in the given data set by performing Exploratory Data Analysis.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import required Libraries
import pandas as pd
import numpy as np

# importing visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import math

# import warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
# Load Dataset from google drive by connecting the google drive

from google.colab import drive
drive.mount('/content/drive')

In [None]:
url = ('/content/drive/MyDrive/almabetter data/Hotel Bookings (2) project.csv')

dataset = pd.read_csv(url)

### Dataset First View

In [None]:

# View top 5 rows of the dataset
dataset.head()

In [None]:
# view bottom 5 rows of the dataset
dataset.tail()

### Dataset Rows & Columns count

In [None]:

# Dataset Rows & Columns count
dataset.shape
print('the number of rows',dataset.shape[0])
print('the number of columns',dataset.shape[1])

### Dataset Information

In [None]:
# Dataset Info
# Detailed information of the dataset
dataset.info()

#### Duplicate Values

In [None]:

# Checking duplicated rows count
dataset[dataset.duplicated()].shape

There are 31994 rows having duplicate values

In [None]:
# removing the duplicate values
dataset.drop_duplicates(inplace =True)

In [None]:
# checking for duplicate values
dataset[dataset.duplicated()].shape

In [None]:
# number of observations after removing duplicates
dataset.shape

#### Missing Values/Null Values

In [None]:
dataset.isnull().sum().sort_values()

we can observe there are four columns [children,country,agent,company] having missing values

We can observe that there are 4 null values in the "children" column. Since it is highly likely that customers with 0 children might have been missed while filling the column, we can fill these null values with 0.

In [None]:
# Missing Values/Null Values Count
# Imputing the 4 null values as 0 in 'children' column
dataset['children'] = dataset['children'].fillna(0)

In [None]:
# Visualizing the missing values
# we are having 452 missing values in country column, so we can replace this as 'others' for the categorization.
dataset['country'] = dataset['country'].fillna('others')

As we can observe that agent and company are the booking mediums for the customer, so there is a high chance that customer can book directly without any medium. So we can replace the value 0

In [None]:
dataset[['agent','company']] = dataset[['agent','company']].fillna(0)

In [None]:
# Checking about the missing values after visualization
dataset.isnull().sum().sort_values()

from the above, we can observe that some rows of babies,children,adults columns having 0 values in it, which represents the number of persons in the booking is 0 by which the booking is invalid to do analysis.

In [None]:
# Checking the number of rows having 0 values in it
dataset[dataset['babies']+dataset['children'] + dataset['adults'] == 0].shape

In [None]:
# dropping the above 166 rows which have 0 values in it
dataset.drop(dataset[dataset['babies']+dataset['children']+dataset['adults'] == 0].index,inplace = True)

### What did you know about your dataset?

This dataset contains information about bookings and revenue for resort and hotel properties from July 2015 to August 2017. The dataset consists of 87,396 rows and 32 columns, providing comprehensive details for analysis.

Our objective is to analyze the data and gain insights into the best time for hotel bookings and the optimal length of stay to obtain the best rates. Additionally, we will explore other key aspects of the dataset to uncover valuable information.

By conducting a thorough analysis of this dataset, we aim to provide meaningful insights and recommendations to improve the booking strategies and revenue optimization for resort and hotel properties.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Describing the dataset
dataset.describe()

In [None]:
# Dataset Describe
# Dataset Columns
dataset.columns

### Variables Description


hotel : Type of the hotel Categorical

is_cancelled : whether booking is cancelled (cancelled = 1, not cancelled = 0 ) numerical

lead_time : The number of days elapsed between the booking and the arrival date of the guests numerical

arrival_date_year:year of the arrival numerical

arrival_date_month : month of the arrival numerical

arrival_date_week_number : week of the arrival numerical

arrival_date_day_of_the_month : day of the arrival numerical

stays_in_weekend_nights : number of weekend nights stayed numerical

stays_in_week_nights : number of week nights stayed numerical

adults : number of adults numerical

children : number of children numerical

babies : number of babies numerical

meal : type of the meal categorical

country : country of the guest country

market_segment : which segment the customer belongs to country

Disribution_channel : Through which means guest got booking categorical

is_repeated_guest : whether the guest is repeated(repeated = 1, not repeated = 0) categorical

previous_cancellation : is there any previous cancellations of the guest categorical

previous_booking : number of completed bookings of the guest numerical

reserved_room_type : type of the room guest booked categorical

assigned_room_type : room assigned to the guest for the booking categorical

booking_changes : number of changes made in the booking numerical

deposit_type : type of deposit the guest made categorical

agent : ID of the agentcategorical

company : ID of the company categorical

days_in_waiting_list : number of days to wait numerical

customer_type : type of the customer categorical

adr : average daily rate(ADR) numerical

required_car_parking : number of car parking spaces required to the guest numerical

total_of_special_requests : special requests made by the guests numerical

reservation_status : status of the reservation categorical

reservation_status_date :date of reservation date

### Check Unique Values for each variable.

In [None]:

# Check Unique Values for each variable using for loop
for i in dataset.columns.tolist():
  print("number of unique values in",i,"is",dataset[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:

# Lets create a copy of dataset to do data wrangling
df = dataset.copy()

In [None]:
# observing the subcategories in market_segment and distribution channel columns to get insights
df.groupby('market_segment')['distribution_channel'].value_counts()

From the above analysis, we can observe that the market_segment and distribution_channel columns appear to have similarities, but each has subcategories, making it difficult to merge or remove these columns. However, we can handle the undefined values in the market_segment column by replacing them with "online TA" since its proportion is significantly higher compared to other categories.

Based on this observation, it is recommended to replace the undefined values in the market_segment column with "online TA" to ensure consistency and improve the analysis.

In [None]:
# replacing the undefined value in market segment as online TA

df['market_segment'] = df['market_segment'].replace(to_replace = 'Undefined', value = 'Online TA')

In [None]:
# creating a new column as total_nights by adding stays_in_weekend_nights and stays_in_week_nights

df['total_nights'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']

In [None]:
# creating a new column for lead_time in terms of month

df['lead_time_months'] = df['lead_time']//30

In [None]:
#creating a function basis on the row to evaluate the values of the given row

def cancellation(row):
  if(row['previous_cancellations'] == 0):
    return 0
  else :
    return 1

In [None]:
# creating new column of previously cancelled values by using the function

df['is_previously_cancelled'] = df.apply(cancellation,axis =1)

In [None]:
# creating new column of non-adults by adding kids and babies columns

df['kids'] = df['children'] + df['babies']

In [None]:
# creating new column of total number of people by adding children,babies and adults columns

df['total_number_of_people'] = df['children'] + df['babies'] + df['adults']

In [None]:
# converting the datatypes from float to integer 64 of below columns

df[['children','agent','company','kids','total_number_of_people']] = df[['children','agent','company','kids','total_number_of_people']].astype('int64')

In [None]:
# changing the format of reservation_status_date string column string to date format

df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'],format = '%Y-%m-%d')

# What all manipulations have you done and insights you found?


In the previous steps, we performed data wrangling by adding new columns and modifying existing columns to enhance our analysis. We made changes to the datatypes of selected columns to facilitate better operations and functions. These transformations allow us to obtain more meaningful insights from the data.

By converting the 'reservation_status_date' column to a different datatype, such as datetime, we can extract information based on quarters, months, or years. This allows us to analyze the booking patterns over different time periods.

Additionally, we introduced new columns to capture the total number of people involved in each booking. By aggregating the 'children', 'babies', and 'adults' columns into a new column called 'total_number_of_people', we can analyze the bookings based on the total number of individuals.

Another new column we added is 'total_nights', which represents the total number of nights for each booking. This allows us to examine the bookings from the perspective of the duration of stay.

Furthermore, we recognized the importance of visual representations in our analysis. By plotting graphs and visualizations, we can gain deeper insights and effectively communicate our findings to stakeholders.




## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

Chart - 1 Visualization on 'is_cancelled' column (Univariate Analyis)

In [None]:
# plotting for 'is_canceled' column to check the difference between the bookings
df1 = df['is_canceled'].value_counts()
x = df1.index.values
y = df1.values

plt.figure(figsize=(10,5))
plots = sns.barplot(x = x, y = y/sum(y)*100)
for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".2f")}%',
                    (bar.get_x() + bar.get_width()/2,
                    bar.get_height()), ha='center',va ='center',
                    size = 12,xytext = (0,8),
                    textcoords = 'offset points')

plt.xlabel('Booking Cancelled (Booking cancelled = 1, not cancelled = 0)')
plt.ylabel('Percentage of bookings')
plt.title('Booking info(Cancelled & Not Cancelled)')
plt.show()

no_of_bookings_not_cancelled = df1[0]
no_of_bookings_cancelled = df1[1]
total_bookings = no_of_bookings_not_cancelled + no_of_bookings_cancelled

print("bookings not cancelled is",(no_of_bookings_not_cancelled))
print("bookings cancelled is",(no_of_bookings_cancelled ))
print('Total bookings are', total_bookings)

##### 1. Why did you pick the specific chart?

We chose bar charts to visualize the percentage of cancelled bookings and not cancelled bookings because bar charts are effective for comparing the sizes or frequencies of different categories or groups of data. In this case, we wanted to compare the percentage of cancelled and not cancelled bookings across different categories.

By using bar charts, we can easily compare the relative percentages of cancelled and not cancelled bookings within each category or group. The length of the bars represents the percentage, allowing for a quick visual understanding of the distribution. This helps us identify any significant differences or patterns among the categories or groups.

##### 2. What is/are the insight(s) found from the chart?

The chart depicts a total of 87,230 bookings, out of which 24,009 bookings were cancelled, while 63,211 bookings were not cancelled. This indicates that 72.48% of the bookings were not cancelled, while 27.52% of the bookings were cancelled.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this analysis can help create a positive business impact. By analyzing the factors related to booking cancellations, such as the type of hotel and ADR (Average Daily Rate), we can identify areas for improvement and implement strategies to reduce cancellations. This can lead to increased customer satisfaction, improved revenue, and overall business growth.

There are no specific insights mentioned that would lead to negative growth. The focus of the analysis is to identify areas for improvement and reduce booking cancellations, which is a positive objective for the business

Chart - 2 Visualization the 'hotel' column (Univariate Analysis )

In [None]:
# Plotting graph on 'hotel' column
df2 = df['hotel'].value_counts()
x = df2.index
y = df2.values

plt.figure(figsize= (10,8))
plots = sns.barplot(x=x, y=y/sum(y)*100)

for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".1f")}%',
                 (bar.get_x() + bar.get_width()/2,bar.get_height()),
                 size = 12, xytext = (0,8), ha ='center',va= 'center',
                 textcoords = 'offset points')

plt.title('Bookings basis on hotel type')
plt.xlabel('Type of the hotel')
plt.ylabel('Percentage of the bookings')
plt.show()

city_hotel_bookings = df2[0]
resort_hotel_bookings = df2[1]

print('the city hotel bookings are',city_hotel_bookings)
print('the resort hotel bookings are',resort_hotel_bookings)
print('the total bookings are',total_bookings)

##### 1. Why did you pick the specific chart?

Bar charts are used to compare the size or frequency of different categories or groups of data. Bar charts are useful for comparing data across different categories, and they can be used to display a large amount of data in a small space.To show the percentage of bookings for city hotel and resort hotel.

##### 2. What is/are the insight(s) found from the chart?


From the observation, it can be inferred that the dataset contains two types of hotels: city hotels and resort hotels. The total number of bookings in the dataset is 87,230, out of which 53,274 bookings are for city hotels and 33,956 bookings are for resort hotels. This means that city hotels account for 61.1% of the bookings, while resort hotels account for 38.9%. The percentage of bookings for resort hotels is slightly lower compared to city hotels.

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can potentially help create a positive business impact. By understanding the booking patterns and the variations in guest arrivals by month, businesses can develop targeted marketing strategies and promotional offers to attract more guests during the months with lower bookings, such as January. By identifying the reasons behind the lower bookings in certain months, businesses can address any potential issues or challenges that may be impacting guest arrivals and take appropriate measures to improve the situation.

Overall, the gained insights can provide valuable information for making informed business decisions, optimizing marketing strategies, and improving customer satisfaction, all of which can contribute to a positive business impact.



Chart - 3 Visualization on Country column (Univariate analysis)

In [None]:
# Plotting graph on country column to know the proportion if bookings from the country basis
df3 = df['country'].value_counts().head(15)
x = df3.index
y = df3.values
plt.figure(figsize = (10,5))
plots = sns.barplot(x =x, y= y/sum(y)* 100)

for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".2f")}%',
                 (bar.get_x() + bar.get_width()/2, bar.get_height()),
                 size = 10, xytext = (0,8), ha= 'center',va = 'center',
                 textcoords = 'offset points')

plt.title('Bookings of the guests from different countries')
plt.xlabel('Country')
plt.ylabel('percentage of the bookings')
plt.show()

for i,j in df3.items():
  print("The country",i,'having',j,'bookings')

1. Why did you pick the specific chart?

We chose to use a bar chart to display the percentage of bookings from the top 15 countries because bar charts are effective in comparing the frequency or size of different categories or groups of data. In this case, we wanted to compare the booking percentages across different countries and highlight the distribution of bookings among the top 15 countries.

By using a bar chart, we can easily visualize and compare the booking percentages of each country, allowing us to identify the countries with the highest booking rates. This information can be valuable for business decisions, such as targeting specific markets or tailoring marketing strategies to attract guests from these countries.

##### 2. What is/are the insight(s) found from the chart?

From the above visualization, we analyzed the bookings of guests from different countries, focusing on the top 15 countries. The insights obtained are as follows:

Highest Booking Percentage: The country with the highest booking percentage is PRT, accounting for 35.25% of the total bookings. This indicates a significant number of bookings coming from this country compared to others in the top 15 list.

Distribution of Bookings: For the remaining countries in the top 15 list, the booking percentages range below 13.43%. This suggests a relatively even distribution of bookings among these countries, although they have lower percentages compared to PRT.

Low Booking Percentages: Beyond the top 15 countries, the remaining countries have booking percentages of less than or equal to 1%. This indicates that these countries contribute a minimal portion of the total bookings.

These insights help us understand the booking patterns and the significance of different countries in terms of generating bookings. It allows us to identify the countries with the highest potential and focus our marketing efforts accordingly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights can help create postive business impact by analyzing the country wise booking analysis.we can identify the key patameters that are important to increase bookings.This information can be used to devise strategies and make improvements in areas that may lead to increased bookings and overall business growth.

There are no insights from the given analysis that indicate negative growth. The focus is on identifying opportunities for improvement and maximizing bookings, which can have a positive impact on the business.

Chart - 4 Market Segment Analysis(Univariate analysis)

In [None]:
# visualization Market segment column to know the insights of the bookings(Univariate analysis)
fig,ax = plt.subplots(figsize = (10,5))
(df.groupby('market_segment').size()/sum(df['market_segment'].value_counts())*100).plot(kind = 'pie',autopct = '%0.2f%%',explode = [0,0.5,0.5,0,0,0,0])


for i ,j in df['market_segment'].value_counts().items():
  print('The Market segment is',i,'and the bookings are',j)

1. Why did you pick the specific chart?

We chose to use a pie chart for the market segment analysis because it effectively displays the proportions of different market segments. Pie charts are particularly useful when presenting data that has already been calculated as a percentage of the whole. By using a pie chart, we were able to visualize the percentage comparison of market segments for the bookings in a clear and concise manner

2. What is/are the insight(s) found from the chart?

The insights gained from the chart are as follows:

      .The majority of the bookings, accounting for 59.102%, are made through Online Travel Agents (TA).
      .Offline Travel Agents/ Tour Operators (TA/TO) account for 15.88% of the bookings.
      .Direct bookings, made directly with the hotel, represent 13.50% of the total bookings.
      .Other market segments, which include various categories like groups,corporates,complementery and aviation make up less than 5% of the bookings.


These insights provide valuable information about the distribution of bookings among different market segments, highlighting the significant contribution of Online TA and the importance of direct bookings as well.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact. Understanding the distribution of bookings among different market segments allows the hotel to focus its marketing and promotional efforts accordingly. By knowing that the majority of bookings come from Online TA, the hotel can allocate resources towards strengthening partnerships with online travel agencies and improving their online presence to attract more bookings. Additionally, identifying the market segments with lower percentages can help the hotel develop targeted strategies to increase bookings from those segments.

As for insights leading to negative growth, there are no specific insights from this chart that suggest negative growth. The focus is on understanding and maximizing bookings from different market segments, which can contribute to overall business growth. However, it is important to consider other factors and conduct a comprehensive analysis to identify any potential challenges or areas for improvement that could affect business growth.

Chart - 5 Analysis on 'Reserved room type' (Univariate Analysis)

In [None]:
# Visualization on reserved room type colum to its bookings percentage
df4 = df['reserved_room_type'].value_counts()
x = df4.index
y = df4.values
plt.figure(figsize = (10,5))
plots = sns.barplot(x= x,y= y/sum(y)*100)

for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".2f")}%',
                 (bar.get_x() + bar.get_width()/2, bar.get_height()),
                 size = 12, xytext = (0,8),ha = 'center',va = 'center',
                 textcoords = 'offset points')

plt.title('Bookings for reserved room type')
plt.xlabel('Room type')
plt.ylabel('Percentage of bookings')
plt.show()

for i,j in df4.items():
  print('The reserved room type is',i,'and the bookings are',j)

##### 1. Why did you pick the specific chart?

To show the percentage of the reserved room types and their distribution in the bookings, we used bar charts. Bar charts are effective for comparing the size or frequency of different categories or groups of data. They allow us to easily compare data across different room types and provide a visual representation of the distribution of bookings among each type. By using bar charts, we can present the information in a clear and concise manner, making it easier for viewers to interpret and understand the data.

##### 2. What is/are the insight(s) found from the chart?

From the visualization, we can gather several insights:

The majority of guests prefer Room Category A, which accounts for 64.70% of the bookings.
Room Category D is the second most popular choice among guests, representing 19.92% of the bookings.
The remaining room categories (B, C, E, F, G, H, and L) have relatively lower percentages, each accounting for less than 6% of the bookings.
These insights provide an understanding of the distribution of room types preferred by guests. Hotel managers can use this information to assess the popularity of different room categories and make informed decisions regarding room availability, pricing strategies, and marketing efforts to maximize bookings and customer satisfaction.

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the analysis can potentially create a positive business impact. By understanding that guests predominantly prefer Room Categories A and D, hotel managers can focus on enhancing these categories to attract more bookings and increase revenue. This could involve improving the amenities, aesthetics, and overall experience of these room types to align with customer preferences.

Regarding the other room categories with lower percentages, it provides an opportunity for further investigation. Analyzing the reasons behind the lower demand for these categories can help identify any shortcomings or areas for improvement. This analysis can inform decision-making processes related to room renovations, pricing strategies, or targeted marketing campaigns to boost the popularity and occupancy rates of these less preferred room types.

Overall, the insights gained from the analysis offer valuable information that can guide business strategies and initiatives to drive positive growth and improve the overall guest experience.

Chart - 6 Analysis on 'arrival month'(Univariate Analysis)

In [None]:
# Visulaization on arrival_date_month

df6 = df['arrival_date_month'].value_counts()
x = df6.index
y = df6.values
plt.figure(figsize = (13,7))
plots = sns.barplot(x =x,y = y/sum(y)* 100)

for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".1f")}%',
                 (bar.get_x()+ bar.get_width()/2, bar.get_height()),
                 size = 10, xytext = (0,8), ha= 'center',
                 va ='center',textcoords = 'offset points')

for i , j in df6.items():
  print('The arrival month of the guest is',i,'and its bookings are',j)

1. Why did you pick the specific chart?

We chose a bar chart for this analysis because it allows us to compare the size or frequency of different categories or groups of data effectively. Bar charts are especially useful when displaying a large amount of data in a limited space.

In this case, we utilized a bar chart to represent the percentage of bookings based on the arrival month of the guests. By using this chart, we can easily compare the booking patterns across different months and identify any significant trends or variations in guest arrivals.

##### 2. What is/are the insight(s) found from the chart?

The insights gained from the chart are as follows:

The month of August has the highest percentage of guest arrivals, accounting for 12.9% of the total bookings. On the other hand, the month of January has the lowest percentage of guest arrivals, with only 5.4% of the total bookings. This indicates that there may be certain factors or trends that influence the booking patterns in different months. Further analysis can be performed to identify the reasons behind the lower bookings in January and explore strategies to increase the bookings percentage in the remaining months

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can potentially help create a positive business impact. By understanding the booking patterns and the variations in guest arrivals by month, businesses can develop targeted marketing strategies and promotional offers to attract more guests during the months with lower bookings, such as January. By identifying the reasons behind the lower bookings in certain months, businesses can address any potential issues or challenges that may be impacting guest arrivals and take appropriate measures to improve the situation.

Overall, the gained insights can provide valuable information for making informed business decisions, optimizing marketing strategies, and improving customer satisfaction, all of which can contribute to a positive business impact.

Chart - 7 Analysis on 'total_people'(univariate analysis)

In [None]:
#Visualization on the total_people column
df7 = df['total_number_of_people'].value_counts()
x = df7.index
y = df7.values

plt.figure(figsize  =(10,5))
plots = sns.barplot(x= x, y =y/sum(y) * 100)
plt.title("Total number of persons for single booking")
plt.xlabel("Number of persons")
plt.ylabel("Percentage of bookings")
plt.show()

for i,j in df7.items():
  percentage = round(j/sum(df7.values)*100,2)
  print(f'The number of people for a booking is {i} and the number of bookings are {j} i.e {percentage}%')

1. Why did you pick the specific chart?

Bar charts are chosen to compare the size or frequency of different categories or groups of data. They are particularly useful for comparing data across categories and displaying a large amount of data in a compact manner.

In this case, we used a bar chart to depict the percentage of the number of persons arriving for a single booking.

2. What is/are the insight(s) found from the chart?

From the above visualization, we can gather the following insights:

The majority of bookings consist of 2 persons, accounting for a significant percentage. Single-person bookings also make up a considerable proportion, indicating a significant segment of solo travelers. Bookings with larger groups or bulk reservations (e.g., 40, 50, 55, or 6 people) are relatively rare, representing a smaller percentage of the total bookings

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact. By understanding the distribution of the number of persons per booking, the hotel can tailor their offerings and services to accommodate the needs of different types of guests. For example, they can focus on providing amenities and packages suitable for double occupancy bookings, as it is the most common category. Additionally, they can explore strategies to attract group or bulk bookings, such as offering discounts or creating special packages for larger groups. These efforts can lead to increased bookings and revenue.

There are no insights from the chart that directly indicate negative growth. However, it's important to monitor the trends and patterns in different categories of bookings over time to identify any potential areas of concern. For instance, if there is a significant decline in single-person bookings, it may indicate a need to reassess marketing strategies or amenities targeting solo travelers. Regular analysis and adaptation to changing market dynamics can help mitigate any negative impacts and ensure continued business growth.

## Chart - 8 Analysis on Total nigts stays column(Univariate Analysis)

In [None]:
# Visualization on total_nights_stay for the bookings
df8 = df['total_nights'].value_counts().sort_values(ascending = False).head(27)
x = df8.index
y = df8.values

plt.figure(figsize = (15,8))
plots = sns.barplot(x =x ,y =y/sum(y) * 100 )

for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".2f")}%',
                 (bar.get_x() + bar.get_width()/2, bar.get_height()),
                 size = 10, xytext = (0,8), ha ='center', va ='center',
                 textcoords = 'offset points')

plt.title("Total night stays for a booking")
plt.xlabel("Number of nights stayed")
plt.ylabel("Percentage of the bookings")
plt.show()

for i, j in df8.items():
  percentage = round(j / sum(df8.values)*100,2)
  print(f'{j} i.e {percentage}% of bookings are having the total nights of stay - {i}')

1. Why did you pick the specific chart?

The statement is correct, bar charts are commonly used to compare the size or frequency of different categories or groups of data. They are effective in visualizing data across different categories and can display a large amount of information in a compact manner.

Using a bar chart to show the percentage of total nights of stay for each booking allows for easy comparison and identification of patterns or trends. It provides a clear visual representation of the distribution of booking durations and helps understand the typical length of stay for guests.

##### 2. What is/are the insight(s) found from the chart?

The insights gained from the chart are as follows:

The majority of bookings (20.45%) have a duration of 3 nights, indicating that guests tend to prefer a short stay.
Bookings with a duration of 1 night account for 19.69% of the total, suggesting that many guests opt for a quick overnight stay.
Similarly, bookings with a duration of 2 nights make up 18.05% of the total, indicating a preference for a slightly longer stay.
It is worth noting that there are a few bookings with an extended duration of 30 nights, although they represent a very small percentage (0.01%) of the total bookings.
These insights provide valuable information about the distribution of booking durations and can help inform decision-making related to room availability, pricing strategies, and resource allocation for different lengths of stay.

3. Will the gained insights help creating a positive business impact?


Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can indeed help create a positive business impact.

By understanding the preferred durations of guests' stays, the hotel can tailor its offerings and services accordingly. For instance, the hotel can design packages or promotional deals specifically targeted at guests looking for stays of 3, 1, 2, 4, or 5 nights. This can help attract more bookings from these segments and increase the overall occupancy rate.

Additionally, identifying the guests who prefer longer durations, such as corporate clients or extended-stay tourists, presents an opportunity to develop special rates or packages tailored to their needs. This can help attract and retain these valuable customers, leading to increased revenue and positive business growth.

Chart - 9 Hotel and cancellation analysis to find the cancellation percentage for each hotel(Bivariate Analysis)

In [None]:
#visualization on the Hotel analysis
# Selecting and counting total cancelled booking for each hotel
df_cancel = df[df['is_canceled'] == 1]
cancel_grp = df_cancel.groupby('hotel')
df_canceled = pd.DataFrame(cancel_grp.size()).rename(columns = {0:'total_cancelled_bookings'})

# counting total number of bookings for each hotel
df_total_bookings = pd.DataFrame(df.groupby('hotel').size()).rename(columns = {0 : 'total_bookings'})
df_final = pd.concat([df_canceled,df_total_bookings],axis = 1)

# calculating the percentage of canceled bookings and adding a new collumn to dataframe 'df_final'
df_final['canceled %'] = round(df_final['total_cancelled_bookings']/df_final['total_bookings'] * 100,2)
print(df_final)

# plotting bar chart to represent the cancelled % of each hotel
plt.figure(figsize = (8,5))
plots = sns.barplot(x = df_final.index, y = df_final['canceled %'])
for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".2f")}%',
                 (bar.get_x() + bar.get_width()/2, bar.get_height()),
                 size = 10, xytext = (0,8),ha = 'center', va = 'center',
                 textcoords = 'offset points')
plt.title('Hotel and its canceled %')
plt.xlabel('type of the hotel')
plt.ylabel('percentage of the cancellation of bookings')
plt.show()

1. Why did you pick the specific chart?

To represent the percentage of cancelled bookings for each type of hotel, we used a bar chart. Bar charts are an effective way to compare data across different categories or groups, making it easier to understand the distribution and proportion of cancelled bookings for each type of hotel.

2. What is/are the insight(s) found from the chart?

From the visualization, we can gather the following insights: The percentage of cancelled bookings is higher for city hotels, with 30.10% of the total bookings being cancelled. In contrast, for resort hotels, the percentage of cancellations is 23.48%. This indicates that city hotels experience a relatively higher rate of cancellations compared to resort hotels.

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights regarding the higher cancellation rate for city hotels compared to resort hotels can have a positive business impact. By identifying the factors contributing to the higher cancellation rate in city hotels, the hotel management can take proactive measures to address those issues. This may include improving the booking policies, enhancing customer service, or offering incentives to encourage guests to follow through with their reservations.

Reducing the cancellation rate can lead to improved occupancy rates, increased revenue, and better customer satisfaction. It helps in optimizing resource allocation, such as staff scheduling and inventory management, as well as reducing the negative impact on other potential bookings that could have taken those canceled slots.

Overall, the insights gained from analyzing the cancellation rates can provide valuable information for implementing strategies to minimize cancellations and maximize positive business impact.

## Chart - 10 Analysis on reserved rooms and assigned rooms(Bivariate Analysis)

In [None]:
# filtering the columns of reserved rooms and assigned rooms of True condition and creating new column as 'same room assigned'  for plotting a chart for analysis between them
df['same_room_assigned'] = df['reserved_room_type']==df['assigned_room_type']

plt.figure(figsize = (12,8))
sns.countplot(x= df['total_nights'], hue = df['same_room_assigned'])
plt.show()

In [None]:
# total nights of stay when same room is alloted which is reserved while booking
same_room_assigned = df[df['reserved_room_type'] == df['assigned_room_type']]
plt.figure(figsize = (10,8))
sns.countplot(x= same_room_assigned['total_nights'], hue = same_room_assigned['hotel'])
plt.show()

for i, j in same_room_assigned['total_nights'].value_counts().items():
  print(f'The total nights of stay {i} are having the bookings of {j}')

In [None]:
# total nights of stay when same room is not alloted which is reserved while booking the hotel
same_room_not_assigned = df[df['reserved_room_type'] != df['assigned_room_type']]
plt.figure(figsize = (10,8))
sns.countplot(x= same_room_not_assigned['total_nights'], hue = same_room_not_assigned['hotel'])
plt.show()

for i, j in same_room_not_assigned['total_nights'].value_counts().items():
  print(f'The total nights of stay {i} is having the bookings of {j}')

##### 1. Why did you pick the specific chart?

Counter plots are used for creating bar plots to visualize the count or frequency of categorical variables in a dataset. It is particularly useful when we want to analyze the distribution of categorical data and compare the counts across different categories/numerical values.In this charts we can work with multiple parameters, to increase the visibility of the comparability.

To show the comparability of total nights stayed when the same room alloted to guest when he reserved the room and its count of the bookings.

##### 2. What is/are the insight(s) found from the chart?

Based on the insights from the chart, it can be observed that there is a significant difference in the number of bookings where the same room is assigned and reserved while booking compared to cases where the same room is not assigned.

For the "same room assigned" analysis, there are 15,500 bookings in the city hotel type and 5,000 bookings in the resort hotel type. This indicates that a substantial number of guests prefer to have the same room they reserved during the booking process, especially in city hotels.

On the other hand, for the "same room not assigned" analysis, there are 1,700 bookings in the city hotel type and 2,500 bookings in the resort hotel type. This suggests that a smaller proportion of guests are flexible or open to being assigned a different room than the one they reserved while making the booking.

These insights provide an understanding of customer preferences regarding room assignments and can be useful for hotel management in terms of managing room allocation and addressing customer expectations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the analysis can indeed help create a positive business impact. By understanding that guests prefer having the same room they reserved during the booking process, hotel management can focus on optimizing their room assignment systems and processes to meet this expectation. This can lead to increased customer satisfaction, improved guest experiences, and potentially higher customer loyalty and repeat bookings.

On the other hand, the analysis does not directly provide insights that lead to negative growth. However, if the hotel fails to meet the expectation of assigning the same room as reserved, it may result in customer dissatisfaction and potentially negative reviews or reduced customer loyalty. Therefore, it is important for hotels to take these insights into account and strive to provide consistent room assignments to align with customer expectations.

# Chart - 11 Analysis on total nights stay for not cancelled bookings(Bivariate analysis)

In [None]:
# Chart - 10 visualization code
# filtering on not cancelled bookings less than 15 total nights for the each hotel type
temp_df = df[df['is_canceled'] == 0 ]
temp_15 = temp_df[temp_df['total_nights'] < 15]
plt.figure(figsize = (10,5))
sns.histplot(x = temp_df['total_nights'],bins=15, alpha=0.7,hue = temp_15['hotel'])
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is a graphical representation of the distribution of a dataset. It is commonly used to visualize the frequency or count of values falling within specific intervals or bins. Histograms are especially useful for understanding the shape, central tendency, and variability of the data.The main purpose of a histogram is to provide a visual summary of the underlying data distribution. By organizing data into bins and representing the count or frequency of values in each bin, a histogram allows us to observe patterns, trends, and outliers in the data.we chose this plot to find the frequeny of bookings which comes under total room nights for a single booking.

##### 2. What is/are the insight(s) found from the chart?

From the analysis of the chart, we can observe that the majority of the not cancelled bookings have a stay duration of 3, 1, 2, or 4 nights. These durations appear to be the most common choices for guests who do not cancel their bookings.Additionally, this insight can guide revenue management strategies, such as optimizing pricing and availability for the popular stay durations, and ensuring that the hotel is prepared to accommodate guests during those periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact. By identifying the preferred durations of stay for not cancelled bookings, the hotel can tailor their offerings and promotions to align with these popular durations. This can attract more guests and increase the overall occupancy rate, leading to positive business growth.

There are no specific insights that indicate negative growth. However, it is important for the hotel to also consider other factors such as room availability, pricing, and customer preferences to ensure a well-rounded strategy for maximizing bookings and revenue

Chart - 12 Analysis to find the waiting period of the bookings(Bivariate Analysis)

In [None]:
# getting waiting list for each hotel type
temp_df = df.groupby('hotel')['days_in_waiting_list'].agg(np.mean).reset_index().rename(columns = {'days_in_waiting_list':'avg_waiting_period'})
print(temp_df)
plt.figure(figsize = (5,5))
plots = sns.barplot(x = temp_df['hotel'], y = temp_df['avg_waiting_period'] )
for bar in plots.patches:
  plots.annotate(format(bar.get_height(),".1f"),
                        (bar.get_x()+ bar.get_width()/2,bar.get_height()),
                        size = 10, xytext = (0,8), ha = 'center', va = 'center',
                          textcoords = 'offset points')

Analysis on is_repeated_guest to find the cummulative business (Bivariate Analysis)

In [None]:
# calculating the number of repeated guests to the each hotel type by using the 'is_repeated_guest' column and filtering it
repeated_guest = df[df['is_repeated_guest']== 1]
temp_repeated_guest = pd.DataFrame(repeated_guest.groupby('hotel').size()).rename(columns = {0: 'total repeated guest'})

# calculating the total number of bookings for each type of the hotel
total_bookings = pd.DataFrame(df.groupby('hotel').size()).rename(columns = {0:'total bookings'})

# concatinating the two dataframes for plotting the graph
repeated_guest_to_hotel = pd.concat([temp_repeated_guest,total_bookings], axis = 1)

# calculating the percentage of the guests returned to each type of the hotel
repeated_guest_to_hotel['return %'] = (repeated_guest_to_hotel['total repeated guest']/repeated_guest_to_hotel['total bookings']) * 100

print(repeated_guest_to_hotel)

# plotting the graph for the above dataframe
plt.figure(figsize = (8,5))
sns.barplot(x = repeated_guest_to_hotel.index, y = repeated_guest_to_hotel['return %'])

##### 1. Why did you pick the specific chart?

We chose to use a bar chart to display the average waiting period in days for each type of hotel and the percentage of repeated guests for each hotel. Bar charts are effective in comparing the size or frequency of different categories or groups of data, making them suitable for visualizing these metrics. Additionally, bar charts allow us to present a large amount of data in a compact format.

##### 2. What is/are the insight(s) found from the chart?

The insights gained from the chart are as follows:

The average waiting period for guests in the city type hotel is 1 day, while for the resort type hotel, it is 0.3 days. This indicates that guests in the resort hotel tend to have shorter waiting periods compared to guests in the city hotel.
The percentage of repeated guests in the city hotel is 3%, with 1657 out of 53274 bookings being from repeated guests. In contrast, the resort hotel has a higher percentage of repeated guests at 5%, with 1707 out of 33956 bookings coming from repeat visitors. This suggests that the resort hotel may have a higher level of guest loyalty and satisfaction, leading to a higher rate of repeat bookings.


Overall, these insights provide information on the average waiting period and guest loyalty for each type of hotel, which can be used to make informed business decisions and improve customer satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact. By focusing on reducing the waiting period and providing prompt room allocation, the hotel can enhance the guest experience and satisfaction. This can lead to positive word-of-mouth, increased customer loyalty, and ultimately, a positive impact on the hotel's reputation and business growth.

The higher percentage of repeated guests in the resort hotel compared to the city hotel is also a positive insight. It indicates that the resort hotel has successfully built a loyal customer base, which can contribute to repeat bookings and increased revenue.

#### Chart - 13 Analysis on Average Daily Rate(ADR)(Bivariate Analysis)

In [None]:

#Visualization on the ADR column by calclulating mean of adr
temp = df.groupby('hotel')['adr'].mean().reset_index().rename(columns = {'adr' :'avg_adr'})
print(temp)
plt.figure(figsize = (8,5))
sns.barplot(x = temp['hotel'], y = temp['avg_adr'])

In [None]:
# Filtering the hotel and is_cancelled columns and storing them in separate variables
df_city = df[(df['hotel'] == 'City Hotel') & (df['is_canceled'] == 0)]
df_resort = df[(df['hotel'] == 'Resort Hotel') & (df['is_canceled'] == 0)]

# calculating the mean of the df_city & df_resort variables and storing them in new varibles
city = df_city.groupby('arrival_date_month')['adr'].mean().reset_index()
resort = df_resort.groupby('arrival_date_month')['adr'].mean().reset_index()

# merging the both variables basis on the same column of arrival_date_month
hotel  = city.merge(resort,on='arrival_date_month')

# renaming the columns in the hotel variable
hotel.columns = ['month','price_for_city','price_for_resort']

# creating new variable of months
months = ['January','Febraury','March','April','May','June','July','August','September','October','November','December']

# assigning the month column to the new variable of months column and categorizing the column baisis of it
hotel['month'] = pd.Categorical(hotel['month'],categories = months,ordered = True)
hotel = hotel.sort_values('month').reset_index()
plt.figure(figsize = (15,8))
print(hotel)
# plotting the line chart for the comparison of the city and resort hotel type with adr monthly wise
sns.lineplot(data = hotel, x ='month', y ='price_for_city')
sns.lineplot(data = hotel, x ='month', y ='price_for_resort')
plt.ylabel('adr')
plt.legend(['Resort','city hotel'])
plt.show()

In [None]:
# visulization of average adr on monthly basis
temp_df = df.groupby('arrival_date_month',as_index =False)['adr'].mean()
plt.figure(figsize = (10,8))
print(temp_df.sort_values(by = 'adr',ascending = False))
ax = sns.barplot(data = temp_df , x = 'arrival_date_month', y = 'adr')
ax.set_title('Month and avg_adr')
plt.show()

In [None]:
# visualization of average adr with total nights of stay
temp_df = df.groupby('total_nights', as_index = False)['adr'].mean().head(20)
# plotting graph for the 20 total nights of stay for the booking
print(temp_df.sort_values(by = 'adr',ascending = False))
plt.figure(figsize = (10,8))
ax = sns.barplot(data = temp_df , x ='total_nights', y = 'adr')
ax.set_title('Total nights and average adr')

In [None]:
#visualization of average adr and the country of the bookings
temp_df = df.groupby('country', as_index = False)['adr'].mean()
temp_df.sort_values(by = 'adr',inplace = True,ascending = False)
print(temp_df)
# plotting graph for the top 15 countries with adr
plt.figure(figsize = (10,8))
ax = sns.barplot(data = temp_df ,x = temp_df['country'][:15], y= 'adr')
ax.set_title('Country and average adr')

In [None]:
# visualization of average adr and the total number of people for a booking
temp_df = df.groupby('total_number_of_people',as_index = False)['adr'].mean()
temp_df.sort_values(by ='adr',ascending = False, inplace = True)
print(temp_df)
# plotting line graph for the adr and total persons for a booking
ax = sns.lineplot(x = temp_df['total_number_of_people'][:10], y= temp_df['adr'])
ax.set_title('total persons per booking to the average adr')
ax.set_ylabel('avg_adr')

In [None]:
# Analysis on yearly adr of each hotel type
temp_df = df.groupby(['arrival_date_year','hotel'])['adr'].sum().reset_index()
temp_df
print(temp_df)
plt.figure(figsize = (10,8))
sns.barplot(data = temp_df, x = temp_df['arrival_date_year'], y = temp_df['adr'],hue = temp_df['hotel'])

##### 1. Why did you pick the specific chart?

Bar charts are used to compare the size or frequency of different categories or groups of data. Bar charts are useful for comparing data across different categories, and they can be used to display a large amount of data in a small space. we used this chart with average adr and compare with a hotel type, to calculate monthly basis adr, to know the insights of country basis adr,and total nights basis.

Line plots, also known as line charts or line graphs, are used to visualize the relationship between two continuous variables. They are commonly used to show trends over time or to display the continuous relationship between two variables. We used this chart with average adr on monthly basis with respect to hotel type and number of people to the avg adr.

##### 2. What is/are the insight(s) found from the chart?

The insights from the analysis are as follows:

The average daily rate (ADR) is higher for city hotels (111.2) compared to resort hotels (99.05).
In terms of monthly ADR, the highest value of 180 is observed in August for city hotels, while for resort hotels, the highest ADR ranges from 120 in May to September.
The overall highest monthly ADR is 140, regardless of the hotel type. When considering the total nights of a booking, the maximum ADR of 120 is observed for bookings with a duration of 6 nights.
The country with the highest ADR is DJI, with an average rate of 270.
The ADR varies based on the number of people in a booking, and the highest ADR of 210 is observed for bookings with 5 people.
In terms of yearly ADR, the city hotel type reached its highest value of 2,885,999.28 in 2016, while the resort hotel type had a peak ADR of 1,417,106.56 in the same year.
These insights provide valuable information about the average daily rates and their variations based on different factors such as hotel type, month, total nights of stay, country, number of people, and year. This information can be used to make strategic decisions related to pricing, marketing, and resource allocation to maximize business impact.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can indeed help create a positive business impact. Here's how:

Targeting high-demand months: By identifying that the months of June, July, August, and September have higher business potential, the company can focus its marketing efforts and allocate resources to maximize revenue during these periods. This can include offering special promotions, discounts, or packages tailored to attract more customers during these peak months.

Developing low-demand months: Identifying the months of November, December, January, and February as lower business periods allows the company to strategize and plan marketing campaigns or initiatives to boost occupancy and revenue during these off-peak seasons. This could involve offering winter-specific packages, holiday-themed promotions, or incentives for specific events or occasions during these months.

Maximizing revenue from different group sizes: Recognizing that the average daily rate is higher for bookings with 1-5 members and 5-10 members, the company can tailor its pricing and marketing strategies to attract more bookings from these group sizes. This could involve creating customized packages, offering discounts for group bookings, or providing additional amenities or services tailored to the needs of these groups.

However, it's important to note that the analysis does not reveal any insights that would directly lead to negative growth. Rather, the insights provide opportunities for targeted marketing, pricing optimization, and resource allocation to enhance business performance and capture untapped potential during specific months or with different group sizes.

Cancellation Analysis (Bivariate Analysis)

In [None]:
# Visualization of cancelled bookings with respective to market segment

# filtering the cancelled bookings from the df with market segment basis and storing them as a dataframe in new variable
cancelled_data = df[df['is_canceled'] == 1]
cancelled_bookings= pd.DataFrame(cancelled_data.groupby('market_segment').size()).rename(columns = {0:'cancelled bookings'})

# filtering the total bookings on market segment basis
total_bookings = pd.DataFrame(df.groupby('market_segment').size()).rename(columns= {0:'total bookings'})

# concatinating the two variables
market_segment_bookings = pd.concat([cancelled_bookings,total_bookings],axis = 1)

# getting the percentage cancelled bookings on the basis of market segment
market_segment_bookings['cancelled %'] = round((market_segment_bookings['cancelled bookings']/(market_segment_bookings['total bookings']))*100,3)
print(market_segment_bookings)
# plotting a
plt.figure(figsize= (8,5))
plots = sns.barplot(data = market_segment_bookings, x =market_segment_bookings.index , y = 'cancelled %')
for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".2f")}%',
                 (bar.get_x() + bar.get_width()/2, bar.get_height()),
                 size = 10, xytext = (0,8), ha = 'center', va= 'center', textcoords = 'offset points')
plt.show()

In [None]:
# visualization on cancelled bookings with respect to lead time months
# filtering the cancelled bookings
cancelled_data= df[df['is_canceled'] == 1]
cancelled_bookings = pd.DataFrame(cancelled_data.groupby('lead_time_months').size()).rename(columns = {0:'cancelled bookings'})

# counting the total number of bookings with repective to the lead time month
total_bookings = pd.DataFrame(df.groupby('lead_time_months').size()).rename(columns = {0 : 'total bookings'})

# concatinating the two dataframes
data = pd.concat([cancelled_bookings,total_bookings], axis = 1 )

# calculating the cancelled percentage of the bookings
data['canceled %'] = round((data['cancelled bookings'] / data['total bookings']) * 100,2)
print(data)
# plotting the graph for the data
plt.figure(figsize = (10,5))
sns.lineplot(data =data ,x = data.index , y = data['canceled %'])
plt.show()

In [None]:
# visualization on cancelled bookings with respect to deposit type
# filtering the cancelled bookings
cancelled_data= df[df['is_canceled'] == 1]
cancelled_bookings = pd.DataFrame(cancelled_data.groupby('deposit_type').size()).rename(columns = {0:'cancelled bookings'})

# counting the total number of bookings with repective to the deposit type
total_bookings = pd.DataFrame(df.groupby('deposit_type').size()).rename(columns = {0 : 'total bookings'})

# concatinating the two dataframes
data = pd.concat([cancelled_bookings,total_bookings], axis = 1 )

# calculating the cancelled percentage of the bookings
data['canceled %'] = round((data['cancelled bookings'] / data['total bookings']) * 100,2)
print(data)

# plotting the graph for the data
plt.figure(figsize = (8,5))
plots = sns.barplot(data =data ,x = data.index , y = data['canceled %'])
for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".2f")}%',
                 (bar.get_x() + bar.get_width()/2, bar.get_height()),
                 size = 10, xytext = (0,8), ha = 'center', va= 'center', textcoords = 'offset points')

In [None]:
# visualization on cancelled bookings with respect to previous cancelled bookings
# filtering the cancelled bookings
cancelled_data= df[df['is_canceled'] == 1]
cancelled_bookings = pd.DataFrame(cancelled_data.groupby('is_previously_cancelled').size()).rename(columns = {0:'cancelled bookings'})

# counting the total number of bookings with repective to the previously cancelled bookings
total_bookings = pd.DataFrame(df.groupby('is_previously_cancelled').size()).rename(columns = {0 : 'total bookings'})

# concatinating the two dataframes
data = pd.concat([cancelled_bookings,total_bookings], axis = 1 )

# calculating the cancelled percentage of the bookings
data['canceled %'] = round((data['cancelled bookings'] / data['total bookings']) * 100,2)
print(data)
# plotting the graph for the data
plt.figure(figsize = (10,5))
plots = sns.barplot(data =data ,x = data.index , y = data['canceled %'])
for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".2f")}%',
                 (bar.get_x() + bar.get_width()/2, bar.get_height()),
                 size = 10, xytext = (0,8), ha ='center', va= 'center', textcoords = 'offset points')
plt.show()

In [None]:
# visualization on cancelled bookings with respect to the repeated guest
# filtering the cancelled bookings
cancelled_data= df[df['is_canceled'] == 1]
cancelled_bookings = pd.DataFrame(cancelled_data.groupby('is_repeated_guest').size()).rename(columns = {0:'cancelled bookings'})

# counting the total number of bookings with repective to the repeated guest
total_bookings = pd.DataFrame(df.groupby('is_repeated_guest').size()).rename(columns = {0 : 'total bookings'})

# concatinating the two dataframes
data = pd.concat([cancelled_bookings,total_bookings], axis = 1 )

# calculating the cancelled percentage of the bookings
data['canceled %'] = round((data['cancelled bookings'] / data['total bookings']) * 100,2)
print(data)
# plotting the graph for the data
plt.figure(figsize = (10,5))
plots = sns.barplot(data =data ,x = data.index , y = data['canceled %'])
for bar in plots.patches:
  plots.annotate(f'{format(bar.get_height(),".2f")}%',
                 (bar.get_x() + bar.get_width()/2, bar.get_height()),
                 size = 10, xytext = (0,8), ha = 'center', va= 'center', textcoords = 'offset points')
plt.show()

1. Why did you pick the specific chart?

Bar charts are used to compare the size or frequency of different categories or groups of data. Bar charts are useful for comparing data across different categories, and they can be used to display a large amount of data in a small space. we used this chart to find the cancellation analysis with its respective features like market segment,deposit type of the bookings, and the booking has a previous cancellations and is repeated guest to know the insights of it.

Line plots, also known as line charts or line graphs, are used to visualize the relationship between two continuous variables. They are commonly used to show trends over time or to display the continuous relationship between two variables. We used this chart to find insights of cancellation analysis and its trend with lead time in months of cancellations.

2. What is/are the insight(s) found from the chart?

Based on the insights derived from the chart, we can identify several factors that contribute to the cancellations of bookings. These insights can help inform business strategies and initiatives:

Market Segment: The analysis shows that the online travel agency (TA) market segment has the highest percentage of cancellations at 35.3%. This suggests that there may be specific challenges or factors influencing cancellations from this segment. Businesses can focus on understanding the reasons behind these cancellations and develop targeted measures to reduce them.

Deposit Type: The data reveals that bookings with non-refundable deposit types have a significantly higher cancellation rate of 94.70%. On the other hand, bookings with refundable deposit types have a lower cancellation rate of 24.30%. This indicates that the deposit policy plays a crucial role in cancellation behavior. Businesses may consider adjusting their deposit policies to incentivize guests to keep their bookings.

Previous Cancellation History: Bookings with a previous cancellation history have a higher cancellation rate of 68.00%, while those without a previous cancellation history have a lower rate of 26.73%. This suggests that past cancellation behavior can be an indicator of future cancellations. Businesses can leverage this insight to implement targeted retention strategies for guests with a history of cancellations.

Repeated Guests: The analysis shows that repeated guests have a lower cancellation rate of 7.73% compared to 28.3% for non-repeated guests. This indicates that guest loyalty and familiarity with the establishment may play a role in reducing cancellations. Businesses can focus on nurturing and rewarding repeat guests to enhance customer loyalty and reduce cancellations.

Lead Time: The lead time in months for bookings also affects the cancellation rate. Bookings with a lead time greater than 16 months have a 100% cancellation rate, while those with a lead time between 6-15 months have a 66% cancellation rate. Bookings with a lead time less than 6 months have a lower cancellation rate of 35%. This suggests that longer lead times may increase the likelihood of cancellations. Businesses can optimize their booking policies, marketing strategies, and pricing to mitigate cancellations associated with longer lead times.

Overall, these insights provide valuable information for businesses to understand the patterns and drivers of cancellations. By addressing the factors contributing to cancellations, businesses can impleme`nt targeted measures and strategies to reduce cancellations, improve customer satisfaction, and ultimately have a positive impact on the bottom line.

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the gained insights can help create a positive business impact. By understanding the factors contributing to cancellations and taking appropriate actions, businesses can reduce cancellations and improve overall customer satisfaction and profitability. Here are some specific examples:

Restricting Online TA Bookings: Since the analysis shows that the online travel agency (TA) market segment has a high cancellation rate, businesses can implement measures to restrict or regulate bookings from this segment. This can include setting specific terms and conditions for online TA bookings or requiring additional guarantees to reduce the likelihood of cancellations.

Addressing Non-Refundable Bookings: Non-refundable bookings have a significantly higher cancellation rate. To mitigate this, businesses can focus on improving communication and transparency with guests regarding the reasons for potential date changes, room availability, or location preferences. By providing a clear understanding of these factors, guests may be more likely to honor their bookings and avoid cancellations.

Managing Bookings with Longer Lead Times: Bookings with longer lead times, particularly those exceeding 6 months, have higher cancellation rates. Businesses can explore implementing advance payment or full payment options for these bookings. Requiring guests to make a substantial financial commitment upfront can reduce the likelihood of cancellations and increase the commitment to stay.

While these insights provide opportunities for positive business impact, it's important to note that the insights also highlight areas that may lead to negative growth if not addressed. For example, a high cancellation rate from certain market segments or deposit types can result in lost revenue and potential negative reviews or reputation impact. Therefore, it's crucial for businesses to take proactive steps to address these insights and implement strategies that minimize cancellations while maximizing customer satisfaction and overall business growth.

## Chart - 14 - Correlation Heatmap

In [None]:

# Correlation Heatmap visualization code
# storing the columns in a variable
num_df = df[['lead_time','previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list','adr','required_car_parking_spaces','total_of_special_requests','total_number_of_people']]

# using the correlation
corrmat = num_df.corr()
f, ax = plt.subplots(figsize = (12,10))
sns.heatmap(corrmat,annot = True,fmt = ".2f",annot_kws ={'size': 10},vmin = -1, vmax = 1, square = True)

##### 1. Why did you pick the specific chart?

##### 2. What is/are the insight(s) found from the chart?

From the correlation heatmap, we can observe the following insights:

Lead time and booking changes have a slight positive correlation. This suggests that as the lead time increases, there may be a slightly higher likelihood of changes being made to the booking.

Lead time is also slightly related to the number of days on the waiting list. This implies that longer lead times may result in a slightly higher chance of being on the waiting list for more days.

Average Daily Rate (ADR) shows a slight positive correlation with the number of people for a booking. This indicates that bookings with more people may have slightly higher ADR.

Total special requests are slightly related to ADR. This suggests that bookings with more special requests may have slightly higher ADR.

The number of people for a booking is slightly related to the lead time. This implies that bookings with more people may have slightly longer lead times.

Overall, the correlations observed in the heatmap are relatively weak, with coefficients close to zero. This suggests that the variables in the dataset may not have strong linear relationships with each other.

Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
final_data = sns.pairplot(num_df, hue = 'previous_cancellations', palette = 'Set2')

##### 1. Why did you pick the specific chart?

A pairplot, also known as a scatterplot matrix, is a visualization that allows you to visualize the relationships between all pairs of variables in a dataset. It is a useful tool for data exploration because it allows you to quickly see how all of the variables in a dataset are related to one another.

Thus, we used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

Based on the pairplot analysis, it appears that the variables in the dataset have a weak or no significant linear relationship with each other. There are no clear patterns or strong correlations between the variables. This suggests that the variables are relatively independent of each other and may not have a strong influence on one another.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

What do you suggest the client to achieve Business Objective ?
Based on the analysis, here are some suggestions to achieve the business objectives:

1.Cancellation Policies: Implement stricter rules and policies for cancellations, especially for bookings made through Online Travel Agencies (OTAs). Consider releasing booking commissions only after successful completion of the booking to discourage cancellations.

2.Deposit Type: Investigate the reasons behind cancellations for non-refundable deposits and take necessary precautions. Provide clear information about the refund policies for different deposit types to set proper expectations.

3.Previously Cancelled Bookings: Develop policies or incentives to discourage cancellations for customers who have a history of cancelling bookings. This could include offering special discounts or benefits for those who commit to their bookings.

4.Off-Peak Months: Implement additional marketing and business strategies to boost business during the months of November, December, January, and February. Consider offering seasonal promotions, targeted advertising, or partnerships with local events or attractions.

5.ADR for Resort Hotel: Take actions to improve the Average Daily Rate (ADR) for the resort hotel. This could involve enhancing facilities, services, or amenities to provide a higher value proposition for guests.

6.Bulk/Group Bookings: Develop plans to attract and increase bulk/group bookings, as they tend to have better ADR. Consider offering special group rates, customized packages, or additional incentives to attract larger bookings.

By implementing these strategies, the business can work towards reducing cancellations, increasing ADR, and maximizing revenue during both peak and off-peak seasons.

# **Conclusion**

In conclusion, based on the exploratory data analysis (EDA), several insights and recommendations have been identified to improve the business:

Hotel Type: The city hotel type is busier compared to the resort hotel. The resort hotel should invest in improving facilities and services to attract more bookings and increase the number of repeated guests.

Market Segment: The majority of bookings (59.88%) are from the Online Travel Agency (OTA) market segment. To diversify and reduce dependence on a single segment, the hotel should explore strategies to attract bookings from corporate clients and groups.

Seasonality: August is the busiest month for bookings, and the period from May to October shows higher business activity. To maximize revenue, the hotel should focus on boosting business during the remaining months by implementing targeted marketing campaigns and offering special promotions.

Average Waiting Time: The city hotel has an average waiting time of 1 day, while the resort hotel has a shorter waiting time of 0.3 days. The city hotel should work on reducing the waiting time to enhance guest satisfaction and potentially increase bookings.

Lead Time and Cancellations: There is a slightly positive correlation between lead time and cancellations. The hotel should analyze the reasons behind longer lead times leading to cancellations and consider implementing measures to minimize cancellations, such as offering incentives for non-refundable bookings or providing flexible booking options.

Occupancy and Number of People: Bookings with a higher average daily rate (ADR) are associated with a total people count of 4-5. The hotel should focus on attracting family/group bookings and provide suitable facilities and services to cater to their needs.

By implementing these recommendations, the hotel can improve its occupancy rates, increase revenue, and enhance overall guest satisfaction. Regular monitoring and analysis of data trends will also help in adapting strategies to meet changing market demands and business objectives.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***