<a href="https://colab.research.google.com/github/mithun-mith/ExploringHotelBookingData/blob/Hotel_Booking_Data_1/Mithun_Waghmare_of_EDA_project_Hotel_Booking_Submission_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA / project / Exploring Hotel Booking Data
##### **Contribution**    - Individual
##### **Team Member 1**  - Mithun Waghmare


# **Project Summary -**

The Hotel Booking EDA project aims to analyze and explore a dataset containing hotel booking information in order to gain valuable insights and make informed decisions. The project focuses on conducting exploratory data analysis to understand the patterns, trends, and relationships within the dataset, with the ultimate goal of extracting meaningful information to optimize the hotel booking process.
The dataset used in this project consists of various attributes related to hotel bookings, including customer demographics, booking details, hotel information, and other relevant variables. The primary objectives of the project are as follows:
* 		Data Cleaning and Preprocessing: The initial step involves cleaning the dataset to handle missing values, removing duplicates, and correcting any inconsistencies. The preprocessing stage also includes transforming the data into a suitable format for analysis, ensuring data integrity, and preparing it for further exploration.
* 		Descriptive Statistics: The project delves into the descriptive analysis of the dataset to uncover key statistical measures such as mean, median, mode, standard deviation, and range for numerical variables. This analysis provides a comprehensive understanding of the dataset's central tendencies and dispersions, enabling better decision-making.
* 		Data Visualization: Utilizing various visualization techniques, such as bar charts, histograms, scatter plots, and heatmaps, the project visualizes the data to reveal patterns, trends, and correlations. Visual representations make it easier to grasp complex relationships within the dataset and present the findings in a clear and concise manner.
* 		Customer Segmentation: By analyzing customer demographics and booking patterns, the project aims to identify distinct customer segments. This segmentation enables personalized marketing strategies, tailored promotions, and improved customer satisfaction.
* 		Booking Patterns and Trends: The project investigates booking trends over time, such as seasonal variations, peak booking periods, and customer preferences. Understanding these patterns helps hotels optimize their operations, allocate resources effectively, and offer competitive pricing.
* 		Cancellation Analysis: Analyzing booking cancellations is crucial in understanding customer behavior and identifying potential areas for improvement. The project explores cancellation rates, reasons for cancellations, and their impact on revenue.
* 		Feature Importance: Through feature importance analysis, the project determines which attributes have the most significant impact on booking decisions. This knowledge aids in prioritizing resources and optimizing marketing strategies.
* 		Recommendations and Insights: The final step involves extracting actionable insights and recommendations from the analysis. These insights can be used by hotel management to improve the booking process, enhance customer experiences, and increase overall profitability.
By conducting a thorough exploratory data analysis of the hotel booking dataset, this project provides valuable insights into customer behavior, booking patterns, and operational optimizations. The outcomes of this analysis can guide decision-making processes and assist hotels in improving their service offerings to meet customer expectations and increase their competitive advantage.



Write the summary here within 500-600 words.

# **GitHub Link -**

GitHub Link - https://github.com/mithun-mith/ExploringHotelBookingData/tree/Hotel_Booking_Data_1

# **Problem Statement**


**Write Problem Statement Here.**

The hotel industry relies heavily on effective booking management to optimize revenue, occupancy rates, and overall customer satisfaction. However, without a deep understanding of the underlying data and factors influencing booking decisions, hotels may struggle to make informed decisions and improve their booking processes.
The problem at hand is the lack of comprehensive insights into hotel booking data, hindering the ability of hotel management to identify trends, patterns, and customer preferences. Without this understanding, hotels face challenges in optimizing their operations, allocating resources efficiently, and developing targeted marketing strategies.
Therefore, there is a need for an exploratory data analysis (EDA) project focused on hotel booking data. This project aims to analyze the available dataset, extract meaningful insights, and provide actionable recommendations to improve the hotel booking process. By conducting a thorough analysis, the project seeks to address the following key questions:

1. What are the key factors influencing hotel booking decisions?
2. Are there any discernible patterns or trends in booking behavior    over time?
3. How do customer demographics and preferences impact booking patterns?
4. What are the most common reasons for booking cancellations, and how can they be minimized?
5. How can hotels effectively segment their customer base for personalized marketing strategies?
6. Which attributes have the most significant impact on booking outcomes?
7. How can hotels optimize pricing strategies based on demand patterns and customer behavior?
8. What actionable insights and recommendations can be derived from the analysis to enhance the hotel booking process?

#### **Define Your Business Objective?**

The primary business objective of the Hotel Booking EDA project is to leverage data analysis and exploration to optimize the hotel booking process and enhance overall business performance. The project aims to provide actionable insights and recommendations to address key challenges faced by hotels in managing their bookings and improving customer satisfaction. The specific business objectives include:

1. Improve Revenue Optimization: By analyzing booking patterns, customer preferences, and market trends, the project aims to identify opportunities for revenue optimization. This includes understanding peak booking periods, pricing strategies, and upselling opportunities to maximize revenue generation.
2. Enhance Customer Satisfaction: Through customer segmentation and analysis of booking behavior, the project aims to gain insights into customer preferences, allowing hotels to provide personalized experiences and tailored services. This will result in higher customer satisfaction and loyalty.
3. Reduce Booking Cancellations: Understanding the reasons behind booking cancellations is crucial for minimizing revenue loss. The project aims to identify patterns and factors contributing to cancellations, enabling hotels to implement measures to reduce cancellations and mitigate their impact.
4. Optimize Resource Allocation: By analyzing booking trends and demand patterns, the project seeks to assist hotels in optimizing their resource allocation. This includes identifying periods of high demand, optimizing staff scheduling, and efficiently allocating resources to maximize operational efficiency.
5. Inform Marketing Strategies: Through customer segmentation and analysis of booking patterns, the project aims to provide insights that can inform targeted marketing strategies. This includes identifying customer segments with specific preferences and tailoring promotional activities to effectively reach and attract the desired target audience.
6. Gain Competitive Advantage: By leveraging data analysis and extracting valuable insights, the project aims to provide hotels with a competitive edge in the market. Understanding market trends, customer behavior, and optimizing operations will allow hotels to differentiate themselves and attract more bookings.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Import statement to mount Google Drive in Google Colab.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
filepath = "/content/drive/MyDrive/Hotel Bookings.csv"
hotel_df = pd.read_csv(filepath)

### Dataset First View

In [None]:
# Dataset First Look rows and columns
hotel_df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count total 119390 rows and 32 columns
hotel_df.shape


The output (119390, 32) indicates that hotel_df has 119,390 rows and 32 columns.

In [None]:
# all columns Name
hotel_df.columns


All 32 columns name of the DataFrame will be displayed when you print or view the DataFrame.

In [None]:
hotel_df.nunique()


Here are the unique value counts for each column.
These unique value counts provide insights into the diversity and variability of values present in each column of the DataFrame.

### Dataset Information

In [None]:
# Dataset Info
hotel_df.info()


 info() Function provides information about the DataFrame hotel_df, including the column names, data types, and non-null values.

# Observation
children has float value in this data sate so let convert it into intiger

#### Duplicate Values

In [None]:
# Find duplicate rows based on all columns
duplicate_rows = hotel_df[hotel_df.duplicated()]


In [None]:
duplicate_rows

It returns the dataset has 31994 rows duplicates where each row represents is a duplicate values in "duplicate_rows" dataset.

#### Missing Values/Null Values

In [None]:
# Check for missing values using isnall()
hotel_df.isnull().sum()

isnull() function to identify which values in the DataFrame are null or missing.
values attribute converts the resulting DataFrame of boolean values

### What did you know about your dataset?

1. The dataset contains 119,390 rows and 32 columns.

2. Missing/Null Values: There are missing or null values present in the dataset, particularly in the columns 'country', 'agent', and 'company'. These columns may have some entries that are empty or not filled in.

3. Duplicate Rows: The dataset has duplicate rows, It returns the dataset has 31994 rows duplicates where each row represents is a duplicate values in "duplicate_rows" dataset.

4. Unique Value Counts: The unique value counts for each column provide insights into the diversity and variability in the dataset. By examining these counts, i can understand the number of values present in each column and assess the data distribution.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotel_df.columns

1. 		hotel: This variable represents the type of hotel (resort hotel, city hotel).
2. 		is_canceled: This variable indicates whether a booking was canceled (0 = not canceled, 1 = canceled).
3. 		lead_time: The number of days between the booking date and the arrival date.
4. 		arrival_date_year: The year of the arrival date.
5. 		arrival_date_month: The month of the arrival date.
6. 		arrival_date_week_number: The week number of the arrival date.
7. 		arrival_date_day_of_month: The day of the month of the arrival date.
8. 		stays_in_weekend_nights: The number of weekend nights (Saturday or Sunday) the guest stayed.
9. 		stays_in_week_nights: The number of weekday nights (Monday to Friday) the guest stayed.
10. 		adults: The number of adults included in the booking.
11. 		children: The number of children included in the booking.
12. 		babies: The number of babies included in the booking.
13. 		meal: The type of meal booked (e.g., Undefined/SC – no meal package, BB – Bed & Breakfast).
14. 		country: The country of origin of the guest.
15. 		market_segment: The market segment designation ( Online Travel Agents, Offline TA/TO).
16. 		distribution_channel: The booking distribution channel ( Direct, Corporate).
17. 		is_repeated_guest: Indicates if the booking was made by a repeated guest (0 = not repeated, 1 = repeated).
18. 		previous_cancellations: The number of previous bookings canceled by the customer.
19. 		previous_bookings_not_canceled: The number of previous bookings not canceled by the customer.
20. 		reserved_room_type: The code for the type of room reserved.
21. 		assigned_room_type: The code for the type of room assigned at check-in.
22. 		booking_changes: The number of changes or modifications made to the booking.
23. 		deposit_type: The type of deposit made for the reservation.
24. 		agent: The ID of the travel agency that made the booking.
25. 		company: The ID of the company/entity that made the booking or is responsible for payment.
26. 		days_in_waiting_list: The number of days the booking was on the waiting list.
27. 		customer_type: The type of booking (Contract, Group, Transient).
28. 		adr: The average daily rate (average price per night) of the booking.
29. 		required_car_parking_spaces: The number of car parking spaces requested by the guest.
30. 		total_of_special_requests: The total number of special requests made by the guest.
31. 		reservation_status: The status of the reservation ( Canceled, Check-Out).
32. 		reservation_status_date: The date when the reservation status was last updated.



In [None]:
# Dataset Describe
hotel_df.describe()

describe() function to calculate various statistics, including count, mean, standard deviation, minimum value, 25th percentile, median, 75th percentile, and maximum value.

### Variables Description

1. 		hotel: This variable represents the type of hotel (resort hotel, city hotel).
2. 		is_canceled: This variable indicates whether a booking was canceled (0 = not canceled, 1 = canceled).
3. 		lead_time: The number of days between the booking date and the arrival date.
4. 		arrival_date_year: The year of the arrival date.
5. 		arrival_date_month: The month of the arrival date.
6. 		arrival_date_week_number: The week number of the arrival date.
7. 		arrival_date_day_of_month: The day of the month of the arrival date.
8. 		stays_in_weekend_nights: The number of weekend nights (Saturday or Sunday) the guest stayed.
9. 		stays_in_week_nights: The number of weekday nights (Monday to Friday) the guest stayed.
10. 		adults: The number of adults included in the booking.
11. 		children: The number of children included in the booking.
12. 		babies: The number of babies included in the booking.
13. 		meal: The type of meal booked (e.g., Undefined/SC – no meal package, BB – Bed & Breakfast).
14. 		country: The country of origin of the guest.
15. 		market_segment: The market segment designation ( Online Travel Agents, Offline TA/TO).
16. 		distribution_channel: The booking distribution channel ( Direct, Corporate).
17. 		is_repeated_guest: Indicates if the booking was made by a repeated guest (0 = not repeated, 1 = repeated).
18. 		previous_cancellations: The number of previous bookings canceled by the customer.
19. 		previous_bookings_not_canceled: The number of previous bookings not canceled by the customer.
20. 		reserved_room_type: The code for the type of room reserved.
21. 		assigned_room_type: The code for the type of room assigned at check-in.
22. 		booking_changes: The number of changes or modifications made to the booking.
23. 		deposit_type: The type of deposit made for the reservation.
24. 		agent: The ID of the travel agency that made the booking.
25. 		company: The ID of the company/entity that made the booking or is responsible for payment.
26. 		days_in_waiting_list: The number of days the booking was on the waiting list.
27. 		customer_type: The type of booking (Contract, Group, Transient).
28. 		adr: The average daily rate (average price per night) of the booking.
29. 		required_car_parking_spaces: The number of car parking spaces requested by the guest.
30. 		total_of_special_requests: The total number of special requests made by the guest.
31. 		reservation_status: The status of the reservation ( Canceled, Check-Out).
32. 		reservation_status_date: The date when the reservation status was last updated.



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
hotel_df.nunique()


Here are the unique value counts for each column.
These unique value counts provide insights into the diversity and variability of values present in each column of the DataFrame.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# The drop_duplicates() function used to remove duplicate rows from a DataFrame. It returns a new DataFrame with duplicate rows removed.

hotel_df = hotel_df.drop_duplicates()

In [None]:
hotel_df

In [None]:
hotel_df.shape

In [None]:
# to find a Summary of the Dataframe if there is any Nan or missing values is present.
hotel_df.info()

In [None]:
# to find a Null value in the Dataframe
hotel_df.isnull().sum()

isnull().sum() function return the sum of all the mssing value in each columns
1. country has 452 nan values
2. agent has 12193 nan Values
3. Company has 82137 nan Values

In [None]:
# This code will replace all the null values in the DataFrame hotel_df with the string '0'
hotel_df.fillna('0',inplace=True)


In [None]:
#Rechecked the Null values it has been replace NaN values with a 0 value.

hotel_df.isnull().sum()

After executing this line of code, any missing or null values in the DataFrame will be replaced with '0'.

1. info(), to find a summary of the dataset.
2. isnull().sum(), to find out the null or missing value in the detaset,the result will be a Series that shows the column-wise count of missing or null values.
3. fillna() this code will replace all the null values in the dataset with "0"
4. drop_duplicates() function to remove duplicates rows from a dataset and it returns a new dataset with duplicate rows removed.




In [None]:
hotel_df

In [None]:
hotel_df.shape

# Observations
Before removing duplicates and data cleaning, the DataFrame had 119,390 rows and 32 columns. After dropping the duplicates, the DataFrame now contains 87,396 rows and still maintains the same 32 columns. Therefore, a total of 31,994 duplicate rows were identified and removed from the DataFrame.

In [None]:
hotel_df.nunique()

The nunique() function is helpful for understanding the uniqueness of values in your dataset, identifying categorical variables, or detecting columns with a low number of distinct values. You can use this information to guide further data exploration, preprocessing, or analysis tasks.



This below piece of codes on each variables shows the unique values in the  each column as the index and their corresponding counts as a values. This information is useful for understanding the distribution,helpful when working with categorical variables. Let’s check the all important columns unique value counts in each column.

In [None]:
hotel_df['hotel'].value_counts()

In [None]:
hotel_df['is_canceled'].value_counts()

In [None]:
hotel_df['arrival_date_year'].value_counts()

In [None]:
hotel_df['adults'].value_counts()

In [None]:
hotel_df['children'].value_counts()

In [None]:
hotel_df['babies'].value_counts()

In [None]:
hotel_df['meal'].value_counts()

In [None]:
# meal column has "undefined" value which is equal to SC (no breakfast) so instead of having to different value we combined both
hotel_df.replace('Undefined','SC',inplace = True)

# observation
meal column has "undefined" value which is equal to SC (no breakfast) so instead of having two different value we combined both. so here we have "undefined" value replace with "SC". after combined it now we have four meal type.as you can see in below. now total value count for "SC" is 9973.

In [None]:
hotel_df['meal'].unique()

In [None]:
# we combined both values of meal
hotel_df['meal'].value_counts()

In [None]:
hotel_df['market_segment'].value_counts()

In [None]:
hotel_df['distribution_channel'].value_counts()

In [None]:
hotel_df['is_repeated_guest'].value_counts()

In [None]:
hotel_df['deposit_type'].value_counts()

In [None]:
hotel_df['customer_type'].value_counts()

In [None]:
hotel_df['required_car_parking_spaces'].value_counts()

In [None]:
hotel_df['reservation_status'].value_counts()

In [None]:
hotel_df['total_of_special_requests'].value_counts()

In [None]:

# Calculate the total number of hotel bookings in weekend nights
total_weekend_nights = hotel_df['stays_in_weekend_nights'].sum()

# Calculate the total number of hotel bookings in weekday nights
total_week_nights = hotel_df['stays_in_week_nights'].sum()

print(f"Total hotel bookings in weekend nights: {total_weekend_nights}")
print(f"Total hotel bookings in weekday nights:{total_week_nights}")


In [None]:
# Calculate the total cancellations for each hotel type
cancellations_by_hotel = hotel_df[hotel_df['is_canceled']==1].groupby('hotel').size()

print(f"Total cancellations by hotel type:{cancellations_by_hotel}")



In [None]:
# Calculate the total number of bookings
total_bookings = len(hotel_df)

# Calculate the total number of canceled bookings
total_canceled_bookings = len(hotel_df[hotel_df['is_canceled'] == 1])

# Calculate the cancellation rate
cancellation_rate = total_canceled_bookings / total_bookings

print(f"Overall cancellation rate:{cancellation_rate}")


In [None]:
# Calculate the ADR per hotel
adr_per_hotel = hotel_df.groupby('hotel')['adr'].mean()

print(f"ADR per hotel: {adr_per_hotel}")



In [None]:
# the average number of adults per booking
average_adults = hotel_df['adults'].mean()

# the average number of babies per booking
average_babies = hotel_df['babies'].mean()

print(f"Average number of adults per booking: {average_adults}")
print(f"Average number of babies per booking: {average_babies}")


In [None]:
# which room types are booked most
num_reserved_room_types = hotel_df['reserved_room_type'].value_counts()
num_reserved_room_types

In [None]:
# Calculate the ADR per room
adr_per_room = hotel_df.groupby('reserved_room_type')['adr'].mean()

print(f"ADR per room:{adr_per_room}")


In [None]:
# Calculate the average number of special requests per booking
average_special_requests = hotel_df['total_of_special_requests'].mean()

print(f"Average number of special requests per booking:{average_special_requests}")


In [None]:
# What are the different deposit types used for bookings.
deposit_types = hotel_df['deposit_type'].unique()

print(f"Different deposit types used for bookings:{deposit_types}")

In [None]:
# average daily rate (ADR) for each customer type
adr_by_customer_type = hotel_df.groupby('customer_type')['adr'].mean()

print(f"Average Daily Rate (ADR) by Customer Type:{adr_by_customer_type}")


### What all manipulations have you done and insights you found?

1. Handling Null Values: Identified several columns with null values, such as 'children', 'country', 'agent', and 'company'. One manipulation you performed was filling the null values with the string '0'. However, it's important to carefully consider the appropriate way to handle missing values based on the nature of the data and the context of the problem.
2. Dropping Duplicates: Mentioned dropping duplicate rows from the dataset, resulting in a reduced number of rows from 119,390 to 87,396. This manipulation helps ensure data integrity and remove redundant information. It could be important to investigate the reason for duplicates and determine if dropping them is the correct approach for analysis.
3. Value Counts: The function on several columns, such as 'hotel', 'required_car_parking_spaces', 'market_segment', 'distribution_channel', 'deposit_type', 'customer_type', and 'reservation_status'. This manipulation provides insights into the distribution and frequency of different values within each column. For example, it helps identify the most common types of hotels, the number of car parking spaces required, the prevalence of different market segments, distribution channels, deposit types, customer types, and reservation statuses.
4. Unique Value Counts: Function to count the number of unique values in each column. This manipulation provides insights into the diversity and distinctiveness of values within each column. For example, it helps identify the number of unique countries, different types of rooms, unique agents, and companies.
5. These manipulations can provide initial insights into the dataset, allowing to understand the distribution of values, identify patterns or trends, and potentially guide further analysis or decision-making processes.








## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1. distribution of hotel types in the dataset.

In [None]:
# Chart - 1 visualization code
# countplot
sns.countplot(data= hotel_df,x= 'hotel')
plt.show()

##### 1. Why did you pick the specific chart?
Using a countplot on the 'hotel' column helps us understand the distribution of hotel types in the dataset. It allows us to visually compare the number of occurrences of each hotel type. i.e Resort Hotel and city Hotel we can see here city Hotel has more occurrence then the Resort Hotel.

##### 2. What is/are the insight(s) found from the chart?

City Hotel Occurrence: Based on the countplot, it appears that the City Hotel has a higher occurrence compared to the Resort Hotel. This suggests that the dataset contains a larger number of records or instances associated with the City Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The countplot reveals the distribution and frequency of hotel types, indicating the popularity and demand for each type. This information can be used to understand customer preferences and tailor marketing strategies accordingly. It allows businesses to allocate resources effectively and provide services that align with customer needs.

#### Chart - 2  the majority of guests fall into the category of "no repeated guest"

In [None]:
# Chart - 2 visualization code
sns.countplot(data = hotel_df, x = 'is_canceled',hue = 'is_repeated_guest')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot reveals a distinct pattern where the majority of guests fall into the category of "no repeated guest" (the value 0), while a significantly smaller number of guests are classified as "is repeated guest" (the value 1). Additionally, a considerable portion of the bookings has been canceled.

##### 2. What is/are the insight(s) found from the chart?

Comparison of Occurrences: It presents the information in a straightforward manner, making it easy to identify any disparities in the frequencies of different categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Targeted Marketing and Communication: Differentiating between repeated and non-repeated guests in terms of their cancellation behavior allows the business to make marketing and communication strategies accordingly.

#### Chart - 3. the distribution of canceled and non-canceled bookings across different hotel types

In [None]:
# Chart - 3 visualization code
sns.countplot(data = hotel_df, x= 'hotel', hue= 'is_canceled')
plt.show()

##### 1. Why did you pick the specific chart?

These counts provide insights into the distribution of canceled and non-canceled bookings across different hotel types. From the chart, we can observe that the count of non-canceled bookings is higher for both 'city hotel' and 'resort hotel' compared to the count of canceled bookings. However, we can also observe that the count of canceled bookings is higher for the 'city hotel' category, compared to the 'resort hotel' category.

##### 2. What is/are the insight(s) found from the chart?

The countplot enables us to observe if there are any variations in cancellation patterns across different types of hotels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The business can focus on implementing targeted strategies to reduce cancellations for those specific hotel types. This may include offering personalized incentives, improving the booking experience, or providing additional value-added services to encourage guests to commit to their bookings. These targeted efforts can help increase customer satisfaction, loyalty, and ultimately lead to higher bookings and revenue.

#### Chart - 4. Top 20, Bookings per Country

In [None]:
# Chart - 4 visualization code
# Group the data by 'country' and count the number of occurrences
country_counts = hotel_df['country'].value_counts().head(20)

# Plotting the bar chart
sns.barplot(x=country_counts.index, y=country_counts.values)
plt.xlabel('Country')
plt.ylabel('Number of Bookings')
plt.title(' Top 20 Number of Bookings per Country')
plt.xticks(rotation=90)  # Rotating the x-axis labels for better visibility
plt.show()



##### 1. Why did you pick the specific chart?

The resulting bar chart will show the number of bookings for each country, allowing to identify the country with the highest number of bookings based on the height of the bars. we can also observe the relative number of bookings for other countries by comparing the bar heights.

##### 2. What is/are the insight(s) found from the chart?

Top 20 Countries: The chart specifically focuses on the top 20 countries with the highest number of bookings. By highlighting the top countries, the chart provides insights into the countries that contribute the most to the overall booking volume. This information can help prioritize marketing efforts, allocate resources, and target specific regions for business expansion.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Targeted Marketing: By understanding the top 20 countries with high booking volume, the business can focus its marketing efforts and allocate resources effectively. Targeted marketing campaigns tailored to those countries can lead to increased brand awareness, customer engagement, and bookings.

#### Chart - 5. How does the lead time vary across different hotel types?

In [None]:
# Chart - 5 visualization code
sns.boxplot(data=hotel_df, x='hotel', y='lead_time')
plt.xlabel('Hotel Type')
plt.ylabel('Lead Time')
plt.title('Lead Time Variation Across Hotel Types')
plt.show()



##### 1. Why did you pick the specific chart?

The boxplot chart illustrates the variation in lead time (the duration between booking and arrival) across different hotel types.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart can guide setting appropriate guest expectations. By understanding the average lead time for different hotel types, the business can communicate realistic expectations to guests, especially regarding the time required for booking confirmation and preparation before arrival.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By understanding the average lead time for different hotel types, the business can communicate realistic expectations to guests, especially regarding the time required for booking confirmation and preparation before arrival.

#### Chart - 6 What is the cancellation rate for each hotel type?

In [None]:
# Chart - 6 visualization code

# Calculate the cancellation rate for each hotel type
cancellation_rates = hotel_df.groupby('hotel')['is_canceled'].mean()

# Plotting the stacked bar chart
sns.barplot(x=cancellation_rates.index, y=cancellation_rates.values)
plt.xlabel('Hotel Type')
plt.ylabel('Cancellation Rate')
plt.title('Cancellation Rate by Hotel Type')
plt.show()


##### 1. Why did you pick the specific chart?

The resulting stacked bar chart will show the cancellation rate for each hotel type, as compared to resort hotel city hotel has more cancellation rate.  can compare the heights of the bars to observe the relative cancellation rates for city and resort hotel.

##### 2. What is/are the insight(s) found from the chart?

The cancellation rates can serve as a performance indicator for each hotel type. Higher cancellation rates might indicate potential issues in terms of customer satisfaction, pricing, service quality, or booking policies. Identifying hotel types with higher cancellation rates can trigger further analysis and targeted improvements to address the underlying causes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the cancellation rates can guide strategic decision-making processes. The business can allocate resources, prioritize improvements, and develop targeted marketing campaigns based on the cancellation behavior observed across different hotel types.

#### Chart - 7  Which month have the highest number of bookings by using lineplot


In [None]:
# Chart - 7 visualization code
#Group the data by 'arrival_date_month' and count the number of occurrences
monthly_bookings = hotel_df['arrival_date_month'].value_counts().sort_index()
#plotting the line plot
sns.lineplot(x= monthly_bookings.index, y = monthly_bookings.values)
plt.xlabel('month')
plt.ylabel('Number of Bookings')
plt.title('number of Bookings by Month')
plt.xticks(rotation = 90)
plt.show()

##### 1. Why did you pick the specific chart?

line plot will show the trend of the number of bookings across different months. By examining the line, we can identify the months with the highest number of bookings based on the peaks or higher points on the line

##### 2. What is/are the insight(s) found from the chart?

The line plot effectively displays the trend of booking counts over the months. It allows for a visual examination of the variations and patterns in booking volumes over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the line plot enables the business to plan and allocate resources accordingly. If certain months consistently show higher booking counts, the business can focus marketing efforts, promotional activities, and operational planning during those months to maximize revenue and customer satisfaction.

#### Chart - 8. Number of Bookings by Month by using barplot

In [None]:
# Chart - 8 visualization code

# Group the data by 'arrival_date_month' and count the number of occurrences
monthly_bookings = hotel_df['arrival_date_month'].value_counts().sort_index()

# Plotting the bar plot
sns.barplot(x=monthly_bookings.index, y=monthly_bookings.values)
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.title('Number of Bookings by Month')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

This is as same as line plot nothing different.By visualizing the bar plot, we can easily compare the number of bookings across different months and identify any variations or trends in the booking patterns throughout the year.

##### 2. What is/are the insight(s) found from the chart?

he box plot effectively displays the trend of booking counts over the months. It allows for a visual examination of the variations and patterns in booking volumes over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Planning: Analyzing the box plot enables the business to plan and allocate resources accordingly. If certain months consistently show higher booking counts, the business can focus marketing efforts, promotional activities, and operational planning during those months to maximize revenue and customer satisfaction.

#### Chart - 9 the distribution of stays in weekend nights and compare it between city hotels and resort hotels

In [None]:
# Chart - 9 visualization code

sns.boxplot(data = hotel_df, x = 'hotel',y= 'stays_in_weekend_nights')
plt.xlabel('Hotel type')
plt.ylabel('Weekend night')
plt.title('Distribution of Stays in Weekend Nights by Hotel Type')
plt.show()

##### 1. Why did you pick the specific chart?

By comparing the boxplots for city hotels and resort hotels, we can visually analyze the differences in the distribution of stays in weekend nights between the two types of hotels. for both the distribution is allmost same for both type of hotels.

##### 2. What is/are the insight(s) found from the chart?

By comparing the heights and shapes of the boxes, as well as the whiskers, you can assess the overall distribution of stays in weekend nights between city hotels and resort hotels. If the boxes are similar in height and shape and the whiskers are similar in length, it suggests that the distributions are relatively similar.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the box plot can assist in resource allocation. If there are significant differences in stays in weekend nights, the business can allocate resources such as staff, inventory, and services accordingly to meet the demands and expectations of guests in each hotel type.

#### Chart - 10. Distribution of Stays in Week Nights by Hotel Type

In [None]:
# Chart - 10 visualization code
sns.boxplot(data = hotel_df, x = 'hotel',y= 'stays_in_week_nights')
plt.xlabel('Hotel type')
plt.ylabel('Week nights')
plt.title('Distribution of Stays in Week Nights by Hotel Type')
plt.show()


##### 1. Why did you pick the specific chart?

By comparing the boxplots for city hotels and resort hotels, higher distribution of stays in week nights for  the resort hotels as compared to the city hotel.

##### 2. What is/are the insight(s) found from the chart?

By comparing the heights and shapes of the boxes, as well as the whiskers, you can assess the overall distribution of stays in weekend nights between city hotels and resort hotels. if there are notable differences, it indicates distinct patterns in stays in week nights between the two hotel types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses can optimize staffing, inventory management, and service availability to meet the specific demands of stays in week nights in each hotel type.

#### Chart - 11. What are the different market segments represented in the dataset, and how do they compare in terms of booking counts?

In [None]:
# Chart - 11 visualization code
# Calculate the booking counts for each market segment
market_segment_counts=hotel_df['market_segment'].value_counts()
sns.barplot(x=market_segment_counts.index,y=market_segment_counts.values)
plt.xlabel('Market segment')
plt.ylabel('Booking counts')
plt.title('booking counts by Market segment')
plt.xticks(rotation = 90)
plt.show()


##### 1. Why did you pick the specific chart?

Each bar represents a specific market segment, and the height of the bar i.e Online TA highest booking count, then offline TA/TO and compair to others market segments.

##### 2. What is/are the insight(s) found from the chart?

The bar plot provides insights into the popularity of each market segment based on the booking counts. Higher bars indicate market segments with higher booking volumes, suggesting greater demand and popularity among guests. Lower bars indicate segments with comparatively lower booking volumes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying segments with lower booking counts presents potential growth opportunities. These segments may be relatively untapped or underrepresented in the current marketing efforts. Businesses can develop targeted strategies to attract more bookings from these segments and expand their customer base.

#### Chart - 12  Which distribution channels are most commonly used for hotel bookings?

In [None]:
# Chart - 12 visualization code

# Calculate the booking counts for each distribution channel

distribution_channel_counts = hotel_df['distribution_channel'].value_counts()
sns.barplot(x = distribution_channel_counts.index,y= distribution_channel_counts.values)
plt.xlabel('Distribution Channel')
plt.ylabel('Booking Counts')
plt.title('Booking counts by Distribution Channel ')
plt.xticks(rotation = 90)
plt.show()

##### 1. Why did you pick the specific chart?

The resulting bar chart will show the booking counts for each distribution channel. TA/TO bar represents a significant  distribution channel, and the Direct and Corporate bar indicates the  low corresponding booking count.

##### 2. What is/are the insight(s) found from the chart?

By comparing the heights of the bars, you can visually assess the relative popularity and usage of different distribution channels. This insight helps in understanding which channels have a larger share of the market and attract a higher volume of bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying the most commonly used distribution channels opens up opportunities for partnerships and collaborations. Businesses can explore partnerships with popular channels to enhance their visibility, expand their customer base, and attract more bookings.

#### Chart - 13 What is the distribution of ADR values in the dataset

In [None]:
# Chart - 13 visualization code

# Plotting the histogram
sns.histplot(data=hotel_df, x='adr', kde=True)
plt.xlabel('ADR')
plt.ylabel('Frequency')
plt.title('Distribution of ADR Values')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram provides a clear and concise representation of the distribution of a numerical variable, such as ADR values.

##### 2. What is/are the insight(s) found from the chart?

Higher bars indicate a higher frequency of ADR values falling within that range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of ADR values allows businesses to optimize their pricing strategies. By analyzing the concentration of ADR values and the range of prices, businesses can adjust their pricing tiers, identify opportunities for upselling or bundling, and set competitive rates. This can lead to improved revenue management, increased profitability, and a positive impact on the business.

#### Chart - 14 - Correlation Heatmap

In [None]:
# to display max columns to check columns name so i can see the corretion heatmap which coulmns should i use.
pd.set_option('display.max_columns', 32)
hotel_df.head(2)

In [None]:
# Correlation Heatmap visualization code

# Select the relevant columns for correlation analysis
correlation_cols = hotel_df[['is_canceled', 'lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights']]

# Calculate the correlation matrix
correlation_matrix = correlation_cols.corr()

# Plotting the correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True,cmap ='Blues')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

the correlation heatmap as the specific chart because it is an effective visual tool for understanding the relationships between multiple variables in a dataset

##### 2. What is/are the insight(s) found from the chart?

The correlation heatmap allows for the simultaneous comparison of multiple variables. In this case, the heatmap shows the correlations between 'is_canceled', 'lead_time', 'stays_in_weekend_nights', and 'stays_in_week_nights'. By visualizing these correlations together, it becomes easier to identify any patterns or relationships between these variables.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Correlation analysis can assist in customer segmentation efforts. By identifying correlations between variables and customer behavior, businesses can create targeted marketing strategies and personalized experiences for different customer segments.'stays_in_week_nights' has a positive correlation with 'lead_time', it indicates that guests who book longer stays in advance may have distinct preferences.

#### Chart - 15 - Pair Plot How does 'lead_time' correlate with 'is_canceled' and 'arrival_date_year'? Does a longer lead time contribute to a higher cancellation rate or vary across different years?

In [None]:
# Pair Plot visualization code
# Select the columns for the pair plot
columns = ['lead_time','is_canceled','arrival_date_year']
sns.pairplot(data = hotel_df[columns])
plt.show()

##### 1. Why did you pick the specific chart?

the pair plot as the specific chart because it allows for the visualization of pairwise relationships between variables.

##### 2. What is/are the insight(s) found from the chart?

 'lead_time' and 'is_canceled': By examining the scatter plots, can assess the relationship between 'lead_time' and 'is_canceled'. there is no clear trend or a weak correlation, it indicates that lead time may not significantly impact the cancellation rate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
1. Since the city hotel has higher occurrences and cancellations compared to the resort hotel, the client can focus on implementing revenue management strategies in the city hotel to maximize revenue.
2. Given that the majority of guests who canceled bookings are categorized as "no repeated guests," the client should focus on improving customer loyalty and retention.
3. Analyzing the different market segments represented in the dataset, online travel agencies (OTA) are the most common booking channel, the client can collaborate closely with OTA to optimize their presence and promotions. The client can tailor their marketing strategies to attract and retain customers from those segments.
4. Since longer lead times contribute to higher cancellation rates, the client should focus on improving booking conversions for customers with longer lead times. Offering early bird discounts, flexible cancellation policies for early bookings, or exclusive benefits for guests who book well in advance can help encourage customers to commit to their reservations.
5. The client should pay attention to the seasonal booking patterns, with August and January being the highest booking months. They can plan their marketing and staffing strategies accordingly to accommodate the increased demand during these periods. Additionally, the significant bookings in April and May indicate a potential opportunity to promote special offers or packages during those months to attract more guests.
6. With the ADR (Average Daily Rate) values falling between 2000 - 2500, the client can assess their pricing strategy to optimize revenue. They can consider conducting a pricing analysis based on market demand.
7. Analyzing the reasons for cancellations and comparing the cancellation rates between the city hotel and the resort hotel can help identify potential areas for improvement. The client can focus on reducing cancellation rates by offering attractive non-refundable rates, providing clear and transparent communication about cancellation policies, and implementing customer feedback systems to address any potential issues.
8. Portugal (PRT), the country with the highest number of bookings, should be a key focus for the client. They can invest in targeted marketing campaigns and promotional offers to further capitalize on the demand from this market.
9. For countries like the United Kingdom (GBR) and France (FRA), which have similar booking numbers, the client should consider understanding the specific customer segments from these countries and design marketing strategies to attract and retain them.


Answer Here.

# **Conclusion**

1. Revenue Optimization: Focus on revenue management strategies in the city hotel, considering its higher occurrences and cancellations compared to the resort hotel.
2. Customer Retention: Improve customer loyalty and retention by implementing personalized marketing campaigns, targeted offers, and loyalty programs.
3. Market Segmentation: Analyze different market segments represented in the dataset to tailor marketing strategies and attract profitable customer segments.
4. Lead Time Analysis: Improve booking conversions for customers with longer lead times by offering early bird discounts and exclusive benefits for early bookings.
5. Seasonal Demand: Plan marketing and staffing strategies around seasonal booking patterns, with a focus on high-demand months like August and January.
6. ADR Optimization: Assess pricing strategy to optimize Average Daily Rate (ADR), considering market demand, competitor rates, and guest preferences.
7. Cancellation Rate: Reduce cancellation rates by offering attractive non-refundable rates, clear communication about cancellation policies, and addressing customer feedback.
8. Country-Specific Focus: Focus on countries with high booking numbers, such as Portugal (PRT), the United Kingdom (GBR), France (FRA), Spain (ESP), Germany (DEU), Italy (ITA), and Ireland (IRL)



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***