<a href="https://colab.research.google.com/github/mohammadanas7777/EDA-Exploratory-Data-Analysis/blob/main/Capstone_Project_Hotel_Booking_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

This project aims to analyze and predict sales for a retail store. By leveraging historical sales data, external factors like holidays and promotions, and advanced machine learning techniques, we aim to build a robust sales forecasting model. This will enable the store to optimize inventory management, staffing, and marketing strategies, ultimately leading to increased profitability.


# **GitHub Link -**

https://github.com/mohammadanas7777/EDA-Exploratory-Data-Analysis

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

To analyse the trends of booking on different sectors and take informed decisions, and have a clear idea on which sectors to invest on and which sectors to improve.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams

# !pip install pymysql
# import pymysql
# from sqlalchemy import create_engine
# from sqlalchemy.pool import NullPool

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
filepath = '/content/drive/MyDrive/Colab Notebooks/Capstone Projects/Hotel Bookings.csv'
hotel_bookings_df = pd.read_csv(filepath)
hotel_df = hotel_bookings_df.copy()

### Dataset First View

In [None]:
# Dataset First Look
hotel_df.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
hotel_df.shape

### Dataset Information

In [None]:
# Dataset Info
hotel_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
no_of_duplicate_values = len(hotel_df[hotel_df.duplicated()])
print("No. of duplicate values in the dataset is", no_of_duplicate_values)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(hotel_df.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(hotel_df.isnull())


**Choice of Heatmap -**
I selected a heatmap to visualize missing values because it offers a clear and intuitive representation of their presence and distribution within the dataset. Heatmaps enable quick identification of patterns and clusters of missing data across different columns and rows, providing valuable insights into the dataset's completeness.

**Insights from the Chart -**
The heatmap highlights the location of missing values, with white cells indicating missing data and black cells representing non-missing data. By analyzing the heatmap, we can determine which columns and rows have the highest concentration of missing values and identify any potential patterns or correlations between missing values in different columns.

### What did you know about your dataset?

**Number of Rows:** The dataset contains **119,390** entries or rows.

**Number of Columns:** There are **32** columns in the dataset, each representing different attributes or features related to hotel bookings.

**Data Types:** The dataset contains a mix of data types, including integers (int64), floats (float64), and objects (object). This suggests that the dataset includes both numerical and categorical variables.

**Duplicates Rows:**
The dataset contains **31,994** duplicate entries, indicating potential replication of information across various records.

**Missing Values:** While the majority of columns contain complete data, several columns exhibit notable numbers of missing values. Specifically, the **company** column shows a substantial count of **112,593** missing values, while the **agent** column has **16,340** missing values. Additionally, the **country** and **children** columns contain **488** and **4** missing values, respectively.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotel_df.columns

In [None]:
# Dataset Describe
hotel_df.describe(include = 'all')

### Variables Description

**hotel:** Type of hotel (e.g., resort hotel, city hotel).

**is_canceled:** Binary indicator of whether the booking was canceled (1 = canceled, 0 = not canceled).

**lead_time:** Number of days between booking date and arrival date.

**arrival_date_year:** Year of arrival date.

**arrival_date_month:** Month of arrival date.

**arrival_date_week_number:** Week number of arrival date.

**arrival_date_day_of_month:** Day of the month of arrival date.

**stays_in_weekend_nights:** Number of weekend nights stayed.

**stays_in_week_nights:** Number of week nights stayed.

**adults:** Number of adults included in the booking.

**children:** Number of children included in the booking.

**babies:** Number of babies included in the booking.

**meal:** Type of meal booked (e.g., Undefined/SC – no meal package; BB – Bed & Breakfast).

**country:** Country of origin.

**market_segment:** Market segment designation (e.g., Online Travel Agents, Offline Travel Agents).

**distribution_channel:** Booking distribution channel (e.g., Direct, Corporate)
is_repeated_guest: Binary indicator of whether the guest is a repeated guest (1 = yes, 0 = no).

**previous_cancellations:** Number of previous cancellations by the guest.

**previous_bookings_not_canceled:** Number of previous bookings not canceled by the guest.

**reserved_room_type:** Code of room type reserved.

**assigned_room_type:** Code of room type assigned.

**booking_changes:** Number of changes made to the booking
deposit_type: Type of deposit made for the booking (e.g., No Deposit, Non Refund, Refundable).

**agent:** ID of the travel agency that made the booking.

**company:** ID of the company/entity that made the booking.

**days_in_waiting_list:** Number of days the booking was on the waiting list before confirmed.

**customer_type:** Type of booking customer (e.g., Transient, Contract, Group).

**adr:** Average Daily Rate (average rental income per paid occupied room).

**required_car_parking_spaces**: Number of car parking spaces required by the guest.

**total_of_special_requests:** Number of special requests made by the guest.

**reservation_status:** Reservation status (e.g., Canceled, Check-Out).

**reservation_status_date:** Date at which the reservation status was last updated.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in hotel_df.columns.tolist():
  print("No. of unique values in", i, "is", hotel_df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
hotel_df.drop_duplicates(inplace = True)

In [None]:
# Dealing with missing values.
# Dropping Company column because it contains too many null value.
hotel_df = hotel_df.drop('company', axis = 1)

In [None]:
hotel_df.head()

In [None]:
# Drop rows with missing values in the 'children' column since it is very few it make sense.
hotel_df.dropna(subset=['children'], inplace=True)

In [None]:
# Drop rows with missing values in the 'country' column with 'NA'
hotel_df['country'] = hotel_df['country'].fillna('NA')

In [None]:
# Fill missing values in the 'agent' column with 0.0 since agent column contains float values
hotel_df['agent'] = hotel_df['agent'].fillna(0.0)
hotel_df['agent'] = hotel_df['agent'].astype('float64')

In [None]:
# Calculate total stay by summing up the number of weekend nights and week nights
hotel_df['Total_stay'] = hotel_df['stays_in_weekend_nights'] + hotel_df['stays_in_week_nights']

In [None]:
# Calculate total number of kids by summing up the number of babies and children
hotel_df['Total_kids'] = hotel_df['babies'] + hotel_df['children']

In [None]:
# Calculate total cost by multiplying the average daily rate (adr) by the total stay
hotel_df['total_cost'] = hotel_df['adr']*hotel_df['Total_stay']

In [None]:
# Create a new date cloumn.
hotel_df['arrival_date'] = pd.to_datetime(hotel_df['arrival_date_year'].astype(str) + '-' + hotel_df['arrival_date_month'] + '-' + hotel_df['arrival_date_day_of_month'].astype(str))

In [None]:
hotel_df

In [None]:
hotel_df.describe()

In [None]:
hotel_df.info()

In [None]:
# Get a list of column names in the hotel_booking DataFrame
hotel_df.columns.to_list()

### What all manipulations have you done and insights you found?

Removed duplicate rows from the hotel_df DataFrame.

Dropped the 'Company' column due to a high number of null values.

Dropped rows with missing values in the 'children' column.

Filled missing values in the 'country' column with 'NA' to handle missing categorical data.

Filled missing values in the 'agent' column with 0.0 to represent as it might shows that the customer not used an agent for booking.

Created a column named total stay by summing up the number of weekend nights and week nights columns.

Created a column named total number of kids by summing up the number of babies and children.

Created a column named total cost by multiplying the average daily rate (adr) by the total stay.

Created a column named arrival date by concat and converting arrival_date_year, arrival_date_month and arrival_date_day_of_month.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

**Booking percentage based on hotel types**

In [None]:
# Chart - 1 visualization code
hotel_types = hotel_df['hotel'].value_counts()
hotel_types
plt.pie(hotel_types, labels = hotel_types.index, autopct='%1.1f%%')
plt.title('Hotel Types')
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is a suitable choice for visualizing the proportion of bookings for each hotel type because it effectively displays the relative sizes of different categories within a whole. This allows for a quick and intuitive understanding of the distribution of bookings across hotel types.

##### 2. What is/are the insight(s) found from the chart?

The pie chart reveals that City Hotels account for the majority of bookings, followed by Resort Hotels. This suggests that City Hotels are more popular among customers compared to Resort Hotels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There are no direct insights from this chart that suggest negative growth. However, it's important to consider the reasons behind the lower demand for Resort Hotels and explore ways to improve their appeal and attract more bookings. This could involve enhancing amenities, offering special packages, or targeting specific customer segments.


#### Chart - 2

**Booking percentage based on customer type**

In [None]:
# Chart - 2 visualization code
hotel_customer_type = hotel_df['customer_type'].value_counts()
hotel_customer_type
plt.pie(hotel_customer_type, labels = hotel_customer_type.index, autopct='%1.1f%%')
plt.title('Customer Type')
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is chosen to represent the distribution of customer types because it effectively illustrates the proportion of each customer type within the total bookings. This allows for a clear visual comparison of the relative sizes of different customer segments.

##### 2. What is/are the insight(s) found from the chart?

The pie chart reveals that Transient customers constitute the largest segment, followed by Transient-Party customers. This indicates that a significant portion of bookings comes from individual travelers or small groups who are not part of larger contracts or groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

While the chart doesn't directly indicate negative growth, it highlights the importance of diversifying customer acquisition strategies. Relying heavily on Transient customers might make the business vulnerable to fluctuations in individual travel demand. Exploring ways to attract more Contract and Group bookings could provide a more stable revenue stream and mitigate potential risks associated with overdependence on a single customer segment.

#### Chart - 3

**Top 10 countries with higher number of bookings**

In [None]:
# Chart - 3 visualization code
top_country = hotel_df['country'].value_counts().reset_index()
top_country.columns = ['country', 'booking_count']
top_country = top_country.head(10)
plt.figure(figsize=(6, 4))  # Adjust figsize as needed
plt.bar('country', 'booking_count', data= top_country, color='skyblue')
plt.title('Booking Counts of top 10 Countries')
plt.xlabel('Country')
plt.ylabel('Booking Count')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is an effective choice for visualizing the top 10 countries with higher bookings because it allows for a clear comparison of the booking counts across different countries. The bars' heights directly correspond to the number of bookings, making it easy to identify the countries with the highest and lowest booking volumes.

##### 2. What is/are the insight(s) found from the chart?

The bar chart reveals that Portugal (PRT) has the highest number of bookings, followed by Great Britain (GBR) and France (FRA). This suggests that these countries are the primary sources of hotel bookings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the top contributing countries can help hotels tailor their marketing and service offerings to better cater to the preferences and needs of guests from these regions. This can lead to increased customer satisfaction and loyalty, ultimately driving positive business impact.

While the chart doesn't directly indicate negative growth, it highlights the importance of diversifying marketing efforts beyond the top contributing countries. Relying heavily on a few markets might make the business vulnerable to fluctuations in travel demand from those specific regions. Exploring ways to attract bookings from other countries could provide a more stable revenue stream and mitigate potential risks associated with overdependence on a limited number of markets.


#### Chart - 4

**Number of cancellations per month each year**

In [None]:
# Chart - 4 visualization code
cancel_by_year = hotel_df.pivot_table(index='arrival_date_month', columns='arrival_date_year', values='is_canceled', aggfunc='sum')
print(cancel_by_year)
sns.heatmap(cancel_by_year, cmap='Blues', annot = True, fmt = '.0f')
plt.title('Cancellation by Month and Year')
plt.xlabel('Year')
plt.ylabel('Month')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is chosen to visualize the number of cancellations per month each year because it effectively displays the distribution of cancellations across different months and years in a concise and intuitive manner. The color intensity of each cell represents the number of cancellations, allowing for quick identification of patterns and trends in cancellation behavior.


##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals that cancellations tend to be higher during peak travel seasons, such as summer months (June, July, August) and holiday periods (December). Additionally, there might be year-specific variations in cancellation patterns, potentially influenced by external factors like economic conditions or events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this chart can help in creating a positive business impact by enabling hotels to anticipate and manage cancellations more effectively. By understanding the seasonal and yearly trends in cancellations, hotels can implement strategies like adjusting pricing, offering flexible cancellation policies, or overbooking to mitigate the impact of cancellations on revenue.

There are no direct insights from this chart that suggest negative growth. However, it highlights the importance of proactively addressing cancellation trends to avoid potential revenue loss and maintain customer satisfaction.

#### Chart - 5

**Number of bookings by month and year**

In [None]:
# Chart - 5 visualization code
hotel_df.groupby(['arrival_date_year', 'arrival_date_month']).size()

plt.figure(figsize=(12, 6))
sns.countplot(data=hotel_df, x='arrival_date_month', hue='arrival_date_year', palette='viridis')
plt.title('Distribution of Booking by Month and Year')
plt.xlabel('Month')
plt.ylabel('Number of Booking')
plt.legend(title='Year', loc=(1, 0.81))
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

##### 1. Why did you pick the specific chart?

A countplot is chosen to visualize the number of bookings by month and year because it effectively displays the distribution of bookings across different months and years in a clear and concise manner. The height of each bar represents the number of bookings for a specific month and year, allowing for quick comparison and identification of trends.


##### 2. What is/are the insight(s) found from the chart?

The countplot reveals that bookings tend to be higher during peak travel seasons, such as summer months (June, July, August) and holiday periods (December). Additionally, there might be year-specific variations in booking patterns, potentially influenced by external factors like economic conditions or events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Yes, the insights gained from this chart can help in creating a positive business impact by enabling hotels to anticipate and manage demand more effectively. By understanding the seasonal and yearly trends in bookings, hotels can implement strategies like adjusting pricing, offering promotions, or managing inventory to optimize revenue and occupancy rates.

There are no direct insights from this chart that suggest negative growth. However, it highlights the importance of proactively addressing booking trends to capitalize on peak seasons and mitigate potential revenue loss during slower periods.


#### Chart - 6

**Number of total arrivals on each day of the week**

In [None]:
# Chart - 6 visualization code
# Count the arrival of customers for each day of the week based on arrival dates
arrival_day_count = hotel_df['arrival_date'].dt.day_name().value_counts()
arrival_day_count

In [None]:
plt.figure(figsize=(8, 6))
plt.barh(arrival_day_count.index, arrival_day_count.values)
plt.title('Total arrivals per day of the week')
plt.ylabel('Day of the week')
plt.xlabel('Number of arrival')
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.gca().invert_yaxis()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart is chosen to visualize the number of total arrivals on each day of the week because it allows for a clear comparison of arrival counts across different days. The horizontal orientation provides ample space for displaying the day names along the y-axis, enhancing readability.



##### 2. What is/are the insight(s) found from the chart?


The chart reveals that Friday is the most popular day for arrivals, followed by Thursday and Monday. This suggests that a significant portion of guests prefer to check in towards the end of the week or at the beginning of the week.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the insights gained from this chart can help in creating a positive business impact by enabling hotels to optimize staffing levels, resource allocation, and service offerings based on arrival patterns. By understanding the peak arrival days, hotels can ensure adequate staffing and resources to handle increased check-ins and provide a seamless guest experience.

There are no direct insights from this chart that suggest negative growth. However, it highlights the importance of efficiently managing resources and staffing levels to avoid potential bottlenecks or service delays on peak arrival days, which could negatively impact customer satisfaction.

#### Chart - 7

**Analyzing the variation of average daily rate based on hotel type**

In [None]:
# Let's first check for the outliers in adr values
plt.figure(figsize=(6, 4))
sns.scatterplot(x='hotel', y='adr', data=hotel_df)
plt.title('ADR Distribution for Each Hotel Type')
plt.xlabel('Hotel Type')
plt.ylabel('Average Daily Rate (ADR)')
plt.show()

We can see that most of the values are lying below 1000, therefore we can put a threshold of 1000 to filter out the outliers.

In [None]:
# Chart - 7 visualization code
# Filter the DataFrame to include ADR values between 0 and 1000
filtered_hotel_df = hotel_df[(hotel_df['adr'] > 0) & (hotel_df['adr'] < 1000)]

# Create a box plot to visualize ADR distribution for each hotel type
plt.figure(figsize=(8, 6))
sns.boxplot(x='hotel', y='adr', data=filtered_hotel_df)
plt.title('ADR Distribution for Each Hotel Type')
plt.xlabel('Hotel Type')
plt.ylabel('Average Daily Rate (ADR)')
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is chosen to analyze the variation of average daily rate (ADR) based on hotel type because it effectively displays the distribution of ADR for each hotel category, including the median, quartiles, and potential outliers. This allows for a clear comparison of the central tendency and spread of ADR across different hotel types.


##### 2. What is/are the insight(s) found from the chart?

The box plot reveals that City Hotels generally have a higher median ADR compared to Resort Hotels, indicating that City Hotels tend to charge higher rates on average. Additionally, the box plot shows a wider range of ADR for City Hotels, suggesting greater variability in pricing for this category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the variation of ADR based on hotel type can help hotels optimize pricing strategies and tailor their services to specific customer segments. For instance, Resort Hotels might consider offering promotional packages or discounts during off-peak seasons to increase occupancy and revenue.

There are no direct insights from this chart that suggest negative growth. However, it highlights the importance of understanding the competitive landscape and adjusting pricing strategies accordingly to maximize revenue and occupancy rates for each hotel type.

#### Chart - 8

**Average ADR for Each Room Type by Hotel Type**

In [None]:
# Chart - 8 visualization code
# Group the data by hotel type and assigned room type, and calculate the mean ADR for each group
average_adr_by_room_type = hotel_df.groupby(['hotel', 'assigned_room_type'])['adr'].mean().reset_index()

# Create a bar plot to visualize the average ADR for each room type, grouped by hotel type
plt.figure(figsize=(8, 5))
sns.barplot(x='assigned_room_type', y='adr', hue='hotel', data=average_adr_by_room_type)
plt.title('Average ADR for Each Room Type by Hotel Type')
plt.xlabel('Assigned Room Type')
plt.ylabel('Average Daily Rate (ADR)')
plt.xticks(rotation=45)
plt.legend(title='Hotel Type')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar plot to visualize the average ADR for each room type, grouped by hotel type, because it provides a clear and concise way to compare the average ADR across different room types and hotel types. The use of different colors for each hotel type enhances the visual distinction between the groups.

##### 2. What is/are the insight(s) found from the chart?

The bar plot reveals that the average ADR varies significantly across different room types and hotel types. For instance, suite rooms generally command a higher ADR compared to other room types, regardless of the hotel type. Additionally, city hotels tend to have a higher average ADR compared to resort hotels for most room types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact by informing pricing strategies and inventory management decisions. By understanding the average ADR for different room types and hotel types, hotels can optimize their pricing to maximize revenue. Additionally, hotels can adjust their inventory allocation based on the demand for different room types to improve occupancy rates.

There are no direct insights from this chart that suggest negative growth. However, it is important to note that the average ADR is just one factor that influences booking decisions. Other factors, such as seasonality, location, and amenities, also play a significant role. Therefore, hotels should consider a holistic approach when making pricing and inventory decisions.


#### Chart - 9

**Average total stays for each hotel type**

In [None]:
# Chart - 9 visualization code
# Calculate the average total stay for each hotel type
average_stay_by_hotel = hotel_df.groupby('hotel')['Total_stay'].mean().reset_index()
print(average_stay_by_hotel)

# Create a bar plot to visualize the average total stay for each hotel type
plt.figure(figsize=(4, 6))
sns.barplot(x='hotel', y='Total_stay', data=average_stay_by_hotel)
plt.title('Average Total Stay by Hotel Type')
plt.xlabel('Hotel Type')
plt.ylabel('Average Total Stay (Nights)')
plt.show()

##### 1. Why did you pick the specific chart?

A bar plot is chosen to visualize the average total stay for each hotel type because it provides a clear and concise way to compare the average length of stay across different hotel categories. The height of each bar represents the average number of nights guests stay at each hotel type, allowing for quick comparison and identification of trends.


##### 2. What is/are the insight(s) found from the chart?

The bar plot reveals that guests tend to stay longer at Resort Hotels compared to City Hotels. This suggests that Resort Hotels might cater to leisure travelers who typically plan longer vacations, while City Hotels might attract more business travelers with shorter stays.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the average total stay for each hotel type can help hotels tailor their services and marketing strategies to specific customer segments. For instance, Resort Hotels can focus on offering amenities and activities that cater to longer stays, while City Hotels can emphasize convenience and efficiency for shorter stays.

There are no direct insights from this chart that suggest negative growth. However, it highlights the importance of understanding the typical length of stay for each hotel type to optimize pricing, inventory management, and service offerings. Failing to cater to the specific needs of each customer segment could lead to lower occupancy rates and revenue.


#### Chart - 10

**Number of bookings with kids and without kids**

In [None]:
# Chart - 10 visualization code
# Count bookings with and without kids
booking_with_kids = hotel_df[hotel_df['Total_kids'] > 0]['Total_kids'].count()
booking_without_kids = hotel_df[hotel_df['Total_kids'] == 0]['Total_kids'].count()
print('Number of booking with kids are',booking_with_kids,'and without kids are', booking_without_kids)
print('\n')
# Create a bar plot
plt.figure(figsize=(4, 6))
plt.bar(['With Kids', 'Without Kids'], [booking_with_kids, booking_without_kids])
plt.title('Number of Bookings With and Without Kids')
plt.xlabel('Booking Type')
plt.ylabel('Number of Bookings')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is chosen to visualize the number of bookings with and without kids because it provides a clear and direct comparison between the two categories. The height of each bar represents the number of bookings, allowing for quick assessment of the proportion of bookings with and without children.


##### 2. What is/are the insight(s) found from the chart?

The bar chart reveals that the number of bookings without kids significantly exceeds the number of bookings with kids. This suggests that a larger portion of hotel guests are couples or individuals traveling without children.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the proportion of bookings with and without kids can help hotels tailor their services and marketing strategies to specific customer segments. For instance, hotels can offer family-friendly amenities and packages to attract more bookings with children or focus on creating a more adult-oriented atmosphere to cater to couples and individuals traveling without kids.

There are no direct insights from this chart that suggest negative growth. However, it highlights the importance of understanding the demographics of hotel guests to optimize service offerings and marketing efforts. Failing to cater to the needs of specific customer segments could lead to missed opportunities for increased bookings and revenue.

#### Chart - 11

**Total revenue generated by hotel type**

In [None]:
# Chart - 11 visualization code
# Calculate the total revenue generated by each hotel type
revenue_by_hotel = hotel_df.groupby('hotel')['total_cost'].sum()
print(revenue_by_hotel)

# Create a bar plot to visualize the total revenue generated by each hotel type
print('\n')
plt.pie(revenue_by_hotel, labels=revenue_by_hotel.index, autopct='%1.2f%%', startangle=270)
plt.title('Total Revenue Generated by Hotel Type')
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is chosen to visualize the total revenue generated by each hotel type because it effectively displays the proportion of revenue contributed by each category. The size of each slice represents the percentage of total revenue, allowing for quick comparison and identification of the dominant revenue source.


##### 2. What is/are the insight(s) found from the chart?

The pie chart reveals that City Hotels generate a significantly larger proportion of total revenue compared to Resort Hotels. This suggests that City Hotels might have higher occupancy rates, higher average daily rates, or a combination of both, leading to greater revenue generation.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the revenue contribution of each hotel type can help hotels make informed decisions regarding resource allocation, marketing strategies, and expansion plans. For instance, hotels might consider investing more in City Hotels due to their higher revenue potential or explore ways to increase occupancy and revenue for Resort Hotels.

There are no direct insights from this chart that suggest negative growth. However, it highlights the importance of optimizing revenue generation for both hotel types to maximize overall profitability. Focusing solely on City Hotels while neglecting Resort Hotels could lead to missed opportunities for growth and revenue diversification.


#### Chart - 12

**Average ADR by Hotel Type and Assigned Room Type**

In [None]:
# Chart - 12 visualization code
# Calculate the average ADR by hotel type and assigned room type
average_adr = hotel_df.groupby(['hotel', 'assigned_room_type'])['adr'].mean().unstack()

# Create a heatmap
plt.figure(figsize=(14, 3))
sns.heatmap(average_adr, annot=True, cmap='plasma', fmt='.2f')
plt.title('Average ADR by Hotel Type and Assigned Room Type')
plt.xlabel('Assigned Room Type')
plt.ylabel('Hotel Type')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is chosen to visualize the average ADR by hotel type and assigned room type because it provides a clear and concise way to compare the average ADR across different room types and hotel types. The use of color intensity to represent the average ADR allows for quick identification of trends and patterns.


##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals that the average ADR varies significantly across different room types and hotel types. For instance, suite rooms generally command a higher ADR compared to other room types, regardless of the hotel type. Additionally, city hotels tend to have a higher average ADR compared to resort hotels for most room types.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact by informing pricing strategies and inventory management decisions. By understanding the average ADR for different room types and hotel types, hotels can optimize their pricing to maximize revenue. Additionally, hotels can adjust their inventory allocation based on the demand for different room types to improve occupancy rates.

There are no direct insights from this chart that suggest negative growth. However, it is important to note that the average ADR is just one factor that influences booking decisions. Other factors, such as seasonality, location, and amenities, also play a significant role. Therefore, hotels should consider a holistic approach when making pricing and inventory decisions.


#### Chart - 13

In [None]:
# Chart - 13 visualization code
filtered_hotel_df = hotel_df[(hotel_df['adr'] > 0) & (hotel_df['adr'] < 1000)]
plt.figure(figsize=(8, 6))
sns.scatterplot(y='Total_stay', x='adr', data=filtered_hotel_df, hue='hotel', alpha=0.7)
plt.title('Scatter Plot of ADR vs. Total Stay')
plt.ylabel('Total Stay (Nights)')
plt.xlabel('Average Daily Rate (ADR)')
plt.legend(title='Hotel Type')
plt.grid(True)
plt.show()

In [None]:
#Gaussian curves of total stays by hotel type

# Calculate the mean and standard deviation of total stays for each hotel type
city_stay_mean = hotel_df[hotel_df['hotel'] == 'City Hotel']['Total_stay'].mean()
city_stay_std = hotel_df[hotel_df['hotel'] == 'City Hotel']['Total_stay'].std()

resort_stay_mean = hotel_df[hotel_df['hotel'] == 'Resort Hotel']['Total_stay'].mean()
resort_stay_std = hotel_df[hotel_df['hotel'] == 'Resort Hotel']['Total_stay'].std()

# Generate x-values for the curves
x_values = np.linspace(0, 15, 200)  # Adjust range as needed

# Calculate y-values for the Gaussian curves
city_curve = np.exp(-(x_values - city_stay_mean)**2 / (2 * city_stay_std**2)) / (city_stay_std * np.sqrt(2 * np.pi))
resort_curve = np.exp(-(x_values - resort_stay_mean)**2 / (2 * resort_stay_std**2)) / (resort_stay_std * np.sqrt(2 * np.pi))

# Plot the curves
plt.plot(x_values, city_curve, label='City Hotel')
plt.plot(x_values, resort_curve, label='Resort Hotel')

plt.title('Gaussian Distribution of Total Stays by Hotel Type')
plt.xlabel('Total Stay (Nights)')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is chosen to visualize the relationship between ADR and total stay because it allows for the examination of individual data points and the identification of potential trends or patterns. The use of color-coding for different hotel types further enhances the analysis by allowing for comparison between the two categories.
Gaussian curves are chosen to visualize the distribution of total stays by hotel type because they provide a clear representation of the probability density of different lengths of stay. This allows for comparison of the central tendency and spread of total stays for City Hotels and Resort Hotels.


##### 2. What is/are the insight(s) found from the chart?

The scatter plot reveals a general trend of longer stays being associated with lower ADRs, particularly for Resort Hotels. This suggests that guests staying for extended periods might be offered discounted rates or packages. Additionally, City Hotels exhibit a wider range of ADRs for similar lengths of stay, indicating greater variability in pricing strategies.
The Gaussian curves confirm the earlier observation that Resort Hotels generally have longer average stays compared to City Hotels. The curves also illustrate the variability in stay lengths, with Resort Hotels exhibiting a wider spread compared to City Hotels.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of total stays by hotel type can help hotels tailor their services and marketing strategies to specific customer segments. For instance, Resort Hotels can focus on offering amenities and activities that cater to longer stays, while City Hotels can emphasize convenience and efficiency for shorter stays.

There are no direct insights from this chart that suggest negative growth. However, it highlights the importance of understanding the typical length of stay for each hotel type to optimize pricing, inventory management, and service offerings. Failing to cater to the specific needs of each customer segment could lead to lower occupancy rates and revenue.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Calculate correlation matrix
correlation_matrix = hotel_df.corr(numeric_only=True)

# Plot correlation heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".1f", linewidths=0.8)
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is chosen to visualize the relationships between numerical variables in the dataset because it provides a comprehensive overview of the correlations between all pairs of variables. The use of color intensity to represent the strength and direction of correlations allows for quick identification of potential relationships and patterns.


##### 2. What is/are the insight(s) found from the chart?

The correlation heatmap reveals several interesting insights:
- There is a strong positive correlation between lead time and the number of days in waiting list, suggesting that bookings made further in advance are more likely to be placed on a waiting list.
- There is a moderate positive correlation between the number of adults and the number of children, indicating that larger groups often include both adults and children.
- There is a weak negative correlation between lead time and the average daily rate (ADR), implying that bookings made closer to the arrival date might have slightly higher ADRs.
- There is a weak positive correlation between the total number of special requests and the total stay, suggesting that guests with longer stays tend to make more special requests.


  Also, The gained insights from the correlation heatmap can help create a positive business impact by informing various aspects of hotel management:
- **Pricing and Inventory Management:** Understanding the relationship between lead time and ADR can help hotels optimize pricing strategies based on booking lead times. The correlation between lead time and waiting list days can also inform inventory management decisions to minimize wait times.
- **Targeted Marketing and Services:** The correlation between adults and children can help hotels tailor marketing campaigns and services to families. The correlation between special requests and total stay can guide hotels in anticipating and fulfilling the needs of long-stay guests.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select numerical columns for pair plot
numeric_columns = ['lead_time', 'is_canceled', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults',
                   'children', 'babies', 'previous_cancellations', 'previous_bookings_not_canceled',
                   'booking_changes', 'days_in_waiting_list', 'adr', 'required_car_parking_spaces',
                   'total_of_special_requests', 'total_cost' ]

# Filter the data to include only numerical columns
numeric_data = hotel_df[numeric_columns]

# Create the pair plot
sns.pairplot(numeric_data,corner=True)
# Set title
plt.title('Pair Plot of Numerical Variables')

plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is chosen to visualize the relationships between pairs of numerical variables because it provides a comprehensive overview of the distributions and correlations between all possible pairs. Each diagonal subplot displays the distribution of a single variable, while the off-diagonal subplots show scatter plots of pairs of variables, allowing for quick identification of potential relationships and patterns.


##### 2. What is/are the insight(s) found from the chart?

The pair plot reveals several interesting insights:
- **Lead Time and Cancellations:** There is a positive correlation between lead time and cancellations, indicating that bookings made further in advance are more likely to be canceled.
- **Stay Duration and Weekends:** There is a positive correlation between the number of weekend nights and the total length of stay, suggesting that guests often include weekends in their bookings.
- **Adults and Children:** There is a positive correlation between the number of adults and the number of children, indicating that larger groups often include both adults and children.
- **ADR and Total Cost:** There is a strong positive correlation between the average daily rate (ADR) and the total cost, as expected.

Also, The gained insights from the pair plot can help create a positive business impact by informing various aspects of hotel management:
- **Cancellation Management:** Understanding the relationship between lead time and cancellations can help hotels implement policies and strategies to reduce cancellations, such as offering non-refundable rates or requiring deposits for bookings made far in advance.
- **Pricing and Promotions:** The correlation between stay duration and weekends can inform pricing and promotional strategies, such as offering weekend packages or discounts for longer stays.
- **Family-Friendly Services:** The correlation between adults and children can help hotels tailor marketing campaigns and services to families, such as offering family rooms or kid-friendly amenities.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis conducted, here are some suggestions for the client to achieve their business objectives:

**1. Optimize Pricing and Inventory Management:**

* Implement dynamic pricing strategies based on demand, seasonality, and booking lead time. Consider offering discounts for longer stays and non-refundable rates for bookings made far in advance to reduce cancellations.
* Utilize revenue management systems to forecast demand and adjust inventory allocation accordingly. This can help maximize occupancy rates and revenue.

**2. Enhance Customer Experience and Loyalty:**

* Personalize the guest experience by tailoring services and amenities to specific customer segments. For instance, offer family-friendly activities and packages for families with children, and provide business travelers with convenient amenities and services.
* Implement a loyalty program to reward repeat guests and encourage direct bookings. This can help reduce reliance on third-party booking platforms and increase customer retention.

**3. Target Marketing and Promotions:**

* Develop targeted marketing campaigns based on customer demographics, booking behavior, and preferences. Utilize social media, email marketing, and search engine optimization to reach potential guests.
* Offer special promotions and packages to attract specific customer segments, such as weekend getaways for couples or family vacations during school holidays.

**4. Improve Operational Efficiency:**

* Streamline check-in and check-out processes to reduce wait times and enhance guest satisfaction. Utilize technology to automate tasks and improve communication between staff.
* Analyze guest feedback and address any recurring issues promptly. This can help identify areas for improvement and enhance the overall guest experience.

**5. Monitor and Adapt:**

* Continuously monitor key performance indicators (KPIs) such as occupancy rates, ADR, revenue per available room (RevPAR), and guest satisfaction scores.
* Adapt strategies and tactics based on market trends, competitor analysis, and guest feedback to stay competitive and achieve sustainable growth.

By implementing these recommendations, the client can optimize their operations, enhance the guest experience, and achieve their business objectives of increasing revenue and profitability.


# **Conclusion**

This analysis has provided valuable insights into the factors that influence hotel bookings and revenue. By understanding these factors, hotels can make informed decisions to optimize their pricing, inventory management, marketing, and customer service strategies.

The key takeaways from this analysis include:

* The importance of lead time in predicting cancellations and adjusting pricing.
* The impact of seasonality and booking channels on occupancy rates and ADR.
* The need to tailor services and amenities to specific customer segments.
* The value of data-driven decision-making in the hospitality industry.

By leveraging these insights and implementing the recommended strategies, hotels can enhance the guest experience, increase revenue, and achieve sustainable growth in a competitive market.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***