<a href="https://colab.research.google.com/github/prajapatimohit/EDA-Hotel_Booking_Analysis/blob/main/Hotel_Booking_Analysis_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Hotel Booking Analysis EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual (Mohit prajapati)


# **Project Summary -**

The hotel booking dataset contains information about bookings for a city hotel and a resort hotel, including the booking date, length of stay, number of guests, and parking availability. This data can be used to answer questions about the best time to book a hotel room, the optimal length of stay, and the likelihood of receiving special requests. Personal information has been removed to protect privacy. By analyzing this data, important factors that impact hotel bookings can be discovered.

# **Column information**


**Hotel**
*  H1: Resort hotel

*  H2: City hotel

**is_canceled**

*   1: Canceled
*   0: Not canceled

**lead_time**

No of days that elapsed between entering date of booking into property management system and arrival date

**arrival_date_year**

Year of arrival date (2015-2017)

**arrival_date_month**

Month of arrival date (Jan - Dec)

**arrival_date_week_numberr**

Week number of year for arrival date (1-53)

**arrival_date_day_of_month**

Day of arrival date

**stays_in_weekend_nights**

No of weekend nights (Sat/Sun) the guest stayed or booked to stay at the hotel

**stays_in_week_nights**
No of week nights (Mon - Fri) the guest stayed or booked to stay at the hotel


**Adults**
**Children**
**Babies**

**meal**

Type of meal booked. Undefined/SC – no meal 
package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)

**country**

**market_segment** 
(a group of people who share one or more common characteristics, lumped together for marketing purposes)

  

*   TA: Travel agents
*   TO: Tour operators


  
**distribution_channel**

(A distribution channel is a chain of businesses or intermediaries through which a good or service passes until it reaches the final buyer or the end consumer)

TA: Travel agents
TO: Tour operators

**is_repeated_guest** 

(value indicating if the booking name was from repeated guest)

1: Yes
0: No

**previous_cancellations**

Number of previous bookings that were cancelled by the customer prior to the current booking

**previous_bookings_not_canceled**

Number of previous bookings not cancelled by the customer prior to the current booking

**reserved_room_type**

Code of room type reserved. Code is presented instead of designation for anonymity reasons.

**assigned_room_type**

Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.

**booking_changes**
Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

**deposit_type**

Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.

**agent**

ID of the travel agency that made the booking

**company**

ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons

**day_in_waiting_list**

Number of days the booking was in the waiting list before it was confirmed to the customer

**customer_type**

Contract - when the booking has an allotment or other type of contract associated to it;
Group – when the booking is associated to a group;
Transient – when the booking is not part of a group or contract, and is not associated to other transient booking;
Transient-party – when the booking is transient, but is associated to at least other transient booking

adr (average daily rate)
average daily rate = 
 
**required_car_parking_spaces**

Number of car parking spaces required by the customer

**total_of_special_requests**

Number of special requests made by the customer (e.g. twin bed or high floor)

**reservation_status**

Canceled – booking was canceled by the customer;
Check-Out – customer has checked in but already departed;
No-Show – customer did not check-in and did inform the hotel of the reason why

**reservation_status_date**
Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel





# **GitHub Link -**

https://github.com/prajapatimohit/EDA-Hotel_Booking_Analysis.git

# **Problem Statement**


Our main objective is perform EDA on the given dataset and draw useful conclusions about general trends in hotel bookings and how factors governing hotel bookings interact with each other.

#### **Define Your Business Objective?**

Generate a report for the management, the new marketing manager of Business so they can derive a strategy for the marketing team. As a newly hired data analyst, I have been assigned this task as majority of my colleagues (except some senior staff) were quarantined after spread of the coronavirus in the office. I need to win them back! I will work with data from year 2015 to 2017.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
## mounting drive

from google.colab import drive
drive.mount('/content/drive')

In [None]:
data = pd.read_csv('/content/drive/MyDrive/Capstone_1(EDA)/Hotel Bookings.csv')
data.head()

### Dataset First View

In [None]:
data.describe()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_of_Rows = data.shape[0]
num_of_Columns = data.shape[1]
print(f"Total no. of rows: {num_of_Rows}")
print(f"Total no. of columns: {num_of_Columns}")

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()



*   Our data set contains total 119390 Rows out of which 31994 rows are duplicate.





#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values = data.isnull().sum().sort_values(ascending = False)[:4]
null_values

In our data set there are many null values we need to remove those nullvalues in order to have accurate result.

In [None]:
# visualize missing values
plt.figure(figsize=(30,12))
sns.heatmap(data.isna().transpose()      #This checks each cell in the DataFrame data to see if it is null or missing.
,cmap="YlGnBu"                           #sets the color map to use for the heatmap
,cbar_kws={'label': 'Missing Data'})     #adds a color bar to the right of the heatmap with a label ("Missing Data") indicating what the colors represent.

### What did you know about your dataset?

In the given dataset we have toatal no. of rows and columns  . In the data set Company coloumn has heighest no of missing values followed by agent, country and children.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

### Variables Description 

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in data.columns:
    unique_vals = data[col].unique()
    print('Unique values for column', col, ':', unique_vals[:7])

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
data = data.drop(columns = ['agent', 'company']) # removing the columns that contain unimportant attributes

Since there are only 488 rows that contain NaN values, which is a very small proportion of the total of 119,390 rows, these rows will be removed. Therefore, removing them from the dataset is a practical approach.

In [None]:
data = data.dropna(axis = 0)
# Check to see if there are any more NaN data 
data.isnull().sum()

### What all manipulations have you done and insights you found?

There seem to be a significant number of NaN values in the "company" and "agent" columns. It's unlikely that these missing values would have any impact on the analysis, so it's better to simply remove these two columns altogether.


Removing rows with NaN values isn't feasible, as it would entail discarding 112,593 out of 119,390 rows. Therefore, removing the columns that contain unimportant attributes, such as agents and companies, would be a more effective approach.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**chart-1 Lets compare bookings trend for two Hotels**

In [None]:
sns.set(style="darkgrid")
sns.countplot(x="hotel", data=data, palette=["r", "y"])
plt.title("Booking counts for two different hotels")
plt.show()

##### 1. Why did you pick the specific chart?

Here we have only two variables and we want to see a percentage ratio between these two variables.

##### 2. What is/are the insight(s) found from the chart?

It seems that a huge proportion of Bookings belongs to city hotel. 

##### 3. Will the gained insights help creating a positive business impact? 

It seems that majority of bookings goes to City hotel. We can suggest promotional activities in order to boost booking for Resort hotel.

**Chart-2 Let's get an overview of the total number of individuals who made hotel reservations.**

In [None]:
data['adults'].groupby(data['hotel']).sum()

In [None]:
# group the data by the 'hotel' column and sum the 'adults' column
grouped_data = data.groupby('hotel')['adults'].sum().reset_index()

# draw the bar chart using seaborn's barplot function
sns.barplot(x='hotel', y='adults', data=grouped_data)

# add appropriate labels to the x-axis, y-axis, and the plot title
plt.xlabel('Hotel')
plt.ylabel('Number of Adults')
plt.title('Number of Adults Who Booked Two Different Hotels')

In [None]:
# Looking into children. 
# Using groupby to group according to hotel types only.
data['children'].groupby(data['hotel']).sum()

In [None]:
# group the data by the 'hotel' column and sum the 'children' column
grouped_data = data.groupby('hotel')['children'].sum().reset_index()

# draw the bar chart using seaborn's barplot function
sns.barplot(x='hotel', y='children', data=grouped_data)

# add appropriate labels to the x-axis, y-axis, and the plot title
plt.xlabel('Hotel')
plt.ylabel('Number of children')
plt.title('Number of children Who Booked Two Different Hotels')
plt.show()

##### 1. Why did you pick the specific chart?

Aa bar charts are effective for categorical data, where each category represents a discrete entity. In this case, the categories are the two hotels, and the number of adults and children who booked them is a numerical quantity.

##### 2. What is/are the insight(s) found from the chart?

It seems that Resort hotel is prefered choice when it comes to children whereas adult mostly choose City hotel.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Decision making: The bar chart can be useful for decision-making purposes. For example, if the goal is to increase the number of adults who book Resort Hotel, the chart shows that there is a significant gap between the two hotels, and strategies can be implemented to increase bookings at Resort Hotel.

**#### Chart - 3 Overview of canceled bookings**

In [None]:
sns.set(style="whitegrid")
sns.catplot(x="hotel", kind="count", data=data, hue="is_canceled", palette=["r", "y"], height=6, aspect=1.5) #The hue argument sets the color of the bars according to another column, is_canceled in this case.
plt.title("Percentage comparison of two different hotels")                                                   # The palette argument sets the color of each bar.
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar  is a suitable chart to visualize the distribution of categorical data. In this case, the categorical variable is "is_canceled", which has two possible values - 'canceled' and 'not_canceled'.

Here are some reasons why a count plot is a good choice for this kind of visualization:

Easy to interpret: Count plots are simple to interpret and can effectively communicate the distribution of categorical data.

Shows the frequency of each category: A count plot shows the number of occurrences of each category, making it easy to see the relative frequency of 'canceled' and 'not_canceled' bookings.

##### 2. What is/are the insight(s) found from the chart?

From the count plot of the "is_canceled" variable, we can gain the following insights:

Distribution of bookings: The count plot shows that the majority of bookings were not canceled, with over 70,000 bookings labeled as 'not_canceled'. The number of bookings that were canceled is approximately half the number of bookings that were not canceled.

Imbalance in the dataset: The count plot highlights an imbalance in the dataset, with significantly more 'not_canceled' bookings than 'canceled' bookings. This imbalance may need to be taken into account when analyzing the data or building predictive models.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The count plot can provide insights into the distribution of the 'is_canceled' variable in the dataset. For example, it can show how many bookings were canceled vs not canceled. If the goal of the analysis is to understand booking cancellations and potentially reduce them, this information could be useful in identifying patterns or factors associated with cancellations.

**Chart - 4 Let's look into cancellation rate among different type of hotel.**

In [None]:
lst1 = ['is_canceled', 'hotel']

type_of_hotel_canceled = data[lst1]
canceled_hotel = type_of_hotel_canceled.groupby(['hotel', 'is_canceled']).size().reset_index(name='count')
canceled_hotel['percent'] = canceled_hotel.apply(lambda x: (x['count'] / canceled_hotel.loc[canceled_hotel['hotel']==x['hotel'], 'count'].sum())*100, axis=1)
canceled_hotel = canceled_hotel[canceled_hotel['is_canceled'] == 'canceled']

sns.barplot(data=canceled_hotel, x='hotel', y='percent').set_title('Graph showing cancellation rates in city and resort hotel')

##### 2. What is/are the insight(s) found from the chart?

The above bar chart depicts the cancellation percentage for two different Hotel and we can clarly say that slightly above 40% customers have canceled their trip while booking their trips with City Hotel whereas Resort hotel accounts nearly 25 % canceled trips.

**Chart - 5 Overview of arrival period**

In [None]:
lst3 = ['hotel', 'arrival_date_year', 'arrival_date_month','arrival_date_day_of_month' ]
period_arrival = data[lst3]
sns.countplot(data = period_arrival, x = 'arrival_date_year', hue = 'hotel')

In [None]:
plt.figure(figsize=(20,5))

sns.countplot(data = period_arrival, x = 'arrival_date_month', hue = 'hotel', order = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
          'August', 'September', 'October', 'November', 'December']).set_title('Graph showing number of arrival per month',fontsize=20)
plt.xlabel('Month')
plt.ylabel('Count')

In [None]:
plt.figure(figsize=(20,5))

sns.countplot(data = period_arrival, x = 'arrival_date_day_of_month', hue = 'hotel').set_title('Graph showing number of arrival per day', fontsize = 20)

##### What is/are the insight(s) found from the chart?

Based on the dataset, it appears that the year 2016 had the highest number of hotel bookings. Additionally, there is an upward trend in bookings in the middle of the year, with August having the highest number of bookings. As August marks the end of summer and the beginning of autumn, it suggests that the summer season is a peak period for hotel bookings.






**Chart - 6 let's explore whether there is any difference in the number of bookings made for weekdays versus weekends**.

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(data = data, x = 'stays_in_weekend_nights').set_title('Number of stays on weekend nights', fontsize = 20)


In [None]:
plt.figure(figsize=(20,10))
sns.countplot(data = data, x = 'stays_in_week_nights' ).set_title('Number of stays on weekday night' , fontsize = 20)


##### 2. What is/are the insight(s) found from the chart?

Based on the fact that most stays occur on weekday nights, it appears that the distribution of stays across the days of the month was not significant and may have been random.

**Chart - 7 Type of visitors**

In [None]:
sns.countplot(data = data, x = 'adults', hue = 'hotel').set_title("Number of adults", fontsize = 20)

In [None]:
sns.countplot(data = data, x = 'children', hue = 'hotel').set_title("Number of children", fontsize = 20)

In [None]:
sns.countplot(data = data, x = 'babies', hue = 'hotel').set_title("Number of babies", fontsize = 20)

What is/are the insight(s) found from the chart?

It appears that the majority of visitors prefer to travel in pairs. Those who travel with children or babies do not have a particular preference for the type of hotel they stay at. However, it can be observed that visitors who bring babies along tend to favor resort hotels.

**Chart - 8 Looking into market segments and distribution channel**

In [None]:
plt.figure(figsize=(12,5))
sns.countplot(data = data, x = 'market_segment').set_title('Types of market segment', fontsize = 20)

In [None]:
plt.figure(figsize=(12,5))
sns.countplot(data = data, x = 'distribution_channel').set_title('Types of distribution channel', fontsize = 20)

##### 2. What is/are the insight(s) found from the chart?

Majority of the distribution channels and market segments involve travel agencies (online or offline).


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

We can target our marketing area to be on these travel agencies website and work with them since majority of the visitors tend to reach out to them.

#### Chart - 9

In [None]:
plt.figure(figsize = (10,5))

sns.barplot (y= list(data.country.value_counts().head (10)), x= list(data.country.value_counts().head(10).index))
plt.title("Number of bookings country wise",fontweight="bold", size=20)


#####  What is/are the insight(s) found from the chart?

Most guest come from Portugal and other European countries.

**Chart - 10 Most preferred Room type**

In [None]:
plt.figure(figsize = (20,7))
sns.countplot( x = data['assigned_room_type'])
plt.title('Preferred room types',fontweight="bold", size=20)

##### 2. What is/are the insight(s) found from the chart?

Room type A and D Were the most preffered room choice.

#### Chart - 11

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(data = data, x = 'arrival_date_month', y = 'adr', hue = 'hotel',sort = True, marker ='o')
plt.title('Average daily rate month wise',fontweight ='bold',size =20)

##### 1. Why did you pick the specific chart?

line plot becomes rellay handy when it comes to visulizw trend along with the time.

##### 2. What is/are the insight(s) found from the chart?

The above line plot illustrates the trend of average daily prices per month for City Hotel and Resort Hotel. The average daily price for the Resort Hotel remains low except two month throughout the given period of time(July and August). The weather condition might be a factor for increasing number of vistiors at the resort Hotel as during these months people feel very hot weather and  want to spend their time at cooler places. So Resort can be the place for them.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

With the help of the above trend we can suggest the City hotel management team to run discounts and water sports activities so that tourist can attract towards City hotel.

#### Chart - 14 - Correlation Heatmap

In [None]:
plt.figure(figsize=(20,10))
data.corr()
sns.heatmap(data.corr(),cmap='magma',linecolor='white',linewidths=1,annot=True)

##### 1. Why did you pick the specific chart?

A correlation heatmap is a useful tool for exploring the relationships between variables in a dataset, identifying patterns and anomalies, and informing feature selection in predictive modeling.

##### 2. What is/are the insight(s) found from the chart?

The above heatmap depicts the corelation between various variables of our data set.

#### Chart - 15 - Pair Plot 

In [None]:
sns.pairplot(data,palette='coolwarm')

##### 1. Why did you pick the specific chart?

pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns).

##### 2. What is/are the insight(s) found from the chart?

A pairplot is a plot that shows the pairwise relationships between variables in a dataset. It is a combination of scatterplots and histograms, where the diagonal shows the distribution of each variable and the off-diagonal plots show the scatterplots of each pair of variables.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***