<a href="https://colab.research.google.com/github/gunjanjoshi-0798/EDAproject/blob/main/HotelAnalysisFinal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -** Gunjan Joshi


# **Project Summary -**

Dataset of some city and resort hotels' previous bookings have been provided and there is a need to explore and analyze different important factors that govern the choices made by the visitors so that it can be worked upon in order to make it user friendly and the owners also get an attractive profit.

# **GitHub Link -**

HotelAnalysisFinal.ipynb

# **Problem Statement**


**Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!**
This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. Explore and analyze the data to discover important factors that govern the bookings.

#### **Define Your Business Objective?**

The project aims to gain interesting insights into customer's behaviour when booking a hotel. The demand of different segment of customer may differ and forecasting become harder as it may require different segment. These insights can guide hotels to adjust their customer strategies and make prepratiom for unknown.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
hotel_df = pd.read_csv('/content/drive/MyDrive/Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
hotel_df.head()

In [None]:
hotel_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Number of rows: {len(hotel_df.axes[0])}")
print(f"Number of columns: {len(hotel_df.axes[1])}")

### Dataset Information

In [None]:
# Dataset Info
hotel_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
(hotel_df.duplicated()).value_counts()

In [None]:
hotel_df.drop_duplicates(inplace = True)

In [None]:
hotel_df.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
hotel_df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize= (10,5))
sns.heatmap(hotel_df.isna().transpose(), cmap = 'coolwarm')
plt.title('Missing Values')
plt.show()

### What did you know about your dataset?

There are total 4 columns which have null values and company column contains most of the values as null. Country column also have some null values but it has to be replace by object type of data.So in order to build a good model our dataset should be complete and shouldn't have null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotel_df.columns

In [None]:
# Dataset Describe
hotel_df.describe()

### Variables Description

hotel : city hotel or resort hotel

is_cancelled : booking was cancelled(1) or not(0)

lead_time : Number of days that elapsed between the entering date of the booking into the PMS and the arrival date

arrival_date_year : Year of arrival date

arrival_date_month : Month of arrival date

arrival_date_week_number : Week number of year for arrival date

arrival_date_day_of_month : Day of arrival date

stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

stays_in_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

adults : Number of adults

children : Number of children

babies : Number of babies

meal : Type of meal booked. Categories are presented in standard hospitality meal packages

country : Country of origin.

market_segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

distribution_channel : Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

is_repeated_guest : Value indicating if the booking name was from a repeated guest (1) or not (0)

previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking

previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking

reserved_room_type : Code of room type reserved. Code is presented instead of designation for anonymity reasons.

assigned_room_type : Code for the type of room assigned to the booking.

booking_changes : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

deposit_type : Indication on if the customer made a deposit to guarantee the booking.

agent : ID of the travel agency that made the booking

company : ID of the company/entity that made the booking or responsible for paying the booking.

days_in_waiting_list : Number of days the booking was in the waiting list before it was confirmed to the customer

customer_type : Type of booking, assuming one of four categories

adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

required_car_parking_spaces : Number of car parking spaces required by the customer

total_of_special_requests : Number of special requests made by the customer (e.g. twin bed or high floor)

reservation_status : Reservation last status, assuming one of three categories

Canceled – booking was canceled by the customer Check-Out – customer has checked in but already departed No-Show – customer did not check-in and did inform the hotel of the reason why reservation_status_date - Date at which the last status was set

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
pd.Series({col:hotel_df[col].unique() for col in hotel_df})

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
hotel_df_new = hotel_df.copy()

In [None]:
hotel_df_new.info()

In [None]:
hotel_df_new['children'].fillna(0,inplace=True)
hotel_df_new['children'] = hotel_df_new['children'].astype(int)
hotel_df_new['country'].fillna('N/A',inplace=True)
hotel_df_new['agent'].fillna('N/A',inplace=True)
hotel_df_new['company'].fillna('N/A',inplace=True)
hotel_df_new['days_in_waiting_list'].fillna('N/A',inplace=True)

In [None]:
hotel_df_new.isnull().sum()

In [None]:
hotel_df_new.info()

In [None]:
hotel_df_new.drop(['company'], axis =1, inplace =True)

In [None]:
no_guest = hotel_df_new[hotel_df_new['adults']+hotel_df_new['babies'] + hotel_df_new['children'] == 0]
hotel_df_new.drop(no_guest.index, inplace = True)

### What all manipulations have you done and insights you found?

From above I saw that data cleaning was required. Then I made a new copy of the dataset so that original data remains undisturbed. Filled missing values with 0 in children, country, agent, company and days in waithing list with n/a. Then I dropped company column because it had maximum missing values and was of no use. And then I droped the rows which there was no guests available. After all this manipulation the insights I found were

1. There are two type of hotels which guests could book.
2. Guests come from different countries
3. Guests can book hotel directly or through different channels that are available.
4. Guests can cancel their booking and there are repeated guests also.
5. Guests can choose rooms of their liking while booking.
6. 'adr' could be used to analyse hotel's performance on the basis of revenue.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Preferred hotels by guests

In [None]:
# Chart - 1 visualization code
hotel_name = hotel_df_new.hotel.unique()

In [None]:
hotel_name_df = hotel_df_new.hotel.value_counts()

In [None]:
hotel_name_df.plot.pie(figsize=(9,7), autopct='%1.2f%%', fontsize=15,startangle=50)
plt.title('Hotel Booking Percentage')
plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

To show the proportions of different type of hotels, with the size of each piece representing the proportion of each category.

##### 2. What is/are the insight(s) found from the chart?

I found out that guests prefer Resort Hotel most over City Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From above there is no such negative growth but stakeholders can focus more on City Hotel to get more booking and icrease the overall revenue.

#### Chart - 2 No. of booking(yearly)

In [None]:
# Chart - 2 visualization code
sns.countplot (x= 'arrival_date_year', data=hotel_df_new , hue= 'hotel').set_title ('yearly_bookings')

##### 1. Why did you pick the specific chart?

To understand the booking pattern in hotels on yearly basis.

##### 2. What is/are the insight(s) found from the chart?

From above insight I found out that hotel was booked most times in year 2016.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Above insight shows that number of booking was declined after year 2016. Hotels owners can now see what went wrong after 2016 and fix that problem to increase the umber of bookings. One way to do this is ask for feedbacks from guests.

#### Chart - 3 No. of booking(monthly)

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(15,5))
sns.countplot(x=hotel_df_new['arrival_date_month'],hue= hotel_df_new['hotel'])
plt.title("Number of booking across months")
plt.show()

##### 1. Why did you pick the specific chart?

I had to compare values across the months and for that bar chart was one of the best choice.

##### 2. What is/are the insight(s) found from the chart?

August and July are the most busy months in comparision to others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels can use this insight to arrange in advance and welcome their guests in the best possible and hotels can run some promotional offer for the months with less guests and attract more guests.

#### Chart - 4 Hotel booking cancellation

In [None]:
# Chart - 4 visualization code
hotel_cancelled = hotel_df_new.is_canceled.value_counts()

In [None]:
hotel_cancelled.plot.pie(figsize = (10,7), autopct = '%1.2f%%', fontsize = 15, startangle = 50)
plt.title('Percentage of Hotel cancellation and non-cancellation')
plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

To find the no. of cancellations and non-cancellations occuring in hotels using a pie chart.

##### 2. What is/are the insight(s) found from the chart?

Here we can see that around 72.48% bookings are not cancelled and 27.52% bookings are cancelled by guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The hotels can reschedule the bookings instead of cancellation and set a flexible cancellation policy to reduce booking cancellation.

#### Chart - 5 Stays in weekdays and weekends

In [None]:
d= {'stays_in_week_nights' : np.random.rand(10),'stays_in_weekend_nights' : np.random.rand(10)}

In [None]:
hotel_df_new = pd.DataFrame(d)

In [None]:
hotel_df_new.plot(style=['o','rx'])

##### 1. Why did you pick the specific chart?

Studing relationship between stays in week nights against stays in weekend nights.

##### 2. What is/are the insight(s) found from the chart?

Majority of the times people prefer a weekend night to book hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No negative impact could be seen because people usually tend to go on a vacation during weekends rather than week days. Owners should put on some offers to attract guests even on week days.

#### Chart - 6 Number of bookings made by individuals from different countries

In [None]:
# Chart - 6 visualization code
sns.barplot(y= list(hotel_df_new.country.value_counts().head(10)), x= list(hotel_df_new.country.value_counts().head(10).index))

In [None]:
ax= hotel_df_new.country.value_counts().head(10).plot (kind= 'bar');
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() , p.get_height()))

##### 1. Why did you pick the specific chart?

Here I compared the guests coming from different countries.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I found out that most guests come from PRT(Portugal) around 27453.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is no negative insight. After knowing that most of the guests come from Portugal hotels can add more things for portugal people.

#### Chart - 7 Meal prefferd by guests

In [None]:
# Chart - 7 visualization code
meal_count = hotel_df_new.meal.value_counts()

In [None]:
meal_name = hotel_df_new['meal'].unique()

In [None]:
meal_df = pd.DataFrame(zip(meal_name, meal_count), columns = ['meal name', 'meal count'])
m = sns.barplot(data = meal_df, x = 'meal name', y = 'meal count')
m.set_xticklabels(meal_df['meal name'])
plt.title('Most preffered meal type')
plt.show()

##### 1. Why did you pick the specific chart?

There were 4 values to compare in meal and Bar graphs are used to compare things between different groups that is why I used this chart.

##### 2. What is/are the insight(s) found from the chart?

After visualizing the above chart we can see that BB - (Bed and Breakfast) is the most preffered meal type by guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from the gained insight above now hotel owners know that BB(Bed and Breakfast) is most preferred meal type so they can arrange raw material for this meal in advance and deliver the meal without any delay.

#### Chart - 8 Preffered Room Type by Guests

In [None]:
# Chart - 8 visualization code

sns.countplot(x=hotel_df_new['reserved_room_type'],order=hotel_df_new['reserved_room_type'].value_counts().index)
plt.title('Preffered Room Type by Guests')
plt.show()

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent. Here it shows which type of rooms are most prefferd by guests.

##### 2. What is/are the insight(s) found from the chart?

By observing the above chart we can understand that the room type A most preffered by the guests while booking the hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As it is clear that room type A is most used hotel should increase the number of A type room to maximize the revenue.

#### Chart - 9 Top 5 agents in terms of booking

In [None]:
# Chart - 9 visualization code
agents = hotel_df_new.groupby(['agent'])['agent'].agg({'count'}).reset_index().rename(columns={'count':'Booking Count'}).sort_values(by = 'Booking Count', ascending = False)
top_5 = agents[:5]


In [None]:
explode = (0.02,0.02,0.02,0.02,0.02)
colors = ( "orange", "cyan", "brown", "indigo", "beige")


In [None]:
def func(pct, allvalues):
    absolute = int(pct / 100.*np.sum(allvalues))
    return "{:.1f}%\n({:d} g)".format(pct, absolute)
fig, ax = plt.subplots(figsize =(15, 7))
wedges, texts, autotexts = ax.pie(top_5['Booking Count'], autopct = lambda pct: func(pct, top_5['Booking Count']),explode = explode, shadow = False,colors = colors,startangle = 50)

# Adding legend
ax.legend(wedges, top_5['agent'],title ="agents",loc ="upper left",bbox_to_anchor =(1, 0, 0.5, 1))

plt.setp(autotexts, size = 8, weight ="bold")
ax.set_title("Top 5 agents in terms of booking", fontsize = 17)

plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart here helps organize and show data of top 5 agents in terms of booking.

##### 2. What is/are the insight(s) found from the chart?

We can see that agent number 9 has made the most number of bookings followed by agent number 240, 0, 14 and 7.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hotel can offer them bonus for their incredible work and to motivate them. This will help to increase the revenue.

#### Chart - 10 Percentage of Repeated Guests

In [None]:
# Chart - 10 visualization code
rep_guests = hotel_df_new['is_repeated_guest'].value_counts()

rep_guests.plot.pie(autopct='%1.2f%%', explode=(0.00,0.09), figsize=(15,6), shadow=False)

plt.title('Percentage of Repeated Guests')

plt.axis('equal')

plt.show()

##### 1. Why did you pick the specific chart?

Pie chart is used to understand the no. of guests coming repeatedly.

##### 2. What is/are the insight(s) found from the chart?

From the above insight we can see that 3.86% guests are repeated guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see that number of repeated guests is very low and it shows negative growth of the hotel. Hotel can offer loyality discount to their guests to increase repeated guests.

#### Chart - 11 Market segment share in booking

In [None]:
# Chart - 11 visualization code
plt.figure(figsize = (15,5))
sns.countplot(x=hotel_df_new['market_segment'], order = hotel_df_new['market_segment'].value_counts().index)
plt.title('Market segment share in booking')
plt.show()

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value represents market share vs bookings.

##### 2. What is/are the insight(s) found from the chart?

Online TA makes the maximum no of bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No negative insights could be found because people usually preffer online booking before going to any hotel. So the hotels should have an online booking facilities for the guests.

#### Chart - 12 Most used deposit type

In [None]:
# Chart - 12 visualization code
deposite = hotel_df_new['deposit_type'].value_counts().index

sns.countplot(x=hotel_df_new['deposit_type'], order= deposite)
plt.title('Most used deposite type')
plt.show()

##### 1. Why did you pick the specific chart?

Barplot tells which which deposition type is most used by guests.

##### 2. What is/are the insight(s) found from the chart?

Hotels which have no deposit policy are preffered by guests most.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels should have guest friendly policy to attract more guests.

#### Chart - 13 Percentage of daily revenue by each hotel type

In [None]:
# Chart - 13 visualization code
most_rev = hotel_df_new.groupby('hotel')['adr'].count()

most_rev.plot.pie(autopct='%1.2f%%')

plt.title('Percentage of daily revenue by each hotel type')
plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

Pie plot is used to understand which type of hotels makes more revenue.

##### 2. What is/are the insight(s) found from the chart?

From the above insight it is clear that City hotel has more share in revenue generation over Resort Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Owners could improve the service of Resort hotel so that people stay more in resort hotel and increase the revenue.

Chart 14-
# adr across different months

In [None]:
bookings_months=hotel_df_new.groupby(['arrival_date_month','hotel'])['adr'].mean().reset_index()

months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

bookings_months['arrival_date_month']=pd.Categorical(bookings_months['arrival_date_month'],categories=months,ordered=True)

bookings_months=bookings_months.sort_values('arrival_date_month')
bookings_months

In [None]:
plt.figure(figsize=(15,5))

sns.lineplot(x=bookings_months['arrival_date_month'],y=bookings_months['adr'],hue=bookings_months['hotel'])

plt.title('ADR across each month')
plt.xlabel('Month Name')
plt.ylabel('ADR')
plt.show()

City Hotel : It is clear that City Hotel generates more revenue in May months in comparison to other months.

Resort Hotel : Resort Hotel generates more revenue in between July and August months.

Stakeholders could prepare in advance for these 2 months as these 2 months generate more revenue.

#### Chart - 15 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,10))

sns.heatmap(hotel_df_new.corr(numeric_only=True),annot=True)
plt.title('Correlation of the columns')

plt.show()


##### 1. Why did you pick the specific chart?

Correlation heatmaps was used to find potential relationships between variables and to understand the strength of these relationships.

##### 2. What is/are the insight(s) found from the chart?

1. lead_time and total_stay is positively corelated. That means if customers stay more then the lead time increases.
2. adults,childrens and babies are corelated to each other. That means more the people more will be adr.
3. is_repeated guest and previous bookings not canceled has strong corelation. That means repeated guests don't cancel their bookings.

#### Chart - 16 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(hotel_df_new)
plt.show()

##### 1. Why did you pick the specific chart?

A pairs plot allows us to see both distribution of single variables and relationships between two variables. We can see the realtionship between all the columns with each other in above chart.

##### 2. What is/are the insight(s) found from the chart?

From the above pair plot we can see that if cancellation increases then total stay also decreases. As the total number of people increases adr also increases.
Thus adr and total people are directly proportional to each other.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1. City Hotels are most preffered so owners can offer discounts on resort Hotel to increase bookings.
2. Around 27.52% of bookings are cancelled so Hotel can offer layality discount if guests don't cnacel their booking.
3. Hotel can maintain raw materials for BB type meal in advance to avoid delay as BB(Bead and Breakfast) is the most preffered meal.
4. Hotel should increase number of rooms in City Hotels to decrease the waiting time.
5. TA has the most number of bookings over other market segments so Hotel could run some offer to get more bookings from otehr segment.
6. Room type A is most preffered by guests so Hotel should increase the number of A type room.
7. Number of repeated guests is low that indicates that there is something they don't like about Hotel and that needs to be fixed to increase number of repeated guests.
8. Maximum number of guests were from Portugal.

# **Conclusion**

In order to achieve the business objective, i would suggest the owners to make the price dynamic, introduce offers and packages to attract new guests. To retain the existing guests and ensure their repetition the owners must introduce loyalty points program which can be redeemed by the guests in their next bookings.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***