<a href="https://colab.research.google.com/github/justmonis/Hotel-Booking-EDA-Project./blob/main/Hotel_Booking_EDA_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PROJECT NAME**    - **HOTEL BOOKING ANALYSIS**



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - **Individual**
Name: **Monis Ahmad**

# **Project Summary -**

1. The Hotel Bookings Analysis project entails examining a real-world dataset spanning from 2015 to 2017, containing booking data from both city and resort hotels. The process involves data cleaning, analysis, manipulation, and visualization. Initially, I imported necessary libraries with specific aliases in Google Colab, then mounted my Google Drive with the Colab notebook. Subsequently, I imported the downloaded hotel booking CSV file and utilized it as a DataFrame using the Pandas library. Data analysis and visualization will employ various Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn.

2. The dataset comprises 119,390 rows and 32 columns with object, integer, and float data types. It also includes 31,944 duplicate rows, which necessitate removal. Upon analyzing each column, it was discovered that 'company,' 'agent,' 'children,' and 'reserved_room_type' contain null values. 'Company' and 'agent' columns exhibit the highest number of null values and will be dropped. Null values in the 'country' and 'children' columns will be replaced with mode and mean values, respectively. Further analysis involves describing each variable and identifying unique values. Data wrangling ensues, removing outliers from columns like 'lead_time,' 'adr,' and 'days_in_waiting_list.' Additional data manipulation includes adding a 'total_stay' column using 'stays_in_weekend_nights' and 'stays_in_week_nights,' as well as a 'total_guest' column using 'adults,' 'children,' and 'babies.'

3. Subsequently, visualization techniques will be employed to understand the relationships between variables. Univariate, Bivariate, and Multivariate analyses will be conducted using bar, count, pie, and other charts.

4. Finally, the correlation between different variables will be determined using a heatmap, revealing the interrelationships within the dataset.

# **GitHub Link -**

______

# **Problem Statement**


Exploring the optimal time of year to book a hotel room, or determining the best length of stay for the most favorable daily rate, are questions many have considered. What if you wanted to forecast whether a hotel might experience an unusually high number of special requests? This hotel booking dataset provides a window into these inquiries! It encompasses booking details for both a city hotel and a resort hotel, including information such as booking dates, length of stay, count of adults, children, and/or infants, and availability of parking spaces, among other variables. Notably, all personally identifiable information has been removed from the dataset. Dive into the data and analyze it to uncover the key factors influencing hotel bookings.



#### **Define Your Business Objective?**


We are tasked with analyzing the provided hotel booking dataset to extract valuable insights for the business. These insights hold the potential to foster business growth, increase revenue generation, and mitigate losses.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Dataset Loading

In [None]:
#mounting google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
path='/content/drive/MyDrive/Hotel Bookings.csv'
df=pd.read_csv(path)


### Dataset First View

In [None]:
# Dataset First Look
#get top 5 rows
df.head()

In [None]:
#get last 5 rows
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

The given dataset has **119390 Rows** and **32 Columns**

### Dataset Information

In [None]:
# Dataset Info
df.info()

while going through the info table, there are many columns which contains null/missing values

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().value_counts()

The dataset contains 31944 duplicate rows

In [None]:
#creating a copy of data set
df1=df.copy()

In [None]:
#droping all duplicate values
df1.drop_duplicates(inplace=True)


In [None]:
#check shape after droping duplicate rows
df1.shape

:After droping all duplicate rows, now we have **87396** **Rows** and **32 Columns** remaining.Now we will try to find out all missing/null values

#### Missing Values/Null Values

In [None]:
#check whether contains null value or not
df1.isnull().any()

This means that there are null values present in our data frame

In [None]:
# Missing Values/Null Values Count
print('NUll Value Count in Each Column')
print('-'*40)
print(df1.isnull().sum().sort_values(ascending=False).head(5))
print('-'*40)
print('Percentage of null values')
print('-'*40)
(df1.isnull().sum()*100/len(df1)).sort_values(ascending=False).head(5)

Here we found out that company,agent,country,children contains null value.Company and agent columns contains maximum null value.so we wil drop these two columns in future steps and we will replace null values in country and children column


In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,7))
sns.heatmap(df1.isnull(),yticklabels=False,cbar=False,cmap='crest')
plt.show()

Here we got missing values visualisation.It can be clearly seen the amout of mising value present in agent and company column

### What did you know about your dataset?

The given dataset gives booking information of two types hotet:
1.   City Hotel
2.   Resort Hotel
* It gives us customer details like when did the customer booked the hotel,in which year,month,week number customer arrived and what was his total stay in the hotel.What type of meal customer opted and what type of room was alloted.
* it also gives us some personal information like what is the customer type,from which country customer belongs to,was he/she alone or in a group,how many children/babies were there
*it also provides information how customer booked the hotel and was there any special request made by the customer
*it also provides information about the revenue generation by the hotel


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.columns

In [None]:
# Dataset Describe
df1.describe()

### Variables Description

1.   Hotel:Two types of Hotels : City Hotel and Resort Hotel
2.   is_canceled:if the booking was canceled (1) or not (0)
3.   lead_time : How many days customer has booked the hotel in advance
4. arrival_date_year: year of arrival date
5. arrival_date_month : month of arrival date
6. arrival_date_week_number:week number for arrival date
7. arrival_date_day_of month: Day of arrival date
8. stays_in_weekend_nights :Number of weekend nights (saturday or sunday) the guest satyed or booked to stay at the hotel
9. stays_in_week_nights :Number of week nights (Monday to Friday) the guest satyed or booked to stay at the hotel
10. adults :Number of adults
11. children : Number of children
12. babies : Number of babies
13. meal :Kind of meal customer has Opted for in the hotel
* BB :- Bed & Breakfast  
* FB :- Full Board (Beakfast, Lunch and Dinner)
* HB :- Half Board (Breakfast and Dinner normally)
* SC/Undefined  :- no meal opted
14. Country: Code of the country customer belongs to
15. market_segment :which segment the customer belongs to
16. distribution_channel :How the customer has done booking direct/TA/TO
17. is_repeated_guest :Guest coming first time (0) or not (1)
18. previous_cancellations :Number of booking canceled by the customer prior to current booking
19. previous_bookings_not_canceled :count of previous booking sucessfully made by the customer
20. reserved_room_type :Type of room reserved by the customer
21. assigned_room_type : Type of room assigned to the customer
22. booking_changes :count of changes made to booking
23. deposit_type : Deposit type opted by the customer
24. agent :ID of travel agent who has made the booking
25. company :ID of company that made booking
26. customer_type :Type of customer
* Transient :when the booking is not part of a group or contract, and is not associated to other transient booking
* Contract:when the booking has any type of contract associated with it
* Transient_party:when the booking is transient but is associated with at least another transient booking
* Group: when the booking is associated with a group
27. days_in_waiting_list:Number of days customer had to wait to get booking confirm
28. adr :A hotel’s ADR, (Average Daily Rate) is the measure of the average rate paid per room that’s occupied at the property
29. required_car_parking_spaces :if car parking is required
30. total_of_special_requests :  Number of additional special requests made by the customer
31. reservation_status :Reservation status whether the customer has checked-in or cancelled or not shown.
32. reservation_status_date :Date at which last reservation status was made





### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df1.columns:
  print(f'\033[1m Unique values of {col} .\033[0m ',df1[col].unique())

In [None]:
# Checking Unique Values for hotel column.
df1['hotel'].unique()

Two types of hotel in dataset

In [None]:
#unique value of is_canceled column
list(df1['is_canceled'].unique())

1 means that booking was cancelled and 0 means not cancelled

In [None]:
#unique values of arrival_date_year column
df1['arrival_date_year'].unique()

3 year of booking data

In [None]:
#unique values of stays_in_week_nights column
df1['stays_in_week_nights'].unique()

Number of week nights Monday to Friday the guest stayed or booked to stay at the hotel

In [None]:
#unique values of stays_in_weekend_nights column
df1['stays_in_weekend_nights'].unique()

Number of weekend nights (saturday/sanday) the guest stayed or booked to stayed at the hotel

In [None]:
#unique values of adults column
df1['adults'].unique()

Number of adults

In [None]:
#unique values of children column
df1['children'].unique()

**children column contains null value.so we will replace it with mean value**

In [None]:
# since the nan value present in children is very small so, we will replace children nan vlaue with mean value
df1['children'].fillna(df['children'].mean(),inplace=True)

In [None]:
#checking again nan value in children column
df1['children'].isnull().value_counts()

In [None]:
#unique values of babies column
df1['babies'].unique()

Number of Babies


In [None]:
#unique values of mealcolumn
df1['meal'].unique()

Type of meal opted by the customer

In [None]:
#unique values of country column
df1['country'].unique()

Code of the country customer belongs to.It also contains **nan/missing values**

In [None]:
#replacing all null values present in country column with mode
df1['country'].fillna(df['country'].mode()[0],inplace=True)
#checking again for null values
df1.country.isnull().value_counts()

In [None]:
#unique values of market_segment column
df1['market_segment'].unique()

In [None]:
#unique values of distribution_channel column
df1['distribution_channel'].unique()

In [None]:

#unique values of is_repeated_guest column
df1['is_repeated_guest'].unique()

Guest coming first time or not
0 means first visit
1 means repeated customer

In [None]:
#unique values of previous_cancellations column
df1['previous_cancellations'].unique()

Number of booking cancelled by the customer prior to current booking

In [None]:
#unique values of previous_bookings_not_canceled column
df1['previous_bookings_not_canceled'].unique()

count of previous booking sucessfully made by the customer

In [None]:
#unique values of reserved_room_type column
df1['reserved_room_type'].unique()

Types of room reserved by the customer

In [None]:
#unique values of assigned_room_type column
df1['assigned_room_type'].unique()

Type of room assigned to the customer

In [None]:
#unique values of booking_changes column
df1['booking_changes'].unique()

number of times booking changed by the customer

In [None]:
#unique values of deposit_type column
df1['deposit_type'].unique()

**Since there are too many null values present in the agent and company column so we will drop these two columns from our datasets**

In [None]:
#droping agent and company column
df1.drop(['agent','company'],axis=1,inplace=True)

In [None]:
#unique values of adr column
df1['adr'].unique()

In [None]:
#unique values of customer_type column
df1['customer_type'].unique()

* Transient :when the booking is not part of a group or contract, and is not associated to other transient booking
* Contract:when the booking has any type of contract associated with it
* Transient_party:when the booking is transient but is associated with at least another transient booking
* Group: when the booking is associated with a group


In [None]:
#unique values of days_in_waiting_list column
df1['days_in_waiting_list'].unique()

Number of days customer had to wait to get booking confirm

In [None]:
#unique values of required_car_parking_spaces column
df1['required_car_parking_spaces'].unique()

number of parking space requied

In [None]:
#unique values of total_of_special_requests column
df1['total_of_special_requests'].unique()

Number of additional special request required

In [None]:
#unique values of reservation_status column
df1['reservation_status'].unique()

In [None]:
#after removing all null values from the dataset and droping some of the columns we will visualize the dataset
# using heat map
sns.heatmap(df1.isnull(),cbar=False,cmap='RdGy',yticklabels=False)
plt.show()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
df1['lead_time'].dtype

In [None]:
categorical_col=[]
discrete_col=[]
continous_col=[]
for col in df1.columns:
  if df1[col].dtype=='o':
    categorical_col.append(col)
  elif len(df1[col].unique())<=10:
    discrete_col.append(col)
  else:
    continous_col.append(col)



In [None]:
# Write your code to make your dataset analysis ready.
#finding out outlier in each column
plt.rc('font', size=15)
plt.figure(figsize=(30,15))
sns.boxplot(data=df1[['is_canceled','lead_time','arrival_date_week_number','adults','babies','previous_cancellations','booking_changes','days_in_waiting_list','adr','total_of_special_requests']])
plt.show()

In [None]:
#After checking for each column only in these columns outliers are present
#lead_time, adr, days_in_waiting_list
plt.rc('font', size=20)
plt.figure(figsize=(30,15))
sns.boxplot(data=df1[['lead_time','adr','days_in_waiting_list']])
plt.show()

In [None]:
#removing outlier
def removing_outlier(col):
  sorted(col)
  q1,q3=col.quantile([0.25,0.75])  # Quantile function divides the dataset in 25% and 75%
  IQR=q3-q1                          #Interquartile range
  lwr_bound=q1-(1.5*IQR)
  upr_bound=q3+(1.5*IQR)
  return lwr_bound,upr_bound

In [None]:
#removing outlier from lead_time column
low,high=removing_outlier(df1['lead_time'])
df1['lead_time']=np.where(df1['lead_time']> high,high,df1['lead_time'])
df1['lead_time']=np.where(df1['lead_time']< low,low,df1['lead_time'])
#removing outlier from adr column
low,high=removing_outlier(df1['adr'])
df1['adr']=np.where(df1['adr']> high,high,df1['adr'])
df1['adr']=np.where(df1['adr']< low,low,df1['adr'])
#removing outlier from days_in_waiting list
low,high=removing_outlier(df1['days_in_waiting_list'])
df1['days_in_waiting_list']=np.where(df1['days_in_waiting_list']> high,high,df1['days_in_waiting_list'])
df1['days_in_waiting_list']=np.where(df1['days_in_waiting_list']< low,low,df1['days_in_waiting_list'])

In [None]:
#after removing outlier
sns.set_style('whitegrid')
plt.rc('font',size=20)
plt.figure(figsize=(10,5))
sns.boxplot(data=df1[['lead_time','adr','days_in_waiting_list']])
plt.show()

In [None]:
# Adding 'total_stay' column using 'stays_in_weekend_nights' & 'stays_in_week_nights' columns
df1['total_stay'] = df1['stays_in_weekend_nights'] + df1['stays_in_week_nights']

In [None]:
#adding 'total_guest' column using 'aduts','children' and 'babies' columns
df1['total_guests']=df1['adults']+df1['children']+df1['babies']

In [None]:
#adding revenue column
df1['revenue']=df1['total_stay']*df['adr']

In [None]:
#meal contains undefined column which is same as 'SC' so we will combine them
df1['meal'].replace('Undefined','SC',inplace=True)

In [None]:
#find out most common meal type in percentage
df1.meal.value_counts(normalize=True)*100

Bed & Breakfast is most prefered meal type

In [None]:
#most common hotel booked in %age
df1.hotel.value_counts(normalize=True)*100

City hotel mostly booked by the customers

In [None]:
# in which year maximun customer has arrived / peak year for booking
df1.arrival_date_year.value_counts(normalize=True)*100

So 2016 is the peak year of booking

In [None]:
# peak month of booking where maximun customer arrived
df1.arrival_date_month.value_counts()

August is the peak month of booking

In [None]:
#from which country maximum guest arrived
df1.country.value_counts()

Portugal,Great Britain,France,spain are the countries from where maximun customer arrived

In [None]:
#most common customer type
df1.customer_type.value_counts(normalize=True)*100

Transient type of customers has done maximum booking

In [None]:
#is repeated guest or not
df1.is_repeated_guest.value_counts(normalize=True)*100

Most of the guest has first time visit to the hotel

In [None]:
#which room is mostly assigned to customers
df1.assigned_room_type.value_counts(normalize=True)*100

A type room is the mostly assigned to the customer

In [None]:
#which hotel has highest booking cancelation
cancel=df1[df1['is_canceled']==1].groupby('hotel')
x1=pd.DataFrame(cancel.size()).rename(columns={0:'canceled_booking'})
total_booking=df1.groupby('hotel')
x2=pd.DataFrame(total_booking.size()).rename(columns={0:'total_booking'})
result=pd.concat([x1,x2],axis=1)
result['%age_cancelation']=round((result['canceled_booking']/result['total_booking'])*100,2)
result

In [None]:
# average adr of each hotel type
avg_adr = total_booking['adr'].agg(np.mean).reset_index().rename(columns = {'adr':'avg_adr'})
avg_adr

### What all manipulations have you done and insights you found?



*   Firstly of all i found out outliers present in each column and it was found out that out that lead_time,adr,days_in_waiting_list columns has larger outlier present.Then i removed the outlier from each of these columns
* I Added 'total_stay' column using 'stays_in_weekend_nights' & 'stays_in_week_nights' columns and  'total_guests' column using 'aduts','children' and 'babies' columns
* Then i analysed all the columns and it was found out that :    
*   Around 77% customers has opted for BB type meal
*   Around 61% customers booked city hotel with maximun average adr
* 2016 and August was the peak year and month respectively for the booking
*Maximum number of customer came from portugal followed by Great Britain,Fance and spain
* Around 72% customers were Transient type
* Around 97% of the customers were having first booking to the hotel
* A type room was assigned to around 52% of the customer
* city hotel was having maximum cancelation of the booking (around 30%)



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 **Find out Which Type Hotel Is Booked by the Customer**

In [None]:
# Chart - 1 visualization code
#using histogram to plot 'Hotel Booking'
sns.set_style('whitegrid')
plt.rc('font',size=10)
plt.figure(figsize=(5,5))
df1['hotel'].hist()
plt.title('Type of Hotel Booked',fontsize=15)
plt.show()
#using pie chats to plot percentage of booking
plt.pie(x=df1.hotel.value_counts(),explode=[0.01,0],labels=['City Hotel','Resort Hotel'],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(1,1))
plt.title('Type of Hotel Booked',fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

**I picked up Histogram and pie chart to analyse which type of hotel has maximun booking and by how much percentage**

##### 2. What is/are the insight(s) found from the chart?

**I found out that City hotel is very commonly booked by the customers and around 61% of customer has booked this hotel over a period of 3 years**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.
* **From the above observation i think Resort hotel might be little expensive as compared to city hotel, if that is true then resort hotel needs update the price or they can can add other offers with the hotel booking in order to attract more customers**  
* **city hotel can also provide more offers so that customers can do repeated booking**  





#### Chart -2 **Preferred Meal Type**

In [None]:
# Chart - 2 visualization code
#univariate analysis
#using seaborn countplot to plot Meal
sns.set_style('whitegrid')
plt.figure(figsize=(10,7))
sns.countplot(x=df1['meal'])
plt.xlabel('Meal Type',fontsize=20)
plt.title('Preferred Meal Type',fontsize=20)
plt.show()


In [None]:
#using pie chart to show percentage distribution
plt.figure(figsize=(7,7))
plt.pie(x=df1.meal.value_counts(),explode=[0,0,0,0],labels=['BB','SC','HB','FB'],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(1,1))
plt.title('Preferred Meal Type',fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

**I picked up seaborn count plot and pie chart to show the distribution of meal type.Which one is mostly opted by the customers**

##### 2. What is/are the insight(s) found from the chart?

**As i picked up countplot and pie chart.It showed that Bed & Breakfast is moslty opted by the customers in the hotel and around 78% of the customers has opted this meal**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**As BB is mostly opted by the customers.The Hotel management can provide any attractive offers for the customes who are opting for BB to go with HB or With FB,that will eventually add up the revenue.And those who are not opting for any kind of meal,the management should try to convince then atleast to go with BB by offering any discounts or any other special offers**

#### Chart-3 **Peak Year of Booking**

In [None]:
# Chart - 3 visualization code
# To find out peak year of booking
#bivariate analysis
plt.figure(figsize=(15,7))
sns.set_style('whitegrid')
plt.rc('font',size=15)
sns.countplot(x='arrival_date_year',data=df1,hue='hotel')
plt.legend()
plt.title('Peak year of Booking',fontsize=20)
plt.xlabel('year',fontsize=20)
plt.ylabel('No. of bookings',fontsize=20)
plt.show()

#### Why did you Picked up specific Chart?


**I have select count plot to find out peak year of booking**

##### 2. What is/are the insight(s) found from the chart?

**It has been found out that 2016 was the peak year for both the hotels**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Both the Hotels has Earned more revenue in year 2016 but the booking has decreased in 2017 so hotel management needs to look into this matter**

#### Chart - 4 **Peak Month of Booking**

In [None]:
# Chart - 4 visualization code
#bivariate analysis
# To find out peak month of non cancelled booking
df2=df1[df1['is_canceled']==0]
plt.figure(figsize=(15,8))
sns.set_style('whitegrid')
plt.rc('font',size=15)
sns.set(font_scale=1.20)
sns.countplot(x=df2['arrival_date_month'],hue=df2['hotel'],order=['January', 'February', 'March', 'April', 'May', 'June', 'July',
          'August', 'September', 'October', 'November', 'December'])
plt.legend()
plt.title('Peak Month of Booking',fontsize=20)
plt.xlabel('Month',fontsize=20)
plt.ylabel('Non cancelled bookings',fontsize=20)
plt.show()


##### 1. Why did you pick the specific chart?

**I selected countplot to find out Peak Month of Booking**

##### 2. What is/are the insight(s) found from the chart?

**It has been found that July and August are the Peak month of Booking**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**July and August has maximun bookings for both type of hotels, The hotel management can advertise maximum in these months and can provide special discounts to attract more customers**

#### Chart - 5 **Booking Cancellation Year-wise**

In [None]:
# Chart - 5 visualization code
# To find out year wise cancelaltion
#bivariate analysis
df3=df1[df1['is_canceled']==1]
plt.figure(figsize=(15,7))
sns.set_style('whitegrid')
plt.rc('font',size=15)
sns.countplot(x=df3['arrival_date_year'],hue=df3['hotel'])
plt.legend()
plt.title('Year wise Booking Cancelation',fontsize=20)
plt.xlabel('year',fontsize=20)
plt.ylabel('No. of bookings Cancelation',fontsize=20)
plt.show()
#using pie chart to find out %age of Booking cancelation over 3 years
plt.pie(x=df3.hotel.value_counts(),labels=['City hotel','Resort hotel'],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(1,1))
plt.title('Hotel Booking Cancelation')
plt.show()

##### 1. Why did you pick the specific chart?

**I selected countplot and pie chart to find out which year hotel has maximun cancelation and in which year**

##### 2. What is/are the insight(s) found from the chart?

* **It has been find out that City Hotel has around 67% of booking cancelation over 3 year and maximum cancelation was in year 2016**
* **Resort hotel has around 33% of booking cancelation and maximun was in year 2016**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Hotel management needs to find out why booking cancelation was so much in year 2016**


#### Chart - 6 **Prefered length of stay**

In [None]:
# Chart - 6 visualization code
# bivaraite analysis
# find out what is preferred length of stay by the customers
stay = df2[df2['total_stay'] < 10]   #considering maximun for 10 days stay
plt.figure(figsize = (10,7))
sns.countplot(x = stay['total_stay'], hue = stay['hotel'])
plt.show()

##### 1. Why did you pick the specific chart?

**I have selected count plot to analyze preferred length of stay by the customers**

##### 2. What is/are the insight(s) found from the chart?

* **For Resot Hotel maximum customers prefer 1 day of stay**
* **For City Hotel maximum customers prefer 3 days of stay**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Average length of the stay in both the Hotels is 2-3 days, so the Hotel management can provide any special offers for 2-3 days stay to get more customers**

#### Chart - 7 **Top 15 Visiting Countries**

In [None]:
# Chart - 7 visualization code
#Bivariate Analysis
#to find out Top 15 visiting countries
guest_country=df2['country'].value_counts().reset_index()
guest_country.rename(columns={'index':'country_name','country':'No. of guests'},inplace=True)
g1=guest_country[guest_country['No. of guests']>650]  #to get top 15 countries
plt.figure(figsize = (15,10))
sns.set(font_scale=1.5)
sns.barplot(x = g1['country_name'], y=g1['No. of guests'])
plt.title('Top 15 visiting countries',fontsize=20)
plt.xlabel('Country',fontsize=20)
plt.ylabel('No. of Guests',fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

**I Picked seaborn barplot to visualize contry and no. of guests from each country**

##### 2. What is/are the insight(s) found from the chart?

**It has been observed that customers come from all over the world in these two hotels but more than 50% of the customer belongs to Portugal,Great Brtain and France**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Since most of the guests visiting the hotel belongs to Portugal and other European Countries so, the Hotel management should advertize more in these countries only**

#### Chart - 8 **Distribution Of Booking**

In [None]:
# Chart - 8 visualization code
#find distribution of booking through different market segment
plt.figure(figsize=(18,7))
sns.set_style('whitegrid')
sns.set(font_scale=1.5)
plt.rc('font',size=10)
sns.countplot(x='market_segment',data=df1)
plt.title('Distribution Of booking through different market segment',fontsize=20)
plt.xlabel('Type of market segment',fontsize=20)
plt.ylabel('No. of bookings',fontsize=20)
plt.show()

In [None]:
#using pie chart to findout percentage
plt.figure(figsize=(7,7))
plt.pie(df1.market_segment.value_counts(),labels=['Online TA','Offline TA/TO','Direct','Groups','Corporate','Complementary','Aviation','Undefined'],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(1.5,1))
plt.title('Distribution through different channels')
plt.show()

##### 1. Why did you pick the specific chart?

**I have selected countplot and pie chart to analyze distribution of booking through different market segment**

##### 2. What is/are the insight(s) found from the chart?

**Around 59% of the booking is done by online Travel Agent followed by the 16% offline Travel agent**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Efforts should be made by the Hotel management to attract customers to book from company website directly**

#### Chart - 9 **Preferred Booking Channel**

In [None]:
# Chart - 9 visualization code
# To know most preferred distribution channel for booking
plt.figure(figsize=(17,7))
sns.set_style('whitegrid')
plt.rc('font',size=15)
sns.countplot(x='distribution_channel',data=df1)
plt.title('Distribution through different channels',fontsize=20)
plt.xlabel('Type of distribution channel',fontsize=20)
plt.ylabel('No. of bookings',fontsize=20)
plt.show()

In [None]:
#using pie chart to find out percentage distribution among different channels
plt.figure(figsize=(7,7))
plt.pie(df1.distribution_channel.value_counts(),labels=['TA/TO','Direct','corporate','undefined','GDS'],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(1,1))
plt.title('Distribution through different channels')
plt.show()

##### 1. Why did you pick the specific chart?

**I picked up seaborn countplot to analyze which type of distribution mostly used by the customers for booking and by using pie chart percantage distribution was found**


##### 2. What is/are the insight(s) found from the chart?

**It was found out that 79% of hotel booking is done by travel agents/travel operators followed by the Direct i.e 15%**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The hotel management can offer discounts, complimentary services and offers on direct bookings as the hotel doesn’t pay commissions to third parties and maintains a direct relationship with the customer when a customer books the hotel directly.**

#### Chart - 10 **Guest Repeating Status**

In [None]:
# Chart - 10 visualization code
#univariate analysis
#To check how much percentage of customers repeats the hotel booking
plt.figure(figsize=(7,7))
plt.pie(df1.is_repeated_guest.value_counts(),labels=['Not_repeated','Repeated'],explode=[0,0.3],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(1,1))
plt.title('Guest repeating Status')
plt.show()

##### 1. Why did you pick the specific chart?

**I have choosen pie chart to find put how much percentage of guests has booked the hotel repeatedly and how much has first visit**


##### 2. What is/are the insight(s) found from the chart?

**I found out that only 4% of the guests has done repeated booking over these 3 years**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The Hotel management can start membership plans and give spcial offers for the repeated customers**

#### Chart - 11 **Room assigned to customer**

In [None]:
# Chart - 11 visualization code
#bivariate analysis
#To find out Type of room assigned to the Guests w.r.t hotel type
plt.figure(figsize=(17,7))
sns.set_style('whitegrid')
plt.rc('font',size=15)
sns.countplot(x='assigned_room_type',data=df1,hue='hotel')
plt.legend(bbox_to_anchor=(.5,1))
plt.title('Type of Room assigned to Guests w.r.t Hotel Type',fontsize=20)
plt.xlabel('Room Type',fontsize=20)
plt.ylabel('No. of bookings',fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

**I have choosen countplot to find out which type of room is mostly assigned to Guests**

##### 2. What is/are the insight(s) found from the chart?

**It has been find out that in both hotels A type of Room is mostly assigned to guests followed by the D type Room**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**It has find out that among all room types A and D are most commonly assigned to the Guest ,may be because of economical and better services**

#### Chart - 12 **Customer Type**

In [None]:
# Chart - 12 visualization code
#univariate Analysis
#Find out which type of customer has highest booking
plt.figure(figsize=(7,7))
plt.pie(df1.customer_type.value_counts(),labels=['Transient','Transient-Party','Contract','Group'],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(1,1))
plt.title('Customer Type')
plt.show()

##### 1. Why did you pick the specific chart?

**I have selected pie chart to find put most common type of customer**

##### 2. What is/are the insight(s) found from the chart?

**It has been find out that Transient type of customer has made around 82% of booking over these 3 years**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Here since very less people comes in groups so Hotel management can start any family/group package to attract more customers**

#### Chart - 13 **Deposit Type**

In [None]:
# Chart - 13 visualization code
#find out percentage using pie chart
plt.figure(figsize=(7,7))
plt.pie(df1.deposit_type.value_counts(),explode=[0,0.5,0.5],labels=['No Deposit','Non Refund','Refundable'],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(1,1))
plt.title('Deposit Type')
plt.show()

##### 1. Why did you pick the specific chart?

**I have select pie chart to find out the what is the deposit type**

##### 2. What is/are the insight(s) found from the chart?

**It has been find out that around 99% of the customer has choosen No deposit**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Hotel management should make Refundable deposit compulsory**

#### Chart - 14 **Special Request by the Customer**

In [None]:
# Chart - 14 visualization code
#Find out How many special Requests by the Guests
plt.figure(figsize=(17,7))
sns.set_style('whitegrid')
plt.rc('font',size=15)
sns.countplot(x='total_of_special_requests',data=df1,hue='hotel')
plt.legend(bbox_to_anchor=(.5,1))
plt.title('Special Request by the customer',fontsize=20)
plt.xlabel('No. of Requests',fontsize=20)
plt.show()
#find out percentage using pie chart
plt.figure(figsize=(7,7))
plt.pie(df1.total_of_special_requests.value_counts(),labels=['0-Request','1-Request','2-Requests','3-Requests','4-Requests','5-Requests'],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(1,1))
plt.title('Special Request')
plt.show()


##### 1. Why did you pick the specific chart?

**I have select countplot,pie chart to find out the No. and %age of special requests made by the customers**

##### 2. What is/are the insight(s) found from the chart?

**It has been found out that 50% ,33% of the customers has 0 or 1 special request respectively**


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Since most of the customers don't have any special request so all the basic needs are being provided by the hotel management**

####Chart 15 **ADR of both Hotels**


In [None]:
# Chart - 15 visualization
d3 = total_booking['adr'].agg(np.mean).reset_index().rename(columns = {'adr':'avg_adr'})
sns.set_style('whitegrid')
plt.figure(figsize=(10,7))
sns.barplot(x = d3['hotel'], y = d3['avg_adr'] )
plt.show()



##### 1. Why did you pick the specific chart?

**I selected barplot to analyze average adr of both the hotels**

##### 2. What is/are the insight(s) found from the chart?

**It has been found that adr of city hotel is more that means more revenue generated**

##### 3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

**Resort hotel management needs to start any special offer to attract more customers**

Chart 16 **Revenue Generted by room**

In [None]:
#barplot for Room type and their average price
plt.figure(figsize=(15,7))
sns.set_style('whitegrid')
plt.rc('font',size=15)
sns.barplot(y='revenue',x='assigned_room_type',data=df1)
plt.legend()
plt.title('Revenue Generated by Room',fontsize=20)
plt.xlabel('Room Type',fontsize=20)
plt.ylabel('Revenue',fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

**I selected barplot plot to visvalize assigned room type against revenue generated**

##### 2. What is/are the insight(s) found from the chart?

**It has been found that G and H room types makes more revenue, so these type of room might be expensive with some more luxury facilities**

##### 3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

**Hotel management might provide some discount offer in off season to attract more customers towards these rooms and generate more money**

#### Chart - 16 - **Correlation Heatmap**

In [None]:
categorical_features=df1.select_dtypes(include=[np.object])
categorical_features.columns


So there is no need to find corelation of these columns

In [None]:
# Correlation Heatmap visualization code
dcorr = df1[['lead_time','previous_cancellations','previous_bookings_not_canceled','booking_changes','adr','required_car_parking_spaces','total_of_special_requests','total_stay','total_guests']]
df_corr = dcorr.corr()
f, ax = plt.subplots(figsize = (18,9))
sns.heatmap(df_corr, annot = True, fmt='.2f', annot_kws =  {'size': 15}, vmax = .99, square = True)
plt.show()

##### 1. Why did you pick the specific chart?

**In order to find out co-relation between the variables**

##### 2. What is/are the insight(s) found from the chart?



1.   **Total stay and the lead time have positive correlation, that means if customer plans to stay for longer they will do booking in advance**
2.  **ADR is positively correlated with total_people, more people means    revenue**









#### Chart - 17 - **Pair Plot**


In [None]:
# Pair Plot visualization code
#scatter plot between adr and total_stay
plt.figure(figsize = (18,9))
sns.scatterplot(x= 'adr', y = 'total_stay', data = df1)
plt.show()

##### 1. Why did you pick the specific chart?

**To compare the relationship between adr and total stay**

##### 2. What is/are the insight(s) found from the chart?

**Here it can be seen that if Guest stay is higher adr is decreasing, so longer stay customer is getting better deal**

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Here are the suggested steps to achieve business objectives:

1. Hotel management can introduce special packages including meals and other amenities to attract more travelers.
2. Implementing a membership plan with exclusive discounts can help in customer retention.
3. Investigate reasons for booking cancellations and minimize them effectively.
4. Utilize social media for targeted advertising, especially during peak seasons.
5. Collaborate with online/offline travel agencies and booking partners to increase customer acquisition.
6. Ensure maximum bookings are made through the company website or app.
7. Introduce attractive offers for group bookings and extended stays.
8. Dedicate efforts to thoroughly analyze customer feedback to enhance facilities and services.
9. Offer special discounts during offseasons to attract customers during slower periods.
10. Consider making deposits mandatory to secure bookings effectively.



# **Conclusion**

Write the conclusion here.

1. The City Hotel garners higher customer bookings, resulting in increased revenue.
2. Customers predominantly opt for Bed & Breakfast meal plans.
3. 2016 witnessed the peak of bookings for both hotels.
4. July and August mark the peak months for bookings in both hotels.
5. The average preferred stay length is 2-4 days.
6. Portugal, Great Britain, and France contribute the most guests to the hotels.
7. Room Type A is the most preferred accommodation.
8. Rooms G and H generate significant revenue.
9. The majority of guests are first-time visitors.
10. Approximately 25-30% of bookings get canceled, with more cancellations from the City Hotel.
11. Transient customers make the most bookings.
12. Most customers opt for no deposit.
13. The majority of guests make no special requests.
14. Online travel agents dominate the market segment for bookings.
15. Total stay and lead time exhibit a positive correlation.
16. Longer-stay customers receive better deals.

