<a href="https://colab.research.google.com/github/pisaybharath/Hotel-Booking-Analysis/blob/main/Hotel_Booking_Analysis(Bharath)_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## <b> Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!

## <b>This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. </b>

## <b> Explore and analyze the data to discover important factors that govern the bookings. </b>

In [1]:
#importing Librabries

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
#importing Dataset from github url

url = "https://github.com/pisaybharath/Hotel-Booking-Analysis/blob/main/Hotel%20Bookings.csv"
df =pd.read_csv(url)

ParserError: ignored

In [None]:
#Reading head of Dataframe

df.head()

In [None]:
#checking Shape of Dataframe

df.shape

In [None]:
#checking columns of dataframe
df.columns

In [None]:
#checking informatiom of dataframe

df.info()

In [None]:
#checking for null values
df.isna().sum().sort_values(ascending=False)



 **STEP 1. Data Cleaning**



In [None]:
#Replacing NUll value with zero for convinience

df.fillna(0, inplace=True)

In [None]:
#again checking for null value
df.isna().sum()

**Now we can see there is no null value in hotel Dataframe**

In [None]:
#checking value of children
df['children'].unique()

In [None]:
#checking value of adults
df['adults'].unique()

In [None]:
#checking value of adults
df['babies'].unique()

# **From the above data we can say that adults, childrens and babies can't be zero at a time**

In [None]:
# so now we are going to filter data where value of adults, children and babies = 0

checking_value_0_of_ad_ch_baby = (df['adults']==0) & (df['children']==0) & (df['babies']==0)

# reading data frame where adult,children and babies are 0
df[checking_value_0_of_ad_ch_baby]

 **As we know  adult, children and babies can't be zero at a time that means these are wrong input so we have to remove this data** 

In [None]:
#reading and storing data where adults,children and babies are having non 0 value 

hotel_df=df[~checking_value_0_of_ad_ch_baby]

In [None]:
# reading hotel_df dataframe
hotel_df

**STEP 2 : Data Analysis**


 **what is the count of each type of Hotels ?**
                             

In [None]:
hotel_list = hotel_df['hotel'].value_counts()
hotel_count = hotel_df['hotel'].value_counts().index

In [None]:
#setting size of graph  
plt.figure(figsize = (5, 5))
# creating the bar graph
plt.bar(hotel_count, hotel_list, color =['skyblue','orange'],width = 0.4)
plt.xlabel("Hotel Type")
plt.ylabel("count")
#showing graph
plt.show()


*From above graph we can say that there are 2 types of Hotel* 

*1. city Hotel and  2. Resort Hotel having count 79000 and 40000 approx respectively*




**where do guest come from ?**

In [None]:
#taking Top 10 country to analyze the largest number quest comes from

top_10_countries = hotel_df[hotel_df['is_canceled']==0]['country'].value_counts()[:10]
top_10_countries

In [None]:
#renaming the index

top_10_countries = top_10_countries.reset_index().rename(columns = {'index':'country','country':'number_of_bookings'})

#adding percentage column
top_10_countries['percentage'] = (top_10_countries['number_of_bookings']/top_10_countries['number_of_bookings'].sum())*100


In [None]:
top_10_countries

In [None]:
plt.figure(figsize=(10,10))
sns.barplot(x="country", y="percentage", data=top_10_countries).set(title='Percentage of guest per country')
plt.show()


**From the above graph we can conclude that most of guest come from  country PRT i.e Portugal**



**How much does guest pay for a room per night ?**
                                          

In [None]:
hotel_df.head()

In [None]:
#creating new df so we can do our anlaysis non canceled room only

room_type = hotel_df[hotel_df['is_canceled']==0]

In [None]:
#ploting box plot fro analysis

plt.figure(figsize=(12,8))
sns.boxplot(x='reserved_room_type', y ='adr',data = room_type,hue='hotel')
plt.title('price of room types per night per person')
plt.xlabel('Room Type')
plt.ylabel('price')
plt.show()


**We can see in the above figure that with repect to "A" category room it has highest price apppx as outlier**

**whereas with respect to  'G' category room of city hotel are much costlier than other**


**Which are most busy month ?**
              


In [None]:
plt.figure(figsize=(20,5))

sns.countplot(data = hotel_df, x = 'arrival_date_month', hue = 'hotel', order = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
          'August', 'September', 'October', 'November', 'December']).set_title('Graph showing number of arrival per month',fontsize=20)



**According to the above graph the most busy month is August**


Which hotel type has highest number of cancellation?

In [None]:
canceled_room=hotel_df[hotel_df['is_canceled']==1]

sns.countplot(data=canceled_room,x='is_canceled', hue='hotel', palette='bright').set_title('Number of canceled Bookings')
plt.show()



**From above vizual we can see that city hotel has more number of cancellation compared to Resort hotel**


**Booking and cancelation per market segment?**

In [None]:
plt.figure(figsize=(18,10))
plt.subplot(211)
sns.countplot(data=hotel_df,x='deposit_type',hue='market_segment')
plt.title('Deposit Type for Market Segment')

plt.subplot(212)
sns.countplot(data=hotel_df,x='is_canceled',hue='market_segment')
plt.title('Cancellation for Market Segment')
plt.show()



**we can see from above graph 1 that most the bookings are done through Online TA segment.**

**and from above graph 2 that most cancelation is also done throuugh online TA segment only**.




**Which meals were more preferreable?**

In [None]:
# Enlarging the pie chart
plt.rcParams['figure.figsize'] = 10,10

#Making list of  not cancelled

# assigning labels and converting them to list 

labels = hotel_df['meal'].value_counts().index

# assigning sizes and converting to list

sizes = hotel_df['meal'].value_counts().tolist()



# autopct enables you to display the percent value using Python string formatting. .1f% will round off to the tenth place
plt.pie(sizes,labels=labels,autopct='%0.1f%%')
plt.show()


**We can see from above pie chart that most preferrable meal is BB(77.4% appx) i.e Bed and Breakfast**  


**How long does guest stay at hotel on weekends and weekdays?**

In [None]:
plt.figure(figsize=(15, 15))
plt.subplot(1, 2, 1)
sns.countplot(data = hotel_df, x = 'stays_in_week_nights',hue='hotel' ,palette='cool')
plt.title("Number of stays on week nights",fontweight="bold", size=20)
plt.grid()
plt.subplot(1, 2, 2)
sns.countplot(data = hotel_df, x = 'stays_in_week_nights', hue='is_canceled', palette='rocket')
plt.title('WeekStay vs Cancelation',fontweight="bold", size=20)
plt.grid()

plt.show()


**we can see from graph that most people are staying at hotel for 2 nights in week days**

**And second graph is showing that most cancellation is done for room which is booked for 2 night stay and we also see that most booking were not cancelled for 1 night stay**

In [None]:
plt.figure(figsize=(15, 8))
plt.subplot(1, 2, 1)
sns.countplot(data = hotel_df, x = 'stays_in_weekend_nights',hue='hotel' ,palette='cool')
plt.title("Number of stays on weekend nights",fontweight="bold", size=20)
plt.grid()
plt.subplot(1, 2, 2)
sns.countplot(data = hotel_df, x = 'stays_in_weekend_nights', hue='is_canceled', palette='rocket')
plt.title('WeekendStay vs Cancelation',fontweight="bold", size=20)
plt.grid()

plt.show()

#conclusion
**More number People prefer to stay at city hotel in weekend for 0,1 and 2 nights**

**People staying for more than 2 night prefer to stay in Resort hotel**

***SECOND GRAPH***

**We can see that we have less number of cancellation of booking for weekends**



 How many number of Repeated guest are there ?
 

In [None]:
hotel_df.info()

In [None]:
#creating new data frame for repeated guest 

repeated_guest_df=hotel_df[hotel_df['is_repeated_guest']==1]
sns.countplot(data=repeated_guest_df,x='is_repeated_guest', hue='hotel', palette='cool').set_title('Number of Repeated Guest')
plt.show()


 **There are very few repeated guest out which city hotel have more number of repeated guest compared to Resort hotel** 


In [None]:
df.tail()