<a href="https://colab.research.google.com/github/pranjalikathait/Numerical-Programming-in-python/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Hotel Booking Analysis**


##### **Project Type**    - EDA
##### **Contribution**    - Individual - Pranjali Kathait

# **Project Summary -**

In this project, the dataset we have got focuses on two types of hotel description i.e., a City Hotel and a Resort Hotel. This dataset is a compilation of 119390 rows and 32 columns.
To extract our needs from this dataset, I divided the data manipulation workflow in three steps:

Data Collection
Data Cleaning and manipulation
EDA (Exploratory Data Analysis)

In the first step, using different methods like head(), tail(), info(), describe(), columns()
and some other I analysed the data and cleared my understanding of the columns involved in the dataset like hotel, is_canceled, lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_weekend_nights.

Then I find out unique values in each columns thereby generating a list in tabular format. I also checked for datatype of each variable, and found that some of the columns do not have accurate datetype which in further steps gets converted into their desired datatype. Duplicate items also gets removed using drop method.

Now, before data visualization and getting different types of valuable charts out of that data, we have to perform data wrangling. For that, we checked for null or missing values in each column then dropped column which was having maximum number of null values i.e., 'company' column and for other columns having null or missing values, I filled them using .fillna().

Then using libraries like numpy, pandas, matplotlib and seaborn, number of charts were created for a better understanding of the data.



# **GitHub Link -**

https://github.com/pranjalikathait/Numerical-Programming-in-python.git

# **Problem Statement**


**Have you ever considered the best time of year to make a hotel reservation? Or have you ever wondered what the perfect length of stay is to get the greatest daily rate? What if you wanted to predict whether a hotel will receive an exceptionally large volume of special requests? This dataset on hotel bookings provides useful information for answering these questions. It includes reservation information for both city hotels and resort hotels, such as reservation dates, length of stay, number of guests, children, babies, and available parking spaces, among other criteria. It's worth noting that all personally identifiable information has been deleted from the dataset to ensure privacy and security.
Explore and analyse the data to identify key elements that influence bookings.**

#### **Define Your Business Objective?**

Analyzing the hotel dataset and digging out hidden and valuable insights that would have a positive influence on the booking rate.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime

# Set maximum columns to be display in datafram as 36
pd.set_option("display.max_columns", 36)
plt.style.use('seaborn')

# Setting fontsizes,fontweight,label weight for lebels,titles.
plt.rcParams["font.weight"] = "bold"
plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["axes.titlesize"] = 25
plt.rcParams["axes.titleweight"] = 'bold'
plt.rcParams['xtick.labelsize']=15
plt.rcParams['ytick.labelsize']=15
plt.rcParams["axes.labelsize"] = 20
plt.rcParams["legend.fontsize"] = 15
plt.rcParams["legend.title_fontsize"] = 15

### Dataset Loading

In [None]:
# Loading Dataset
from google.colab import drive
drive.mount('/content/drive')
df=pd.read_csv('/content/drive/MyDrive/Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
df

In [None]:
#First 5 rows of the dataset
df.head()

In [None]:
#Last 5 rows of the dataset
df.tail()

# **Data Exploration**

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows, num_columns= df.shape
print(f"The dataset has {num_rows} rows and {num_columns} columns.")

print(df.index)
print(df.columns)


### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
num_duplicates=df.duplicated().sum()
print("Total number of duplicate rows :",num_duplicates)

df.drop_duplicates(inplace=True)
unique_rows=df.shape[0]
print("Total number of unique rows :",unique_rows)

In [None]:
#viewing unique data
df.reset_index()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values_count=df.isnull().sum()
print('Missing values count in each column ', missing_values_count)
null_values_count=df.isnull().sum().sum()
print(f'The dataset has {null_values_count} missing values.')

df.fillna(np.nan,inplace=True)
print("After replacing all the null values as nan: ")
df

In [None]:
miss_values=df.isnull().sum().sort_values(ascending=False)
miss_values

In [None]:
# Visualizing the missing values
# using heat map
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Null Values distribution using Heatmap')
plt.show()

# Calculate the percentage of missing values for each column
missing_percentage = (df.isnull().mean() * 100).round(2)
# Create a bar chart to visualize missing value percentages
plt.figure(figsize=(12, 6))
missing_percentage.plot(kind='bar', color='skyblue')
plt.title('Percentage of Missing Values in Each Column')
plt.ylabel('Percentage')
plt.xlabel('Columns')
plt.xticks(rotation=45, ha='right')
plt.show()

### What did you know about your dataset?

This dataset is a compilation of various booking informations between two types of hotels i.e., a city hotel and a resort hotel. The dataset has information like when the booking was made, the length of the stay, the number of adults, children or babies, available parking space, whether the stay is cancelled or not, country, meal and many more characteristics. The database has a total of 119390 rows and 32 specified columns. Now, I find out that this dataset contains duplicated items i.e., 31944 which should be taken care of so, it got removed out. This dataset has multiple types of data, but some columns datatype is not accurate so I will remove them later. Then I visualizes the null values using heat map and bar chart.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(list(df.columns))

In [None]:
#looking at the min, max values,mean values etc. NAN values for mean,25% ,50%,75%,max indicates those are categorical column
df.describe(include='all')

### Variables Description

### **The column name with its specifications is listed below:**
1. hotel: Type of the hotel (Resort Hotel or City Hotel)
2. is_canceled: If the booking was canceled (1) or not (0)
3. lead_time: Number of days before the actual arrival of the guests
4. arrival_date_year: Year of arrival date
5. arrival_date_month: Month of month arrival date
6. arrival_date_week_number: Week number of year for arrival date
7. arrival_date_day_of_month : Day of arrival date
8. stays_in_weekend_nights: Number of weekend nights (Saturday or Sunday) spent at the hotel by the guests.
9. stays_in_week_nights : Number of weeknights (Monday to Friday) spent at the hotel by the guests.
10. adults: Number of adults among guests
11. children : Number of children among guests
12. babies : Number of babies among guests
13. meal : Type of meal booked
14. country : country of guests
15. market_segment : segment of the market whether online, offline, corporate or direct
16. distribution_channel : channel of distribution
17. is_repeated_guest : whether the guest is repeated guest
18. previous_cancellations : whether cancellation is somewhere previously done by the guest
19. previous_bookings_not_cancelled : whether previous bookings canceled or not
20. booking_changes : any change in the booking
21. assigned_room_type: Code of room type assigned
22. booking_changes: Number of changes/amendments made to the booking
23. deposit_type: Type of the deposit made by the guest
24. agent: ID of travel agent who made the booking
25. company: ID of the company that made the booking
26. days_in_waiting_list: Number of days the booking was in the waiting list
27. customer_type: Type of customer, assuming one of four categories
28. adr: Average Daily Rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights
29. required_car_parking_spaces: Number of car parking spaces required by the customer
30. total_of_special_requests: Number of special requests made by the customer
31. reservation_status: Reservation status (Canceled, Check-Out or No-Show)
32. reservation_status_date : Date at which the last reservation status was updated



### Check Unique Values for each variable.

In [None]:
# Check unique values for each variable
print(df.apply(lambda col: col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

## **Data Cleaning**

In [None]:
#Checking which of the columns has null values, we have already stored the sum of all the null values in each of the column in variable named-'miss_values'
miss_values[:4]

In [None]:
#checking percentage of null values in each column

#1. 'company' column
company_null_percentage=miss_values[0]*100/unique_rows
print("Percentage of nullvalues in company column is ",company_null_percentage)

#2. 'agent' column
agent_null_percentage=miss_values[1]*100/unique_rows
print("Percentage of nullvalues in agent column is ",agent_null_percentage)

#3. 'country' column
country_null_percentage=miss_values[2]*100/unique_rows
print("Percentage of nullvalues in country column is ",country_null_percentage)

#4. 'children' column
children_null_percentage=miss_values[3]*100/unique_rows
print("Percentage of nullvalues in children column is ",children_null_percentage)

In [None]:
# Write your code to make your dataset analysis ready.
# It is better to drop company column as the percentage of null values present is very high
df.drop(['company'], axis=1, inplace=True)

In [None]:
# null values count in agent column is not that a big number, so replacing the values by 0
df['agent'].fillna(value=0, inplace=True)
#again rechecking if 'agent' column is left with any of the null value
df['agent'].isnull().sum()

In [None]:
# null values count in country column is not that a big number, so replacing the values by 'others' as country name
df['country'].fillna(value='others', inplace=True)
#again rechecking if 'country' column is left with any of the null value
df['agent'].isnull().sum()

In [None]:
# null values count in children column is not that a big number, so filling the values by 0
df['children'].fillna(value=0, inplace=True)
#again rechecking if 'children' column is left with any of the null value
df['children'].isnull().sum()

In [None]:
#checking whether any of the column is left with nullvalues
df.isnull().sum()

## **Changing datatypes as per the data**

In [None]:
#checking datatype
df.info()

In [None]:
#We can see that ''children' and 'agent'column has int values but their datatype is float so converting them into their desired datatype
df[['children','agent']]= df[['children','agent']].astype('int64')

In [None]:
# converting object type to datetime
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], format = '%Y-%m-%d')

In [None]:
df.info()

### **Addition of new columns as per requirement**

In [None]:
# Adding a column total_night_stays
df['total_night_stays']= df['stays_in_weekend_nights']+df['stays_in_week_nights']
df['total_night_stays']

In [None]:
# Adding a revenue generated column by multiplying total_night_stays with adr
df['revenue']= df['total_night_stays']*df['adr']
df['revenue']

In [None]:
# Finding total number of guests involved in a booking
df['total_guests']=df['adults']+df['children']+df['babies']
df['total_guests']

In [None]:
#Making some of the columns data more readable
#1. in column 'is_canceled' , replacing (0,1) with ('is not canceled','is canceled')
df['is_canceled']= df['is_canceled'].replace([0,1],['is not canceled','is canceled'])
df['is_canceled']

In [None]:
#2. in column 'is_repeated_guest', replacing (0,1) with ('not repeated one',repeated one')
df['is_repeated_guest']= df['is_repeated_guest'].replace([0,1],['not repeated one','repeated one'])
df['is_repeated_guest']

In [None]:
#estimating hotel wise revenue
hotel_wise_revenue=df.groupby('hotel')['revenue'].sum()
hotel_wise_revenue

In [None]:
df[['hotel','revenue']]

In [None]:
#checking for those columns where 'total_guests' count is 0
null_guests_count=df[df['total_guests']==0]
null_guests_count                                          #it means no bookings were made during these days so dropping this count

In [None]:
#dropping all the 166 rows having no guests count
df=df[df['total_guests']!=0]

In [None]:
#Checking final shape of the dataset
df.shape

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***