<a href="https://colab.research.google.com/github/prati25/Hotel-Booking-Analysis/blob/main/Copy_of_Hotel_Booking_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!

## <b>This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. </b>

## <b> Explore and analyze the data to discover important factors that govern the bookings. </b>

## **DATASET**

The dataset contains information on bookings for two hotels in Portugal (a resort and a city hotel) scheduled to arrive in a period between July 1, 2016 and August 31, 2018.

For both hotels, the same information was collected: 31 variables describing 40,060 observations for the resort and 79,330 observations for the city hotel. That is, the dataset contains information on 119,390 hotel reservations, including those that were canceled. This is real information, so all elements that could identify hotels or customers were removed.

### Here there are some questions for analysis:

1.What is the month with the most guest arrivals?

2.How long do guests tend to stay at the hotel?

3.How many reservations were made by repeated guests?

4.What is the Average Daily Rate (ADR) throughout the year?

5.How many reservations were cancelled out of total?

6.What is the most frequent deposit type for cancelled reservations?

7.Which countries do customers come from?

8.What types of customers are most common in each hotel?

9.What is their preferred meal plan?

10.Which hotel is preferred by adults with children?

11.What is the strongest market segment and distribution channel?

## **Suppressing warnings**

In [None]:
import warnings
warnings.filterwarnings('ignore')             #suppressing warnings

## **Importing libraries**

In [None]:
#importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns                         
%matplotlib inline 

## **Reading and inspecting data**

let's see Contents of Dataset :

In [None]:
pk = pd.read_csv()

In [None]:
pk.head()

In [None]:
print(pk.shape)


There are 119.390 observations and 32 columns in thr dataset

Check data type :

In [None]:
pk.info()

As it can be seen, 'reservation_status_date' has an object data type, when it should has a date data type.

Also, there are missing values in 'children', 'country', 'agent' and 'company' columns. This will be explored in the next section.

## **Data cleaning**

## Missing values

Let's see how many values are missing there :

In [None]:
# Number of missing values according to column

pk.isnull().sum().sort_values(ascending = False)

In [None]:
# percentage of missing values according to column

round(pk.isnull().sum().sort_values(ascending = False)* 100) / len(d,f),2)

**'company'** and **'agent'** columns have a very high amount of missing values (94,31% and 13,69% respectively). one possible option could be to drop down that columns. However, the original article(that can be found here) 

States the following :



---
In some Categorical variables like agent or company,"NULL" is presented as one of the categories. This should not be considered a missing value, but rather as "not applicable".
For example: if a booking "Agent" is defined as "NULL" it means that the booking did not came from a travel agent. 

---


Then, I'm going to replace null values in those columns with 0 (as they have float64 data types, since the personal information was removed).

On the other hand, the column **'country'** has 0.41% missing values (488 rows affected). In its case, being a categorical variable, I have chosen the mode to replace the null values.

The column **'children'** also has some missing values, but they do not even represent 0.01% of the dataset. Therefore, I have chosen to delete the affected rows (4 rows).

In [None]:
#Replacing NULL values in company and agent columns with 0

values = {'company':0, 'agent':0}
pk.fillna(value = values, inplace = True)


#Replacing NULL values in country column with most frecuent value

pk['country'].fillna(value = pk['country'].mode()[0],inplace = true)


#Removing row affected by NULL values in children column

pk.dropna(subset = ['children'], inplace = True)


In [None]:
#Rechecking for NULL values in the data set

pk.isnull().sum()

Now, there is no missing values in the dataset.

## **Inconsistent Data**

In this part I'll look for inconsistent data in the dataset, that is, I'll check that the unique values of the categorical volumns are correct.

In [None]:
#Categorical columns:

categ_columns = ['hotel','is_canceled','meal','country','market_segment','distribution_channel','is_repeated_guest','reserved_room_type','assigned_room_type','deposit_type','customer_type','reservation_status']


#Unique values in each categorical column

print(f"UNIQUE VALUES BY CATEGORICAL COLUMNS\n".upper())
for categ_columns in categ_columns:
  unique_values = pk[categ_columns].unique()
  print(f"\n{categ_columns}: \n{unique_values}\n")
  print('-' * 70)

In the column **'meal'**  there are five possible results:['BB' 'FB' 'HB' 'SC' 'Undifined']

The category **'undifined'** actually corresponds to **'sc'** (Self catering i.e. meal is not included),as definrd in the original article. Therefore, I'll replace it's value with **'sc'**.


In [None]:
#Replacing 'Undifined' meal with 'SC'

pk['meal'].replace(to_replace = 'undifined', value = 'SC', inplace = True)

In [None]:
#Rechecking unique values in meal column

pk['meal'].unique()

There is no more inconsistent data in Dataset.

## **Invalid Data**

Now, Verify there are any non-logical values in the dataset :

In [None]:
pk.describe()

At first glance, it seems that there are outliers in the dataset. For example, we can see that the column 'previous_cancellations' has a maximum value of 26 cancellations, which would imply that some customer made 26 cancellations, which is unlikely.

On the other hand, the column 'adults' has a maximum of 55 and a minimum of 0 people. The minimum is especially interesting because it would assume that there are hotel reservations for 0 adults, which is not possible since there must be a minimum of 1 adult per reservation (obviously children cannot book hotel rooms). Therefore, I will eliminate rows where the number of adults equals 0.

In [None]:
#Dropping rows with 0 adults :

pk.drop(pk[pk['adults']== 0].index,inplace = True)

In [None]:
#Verification of dropping rows with 0 adults :

len(pk[pk['adults']==0])

Rows where 'adults' was equal to 0 have been eliminated. I'll now check the outliers.