<a href="https://colab.research.google.com/github/nihalhabeeb/hotel-booking-analysis/blob/main/Hotel_Booking_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!

## <b>This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. </b>

## <b> Explore and analyze the data to discover important factors that govern the bookings. </b>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The first step is to import neccessary libraries

In [None]:
## Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sbrn
import matplotlib.pyplot as plt

Download the dataset to use it as a pandas dataframe 

In [None]:
file_path = '/content/drive/MyDrive/AlmaBetter/Capstone Projects/EDA/Hotel Bookings.csv'
hotel_def = pd.read_csv(file_path)  

Lets find the shape of the dataset.

In [None]:
hotel_def.shape

(119390, 32)

Great..! We know that our dataset has 119390 rows and 32 columns

describe() method can be used to get statistical information about the dataset. We were able to make some very useful findings using this information (which comes later in the notebook).

In [None]:
hotel_def.describe()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,agent,company,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
count,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119386.0,119390.0,119390.0,119390.0,119390.0,119390.0,103050.0,6797.0,119390.0,119390.0,119390.0,119390.0
mean,0.370416,104.011416,2016.156554,27.165173,15.798241,0.927599,2.500302,1.856403,0.10389,0.007949,0.031912,0.087118,0.137097,0.221124,86.693382,189.266735,2.321149,101.831122,0.062518,0.571363
std,0.482918,106.863097,0.707476,13.605138,8.780829,0.998613,1.908286,0.579261,0.398561,0.097436,0.175767,0.844336,1.497437,0.652306,110.774548,131.655015,17.594721,50.53579,0.245291,0.792798
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,0.0,-6.38,0.0,0.0
25%,0.0,18.0,2016.0,16.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,62.0,0.0,69.29,0.0,0.0
50%,0.0,69.0,2016.0,28.0,16.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,179.0,0.0,94.575,0.0,0.0
75%,1.0,160.0,2017.0,38.0,23.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,229.0,270.0,0.0,126.0,0.0,1.0
max,1.0,737.0,2017.0,53.0,31.0,19.0,50.0,55.0,10.0,10.0,1.0,26.0,72.0,21.0,535.0,543.0,391.0,5400.0,8.0,5.0


## Cleaning the Dataset

### Handing Null Values

Now lets find out how many cells are missing from our data set

In [None]:
hotel_def.isnull().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

The percentage of null values in each columns give a better picture and will help us to decide if any columns have to be removed.

In [None]:
hotel_def.isnull().sum()/len(hotel_def.index)*100

hotel                              0.000000
is_canceled                        0.000000
lead_time                          0.000000
arrival_date_year                  0.000000
arrival_date_month                 0.000000
arrival_date_week_number           0.000000
arrival_date_day_of_month          0.000000
stays_in_weekend_nights            0.000000
stays_in_week_nights               0.000000
adults                             0.000000
children                           0.003350
babies                             0.000000
meal                               0.000000
country                            0.408744
market_segment                     0.000000
distribution_channel               0.000000
is_repeated_guest                  0.000000
previous_cancellations             0.000000
previous_bookings_not_canceled     0.000000
reserved_room_type                 0.000000
assigned_room_type                 0.000000
booking_changes                    0.000000
deposit_type                    

The 'company' column has aroun 94% null values! The 'agent' column has much less but still non-negligible amount (around 14%) of null values. The 'country' and 'children' have negligible amount of null values.

In [None]:
hotel_def['agent'][10]

240.0

The 'agent' column consists of float values. A little research about the dataset on the internet revealed that this column refers to the agent's ID. So, the null values cannot be replaced by a value calculated from the rest of the values to make any sense. Since, they are just IDs it is better to remove the column entirely as they won't be necessary in any analysis considering the fact that a good amount of null values are present.

The 'company' column can be removed easily as most of the values are null. Those values also refer to the company ID and removing them is not an issue anyway.

In [None]:
hotel_def = hotel_def.drop(columns=['company','agent'],axis=1)

The 'children' and 'country' columns have some null values. Those rows which contain the null values can be removed from the analysis.

In [None]:
# dropping the rows with null values completely
hotel_def = hotel_def.dropna(axis=0)

### Handling Some Data That Don't Make Any Sense!

From the describe() function used earlier we spotted something fishy. The minimum value of adults column is zero. While it is possible that there are bookings with children only it is better to check those data carefully.