## <b> Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!

## <b>This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. </b>

## <b> Explore and analyze the data to discover important factors that govern the bookings. </b>

**IMPORT LIBRARIES**

**Impoting all the important libraries.**

In [None]:
#import the libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as pg
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

We imported all the important libraries for the purpose of using functionality.

**For reading dataset into pandas dataframe**

In [None]:
# Mount your drive 
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Read the data set of "Hotel Booking" from variable name data
data= pd.read_csv("/content/drive/MyDrive/Hotel Bookings.csv")

The purpose of loading the data "hotel_bookings.csv" into a dataframe.

In [None]:
#creating a copy of dataset
df=data.copy()

Coping the whole dataset from variable "data" to"df" using the independent copy function.

In [None]:
#Printing the data
df


To print the data in column wise manner after upload the csv file in the dataframe.

In [None]:
#To find out the shape of data
df.shape

For the better understanding, we print the shape of data which shows the number of column and rows. We have total 119390 rows and 32 columns.

In [None]:
#Print head of dataframe
df.head()

In [None]:
#Print tail of dataframe
df.tail()

Here we have pinted 5 top and bottom rows of the date using 'Head' and 'Tail' method.

In [None]:
#To show all the columns of the dataset
df.columns

To get all the columns of the Pandas dataframe.

In [None]:
#To show the information of all the important aspects of dataset 
df.info()

To print a concise summary of a DataFrame.

In [None]:
#Describe summary of the dataset
df.describe(include="all")

This is function is applied to get a descriptve summary of data set.

In [None]:
#To transpose data
df.describe().T

With the describe function T property used to transpose the data where columns are converted into the rows and rows became columns.

In [None]:
#To find out duplicate values 
df[df.duplicated()].shape

In [None]:
#To count the duplicated values 
df.duplicated().value_counts()

In [None]:
#Drop duplicate values 
df= df.drop_duplicates()
df.shape

Above 3 steps are used to find out the duplicate values and removing them. Current data is important and must be in precise form.

In [None]:
#To find the size of the data
df.size

In [None]:
#Shape of unique values of data set (after removing duplicate values)
df.shape

In [None]:
#To find null values
df.isnull().sum().sort_values(ascending=False)

In [None]:
#Removing null values
df=df.drop(['company'],axis=1)
df["agent"].fillna(0, inplace = True)  

In [None]:
df = df.dropna(axis = 0)

In [None]:
#Detecting null values 
df.isnull().sum()

With this step we are able to get rid of the null values that existed within some columns. While one approach can simply eliminate the whole column, the fact that there is existing data is still important as that there is a sizable presence for that existing data.

In [None]:
#Dateset after removing null values
df.shape

This is the shape of unique data after removing null values.

#**Here we can see some outliers.**

**Lets build boxplots to see it better.**

In [None]:
#Check outliers in numerical columns with seaborn boxplot
columns = ['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'required_car_parking_spaces', 'adr', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes']
plt.figure(figsize=(20,15))
m=1
for i in columns:
  plt.subplot(4,4,m)
  m = m+1
  sns.boxplot(df[i])

In [None]:
#Operations to cleanup outliers.
df.loc[df.lead_time > 500, 'lead_time'] = 500
df.loc[df.stays_in_weekend_nights >=  5, 'stays_in_weekend_nights'] = 5
df.loc[df.adults > 4, 'adults'] = 4
df.loc[df.previous_bookings_not_canceled > 0, 'previous_bookings_not_canceled'] = 1
df.loc[df.previous_cancellations > 0, 'previous_cancellations'] = 1
df.loc[df.stays_in_week_nights > 10, 'stays_in_week_nights'] = 10
df.loc[df.booking_changes > 5, 'booking_changes'] = 5
df.loc[df.babies > 8, 'babies'] = 0
df.loc[df.required_car_parking_spaces > 5, 'required_car_parking_spaces'] = 0
df.loc[df.children > 8, 'children'] = 0
df.loc[df.adr >= 1000, 'adr'] = 800

In [None]:
#rechecking outlier existance.
columns = ['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'required_car_parking_spaces', 'adr', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes']
plt.figure(figsize=(20,15))
m=1
for i in columns:
  plt.subplot(4,4,m)
  m = m+1
  sns.boxplot(df[i])