## An Analysis of Hotel Booking Data
### Exploratory Data Analysis

The dataset contained multiple columns which were not very useful in the analysis; hence, they were dropped from the data frame. It was then important to check the data types contained in each column of the data frame.

In [None]:
import pandas as pd

In [None]:
raw_data = pd.read_csv('hotel_bookings.csv')

In [None]:
raw_data.columns

The next stage is to analyse the unique values of columns which contain catgeorical or factor data. The Pandas dataframe stores these categorical variables (which tend to take a set of string values) as an 'object' data type.

Beginning with the type of hotel; the unique types of hotels include: _Resort Hotels_ and _City Hotel_. The dataset contains data from all twelve months of the year, which would be expected. 177 countries are represented in this dataset. The reserved room types are labeled as 'C', 'A', 'D', 'E', 'G', 'F', 'H', 'L', 'P' and 'B'. The assigned room types are of the same kind, but even more numerous in number. Since these labels make very little sense without metadata of some sort (which was not provided with the dataset), it is appropriate to drop them from the data frame. Finally, the deposit type of the booking can be: _No Deposit, Refundable, Non Refund_ (Non Refundable).

This information was gathered from the analysis below:

In [None]:
raw_data.drop(columns = ['is_canceled', 
                      'arrival_date_year', 
                      'arrival_date_week_number', 
                      'arrival_date_day_of_month', 
                      'agent',
                      'company', 
                      'days_in_waiting_list', 
                      'customer_type', 
                      'adr',
                      'required_car_parking_spaces', 
                      'total_of_special_requests',
                      'reservation_status', 
                      'reservation_status_date', 
                      'distribution_channel', 
                      'meal', 
                      'market_segment'], inplace = True)

In [None]:
print("Hotel types:", raw_data['hotel'].unique())
print("Months in the data:", raw_data['arrival_date_month'].unique())
print("Number of unique country values in the data:", raw_data['country'].unique().size)
print("Reserved room types:", raw_data['reserved_room_type'].unique())
print("Assigned room types:", raw_data['assigned_room_type'].unique())
print("Deposit types:", raw_data['deposit_type'].unique())

In [None]:
raw_data.drop(columns = ['reserved_room_type', 'assigned_room_type'], inplace = True)

In [None]:
raw_data.dtypes

This dataset is punctuated with missing values (represented as "nan" values). We would like to check how many missing values exist in this data. 

From the output generated by code below, most columns haven't got any missing values. However, where there are missing values in certain records, we would like to drop the entire record due to it being incomplete. We finally check to see that all the "nan" values have been removed. 

In [None]:
raw_data.isna().sum()

In [None]:
raw_data = raw_data.dropna(how = 'any')
raw_data.isna().sum()

In [None]:
raw_data['children'] = raw_data['children'].astype(int)
raw_data['children'].dtype

In [None]:
# grouped_month = hotel.groupby('arrival_date_month')
# month_count = grouped_month['hotel'].count()