# Expedia Hotel Recommendations

Written by: Lais Amorim Menezes

Contact Email: laisamorimmenezes@gmail.com

Date Filled: Oct 16, 2023

### Table of Contents:

1. Introduction:
2. Key Questions:
3. Assumptions & Limitations
4. Methods
5. Finding & Analysis

* [Analysis for column `Date_time`](#1)
* [Analysis for column `Site_name`](#2)
* [Analysis for column `Posa_continents`](#3)
* [Analysis for column `User_location_country`](#4)
* [Analysis for column `User_location_region` and `User_location_city`](#5)
* [Analysis for column `Orig_destination_distance`](#6)
* [Analysis for column `User_id`](#7)
* [Analysis for column `Is_mobile`](#8)
* [Analysis for column `Is_package`](#9)
* [Analysis for column `Channel`](#10)
* [Analysis for column `Srch_ci`, `Srch_co` and `number_of_days`](#11)
* [Analysis for column `Srch_adults_cnt`, `Srch_children_cnt` and `Srch_rm_cnt`](#12)
* [Analysis for column `Srch_destination_id` and `Srch_destination_type_id`](#13)
* [Analysis for column `Is_booking`](#14)
* [Analysis for column `Cnt`](#15)
* [Analysis for column `Hotel_continent`, `Hotel_country` and `Hotel_market`](#16)
* [Analysis for column `Hotel_cluster`](#17)
* [Some graphics](#18)


## Introduction:

This business report aims to analyze the customer interactions on Expedia's website. The dataset comprises a selection of records from Expedia's vast collection. Due to the dataset's substantial size, a sample has been taken to facilitate computational efficiency.

The primary objective of this analysis is to predict the hotel cluster that a user is likely to book. Expedia provides a hotel cluster, an in-house algorithm grouping similar hotels based on various factors such as historical pricing, customer ratings, and proximity to city centers. These clusters are invaluable for predicting user preferences when booking hotels. It is important to note that the dataset contains 100 distinct hotel clusters. This predictive goal requires the anticipation of a user's booking outcome (hotel cluster) based on their search and related event attributes. The training data covers the period from 2013 to July 2014, with the test data spanning from August to December 2014

## Key Question:

1. How can we effectively sample and clean the data to prepare it for analysis?
2. What insights can we gather from each column to gain a comprehensive understanding of their significance?
3. How can we draw meaningful conclusions and identify the target variable, along with selecting suitable models for predictive analysis?

## Assumptions & Limitations:

Assumptions:

1. Data Accuracy: It is assumed that the provided data accurately represents a subset of Expedia's extensive dataset.
2. Relevance of Clusters: The assumption is made that hotel clusters are significant identifiers of user booking preferences.

Limitations:

1. Data Encoding: The dataset lacks categorical columns, as all variables are stored numerically. This necessitates the non-standard treatment of numerical columns.
2. Temporal Constraints: The data covers a limited timeframe and does not capture long-term trends or seasonal variations.

## Methods:

The analysis begins with the acquisition of a representative sample from Expedia's customer interaction dataset, accessible on the Kaggle website. The report proceeds by splitting the data into training and test sets based on the interaction's datetime, enabling us to assess prediction accuracy in the future. Data cleaning is undertaken to ensure data quality, followed by a comprehensive exploratory data analysis aimed at unveiling the underlying characteristics and interpretations of each column.

## Finding & Analysis:

In the year 2013, the months with the highest website interactions were March, July, and October. This pattern continues into 2014, with July being the peak month. Unfortunately, our data only goes up to July 2014. The year 2014 shows an increase in interactions compared to 2013, indicating growing website engagement.

We observed that each site name (e.g., Expedia.com, Expedia.co.uk) corresponds to only one continent. However, multiple site names may link to the same continent. It suggests that certain user locations are more engaged in searches, possibly where Expedia is more popular.

The median distance between the origin and destination points is approximately 1168.39 miles, which is approximately a 2-hour flight. This gives us an idea of the average travel distances customers are interested in.

Most users access the website from non-mobile devices. Additionally, they often search for hotel options without combining them with flight packages, implying that a significant portion of the users might be primarily interested in hotels.

The majority of customers searching the website are likely looking for a one-day hotel stay, typically for two people, likely a couple, booking only one room and traveling without children. They also show a preference for starting their hotel stays on Sundays and tend to conduct searches primarily on weekdays.

In terms of destination types, there are 11 options available. More than half of the destinations are classified as type 1. This suggests a high demand for a particular type of destination.

Only 8.71% of the interactions result in actual bookings.

Destinations in different continents vary in terms of the number of associated countries. Some continents have multiple countries to explore, while others only have a few like 2.

The target variable for our predictive modeling is the hotel cluster.

These findings provide valuable insights into user behavior and preferences on the Expedia website, which will be crucial for any predictive modeling or recommendations for users.

## EDA

In [1]:
# Imports`
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyarrow.parquet as pq
import plotly.express as px
from scipy import stats
from scipy.stats import norm
import seaborn as sns

In [None]:
#Loding the data
hotel_data = pd.read_csv('/Users/laisamorim/Desktop/Brain Station Course/Capstone/Notebooks/Data/train.csv')

As the data is very big I got a sample of 4670293 and put the random_state so I can replicate if I need.

In [None]:
hotel_df = hotel_data.sample(n = 4670293, random_state = 42)

In [None]:
# Save the sampled data to a CSV file
hotel_df.to_csv('hotel_df.csv', index=False)

In [None]:
#Loding the data
hotel_df = pd.read_csv('/Users/laisamorim/Desktop/Brain Station Course/Capstone/Notebooks/Data/hotel_df.csv')

In [None]:
hotel_df.head()

In [None]:
hotel_df.columns

In [None]:
hotel_df.info()

In [None]:
hotel_df['date_time'] = pd.to_datetime(hotel_df['date_time'])

In [None]:
hotel_df.info()

As the column 'date_time' is an object I change it to a datetime, so I will split the data into train and in test the train data will be for year 2013 and July of 2014 and the test data will be from August to end of the year of 2014

In [None]:
hotel_df = hotel_df[((hotel_df['date_time'].dt.year == 2013) | ((hotel_df['date_time'].dt.year == 2014) & (hotel_df['date_time'].dt.month < 8)))]
test = hotel_df[((hotel_df['date_time'].dt.year == 2014) & (hotel_df['date_time'].dt.month >= 8))]

In [None]:
hotel_df.head()

In [None]:
print(f'There are {hotel_df.shape[0]} columns and {hotel_df.shape[1]} rows')

In [None]:
hotel_df.info()

In [None]:
hotel_df.duplicated().sum()

In [None]:
duplicate_rows = hotel_df[hotel_df.duplicated(keep=False)]

# Sort in descending order
duplicate_rows = duplicate_rows.sort_values(by='date_time', ascending=False)

duplicate_rows

In [None]:
duplicate_percentage = (hotel_df.duplicated().sum() / len(hotel_df)) * 100
duplicate_percentage

In [None]:
hotel_df = hotel_df.drop_duplicates()

In [None]:
hotel_df.duplicated().sum()

Upon closer inspection of the data, I identified 11 duplicate columns. After further examination and considering the percentage of duplication, I have decided to remove these columns from the dataset.

In [None]:
hotel_df.isna().sum()

In [None]:
hotel_df[hotel_df['orig_destination_distance'].isna()]

In [None]:
percentage_nan = (hotel_df['orig_destination_distance'].isna().sum() / len(hotel_df)) * 100
percentage_nan

In [None]:
hotel_df.dropna(subset=['orig_destination_distance'], inplace=True)

In [None]:
hotel_df['orig_destination_distance'].isna().sum()

I decided to remove the `orig_destination_distance` column due to the presence of a significant number of NaN values, exceeding 20%.

In [None]:
percentage_nan = (hotel_df['srch_ci'].isna().sum() / len(hotel_df)) * 100
percentage_nan

In [None]:
hotel_df.dropna(subset=['srch_ci'], inplace=True)

In [None]:
hotel_df['srch_ci'].isna().sum()

In [None]:
hotel_df['srch_co'].isna().sum()

##### I decided to drop the NaN values in the `srch_ci` column, as they accounted for just 0.159% of the data. Upon further investigation, I found that the same rows with NaN values in `srch_ci` also had NaN values in `srch_co`. Consequently, I have successfully eliminated NaN values in both columns.

In [None]:
hotel_df.isna().sum()

In [None]:
# Have a new name for the clean data
hotel_clean = hotel_df

In [None]:
# Checking the data
hotel_clean.head(5)

#### Making an analysis in column `Date_time` <a class= 'anchor' id = '1'></a> 

In [None]:
hotel_clean.info()

I already verified that the `Date_time` column was initially in object format, and to facilitate data manipulation within this column, I converted it to a datetime format.

In [None]:
hotel_clean

In [None]:
hotel_clean['date_time'].dt.year.unique()

In [None]:
hotel_clean['date_time'].dt.month.unique()

In [None]:
# Just checking if we don't have the data from year 2014 after Agust
hotel_clean[(hotel_clean['date_time'].dt.year == 2014) & (hotel_clean['date_time'].dt.month >= 8)]

In [None]:
# Filter the data for the year 2013
data_2013 = hotel_clean[hotel_clean['date_time'].dt.year == 2013]

# Plot the number of hotel bookings per month in 2013
plt.figure(figsize=(12, 6))
data_2013['date_time'].dt.month.value_counts().sort_index().plot()
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.title('Hotel Bookings per Month in 2013')
plt.xticks(range(1, 13), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

The data reveals distinct patterns in website traffic throughout the year 2013. At the beginning and end of the year, there were notably lower levels of activity. However, during the summer vacation month of July, as well as in March and October, there were significant peaks in website traffic.

In [None]:
# Filter the data for the year 2014
data_2014 = hotel_clean[hotel_clean['date_time'].dt.year == 2014]

# Plot the number of hotel bookings per month in 2014
plt.figure(figsize=(12, 6))
data_2014['date_time'].dt.month.value_counts().sort_index().plot()
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.title('Hotel Bookings per Month in 2014')
plt.xticks(range(1, 13), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

On the contrary, 2014 marked a significant increase in website traffic. July stood out as the busiest month, being the latest month for which we have data. Conversely, the year began with relatively lower activity in January and February. Notably, we still observed distinct spikes in traffic during March and July, reminiscent of the patterns seen in 2013.

#### Making an analysis in column `site_name` <a class= 'anchor' id = '2'></a> 

In [None]:
website_count = hotel_df['site_name'].value_counts()
total_count = len(hotel_df)

percentages = (website_count / total_count) * 100
percentages

In [None]:
hotel_clean['site_name'].nunique()

In [None]:
percentage_less_than_1 = percentages <= 1
percentage_less_than_1.value_counts()

A total of 75.14% of customers are visiting website 2, while the remaining 33 out of 39 websites (e.g., Expedia.com, Expedia.co.uk, Expedia.co.jp) each receive less than 1% of customer interactions.

In [None]:
site_counts = hotel_clean['site_name'].value_counts()

# Sort the site by frequency in descending order
site_counts = site_counts.sort_values(ascending=False)

# Create a bar plot
plt.figure(figsize=(12, 6))
site_counts.plot(kind='bar')
plt.xlabel('Site name')
plt.ylabel('Frequency')
plt.title('Frequency of Each Site Name in the Dataset')
plt.xticks(rotation=45) 

# Show the plot
plt.show()

#### Making an analysis in cloumn `posa_continent` <a class= 'anchor' id = '3'></a> 

In [None]:
hotel_clean['posa_continent'].unique()

In [None]:
filtered_data = hotel_clean[hotel_clean['site_name'] == 2]

# Print the filtered data
print(filtered_data['posa_continent'].unique())

In [None]:
filtered_data = hotel_clean[hotel_clean['site_name'] == 40]

# Print the filtered data
print(filtered_data['posa_continent'].unique())

In [None]:
filtered_data = hotel_clean[hotel_clean['posa_continent'] == 3]

# Print the filtered data
print(filtered_data['site_name'].unique())

In [None]:
filtered_data = hotel_clean[hotel_clean['posa_continent'] == 1]

# Print the filtered data
print(filtered_data['site_name'].unique())

In [None]:
filtered_data = hotel_clean[hotel_clean['posa_continent'] == 4]

# Print the filtered data
print(filtered_data['site_name'].unique())

In [None]:
filtered_data = hotel_clean[hotel_clean['posa_continent'] == 0]

# Print the filtered data
print(filtered_data['site_name'].unique())

In [None]:
filtered_data = hotel_clean[hotel_clean['posa_continent'] == 2]

# Print the filtered data
print(filtered_data['site_name'].unique())

It's evident that there are five continents associated with the site names, which are labeled as 3, 1, 4, 0, and 2. Each of these continents corresponds to a set of site names through which users can access the Expedia website. However, it's worth noting that each website is exclusive to a single continent.

In [None]:
# Group the data by 'posa_continent' and 'site_name' and count the occurrences
grouped_data = hotel_clean.groupby(['posa_continent', 'site_name']).size().unstack(fill_value=0)

# Create a stacked bar plot
grouped_data.plot(kind='bar', stacked=True, figsize=(10, 6))

# Customize the plot with labels and titles
plt.xlabel('Continent')
plt.ylabel('Count')
plt.title('Distribution of Site Names by Continent')

# Show the plot
plt.legend(title='Site Names', loc='upper right')
plt.show()

In [None]:
# Group the data by 'posa_continent' and 'site_name' and count the occurrences
grouped_data = hotel_clean.groupby(['posa_continent', 'site_name']).size().unstack(fill_value=0)

# Calculate the total count for each 'posa_continent'
total_counts = grouped_data.sum(axis=1)

# Calculate the percentage for each 'site_name'
percentage_data = grouped_data.divide(total_counts, axis=0) * 100

# Create a stacked bar plot with percentages
percentage_data.plot(kind='bar', stacked=True, figsize=(10, 6))

# Customize the plot with labels and titles
plt.xlabel('Continent')
plt.ylabel('Percentage')
plt.title('Percentage Distribution of Site Names by Continent')

# Show the plot
plt.legend(title='Site Names', loc='upper right')
plt.show()

In [None]:
# getting the percentage 
df_filtered = percentage_data[percentage_data > 0]
df_filtered.head(5)

This graphic provides valuable insights into the percentage distribution of each website among different continents. It's evident that continent 6 is where we observe the highest interaction rate on the website

#### Making an analysis in column `user_location_country` <a class= 'anchor' id = '4'></a> 

In [None]:
hotel_clean['user_location_country'].nunique()

In [None]:
hotel_clean['user_location_country'].unique()

In [None]:
# Counting the countries
country_counts = hotel_clean['user_location_country'].value_counts()

# Sort the countries by frequency in descending order
country_counts = country_counts.sort_values(ascending=False)

# Create a bar plot
plt.figure(figsize=(12, 6))
country_counts.plot(kind='bar')
plt.xlabel('Country')
plt.ylabel('Frequency')
plt.title('Frequency of Each Country in the Dataset')
plt.xticks(rotation=45) 

# Show the plot
plt.show()

In this grafic we can see that we have 21 countrys and the country 66 is the one that more customer engagement.

In [None]:
hotel_clean.head(10)

#### Making an analysis in columns `user_location_region` and `user_location_city` <a class= 'anchor' id = '5'></a> 

In [None]:
hotel_clean['user_location_region'].nunique()

In [None]:
hotel_clean['user_location_city'].nunique()

In [None]:
hotel_clean['user_location_region'].unique()

It's evident that in this dataset containing 218 distinct regions and a total of 8,456 unique cities.

#### Making an analysis in column `orig_destination_distance` <a class= 'anchor' id = '6'></a> 

In [None]:
hotel_clean['orig_destination_distance'].nunique()

In [None]:
hotel_clean['orig_destination_distance'].median()

In [None]:
# Set the float format to display numbers without scientific notation
pd.options.display.float_format = '{:.2f}'.format

# Describe
column_description = hotel_clean['orig_destination_distance'].describe()
print(column_description)

In [None]:
plt.figure(figsize=(8, 6))
plt.boxplot(hotel_clean['orig_destination_distance'], vert=False)  
plt.title('frequency of Distance')
plt.xlabel('Distance')
plt.show()

The majority of distances between hotels and customers during their searches are below 3000, with an average distance of approximately 1975.54. Furthermore, we observe that both the minimum and maximum distances exhibit significant variation. The maximum distance recorded is as high as 12280.48, while the minimum is a mere 0.01. Additionally, the median distance shows that 50% of customers search for hotels at a location of approximately 1168.39.

#### Making an analysis in column `user_id` <a class= 'anchor' id = '7'></a> 

In [None]:
hotel_clean['user_id'].nunique()

In [None]:
customer_frequency = hotel_clean['user_id'].value_counts()

# Calculate the average frequency
average_frequency = customer_frequency.mean()

print(f'Average Frequency of Customer Interactions: {average_frequency:.2f}')

The mean interaction frequency per customer is approximately 3.89, this insight provides a valuable understanding of customer engagement.

#### Naking an analysis in column `is_mobile` <a class= 'anchor' id = '8'></a> 

In [None]:
hotel_clean['is_mobile'].value_counts() / len(hotel_clean) * 100

In this dataset, the vast majority of customers, accounting for 87.05%, accessed the website from non-mobile devices, while the remaining 12.95% accessed it from mobile devices.

#### Making an analysis in column `is_package` <a class= 'anchor' id = '9'></a> 

In [None]:
hotel_clean['is_package'].value_counts() / len(hotel_clean) * 100

Within this dataset, the majority of clicks/bookings, totaling 75.39%, were d. In contrast, 24.61% of the interactions involved such combined packages.

#### Making an analysis in column `channel` <a class= 'anchor' id = '10'></a> 

In [None]:
hotel_clean['channel'].unique()

In [None]:
# Counting the channel
channel_counts = hotel_clean['channel'].value_counts()

# Sort the channel by frequency in descending order
channel_counts = channel_counts.sort_values(ascending=False)

# Calculate the percentages
percentages = (channel_counts / len(hotel_clean) * 100).round(2)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = channel_counts.plot(kind='bar')
plt.xlabel('Channel')
plt.ylabel('Frequency')
plt.title('Frequency of Each Channel in the Dataset')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(channel_counts, percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()

We observe that among the 11 available ID channels, a significant majority, comprising 60.22%, is associated with Channel 9. Additionally, there are four channels, each contributing less than 1% to the dataset.

#### Making an analysis in columns `srch_ci`, `srch_co` and `number_of_days` <a class= 'anchor' id = '11'></a> 

In [None]:
hotel_clean['srch_ci'].head(10)

In [None]:
hotel_clean['srch_co'].head(10)

In [None]:
hotel_clean.info()

In [None]:
# Convert date columns to datetime objects
hotel_clean['srch_ci'] = pd.to_datetime(hotel_clean['srch_ci'])
hotel_clean['srch_co'] = pd.to_datetime(hotel_clean['srch_co'])

In [None]:
hotel_clean.info()

In [None]:
hotel_clean['number_of_days'] = (hotel_clean['srch_co'] - hotel_clean['srch_ci']).dt.days
hotel_clean['number_of_days'] 

In [None]:
hotel_clean['number_of_days'].nunique()

In [None]:
hotel_clean['number_of_days'].value_counts() / len(hotel_clean) * 100

In [None]:
# Counting the days
channel_counts = hotel_clean['number_of_days'].value_counts()

# Sort the days by frequency in descending order
channel_counts = channel_counts.sort_values(ascending=False)

# Calculate the percentages
percentages = (channel_counts / len(hotel_clean) * 100).round(2)

# Select the top 10 records
top_10_channel_counts = channel_counts.head(10)
top_10_percentages = percentages.head(10)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = top_10_channel_counts.plot(kind='bar')
plt.xlabel('Number of days')
plt.ylabel('Frequency')
plt.title('Top 10 Frequencies of Days in the Dataset')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(top_10_channel_counts, top_10_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()

This graphic highlights the top 10 search durations for hotel stays. Notably, around 27.8% of users opt for single-day hotel stays. In contrast, individuals planning extended stays of over ten days make up a minority, amounting to less than 1% of the searches.

In [None]:
hotel_day = hotel_clean['date_time'].dt.day_name()

# Then, count the occurrences of each day
day_counts = hotel_day.value_counts()

# Sort the days by frequency in descending order
day_counts = day_counts.sort_values(ascending=False)

# Calculate the percentages
day_percentages = (day_counts / len(hotel_clean) * 100).round(2)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = day_counts.plot(kind='bar')
plt.xlabel('Day of the Week')
plt.ylabel('Frequency')
plt.title('Frequency of Each Day in the Dataset')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(day_counts, day_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()


Wednesday, closely followed by Monday and Tuesday, is the day of the week when users are most active in terms of clicking and making bookings on the website. Conversely, activity is at its lowest on weekends, indicating reduced user engagement during that time.

In [None]:
hotel_clean['srch_co'].head(10)

In [None]:
checkin_day = hotel_clean['srch_co'].dt.day_name()

# Then, count the occurrences of each day
checkin_counts = checkin_day.value_counts()

# Sort the days by frequency in descending order
checkin_counts = checkin_counts.sort_values(ascending=False)

# Calculate the percentages
checkin_percentages = (checkin_counts / len(hotel_clean) * 100).round(2)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = checkin_counts.plot(kind='bar')
plt.xlabel('Day of the Week')
plt.ylabel('Frequency')
plt.title('Frequency of the Day check in in the Dataset')
plt.xticks(rotation=0)


# Adding labels to the top of the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(checkin_counts, checkin_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')


# Show the plot
plt.show()

Sunday is the most popular day of the week for customers to make clicks/bookings for their check-in days.

#### Making an analysis in columns `srch_adults_cnt`, `srch_children_cnt` and `srch_rm_cnt` <a class= 'anchor' id = '12'></a> 

In [None]:
hotel_clean['srch_adults_cnt']

In [None]:
sorted(hotel_clean['srch_adults_cnt'].unique())

In [None]:
(hotel_clean['srch_adults_cnt'] == 0).sum()


In [None]:
# Then, count the occurrences of adults
adults_counts = (hotel_clean['srch_adults_cnt'].value_counts()).sort_values(ascending=False)

# Calculate the percentages
adults_percentages = (adults_counts / len(hotel_clean) * 100).round(2)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = adults_counts.plot(kind='bar')
plt.xlabel('Number of Adults')
plt.ylabel('Frequency')
plt.title('Frequency of Adults in the Dataset')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(adults_counts, adults_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()

The majority of clicks and bookings are made for 2 adults. Furthermore, bookings for 5 or more adults are relatively rare. This suggests that the most common scenario involves couples booking hotels. Now, let's shift our focus to the presence of children.

In [None]:
# Count the occurrences of each unique value in 'srch_adults_cnt'
adults_counts = hotel_clean['srch_adults_cnt'].value_counts().sort_index()

# Count the occurrences of each unique value in 'srch_children_cnt'
children_counts = hotel_clean['srch_children_cnt'].value_counts().sort_index()

# Create a bar plot
plt.figure(figsize=(10, 6))
width = 0.35
x = range(len(adults_counts))

plt.bar(x, adults_counts, width, label='Adults Count')
plt.bar([i + width for i in x], children_counts, width, label='Children Count')

# Customize the plot with labels and titles
plt.xlabel('Number of Guests')
plt.ylabel('Frequency')
plt.title('Frequency of Adults and Children Counts in Searches')
plt.xticks([i + width/2 for i in x], adults_counts.index)

# Show the plot
plt.legend()
plt.show()


The frequency decreases as the number of children increases.

In [None]:
sorted(hotel_clean['srch_children_cnt'].unique())

In [None]:
# Count occurrences of 0 in the column
count_of_zeros = (hotel_clean['srch_children_cnt'] == 0).sum()

# Count occurrences of non-zero values in the column
count_of_non_zeros = (hotel_clean['srch_children_cnt'] != 0).sum()

# Data for the pie chart
labels = ['No Children', 'Have Children']
sizes = [count_of_zeros, count_of_non_zeros]
colors = ['#ff9999', '#66b3ff']  # Custom colors

# Explode the 'Have Children' slice
explode = (0, 0.1)

# Create a pie chart
plt.figure(figsize=(8, 6))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', startangle=140)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle

plt.title('Proportion of Searches with or without Children')
plt.show()

In the following graphic, we observe that a significant portion of searches on the website doesn't include children.

In [None]:
hotel_clean['srch_rm_cnt'].unique()

In [None]:
# Then, count the occurrences of each day
room_counts = (hotel_clean['srch_rm_cnt'].value_counts()).sort_values(ascending=False)

# Calculate the percentages
room_percentages = (room_counts / len(hotel_clean) * 100).round(3)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = room_counts.plot(kind='bar')
plt.xlabel('Number of Rooms')
plt.ylabel('Frequency')
plt.title('Frequency of Rooms in the Dataset')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(room_counts, room_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()

Given that a substantial portion of clicks/bookings is for couples (two adults), it's logical to observe that most searches correspond to a single room. The prevalence of single-room bookings begins to diminish as the number of rooms requested increases.

#### Making an analysis in columns `srch_destination_id` and `srch_destination_type_id` <a class= 'anchor' id = '13'></a> 

In [None]:
hotel_clean['srch_destination_id'].nunique()

In [None]:
# Counting 
destination_counts = (hotel_clean['srch_destination_id'].value_counts()).sort_values(ascending = False)

# Calculate the percentages
destination_percentages = (destination_counts / len(hotel_clean) * 100).round(2)

# Select the top 10 records
top_10_destination_counts = destination_counts.head(10)
top_10_percentages = destination_percentages.head(10)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = top_10_destination_counts.plot(kind='bar')
plt.xlabel('Destinations')
plt.ylabel('Frequency')
plt.title('Top 10 Destinations in the Dataset')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(top_10_destination_counts, top_10_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()


This dataset encompasses 26,025 destinations, and the graphic highlights the top 10 most frequently searched destinations on the website. Despite the multitude of destination options, the top 1 choics account for only 4.26% of the total search frequency.

In [None]:
hotel_clean['srch_destination_type_id'].unique()

In [None]:
destination_type_counts = (hotel_clean['srch_destination_type_id'].value_counts()).sort_values(ascending=False)

# Calculate the percentages
destination_type_percentages = (destination_type_counts / len(hotel_clean) * 100).round(5)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = destination_type_counts.plot(kind='bar')
plt.xlabel('Type of destination')
plt.ylabel('Frequency')
plt.title('Frequency of type of destination')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(destination_type_counts, destination_type_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()

This graphic illustrates that the majority of destinations fall under type 1, while types 7 and 9 represent only a small fraction of the total destinations.

#### Making an analysis in column `is_booking` <a class= 'anchor' id = '14'></a> 

In [None]:
# Count occurrences of 0 in the column
count_click = (hotel_clean['is_booking'] == 0).sum()

# Count occurrences of 1 in the column
count_booking = (hotel_clean['is_booking'] == 1).sum()

# Calculate the percentages
total = count_click + count_booking
percentage_click = (count_click / total) * 100
percentage_booking = (count_booking / total) * 100

# Create a bar plot
plt.figure(figsize=(8, 6))
plt.bar(['Booking', 'Click'], [count_booking, count_click])
plt.xlabel('Interaction in the website')
plt.ylabel('Count')
plt.title('Number of booking and clicking in the website')

# Show the percentages on top of the bars
plt.text(0, count_booking, f'{percentage_booking:.2f}%', ha='center', va='bottom')
plt.text(1, count_click, f'{percentage_click:.2f}%', ha='center', va='bottom')

plt.show()

Observing the data, it's evident that a substantial majority, amounting to 91.29%, of interactions on the website are merely clicks, whereas a more modest 8.71% represent actual bookings.

#### Making an analysis in column `cnt` <a class= 'anchor' id = '15'></a> 

In [None]:
hotel_clean['cnt'].unique()

In [None]:
# Count the occurrences of each 
cnt_counts = (hotel_clean['cnt'].value_counts()).sort_values(ascending=False)

# Calculate the percentages
cnt_percentages = (cnt_counts / len(hotel_clean) * 100).round(2)

# Select the top 10 records
top_10_cnt_counts = cnt_counts.head(10)
top_10_percentages = cnt_percentages.head(10)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = top_10_cnt_counts.plot(kind='bar')
plt.xlabel('Number of similar events for the same user session')
plt.ylabel('Frequency')
plt.title('Frequency of similar events')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(top_10_cnt_counts, top_10_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()

The graph displays the top 10 similar events within the context of the same user session. It's evident that the percentage decreases as the number of similar interactions increases. Specifically, only 15.59% of events occur twice, while those occurring more than six times represent less than 1% of the total.

#### Making an analysis in columns `hotel_continent`, `hotel_country` and `hotel_market` <a class= 'anchor' id = '16'></a> 

In [None]:
hotel_clean['hotel_continent'].unique()

In [None]:
hotel_clean['hotel_country'].unique()

In [None]:
# Group the data by 'hotel_continent' and 'hotel_country' and count the occurrences
grouped_data = hotel_clean.groupby(['hotel_continent', 'hotel_country']).size().unstack(fill_value=0)

# Create a stacked bar plot
grouped_data.plot(kind='bar', stacked=True, figsize=(10, 6))

# Customize the plot with labels and titles
plt.xlabel('Hotel Continent')
plt.ylabel('Distribution')
plt.title('Distribution of Hotel Country by Continent')

# Show the plot
plt.legend(title='Hotel Country', loc='upper right')
plt.show()

In [None]:
# Group the data by 'hotel_continent' and 'hotel_country' and count the occurrences
grouped_data = hotel_clean.groupby(['hotel_continent', 'hotel_country']).size().unstack(fill_value=0)

# Calculate the total count for each 'posa_continent'
total_counts = grouped_data.sum(axis=1)

# Calculate the percentage for each 'site_name'
percentage_data = grouped_data.divide(total_counts, axis=0) * 100

# Create a stacked bar plot with percentages
percentage_data.plot(kind='bar', stacked=True, figsize=(10, 6))

# Customize the plot with labels and titles
plt.xlabel('Hotel Continent')
plt.ylabel('Percentage')
plt.title('Percentage of Hotel Country by Continent')

# Show the plot
plt.legend(title='Hotel Country', loc='upper right')
plt.show()

In [None]:
# Then, count the occurrences of each 
continents_counts = (hotel_clean['hotel_continent'].value_counts()).sort_values(ascending=False)

# Calculate the percentages
continents_percentages = (continents_counts / len(hotel_clean) * 100).round(3)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = continents_counts.plot(kind='bar')
plt.xlabel('Hotel Continents')
plt.ylabel('Frequency')
plt.title('Frequency of Hotel Continents')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(continents_counts, continents_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()

In [None]:
filtered_data = hotel_clean[hotel_clean['hotel_continent'] == 2]

# Print the filtered data
print(filtered_data['hotel_country'].unique())

In [None]:
filtered_data = hotel_clean[hotel_clean['hotel_continent'] == 6]

# Print the filtered data
print(filtered_data['hotel_country'].unique())

The analysis reveals that the majority of website interactions are focused on continent 2, while continents 0, 5, and 1 experience considerably fewer interactions. Furthermore, within continent 2, customers concentrate on two specific countries, namely 50 and 198. With the exception of continents 1 and 2, the remaining continents exhibit a greater diversity of countries.

In [None]:
hotel_clean['hotel_market'].nunique()

#### making an analysis in column `hotel_cluster` <a class= 'anchor' id = '17'></a> 

In [None]:
hotel_clean['hotel_cluster'].nunique()

In [None]:
# Then, count the occurrences of each 
hotel_cluster_counts = (hotel_clean['hotel_cluster'].value_counts()).sort_values(ascending=False)

# Calculate the percentages
hotel_cluster_percentages = (hotel_cluster_counts / len(hotel_clean) * 100).round(2)

# Select the top 20 records
top_20_counts = hotel_cluster_counts.head(20)
top_20_percentages = hotel_cluster_percentages.head(20)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = top_20_counts.plot(kind='bar')
plt.xlabel('Hotel Cluster')
plt.ylabel('Frequency')
plt.title('Frequency of Hotel Cluster')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(top_20_counts, top_20_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()

In [None]:
# Then, count the occurrences of each 
hotel_cluster_counts = (hotel_clean['hotel_cluster'].value_counts()).sort_values(ascending=False)

# Calculate the percentages
hotel_cluster_percentages = (hotel_cluster_counts / len(hotel_clean) * 100).round(2)

# Select the top 20 records
top_20_counts = hotel_cluster_counts.tail(20)
top_20_percentages = hotel_cluster_percentages.tail(20)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = top_20_counts.plot(kind='bar')
plt.xlabel('Hotel Cluster')
plt.ylabel('Frequency')
plt.title('Frequency of Hotel Cluster')
plt.xticks(rotation=0)

# Adding labels to the bars with frequencies and percentages
for i, (v, p) in enumerate(zip(top_20_counts, top_20_percentages)):
    ax.text(i, v, f'{p}%', ha='center', va='bottom')

# Show the plot
plt.show()

In [None]:
# Then, count the occurrences of each 
hotel_cluster_counts = (hotel_clean['hotel_cluster'].value_counts()).sort_values(ascending=False)

# Calculate the percentages
hotel_cluster_percentages = (hotel_cluster_counts / len(hotel_clean) * 100).round(2)

# Create a bar plot
plt.figure(figsize=(12, 6))
ax = hotel_cluster_counts.plot(kind='bar')
plt.xlabel('Hotel Cluster')
# Remove X-axis tick labels
plt.gca().set_xticklabels([])
plt.ylabel('Frequency')
plt.title('Frequency of Hotel Cluster')
plt.xticks(rotation=0)

# Show the plot
plt.show()

The `hotel_cluster` column serves as the target variable, which is crucial for predicting the hotel cluster associated with a user event. This prediction is based on various attributes linked to the user event and aims to identify one of the 100 potential hotel clusters. The third grafic is showing the distribuition overview.

#### Some graphics <a class= 'anchor' id = '18'></a> 

In [None]:
plt.subplots(5, 5, figsize=(20, 20))

count = 1

for col in hotel_clean.columns:
    plt.subplot(5, 5, count)
    plt.hist(hotel_clean[col])
    plt.title(col)
    
    count += 1
    
plt.tight_layout()
plt.show()

In [None]:
# find the correlation between the variables
corr = hotel_clean.corr()

# plot the correlation matrix using a heatmap
plt.figure(figsize=(20,10))
matrix = np.triu(hotel_clean.corr())
sns.heatmap(corr, annot=True, mask=matrix, cmap='coolwarm')
plt.show()

The heatmap presents a clear depiction of the dataset's correlations. Notably, there exists a strong positive correlation between the variables `srch_ci` and `srch_co` indicating that as one date shifts, the other follows suit. Furthermore, both `srch_ci` and `srch_co` exhibit a high correlation with `date_time`, which signifies the temporal aspect of the customers' interactions.

Another notable positive correlation emerges between `srch_rm_cnt` and `srch_adults_cnt`, suggesting that the number of adults in a party has a direct influence on the required room count, which is a logical relationship.

It's worth mentioning that the majority of the heatmap displays negative correlations. This phenomenon likely results from the categorical nature of the variables represented as numbers. These numerical values, which should ideally be treated as labels or categories, inadvertently introduce negative correlations when analyzed in a numerical context.

In [None]:
# Save the data to a CSV file
hotel_clean.to_csv('hotel_clean.csv', index=False)