# Hotel dataset analysis
<p style="font-weight: 600; text-align: center;">
Datascience Tools, February 2025 <br>
Luka Skeledžija
</p>

<style>
.MJXc-display{
    display: inline-block !important;
    width: 100%;
}
@media print {
    .pagebreak { page-break-before: always; } /* page-break-after works, as well */
}

img{
    width: 100%;
    max-width: 600px !important;
    margin: auto !important;
}

body {
    overflow: hidden;
    max-width: 600px;
    margin: auto;
}

::-webkit-scrollbar {
  width: 0px;
}

table{
    width: 100%;
}

td {
    text-align: left!important;
}

th {
    text-align: left!important;
    text-transform: capitalize; 
}

h1 {
    text-transform: uppercase;
    text-align: center;
    background: #222222;
    color: white;
    padding: 8px;
}

blockquote {
    margin-left: 0em!important;
    margin-right: 0em!important;

}

.jp-RenderedHTMLCommon pre, .jp-RenderedHTMLCommon code {

    background-color: var(--jp-layout-color2)!important;
}

.jp-RenderedHTMLCommon pre{
    margin: 0.5em 0em!important;
    padding: 0em 1.5em!important;
}

body {
    counter-reset: h2counter;
}
h1 {
    counter-reset: h2counter;
}
h2:before {
    content: counter(h2counter) ".\0000a0\0000a0";
    counter-increment: h2counter;
    counter-reset: h3counter;
}
h3:before {
    counter-increment: h3counter;
    content: counter(h2counter) "." counter(h3counter) ".\0000a0\0000a0";
  
}

.jp-RenderedHTMLCommon table {
    table-layout: auto;
}



</style>


---

## Introduction

This analysis explores a dataset containing 119,390 hotel bookings recorded between July 2015 and August 2017 from two hotels: a City Hotel and a Resort Hotel. The data includes both successful stays and cancellations, providing insights into booking patterns and hotel operations. While customer identification details were removed for privacy, the dataset includes synthetic personal information to maintain data structure. Through this analysis, we aim to uncover meaningful patterns and seasonal trends.

> You can download the dataset yourself from Kaggle → [🔗 Hotel Booking Dataset](https://www.kaggle.com/datasets/mojtaba142/hotel-booking/data)

## What do we want achieve?

1. Can we see any signs of seasonality in the dataset?

2. Visualize the distribution of the total length of the stays.

3. Find out what's the reservation with the biggest number of guests that was not canceled?

4. Find out what is the cancellation rate of the bookings as a function of lead time?



In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# For pretty HTML rendering
import markdown
from IPython.display import display, HTML

def centerHTML(html, text=""):
    text = markdown.markdown(text)
    return '<div style="display: flex; align-items: center; flex-direction: column;">' + html + f'</div><div style="display: flex; align-items: center; flex-direction: column;padding-top: 15px;"><small style="max-width: 600px">{text}</small></div>'

def insertHTMLVideo(filename, text=""):
    return centerHTML(f'<video controls src="{filename}" style="max-width: 600px;width:100%"></video>', text)

def insertHTMLAudio(filename, text=""):
    return centerHTML(f'<audio controls src="{filename}" style="max-width: 600px;width:100%"></audio>', text)

In [7]:
# Load the data
df = pd.read_parquet('./hotels.parquet')

## Structure of data

First, we will look into the column structure of our dataset. 

In [42]:
# count number of rows and display as table
count = df.shape[0]
print(f"Number of rows: {count} \n")

# print head in a table and make it scrollable
display(HTML(df.head(10).to_html(max_rows=6)))

Number of rows: 119390 



Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,0,0,0,C,C
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,0,0,0,C,C
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,0,0,0,A,C
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7,Resort Hotel,0,9,2015,July,27,1,0,2,2,0.0,0,0,0,0,C,C
8,Resort Hotel,1,85,2015,July,27,1,0,3,2,0.0,0,0,0,0,A,A
9,Resort Hotel,1,75,2015,July,27,1,0,3,2,0.0,0,0,0,0,D,D


In [25]:
# Print all columns and their data types in an HTML table, can we give a better description?
type_df = df.dtypes.to_frame(name='Data type').reset_index().rename(columns={'index': 'Column name'})
HTML(type_df.to_html(index=False, classes='table-style').replace('<table', '<table style="text-align: left"'))

Column name,Data type
hotel,object
is_canceled,int64
lead_time,int64
arrival_date_year,int64
arrival_date_month,object
arrival_date_week_number,int64
arrival_date_day_of_month,int64
stays_in_weekend_nights,int64
stays_in_week_nights,int64
adults,int64


> More detailed descriptions on Kaggle → [Hotel Booking Dataset](https://www.kaggle.com/datasets/mojtaba142/hotel-booking/data)

Interestingly, number of children is a `float64` 🤔 ... but based on grouping it seems that this is just a quirk. The column is always a whole number, but stored as a float.

In [29]:
by_children = df.groupby('children').size().reset_index(name='count')
# display in a table
HTML(by_children.to_html(index=False, classes='table-style').replace('<table', '<table style="text-align: left"'))

children,count
0.0,110796
1.0,4861
2.0,3652
3.0,76
10.0,1
