# **1. Introduction**

In the last decade, hotel booking platforms such as Booking and Trivago have become one of the most widely used tools by users, who, in addition to using the services, have started leaving their own reviews for future customers. This has led to the desire on the part of the property managers to understand the evaluation of their guests' stay, and specifically which services are more or less appreciated, with the aim of identifying aspects of the property to improve and gaining an advantage in terms of bookings.

Based on the data from various hotel guests, the aim of the project is to compare different classification models that can determine, based on a set of attributes, whether customers were satisfied with the service.

# **2. Dataset**

The dataset was downloaded from the following source:
https://www.kaggle.com/datasets/ishansingh88/europe-hotel-satisfaction-score/code
It consists of around 100000 reviews made by clients of some european hotels.

## 2.1 Dataset overview

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv('dataset.csv')
print(f"Shape: {df.shape}")
df.head()

Shape: (103904, 17)


Unnamed: 0,id,Gender,Age,purpose_of_travel,Type of Travel,Type Of Booking,Hotel wifi service,Departure/Arrival convenience,Ease of Online booking,Hotel location,Food and drink,Stay comfort,Common Room entertainment,Checkin/Checkout service,Other service,Cleanliness,satisfaction
0,70172,Male,13,aviation,Personal Travel,Not defined,3,4,3,1,5,5,5,4,5,5,neutral or dissatisfied
1,5047,Male,25,tourism,Group Travel,Group bookings,3,2,3,3,1,1,1,1,4,1,neutral or dissatisfied
2,110028,Female,26,tourism,Group Travel,Group bookings,2,2,2,2,5,5,5,4,4,5,satisfied
3,24026,Female,25,tourism,Group Travel,Group bookings,2,5,5,5,2,2,2,1,4,2,neutral or dissatisfied
4,119299,Male,61,aviation,Group Travel,Group bookings,3,3,3,3,4,5,3,3,3,3,satisfied


The dataset has the following structure:
<ul>
    <li>103904 records</li>
    <li>17 features</li>
</ul>

The feature <b> satisfaction </b> is the target feature considered as the label for the training data.
Feature names are changed, deleting special characters (as they could cause some problems in the analysis) like:
<ul>
    <li> "_" from <b>purpose_of_travel</b> feature </li>
    <li> "/" from <b>Departure/Arrival convenience</b> feature </li>
</ul>

In [2]:
df.columns = df.columns.str.replace("/"," ")
df.columns = df.columns.str.replace("_"," ")
display(df.head())

Unnamed: 0,id,Gender,Age,purpose of travel,Type of Travel,Type Of Booking,Hotel wifi service,Departure Arrival convenience,Ease of Online booking,Hotel location,Food and drink,Stay comfort,Common Room entertainment,Checkin Checkout service,Other service,Cleanliness,satisfaction
0,70172,Male,13,aviation,Personal Travel,Not defined,3,4,3,1,5,5,5,4,5,5,neutral or dissatisfied
1,5047,Male,25,tourism,Group Travel,Group bookings,3,2,3,3,1,1,1,1,4,1,neutral or dissatisfied
2,110028,Female,26,tourism,Group Travel,Group bookings,2,2,2,2,5,5,5,4,4,5,satisfied
3,24026,Female,25,tourism,Group Travel,Group bookings,2,5,5,5,2,2,2,1,4,2,neutral or dissatisfied
4,119299,Male,61,aviation,Group Travel,Group bookings,3,3,3,3,4,5,3,3,3,3,satisfied


Now let's view the dataset summary

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 17 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   id                              103904 non-null  int64 
 1   Gender                          103904 non-null  object
 2   Age                             103904 non-null  int64 
 3   purpose of travel               103904 non-null  object
 4   Type of Travel                  103904 non-null  object
 5   Type Of Booking                 103904 non-null  object
 6   Hotel wifi service              103904 non-null  int64 
 7   Departure Arrival  convenience  103904 non-null  int64 
 8   Ease of Online booking          103904 non-null  int64 
 9   Hotel location                  103904 non-null  int64 
 10  Food and drink                  103904 non-null  int64 
 11  Stay comfort                    103904 non-null  int64 
 12  Common Room entertainment     

From the summary we can see that the dataset consists of:
<ul>
    <li> 12 attributes of type <i>int64</i> </li>
    <li> 5 attributes of type <i>object</i> </li>
</ul>

## 2.2 Missing values analysis

Inside the dataset there's no missing value. To better see this, let's compute the fraction of missing data for each feature:

In [4]:
pd.DataFrame(df.isnull().sum()/df.shape[0])

Unnamed: 0,0
id,0.0
Gender,0.0
Age,0.0
purpose of travel,0.0
Type of Travel,0.0
Type Of Booking,0.0
Hotel wifi service,0.0
Departure Arrival convenience,0.0
Ease of Online booking,0.0
Hotel location,0.0


Even if there's no missing data, numerical rating features with value 0 should be considered missing. Because of that, we substitute with <i>NaN</i>:

In [5]:
service_columns = df.columns[6:15]
df[service_columns] = df[service_columns].replace({'0': np.nan, 0: np.nan})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 17 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              103904 non-null  int64  
 1   Gender                          103904 non-null  object 
 2   Age                             103904 non-null  int64  
 3   purpose of travel               103904 non-null  object 
 4   Type of Travel                  103904 non-null  object 
 5   Type Of Booking                 103904 non-null  object 
 6   Hotel wifi service              100801 non-null  float64
 7   Departure Arrival  convenience  98604 non-null   float64
 8   Ease of Online booking          99417 non-null   float64
 9   Hotel location                  103903 non-null  float64
 10  Food and drink                  103797 non-null  float64
 11  Stay comfort                    103903 non-null  float64
 12  Common Room ente

In [6]:
pd.DataFrame(df.isnull().sum()/df.shape[0]*100)

Unnamed: 0,0
id,0.0
Gender,0.0
Age,0.0
purpose of travel,0.0
Type of Travel,0.0
Type Of Booking,0.0
Hotel wifi service,2.986411
Departure Arrival convenience,5.100862
Ease of Online booking,4.318409
Hotel location,0.000962


The attributes with the highest missing data percentage are <b>Departure Arrival convenience</b> and <b>Ease of Online booking</b>, both numerical.
Let's visualize the missing data with an histogram:

In [12]:
import seaborn as sns
import matplotlib.pyplot as plt

# Determine the fraction of missing data for each feature
missing_data_fraction = df.isnull().sum() / df.shape[0]

# Put the values inside a dataframe
missing_fraction_df = pd.DataFrame(missing_data_fraction, columns=["Missing Fraction"])

# Create the histogram
sns.set(style="whitegrid")
plt.figure(figsize=(10, 5))
sns.barplot(x=missing_fraction_df.index, y='Missing Fraction', data=missing_fraction_df, palette="viridis")
plt.xticks(rotation=90)
plt.xlabel('Feature', fontweight='bold')
plt.ylabel('Fraction of missing values', fontweight='bold')
plt.title('Fraction of missing values for each feature', fontweight='bold')
plt.show()

ImportError: cannot import name '_c_internal_utils' from partially initialized module 'matplotlib' (most likely due to a circular import) (/opt/homebrew/opt/graph-tool/libexec/lib/python3.12/site-packages/matplotlib/__init__.py)

In [11]:
!pip install seaborn

Defaulting to user installation because normal site-packages is not writeable
