<img src='https://upload.wikimedia.org/wikipedia/fr/thumb/e/ed/Logo_Universit%C3%A9_du_Maine.svg/1280px-Logo_Universit%C3%A9_du_Maine.svg.png' width="300" height="500">

# Hotel Reservation Dataset

## Présentation du dataset

### Objectif du dataset

Le dataset contient les données de réservations en ligne des clients d'un hotel. **Le but de ce jeu de données est d'arriver à prédir si le client va respecter sa réservation ou l'annuler**.

### Description des données

* **Booking_ID** : unique identifier of each booking
* **no_of_adults** : Number of adults
* **no_of_children** : Number of Children
* **no_of_weekend_nights** : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
* **no_of_week_nights** : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
* **type_of_meal_plan** : Type of meal plan booked by the customer:
* **required_car_parking_space** : Does the customer require a car parking space? (0 - No, 1- Yes)
* **room_type_reserved** : Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
* **lead_time**: Number of days between the date of booking and the arrival date
* **arrival_year** : Year of arrival date
* **arrival_month** : Month of arrival date
* **arrival_date** : Date of the month
* **market_segment_type** : Market segment designation.
* **repeated_guest** : Is the customer a repeated guest? (0 - No, 1- Yes)
* **no_of_previous_cancellations** : Number of previous bookings that were canceled by the customer prior to the current booking
* **no_of_previous_bookings_not_canceled** : Number of previous bookings not canceled by the customer prior to the current booking
* **avg_price_per_room** : Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
* **no_of_special_requests** : Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
* **booking_status** : Flag indicating if the booking was canceled or not.



In [1]:
import pandas as pd 
import numpy as np


dataset = pd.read_csv('archive/Hotel Reservations.csv')
dataset

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.00,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.00,0,Canceled
3,INN00004,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.00,0,Canceled
4,INN00005,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.50,0,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,INN36271,3,0,2,6,Meal Plan 1,0,Room_Type 4,85,2018,8,3,Online,0,0,0,167.80,1,Not_Canceled
36271,INN36272,2,0,1,3,Meal Plan 1,0,Room_Type 1,228,2018,10,17,Online,0,0,0,90.95,2,Canceled
36272,INN36273,2,0,2,6,Meal Plan 1,0,Room_Type 1,148,2018,7,1,Online,0,0,0,98.39,2,Not_Canceled
36273,INN36274,2,0,0,3,Not Selected,0,Room_Type 1,63,2018,4,21,Online,0,0,0,94.50,0,Canceled


In [2]:
dataset.values.shape

(36275, 19)

# Analyse des données
On regarde si des valeurs sont égales à Nan
<br>
ID à supprimer : pas pertinenent
<br>
Des variables à discrétiser : avg_price_per_room, lead_time, no_of_previous_bookings_not_canceled
<br>
Attention à certaines variables qui sont majoritairement nulles

In [3]:
def verificationNan(dataset):
    return dataset.isna().sum()

In [4]:
verificationNan(dataset)

Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

In [5]:
def dataAnalysis(dataset):
    for series_name, series in dataset.items():
        #print(series_name)
        print("**************\n")
        print(series.value_counts())
        print("**************\n")
    

In [6]:
dataAnalysis(dataset)

**************

INN00001    1
INN24187    1
INN24181    1
INN24182    1
INN24183    1
           ..
INN12086    1
INN12085    1
INN12084    1
INN12083    1
INN36275    1
Name: Booking_ID, Length: 36275, dtype: int64
**************

**************

2    26108
1     7695
3     2317
0      139
4       16
Name: no_of_adults, dtype: int64
**************

**************

0     33577
1      1618
2      1058
3        19
9         2
10        1
Name: no_of_children, dtype: int64
**************

**************

0    16872
1     9995
2     9071
3      153
4      129
5       34
6       20
7        1
Name: no_of_weekend_nights, dtype: int64
**************

**************

2     11444
1      9488
3      7839
4      2990
0      2387
5      1614
6       189
7       113
10       62
8        62
9        34
11       17
15       10
12        9
14        7
13        5
17        3
16        2
Name: no_of_week_nights, dtype: int64
**************

**************

Meal Plan 1     27835
Not Selected     5130
Me

In [7]:
# to execute only once
dataset = dataset.drop('Booking_ID',axis=1)
dataset

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.00,0,Not_Canceled
1,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.00,0,Canceled
3,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.00,0,Canceled
4,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.50,0,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,3,0,2,6,Meal Plan 1,0,Room_Type 4,85,2018,8,3,Online,0,0,0,167.80,1,Not_Canceled
36271,2,0,1,3,Meal Plan 1,0,Room_Type 1,228,2018,10,17,Online,0,0,0,90.95,2,Canceled
36272,2,0,2,6,Meal Plan 1,0,Room_Type 1,148,2018,7,1,Online,0,0,0,98.39,2,Not_Canceled
36273,2,0,0,3,Not Selected,0,Room_Type 1,63,2018,4,21,Online,0,0,0,94.50,0,Canceled


In [8]:
# Decomposition en quartile

def quartile(feature):
    prem_quartile = feature.quantile(0.25)
    de_quartile = feature.quantile(0.5)
    trois_quartile = feature.quantile(0.75)
    return (prem_quartile,de_quartile,trois_quartile)






# df.loc[df.C <= df.B, 'B':'E']
# quarter['avg_price_per_room'] = prem_quartile
# quarter





In [9]:
# Discrétistation des données


def discretisation(features,dataset):
    for feature in features:
        pQ,dQ,tQ = quartile(feature)
        dataset.loc[(feature <= pQ), feature.name] = 1
        dataset.loc[(feature > pQ) & (feature <= dQ), feature.name] = 2
        dataset.loc[(feature > dQ) & (feature <= tQ), feature.name] = 3
        dataset.loc[(feature > tQ), feature.name] = 4


In [10]:
continuFeatures = [dataset.avg_price_per_room,dataset.lead_time, dataset.no_of_previous_bookings_not_canceled]
discretisation(continuFeatures,dataset)

In [11]:
dataset

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,2,0,1,2,Meal Plan 1,0,Room_Type 1,4,2017,10,2,Offline,0,0,4,1.0,0,Not_Canceled
1,2,0,2,3,Not Selected,0,Room_Type 1,1,2018,11,6,Online,0,0,4,3.0,1,Not_Canceled
2,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,4,1.0,0,Canceled
3,2,0,0,2,Meal Plan 1,0,Room_Type 1,4,2018,5,20,Online,0,0,4,3.0,0,Canceled
4,2,0,1,1,Not Selected,0,Room_Type 1,2,2018,4,11,Online,0,0,4,2.0,0,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,3,0,2,6,Meal Plan 1,0,Room_Type 4,3,2018,8,3,Online,0,0,4,4.0,1,Not_Canceled
36271,2,0,1,3,Meal Plan 1,0,Room_Type 1,4,2018,10,17,Online,0,0,4,2.0,2,Canceled
36272,2,0,2,6,Meal Plan 1,0,Room_Type 1,4,2018,7,1,Online,0,0,4,2.0,2,Not_Canceled
36273,2,0,0,3,Not Selected,0,Room_Type 1,3,2018,4,21,Online,0,0,4,2.0,0,Canceled


In [12]:
pd.qcut(dataset['no_of_previous_bookings_not_canceled'], 1 , labels=["peu"])

0        peu
1        peu
2        peu
3        peu
4        peu
        ... 
36270    peu
36271    peu
36272    peu
36273    peu
36274    peu
Name: no_of_previous_bookings_not_canceled, Length: 36275, dtype: category
Categories (1, object): ['peu']

In [13]:
# Pre-process des données 


dataframe = dataset.sample(frac=1).reset_index(drop=True)



def train_test_dev_split(dataset):
    x_train = dataset.sample(frac=0.7)

    y_train = x_train['booking_status']
    
    x_test = dataset.drop(x_train.index,axis=0)
    
    y_test = x_test['booking_status']

    return (x_train,y_train,x_test,y_test)


x_train, y_train,x_test,y_test = train_test_dev_split(dataset)







In [14]:
# Detection des zeros pour préparer au classifieur bayesien naif
def zeroExistence(df):
    if (0 in df.values):
        return True
    return False

In [15]:
zeroExistence(dataset)

True

In [16]:
# Si zeros : +1 à toutes les données numériques

In [17]:
# Discretisation des données par la moyenne 

stats = x_train['avg_price_per_room'].describe()
slice = x_train.loc[x_train['avg_price_per_room'] <= 103.4]
slice

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
7983,2,0,0,1,Meal Plan 1,0,Room_Type 1,2,2018,6,15,Online,0,0,4,3.0,0,Canceled
7766,2,0,1,3,Meal Plan 1,0,Room_Type 1,4,2018,10,3,Offline,0,0,4,3.0,0,Canceled
15091,2,0,0,1,Meal Plan 2,0,Room_Type 1,2,2017,9,4,Offline,0,0,4,3.0,0,Not_Canceled
709,2,0,2,1,Meal Plan 1,0,Room_Type 1,3,2017,7,11,Online,0,0,4,1.0,0,Canceled
20263,2,0,0,1,Meal Plan 1,0,Room_Type 1,2,2018,10,20,Online,0,0,4,3.0,2,Not_Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16835,1,0,1,1,Meal Plan 2,0,Room_Type 1,4,2018,6,6,Offline,0,0,4,2.0,1,Not_Canceled
5544,1,1,2,1,Not Selected,0,Room_Type 1,1,2018,10,29,Online,0,0,4,2.0,1,Not_Canceled
23434,1,0,1,0,Meal Plan 1,0,Room_Type 1,1,2018,6,20,Online,0,0,4,3.0,0,Not_Canceled
9606,2,0,0,3,Meal Plan 1,0,Room_Type 4,1,2018,6,14,Corporate,0,0,4,2.0,0,Canceled


In [18]:
slice.keys

<bound method NDFrame.keys of        no_of_adults  no_of_children  no_of_weekend_nights  no_of_week_nights  \
7983              2               0                     0                  1   
7766              2               0                     1                  3   
15091             2               0                     0                  1   
709               2               0                     2                  1   
20263             2               0                     0                  1   
...             ...             ...                   ...                ...   
16835             1               0                     1                  1   
5544              1               1                     2                  1   
23434             1               0                     1                  0   
9606              2               0                     0                  3   
11127             1               0                     1                  2   

      typ

## ZeroR

In [19]:
def zeroR(target):
    res =  target.value_counts()
    
    return res.idxmax()

In [20]:
test = zeroR(x_train.loc[:,'booking_status'])
test

'Not_Canceled'

# Création des tables de fréquences pour OneR


In [21]:
x_train['required_car_parking_space'].value_counts()

0    24605
1      787
Name: required_car_parking_space, dtype: int64

In [22]:
freq_table = pd.crosstab(x_train['no_of_children'], x_train['booking_status']) 
freq_table

booking_status,Canceled,Not_Canceled
no_of_children,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7533,15959
1,381,762
2,319,422
3,4,9
9,1,1
10,0,1


In [23]:
freq_table[['Canceled']]

booking_status,Canceled
no_of_children,Unnamed: 1_level_1
0,7533
1,381
2,319
3,4
9,1
10,0


In [24]:
res = freq_table['Canceled'] + freq_table['Not_Canceled']
freq_table['Total'] = res
res

no_of_children
0     23492
1      1143
2       741
3        13
9         2
10        1
dtype: int64

In [25]:
freq_table

booking_status,Canceled,Not_Canceled,Total
no_of_children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7533,15959,23492
1,381,762,1143
2,319,422,741
3,4,9,13
9,1,1,2
10,0,1,1


In [42]:

def frequency_table(data):
    freq_tables = []
    for series_name, series in data.items():
        if (series_name == "Booking_status"):
            pass
        freq_table = pd.crosstab(data[series_name], data['booking_status'])
        #freq_table.loc[:, 'Total'] = freq_table.loc[:,'Canceled'] + freq_table.loc[:,'Not_Canceled']
        freq_tables.append(freq_table)
    return freq_tables

 

In [43]:
freq = frequency_table(x_train)
for table in freq:
    print(table)
    

booking_status  Canceled  Not_Canceled
no_of_adults                          
0                     29            68
1                   1305          4104
2                   6315         11953
3                    587          1020
4                      2             9
booking_status  Canceled  Not_Canceled
no_of_children                        
0                   7533         15959
1                    381           762
2                    319           422
3                      4             9
9                      1             1
10                     0             1
booking_status        Canceled  Not_Canceled
no_of_weekend_nights                        
0                         3534          8266
1                         2387          4661
2                         2184          4131
3                           50            56
4                           52            34
5                           21             2
6                           10             4
booking_st

In [49]:
freq[0].loc[0].idxmax()

'Not_Canceled'

In [48]:
def ruleOver(freqs):
    for freq in freqs:
        index_labels = freq.index.values
        for idx_label in index_labels:
            print(freq.loc[idx_label].idxmax())
        

In [50]:
ruleOver(freq)

Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Canceled
Canceled
Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Canceled
Canceled
Canceled
Canceled
Canceled
Canceled
Canceled
Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled
Not_Canceled