<img src='https://upload.wikimedia.org/wikipedia/fr/thumb/e/ed/Logo_Universit%C3%A9_du_Maine.svg/1280px-Logo_Universit%C3%A9_du_Maine.svg.png' width="300" height="500">

# Hotel Reservation Dataset

## Présentation du dataset

### Objectif du dataset

Le dataset contient les données de réservations en ligne des clients d'un hotel. **Le but de ce jeu de données est d'arriver à prédir si le client va respecter sa réservation ou l'annuler**.

### Description des données

* **Booking_ID** : unique identifier of each booking
* **no_of_adults** : Number of adults
* **no_of_children** : Number of Children
* **no_of_weekend_nights** : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
* **no_of_week_nights** : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
* **type_of_meal_plan** : Type of meal plan booked by the customer:
* **required_car_parking_space** : Does the customer require a car parking space? (0 - No, 1- Yes)
* **room_type_reserved** : Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
* **lead_time**: Number of days between the date of booking and the arrival date
* **arrival_year** : Year of arrival date
* **arrival_month** : Month of arrival date
* **arrival_date** : Date of the month
* **market_segment_type** : Market segment designation.
* **repeated_guest** : Is the customer a repeated guest? (0 - No, 1- Yes)
* **no_of_previous_cancellations** : Number of previous bookings that were canceled by the customer prior to the current booking
* **no_of_previous_bookings_not_canceled** : Number of previous bookings not canceled by the customer prior to the current booking
* **avg_price_per_room** : Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
* **no_of_special_requests** : Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
* **booking_status** : Flag indicating if the booking was canceled or not.



In [56]:
import pandas as pd 
import numpy as np


dataset = pd.read_csv('archive/Hotel Reservations.csv')
dataset

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.00,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.00,0,Canceled
3,INN00004,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.00,0,Canceled
4,INN00005,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.50,0,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,INN36271,3,0,2,6,Meal Plan 1,0,Room_Type 4,85,2018,8,3,Online,0,0,0,167.80,1,Not_Canceled
36271,INN36272,2,0,1,3,Meal Plan 1,0,Room_Type 1,228,2018,10,17,Online,0,0,0,90.95,2,Canceled
36272,INN36273,2,0,2,6,Meal Plan 1,0,Room_Type 1,148,2018,7,1,Online,0,0,0,98.39,2,Not_Canceled
36273,INN36274,2,0,0,3,Not Selected,0,Room_Type 1,63,2018,4,21,Online,0,0,0,94.50,0,Canceled


In [57]:
dataset.values.shape

(36275, 19)

# Analyse des données
On regarde si des valeurs sont égales à Nan
<br>
ID à supprimer : pas pertinenent
<br>
Des variables à discrétiser : avg_price_per_room, lead_time, no_of_previous_bookings_not_canceled
<br>
Attention à certaines variables qui sont majoritairement nulles

In [58]:
def verificationNan(dataset):
    return dataset.isna().sum()

In [59]:
verificationNan(dataset)

Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

In [49]:
def dataAnalysis(dataset):
    for series_name, series in dataset.items():
        #print(series_name)
        print("**************\n")
        print(series.value_counts())
        print("**************\n")
    

In [60]:
dataAnalysis(dataset)

**************

INN00001    1
INN24187    1
INN24181    1
INN24182    1
INN24183    1
           ..
INN12086    1
INN12085    1
INN12084    1
INN12083    1
INN36275    1
Name: Booking_ID, Length: 36275, dtype: int64
**************

**************

2    26108
1     7695
3     2317
0      139
4       16
Name: no_of_adults, dtype: int64
**************

**************

0     33577
1      1618
2      1058
3        19
9         2
10        1
Name: no_of_children, dtype: int64
**************

**************

0    16872
1     9995
2     9071
3      153
4      129
5       34
6       20
7        1
Name: no_of_weekend_nights, dtype: int64
**************

**************

2     11444
1      9488
3      7839
4      2990
0      2387
5      1614
6       189
7       113
10       62
8        62
9        34
11       17
15       10
12        9
14        7
13        5
17        3
16        2
Name: no_of_week_nights, dtype: int64
**************

**************

Meal Plan 1     27835
Not Selected     5130
Me

In [61]:
# to execute only once
dataset = dataset.drop('Booking_ID',axis=1)
dataset

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.00,0,Not_Canceled
1,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.00,0,Canceled
3,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.00,0,Canceled
4,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.50,0,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,3,0,2,6,Meal Plan 1,0,Room_Type 4,85,2018,8,3,Online,0,0,0,167.80,1,Not_Canceled
36271,2,0,1,3,Meal Plan 1,0,Room_Type 1,228,2018,10,17,Online,0,0,0,90.95,2,Canceled
36272,2,0,2,6,Meal Plan 1,0,Room_Type 1,148,2018,7,1,Online,0,0,0,98.39,2,Not_Canceled
36273,2,0,0,3,Not Selected,0,Room_Type 1,63,2018,4,21,Online,0,0,0,94.50,0,Canceled


In [62]:
# Decomposition en quartile

def quartile(feature):
    prem_quartile = feature.quantile(0.25)
    de_quartile = feature.quantile(0.5)
    trois_quartile = feature.quantile(0.75)
    return (prem_quartile,de_quartile,trois_quartile)






# df.loc[df.C <= df.B, 'B':'E']
# quarter['avg_price_per_room'] = prem_quartile
# quarter





In [63]:
# Discrétistation des données


def discretisation(features,dataset):
    for feature in features:
        pQ,dQ,tQ = quartile(feature)
        dataset.loc[(feature <= pQ), feature.name] = 1
        dataset.loc[(feature > pQ) & (feature <= dQ), feature.name] = 2
        dataset.loc[(feature > dQ) & (feature <= tQ), feature.name] = 3
        dataset.loc[(feature > tQ), feature.name] = 4


In [64]:
continuFeatures = [dataset.avg_price_per_room,dataset.lead_time, dataset.no_of_previous_bookings_not_canceled]
discretisation(continuFeatures,dataset)

In [65]:
dataset

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,2,0,1,2,Meal Plan 1,0,Room_Type 1,4,2017,10,2,Offline,0,0,4,1.0,0,Not_Canceled
1,2,0,2,3,Not Selected,0,Room_Type 1,1,2018,11,6,Online,0,0,4,3.0,1,Not_Canceled
2,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,4,1.0,0,Canceled
3,2,0,0,2,Meal Plan 1,0,Room_Type 1,4,2018,5,20,Online,0,0,4,3.0,0,Canceled
4,2,0,1,1,Not Selected,0,Room_Type 1,2,2018,4,11,Online,0,0,4,2.0,0,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,3,0,2,6,Meal Plan 1,0,Room_Type 4,3,2018,8,3,Online,0,0,4,4.0,1,Not_Canceled
36271,2,0,1,3,Meal Plan 1,0,Room_Type 1,4,2018,10,17,Online,0,0,4,2.0,2,Canceled
36272,2,0,2,6,Meal Plan 1,0,Room_Type 1,4,2018,7,1,Online,0,0,4,2.0,2,Not_Canceled
36273,2,0,0,3,Not Selected,0,Room_Type 1,3,2018,4,21,Online,0,0,4,2.0,0,Canceled


In [42]:
pd.qcut(dataset['no_of_previous_bookings_not_canceled'], 1 , labels=["peu"])

0        peu
1        peu
2        peu
3        peu
4        peu
        ... 
36270    peu
36271    peu
36272    peu
36273    peu
36274    peu
Name: no_of_previous_bookings_not_canceled, Length: 36275, dtype: category
Categories (1, object): ['peu']

In [66]:
# Pre-process des données 


dataframe = dataset.sample(frac=1).reset_index(drop=True)



def train_test_dev_split(dataset):
    x_train = dataset.sample(frac=0.7)

    y_train = x_train['booking_status']
    
    x_test = dataset.drop(x_train.index,axis=0)
    
    y_test = x_test['booking_status']

    return (x_train,y_train,x_test,y_test)


x_train, y_train,x_test,y_test = train_test_dev_split(dataset)







In [12]:
# Detection des zeros pour préparer au classifieur bayesien naif
def zeroExistence(df):
    if (0 in df.values):
        return True
    return False

In [13]:
zeroExistence(dataset)

True

In [14]:
# Si zeros : +1 à toutes les données numériques

In [15]:
# Discretisation des données par la moyenne 

stats = x_train['avg_price_per_room'].describe()
slice = x_train.loc[x_train['avg_price_per_room'] <= 103.4]
slice

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
28262,2,0,0,2,Meal Plan 1,0,Room_Type 4,3,2018,6,28,Online,0,0,4,4.0,1,Not_Canceled
1033,2,0,1,0,Meal Plan 1,0,Room_Type 1,3,2018,6,26,Online,0,0,4,3.0,1,Canceled
15043,2,0,1,3,Meal Plan 1,0,Room_Type 1,4,2018,9,26,Offline,0,0,4,2.0,0,Canceled
5952,2,0,0,1,Meal Plan 1,0,Room_Type 6,1,2018,5,19,Online,0,0,4,4.0,2,Not_Canceled
12315,2,0,0,1,Not Selected,0,Room_Type 1,2,2018,10,19,Online,0,0,4,3.0,1,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8196,2,0,0,1,Not Selected,1,Room_Type 1,1,2017,10,9,Online,0,0,4,4.0,1,Not_Canceled
26279,2,0,2,2,Meal Plan 1,0,Room_Type 1,2,2018,11,12,Offline,0,0,4,1.0,1,Not_Canceled
2723,2,2,0,2,Meal Plan 1,0,Room_Type 6,4,2018,5,27,Online,0,0,4,4.0,2,Canceled
3343,2,0,0,2,Meal Plan 1,0,Room_Type 1,1,2018,6,8,Online,0,0,4,4.0,1,Canceled


In [16]:
slice.keys

<bound method NDFrame.keys of        no_of_adults  no_of_children  no_of_weekend_nights  no_of_week_nights  \
28262             2               0                     0                  2   
1033              2               0                     1                  0   
15043             2               0                     1                  3   
5952              2               0                     0                  1   
12315             2               0                     0                  1   
...             ...             ...                   ...                ...   
8196              2               0                     0                  1   
26279             2               0                     2                  2   
2723              2               2                     0                  2   
3343              2               0                     0                  2   
3046              2               0                     0                  2   

      typ

KeyError: "['Booking_ID'] not in index"

## ZeroR

In [53]:
def zeroR(target):
    res =  target.value_counts()
    
    return res.idxmax()

In [32]:
test = zeroR(x_train.loc[:,'booking_status'])
test

'Not_Canceled'

# Création des tables de fréquences pour OneR


In [11]:
x_train['required_car_parking_space'].value_counts()

0    24625
1      767
Name: required_car_parking_space, dtype: int64

In [47]:
freq_table = pd.crosstab(x_train['no_of_children'], x_train['booking_status']) 
freq_table

booking_status,Canceled,Not_Canceled
no_of_children,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7713,15829
1,373,747
2,306,409
3,2,11
9,1,0
10,0,1


In [56]:
freq_table[['Canceled']]

booking_status,Canceled
no_of_children,Unnamed: 1_level_1
0,7713
1,373
2,306
3,2
9,1
10,0


In [57]:
res = freq_table['Canceled'] + freq_table['Not_Canceled']
freq_table['Total'] = res
res

no_of_children
0     23542
1      1120
2       715
3        13
9         1
10        1
dtype: int64

In [58]:
freq_table

booking_status,Canceled,Not_Canceled,Total
no_of_children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7713,15829,23542
1,373,747,1120
2,306,409,715
3,2,11,13
9,1,0,1
10,0,1,1


In [72]:

def frequency_table(data):
    freq_tables = []
    for series_name, series in data.items():
        if (series_name == "Booking_status"):
            pass
        freq_table = pd.crosstab(data[series_name], data['booking_status'])
        freq_tables.append(freq_table)
    return freq_tables

 

In [74]:
freq = frequency_table(x_train)
for table in freq_tables:
    print(table)
    

booking_status  Canceled  Not_Canceled
no_of_adults                          
0                     29            71
1                   1282          4146
2                   6310         11940
3                    597          1009
4                      2             6
booking_status  Canceled  Not_Canceled
no_of_children                        
0                   7512         15979
1                    376           744
2                    326           436
3                      5            11
9                      1             1
10                     0             1
booking_status        Canceled  Not_Canceled
no_of_weekend_nights                        
0                         3518          8271
1                         2368          4667
2                         2194          4139
3                           48            52
4                           60            35
5                           22             5
6                           10             3
booking_st