# Sunset Pier Hotels and Resorts - Data Wrangling

## 1. Introduction
This notebook contains the data wrangling step to solve the following problem:
```
How can Sunset Pier Hotels and Resorts implement data backed risk mitigation strategies for the next hotel season that
(a) reduce their loss in revenues due to cancellation to sub 10% and (b) do not dissuade the clients from booking,
therefore increasing overall revenues by 5%?
```

## 2. Project Links
* Dataset Source: https://www.kaggle.com/competitions/99-dapt-sao-ih-hotel-booking/data
* Project Proposal and Problem Statement Worksheet: http://localhost:8888/files/Capstone%202%20Project%20Proposal.pdf
* Github Repository: https://github.com/lojames/springboard-capstone-project-2

## 3. Imports and Configurations

In [83]:
import pandas as pd
import numpy as np
import pickle as pkl

In [2]:
pd.options.display.max_columns = None

## 4. Data Collection

Data sources:
* tb_hotel_traintest.csv - the provided training set
* tb_hotel_feat_valid_2.csv - the provided validation set

In [3]:
bookings_a = pd.read_csv('tb_hotel_traintest.csv')
bookings_b = pd.read_csv('tb_hotel_feat_valid_2.csv')

In [4]:
bookings_a.head(2)

Unnamed: 0,hotel,is_cancelled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date,id_booking
0,Resort Hotel,0,342,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,2015-07-01,2015-07-01,0
1,Resort Hotel,0,737,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,2015-07-01,2015-07-01,1


In [5]:
bookings_b.head(2)

Unnamed: 0,hotel,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date,id_booking
0,Resort Hotel,113,2,5,2,0.0,0,BB,NOR,Offline TA/TO,TA/TO,0,0,0,E,E,0,No Deposit,156.0,,0,Transient-Party,82.88,0,2,2015-03-11,2015-07-02,47
1,Resort Hotel,5,1,0,2,0.0,0,BB,PRT,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,240.0,,0,Transient,97.0,0,0,2015-06-30,2015-07-05,132


In [6]:
set(bookings_a.columns)-set(bookings_b.columns)

{'is_cancelled'}

In [7]:
set(bookings_b.columns)-set(bookings_a.columns)

set()

As expected, the validation set does not have a is_cancelled column.  We will only use the data from set a to train and test models.

In [8]:
bookings = bookings_a

## 5. Data Definition

### 5.1. Column Names and Data Types

In [9]:
bookings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113409 entries, 0 to 113408
Data columns (total 29 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           113409 non-null  object 
 1   is_cancelled                    113409 non-null  int64  
 2   lead_time                       113409 non-null  int64  
 3   stays_in_weekend_nights         113409 non-null  int64  
 4   stays_in_week_nights            113409 non-null  int64  
 5   adults                          113409 non-null  int64  
 6   children                        113406 non-null  float64
 7   babies                          113409 non-null  int64  
 8   meal                            113409 non-null  object 
 9   country                         112951 non-null  object 
 10  market_segment                  113409 non-null  object 
 11  distribution_channel            113409 non-null  object 
 12  is_repeated_gues

In [10]:
# Helper code to create object to change data types
'''
sorted_bookings_cols = bookings.columns.sort_values()
temp = [print (f'    \'{b}\': ') for b in sorted_bookings_cols]
'''

"\nsorted_bookings_cols = bookings.columns.sort_values()\ntemp = [print (f'    '{b}': ') for b in sorted_bookings_cols]\n"

In [11]:
new_data_types = {
    'agent': 'category',
    'arrival_date': 'datetime64',
    'assigned_room_type': 'category',
    'company': 'category',
    'country': 'category',
    'customer_type': 'category',
    'deposit_type': 'category',
    'distribution_channel': 'category',
    'is_cancelled': 'category',
    'is_repeated_guest': 'category',
    'market_segment': 'category',
    'meal': 'category', 
    'reservation_status_date': 'datetime64',
    'reserved_room_type': 'category',
    'hotel': 'string'
}

bookings = bookings.astype(new_data_types)
bookings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113409 entries, 0 to 113408
Data columns (total 29 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   hotel                           113409 non-null  string        
 1   is_cancelled                    113409 non-null  category      
 2   lead_time                       113409 non-null  int64         
 3   stays_in_weekend_nights         113409 non-null  int64         
 4   stays_in_week_nights            113409 non-null  int64         
 5   adults                          113409 non-null  int64         
 6   children                        113406 non-null  float64       
 7   babies                          113409 non-null  int64         
 8   meal                            113409 non-null  category      
 9   country                         112951 non-null  category      
 10  market_segment                  113409 non-null  categor

### 5.2. Column Descriptions

In [12]:
description = bookings.describe()
description.T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
lead_time,113409.0,104.109074,106.894825,0.0,18.0,69.0,161.0,737.0
stays_in_weekend_nights,113409.0,0.927907,0.998723,0.0,0.0,1.0,2.0,19.0
stays_in_week_nights,113409.0,2.500498,1.90667,0.0,1.0,2.0,3.0,50.0
adults,113409.0,1.857304,0.583753,0.0,2.0,2.0,2.0,55.0
children,113406.0,0.104227,0.398976,0.0,0.0,0.0,0.0,10.0
babies,113409.0,0.00798,0.098027,0.0,0.0,0.0,0.0,10.0
previous_cancellations,113409.0,0.087101,0.844538,0.0,0.0,0.0,0.0,26.0
previous_bookings_not_canceled,113409.0,0.13633,1.497662,0.0,0.0,0.0,0.0,72.0
booking_changes,113409.0,0.220917,0.649771,0.0,0.0,0.0,0.0,21.0
days_in_waiting_list,113409.0,2.3262,17.613897,0.0,0.0,0.0,0.0,391.0


### 5.3. Counts and Percents of Unique Values

In [13]:
num_unique = bookings.nunique()
print (f'Number of Unique Values Per Column\n{num_unique}')

Number of Unique Values Per Column
hotel                                  2
is_cancelled                           2
lead_time                            478
stays_in_weekend_nights               17
stays_in_week_nights                  35
adults                                14
children                               5
babies                                 5
meal                                   5
country                              174
market_segment                         8
distribution_channel                   5
is_repeated_guest                      2
previous_cancellations                15
previous_bookings_not_canceled        72
reserved_room_type                    10
assigned_room_type                    12
booking_changes                       20
deposit_type                           3
agent                                327
company                              348
days_in_waiting_list                 127
customer_type                          4
adr                   

In [14]:
percent_unique = num_unique/len(bookings)*100
print (f'Percent Unique Values By Column\n{percent_unique}')

Percent Unique Values By Column
hotel                               0.001764
is_cancelled                        0.001764
lead_time                           0.421483
stays_in_weekend_nights             0.014990
stays_in_week_nights                0.030862
adults                              0.012345
children                            0.004409
babies                              0.004409
meal                                0.004409
country                             0.153427
market_segment                      0.007054
distribution_channel                0.004409
is_repeated_guest                   0.001764
previous_cancellations              0.013226
previous_bookings_not_canceled      0.063487
reserved_room_type                  0.008818
assigned_room_type                  0.010581
booking_changes                     0.017635
deposit_type                        0.002645
agent                               0.288337
company                             0.306854
days_in_waiting_list   

### 5.4. Range of Values

In [15]:
description.loc[['min','max']].T

Unnamed: 0,min,max
lead_time,0.0,737.0
stays_in_weekend_nights,0.0,19.0
stays_in_week_nights,0.0,50.0
adults,0.0,55.0
children,0.0,10.0
babies,0.0,10.0
previous_cancellations,0.0,26.0
previous_bookings_not_canceled,0.0,72.0
booking_changes,0.0,21.0
days_in_waiting_list,0.0,391.0


### 5.5. Considerations

***Do your column names correspond to what those columns store?***

Yes.  A quick examination done in 4 of the first 20 rows seems to indicate so.  Further examination will be done later in the data cleaning step.

***Check the data types of your columns. Are they sensible?***

Yes.  Datatypes were changed accordingly above in 5.1.

***Calculate summary statistics for each of your columns, such
as mean, median, mode, standard deviation, range, and
number of unique values. What does this tell you about your
data?***

These summary statistics reveal the central tendency, dispersion, and shape of the dataset's distributions.  Furthermore, the validity of the data can be analyzed through these statistics.

***What do you now need to investigate?***

Missing values, duplicates, and outliers.

## 6. Data Cleaning

In [16]:
bookings_cleaned = bookings.copy(deep=True)

### 6.1. Missing Values

In [17]:
bookings_cleaned.isna().sum()

hotel                                  0
is_cancelled                           0
lead_time                              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               3
babies                                 0
meal                                   0
country                              458
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              15491
company                           106972
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_par

The following columns have missing values that need to be addressed:
* `children`
* `country`
* `agent`
* `company`

#### 6.1.1 `children`
Only 3 reservations have missing values in the children column.

In [18]:
bookings_cleaned['children'].value_counts()

0.0     105210
1.0       4652
2.0       3471
3.0         72
10.0         1
Name: children, dtype: int64

The mode of the column is 0. Since the vast majority of reservations have 0 children (more than 92.77%), imputing with 0 is the best quick and dirty approach.

In [19]:
bookings_cleaned['children'].fillna(0, inplace=True)
bookings_cleaned['children'].value_counts()

0.0     105213
1.0       4652
2.0       3471
3.0         72
10.0         1
Name: children, dtype: int64

#### 6.1.2 `country`
There are only 458 missing values for 'country'.  While this number is well under the threshold for dropping the rows with the missing values, the missing values can be assigned to an unused 3 letter code.

In [20]:
bookings_cleaned['country'].value_counts()

PRT    46213
GBR    11487
FRA     9890
ESP     8162
DEU     6919
       ...  
MDG        1
MMR        1
SMR        1
MRT        1
PYF        1
Name: country, Length: 174, dtype: int64

While the documentation specifies that the coutnry values are of the ISO 3155–3:2013 standard, this standard is not readily viewable online. The ISO 3166-1 alpha 3 code standard will be referenced (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) instead. The codes have been saved to the text file `country_codes.txt`.

A quick sanity check:

In [21]:
# Creating a dictionary of countries with ISO 3166-1 alpha 3 as the key and the country name as value.
country_codes = []
country_names = []
with open('country_codes.txt', encoding = 'utf8') as file:
    for line in file:
        country_codes += [line[0:3]]
        country_names += [line[5:-1]]
countries_dictionary = dict(zip(country_codes,country_names))
countries_dictionary['PRT']

'Portugal'

In [22]:
country_codes_set = set(country_codes)
bookings_cleaned[~bookings_cleaned['country'].isin(country_codes_set)]['country'].unique()

[NaN, 'CN', 'TMP']
Categories (174, object): ['ABW', 'AGO', 'AIA', 'ALB', ..., 'VGB', 'VNM', 'ZAF', 'ZWE']

Unfortunately, it's clear that the standards are not exactly the same; however the standards seem to be similar.  Again since the ISO 3155–3:2013 standard is not readily available, we have to do with what we have.

With regards to missing values, the ISO 3166-1 alpha 3 standards has various 3 letter codes that can be user assigned including `ZZZ`.  `ZZZ` will be used to specify missing values.

In [23]:
bookings_cleaned['country'] = bookings_cleaned['country'].cat.add_categories('ZZZ')
bookings_cleaned['country'].fillna('ZZZ', inplace=True)
bookings_cleaned['country'].isna().sum()

0

In [24]:
bookings_cleaned['country'].value_counts()['ZZZ']

458

#### 6.1.3. `agent`
There are 15491 missing

In [25]:
print (f'{15491/len(bookings)*100:.2f}% of values are missing.')

13.66% of values are missing.


While we can opt to drop the reservations with missing agent values, it may be better to assign the missing values to an unused value.

In [26]:
np.array(bookings['agent'].cat.categories).min()

1.0

Since the lowest value used for the agent category is 1, we can safetly designate 0 as missing values.

In [27]:
bookings_cleaned['agent'] = bookings_cleaned['agent'].cat.add_categories(0)

In [28]:
bookings_cleaned['agent'].fillna(0, inplace=True)
bookings_cleaned['agent'].isna().sum()

0

In [29]:
bookings_cleaned['agent'].value_counts()[0]

15491

#### 6.1.4. `company`
There are 106972 missing values for company.

In [30]:
print (f'{106972/len(bookings)*100:.2f}% of values are missing.')

94.32% of values are missing.


Since the vast majority of rows are missing values for `company`, it's best if missing values are reassigned.

In [31]:
np.array(bookings['company'].cat.categories).min()

6.0

0 is a safe value to reassign.

In [32]:
bookings_cleaned['company'] = bookings_cleaned['company'].cat.add_categories(0)

In [33]:
bookings_cleaned['company'].fillna(0, inplace=True)
bookings_cleaned['company'].isna().sum()

0

In [34]:
bookings_cleaned['company'].value_counts()[0]

106972

### 6.2 Duplicated Values

In [35]:
bookings_cleaned[bookings_cleaned.duplicated()]

Unnamed: 0,hotel,is_cancelled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date,id_booking


Note that the dataset is curated.

In [36]:
len(bookings_cleaned['id_booking'].unique())==len(bookings_cleaned)

True

In the case of this dataset, faith must be placed that the id_booking has been correctly generated for each unique booking.  Without this column, it's virtually impossible to definitively tell if certain bookings are erroneous duplicates of each other.  More specifically, separate bookings without the `id_booking` column can have identical column values for the remaining columns and still refer to different reservations -- there is no sure method of knowing.  For example, 4 different couples may have 1 individual make reservations for all the couples all originating from the same country with the same itinerary.

**Due to this fact and the fact that this is a curated dataset, no rows will be culled from this dataset.**

### 6.3. Outliers

In [37]:
bookings_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113409 entries, 0 to 113408
Data columns (total 29 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   hotel                           113409 non-null  string        
 1   is_cancelled                    113409 non-null  category      
 2   lead_time                       113409 non-null  int64         
 3   stays_in_weekend_nights         113409 non-null  int64         
 4   stays_in_week_nights            113409 non-null  int64         
 5   adults                          113409 non-null  int64         
 6   children                        113409 non-null  float64       
 7   babies                          113409 non-null  int64         
 8   meal                            113409 non-null  category      
 9   country                         113409 non-null  category      
 10  market_segment                  113409 non-null  categor

In [38]:
bookings_cleaned.describe()

Unnamed: 0,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,previous_cancellations,previous_bookings_not_canceled,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests,id_booking
count,113409.0,113409.0,113409.0,113409.0,113409.0,113409.0,113409.0,113409.0,113409.0,113409.0,113409.0,113409.0,113409.0,113409.0
mean,104.109074,0.927907,2.500498,1.857304,0.104225,0.00798,0.087101,0.13633,0.220917,2.3262,101.882431,0.062367,0.571612,59714.795969
std,106.894825,0.998723,1.90667,0.583753,0.398971,0.098027,0.844538,1.497662,0.649771,17.613897,50.626711,0.24519,0.792979,34464.577528
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-6.38,0.0,0.0,0.0
25%,18.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,69.4,0.0,0.0,29879.0
50%,69.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,94.9,0.0,0.0,59708.0
75%,161.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,126.0,0.0,1.0,89584.0
max,737.0,19.0,50.0,55.0,10.0,10.0,26.0,72.0,21.0,391.0,5400.0,8.0,5.0,119389.0


#### 6.3.1. `lead_time`

The range of `lead_time` values is reasonable.  It's not unheard of to make reservations years in advance.

#### 6.3.2. `stays_in_weekend_nights` and `stays_in_week_nights`

A max of 19 `stays_in_weekend_nights` may seem high, but it checks out with a value of 50 `stays_in_week_nights`.  The following table shows bookings with 10 or more weekends.  All days stayed during weekends seem to match up with days stayed during the week.

In [39]:
bookings_cleaned[bookings_cleaned['stays_in_weekend_nights']>8][['stays_in_weekend_nights','stays_in_week_nights']]

Unnamed: 0,stays_in_weekend_nights,stays_in_week_nights
1578,13,33
3630,12,30
3660,12,30
5072,9,24
8853,9,21
8854,9,21
9335,16,40
13325,18,42
13326,19,50
13327,9,21


#### 6.3.3. `adults`

Values of 40, 50, and 55 seem suspect.

In [40]:
bookings_cleaned['adults'].value_counts()

2     85212
1     21817
3      5917
0       385
4        62
26        5
27        2
20        2
5         2
40        1
50        1
55        1
6         1
10        1
Name: adults, dtype: int64

In [41]:
bookings_cleaned[bookings_cleaned['adults']>4]

Unnamed: 0,hotel,is_cancelled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date,id_booking
1467,Resort Hotel,1,304,0,3,40,0.0,0,BB,PRT,Direct,Direct,0,0,0,A,A,0,No Deposit,0.0,0.0,0,Group,0.0,0,0,2015-01-02,2015-09-03,1539
1514,Resort Hotel,1,333,2,5,26,0.0,0,BB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,96.0,0.0,0,Group,0.0,0,0,2015-01-02,2015-09-05,1587
1568,Resort Hotel,1,336,1,2,50,0.0,0,BB,PRT,Direct,Direct,0,0,0,A,A,0,No Deposit,0.0,0.0,0,Group,0.0,0,0,2015-01-18,2015-09-07,1643
1674,Resort Hotel,1,340,2,5,26,0.0,0,BB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,96.0,0.0,0,Group,0.0,0,0,2015-01-02,2015-09-12,1752
1801,Resort Hotel,1,347,2,5,26,0.0,0,BB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,96.0,0.0,0,Group,0.0,0,0,2015-01-02,2015-09-19,1884
1834,Resort Hotel,1,349,1,3,27,0.0,0,HB,PRT,Direct,Direct,0,0,0,A,A,0,No Deposit,0.0,0.0,0,Group,0.0,0,0,2015-01-02,2015-09-21,1917
1878,Resort Hotel,1,352,1,3,27,0.0,0,HB,PRT,Direct,Direct,0,0,0,A,A,0,No Deposit,0.0,0.0,0,Group,0.0,0,0,2015-01-02,2015-09-24,1962
1917,Resort Hotel,1,354,2,5,26,0.0,0,BB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,96.0,0.0,0,Group,0.0,0,0,2015-01-02,2015-09-26,2003
2067,Resort Hotel,1,361,2,5,26,0.0,0,BB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,0,No Deposit,96.0,0.0,0,Group,0.0,0,0,2015-01-02,2015-10-03,2164
2074,Resort Hotel,1,338,2,0,55,0.0,0,HB,PRT,Direct,Direct,0,0,0,A,A,0,No Deposit,0.0,0.0,0,Group,0.0,0,0,2015-01-02,2015-10-04,2173


In [42]:
def cancellation_percentage(series):
    return sum(series)/len(series)*100

bookings_cleaned.groupby('adults').agg({'is_cancelled': [cancellation_percentage, 'count']})

Unnamed: 0_level_0,is_cancelled,is_cancelled
Unnamed: 0_level_1,cancellation_percentage,count
adults,Unnamed: 1_level_2,Unnamed: 2_level_2
0,26.753247,385
1,29.10116,21817
2,39.314885,85212
3,34.662836,5917
4,25.806452,62
5,100.0,2
6,100.0,1
10,100.0,1
20,100.0,2
26,100.0,5


While large group reservations of 40, 50, and 55 seem somewhat unlikely, these outliers should still be kept since it could inform our model.  Large groups tend to cancel.

#### 6.3.4. `children`, `babies`, `previous_cancellations`, `previous_bookings_not_canceled`, `booking_changes`, 
#### `days_in_waiting_list`, `required_car_parking_spaces`, and `total_of_special_requests`

These columns do have outliers, but the max values are still in the realm of possibility.

In [58]:
cols_to_analyze = ['children', 'babies', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'days_in_waiting_list', 'required_car_parking_spaces', 'total_of_special_requests']

def analyze(cols, df=bookings_cleaned):
    for col in cols:
        print('\''+col+'\'')
        print(df[col].value_counts())
        print('\n')

analyze(cols_to_analyze)

'children'
0.0     105212
1.0       4652
2.0       3471
3.0         72
10.0         1
Name: children, dtype: int64


'babies'
0     112535
1        856
2         15
10         1
9          1
Name: babies, dtype: int64


'previous_cancellations'
0     107238
1       5759
2        114
3         60
24        47
11        31
4         28
26        26
25        22
6         22
5         18
19        17
14        13
13        12
21         1
Name: previous_cancellations, dtype: int64


'previous_bookings_not_canceled'
0     109980
1       1461
2        553
3        315
4        211
       ...  
47         1
49         1
50         1
51         1
72         1
Name: previous_bookings_not_canceled, Length: 72, dtype: int64


'booking_changes'
0     96243
1     12058
2      3617
3       884
4       352
5       114
6        61
7        27
8        16
9         8
10        6
13        5
14        5
16        2
17        2
12        2
15        2
11        2
21        1
18        1
Name: booking_ch

In [44]:
bookings_cleaned[(bookings_cleaned['previous_cancellations']==26)]

Unnamed: 0,hotel,is_cancelled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date,id_booking
14030,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14779
14031,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14780
14032,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14781
14033,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14782
14034,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14783
14035,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14784
14036,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14785
14037,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14786
14038,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14787
14039,Resort Hotel,1,275,2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,26,0,A,A,0,Non Refund,208.0,0.0,0,Transient,50.0,0,0,2015-01-30,2015-10-04,14788


Higher `previous_cancellations` seems to correspond to one entity making multiple reservations and cancelling them. Note that the TA/TO designation refers to Travel Agents and Tour Operators, so these reservations are likely legitimate.

A google search of "ISO 3155–3:2013" leads to https://www.sciencedirect.com/science/article/pii/S2352340918315191 which seems to correspond to this dataset.  The abstract mentions that there are a total of 119,390 observations.  We also have a total of 119390 observations if we combine the train-test set with the validation set.

In [45]:
len(bookings_a) + len(bookings_b)

119390

In [46]:
bookings_cleaned[(bookings_cleaned['previous_cancellations']==25)]

Unnamed: 0,hotel,is_cancelled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date,id_booking
14080,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14829
14081,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14830
14082,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14831
14083,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14832
14084,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14833
14085,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14834
14086,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14835
14087,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14836
14088,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14837
14089,Resort Hotel,1,222,1,5,2,0.0,0,FB,PRT,Groups,Corporate,0,25,0,A,A,0,Non Refund,252.0,0.0,0,Transient,49.95,0,0,2015-03-03,2015-09-15,14839


In [47]:
bookings_cleaned[(bookings_cleaned['previous_cancellations']==24)]

Unnamed: 0,hotel,is_cancelled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date,id_booking
14217,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14972
14218,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14973
14219,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14974
14220,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14975
14221,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14976
14222,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14977
14223,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14978
14224,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14979
14225,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14980
14226,Resort Hotel,1,166,0,2,2,0.0,0,FB,PRT,Groups,TA/TO,0,24,0,A,A,0,Non Refund,0.0,0.0,0,Transient,121.5,0,0,2015-04-28,2015-07-15,14981


#### 6.4.5 `adr`

In [48]:
bookings_cleaned['adr'].describe()

count    113409.000000
mean        101.882431
std          50.626711
min          -6.380000
25%          69.400000
50%          94.900000
75%         126.000000
max        5400.000000
Name: adr, dtype: float64

In [49]:
bookings_cleaned[bookings_cleaned['adr']>550]

Unnamed: 0,hotel,is_cancelled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date,id_booking
46046,City Hotel,1,35,0,1,2,0.0,0,BB,PRT,Offline TA/TO,TA/TO,0,0,0,A,A,1,Non Refund,12.0,0.0,0,Transient,5400.0,0,0,2016-02-19,2016-03-25,48515


There seems to be an error with the adr value for the booking with an adr of 5400.  This column will be dropped.

In [50]:
bookings_cleaned = bookings_cleaned.drop(index=46046, axis=0)

In [51]:
bookings_cleaned[bookings_cleaned['adr']>550]

Unnamed: 0,hotel,is_cancelled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date,id_booking


#### 6.3.6. Categorical Data

In [52]:
categorical_columns = bookings_cleaned.select_dtypes(include='category').columns
categorical_columns

Index(['is_cancelled', 'meal', 'country', 'market_segment',
       'distribution_channel', 'is_repeated_guest', 'reserved_room_type',
       'assigned_room_type', 'deposit_type', 'agent', 'company',
       'customer_type'],
      dtype='object')

In [59]:
analyze(categorical_columns)

'is_cancelled'
0    71373
1    42035
Name: is_cancelled, dtype: int64


'meal'
BB           87666
HB           13769
SC           10113
Undefined     1111
FB             749
Name: meal, dtype: int64


'country'
PRT    46212
GBR    11487
FRA     9890
ESP     8162
DEU     6919
       ...  
SDN        1
NCL        1
NIC        1
NPL        1
TGO        1
Name: country, Length: 175, dtype: int64


'market_segment'
Online TA        53678
Offline TA/TO    22993
Groups           18814
Direct           11990
Corporate         5000
Complementary      705
Aviation           226
Undefined            2
Name: market_segment, dtype: int64


'distribution_channel'
TA/TO        92980
Direct       13918
Corporate     6325
GDS            181
Undefined        4
Name: distribution_channel, dtype: int64


'is_repeated_guest'
0    109816
1      3592
Name: is_repeated_guest, dtype: int64


'reserved_room_type'
A    81681
D    18255
E     6173
F     2759
G     1988
B     1070
C      889
H      576
P       11


Upon manual inspection, all values are within categorical parameters.

#### 6.3.6. Datetime Data

In [57]:
datetime_columns = bookings_cleaned.select_dtypes(include='datetime64').columns
datetime_columns

Index(['reservation_status_date', 'arrival_date'], dtype='object')

In [63]:
bookings_cleaned[datetime_columns].describe(datetime_is_numeric=True)

Unnamed: 0,reservation_status_date,arrival_date
count,113408,113408
mean,2016-07-29 17:39:06.755078912,2016-08-28 11:31:33.453724928
min,2014-10-17 00:00:00,2015-07-01 00:00:00
25%,2016-02-01 00:00:00,2016-03-13 00:00:00
50%,2016-08-06 00:00:00,2016-09-05 00:00:00
75%,2017-02-08 00:00:00,2017-03-18 00:00:00
max,2017-09-14 00:00:00,2017-08-31 00:00:00


All dates are within bounds.

## 7. Exporting Data

In [73]:
bookings_cleaned.to_csv('bookings_cleaned.csv')

In [84]:
with open('bookings_cleaned.pkl', 'wb') as file:
    pkl.dump(new_data_types, file)

In [85]:
with open('bookings_cleaned.pkl', 'rb') as file:
    data_types = pickle.load(file)
print (data_types)

{'agent': 'category', 'arrival_date': 'datetime64', 'assigned_room_type': 'category', 'company': 'category', 'country': 'category', 'customer_type': 'category', 'deposit_type': 'category', 'distribution_channel': 'category', 'is_cancelled': 'category', 'is_repeated_guest': 'category', 'market_segment': 'category', 'meal': 'category', 'reservation_status_date': 'datetime64', 'reserved_room_type': 'category', 'hotel': 'string'}
