## EDA in Python On Hotel Cancellation Rates

In this project, we'll be doing Exploratory Data Analysis, or EDA, on a dataset that consists of hotel booking data. It includes many details about the bookings, including room specifications, the length of stay, the time between the booking and the stay, whether the booking was canceled, and how the booking was made. The data was gathered between July 2015 and August 2017. You can consult the appendices at the bottom of the notebook for citations and an overview of all variables.


In [1]:
# Import the required packages
import pandas as pd
import plotly.express as px


### Import the data


In [2]:
# Import hotel_bookings_clean_v2.csv
df = pd.read_csv('hotel_bookings_clean_v2.csv')
df

Unnamed: 0,is_canceled,lead_time,arrival_date_week_number,arrival_date_day_of_month,arrival_date_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,required_car_parking_spaces,total_of_special_requests,avg_daily_rate,booked_by_company,booked_by_agent,hotel_City,hotel_Resort,meal_BB,meal_FB,meal_HB,meal_No_meal,market_segment_Aviation,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline_TA_TO,market_segment_Online_TA,market_segment_Undefined,distribution_channel_Corporate,distribution_channel_Direct,distribution_channel_GDS,distribution_channel_TA_TO,distribution_channel_Undefined,reserved_room_type_A,reserved_room_type_B,reserved_room_type_C,reserved_room_type_D,reserved_room_type_E,reserved_room_type_F,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,deposit_type_No_Deposit,deposit_type_Non_Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party,deposit_type
0,0,342,27,1,7,0,0,2,0.0,0,0,0,0,0,0,0.00,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,no deposit made
1,0,737,27,1,7,0,0,2,0.0,0,0,0,0,0,0,0.00,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,no deposit made
2,0,7,27,1,7,0,1,1,0.0,0,0,0,0,0,0,75.00,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,no deposit made
3,0,13,27,1,7,0,1,1,0.0,0,0,0,0,0,0,75.00,0,1,0,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,no deposit made
4,0,14,27,1,7,0,2,2,0.0,0,0,0,0,0,1,98.00,0,1,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,no deposit made
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119205,0,23,35,30,8,2,5,2,0.0,0,0,0,0,0,0,96.14,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,no deposit made
119206,0,102,35,31,8,2,5,3,0.0,0,0,0,0,0,2,225.43,0,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,no deposit made
119207,0,34,35,31,8,2,5,2,0.0,0,0,0,0,0,4,157.71,0,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,no deposit made
119208,0,109,35,31,8,2,5,2,0.0,0,0,0,0,0,0,104.40,0,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,no deposit made


### Basic exploration

In [3]:
# Show dimensions
df.shape

(119210, 54)

In [4]:
# Are there missing values?
df.isnull().sum()


is_canceled                       0
lead_time                         0
arrival_date_week_number          0
arrival_date_day_of_month         0
arrival_date_month                0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          4
babies                            0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
required_car_parking_spaces       0
total_of_special_requests         0
avg_daily_rate                    0
booked_by_company                 0
booked_by_agent                   0
hotel_City                        0
hotel_Resort                      0
meal_BB                           0
meal_FB                           0
meal_HB                           0
meal_No_meal                      0
market_segment_Aviation           0
market_segment_Complementary      0
market_segment_Corporate          0
market_segment_Direct       

In [5]:
# Describe with summary statistics
df.describe()

Unnamed: 0,is_canceled,lead_time,arrival_date_week_number,arrival_date_day_of_month,arrival_date_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,required_car_parking_spaces,total_of_special_requests,avg_daily_rate,booked_by_company,booked_by_agent,hotel_City,hotel_Resort,meal_BB,meal_FB,meal_HB,meal_No_meal,market_segment_Aviation,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline_TA_TO,market_segment_Online_TA,market_segment_Undefined,distribution_channel_Corporate,distribution_channel_Direct,distribution_channel_GDS,distribution_channel_TA_TO,distribution_channel_Undefined,reserved_room_type_A,reserved_room_type_B,reserved_room_type_C,reserved_room_type_D,reserved_room_type_E,reserved_room_type_F,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,deposit_type_No_Deposit,deposit_type_Non_Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
count,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119206.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0
mean,0.370766,104.109227,27.163376,15.798717,6.552051,0.927053,2.499195,1.859206,0.104047,0.007961,0.031499,0.087191,0.137094,0.062553,0.571504,101.969092,0.056774,0.863434,0.664063,0.335937,0.773727,0.006694,0.121282,0.098297,0.001971,0.006107,0.044308,0.105545,0.166018,0.202852,0.473182,1.7e-05,0.055792,0.122565,0.001619,0.819982,4.2e-05,0.720351,0.009353,0.00781,0.160884,0.054685,0.024276,0.017549,0.005042,5e-05,0.876277,0.122364,0.001359,0.034158,0.004815,0.750575,0.210452
std,0.483012,106.87545,13.601107,8.78107,3.089796,0.995117,1.897106,0.575186,0.398842,0.097509,0.174663,0.844918,1.498137,0.24536,0.792876,50.434007,0.231411,0.34339,0.472319,0.472319,0.41842,0.081543,0.326456,0.297717,0.044356,0.077908,0.20578,0.307255,0.372098,0.402125,0.499282,0.004096,0.229521,0.327939,0.040204,0.384204,0.006476,0.448829,0.096259,0.088027,0.367426,0.227365,0.153907,0.131305,0.070825,0.007094,0.329266,0.327707,0.036839,0.181636,0.069223,0.432682,0.407631
min,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-6.38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,18.0,16.0,8.0,4.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,69.5,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,0.0,69.0,28.0,16.0,7.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.95,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,1.0,161.0,38.0,23.0,9.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,126.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
max,1.0,737.0,53.0,31.0,12.0,19.0,50.0,55.0,10.0,10.0,1.0,26.0,72.0,8.0,5.0,5400.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
# How many bookings were canceled?
n_canelation = df['is_canceled'].sum()
mean_cancelled = df['is_canceled'].mean()
print(f'{n_canelation} bookings were cancelled, which is {mean_cancelled * 100} % of all bookings')

44199 bookings were cancelled, which is 37.0765875346028 % of all bookings


### Are the cancellation rates different during different times of the year?

In [7]:
# Calculate and plot cancellations every month
cancellations = df\
    .filter(['arrival_date_month', 'is_canceled'])\
    .groupby(by = 'arrival_date_month', as_index=False)\
    .sum()

# Create bar chart of cancellations per month
px.bar(cancellations, x= 'arrival_date_month', y= 'is_canceled')

In [8]:
# Calculate and plot total bookings every month
total_bookings = df\
    .filter(['arrival_date_month', 'is_canceled'])\
    .groupby(by = 'arrival_date_month', as_index=False)\
    .count().rename(columns={'is_canceled' : 'total_bookings'})

# Create bar chart of total bookings per month
total_bookings

Unnamed: 0,arrival_date_month,total_bookings
0,1,5921
1,2,8052
2,3,9768
3,4,11078
4,5,11780
5,6,10929
6,7,12644
7,8,13861
8,9,10500
9,10,11147


In [9]:
# Calculate cancellation rates every month
merged = pd.merge(total_bookings, cancellations,  on ='arrival_date_month' )
merged['cancellation_rate'] = merged['is_canceled'] / merged['total_bookings']
 

# Create bar chart of cancellation rate every month
px.bar(merged, x='arrival_date_month', y='cancellation_rate')

There doesn't appear to be a clear connection between the time of year and the cancellation rate.

### Does the amount of nights influence the cancellation rate?

In [10]:
# Prepare the data
df_sel = df.assign(stays = lambda x: x['stays_in_week_nights'] + x['stays_in_weekend_nights']).query('stays < 15')


In [11]:
# Attempt 1: create a histogram
px.histogram(df_sel, x='stays', color = 'is_canceled', barmode = 'group')

In [12]:
# Attempt 2: Calulate cancellation per days of stay
total_bookings = df_sel\
    .filter(['stays', 'is_canceled'])\
    .groupby(by = 'stays', as_index=False)\
    .count()\
    .rename(columns = {'is_canceled': 'total_bookings'})
cancellations = df_sel\
    .filter(['stays', 'is_canceled'])\
    .groupby(by = 'stays', as_index=False)\
    .sum()
merged = pd.merge(total_bookings, cancellations, on='stays')
merged['ratio_canceled'] = merged['is_canceled'] / merged['total_bookings']

# Show on bar chart
px.bar(merged, x= 'stays', y = 'ratio_canceled')

In [13]:
## Attempt 3: Boxplot
px.box(df_sel, y= 'stays', color = 'is_canceled')

There doesn't appear to be a clear connection between the amount of nights booked and the cancellation rate.

## Relationship between daily rate and cancellation

In [14]:
# Box plot
df_sel2 = df.query('avg_daily_rate < 1000')
px.box(df_sel2, y= 'avg_daily_rate', color = 'is_canceled')

The daily rate doesn't seem to provide significant insights into the cancellation rate.

## Taking a more systematic approach

In [15]:
# Build correlation plot
df.corr()


Unnamed: 0,is_canceled,lead_time,arrival_date_week_number,arrival_date_day_of_month,arrival_date_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,required_car_parking_spaces,total_of_special_requests,avg_daily_rate,booked_by_company,booked_by_agent,hotel_City,hotel_Resort,meal_BB,meal_FB,meal_HB,meal_No_meal,market_segment_Aviation,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline_TA_TO,market_segment_Online_TA,market_segment_Undefined,distribution_channel_Corporate,distribution_channel_Direct,distribution_channel_GDS,distribution_channel_TA_TO,distribution_channel_Undefined,reserved_room_type_A,reserved_room_type_B,reserved_room_type_C,reserved_room_type_D,reserved_room_type_E,reserved_room_type_F,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,deposit_type_No_Deposit,deposit_type_Non_Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
is_canceled,1.0,0.292876,0.008315,-0.005948,0.011179,-0.001323,0.025542,0.058182,0.004862,-0.032569,-0.083745,0.110139,-0.057365,-0.195701,-0.234877,0.046492,-0.099692,0.102218,0.137082,-0.137082,0.012786,0.03879,-0.020085,-0.006571,-0.013755,-0.04033,-0.081645,-0.154366,0.22199,-0.028671,-0.006232,0.005336,-0.075589,-0.151583,-0.014928,0.176005,0.005755,0.069117,-0.008372,-0.007336,-0.047736,-0.038882,-0.021778,-0.001672,0.005436,-0.00055,-0.477957,0.481507,-0.011345,-0.02369,-0.038842,0.133235,-0.124271
lead_time,0.292876,1.0,0.127046,0.002306,0.131603,0.085985,0.166892,0.117575,-0.037886,-0.021003,-0.123209,0.086025,-0.073599,-0.116624,-0.095949,-0.065018,-0.125951,0.179563,0.07597,-0.07597,-0.039154,0.009646,0.136377,-0.097156,-0.041433,-0.066418,-0.165143,-0.174242,0.346418,0.146264,-0.186607,-0.003933,-0.1344,-0.161542,-0.031422,0.221546,-0.004915,0.104115,0.008561,-0.019798,-0.070515,-0.029329,-0.052407,-0.029744,-0.017132,-0.006911,-0.380173,0.38012,0.016564,0.068627,-0.031759,-0.17403,0.159538
arrival_date_week_number,0.008315,0.127046,1.0,0.066572,0.995101,0.018629,0.016047,0.026567,0.005559,0.010417,-0.031125,0.035493,-0.021009,0.00198,0.026202,0.076281,-0.025516,0.032239,0.001241,-0.001241,-0.000361,0.021425,0.038426,-0.047495,-0.006805,0.007846,-0.017845,-0.016997,0.00294,0.068223,-0.039955,0.001457,0.007821,-0.016003,-0.003981,0.00937,0.002017,0.00849,0.007617,0.005359,-0.010173,-0.007899,0.002586,-0.002047,0.005354,0.001393,-0.005903,0.007831,-0.016901,0.090342,0.011246,-0.079507,0.042228
arrival_date_day_of_month,-0.005948,0.002306,0.066572,1.0,-0.026335,-0.016225,-0.028362,-0.001754,0.014541,-0.000235,-0.006471,-0.027027,-0.000306,0.008569,0.003026,0.030291,-0.002831,-0.00392,-0.001678,0.001678,0.008278,-0.00257,-0.005223,-0.005203,-0.000984,-0.002985,-0.002046,0.004075,0.01186,-0.01106,-0.000997,-0.005504,-0.013528,0.011903,-0.001144,-0.001839,-0.00708,-0.017273,0.007001,0.001328,0.006459,0.01145,0.007613,0.004861,0.002873,-0.004012,0.005003,-0.008643,0.032171,-0.012178,-0.001704,-0.000426,0.006168
arrival_date_month,0.011179,0.131603,0.995101,-0.026335,1.0,0.018851,0.019739,0.029239,0.005483,0.010193,-0.031709,0.037473,-0.021745,0.000325,0.028086,0.079828,-0.026905,0.034048,0.00177,-0.00177,-0.002757,0.021823,0.040014,-0.045978,-0.006961,0.007879,-0.019697,-0.017274,0.002607,0.067672,-0.038325,0.001919,0.006841,-0.01727,-0.004291,0.011059,0.002616,0.008491,0.006274,0.006447,-0.00987,-0.008189,0.002405,-0.001879,0.005222,0.001794,-0.006471,0.008809,-0.02052,0.091687,0.011457,-0.079496,0.041581
stays_in_weekend_nights,-0.001323,0.085985,0.018629,-0.016225,0.018851,1.0,0.494175,0.094759,0.046134,0.018607,-0.086009,-0.012769,-0.042859,-0.01852,0.073124,0.05067,-0.108566,0.127345,-0.187816,0.187816,-0.065954,0.017596,0.105888,-0.028236,0.007819,-0.044784,-0.107029,-0.024834,-0.062126,0.06421,0.060289,-0.001758,-0.087222,-0.036917,-0.013612,0.085033,0.000475,-0.1498,0.003182,0.033126,0.094204,0.092265,0.017916,0.036697,0.012479,-0.005421,0.113828,-0.114571,0.001789,0.102708,-0.007566,0.020028,-0.06574
stays_in_week_nights,0.025542,0.166892,0.016047,-0.028362,0.019739,0.494175,1.0,0.096214,0.044651,0.020373,-0.095302,-0.013976,-0.048873,-0.024933,0.068738,0.066847,-0.082634,0.118685,-0.235955,0.235955,-0.061608,0.015435,0.122563,-0.052036,0.000866,-0.048948,-0.097443,-0.02773,-0.069501,0.09367,0.041168,-0.003237,-0.088431,-0.027395,-0.024014,0.07873,-0.000339,-0.178293,-0.003197,0.041002,0.113461,0.115166,0.015093,0.04131,0.016107,-0.005607,0.079174,-0.080321,0.006857,0.134339,-0.016898,0.007839,-0.065311
adults,0.058182,0.117575,0.026567,-0.001754,0.029239,0.094759,0.096214,1.0,0.029416,0.01789,-0.140973,-0.00707,-0.108856,0.014438,0.123353,0.224253,-0.235272,0.18568,-0.010571,0.010571,-0.042028,0.014193,0.048051,0.002489,-0.065403,-0.046706,-0.232132,0.011984,-0.046938,-0.032617,0.162611,0.004563,-0.225547,-0.006709,-0.053987,0.146052,0.003837,-0.210939,-0.048182,0.028178,0.176684,0.052652,0.037094,0.051264,0.105558,0.003792,0.030537,-0.03103,0.003091,0.020339,0.060427,0.091906,-0.116878
children,0.004862,-0.037886,0.005559,0.014541,0.005483,0.046134,0.044651,0.029416,1.0,0.023999,-0.032477,-0.024755,-0.021079,0.056245,0.081756,0.325058,-0.05446,0.045553,-0.044008,0.044008,0.037075,-0.001297,0.015636,-0.068896,-0.011594,-0.003981,-0.050755,0.06428,-0.113117,-0.097446,0.145802,,-0.058008,0.050268,-0.010505,-0.007203,0.006506,-0.279298,0.109534,0.259993,-0.069466,-0.007242,0.366779,0.392982,0.156345,-0.001851,0.097132,-0.096833,-0.006769,-0.018143,-0.0066,0.09622,-0.092929
babies,-0.032569,-0.021003,0.010417,-0.000235,0.010193,0.018607,0.020373,0.01789,0.023999,1.0,-0.008813,-0.007509,-0.006552,0.037389,0.097939,0.029043,-0.013338,-0.006864,-0.043386,0.043386,-0.009101,0.018618,0.017631,-0.011641,-0.003628,0.017894,-0.01089,0.052873,-0.03342,-0.003103,-0.00311,-0.000334,-0.011974,0.051335,-0.003288,-0.036311,-0.000529,-0.038069,0.004579,0.057259,0.006397,0.012147,0.012835,0.031676,0.005121,-0.000579,0.030677,-0.030484,-0.003012,-0.000197,0.000535,0.021613,-0.022945


In [16]:
# Plotting a heat map
px.imshow(df.corr(), width = 900, height= 900)

Looking through the correlations dataframe and heat map, we can see that the is_canceled column has strong correlation with columns like lead_time and deposit_type. Let's dig deeper into this

In [19]:
# Boxplot of lead time vs cancellations
px.box(df_sel2, y='lead_time', color='is_canceled')

Based on the data we've looked at, it's clear that there's an interesting pattern. When we look at bookings that were canceled, we find that the typical amount of time between booking and arrival is about 113 days. On the other hand, for bookings that weren't canceled, this time is shorter, around 45 days.

What this seems to tell us is that the longer people wait between making a booking and actually arriving, the more likely it is that they might change their plans and cancel. So, having a longer "lead time" seems to increase the chances of a booking being canceled.

In [28]:
# Frequency table of cancellation vs deposit_type
freqtable = pd.crosstab(df['is_canceled'], df['deposit_type'], normalize= True)
print(freqtable)

deposit_type  full deposit made  no deposit made  partial deposit made
is_canceled                                                           
0                      0.000780         0.627397              0.001057
1                      0.121584         0.248880              0.000302


Interestingly, the data reveals a somewhat unexpected trend: it appears that individuals who make full upfront deposits are also the ones who tend to cancel their bookings the most frequently.

## Appendix 1: Citation

[Source](https://www.kaggle.com/jessemostipak/hotel-booking-demand/) and [license](https://creativecommons.org/licenses/by/4.0/) of data. The data is originally from an article called [Hotel booking demand datasets](https://www.sciencedirect.com/science/article/pii/S2352340918315191) by Nuno Antonio, Ana de Almeida, and Luis Nunes. It was cleaned by Thomas Mock and Antoine Bichat for [#TidyTuesday during the week of February 11th, 2020](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md).

## Appendix 2: Data Dictionary

_Note: For binary variables: `1` = true and `0` = false._

| Column                                                                                                                                                                                                          | Explanation                                                                                                                            |   |   |   |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|---|---|---|
| is_canceled                                                                                                                                                                                                     | Binary variable indicating whether a booking was canceled                                                                              |   |   |   |
| lead_time                                                                                                                                                                                                       | Number of days between booking date and arrival date                                                                                   |   |   |   |
| arrival_date_week_number, arrival_date_day_of_month, arrival_date_month                                                                                                                                         | Week number, day date, and month number of arrival date                                                                                |   |   |   |
| stays_in_weekend_nights, stays_in_week_nights                                                                                                                                                                   | Number of weekend nights (Saturday and Sunday) and weeknights (Monday to Friday) the customer booked                                   |   |   |   |
| adults, children, babies                                                                                                                                                                                        | Number of adults, children, babies booked for the stay                                                                                 |   |   |   |
| is_repeated_guest                                                                                                                                                                                               | Binary variable indicating whether the customer was a repeat guest                                                                     |   |   |   |
| previous_cancellations                                                                                                                                                                                          | Number of prior bookings that were canceled by the customer                                                                            |   |   |   |
| previous_bookings_not_canceled                                                                                                                                                                                  | Number of prior bookings that were not canceled by the customer                                                                        |   |   |   |
| required_car_parking_spaces                                                                                                                                                                                     | Number of parking spaces requested by the customer                                                                                     |   |   |   |
| total_of_special_requests                                                                                                                                                                                       | Number of special requests made by the customer                                                                                        |   |   |   |
| avg_daily_rate                                                                                                                                                                                                  | Average daily rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights                   |   |   |   |
| booked_by_company                                                                                                                                                                                               | Binary variable indicating whether a company booked the booking                                                                        |   |   |   |
| booked_by_agent                                                                                                                                                                                                 | Binary variable indicating whether an agent booked the booking                                                                         |   |   |   |
| hotel_City                                                                                                                                                                                                      | Binary variable indicating whether the booked hotel is a "City Hotel"                                                                  |   |   |   |
| hotel_Resort                                                                                                                                                                                                    | Binary variable indicating whether the booked hotel is a "Resort Hotel"                                                                |   |   |   |
| meal_BB                                                                                                                                                                                                         | Binary variable indicating whether a bed & breakfast meal was booked                                                                   |   |   |   |
| meal_HB                                                                                                                                                                                                         | Binary variable indicating whether a half board meal was booked                                                                        |   |   |   |
| meal_FB                                                                                                                                                                                                         | Binary variable indicating whether a full board meal was booked                                                                        |   |   |   |
| meal_No_meal                                                                                                                                                                                                    | Binary variable indicating whether there was no meal package booked                                                                    |   |   |   |
| market_segment_Aviation, market_segment_Complementary, market_segment_Corporate, market_segment_Direct, market_segment_Groups, market_segment_Offline_TA_TO, market_segment_Online_TA, market_segment_Undefined | Indicates market segment designation with a value of 1. "TA"= travel agent, "TO"= tour operators                                       |   |   |   |
| distribution_channel_Corporate, distribution_channel_Direct, distribution_channel_GDS, distribution_channel_TA_TO, distribution_channel_Undefined                                                               | Indicates booking distribution channel with a value of 1. "TA"= travel agent, "TO"= tour operators, "GDS" = Global Distribution System |   |   |   |
| reserved_room_type_A, reserved_room_type_B, reserved_room_type_C, reserved_room_type_D, reserved_room_type_E, reserved_room_type_F, reserved_room_type_G, reserved_room_type_H, reserved_room_type_L            | Indicates code of room type reserved with a value of 1. Code is presented instead of designation for anonymity reasons                 |   |   |   |
| deposit_type_No_Deposit                                                                                                                                                                                         | Binary variable indicating whether a deposit was made                                                                                  |   |   |   |
| deposit_type_Non_Refund                                                                                                                                                                                         | Binary variable indicating whether a deposit was made in the value of the total stay cost                                              |   |   |   |
| deposit_type_Refundable                                                                                                                                                                                         | Binary variable indicating whether a deposit was made with a value under the total stay cost                                           |   |   |   |
| customer_type_Contract                                                                                                                                                                                          | Binary variable indicating whether the booking has an allotment or other type of contract associated to it                             |   |   |   |
| customer_type_Group                                                                                                                                                                                             | Binary variable indicating whether the booking is associated to a group                                                                |   |   |   |
| customer_type_Transient                                                                                                                                                                                         | Binary variable indicating whether the booking is not part of a group or contract, and is not associated to other transient booking    |   |   |   |
| customer_type_Transient-Party                                                                                                                                                                                   | Binary variable indicating whether the booking is transient, but is associated to at least another transient booking                   |   |   |   |