<h1 style="text-align:center;">Cameron Hicks - Capstone Project Two</h1>
<h1 style="text-align:center;">Hotel Reservation Cancellation Risk Model</h1>
<h2 style="text-align:center;">Preprocessing</h2>

In this notebook I complete the preprocessing steps needed for the Hotel Reservation Cancellation Risk Model Capstone. After importing the data, I drop two columns that would interfere with the model, then I create lists of the categorical columns and numeric columns. Using those lists, I create dummies of the categorical data, and I scale the numeric data to prepare for the ML model, then I split the data to prepare for training and testing the model. 

## Import Packages and Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Import data from EDA step

In [2]:
df = pd.read_csv("../Data/hotel_weather_cleaned.csv")

## Inspect Data
In this step I inspect the data before starting preprocessing to make any changes not made in previous steps.

In [3]:
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,reservation_status,reservation_status_date,hotel_city,arrival_date,forecast_date,temp_max,temp_min,precip_sum,weathercode,weather_description
0,Resort Hotel,0,342,2015,JULY,27,1,0,0,2,...,Check-Out,2015-07-01,"Faro, Portugal",2015-07-01,2015-05-27,26.755,19.305,0.0,0.0,Clear sky
1,Resort Hotel,0,737,2015,JULY,27,1,0,0,2,...,Check-Out,2015-07-01,"Faro, Portugal",2015-07-01,2015-05-27,26.755,19.305,0.0,0.0,Clear sky
2,Resort Hotel,0,7,2015,JULY,27,1,0,1,1,...,Check-Out,2015-07-02,"Faro, Portugal",2015-07-01,2015-05-27,26.755,19.305,0.0,0.0,Clear sky
3,Resort Hotel,0,13,2015,JULY,27,1,0,1,1,...,Check-Out,2015-07-02,"Faro, Portugal",2015-07-01,2015-05-27,26.755,19.305,0.0,0.0,Clear sky
4,Resort Hotel,0,14,2015,JULY,27,1,0,2,2,...,Check-Out,2015-07-03,"Faro, Portugal",2015-07-01,2015-05-27,26.755,19.305,0.0,0.0,Clear sky


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87396 entries, 0 to 87395
Data columns (total 40 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   hotel                           87396 non-null  object 
 1   is_canceled                     87396 non-null  int64  
 2   lead_time                       87396 non-null  int64  
 3   arrival_date_year               87396 non-null  int64  
 4   arrival_date_month              87396 non-null  object 
 5   arrival_date_week_number        87396 non-null  int64  
 6   arrival_date_day_of_month       87396 non-null  int64  
 7   stays_in_weekend_nights         87396 non-null  int64  
 8   stays_in_week_nights            87396 non-null  int64  
 9   adults                          87396 non-null  int64  
 10  children                        87396 non-null  int64  
 11  babies                          87396 non-null  int64  
 12  meal                            

The columns "reservation_status" and "reservation_status_date" seem that they will train the Machine Learning Model on the target data, so it is best to drop those columns.

In [6]:
df = df.drop(columns=['reservation_status', 'reservation_status_date'], errors='ignore')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87396 entries, 0 to 87395
Data columns (total 38 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   hotel                           87396 non-null  object 
 1   is_canceled                     87396 non-null  int64  
 2   lead_time                       87396 non-null  int64  
 3   arrival_date_year               87396 non-null  int64  
 4   arrival_date_month              87396 non-null  object 
 5   arrival_date_week_number        87396 non-null  int64  
 6   arrival_date_day_of_month       87396 non-null  int64  
 7   stays_in_weekend_nights         87396 non-null  int64  
 8   stays_in_week_nights            87396 non-null  int64  
 9   adults                          87396 non-null  int64  
 10  children                        87396 non-null  int64  
 11  babies                          87396 non-null  int64  
 12  meal                            

### Separate Categorical and Numeric columns

In [10]:
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

In [9]:
categorical_cols

['hotel',
 'arrival_date_month',
 'meal',
 'country',
 'market_segment',
 'distribution_channel',
 'reserved_room_type',
 'assigned_room_type',
 'deposit_type',
 'customer_type',
 'hotel_city',
 'arrival_date',
 'forecast_date',
 'weather_description']

In [11]:
numeric_cols

['is_canceled',
 'lead_time',
 'arrival_date_year',
 'arrival_date_week_number',
 'arrival_date_day_of_month',
 'stays_in_weekend_nights',
 'stays_in_week_nights',
 'adults',
 'children',
 'babies',
 'is_repeated_guest',
 'previous_cancellations',
 'previous_bookings_not_canceled',
 'booking_changes',
 'agent',
 'company',
 'days_in_waiting_list',
 'adr',
 'required_car_parking_spaces',
 'total_of_special_requests',
 'temp_max',
 'temp_min',
 'precip_sum',
 'weathercode']

The target variable 'is_canceled' needs to be removed from the numeric list so it is not scaled.

In [12]:
numeric_cols = [col for col in numeric_cols if col != 'is_canceled']

In [13]:
numeric_cols

['lead_time',
 'arrival_date_year',
 'arrival_date_week_number',
 'arrival_date_day_of_month',
 'stays_in_weekend_nights',
 'stays_in_week_nights',
 'adults',
 'children',
 'babies',
 'is_repeated_guest',
 'previous_cancellations',
 'previous_bookings_not_canceled',
 'booking_changes',
 'agent',
 'company',
 'days_in_waiting_list',
 'adr',
 'required_car_parking_spaces',
 'total_of_special_requests',
 'temp_max',
 'temp_min',
 'precip_sum',
 'weathercode']

## Create Dummy Features for Categorical Variables

In [16]:
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
df_encoded.head()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,...,forecast_date_2017-08-31,weather_description_Dense drizzle,weather_description_Heavy rain,weather_description_Light drizzle,weather_description_Mainly clear,weather_description_Moderate drizzle,weather_description_Moderate rain,weather_description_Overcast,weather_description_Partly cloudy,weather_description_Slight rain
0,0,342,2015,27,1,0,0,2,0,0,...,False,False,False,False,False,False,False,False,False,False
1,0,737,2015,27,1,0,0,2,0,0,...,False,False,False,False,False,False,False,False,False,False
2,0,7,2015,27,1,0,1,1,0,0,...,False,False,False,False,False,False,False,False,False,False
3,0,13,2015,27,1,0,1,1,0,0,...,False,False,False,False,False,False,False,False,False,False
4,0,14,2015,27,1,0,2,2,0,0,...,False,False,False,False,False,False,False,False,False,False


## Scale numeric data with StandardScaler

In [18]:
scaler = StandardScaler()

df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])

## Train Test Split

In [19]:
X = df_encoded.drop('is_canceled', axis=1)
y = df_encoded['is_canceled']

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Save Data

In [22]:
hotel_weather_preprocessed_df = pd.concat([X, y], axis=1)
hotel_weather_preprocessed_df.to_csv('hotel_weather_preprocessed_df.csv', index=False)