# Advanced Python for Data Science

## Assignment 5 - Machine Learning

In this assignment you will use Pandas to prepare a dataset for machine learning classificaiton.   You will then build a decision tree model, separating the data into training and validation sets, and calculate the accuracy of the model.

Some helpful Resources:

- [Pandas Website](https://pandas.pydata.org/)
- [Kaggle Hotel Booking Cancellation Dataset](https://www.kaggle.com/vinayakashastri/hotel-booking-cancellation-dataset?select=hotel_bookings.csv)

## Tasks

1. Download and load `hotel_bookings.csv` from Kaggle
2. Use Pandas to explore and preprocess the data.   Idenitfy columns which need preprocessing (conversion of categorial data to numeric vaules, fill empty values, drop columns which are not likely to help with classification)
3. Split the data into training and validation datsets
4. Train a Decision Tree classifier using the training data
5. Predict classifications for the validation data
6. Score the model's accuracy

While not required, extra credit will be available to those who work to tune their model to improve it's accuracy.

As you build out this notebook create lots of cells -- small snippets of code followed or prefaced by comments in "markdown" cells.   A complete solution will be well structured and well documented.

## Rubric

The rubric for this assignment:

- 40 Methodical approahc to preprocessing data is evident
- 30 Decision tree successfully trained and validated
- 20 Appoach is well structured
- 10 Notebook is well documented
- 10 Extra credit points for models which achieve validation accuracy above 85%


In [1]:
# Import our typical libraries

import numpy as np
import pandas as pd


## About the Data

`hotel_bookings.csv` contains 119390 records representing individual reservations at a collection of hotels.  Each record is classified to show if the reservation was cancelled or not.   Your challenge is to create a machine learning model which can accurately predict whether or not a reservation will be cancelled based on the available data.

One wrinkle in the data: there are two columns which represent whether or not a reservation was cancelled.   They are **'is_cancelled'** and **'reservation_status'** -- and if you feed one of them as an attribute then the model will easily achieve 100% accuracy.   So to properly create the model you should drop one, and I recommend dropping **'reservation_status'**.


## 1.  Dowload and load hotel_bookings.csv

In [2]:
df = pd.read_csv('hotel_bookings.csv', low_memory=False)

## 2. Use Pandas to explore and preprocess the data


### Deal with Null Values

Here we remove a few attributes that either have too many null values or don't provide a substantial benefit to the model.

In [3]:
#Removed as it gives similar information to 'is_cancelled'.
df.drop('reservation_status', axis=1, inplace=True) 
#Very few company entries were filled so we're dropping it.
df.drop('company', axis=1, inplace=True)   
#There are enough agents missing where we are going to drop it. The number is more or less categorical so using a median or average does not make sense. I feel it is too many values to fill randomly.
df.drop('agent', axis=1, inplace=True)     
#I am afraid of this information not being helpful as it only gives the latest date for a status change for placing the reservation or canceling it. I am a little concerned with this causing issues for overfitting.
df.drop('reservation_status_date', axis=1, inplace=True) 

Here we will fill the NaN values in the country attribute with the most common value. The magnitude of the numbers will hopefully make it fine.

In [4]:
df['country'] = df['country'].fillna('PRT')

Since there are only 4 observations with null values for children we're going to fill them randomly based on what options were picked by other observations (0 to 2 since they are by far the most prominent). 4 observations shouldn't hurt things too much.

In [5]:
df['children'] = df['children'].fillna(-1)
np.random.seed(123456)                    
df.loc[df['children'] == -1,'children'] = df['children'].apply(lambda x: np.random.randint(0,3))
df['children'].value_counts()              

0.0     110798
1.0       4861
2.0       3654
3.0         76
10.0         1
Name: children, dtype: int64

Null values are dealt with.

In [6]:
df.isnull().sum()

hotel                             0
is_canceled                       0
lead_time                         0
arrival_date_year                 0
arrival_date_month                0
arrival_date_week_number          0
arrival_date_day_of_month         0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          0
babies                            0
meal                              0
country                           0
market_segment                    0
distribution_channel              0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
reserved_room_type                0
assigned_room_type                0
booking_changes                   0
deposit_type                      0
days_in_waiting_list              0
customer_type                     0
adr                               0
required_car_parking_spaces       0
total_of_special_requests   

### Convert non-numeric (categorical) data to numeric values


Most of these are fairly straightforward reassignments from string to numerical.

In [7]:
df['arrival_date_month']=df['arrival_date_month'].map({'January': 1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 
                                                       'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 
                                                       'December':12})
df['hotel'] = df['hotel'].map({'City Hotel': 0, 'Resort Hotel': 1})
df['assigned_room_type'] = df['assigned_room_type'].map({'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7, 'H': 8, 
                                                         'I': 9, 'K': 11, 'L': 12, 'P': 16})
df['deposit_type'] = df['deposit_type'].map({'No Deposit': 0, 'Non Refund': 1, 'Refundable': 2})
df['customer_type'] = df['customer_type'].map({'Transient': 0, 'Transient-Party': 1, 'Contract': 2, 'Group': 3})
df['meal'] = df['meal'].map({'BB': 0, 'HB': 1, 'SC': 2, 'Undefined': 3, 'FB': 4})
df['market_segment'] = df['market_segment'].map({'Online TA': 0, 'Offline TA/TO': 1, 'Groups': 2, 'Direct': 3, 'Corporate': 4, 'Complementary': 5, 'Aviation': 6, 'Undefined': 7})
df['distribution_channel'] = df['distribution_channel'].map({'TA/TO': 0, 'Direct': 1, 'Corporate': 2, 'GDS': 3, 'Undefined': 4})
df['reserved_room_type'] = df['reserved_room_type'].map({'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7, 'H': 8, 
                                                        'L': 12, 'P': 16})

There were a lot of countries so I had to get creative mapping all of these. I do question the elegance of my code here because the for loop looks sloppy.

In [8]:
country_dict = {}
for i in range(len(df.country.unique())):
    country_dict[df.country.unique()[i]]=i
df['country'] = df['country'].map(country_dict)

## 3. Split the data into training and validation datsets


I used the same code from the lecture, swapping out the variables used in class with mine.

In [9]:
from sklearn.model_selection import train_test_split

x = df.drop('is_canceled', axis=1)
y = df.is_canceled

x_train,x_validate,y_train,y_validate=train_test_split(x,y,test_size=0.2,random_state=123456)

## 4. Train a Decision Tree classifier using the training data

Much like the last cell this was taken from the lecture. The seed was set to keep everything repeatable for checking work.

In [10]:
from sklearn.tree import DecisionTreeClassifier

HotelDT = DecisionTreeClassifier(random_state=123456)
HotelDT.fit( x_train, y_train)

DecisionTreeClassifier(random_state=123456)

## 5. Predict classifications for the validation data

In [11]:
#We send the prediction
y_predict = HotelDT.predict(x_validate)

## 6. Score the model's accuracy

It's funny that I am merely .33% off. I considered taking out less relevant attributes but I didn't see any I felt comfortable removing.

In [12]:
from sklearn.metrics import classification_report
from sklearn.metrics import classification_report


print(f'Overall accuracy against training data is {HotelDT.score(x_train, y_train):5.2%}')
print(f'Overall accuracy against validation data is {HotelDT.score(x_validate, y_validate):5.2%}')
print(classification_report(y_validate, y_predict))


Overall accuracy against training data is 99.62%
Overall accuracy against validation data is 84.66%
              precision    recall  f1-score   support

           0       0.88      0.87      0.88     14973
           1       0.79      0.81      0.80      8905

    accuracy                           0.85     23878
   macro avg       0.84      0.84      0.84     23878
weighted avg       0.85      0.85      0.85     23878



## 7. Extra Credit

For those of you with an interestin machine learning, here's a chance to explore ways to tune a decisiontree model for improved performance.   You can use various strategies - limit the depth of the tree, require a minimum number of samples for a split.  You can even create a *Random Forest* - a collection of small decision trees which work together to classify data.   

If you can crate a classifier with accuaracy above 85% you have done well -- and will be awarded 10 extra credit points for your effort.