### Lab -- Data Prep & Gradient Boosting

Welcome to today's lab!  Today we're going to shift our attention to a more demanding dataset -- the restaurants data.  A quarter million rows, dates, and categorical data make this a more interesting, realistic use case of boosting.  

The point of today's lab will be to experiment with different encoding methods and model parameters.

**Step 1:**  Load in your dataset

In [4]:
# your code here
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
import datetime as dt
df = pd.read_csv('/Users/harleyhoffmann/dat-02-22/ClassMaterial/Unit2/data/master.csv', parse_dates=['visit_date', 'calendar_date'])
df['reserve_visitors'] = df['reserve_visitors'].fillna(0)
df

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,0.0
...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252104,air_a17f0778617c76e2,2017-04-22,60,2017-04-22,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


**Step 2:** Create a training and test set.

Make the test set the **last 15 observations for each restaurant**.

Turn each of these variables into `X_train, y_train`, and `X_test, y_test`, respectively.

**Hint:**  This harkens back to our grouping lab -- check this if you forget how to do it.

In [5]:
#extract time pieces from the calendar date column
df['month'] = pd.DatetimeIndex(df['calendar_date']).month
df['year'] = pd.DatetimeIndex(df['calendar_date']).year
df['day_of_month'] = pd.DatetimeIndex(df['calendar_date']).day

In [6]:
# your code here
df.sort_values(by=['id','visit_date'], ascending=True, inplace=True)

train = df.groupby('id').apply(lambda x: x.iloc[:-15])
test = df.groupby('id').apply(lambda x: x.iloc[-15:])

#we're dropping the dates because we moved them into new columns
train.drop('visit_date', axis=1, inplace=True)
test.drop('visit_date', axis=1, inplace=True)

train.drop('calendar_date', axis=1, inplace=True)
test.drop('calendar_date', axis=1, inplace=True)

#we're predicting visitors so that becomes they variable
X_train, y_train = train.drop('visitors', axis=1), train['visitors']
X_test, y_test   = test.drop('visitors', axis=1), test['visitors']

**Step 3:** Experiment with different encoding methods

Let's do a quick check to see how different encoding methods work out of the box on our dataset.

You're going to repeat the same process for each of `OrdinalEncoder`, `TargetEncoder`, and `OneHotEncoder` and see which one gives you the best results on our data.

**3a:** Use an `OrdinalEncoder` to transform your training set with the `fit_transform` method.  Then use the `transform` method to transform your test set.  

**Important:** An important detail here is that the test set is being transformed according to the values in your training set.  

If you are confused about how the transformation is happening, try using the `mapping()` method on your category encoder to get a hang of what's going on.

In [7]:
# your code here
import category_encoders as ce

In [8]:
#ordinal encoder - numbers
ore = ce.OrdinalEncoder()
X_train_ore = ore.fit_transform(X_train)
X_test_ore  = ore.transform(X_test)

**3b:** Initialize a `GradientBoostingRegressor` with the default parameters, fit it on your training set, and score it on your test set.

In [9]:
# your code here
gbm = GradientBoostingRegressor()
gbm.fit(X_train_ore, y_train)
gbm.score(X_train_ore, y_train), gbm.score(X_test_ore, y_test)

(0.17427166552896078, 0.15682702382071445)

**3c:** Repeat these same steps for the `TargetEncoder` and the `OneHotEncoder`

**Important:** The `OneHotEncoder` can take awhile to fit.  If nothing happens in around 4 minutes, just cancel the process and try it again later on when you have more time.

In [10]:
#Target Encoder - floats?
te = ce.TargetEncoder()
X_train_te = te.fit_transform(X_train, y_train)
X_test_te  = te.transform(X_test, y_test)

  elif pd.api.types.is_categorical(cols):


In [11]:
gbm = GradientBoostingRegressor()
gbm.fit(X_train_te, y_train)
gbm.score(X_train_te, y_train), gbm.score(X_test_te, y_test)

(0.47645123730015393, 0.4649839906056512)

In [12]:
#onehotencoder - binary
ohe      = ce.OneHotEncoder()
X_train_ohe = ohe.fit_transform(X_train, y_train)
X_test_ohe  = ohe.transform(X_test, y_test)

  elif pd.api.types.is_categorical(cols):


In [13]:
gbm = GradientBoostingRegressor()
gbm.fit(X_train_ohe, y_train)
gbm.score(X_train_ohe, y_train), gbm.score(X_test_ohe, y_test)

(0.16940288952649096, 0.17191414257277793)

**Step 4:** Look at your most important features

Similar to the previous lab, take your model's most important features and load them into a dataframe to see what's driving your results.

In [14]:
#using the Target encoder because it had the best score
te = ce.TargetEncoder()
X_train_te = te.fit_transform(X_train, y_train)
X_test_te  = te.transform(X_test, y_test)
#fitting and scoring
gbm = GradientBoostingRegressor()
gbm.fit(X_train_te, y_train)
gbm.score(X_train_te, y_train), gbm.score(X_test_te, y_test)

  elif pd.api.types.is_categorical(cols):


(0.47645123730015393, 0.4649839906056512)

In [15]:
# your code here for feature importances
feats = pd.DataFrame({'Feature':X_train_te.columns, 'Importance':gbm.feature_importances_}).sort_values(by='Importance', ascending=False)
feats


Unnamed: 0,Feature,Importance
0,id,0.86496
1,day_of_week,0.100339
8,month,0.011479
2,holiday,0.006557
10,day_of_month,0.005928
6,longitude,0.004103
5,latitude,0.003187
7,reserve_visitors,0.001052
3,genre,0.001028
4,area,0.000813


**Step 5:** Can model parameters improve your score?  

Take the **best** version of your encoding method and try changing some parameters with your model to see if it improves your score.  

You won't have a ton of time to do this, but try some of the following:

 - Try increasing the number of trees your model uses -- 250, 500, or perhaps more trees if time permits
 - Try experimenting with differing values for tree depth -- the default is 3, but perhaps 4, 5 or 6 works better
 - Try improving fitting time by introducing some **randomness** into your data with the following two model parameters:
   - `subsample`: this dictates what proportion of your data will be used for each tree.  A value of `0.7` means 70% of your data will be used for a particular tree, chosen at random
   - `max_features`: this is the portion of columns that are used at each individual split.  If you enter an integer the model will randomly select that number of columns, if you enter a decimal it will randomly select that portion of columns.
   - It can be very useful to find the most sparse model that will still give you comparable results.  Ie, if you find a gbm with 500 trees and a max_depth of 4 gives you the best results, it can be very beneficial if you can get those same results with a `subsample` value of 0.6 and a `max_features` score of 0.7, because your model will fit ~50% faster.
   
This step is open ended, so we will likely have to end class in the middle of it.

In [16]:
# your code here
gbm.get_params()

{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'ls',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [17]:
#using the Target encoder because it had the best score
te = ce.TargetEncoder()
X_train_te = te.fit_transform(X_train, y_train)
X_test_te  = te.transform(X_test, y_test)
#fitting and scoring
gbm = GradientBoostingRegressor(n_estimators=500, max_depth=5, subsample=0.6, max_features=0.7)
gbm.fit(X_train_te, y_train)
gbm.score(X_train_te, y_train), gbm.score(X_test_te, y_test)

  elif pd.api.types.is_categorical(cols):


(0.5697688176016562, 0.5262781000975406)