### Lab:  Model Validation With Decision Trees

Welcome to this evening's lab!  It's going to be a fun one.  For today's class, we're going to try and take a crack at model building in a wholistic way.  

Specifically, we're going to try and do three different things:

 - Try out different versions of our data, and use our validation scores to see if something was an improvement or not
 - We're going to adjust model parameters to try and adjust our results to help curb overfitting
 - We're going to try and find model parameters that maximize our score for our dataset
 
The idea is that we'll be able to do a mini-walkthrough to test what it's like to build and validate a model and try and improve our results.

**Step 1:** Using the suggestions from the homework prompt given previously, try and add 3-4 different features ( columns ) to your data, and use your validation score to determine if they improved your results.  For now just stick with a tree that is 6 levels deep.

This is meant to be open ended, and to allow you a chance to re-discover material from previous labs.

In [2]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt
pd.options.plotting.backend = 'plotly'
from sklearn.pipeline import make_pipeline
import category_encoders as ce

In [3]:
df = pd.read_csv('/Users/mcs275/dat-class-repo/master.csv', parse_dates = ['visit_date'])

In [4]:
df = df.sort_values(['id','visit_date']).fillna(0)

In [6]:
def get_val_scores(df, estimator):
    
    df = df.drop('visit_date', axis = 1)
    
    # create training and validation set
    train = df.groupby('id').apply(lambda x: x.iloc[:-15]).reset_index(drop = True)
    val   = df.groupby('id').apply(lambda x: x.iloc[-15:]).reset_index(drop = True)
    
    # create a validation & test set
    X_train, y_train = train.drop('visitors', axis = 1), train['visitors']
    X_val, y_val     = val.drop('visitors', axis = 1), val['visitors']
    
    estimator.fit(X_train, y_train)
    
    # score on the test data
    return estimator.score(X_val, y_val)

In [27]:
train = df.groupby('id').apply(lambda x: x.iloc[:-15]).reset_index(drop = True)
test  = df.groupby('id').apply(lambda x: x.iloc[-15:]).reset_index(drop = True)

In [28]:
pipe = make_pipeline(ce.TargetEncoder(), DecisionTreeRegressor(max_depth = 6))

In [29]:
get_val_scores(train, pipe)

0.4795716486578089

In [30]:
train['month'] = train['visit_date'].dt.month
train['year'] = train['visit_date'].dt.year
train['passage_of_time'] = (train['visit_date'] - train['visit_date'].min()).dt.days

In [31]:
get_val_scores(train, pipe)

0.4794210840577293

In [32]:
train = train.drop(['month','year','passage_of_time'], axis = 1)

In [24]:
get_val_scores(train, pipe)

0.4795716486578089

In [40]:
#take the previous value and move it up one day (i.e. use yesterday's value)
train['yesterday'] = train.groupby('id')['visitors'].shift()
train['2dayago'] = train.groupby('id')['visitors'].shift(2)
train['3dayago'] = train.groupby('id')['visitors'].shift(3)
train['7dayago'] = train.groupby('id')['visitors'].shift(7)
train['30dayago'] = train.groupby('id')['visitors'].shift(30)
train = train.bfill()

In [41]:
get_val_scores(train, pipe)

0.4912491788355623

In [55]:
train['rolling7'] = train.groupby('id')['visitors'].rolling(7).mean().shift().values
train['rolling30'] = train.groupby('id')['visitors'].rolling(30).mean().shift().values
train['rolling90'] = train.groupby('id')['visitors'].rolling(90).mean().shift().values
train = train.bfill()

In [56]:
get_val_scores(train, pipe)


0.5102922382918247

**Step 2:** Let's now try out different types of model parameters  

The idea here is two-fold:  see if you can narrow the gap between in-sample and out-of-sample results (using training & validation sets), while simultaneously **not** decreasing your model scores (or at least not by very much).  The idea being that the closer these two are, the more reliable your results are likely to be.

Some knobs you can turn:
 - `min_samples_leaf`: parameter in the category encoder that determines what cutoff point you can use for using the local vs. global average for the category.  (A decent rule of thumb is to try and have at least 10 samples on a leaf, but feel free to try different values)
 - `max_features`: what portion of columns to use at each split.  This parameter will randomly sample columns at each split, which reduces the chance that random patterns within them will have a disproportionately large impact on your model.  Should be a fraction between 0 and 1 or the number of columns you want to include.  
 - You can also try the following:  remove and sort of max_depth on your tree, and just use `min_samples_leaf` as a way to prune unnecessary splits

In [None]:
# your code here

**Step 3:** Take the best version of your model & your data, and fit it on **all** of your training + validation data.  The idea is that now that we've found the best version of what we have to work with, we want to give it as much training samples as possible.  

In [None]:
# your code here

**Step 4:** Score your model on your test set.

Note how your validation + test scores compared to one another.

In [None]:
# your code here