### Lab:  Model Validation With Decision Trees

Welcome to this evening's lab!  It's going to be a fun one.  For today's class, we're going to try and take a crack at model building in a wholistic way.  

Specifically, we're going to try and do three different things:

 - Try out different versions of our data, and use our validation scores to see if something was an improvement or not
 - We're going to adjust model parameters to try and adjust our results to help curb overfitting
 - We're going to try and find model parameters that maximize our score for our dataset
 
The idea is that we'll be able to do a mini-walkthrough to test what it's like to build and validate a model and try and improve our results.

**Step 1:** Using the suggestions from the homework prompt given previously, try and add 3-4 different features ( columns ) to your data, and use your validation score to determine if they improved your results.  For now just stick with a tree that is 6 levels deep.

This is meant to be open ended, and to allow you a chance to re-discover material from previous labs.

In [3]:
# your code here
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline
import category_encoders as ce

tree = DecisionTreeRegressor(max_depth = 6)
pipe = make_pipeline(ce.TargetEncoder(), tree)

# function to automatically check for validation score -- useful for 
def get_val_scores(df, estimator):
    
    df = df.drop('visit_date', axis = 1)
    
    # create training and validation set
    train = df.groupby('id').apply(lambda x: x.iloc[:-15]).reset_index(drop = True)
    val   = df.groupby('id').apply(lambda x: x.iloc[-15:]).reset_index(drop = True)
    
    # create a validation & test set
    X_train, y_train = train.drop('visitors', axis = 1), train['visitors']
    X_val, y_val     = val.drop('visitors', axis = 1), val['visitors']
    
    estimator.fit(X_train, y_train)
    
    # score on the test data
    return estimator.score(X_val, y_val)

# Load in the data set
df = pd.read_csv('/Users/mcs275/dat-class-repo/master.csv', 
                 parse_dates = ['visit_date'])

df = df.sort_values(by=['id', 'visit_date'])

# fill in missing values
df = df.fillna(0)

In [33]:
# create training and test sets
train = df.groupby('id').apply(lambda x: x.iloc[:-15]).reset_index(drop = True)
test  = df.groupby('id').apply(lambda x: x.iloc[-15:]).reset_index(drop = True)

In [34]:
val_score = get_val_scores(train, pipe)

  elif pd.api.types.is_categorical(cols):


In [36]:
# initial validation score
get_val_scores(train, pipe)

  elif pd.api.types.is_categorical(cols):


0.4795716486578089

In [37]:
# we'll try encoding some different date parts
df['year'] = df['visit_date'].dt.month
df['month'] = df['visit_date'].dt.month
df['week'] = df['visit_date'].dt.week
df['time'] = (df['visit_date'] - df['visit_date'].min()).dt.days

# recreate training and test
train = df.groupby('id').apply(lambda x: x.iloc[:-15]).reset_index(drop = True)
test  = df.groupby('id').apply(lambda x: x.iloc[-15:]).reset_index(drop = True)

# and get the new validation score
get_val_scores(train, pipe)

  df['week'] = df['visit_date'].dt.week
  elif pd.api.types.is_categorical(cols):


0.47519515118681543

This had no change, so we'll drop these values for now.

In [38]:
# drop the columns
df = df.drop(['year', 'month', 'week', 'time'], axis = 1)

In [39]:
# try different auto regression values
# look at previous 3 values + week + 1 month ago
df['yesterday'] = df.groupby('id')['visitors'].shift()
df['two_days_ago'] = df.groupby('id')['visitors'].shift(2)
df['three_days_ago'] = df.groupby('id')['visitors'].shift(3)
df['one_week_ago'] = df.groupby('id')['visitors'].shift(7)
df['one_month_ago'] = df.groupby('id')['visitors'].shift(30)

# fill in missing values
df = df.bfill() # note why we're doing this

# recreate training and test
train = df.groupby('id').apply(lambda x: x.iloc[:-15]).reset_index(drop = True)
test  = df.groupby('id').apply(lambda x: x.iloc[-15:]).reset_index(drop = True)

# and get the new validation score
get_val_scores(train, pipe)

  elif pd.api.types.is_categorical(cols):


0.4912491788355623

A modest improvement, so we'll keep it.

In [40]:
# we'll now try different rolling statistics
df['one_week_rolling'] = df.groupby('id')['visitors'].rolling(7).mean().shift().values
df['one_month_rolling'] = df.groupby('id')['visitors'].rolling(30).mean().shift().values

# back fill data
df = df.bfill()

# create training and test set
train = df.groupby('id').apply(lambda x: x.iloc[:-15]).reset_index(drop = True)
test  = df.groupby('id').apply(lambda x: x.iloc[-15:]).reset_index(drop = True)

# and get the new validation score
get_val_scores(train, pipe)

  elif pd.api.types.is_categorical(cols):


0.5129720401754606

Better still, so we'll go ahead and keep it.

**Step 2:** Let's now try out different types of model parameters  

The idea here is two-fold:  see if you can narrow the gap between in-sample and out-of-sample results (using training & validation sets), while simultaneously **not** decreasing your model scores (or at least not by very much).  The idea being that the closer these two are, the more reliable your results are likely to be.

Some knobs you can turn:
 - `min_samples_leaf`: parameter in the category encoder that determines what cutoff point you can use for using the local vs. global average for the category.  (A decent rule of thumb is to try and have at least 10 samples on a leaf, but feel free to try different values)
 - `max_features`: what portion of columns to use at each split.  This parameter will randomly sample columns at each split, which reduces the chance that random patterns within them will have a disproportionately large impact on your model.  Should be a fraction between 0 and 1 or the number of columns you want to include.  
 - You can also try the following:  remove and sort of max_depth on your tree, and just use `min_samples_leaf` as a way to prune unnecessary splits

In [41]:
# your code here

min_samples_leaf = [5, 10, 20, 30]
max_features     = [0.6, 0.8, 1.0]
max_depth        = [5, 6, 7]
cv_scores        = []

# a training loop to search for different validation combinations
for sample in min_samples_leaf:
    for feature in max_features:
        for depth in max_depth:
            print(f"Fitting a tree for leaf size: {sample}, max feats: {feature}, depth: {depth}")
            pipe[-1].set_params(min_samples_leaf = sample, max_depth = depth, max_features = feature)
            val_score = get_val_scores(train, pipe)
            cv_scores.append((val_score, sample, feature, depth))

Fitting a tree for leaf size: 5, max feats: 0.6, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 5, max feats: 0.6, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 5, max feats: 0.6, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 5, max feats: 0.8, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 5, max feats: 0.8, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 5, max feats: 0.8, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 5, max feats: 1.0, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 5, max feats: 1.0, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 5, max feats: 1.0, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 10, max feats: 0.6, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 10, max feats: 0.6, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 10, max feats: 0.6, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 10, max feats: 0.8, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 10, max feats: 0.8, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 10, max feats: 0.8, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 10, max feats: 1.0, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 10, max feats: 1.0, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 10, max feats: 1.0, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 20, max feats: 0.6, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 20, max feats: 0.6, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 20, max feats: 0.6, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 20, max feats: 0.8, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 20, max feats: 0.8, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 20, max feats: 0.8, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 20, max feats: 1.0, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 20, max feats: 1.0, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 20, max feats: 1.0, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 30, max feats: 0.6, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 30, max feats: 0.6, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 30, max feats: 0.6, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 30, max feats: 0.8, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 30, max feats: 0.8, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 30, max feats: 0.8, depth: 7


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 30, max feats: 1.0, depth: 5


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 30, max feats: 1.0, depth: 6


  elif pd.api.types.is_categorical(cols):


Fitting a tree for leaf size: 30, max feats: 1.0, depth: 7


  elif pd.api.types.is_categorical(cols):


**Step 3:** Take the best version of your model & your data, and fit it on **all** of your training + validation data.  The idea is that now that we've found the best version of what we have to work with, we want to give it as much training samples as possible.  

In [42]:
# your code here
cv_scores

[(0.4903522256472562, 5, 0.6, 5),
 (0.5020397294389618, 5, 0.6, 6),
 (0.5240358439409751, 5, 0.6, 7),
 (0.48832861716590903, 5, 0.8, 5),
 (0.5133094455836702, 5, 0.8, 6),
 (0.5161592163150333, 5, 0.8, 7),
 (0.4879663829077875, 5, 1.0, 5),
 (0.5133983883190941, 5, 1.0, 6),
 (0.5169589241496577, 5, 1.0, 7),
 (0.485932318602449, 10, 0.6, 5),
 (0.5091583947520302, 10, 0.6, 6),
 (0.5104326575878118, 10, 0.6, 7),
 (0.4898966203656294, 10, 0.8, 5),
 (0.5233972466524043, 10, 0.8, 6),
 (0.5186276480458202, 10, 0.8, 7),
 (0.4879663829077875, 10, 1.0, 5),
 (0.5133981064470319, 10, 1.0, 6),
 (0.5223462176324639, 10, 1.0, 7),
 (0.5089853318778007, 20, 0.6, 5),
 (0.5187079480371192, 20, 0.6, 6),
 (0.5195564493120243, 20, 0.6, 7),
 (0.48934420529319256, 20, 0.8, 5),
 (0.5132069163148621, 20, 0.8, 6),
 (0.518724427107579, 20, 0.8, 7),
 (0.4879663829077875, 20, 1.0, 5),
 (0.5133341993019256, 20, 1.0, 6),
 (0.5221544731749446, 20, 1.0, 7),
 (0.4903730724371058, 30, 0.6, 5),
 (0.4983284374204664, 30, 0.6

In [43]:
# get the one with the best validation score
max(cv_scores)

(0.5240358439409751, 5, 0.6, 7)

In [44]:
# set the model for your best parameters
pipe[-1].set_params(min_samples_leaf = 5, max_features = 0.6, max_depth = 7)

DecisionTreeRegressor(max_depth=7, max_features=0.6, min_samples_leaf=5)

In [46]:
# fit on all of training data
X_train, y_train = train.drop(['visit_date', 'visitors'], axis = 1), train['visitors']
X_test, y_test   = test.drop(['visit_date', 'visitors'], axis = 1), test['visitors']

**Step 4:** Score your model on your test set.

Note how your validation + test scores compared to one another.

In [47]:
# your code here
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

  elif pd.api.types.is_categorical(cols):


0.48109001135648977