## Notebook 5 - Fit Regression Models on the Tweet Metadata
The purpose of this notebook is to train regression models on the non-text features of the tweets for the first layer of the model stack.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.externals import joblib

### Train models with the following features:
1. Year tweet was created
1. Month tweet was created
1. Day of week tweet was created
1. Hour tweet was created
1. Tweet has a hashtag
1. Tweet has a mention
1. Tweet has a url

In [2]:
# import the layer1 training and test sets
df_layer1_train = joblib.load('../data/df_layer1_train.pkl')
df_layer1_test = joblib.load('../data/df_layer1_test.pkl')
y_layer1_train = joblib.load('../data/y_layer1_train.pkl')
y_layer1_test = joblib.load('../data/y_layer1_test.pkl')

In [3]:
# get the desired features 
cols = ['year', 'month', 'hour', 'weekday', 'has_hashtag', 'has_mention', 'has_url']
X_layer1_train = df_layer1_train[cols]
X_layer1_test = df_layer1_test[cols]

In [4]:
from sklearn.linear_model import Lasso, BayesianRidge, SGDRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor

In [5]:
# function to fit and score a regressor
def fit_and_score(data, regr, return_regr=False):
    '''
    Fits the regressor to the training data and then prints train and test scores.
    
    Parameters:
        data - iterable containing X_train, X_test, y_train, y_test
        regr - instantiated regressor
        return_regr - boolean, option to return the fit regressor (default: False)
    
    Returns: optional, regressor fit to the training data
    '''
    
    print('Regressor: {}'.format(regr))
    
    regr.fit(data[0], data[2])
    print('Train score: {}'.format(regr.score(data[0], data[2])))
    
    print('Test score: {}'.format(regr.score(data[1], data[3])))
    
    if return_regr:
        return regr

In [6]:
data = (X_layer1_train, X_layer1_test, y_layer1_train, y_layer1_test)

In [7]:
sgd = SGDRegressor(n_iter=5000, verbose=0)
rfr = RandomForestRegressor()
knr = KNeighborsRegressor(n_jobs=-1)
bayes = BayesianRidge(verbose=True)
gbr = GradientBoostingRegressor()
models = [sgd, rfr, knr, bayes, gbr]

In [8]:
for model in models:
    fit_and_score(data, model)
    print()

Regressor: SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=5000, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=False)
Train score: -1.636113059385431e+28
Test score: -1.6668625552937147e+28

Regressor: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
Train score: 0.6984760028369565
Test score: 0.2064341551631157

Regressor: KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
          weights='uniform')
Train score: 0.5030714365324973
Test 

### Tune / Boost the Top Regressors
(All attempts are documented below, but I'm showing only those that were deemed the "winners".)

#### Record of Tuning / Boosting Attempts and Results
1. `RandomForestRegressor`
    1. `n_estimators`: 5, 15, 20 - led to more overfitting
    1. `warm_start`: True - slight decrease in test score
    1. `min_samples_split`: 50, 100, 125, 150, 200 - results peaked at 100  
    **Winner:** 10 estimators, no warm_start, at least 100 samples to split  
1. `KNeighborsRegressor`
    1. `n_neighbors`: 3, 7, 9 - 7 got .03 bump in test score, will use 7 moving forward
    1. `weights`: distance - no improvement
    1. `p`: 1 (city) - .002 improvement in test score  
    **Winner:** 7 neighbors, uniform weights, city (manhattan) distance metric
1. `BayesianRidge`
    1. `AdaBoostRegressor` did not improve the scores
1. `GradientBoostingRegressor`
    1. `n_estimators`: 100 (default), 500, 1000 - results peaked at 500
    

In [10]:
rfr_split_100 = RandomForestRegressor(min_samples_split=100)
fit_and_score(data, rfr_split_100)

Regressor: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=100, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
Train score: 0.40578954758179997
Test score: 0.3401445723084252


In [9]:
knr_7_city = KNeighborsRegressor(n_neighbors=7, p=1, n_jobs=-1)
fit_and_score(data, knr_7_city)

Regressor: KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=7, p=1,
          weights='uniform')
Train score: 0.46560186856534225
Test score: 0.2623040074689298


In [11]:
gbr_500 = GradientBoostingRegressor(n_estimators=500)
fit_and_score(data, gbr_500)

Regressor: GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=500,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)
Train score: 0.35670574224175045
Test score: 0.3214409514256279


### Export the Best Models
Exporting the models to be used in the blender

In [12]:
joblib.dump(rfr_split_100, '../data/5-meta_rfr.pkl')
joblib.dump(knr_7_city, '../data/5-meta_knr.pkl')
joblib.dump(bayes, '../data/5-meta_bayes.pkl')
joblib.dump(gbr_500, '../data/5-meta_gbr.pkl')

['../data/5-meta_gbr.pkl']