## Notebook 6 - Fit Regression Models on the Hashtags and Mentions in the Tweets 
The purpose of this notebook is to train regression models on the hashtags and mentions in the tweets for the first layer of the model stack.

#### Future Consideration:
Fit the vectorizer on the random tweets and the SPCA tweets together instead of just the SPCA tweets.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.externals import joblib

### Steps to Work Through
1. Fit and transform the entire SPCA data hashtags and mentions sentences using count vectorizer then fit the svd
1. Try different regression models on the transformed sentences to identify best options
1. Tune / Boost the best options

### 1. Count Vectorize and do SVD on the Sentences

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

In [3]:
# get all the SPCA data for the purpose of fitting count vectorizer and svd
df_spca = pd.read_pickle('../data/3-post_eda.p')

In [4]:
# fit count vectorizer and svd on all SPCA data
X = df_spca.hashtags_and_mentions
tags_cv = CountVectorizer(stop_words='english')
tags_sparse = tags_cv.fit_transform(X)
display(tags_sparse.shape)
tags_svd = TruncatedSVD(500)
tags_svd.fit(tags_sparse)

(80836, 27454)

TruncatedSVD(algorithm='randomized', n_components=500, n_iter=5,
       random_state=None, tol=0.0)

In [5]:
# pickle the cv and svd for later use
joblib.dump(tags_cv, '../data/6-tags_cv.pkl')
joblib.dump(tags_svd, '../data/6-tags_svd.pkl')

['../data/6-tag_svd.pkl']

### 3. Train Various Regressors on the Hashtags and Mentions to Identify Best Options for Layer 1
Train various regressors to see which are most appropriate to use for the first layer. I will try the following regressors with default parameters, and the intention is to ultimately tune / boost those that lead to best results:  
   1. `Lasso`
   1. `DecisionTreeRegressor`
   1. `KNeighborsRegressor`
   1. `BayesianRidge`

In [6]:
# import the layer1 training and test sets
df_layer1_train = joblib.load('../data/df_layer1_train.pkl')
df_layer1_test = joblib.load('../data/df_layer1_test.pkl')
y_layer1_train = joblib.load('../data/y_layer1_train.pkl')
y_layer1_test = joblib.load('../data/y_layer1_test.pkl')

In [7]:
# get hashtags and mentions from the training and test sets
X_tags_train = df_layer1_train.hashtags_and_mentions
X_tags_test = df_layer1_test.hashtags_and_mentions

In [8]:
from sklearn.linear_model import Lasso, BayesianRidge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.neighbors import KNeighborsRegressor

In [9]:
# function to fit and score a regressor
def fit_and_score_tags(data, regr, return_regr=False):
    '''
    Transforms the hashtags and mentions sentences using the fitted cv and svd, fits the regressor 
    to the transformed training data and then prints train and test scores.
    
    Parameters:
        data - iterable containing X_train, X_test, y_train, y_test
        regr - instantiated regressor
        return_regr - boolean, option to return the fit regressor (default: False)
    
    Returns: optional, regressor fit to the transformed training data
    '''
    
    print('Regressor: {}'.format(regr))
    train_sparse = tag_cv.transform(data[0])
    X_train = tag_svd.transform(train_sparse)
    
    regr.fit(X_train, data[2])
    print('Train score: {}'.format(regr.score(X_train, data[2])))
    
    test_sparse = tag_cv.transform(data[1])
    X_test = tag_svd.transform(test_sparse)
    
    print('Test score: {}'.format(regr.score(X_test, data[3])))
    
    if return_regr:
        return regr

In [10]:
data = (X_tags_train, X_tags_test, y_layer1_train, y_layer1_test)

In [11]:
lasso = Lasso()
rfr = RandomForestRegressor(n_jobs=-1)
knr = KNeighborsRegressor(n_jobs=-1)
bayes = BayesianRidge()
gbr = GradientBoostingRegressor()
models = [lasso, rfr, knr, bayes, gbr]

In [12]:
for model in models:
    fit_and_score_tags(data, model)
    print()

Regressor: Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
Train score: 0.0
Test score: -0.00027090916794425546

Regressor: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=-1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
Train score: 0.5106353315805702
Test score: 0.297496539336915

Regressor: KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
          weights='uniform')
Train score: 0.37709008773048036
Test score: 0.21891598446317803

Regressor: BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=Fals

In [13]:
rfr_50_50 = RandomForestRegressor(n_estimators=50, min_samples_split=50, n_jobs=-1)
fit_and_score_tags(data, rfr_50_50)

Regressor: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=50, min_weight_fraction_leaf=0.0,
           n_estimators=50, n_jobs=-1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
Train score: 0.452209613957352
Test score: 0.33014432220611045


In [14]:
knr_9 = KNeighborsRegressor(n_neighbors=9, n_jobs=-1)
fit_and_score_tags(data, knr_9)

Regressor: KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=9, p=2,
          weights='uniform')
Train score: 0.33601431474104726
Test score: 0.2387399564012711


In [15]:
gbr_500 = GradientBoostingRegressor(n_estimators=500)
fit_and_score_tags(data, gbr_500)

Regressor: GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=500,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)
Train score: 0.37664554243644355
Test score: 0.31815056740618075


### 4. Tune / Boost the Top Regressors
(All attempts are documented below, but I'm showing only those that were deemed the "winners".)

#### Record of Tuning / Boosting Attempts and Results
1. `RandomForestRegressor`
    1. `n_estimators`: 5, 15, 25, 50, 75, 100 - continued improvement but negligible from 50 to 100 
    1. `min_samples_split`: 25, 50, 75 - results peaked at 50  
    **Winner:** 50 estimators, 50 samples to split  
1. `KNeighborsRegressor`
    1. `n_neighbors`: 3, 7, 9 - 9 got biggest bump in test score  
1. `BayesianRidge`
    1. `AdaBoostRegressor` did not improve the scores
1. `GradientBoostingRegressor`
    1. `n_estimators`: 100, 500, 1000 - results peaked at 500
    

### Export the Best Models
Exporting the models to be used in the blender

In [16]:
joblib.dump(rfr_50_50, '../data/6-tags_rfr.pkl')
joblib.dump(knr_9, '../data/6-tags_knr.pkl')
joblib.dump(bayes, '../data/6-tags_bayes.pkl')
joblib.dump(gbr_500, '../data/6-tags_gbr.pkl')

['../data/6-tags_gbr.pkl']