## Notebook 6 - Fit Regression Models on the Hashtags and Mentions in the Tweets 
The purpose of this notebook is to train regression models on the hashtags and mentions in the tweets for the first layer of the model stack.

#### Future Consideration:
Fit the vectorizer on the random tweets and the SPCA tweets together instead of just the SPCA tweets.

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.externals import joblib

### Steps to Work Through
1. Using SPCA dataset, combine each tweet's hashtags and mentions into one sentence **(write a function for this since it will be used on the training and test sets as well)**
1. Fit and transform the sentences using count vectorizer then fit the svd
1. Combine hashtags and mentions for the training and test sets
1. Transform the sentences in the training and test sets using the fit count vectorizer and svd.
1. Try different regression models to identify best options
1. Boost the best options

### 1. Combine Hashtags and Mentions for SPCA Dataset

In [None]:
# get all the SPCA data for the purpose of fitting count vectorizer and svd
df_spca = pd.read_pickle('../data/3-post_eda.p')

In [None]:
# Write function to combine hashtags and mentions 

In [None]:
# Run the above function on the SPCA data

### 2. Count Vectorize and do SVD on the Sentences

In [None]:
# Need code here to train count vectorizer and svd on all SPCA data

### 3. Combine Hashtags and Mentions for Training and Test Sets

In [None]:
# import the layer1 training and test sets
df_layer1_train = joblib.load('../data/df_layer1_train.pkl')
df_layer1_test = joblib.load('../data/df_layer1_test.pkl')
y_layer1_train = joblib.load('../data/y_layer1_train.pkl')
y_layer1_test = joblib.load('../data/y_layer1_test.pkl')

In [None]:
# Run combining function on training and test sets
# use X_hash_train, X_hash_test as names

### 4. Transform the Sentences for Training and Test Sets

In [None]:
# Transform the training and test sets

### 5. Train Various Regressors on the Hashtags and Mentions to Identify Best Options for Layer 1
Train various regressors to see which are most appropriate to use for the first layer. I will try the following regressors with default parameters, and the intention is to ultimately boost those that lead to best results:  
   1. `Lasso`
   1. `DecisionTreeRegressor`
   1. `KNeighborsRegressor`
   1. `BayesianRidge`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import Lasso, BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

In [None]:
# function to fit and score a regressor
def fit_and_score(data, regr, return_regr=False):
    '''
    Transforms the data using the pre-fit tfidf and svd, fits the regressor to the transformed 
    training data and then prints train and test scores.
    
    Parameters:
        data - iterable containing X_train, X_test, y_train, y_test
        regr - instantiated regressor
        return_regr - boolean, option to return the fit regressor (default: False)
    
    Returns: optional, regressor fit to the transformed training data
    '''
    
    print('Regressor: {}'.format(regr))
    train_sparse = tfidf.transform(data[0])
    X_train = svd.transform(train_sparse)
    
    regr.fit(X_train, data[2])
    print('Train score: {}'.format(regr.score(X_train, data[2])))
    
    test_sparse = tfidf.transform(data[1])
    X_test = svd.transform(test_sparse)
    
    print('Test score: {}'.format(regr.score(X_test, data[3])))
    
    if return_regr:
        return regr

In [None]:
data = (X_hash_train, X_hash_test, ylayer1_train, y_layer1_test)

In [None]:
lasso = Lasso()
dtr = DecisionTreeRegressor()
knr = KNeighborsRegressor(n_jobs=-1)
bayes = BayesianRidge()
models = [lasso, dtr, knr, bayes]

In [None]:
for model in models:
    fit_and_score(data, model)
    print()

### 6. Boost the Top Regressors