## Notebook 4 - Fit Regression Models on the Tweet Text
The purpose of this notebook is to process the text portion of the tweets (feature extraction and dimensionality reduction) and train regression models for the first layer of the model stack.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.externals import joblib

### Train Text Encoder
I gathered over one million random tweets from around the US and Canada to use for fitting the text encoder. I did this using the Twitter Stream API and two different AWS instances. To avoid duplicates, one instance pulled tweets from the left half of US and Canada (labeled 'west') and the other pulled from the right half (labeled 'east'). Consult the notebook titled `1-get-tweets-streaming` in the `ipynb` folder for the code. The random tweets are all stored in pickle files (roughly 10,000 tweets per file with some exceptions). Here are the necessary steps to train the encoder:  
1. Import text from tweets and create a huge dataframe.
1. Clean the text using the cleaner.
1. Fit the tf-idf vectorizer using the text.


In [None]:
# # one-time code to import the files that start with east (68 in all)
# df = pd.DataFrame()
# path = '../data/'
# for i in range(69):
#     print(i, end='\r')
#     if i < 10:
#         filename = path + 'east-0{}.p'.format(i)
#     else:
#         filname = path + 'east-{}.p'.format(i)
    
#     df = df.append(pd.read_pickle(filename), ignore_index=True)

In [None]:
# # one-time code to import the files that start with west (35 in all) and append to the previous dataframe
# path = '../data/'
# for i in range(36):
#     print(i, end='\r')
#     if i < 10:
#         filename = path + 'west-0{}.p'.format(i)
#     else:
#         filname = path + 'west-{}.p'.format(i)
    
#     df = df.append(pd.read_pickle(filename), ignore_index=True)

In [None]:
# # drop rows with empty tweets
# df.dropna(subset=['text'], inplace=True)
# df.shape

In [None]:
# # write to pickle so that all the individual files can be deleted
# df.to_pickle('../data/big_tweets.p')

In [26]:
# df_big = pd.read_pickle('../data/big_tweets.p')
df_spca = pd.read_pickle('../data/3-post_eda.p')

**Note:**  
Taking a look at some of the text data to see what sort of cleaning needs to be done.

In [None]:
print('\n\n'.join(df_spca.text.sample(10).values))

### List of considerations for text cleaning
1. **Hashtags and At Symbols:** removing just the symbols themselves, but keeping the phrases attached. Hashtags and mentions are also going to be in a separate data set on which to fit a model and ensemble with other data sets.
1. **URL's:** I will get rid of them entirely. Need to note that there are some that begin with "http" and some that do not. Email addresses should be get the same treatment.
1. **Punctuation:** I will remove all punctuation, which will capture the hashtags and at symbols mentioned above.
1. **Capital Letters:** I will convert everything to lower case.
1. **Numbers:** I'm going to replace stand-alone numbers to the string "NUMBER", but numbers part of a string will remain. For instance, 280 will become NUMBER but kourtneeybell3 will stay the same.
1. **Whitespace:** All white space will be replaced with a single space.

In [27]:
# function to clean the input text
def cleaner(message):
    message = re.sub('https?:\/{2}\s?[^\s]*', '', message) # remove http url's
    message = re.sub('[^\s]+\/[^\s]+', '', message) # remove some random strings with /'s
    message = re.sub('[^\s]*\.com[^\s]*', '', message) # remove .com that doesn't start with http
    message = re.sub('[^\s]*\.net[^\s]*', '', message) # remove .net that doesn't start with http
    message = re.sub('\.+', ' ', message) # replace dots with space
    message = re.sub('[^a-z0-9 ]','', message.lower())  # convert to lowercase and remove punctuation
    message = re.sub('\s+\d+\s+',' NUMBER ',message) # replace stand-alone numbers with the string "NUMBER"
    message = re.sub('\s+',' ',message) # replace whitespace with a single space
    return message

In [None]:
# compare pre-processed text to clean text to validate the cleaner function
some_tweet_text = df_spca.text.sample(10).values
for t in some_tweet_text:
    print(t)
    print(cleaner(t), '\n')

**Note:**  
Based on several 10-tweet samples, the cleaner appears to be working quite well. One issue I notice is with url's that have multiple spaces. I don't believe there's a way to eliminate those without also eliminating relevant text that follows a url. Here's an example:  

```
RT @aerocar: @aerocar & @HighendLimo are proud supporters of @pawsforacause ! http:// support.spca.bc.ca/site/TR?pg=ent ry&fr_id=1424 …

rt aerocar aerocar highendlimo are proud supporters of pawsforacause ryfrid1424
```

Overall, I believe the cleaner works sufficiently so time to clean the text for both dataframes.

In [None]:
df_big['text'] = df_big.text.apply(cleaner)

In [28]:
df_spca['text'] = df_spca.text.apply(cleaner)

### Data Split
Because I intend to use stacking, I have to split the approximate 80k rows of data as follows:
- 50k to be used to train (40k) and test (10k) the models in the first layer of the stack.  
- 30k to be used to train (24k) and test (6k) the blender.

In [29]:
from sklearn.model_selection import train_test_split

In [30]:
# split the data into layer1 set (63%) and blender set (37%) 
df_layer1, df_blender, y_layer1, y_blender = train_test_split(df_spca.drop('retweets', axis=1), df_spca.retweets, 
                                                              test_size=.37, random_state=55)

# split each of the layer1 set and blender set into train and test sets
df_layer1_train, df_layer1_test, y_layer1_train, y_layer1_test = train_test_split(df_layer1, y_layer1, 
                                                                                  test_size=.2, random_state=55)
df_blender_train, df_blender_test, y_blender_train, y_blender_test = train_test_split(df_blender, y_blender, 
                                                                                      test_size=.2, random_state=55)

print(df_layer1_train.shape, df_layer1_test.shape, y_layer1_train.shape, y_layer1_test.shape)
print(df_blender_train.shape, df_blender_test.shape, y_blender_train.shape, y_blender_test.shape)

(40740, 19) (10186, 19) (40740,) (10186,)
(23928, 19) (5982, 19) (23928,) (5982,)


In [31]:
# output the split data to pickle files for use in other notebooks
joblib.dump(df_layer1_train, '../data/df_layer1_train.pkl')
joblib.dump(df_layer1_test, '../data/df_layer1_test.pkl')
joblib.dump(y_layer1_train, '../data/y_layer1_train.pkl')
joblib.dump(y_layer1_test, '../data/y_layer1_test.pkl')
joblib.dump(df_blender_train, '../data/df_blender_train.pkl')
joblib.dump(df_blender_test, '../data/df_blender_test.pkl')
joblib.dump(y_blender_train, '../data/y_blender_train.pkl')
joblib.dump(y_blender_test, '../data/y_blender_test.pkl')

['../data/y_blender_test.pkl']

### Train Various Regressors on the Text Data to Identify Best Options for Layer 1
1. Fit the `TfidfVectorizer` to the random tweets with `stop_words='english'` and everything else is default for the first run. Repeat with different values for `ngram_range`, `min_df` and `max_features`.
1. Grab just the SPCA text data as the predictor.  
1. Transform the SPCA training text using the fit tfidf vectorizer and svd.
1. Train various regressors to see which are most appropriate to use for the first layer. I will try the following regressors with default parameters, and the intention is to ultimately tune / boost those that lead to best results:  
    1. `Lasso`
    1. `RandomForestRegressor`
    1. `KNeighborsRegressor`
    1. `BayesianRidge`
    1. `GradientBoostingRegressor

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import Lasso, BayesianRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor

In [None]:
# only need to run this once
# fit random twitter text data with tfidf and svd
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,1), min_df=1, max_features=None)
big_sparse = tfidf.fit_transform(df_big.text)
display(big_sparse.shape)
svd = TruncatedSVD(500)
svd.fit(big_sparse)

#### Importing TF-IDF, SVD and Data Instead of Splitting and Fitting Again

In [3]:
# load the prefit tfidf and svd 
tfidf = joblib.load('../data/4-text_tfidf.pkl')
svd = joblib.load('../data/4-text_svd.pkl')

# import the layer1 training and test sets
df_layer1_train = joblib.load('../data/df_layer1_train.pkl')
df_layer1_test = joblib.load('../data/df_layer1_test.pkl')
y_layer1_train = joblib.load('../data/y_layer1_train.pkl')
y_layer1_test = joblib.load('../data/y_layer1_test.pkl')

In [4]:
# get the text data from the train and test sets for first layer
X_layer1_train = df_layer1_train.text
X_layer1_test = df_layer1_test.text

In [5]:
# function to fit and score a regressor
def fit_and_score_text(data, regr, return_regr=False):
    '''
    Transforms the data using the fitted tfidf and svd, fits the regressor to the transformed 
    training data and then prints train and test scores.
    
    Parameters:
        data - iterable containing X_train, X_test, y_train, y_test
        regr - instantiated regressor
        return_regr - boolean, option to return the fit regressor (default: False)
    
    Returns: optional, regressor fit to the transformed training data
    '''
    
    print('Regressor: {}'.format(regr))
    train_sparse = tfidf.transform(data[0])
    X_train = svd.transform(train_sparse)
    
    regr.fit(X_train, data[2])
    print('Train score: {}'.format(regr.score(X_train, data[2])))
    
    test_sparse = tfidf.transform(data[1])
    X_test = svd.transform(test_sparse)
    
    print('Test score: {}'.format(regr.score(X_test, data[3])))
    
    if return_regr:
        return regr

In [6]:
data = (X_layer1_train, X_layer1_test, y_layer1_train, y_layer1_test)

In [7]:
lasso = Lasso()
rfr = RandomForestRegressor(n_jobs=-1)
knr = KNeighborsRegressor(n_jobs=-1)
bayes = BayesianRidge()
gbr = GradientBoostingRegressor()
models = [lasso, rfr, knr, bayes, gbr]

In [8]:
for model in models:
    fit_and_score_text(data, model)
    print()

Regressor: Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
Train score: 0.0
Test score: -0.00027090916794425546

Regressor: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
Train score: 0.816431248796844
Test score: 0.38446032879585323

Regressor: KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
          weights='uniform')
Train score: 0.5455633208870001
Test score: 0.3532456296611669

Regressor: BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False,

### Tuning TF-IDF and SVD 
I used the base regressor models to compare scores after tuning the tfidf and svd
1. `TfidfVectorizer(min_df = 10)` - ever so slightly worse
1. `TfidfVectorizer(max_features = 50000)` - ever so slightly worse
1. `TfidfVectorizer(ngram_range = (2,2))` - significantly worse
1. `TruncatedSVD(n_components=300)` - slightly worse

So I will not be making any changes to the tfidf or svd.

In [None]:
# pickle the tfidf and svd for later use
joblib.dump(tfidf, '../data/4-text_tfidf.pkl')
joblib.dump(svd, '../data/4-text_svd.pkl')

### Tune / Boost the Top Regressors
(All attempts are documented below, but I'm showing only those that were deemed the "winners".)

#### Record of Tuning / Boosting Attempts and Results
1. `RandomForestRegressor`
    - `n_estimators`: 5, 15, 25, 50, 75 - optimal at 50 (75 was negligibly better but cost more)  **Winner:** 50 estimators
1. `KNeighborsRegressor`
    - `n_neighbors`: 3, 7, 9, 11, 13 - results peaked at 11  **Winner:** 11 neighbors
1. `BayesianRidge`
    - `AdaBoostRegressor` did not improve the scores
1. `GradientBoostingRegressor`
    - `n_estimators`: 100, 500, 750, 1000 - big jump from 500 to 750, small from 750 to 1000  **Winner:** 1000 estimators

In [14]:
rfr_50 = RandomForestRegressor(n_estimators=50, n_jobs=-1)
fit_and_score_text(data, rfr_50)

Regressor: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
Train score: 0.8393992619203053
Test score: 0.43068263192153317


In [35]:
knr_11 = KNeighborsRegressor(n_neighbors=11, n_jobs=-1)
fit_and_score_text(data, knr_11)

Regressor: KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=11, p=2,
          weights='uniform')
Train score: 0.47126003469565186
Test score: 0.3805232310087995


In [43]:
gbr_1000 = GradientBoostingRegressor(n_estimators=1000)
fit_and_score_text(data, gbr_1000)

Regressor: GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=1000,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)
Train score: 0.5915463933550908
Test score: 0.4284726172114104


### Export the Best Models
Exporting the models to be used in the blender

In [45]:
joblib.dump(rfr_50, '../data/4-text_rfr.pkl')
joblib.dump(knr_11, '../data/4-text_knr.pkl')
joblib.dump(bayes, '../data/4-text_bayes.pkl')
joblib.dump(gbr_1000, '../data/4-text_gbr.pkl')

['../data/4-text_gbr.pkl']