## Notebook 4 - Fit Models on the Tweet Text
The purpose of this notebook is to process the text portion of the tweets (feature extraction and dimensionality reduction) and train models.

In [None]:
import pandas as pd
import numpy as np
import re

### Train Text Encoder
I gathered over one million random tweets from around the US and Canada to use for fitting the text encoder. I did this using the Twitter Stream API and two different AWS instances. To avoid duplicates, one pulled tweets from the left half of US and Canada (labeled 'west') and the other pulled from the right half (labeled 'east'). Consult the notebook titled `1-get-tweets-streaming` in the `ipynb` folder for the code. The random tweets are all stored in pickle files (roughly 10,000 tweets per file with some exceptions). Here are the necessary steps to train the encoder:  
1. Import text from tweets and create a huge dataframe.
1. Clean the text using the cleaner.
1. Fit the tf-idf vectorizer using the text.


In [None]:
# # one-time code to import the files that start with east (68 in all)
# df = pd.DataFrame()
# path = '../data/'
# for i in range(69):
#     print(i, end='\r')
#     if i < 10:
#         filename = path + 'east-0{}.p'.format(i)
#     else:
#         filname = path + 'east-{}.p'.format(i)
    
#     df = df.append(pd.read_pickle(filename)[['text']], ignore_index=True)

In [None]:
# # one-time code to import the files that start with west (35 in all) and append to the dataframe
# path = '../data/'
# for i in range(36):
#     print(i, end='\r')
#     if i < 10:
#         filename = path + 'west-0{}.p'.format(i)
#     else:
#         filname = path + 'west-{}.p'.format(i)
    
#     df = df.append(pd.read_pickle(filename)[['text']], ignore_index=True)

In [None]:
# # write to pickle so that all the individual files can be deleted
# df.to_pickle('../data/big_tweets.p')

In [None]:
df_big = pd.read_pickle('../data/big_tweets.p')
df = pd.read_pickle('../data/3-post_eda.p')
df_big.shape, df.shape

**Note:**  
Taking a look at some of the text data to see what sort of cleaning needs to be done.

In [None]:
# print('\n\n'.join(df.text.sample(10).values))

### List of considerations for text cleaning
1. **Hashtags and At Symbols:** removing just the symbols themselves, but keeping the phrases attached. Hashtags and mentions are also going to be in a separate data set on which to fit a model and ensemble with other data sets.
1. **URL's:** I will get rid of them entirely. Need to note that there are some that begin with "http" and some that do not. Email addresses should be get the same treatment.
1. **Punctuation:** I will remove all punctuation, which will capture the hashtags and at symbols mentioned above.
1. **Capital Letters:** I will convert everything to lower case.
1. **Numbers:** I'm going to replace stand-alone numbers to the string "NUMBER", but numbers part of a string will remain. For instance, 280 will become NUMBER but kourtneeybell3 will stay the same.
1. **Whitespace:** All white space will be replaced with a single space.

In [None]:
# function to clean the input text
def cleaner(message):
    message = re.sub('https?:\/{2}\s?[^\s]*', '', message) # remove http url's
    message = re.sub('[^\s]+\/[^\s]+', '', message) # remove some random strings with /'s
    message = re.sub('[^\s]*\.com[^\s]*', '', message) # remove .com that doesn't start with http
    message = re.sub('[^\s]*\.net[^\s]*', '', message) # remove .net that doesn't start with http
    message = re.sub('\.+', ' ', message) # replace dots with space
    message = re.sub('[^a-z0-9 ]','', message.lower())  # convert to lowercase and remove punctuation
    message = re.sub('\s+\d+\s+',' NUMBER ',message) # replace stand-alone numbers with the string "NUMBER"
    message = re.sub('\s+',' ',message) # replace whitespace with a single space
    return message

In [None]:
# compare pre-processed text to clean text to validate the cleaner function
some_tweet_text = df.text.sample(10).values
for t in some_tweet_text:
    print(t)
    print(cleaner(t), '\n')

**Note:**  
Based on several 10-tweet samples, the cleaner appears to be working quite well. One issue I notice is with url's that have multiple spaces. I don't believe there's a way to eliminate those without also eliminating relevant text that follows a url. Here's an example:  

```
RT @aerocar: @aerocar & @HighendLimo are proud supporters of @pawsforacause ! http:// support.spca.bc.ca/site/TR?pg=ent ry&fr_id=1424 …

rt aerocar aerocar highendlimo are proud supporters of pawsforacause ryfrid1424
```

Overall, I believe the cleaner works sufficiently so time to clean the text for both dataframes.

In [None]:
df_big['text'] = df_big.text.apply(cleaner)

In [None]:
df['text'] = df.text.apply(cleaner)

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split

In [None]:
# split the data into train and test sets
df_train, df_test, y_train, y_test = train_test_split(df.drop('retweets', axis=1), df.retweets, 
                                                      test_size=.3, random_state=55)

### Train Various Regressors on the Text Data to Identify Best Options 
1. Fit the `TfidfVectorizer` to the random tweets with `stop_words='english'` and everything else is default.
1. Grab just the SPCA text data as the predictor.  
1. Transform the SPCA training text using the fit tfidf vectorizer. Check the dimensions of the sparse matrix to see if adjustments need to be made to the vectorizer (i.e. ngram_range, min_df, max_features)
1. Perform dimensionality reduction using `TruncatedSVD` with all components.
1. Train various regressors to see which are most appropriate to try to tune. I will try the following regressors:  
    1. `Lasso`
    1. `DecisionTreeRegressor`
    1. `KNeighborsRegressor`
    1. `BayesianRidge`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression, Lasso, BayesianRidge
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

In [None]:
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,1), min_df=1, max_features=None)
tfidf.fit(df_big.text)

In [None]:
# get the text data from the train and test sets
X_train = df_train.text
X_test = df_test.text

In [None]:
sparse_train = tfidf.transform(X_train)
sparse_train.shape

In [None]:
# function to train and score various regressors with varying ngram ranges
def try_regressor(data, regr, ngram):
    '''
    Fits tfidf, svd and regressor to the training data then prints train and test score.
    
    Parameters:
        data - iterable containing X_train, X_test, y_train, y_test
        regr - instantiated regressor
        ngram - tuple for the ngram_range to be used in TfidfVectorizer
    
    Returns: regressor fit to the training data
    '''
    tfidf = TfidfVectorizer(stop_words='english', ngram_range=ngram)
    sparse_train = tfidf.fit_transform(data[0])
    svd = TruncatedSVD(1000)  # ideally able to use max number of features
    X_train = svd.fit_transform(sparse_train)
    
    print('Training {} with ngram = {}.'.format(regr, ngram))
    # fit regressor to training data
    regr.fit(X_train, data[2])
    
    # score the model on the training data
    print('Train score: {}'.format(regr.score(X_train, data[2])))
    
    # transform the test data
    sparse_test = tfidf.transform(data[1])
    X_test = svd.transform(sparse_test)
    
    # score the model on the test data
    print('Test score: {}'.format(regr.score(X_test, data[3])))
    
#     return regr

In [None]:
# prep data to pass into fitting function
data = (X_train, X_test, y_train, y_test)

In [None]:
lasso = Lasso()
try_regressor(data, regr=lasso, ngram=(1,1))

In [None]:
# create the pipeline
linreg_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('svd', TruncatedSVD()),
    ('regr', LinearRegression())
])

In [None]:
# create the parameters dictionary
linreg_params = {
    'tfidf__norm': ['l1', 'l2'],
    'tfidf__ngram_range': [(1,1), (1,2), (1,3)],
    'tfidf__min_df': [1, 3, 5],
    'svd__n_components': [2, 10, 20]
}

In [None]:
# perform the grid search
linreg_gs = GridSearchCV(linreg_pipe, linreg_params, n_jobs=-1, verbose=1)
linreg_gs.fit(X_train, y_train)

In [None]:
from pprint import pprint
def get_gs_results(params, model, xtest, ytest):
    print('Parameters used:')
    pprint(linreg_params)
    best_parameters = model.best_estimator_.get_params()
    print('Best parameters:')
    for p_name in sorted(params.keys()):
        print("\t{}: {}".format(p_name, best_parameters[p_name]))
    print('Train score: {}'.format(model.best_score_))
    print('Test score: {}'.format(model.score(xtest,ytest)))

In [None]:
get_gs_results(linreg_params, linreg_gs, X_test, y_test)

In [None]:
pd.DataFrame(linreg_gs.cv_results_).sort_values('rank_test_score')