## Notebook 4 - Text Processing
The purpose of this notebook is to process the text portion of the tweets in preparation for training models.

In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
df = pd.read_pickle('../data/3-post_eda.p')

**Note:**  
Taking a look at some of the text data to see what sort of cleaning needs to be done.

In [None]:
print('\n\n'.join(df.text.sample(10).values))

### List of considerations for text cleaning
1. **Hashtags and At Symbols:** removing just the symbols themselves, but keeping the phrases attached. Hashtags and mentions are also going to be in a separate data set on which to fit a model and ensemble with other data sets.
1. **URL's:** I will get rid of them entirely. Need to note that there are some that begin with "http" and some that do not. Email addresses should be get the same treatment.
1. **Punctuation:** I will remove all punctuation, which will capture the hashtags and at symbols mentioned above.
1. **Capital Letters:** I will convert everything to lower case.
1. **Numbers:** I'm going to replace stand-alone numbers to the string "NUMBER", but numbers part of a string will remain. For instance, 280 will become NUMBER but kourtneeybell3 will stay the same.
1. **Whitespace:** All white space will be replaced with a single space.

In [None]:
# function to clean the input text
def cleaner(message):
    message = re.sub('https?:\/{2}\s?[^\s]*', '', message) # remove http url's
    message = re.sub('[^\s]+\/[^\s]+', '', message) # remove some random strings with /'s
    message = re.sub('[^\s]*\.com[^\s]*', '', message) # remove .com that doesn't start with http
    message = re.sub('[^\s]*\.net[^\s]*', '', message) # remove .net that doesn't start with http
    message = re.sub('\.+', ' ', message) # replace dots with space
    message = re.sub('[^a-z0-9 ]','', message.lower())  # convert to lowercase and remove punctuation
    message = re.sub('\s+\d+\s+',' NUMBER ',message) # replace stand-alone numbers with the string "NUMBER"
    message = re.sub('\s+',' ',message) # replace whitespace with a single space
    return message

In [None]:
some_tweet_text = df.text.sample(10).values
for t in some_tweet_text:
    print(t)
    print(cleaner(t), '\n')

**Note:**  
Based on several 10-tweet samples, the cleaner appears to be working quite well. One issue I notice is with url's that have multiple spaces. I don't believe there's a way to eliminate those without also eliminating relevant text that follows a url. Here's an example:  

```
RT @aerocar: @aerocar & @HighendLimo are proud supporters of @pawsforacause ! http:// support.spca.bc.ca/site/TR?pg=ent ry&fr_id=1424 …

rt aerocar aerocar highendlimo are proud supporters of pawsforacause ryfrid1424
```

Overall, I believe the cleaner works sufficiently so I'm going to clean it and write it to a pickle.

In [None]:
df['text'] = df.text.apply(cleaner)

In [None]:
df = df.loc[:,['text', 'retweets']]
df.to_pickle('../data/4-clean_text.p')

### Fit a Linear Regression Model 
Split the data into training and test sets, being sure to set a random seed for reproducibility. Then build a pipeline to gridsearch the following:  
1. Feature extraction using `TfidfVectorizer`.
1. Dimensionality reduction using `TruncatedSVD`.
1. Regression using `LinearRegression`.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline

In [None]:
# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.text, df.retweets, test_size=.3, random_state=55)

In [None]:
# create the pipeline
linreg_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('svd', TruncatedSVD()),
    ('regr', LinearRegression())
])

In [None]:
# create the parameters dictionary
linreg_params = {
    'tfidf__norm': ['l1', 'l2'],
    'tfidf__ngram_range': [(1,1), (1,2), (1,3)],
    'tfidf__min_df': [1, 3, 5],
    'svd__n_components': [2, 10, 20]
}

In [None]:
# perform the grid search
linreg_gs = GridSearchCV(linreg_pipe, linreg_params, n_jobs=-1, verbose=1)
linreg_gs.fit(X_train, y_train)

In [None]:
from pprint import pprint
def get_gs_results(params, model, xtest, ytest):
    print('Parameters used:')
    pprint(linreg_params)
    best_parameters = model.best_estimator_.get_params()
    print('Best parameters:')
    for p_name in sorted(params.keys()):
        print("\t{}: {}".format(p_name, best_parameters[p_name]))
    print('Train score: {}'.format(model.best_score_))
    print('Test score: {}'.format(model.score(xtest,ytest)))

In [None]:
get_gs_results(linreg_params, linreg_gs, X_test, y_test)

In [None]:
pd.DataFrame(linreg_gs.cv_results_).sort_values('rank_test_score')