## ToDos

* Review term frequency in major and minor classes
* Pre-compute DTM and sampling for training data sets
* LDA will take a long time...can I fit on something smaller and then partial fit
* Figure out AWS
* Use spaCy to transform documents into vectorized form. This will bypass count/tf-idf -> PCA/LDA pipeline


## Fixed Data
1. CountVectorizer
2. tf-idf
3. spaCy word embeddings
4. Sampling

Build out all of the above data structures, and then pickle the class. I can then reload the class to run the model pipelines with right training data sets. 

How do I maintain the test data?
* Create models to transform the text later.
* Transform and save the data


### Pipelines
1. Count/tf-idf -> PCA / LDA -> Supervised Learning
  * stemming applied
  * english words
2. Word Embeddings -> Supervised Learning

In [2]:
import AmazonReviews

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
ar = AmazonReviews.AmazonReviews()

PATH = '../data/amazon_reviews_us_Toys_v1_00.tsv'
# ar.load_data(PATH)

# ar.calc_trend_score()

# ar.create_observations()

# # ar.create_train_test_split(train_reduction=.1)
# ar.create_train_test_split()
# ar.dump_models()

ar = ar.load_models()

Read from pickle...


Review the distribution of the first review data. 1/1/2014 is the most popular. May need to move the cutoff date.

2014-01-01    0.030230
2014-01-02    0.029510
2014-01-03    0.026328
2014-01-04    0.016026
2014-01-07    0.014318

In [None]:
# ar.reviews_selected_df.min_review_date.value_counts(normalize=True).sort_values(ascending = False).head()

In [None]:
# ar.product_trend_df[ar.product_trend_df.trend == 1].describe()

## DTM / Sampling Creation

Create DTM and only restrict features to English words in the `nltk.corpus.words`

In [5]:
from nltk.corpus import words, stopwords
from nltk import SnowballStemmer
import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import StratifiedKFold

from imblearn.pipeline import Pipeline as imbPipeline
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN, SMOTETomek

from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical

# from xgboost import XGBClassifier

from sklearn.metrics import confusion_matrix, f1_score, recall_score, precision_score, accuracy_score


import numpy as np
import pandas as pd

# skf = StratifiedKFold(n_splits = 10, random_state=ar.RANDOM_STATE)

stemmer = SnowballStemmer('english')

def english_corpus(doc, tkpat=re.compile('\\b[a-z][a-z]+\\b')):
    return [stemmer.stem(w) for w in tkpat.findall(doc)]


In [6]:
## Create each pipe and add it to the data dictionary.
pre_process_pipe = imbPipeline([
                          ('cnt_v', CountVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2)),
                          ('sm', SMOTE(random_state=42, n_jobs=-1))]) 

ar.pre_process_data(pre_process_pipe, 'cnt_v_1_gram_sm')

In [None]:
pre_process_pipe = imbPipeline([
                          ('cnt_v', CountVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2,
                                        ngram_range = (1,2))),
                          ('sm', SMOTE(random_state=42, n_jobs=-1))]) 

ar.pre_process_data(pre_process_pipe, 'cnt_v_2_gram_sm')

In [None]:
pre_process_pipe = imbPipeline([
                          ('cnt_v', TfidfVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2)),
                          ('sm', SMOTE(random_state=42, n_jobs=-1))]) 

ar.pre_process_data(pre_process_pipe, 'tf_idf_1_gram_sm')

In [None]:
pre_process_pipe = imbPipeline([
                          ('cnt_v', TfidfVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2,
                                        ngram_range = (1,2))),
                          ('sm', SMOTE(random_state=42, n_jobs=-1))]) 

ar.pre_process_data(pre_process_pipe, 'tf_idf_2_gram_sm')

In [None]:
pre_process_pipe = imbPipeline([
                          ('cnt_v', TfidfVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2,
                                        ngram_range = (1,2))),
                          ('sm', SMOTEENN(random_state=42, n_jobs=-1))]) 

ar.pre_process_data(pre_process_pipe, 'tf_idf_2_gram_sm_enn')

In [None]:
pre_process_pipe = imbPipeline([
                          ('cnt_v', TfidfVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2)),
                          ('sm', SMOTEENN(random_state=42, n_jobs=-1))]) 

ar.pre_process_data(pre_process_pipe, 'tf_idf_1_gram_sm_enn')

In [None]:
## Create each pipe and add it to the data dictionary.
pre_process_pipe = imbPipeline([
                          ('cnt_v', CountVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2)),
                          ('sm', SMOTENN(random_state=42, n_jobs=-1))]) 

ar.pre_process_data(pre_process_pipe, 'cnt_v_1_gram_sm_enn')

In [None]:
pre_process_pipe = imbPipeline([
                          ('cnt_v', CountVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2,
                                        ngram_range = (1,2))),
                          ('sm', SMOTENN(random_state=42, n_jobs=-1))]) 

ar.pre_process_data(pre_process_pipe, 'cnt_v_2_gram_sm_enn')

In [None]:
X_temp, _ = pre_process_pipe.fit_sample(ar.X_train, ar.y_train)

In [None]:
X_temp.shape

In [None]:
ar.X_train.shape

Steps:
1. Create pipeline
2. Pass it to a function to manage the data.
  1. `fit_sample` pipeline
  2. Save X_train, y_train, and model to a dictionary.
  3. Write to disk

In [None]:
Xy_train = pd.concat([ar.y_train, ar.X_train], axis=1)

In [None]:
Xy_train[Xy_train.trend==1].head()

In [None]:
cv = CountVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2)
X_train_minor = cv.fit_transform(Xy_train.loc[Xy_train.trend==1, 'review_body'])

In [None]:
lda = LatentDirichletAllocation(         # could I fit on 1 of the 10 folds, and then partial fit
                                        n_jobs=-1, 
                                        learning_method='online', 
                                        random_state=42)
lda.fit(X_train_minor)

In [None]:
ar.X_train.shape

In [None]:
first_pipe = imbPipeline([
                          ('lda', LatentDirichletAllocation(         # could I fit on 1 of the 10 folds, and then partial fit
                                        n_jobs=-1, 
                                        learning_method='online',    
                                        random_state=42)),
                          ('log_transform', FunctionTransformer(np.log)),
                          ('ss', StandardScaler()),
                          ('log_reg', LogisticRegression(random_state=42))])

params = {
    'lda__n_components': Integer(5, 20),
    'lda__learning_decay': Real(0.5, 1),
    'log_reg__C': Categorical([0.001,0.01,0.1,1,10,100])
}

grid = BayesSearchCV(first_pipe, params, n_jobs=-1, n_iter=1, cv=1, scoring='precision')

In [None]:
ar.run_model(grid, 'mvp')

In [None]:
second_pipe = imbPipeline([
                          ('cnt_v', CountVectorizer(
                                        stop_words='english', 
                                        tokenizer=english_corpus, 
                                        min_df=2)),
                          ('sm', SMOTE(random_state=42, n_jobs=-1)),
                          ('lda', LatentDirichletAllocation(
                                        n_jobs=-1, 
                                        learning_method='online', 
                                        random_state=42)),
                          ('log_transform', FunctionTransformer(np.log)),
                          ('rf', RandomForestClassifier(n_jobs=-1, random_state=42))])

params = {
    'lda__n_components': Integer(5, 20),
    'lda__learning_decay': Real(0.5, 1),
    'rf__n_estimators': Integer(10,100),
    'rf__max_depth': Integer(1,5)
}

second_grid = BayesSearchCV(second_pipe, params, n_jobs=-1, n_iter=5, cv=4, scoring='precision')

In [None]:
ar.run_model(second_grid, 'mvp_rf',t_func=english_corpus )

In [None]:
ar.results

In [None]:
pd.DataFrame(ar.models['mvp']['cv_results']).sort_values('mean_test_score', ascending=False)

In [None]:
ar.models['mvp']['best_model'].steps

In [None]:
ar.conf_matrix('mvp_rf')

In [None]:
ar.dump_models(func=english_corpus)

In [None]:
features = ar.models['count_1_gram']['model'].get_feature_names()

len(features)

In [None]:
features.sort(reverse=True)
features

In [None]:
ar.models['count_1_gram']['X_train'].shape

## Model Running Framework

1. Supervised model or pipeline is created
2. BayesSearchCV is configured
3. BayesSearchCV is fitted and scored and everything is logged.
4. Data frame maintained reporting F1, precision, accuracy, and recall (rows), and models (columns)
5. Perferably store the confusion matrix to plot heatmaps

In [None]:
grid.fit(ar.models['orig']['X_train'], ar.models['orig']['y_train'])

In [None]:
print(grid.best_estimator_.named_steps)

In [None]:
y_pred = grid.predict(ar.models['orig']['X_train'])

print('F1', f1_score(ar.models['orig']['y_train'], y_pred))
print('Precision',precision_score(ar.models['orig']['y_train'], y_pred))
print('Recall', recall_score(ar.models['orig']['y_train'], y_pred))
print(confusion_matrix(ar.models['orig']['y_train'], y_pred))

In [None]:
results = {
    'F1': f1_score(ar.models['orig']['y_train'], y_pred),
    'Precision': precision_score(ar.models['orig']['y_train'], y_pred),
    'Recall': recall_score(ar.models['orig']['y_train'], y_pred),
    'Accuracy': accuracy_score(ar.models['orig']['y_train'],y_pred)
}

In [None]:
import pandas as pd
pd.DataFrame(data=results.values(), index=results.keys())

|metric|score|
---|---|
|F1| 0.03751465416178194|
|Precision| 0.01920768307322929|
|Recall| 0.8|

||Pred No| Pred Yes|
|---|---|---|
|Act No| 2425| 1634|
|Act Yes|8|   32|


In [None]:
ar.models['orig']['X_train'].shape

In [None]:
from sklearn.cluster import KMeans

In [None]:
km = KMeans(n_clusters=10)
X_train_new = km.fit_transform(X_train_new)

In [None]:
X_train_new

In [None]:
ar.models['count_1_gram']['X_train'].shape

In [None]:
'he' in set(stopwords.words())

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [None]:
pyLDAvis.sklearn.prepare(lda, X_train_minor, cv)

In [None]:
import pickle

In [None]:
pickle.dump(ar, open('../data/ar.pkl', 'wb'))

In [None]:
ar.models['orig']['X_train'].shape

## Pipelines

1. (CountVectorizer, TF-IDF) -> (LDA, PCA, NMF, Word2Vec) -> K-Means -> (Logistic Regression, Random Forest, Gradient Boost)
2. Sampling due to imbalanced classes (SMOTE, SMOTE->Tomek, SMOTE-> ENN) 