**\*Remember to activate the *far_nlp* environment before starting jupyter notebook from command line and running this notebook:**

**OS X, Linux:** `$ source activate far_nlp`

**Windows:** `$ activate far_nlp`

In [None]:
# Verify the right environment is enabled by checking the python path
import sys
print(sys.executable)

# Classifying Text Documents

Imagine the following scenario: Amazon is wanting to do a better job promoting new, potentially helpful, product reviews so they can be more visible to customers. The algorithm for displaying reviews is based on the current helpfulness rating for each review and new reviews get pushed to the bottom of the list because they have no rating.

Our task is to use information from past reviews that have already been rated to predict the helpfulness of new reviews. We'll focus on seeing if we can use the review text to make a prediction.

## Load in the Amazon Reviews

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Binarizer

The data we're using for this is from the Amazon product data curated by Julian McAuley [http://jmcauley.ucsd.edu/data/amazon/](http://jmcauley.ucsd.edu/data/amazon/). We're looking at a random sample of the reviews in the Electronics category. See the link for the full data source and a further explanation of the fields.

In [None]:
# First load in the reviews data set
amazon_reviews = pd.read_csv('data/amazon_electronics_reviews_subset.csv', header=0, index_col=0)

### Try it Out:

Preview a sample of the data set by using either `.head()` or `.sample()`.

In [None]:
# Your code here




#### Hint:

In [None]:
amazon_reviews.sample(10)

## Operationalize the Y Variable

The *helpful* column gives us how many reviews were voted helpful and the total number of votes received for that review. For example, if a particular review was voted helpful 6 times and unhelpful 4 times, we would end up with a rating of [6, 10].

We'll have to do some work place the reviews into categories for helpful/not helpful. We are only going to consider reviews with 10 or more total votes because it's too easy for a review with only a few votes to have a high rating. Then we'll divide the number of helpful votes by the total number of votes received. Lastly, if the average for a review is greater than 60% we'll assign the review to the helpful (1) category. Otherwsie the review will be assigned to the unhelpful (0) category.

In [None]:
# This funciton will created our helpfulness average
def helpful_transformer(help_string):
    stripped = help_string.strip().replace("[", "").replace("]", "")
    split = stripped.split(",")
    split[0] = split[0].strip()
    split[1] = split[1].strip()
    if int(split[1]) == 0 or int(split[1]) < 10:
        return 0
    else:
        helpful_avg = int(split[0]) / int(split[1]) 
    return helpful_avg

In [None]:
# Here we apply the function above to the full dataframe
amazon_reviews['helpful_avg'] = amazon_reviews['helpful'].apply(helpful_transformer)

In [None]:
# How many reviews are greater than 60% (helpful)?
print(amazon_reviews.loc[amazon_reviews['helpful_avg'] > 0.60].shape)

# How many reviews are 60% or below (unhelpful)?
print(amazon_reviews.loc[amazon_reviews['helpful_avg'] <= 0.60].shape)

In [None]:
# We'll use binarizer to make a binary is helpful column
helpful_binzrizer = Binarizer(copy=True, threshold=0.60)

In [None]:
# Apply the Binarizer to the data frame
amazon_reviews['is_helpful'] = helpful_binzrizer.fit_transform(amazon_reviews['helpful_avg'].values.reshape(-1,1))

### Try it Out:

Load a sample of the new data set.

In [None]:
# Your Code Here




#### Hint:

In [None]:
# Let's view the resulting data frame
amazon_reviews.sample(10, random_state=42)

## Construct a Balanced Data Set

Right now our data set is very unbalanced, we have 189,279 unhelpful reviews and only 10,721 helpful reviews. Modeling on a more balanced data set will help our algorithm better predict between each class. We'll combine our helpful reviews with a random sample of the same number of unhelpful reviews to construct a balanced data set.

In [None]:
helpful_reviews = amazon_reviews.loc[amazon_reviews['is_helpful'] == 1].copy()

In [None]:
# View the shape and sample of the helpful reviews
print(helpful_reviews.shape)
helpful_reviews.sample(5)

In [None]:
not_helpful = amazon_reviews.loc[amazon_reviews['is_helpful'] == 0].sample(n=10721, random_state=17).copy()

In [None]:
# View the shape and sample of the unhelpful reviews
print(not_helpful.shape)
not_helpful.sample(5)

In [None]:
# Join the two subsets to make a ballanced sample
reviews_ballanced_sample = pd.concat([helpful_reviews, not_helpful], axis=0)

### Try it Out:

Preview the new ballanced data set. Make sure ther eare intances of each class in the preview.

In [None]:
# Your Code here




#### Hint:

In [None]:
# Preview our ballanced data set
print(reviews_ballanced_sample.shape)
reviews_ballanced_sample.sample(5, random_state=17)

## Manually Vectorize the Text and Predict

In [None]:
# Basic stuff just in case it isn't loaded already
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# For splitting our data in to training and test sets
from sklearn.model_selection import train_test_split

# For our vectorizing and modeling pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline #, make_pipeline, make_union
from pprint import pprint
from time import time
from sklearn.model_selection import GridSearchCV
# import logging

# For vectorizing and tokenizing our text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import spacy                        
nlp = spacy.load('en') 

# Classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# Take a sample of the data set for quicker processing.
sample = reviews_ballanced_sample.sample(n=1000, random_state=17)

In [None]:
# Confirm the shape of the sample
sample.shape

### Split out X, y

In [None]:
# Setup training and final test data
# If you want to run on the full set just replace
# Sample with the full data set

X = sample
y = sample['is_helpful']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=17)

### Try it Out:

Preview the length/shape of X_train, y_train, X_test and y_test

In [None]:
# Your code here




#### Hint:

In [None]:
print("x train:", X_train.shape)
print("y train:", len(y_train))
print("x test:", X_test.shape)
print("y test:", len(y_test))

### Try it Out: — Extract the Review Column from X_train and X_test:

In [None]:
# Returns the review text column from the data frame
def review_extractor(dataframe):
    return dataframe['reviewText'].tolist()

Use the *review_extractor* to pull out the *reviewText* column from the data frame for X_train and X_test. You can accomplish this by passing in the whole data frame to the funciton. Set the result to a new variables called *review_text_train* and *review_text_text*. Confirm that the variables are a list, see how long each list is and display a few items from each list.

In [None]:
# Your code here




#### Hint:

In [None]:
# Pull out the review column from X_train
review_text_train = review_extractor(X_train)
review_text_test = review_extractor(X_test)

# Display the first two reviews and the length of the array
print("Training Data:")
print("type: ", type(review_text_train))
print("length: ", len(review_text_train))
print(review_text_train[0:2])

print("\n=======")

print("\nTest Data:")
print("type: ", type(review_text_test))
print("length: ", len(review_text_test))
print(review_text_test[0:2])

### Try it Out — Tokenize the Text:

Vectorize the X_train and X_test using either CountVectorizer or TfidfVectorizer and the spaCy tokenizer below. Pass in *review_text_train* and *review_text_test* from the previous step. Set the results to new variables *vec_text_train*, and *vec_text_test*. When vectorizing the training text use `.fit_transoform()`, then use the same instance of your vectorizer with only `.transform()` on the test text. Verify that the resulting sparce matrices are the same width.

**What happens to the shape of *vec_text_train*, and *vec_text_test* if you use `.fit_transform()` on both the train and test text? Why is this a bad idea?**

In [None]:
# Custom tokenizer using SpaCy
def spacy_tokenizer(doc_as_string):
    spacy_doc = nlp(doc_as_string)

    tokens = []
    for tok in spacy_doc:
        if tok.like_email == True:
            tokens.append('email')
        elif tok.like_url:
            tokens.append('URL')
        elif tok.lemma_ == "-PRON-":
            tokens.append(tok.lower_)
        elif tok.is_alpha == True:
            tokens.append(tok.lemma_)
    return tokens

In [None]:
# Your code here




#### Hint:

In [None]:
TFIDF = TfidfVectorizer(tokenizer=spacy_tokenizer)
vec_text_train = TFIDF.fit_transform(review_text_train)
vec_text_test = TFIDF.transform(review_text_test)

In [None]:
vec_text_train

In [None]:
vec_text_test

In [None]:
# 2nd time using fit_transform for both
TFIDF = TfidfVectorizer(tokenizer=spacy_tokenizer)
vec_text_train_2 = TFIDF.fit_transform(review_text_train)
vec_text_test_2 = TFIDF.fit_transform(review_text_test)

In [None]:
vec_text_train_2

In [None]:
vec_text_test_2

### Train the Model and Predict

We'll use a multinomial Naive Bayes as our baseline model. See the docs for more: [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).

In [None]:
# Fit the classifier
clf = MultinomialNB()
clf.fit(vec_text_train, y_train)

# Generate Predictions
y_predictions = clf.predict(vec_text_test)

### Try it Out — View the Predictions:

Print or display the y_predicitons to see the results.

In [None]:
# Your code here



#### Hint:

In [None]:
y_predictions

### Evaluate the Model

We can accuracy score to get a broad idea of how well out classifier is doing. We pass in the correct label, and then the predicted label and the percentage of accurate predictions is returned. **How well did our classifier do?**

In [None]:
score = accuracy_score(y_test, y_predictions)
print(score)

### Try it Out:

Go back through the steps above and make adjustments to the vectorization and the paramters of multinomial Naive Bayes to see if you can improve the accuracy score. **What is the top score you can acheive?**

## Model Pipeline

Thankfully there is a better way. We can use scikit learn pipelines to efficiently go through the steps above and iterate over differen parameters.

Pipeline docs: [scikit learn Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

In [None]:
# Returns the review text column from the data frame
def review_extractor(dataframe):
    return dataframe['reviewText'].tolist()

In [None]:
# Setup training and final test data
# If you want to run on the full set just replace
# Sample with the full data set

X = sample
y = sample['is_helpful']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=17)

In [None]:
# Returns the review text column from the data frame
def review_extractor(dataframe):
    return dataframe['reviewText'].tolist()

In [None]:
# Custom tokenizer using SpaCy
def spacy_tokenizer(doc_as_string):
    spacy_doc = nlp(doc_as_string)

    tokens = []
    for tok in spacy_doc:
        if tok.like_email == True:
            tokens.append('email')
        elif tok.like_url:
            tokens.append('URL')
        elif tok.lemma_ == "-PRON-":
            tokens.append(tok.lower_)
        elif tok.is_alpha == True:
            tokens.append(tok.lemma_)
    return tokens

In [None]:
pipeline = Pipeline([
    ('extractor', FunctionTransformer(review_extractor, validate=False)),
    ('vect', TfidfVectorizer()),
    ('clf',MultinomialNB())
])

In [None]:
# uncommenting more parameters will give better exploring power but will
# increase processing time
parameters = {
    'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2), (1,3)),
}

In [None]:
if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(X, y)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

## Model Evaluation

In [None]:
# Generate Predicitons

# Get the best performing model from the GridSearch above
best_model = grid_search.best_estimator_

# Fit and predict
best_model.fit(X_train, y_train)
y_predictions = best_model.predict(X_test)

# Score our Predictions
score = accuracy_score(y_test, y_predictions)
print(score)

In [None]:
# What did our predicitons look like?
pd.Series(y_predictions).value_counts()

In [None]:
# What did the true class realy look like?
pd.Series(y_test).value_counts()

In [None]:
# A confusion Matrix can help us better understand how our classifier is doing.
from utilities import plot_confusion_matrix

conf_matrix = confusion_matrix(y_test, y_predictions)
labels = ['Unhelpful Review - 0', 'Helpful Review - 1']
np.set_printoptions(precision=2)

plt.figure(figsize=(5,5), dpi=150)
plt.grid(False)
plot_confusion_matrix(conf_matrix, labels )
plt.show()

Precision and Recall can also help us understand how our predicitons are doing.

See the link below for a great explanation of Precision and Recall. This site is a great general resource for outher Data Science topics as well.
https://chrisalbon.com/machine-learning/precision_recall_and_F1_scores.html

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predictions))

### Try it Out:

Iterate through the steps above to try and improve the model. You can adjust parameters in how the documents are vectorized and parameters for the classifier. You can also try using a different classifier like LinearSVC instead of MultinomialNB by substituting it in the last step of the pipeline above. You can use `pipeline.get_params().keys()` to get a listing of all the parameters available in the current pipeline. This is demonstrated in the cell below.

**What is the best overal classification accuracy that you can acheive?**

In [None]:
pipeline.get_params().keys()

# Clustering Text Documents

The data set for the next exercise contains reviews for the SanDisk Ultra 64GB MicroSDXC Memory Card. Imagine we work for SanDisk and our task is to sort through the negative reviews to find any discernible patterns. What can we learn from these negative reviews that will give insight on how we can improve our product?

In [None]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

import spacy
nlp = spacy.load('en')  

from sklearn.cluster import KMeans

from bokeh.plotting import figure, show, output_notebook, ColumnDataSource
from bokeh.palettes import viridis
from bokeh.models import HoverTool

In [None]:
# import sandisk review
sandisk_reviews = pd.read_csv("data/sandisk_sd_card_reviews.csv", index_col=0)

### Try it Out:

Preview the data set by using `.head()`. Notice the asin number is the same for every row, this is because the asin represents the product and all of these reviews are for the same product.

In [None]:
# Your code here



#### Hint:

In [None]:
sandisk_reviews.head(3)

## Process  the Data

In [None]:
# Let's drop any rows where there are missing values in 
# the overal and reviewText columns

print(sandisk_reviews.shape)
sandisk_reviews.dropna(axis=0, how='any', subset=['overall', 'reviewText'], inplace=True)

# See the shape after the drop
sandisk_reviews.shape

In [None]:
sandisk_reviews.info()

So it looks like rating and review text are not missing for any of these columns. So we're good to go.

### Make a Subset of the Reviews

In [None]:
sandisk_bad_reviews = sandisk_reviews.loc[sandisk_reviews['overall'] <= 2].copy()

In [None]:
sandisk_bad_reviews.shape

In [None]:
# We're going to make another column of a trimmed version of the review 
# We'll end up using this later on when we graph our data.
def review_trimmer(text):
    words = text.split()
    if len(words) > 200:
        words = words[0:75]
    words = ' '.join(words)
    return words

In [None]:
sandisk_bad_reviews['reviewPreview'] = sandisk_bad_reviews['reviewText'].apply(review_trimmer)

In [None]:
sandisk_bad_reviews.head(3)

## Vectorize the Data

Use the same pipeline, `review_extractor` function and `spacy_tokenizer` from the classification example above to vectorize all the review text in the `sandisk_bad_reviews` data frame. However, remove the last step from the pipeline (the classifier) so the pipeline ends with a vectorizer, not a classifier. Use `fit_transform()` on the pipeline and assign the results to the variable tfidf_docs. Note that we don't need to split the data into training and test sets for this task.

In [None]:
# Your Code here




#### Hint:

In [None]:
# Custom Functions to extract the review text from the DF
def review_extractor(dataframe):
    return dataframe['reviewText'].tolist()

In [None]:
# Custom tokenizer using SpaCy
def spacy_tokenizer(doc_as_string):
    spacy_doc = nlp(doc_as_string)

    tokens = []
    for tok in spacy_doc:
        if tok.like_email == True:
            tokens.append('email')
        elif tok.like_url:
            tokens.append('URL')
        elif tok.lemma_ == "-PRON-":
            tokens.append(tok.lower_)
        elif tok.is_alpha == True:
            tokens.append(tok.lemma_)
    return tokens

In [None]:
tfidf_pipeline = Pipeline([
    ('extractor', FunctionTransformer(review_extractor, validate=False)),
    ('vect', TfidfVectorizer(tokenizer=spacy_tokenizer, ngram_range=(1,1), use_idf=False, min_df=1)),
])

In [None]:
tfidf_docs = tfidf_pipeline.fit_transform(sandisk_bad_reviews)

## Cluster, Reduce and Plot the Documents

In [None]:
# Instantiate KMeans
kmeans = KMeans(n_clusters=7, n_jobs=-1, max_iter=700, n_init=15)

In [None]:
# Predict with Kmeans to generate unsupervised classes
cluster_labels = kmeans.fit_predict(tfidf_docs)

In [None]:
# Reduce the Term Matrix to 2 dimensions for Plotting
TSVD = TruncatedSVD(n_components=2, random_state=17)
reduced_docs = TSVD.fit_transform(tfidf_docs)

In [None]:
# Notice how everything gets reduced to 2 columns
print(reduced_docs.shape)
reduced_docs[0:3]

In [None]:
# plot our classes
n_categores = pd.Series(cluster_labels).nunique()
color_swatches = viridis(n_categores)
categories =  np.unique(cluster_labels).tolist()
colormap = dict(zip(categories, color_swatches))
cat_colors = [colormap[i] for i in cluster_labels]


source = ColumnDataSource(data=dict(
    x = reduced_docs[:,0],
    y = reduced_docs[:,1],
    desc = sandisk_bad_reviews['reviewPreview'].tolist(),
    color=cat_colors,
    label = cluster_labels
))


hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("desc", "@desc")
])

p = figure(title = "Documents by Truncated SVD Values", width=700, height=700, tools=[hover, 'pan', 'box_zoom', 'reset', 'zoom_in','zoom_out'])
p.xaxis.axis_label = 'TSVD Dimension 1'
p.yaxis.axis_label = 'TSVD Dimension 2'

p.circle('x','y', size=8, fill_alpha=0.2, 
        color='color',
        legend='label',
         source=source,)

output_notebook(notebook_type='jupyter')
show(p)

Iterate through the steps above and adjust parameters to see if you can find any natural clusters or trends in the reviews that would shed light on ways we might be able to improve our product.


# End of Part 2