# NYAAPOR Text Analytics Tutorial

## Loading in the data

First, download the Kaggle zip file (https://www.kaggle.com/snap/amazon-fine-food-reviews).  And unpack it in this repository's root folder

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("../amazon-fine-food-reviews/Reviews.csv")

In [None]:
print(len(df))

Wow, that's a lot of data.  Let's see what's in here.

In [None]:
df.head()

Let's just use a sample for now, so things run faster

In [None]:
sample = df.sample(10000).reset_index()

### Examine the data

Run the cell below a few times, let's take a look at our text and see what it looks like.  Always take a look at your raw data.

In [None]:
sample.sample(10)['Text'].values

I don't know about you, but I noticed some junk in our data - HTML and URLs.  Let's clear that out first.

In [None]:
import re

def clean_text(text):
    text = re.sub(r'http[a-zA-Z0-9\&\?\=\?\/\:\.]+\b', ' ', text)
    text = re.sub(r'\<[^\<\>]+\>', ' ', text)
    return text

df['Text'] = df['Text'].map(clean_text)

## TF-IDF Vectorization (Feature Extraction)

Okay, now let's tokenize our text and turn it into numbers

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vectorizer = TfidfVectorizer(
    max_df=0.9, 
    min_df=5, 
    ngram_range=(1, 1), 
    stop_words='english', 
    max_features=2500
)
tfidf = tfidf_vectorizer.fit_transform(sample['Text'])
ngrams = tfidf_vectorizer.get_feature_names()

In [None]:
tfidf

Because words are really big, by default we work with sparse matrices.  We can expand the sparse matrix with `.todense()` and compute sums like a normal dataframe.  Let's check out the top 20 words.

In [None]:
ngram_df = pd.DataFrame(tfidf.todense(), columns=ngrams) 
ngram_df.sum().sort_values(ascending=False)[:20]

## Classification

Let's make an outcome variable.  How about we try to predict 5-star reviews, and then maybe helpfulness?

In [None]:
sample['good_score'] = sample['Score'].map(lambda x: 1 if x == 5 else 0)
sample['was_helpful'] = ((sample['HelpfulnessNumerator'] / sample['HelpfulnessDenominator']).fillna(0.0) > .80).astype(int)

In [None]:
column_to_predict = 'good_score'

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn import svm
from sklearn import metrics

results = []
kfolds = StratifiedKFold(n_splits=5)

We just created an object that'll split the data into fifths, and then iterate over it five times, holding out one-fifth each time for testing.  Let's do that now.  Each "fold" contains an index for training rows, and one for testing rows.  For each fold, we'll train a basic linear Support Vector Machine, and evaluate its performance.

In [None]:
for i, fold in enumerate(kfolds.split(tfidf, sample[column_to_predict])):
    
    train, test = fold 
    print("Running new fold, {} training cases, {} testing cases".format(len(train), len(test)))
    
    clf = svm.LinearSVC(
        max_iter=1000,
        penalty='l2',
        class_weight='balanced',
        loss='squared_hinge'
    )
    # We picked some decent starting parameters, but encourage you to try out different ones
    # http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html 
    # If you're ambitious - check out the Scikit-Learn documentation and test out different models
    # http://scikit-learn.org/stable/supervised_learning.html
    
    training_text = tfidf[train]
    training_outcomes = sample[column_to_predict].loc[train]
    clf.fit(training_text, training_outcomes) # Train the classifier on the training data
    
    test_text = tfidf[test]
    test_outcomes = sample[column_to_predict].loc[test]
    predictions = clf.predict(test_text) # Get predictions for the test data
    
    precision, recall, fscore, support = metrics.precision_recall_fscore_support(
        test_outcomes, # Compare the predictions against the true outcomes
        predictions
    )
    
    results.append({
        "fold": i,
        "outcome": 0,
        "precision": precision[0],
        "recall": recall[0],
        "fscore": fscore[0],
        "support": support[0]
    })
    
    results.append({
        "fold": i,
        "outcome": 1,
        "precision": precision[1],
        "recall": recall[1],
        "fscore": fscore[1],
        "support": support[1]
    })
    
results = pd.DataFrame(results)

How'd we do?

In [None]:
print(results.groupby("outcome").mean()[['precision', 'recall']])
print(results.groupby("outcome").std()[['precision', 'recall']])

Now we know that our model is pretty stable and reasonably performant, we can fit and transform the full dataset.

In [None]:
clf.fit(tfidf, sample[column_to_predict])  
print(metrics.classification_report(sample[column_to_predict].loc[test], predictions))
print(metrics.confusion_matrix(sample[column_to_predict].loc[test], predictions))

And now we can see what the most predictive features are.

In [None]:
import numpy as np

ngram_coefs = sorted(zip(ngrams, clf.coef_[0]), key=lambda x: x[1], reverse=True)
ngram_coefs[:10]

What happens if you change the outcome column to "was_helpful" and re-run it again?  Can you think of ways to improve this?  Add stopwords?  Bigrams?

## Topic Modeling

In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #{}: {}".format(
            topic_idx,
            ", ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        ))

Let's find some topics.  We'll check out non-negative matrix factorization (NMF) first.

In [None]:
nmf = NMF(n_components=10, random_state=42, alpha=.1, l1_ratio=.5).fit(tfidf)
# Try out different numbers of topics (change n_components)
# Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
print("\nTopics in NMF model:")
print_top_words(nmf, ngrams, 10)

LDA is an other popular topic modeling technique

In [None]:
lda = LatentDirichletAllocation(n_topics=10, random_state=42).fit(tfidf)
# Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
# doc_topic_prior (alpha) - lower alpha means documents will be composed of fewer topics (higher means a more uniform distriution across all topics)
# topic_word_prior (beta) - lower beta means topis will be composed of fewer words (higher means a more uniform distribution across all words)
print("\nTopics in LDA model:")
print_top_words(lda, ngrams, 10)

We can use the topic models the same way we did our classifier - everything in Scikit-Learn follows the same fit/transform paradigm.  So, let's get the topics for our documents.

In [None]:
doc_topics = pd.DataFrame(lda.transform(tfidf))

In [None]:
doc_topics.head()

In [None]:
topic_column_names = ["topic_{}".format(c) for c in doc_topics.columns]
doc_topics.columns = topic_column_names

Next we use Pandas to join the topics with the original sample dataframe

In [None]:
sample_with_topics = pd.concat([sample, doc_topics], axis=1)

Let's look for patterns by running some means and correlations

In [None]:
sample_with_topics.groupby("Score").mean()

In [None]:
for topic in topic_column_names:
    print "{}: {}".format(topic, sample_with_topics[topic].corr(sample_with_topics['Score']))

Here's an example of a linear regression

In [None]:
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

training_data = sample_with_topics[topic_column_names[:-1]] # We're leaving a column out to prevent multicollinearity

regression = linear_model.LinearRegression()

# Train the model using the training sets
regression.fit(training_data, sample_with_topics['Score'])
coefficients = regression.coef_
print zip(topic_column_names[:-1], coefficients)

Sadly Scikit-Learn doesn't make it easy to get p-values or a regression report like you'd normally expect of something like R or Stata.  Scikit-Learn is more about prediction than statistical analysis; for the latter, we can use Statsmodels.  

In [None]:
import statsmodels.api as sm

regression = sm.OLS(training_data, sample_with_topics['Score'])
results = regression.fit()
print(results.summary())

## Clustering

We can also check out other unsupervised methods like clustering.  I borrowed/modified some of this code from http://brandonrose.org/clustering

### K-Means Clustering

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=10, max_iter=50, tol=.01)
# http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmeans.fit(tfidf)
clusters = kmeans.labels_.tolist() # You can merge these back into the data if you want

In [None]:
centroids = kmeans.cluster_centers_.argsort()[:, ::-1] 
for i, closest_ngrams in enumerate(centroids):
    print "Cluster #{}: {}".format(i, np.array(ngrams)[closest_ngrams[:8]])

### Agglomerative/Hierarchical Clustering

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.metrics.pairwise import cosine_similarity

# Uses cosine similarity to get word similarities based on document overlap
# To get this for document similarities in terms of word overlap, just drop the .transpose()!
similarities = cosine_similarity(tfidf.transpose()) 
distances = 1 - similarities # Converts to distances
clusters = linkage(distances, method='ward') # Run hierarchical clustering on the distances

In [None]:
fig, ax = plt.subplots(figsize=(15, min([len(ngrams)/10.0, 300])))
ax = dendrogram(clusters, labels=ngrams, orientation="left")
plt.tight_layout()