Intro to Topic Modelling


Dataset can be found [here](https://www.kaggle.com/snap/amazon-fine-food-reviews/downloads/Reviews.csv/2)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from string import punctuation
from sklearn import svm
from sklearn.utils import shuffle
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
from nltk import ngrams
from itertools import chain

from sklearn.metrics import recall_score, f1_score

In [2]:
corpus = ['Apple Orange Orange Apple',\
  'Apple Banana Apple Banana',\
  'Banana Apple Banana Banana Banana Apple',\
  'Banana Orange Banana Banana Orange Banana',\
  'Banana Apple Banana Banana Orange Banana']

print "Using count vectorizer"
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(corpus)
print(pd.DataFrame(tf.A, columns=tf_vectorizer.get_feature_names()).to_string())

print "\nUsing tfidf"
tfidf_vec = TfidfVectorizer()
tfidf = tfidf_vec.fit_transform(corpus)
print(pd.DataFrame(tfidf.A, columns=tfidf_vec.get_feature_names()).to_string())

Using count vectorizer
   apple  banana  orange
0      2       0       2
1      2       2       0
2      2       4       0
3      0       4       2
4      1       4       1

Using tfidf
      apple    banana    orange
0  0.643744  0.000000  0.765241
1  0.707107  0.707107  0.000000
2  0.447214  0.894427  0.000000
3  0.000000  0.859622  0.510931
4  0.233043  0.932173  0.277026


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [3]:
df = pd.read_csv('Reviews.csv')

In [4]:
df.shape

(568454, 10)

In [5]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
pos = df[df['Score']>3]
neg = df[df['Score']<3]
print(pos.shape, neg.shape)

((443777, 10), (82037, 10))


In [7]:
pos = pos.head(neg.shape[0])
print(pos.shape, neg.shape)
df = pd.concat([pos,neg])

((82037, 10), (82037, 10))


In [8]:
df = shuffle(df)
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
256417,256418,B0023V65GW,A3KG1RHOM34CXE,J. Ingram,1,2,2,1316044800,"They seem old, but the date has not passed",These gummy bears are very chewey. They seem ...
17807,17808,B000EVMNO6,A28DQF01VXVK41,K. Bryson,0,0,5,1261526400,Frog Gummies,I LOVE MY FROG GUMMY CANDY...I even shared it ...
368661,368662,B005K4Q1W2,A3JOYNYL458QHP,coleridge,1,1,1,1323907200,Doesn't taste like apple cider.,"Wanted to like it, but doesn't taste like appl..."
13672,13673,B000FLRNK4,A1899PHB3D4QJ7,"John Nazareno ""Hi""",1,2,5,1289347200,Good Caramel Popcorn,great not the cheap kind - i bought the 6.5 ga...
58869,58870,B000AOOR3C,A2NNCCH475532G,"Amazon Addict ""Amazon Addict""",1,2,5,1336089600,Our golden loves these!,We can't feed our golden meaty chews (he gets ...


### Preprocessing
Remove all nutral scores and seperate the remaning into positive = 1 and negative = 0 Preprocessing ususally involves:

* Removing additional white spaces
* Replacing emoji's with a word representation for example :) ==> smile
* Removing links from the corpus
* Removing punction
* Removing anh HTML tags
* Remove duplicate reviews


For this exercise we will only tokenize reviews, that is change "This is a review" to ['this', 'is', 'a', 'review']

Once the text is 'clean' we will use sklearn:

1. CountVectorizer - Convert a collection of text documents to a matrix of token counts
2. TfidfVectorizer - Convert a collection of raw documents to a matrix of TF-IDF features

In [9]:

df = df[df['Score'] !=3]
X = df['Text']
y_map = {1:0, 2:0, 4:1, 5:1}
y = df['Score'].map(y_map)

In [10]:
X.head()

256417    These gummy bears are very chewey.  They seem ...
17807     I LOVE MY FROG GUMMY CANDY...I even shared it ...
368661    Wanted to like it, but doesn't taste like appl...
13672     great not the cheap kind - i bought the 6.5 ga...
58869     We can't feed our golden meaty chews (he gets ...
Name: Text, dtype: object

In [11]:
y.head()

256417    0
17807     1
368661    0
13672     1
58869     1
Name: Score, dtype: int64

### Classification
Once the text is processed the next step is to do the actual classificaiton. For this exercise we will be using a Logistic Regression Classifier. However there are many other popular classifiers that may perform better:

1. Support Vector Machine and its variants
2. Naive Bayes and its variants
3. Random Forests and its variants

We created a function that takes in the training set X , test set y, the model being used model and the classification algorithm clf_model as well as a variable that will show the top coefficients if true  coef_show

In [12]:

c = CountVectorizer(stop_words = 'english')

def text_fit(X, y, model,clf_model,coef_show=1):
    
    X_c = model.fit_transform(X)
    print('# features: {}'.format(X_c.shape[1]))
    X_train, X_test, y_train, y_test = train_test_split(X_c, y, random_state=0)
    print('# train records: {}'.format(X_train.shape[0]))
    print('# test records: {}'.format(X_test.shape[0]))
    clf = clf_model.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    recall = recall_score(y_test,y_pred)
#     acc = clf.score(X_test, y_test)
    print ('Model Recall: {}'.format(recall))
    
    if coef_show == 1: 
        w = model.get_feature_names()
        coef = clf.coef_.tolist()[0]
        coeff_df = pd.DataFrame({'Word' : w, 'Coefficient' : coef})
        coeff_df = coeff_df.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
        print('')
        print('-Top 20 positive-')
        print(coeff_df.head(20).to_string(index=False))
        print('')
        print('-Top 20 negative-')        
        print(coeff_df.tail(20).to_string(index=False))

### Classification Experiments

1. Logistic regression model on word count
2. Logistic regression model on TFIDF
3. Logistic regression model on TFIDF + ngram

In [13]:
text_fit(X, y, c, LogisticRegression())

# features: 68923
# train records: 123055
# test records: 41019
Model Recall: 0.903139188595

-Top 20 positive-
Coefficient           Word
   2.596755     pleasantly
   2.589117         hooked
   2.503858    substitutes
   2.312637       drawback
   2.307803      addicting
   2.276742       downside
   2.120486         intend
   2.118353       terrific
   2.095301         resist
   2.060109          penny
   2.047629     mallowmars
   2.019026        worries
   2.010343  emeraldforest
   2.005211      skeptical
   1.996178        iprozon
   1.955764        trainer
   1.953619      astronaut
   1.953122       tiramisu
   1.948077       soothing
   1.930655          saves

-Top 20 negative-
Coefficient            Word
  -2.185022           ruins
  -2.187378         sounded
  -2.188627       revolting
  -2.210964           280mg
  -2.212824          ripoff
  -2.232607         concept
  -2.238236         expired
  -2.253696            dive
  -2.303428         boredom
  -2.395566           

In [14]:
tfidf = TfidfVectorizer(stop_words = 'english')
text_fit(X, y, tfidf, LogisticRegression())

# features: 68923
# train records: 123055
# test records: 41019
Model Recall: 0.896060147439

-Top 20 positive-
Coefficient       Word
  11.175333      great
   9.943246  delicious
   9.706094       best
   8.956714    perfect
   8.200967  excellent
   7.902644      loves
   7.775129     highly
   7.332294       love
   6.986550  wonderful
   6.587490    amazing
   6.244730       good
   6.154312   favorite
   6.033778    awesome
   5.986136     hooked
   5.926621       nice
   5.645311      yummy
   5.603217    pleased
   5.495396     smooth
   5.409916       easy
   5.292767      happy

-Top 20 negative-
Coefficient            Word
  -5.297115        thinking
  -5.461766           sorry
  -5.615481           waste
  -5.647884           money
  -5.710584           threw
  -5.827271      disgusting
  -5.871758          hoping
  -5.928343          return
  -6.043995         thought
  -6.261503           stale
  -6.323965           bland
  -6.898364            weak
  -7.113479  disappoin

In [15]:
tfidf_n = TfidfVectorizer(ngram_range=(1,2),stop_words = 'english')
text_fit(X, y, tfidf_n, LogisticRegression())

# features: 1932701
# train records: 123055
# test records: 41019
Model Recall: 0.909827661964

-Top 20 positive-
Coefficient              Word
  17.138644             great
  14.235385              best
  13.696116         delicious
  12.037183           perfect
  11.314099             loves
  11.251397              love
  10.542300         excellent
   9.337090         wonderful
   9.210421              good
   9.032193          favorite
   8.896408              nice
   7.896643              easy
   7.819436           amazing
   7.376214            highly
   7.328306             happy
   7.102125           awesome
   6.813112             tasty
   6.789930  highly recommend
   6.724696             yummy
   6.681264            smooth

-Top 20 negative-
Coefficient            Word
  -6.364561           china
  -6.473958           waste
  -6.501778           threw
  -6.762992          hoping
  -6.995502           maybe
  -7.290595  disappointment
  -7.424297          return
  -7.465087  

### Topic Modelling

Non Negative Matrix Factorization (NMF), Latent Dirichlet Allocation (LDA) and Single Value Decomposition (SVD)algorithms will be used to find topics in a document collection. The output of the derived topics involved assigning a numeric label to the topic and printing out the top words in a topic.

The algorithms are not able to automatically determine the number of topics and this value must be set when running the algorithm. Comprehensive documentation on available parameters is available for both [NMF](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html), [LDA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation) and [SVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html).

In [23]:
def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        print("\n") 
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])
            print("\n")

In [17]:
documents = list(X)[0:10000]
print len(documents)

10000


In [18]:
# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

In [19]:
no_topics = 5

In [20]:
# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
nmf_W = nmf_model.transform(tfidf)
nmf_H = nmf_model.components_

In [21]:
# Run LDA
lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

In [24]:
#SVD
lsi_model = TruncatedSVD(n_components=no_topics, n_iter=7, random_state=42).fit(tf)
lsi_W = lsi_model.transform(tf)
lsi_H = lsi_model.components_

In [27]:
no_top_words = 10
no_top_documents = 4
print("NMF Topics \n\n")
display_topics(nmf_H, nmf_W, tfidf_feature_names, documents, no_top_words, no_top_documents)

NMF Topics 


Topic 0:
br ingredients water sugar milk product amazon com natural fat


3/10/11<br /><br />  Delicious variety of chocolate candy!<br /><br />  Will definately buy again!<br /><br />  Seller is A++++++++++!<br /><br />  esantana32


The Mint Milano cookies are wonderful. They have a light and crispy texture with a mild mint chocolate taste. The flavor reminds me a lot of the chocolate mint Girl Scout cookies although the texture of the 2 cookies are very different (the girl scout cookie is much denser).<br /><br />Unfortunately the cookies usually arrive with very few intact and several of mine were even broken into multiple pieces. They still taste pretty good though.<br /><br />Nutritional Facts copied from the actual package (July 2011):<br /><br />Serving size 2 cookies (25g/ 0.9oz)<br /><br />Servings per container about 8<br /><br />Calories per serving 130<br /><br />Calories from fat 70<br /><br />Total Fat 7g<br /><br />Saturated Fat 5g<br /><br />Trans Fat 0g<

In [26]:
print("\n\nLDA Topics \n\n")
display_topics(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)



LDA Topics 


Topic 0:
br like taste coffee flavor good tea just chocolate cup


Wonderful, smooth, satisfying decaf coffee with a fabulously rich flavor.  No chemical aftertaste, like many decafs I've tried.  Apparently, the Swiss Water Process used to decaffeinate this coffee leaves flavor intact, something I've found terribly disappointing in other brands.<br /><br />I've been VERY impressed by the flavor dimensions of this Riviera Sunset blend; served it to some neighbors (who simply won't touch decaf coffee) who were pleasantly surprised to find they actually enjoyed a cup of decaf.  No bitter notes; my husband, who usually complains fervently about the flavor of decaf, has been asking for a cup of this every evening, after dinner.  Believe me, this alone is QUITE an endorsement!<br /><br />I love using my <a href="http://www.amazon.com/gp/product/B0011528S0">Zojirushi CD-WBC30 Micom Electric 3-Liter Water Boiler and Warmer, Champagne Gold</a> for coffee, tea, and cooking.  I br

In [28]:
print("\n\nLSI Topics \n\n")
display_topics(lsi_H, lsi_W, tf_feature_names, documents, no_top_words, no_top_documents)



LSI Topics 


Topic 0:
br food like product just good taste coffee dog flavor


The taste of SodaStream diet cola does not match that of mainstream brands.  This becomes especially noticeable for frequent drinkers of Diet Coke or Diet Pepsi.  Yet that isn't the main problem with the SodaStream offering.  The truth is simply that consumers are paying way too much for C02 refills -- sixteen times (16x) too much.<br /><br />While the general idea of at-home carbonation is solid, consumers should understand that they are grossly overpaying for SodaStream carbon dioxide refills. In fact, the prices charged are sixteen times (16x) wholesale costs or many, many times more expensive than the prices a restaurant would pay. In Europe, frugal consumers are aware of such markups and actually purchase their own restaurant-sized CO2 refills. In the US, we aren't so frugal.<br /><br />Other than the outrageous costs for the CO2 refills, there are other issues with the SodaStream Fountain Jet. One i