In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from string import punctuation
from sklearn import svm
from sklearn.utils import shuffle
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
from nltk import ngrams
from itertools import chain

from sklearn.metrics import recall_score, f1_score

In [26]:
df = pd.read_csv('Reviews.csv')

In [27]:
df.shape

(568454, 10)

In [28]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [29]:
pos = df[df['Score']>3]
neg = df[df['Score']<3]
print pos.shape, neg.shape

(443777, 10) (82037, 10)


In [30]:
pos = pos.head(neg.shape[0])
print pos.shape, neg.shape
df = pd.concat([pos,neg])

(82037, 10) (82037, 10)


In [31]:
df = shuffle(df)
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
106462,106463,B000F4ISDW,A63IP4HHUQYYD,"Kimn Gollnick ""writermom""",0,1,1,1336694400,"Damaged in shipment, 1 can leaking, 3 dented!",I cannot believe how badly these cans were dam...
349279,349280,B000CQC064,A2V18JN5XI2GOB,C. Woods,1,2,2,1275350400,Not a fan,"Not familiar with what ""chai"" tea is supposed ..."
262767,262768,B002V0XB62,ADWSD45P3NZGU,WakkyWabbit,0,0,1,1331510400,Worst popcorn I have ever eaten,I ordered two types of popcorn at one time. On...
26680,26681,B000YSS7EO,A3B3GI2DXMSCF6,"nycspagirl ""nycspagirl""",15,22,1,1241308800,Read ingredients carefully.,I bought the pack of 6 box. When it arrived a...
168440,168441,B0001ES9F8,AZ996YCGDRLA6,M. Chan,0,0,1,1285891200,Flavor Change? Watery like gas station coffee,I always enjoy Senseo's coffee and have tried ...


## Preprocessing

Remove all nutral scores and seperate the remaning into positive = 1 and negative = 0
Preprocessing ususally involves:
1. Removing additional white spaces
2. Replacing emoji's with a word representation for example :) ==> smile
3. Removing links from the corpus
4. Removing punction
5. Removing anh HTML tags
6. Remove duplicate reviews

Here is a good blog on how to [process text](http://adataanalyst.com/scikit-learn/countvectorizer-sklearn-example/)

For this exercise we will only tokenize reviews, that is change `"This is a review"` to `['this', 'is', 'a', 'review']`

Once the text is 'clean' we will use sklearn:
1. CountVectorizer - Convert a collection of text documents to a matrix of token counts
2. TfidfVectorizer - Convert a collection of raw documents to a matrix of TF-IDF features


In [32]:
df = df[df['Score'] !=3]
X = df['Text']
y_map = {1:0, 2:0, 4:1, 5:1}
y = df['Score'].map(y_map)

In [33]:
X.head()

106462    I cannot believe how badly these cans were dam...
349279    Not familiar with what "chai" tea is supposed ...
262767    I ordered two types of popcorn at one time. On...
26680     I bought the pack of 6 box.  When it arrived a...
168440    I always enjoy Senseo's coffee and have tried ...
Name: Text, dtype: object

In [34]:
y.head()

106462    0
349279    0
262767    0
26680     0
168440    0
Name: Score, dtype: int64

## Classification

Once the text is processed the next step is to do the actual classificaiton. For this exercise we will be using a Logistic Regression Classifier. However there are many other popular classifiers that may perform better:
1. Support Vector Machine and its variants
2. Naive Bayes and its variants
3. Random Forests and its variants 

We created a function that takes in the training set `X` , test set `y`, the model being used `model` and the classification algorithm `clf_model` as well as a variable that will show the top coefficients if true  `coef_show`

In [35]:
c = CountVectorizer(stop_words = 'english')

def text_fit(X, y, model,clf_model,coef_show=1):
    
    X_c = model.fit_transform(X)
    print('# features: {}'.format(X_c.shape[1]))
    X_train, X_test, y_train, y_test = train_test_split(X_c, y, random_state=0)
    print('# train records: {}'.format(X_train.shape[0]))
    print('# test records: {}'.format(X_test.shape[0]))
    clf = clf_model.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    recall = recall_score(y_test,y_pred)
#     acc = clf.score(X_test, y_test)
    print ('Model Recall: {}'.format(recall))
    
    if coef_show == 1: 
        w = model.get_feature_names()
        coef = clf.coef_.tolist()[0]
        coeff_df = pd.DataFrame({'Word' : w, 'Coefficient' : coef})
        coeff_df = coeff_df.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
        print('')
        print('-Top 20 positive-')
        print(coeff_df.head(20).to_string(index=False))
        print('')
        print('-Top 20 negative-')        
        print(coeff_df.tail(20).to_string(index=False))

## Classification Experiments

1. Logistic regression model on word count
3. Logistic regression model on TFIDF
2. Logistic regression model on TFIDF + ngram

NGram Defn:
N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). Taken from [here](http://text-analytics101.rxnlp.com/2014/11/what-are-n-grams.html)

In [36]:
text_fit(X, y, c, LogisticRegression())

# features: 68923
# train records: 123055
# test records: 41019
Model Recall: 0.907650432585

-Top 20 positive-
Coefficient           Word
   2.694677     pleasantly
   2.586919         hooked
   2.442019       perruche
   2.241181       drawback
   2.213282  deliciousness
   2.207249         resist
   2.176302       downside
   2.138309      skeptical
   2.105337       insisted
   2.076575      addicting
   2.070123        welcome
   2.064151      excellant
   2.029976  emeraldforest
   2.013503          penny
   2.013448        worries
   1.994691        iprozon
   1.969112          foamy
   1.916526      delighted
   1.910513       tiramisu
   1.891085      delicious

-Top 20 negative-
Coefficient            Word
  -2.156156         assumed
  -2.157149          ruined
  -2.192306            mojo
  -2.214745          ripoff
  -2.227076      flavorless
  -2.285248       glorified
  -2.359231           awful
  -2.370907        terrible
  -2.380938       sleepless
  -2.399603          f

In [37]:
tfidf = TfidfVectorizer(stop_words = 'english')
text_fit(X, y, tfidf, LogisticRegression())

# features: 68923
# train records: 123055
# test records: 41019
Model Recall: 0.89963060173

-Top 20 positive-
Coefficient       Word
  11.242273      great
   9.976446  delicious
   9.487620       best
   8.879661    perfect
   8.184955      loves
   8.101420  excellent
   7.296220  wonderful
   7.290039     highly
   7.169392       love
   7.042965    amazing
   6.615147       good
   6.145776       nice
   6.025466    awesome
   5.997930     hooked
   5.935291   favorite
   5.819390      yummy
   5.749754    pleased
   5.576403     smooth
   5.486892      happy
   5.464887       glad

-Top 20 negative-
Coefficient            Word
  -5.082552           hopes
  -5.422343           money
  -5.548397          hoping
  -5.648424           sorry
  -5.847056           threw
  -5.869532          return
  -5.880412         thought
  -5.889446           waste
  -5.909911           stale
  -6.032231      disgusting
  -6.516015           bland
  -7.014679  disappointment
  -7.051950            

In [38]:
tfidf_n = TfidfVectorizer(ngram_range=(1,2),stop_words = 'english')
text_fit(X, y, tfidf_n, LogisticRegression())

# features: 1932701
# train records: 123055
# test records: 41019
Model Recall: 0.913288616701

-Top 20 positive-
Coefficient       Word
  17.175395      great
  14.000333       best
  13.785201  delicious
  11.903692    perfect
  11.450009      loves
  11.026022       love
  10.384251  excellent
   9.648505       good
   9.646492  wonderful
   9.104985       nice
   8.780367   favorite
   8.235120    amazing
   7.572120      happy
   7.540856       easy
   7.091795    awesome
   7.063638     highly
   7.001554      tasty
   6.841660      yummy
   6.835026     smooth
   6.723812    pleased

-Top 20 negative-
Coefficient            Word
  -6.457616      disgusting
  -6.585610          hoping
  -6.691178           threw
  -6.797792           waste
  -6.859246           maybe
  -7.111126  disappointment
  -7.377162          return
  -7.653478             bad
  -7.685543           bland
  -7.726872           money
  -7.893960           stale
  -8.591642            weak
  -8.621319         

## Topic Modelling

Non Negative Matrix Factorization (NMF), Latent Dirichlet Allocation (LDA) and Single Value Decomposition (SVD)algorithms will be used to find topics in a document collection. The output of the derived topics involved assigning a numeric label to the topic and printing out the top words in a topic. 

The algorithms are not able to automatically determine the number of topics and this value must be set when running the algorithm. Comprehensive documentation on available parameters is available for both [NMF](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html), [LDA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation) and [SVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html). 




In [39]:
def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print "Topic %d:" % (topic_idx)
        print " ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]])
        print 
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print documents[doc_index]

In [40]:
documents = list(X)[0:10000]
print len(documents)

10000


In [41]:

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

In [42]:
no_topics = 5

In [43]:
# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
nmf_W = nmf_model.transform(tfidf)
nmf_H = nmf_model.components_

In [44]:
# Run LDA
lda_model = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_



In [45]:
#SVD
lsi_model = TruncatedSVD(n_components=no_topics, n_iter=7, random_state=42).fit(tf)
lsi_W = lsi_model.transform(tf)
lsi_H = lsi_model.components_


In [46]:
no_top_words = 10
no_top_documents = 4
print "NMF Topics \n\n"
display_topics(nmf_H, nmf_W, tfidf_feature_names, documents, no_top_words, no_top_documents)


NMF Topics 


Topic 0:
like taste product good flavor great just chocolate really love

I decided to try this product because it was cheap and I'm usually a vanilla person. I love vanilla everything and because I had already tried the chocolate version and wanted a change I thought this would be perfect. Turns out it wasn't. The taste is horrible. Even thought there's no aspartame it tastes like there's artificial sweeteners in it. The first time I tried it it made me gag. The only way I was able to finish the box was to put it in my shaker and just take very big gulps without breathing. That way it was over in less than 30sec. I've tried the chocolate and at fist didn't like it, but this stuff makes the chocolate taste like candy. The chocolate tastes just like chocolate (more like hot cocoa), but without  added sugar. This stuff does not resemble vanilla ice cream at all.<br />  I'm not gonna completely bash it however, because how many protein shakes actually taste good? Despite the

In [47]:
print "\n\nLDA Topics \n\n"
display_topics(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)




LDA Topics 


Topic 0:
br plastic salt toy almonds bottle ball magnesium blue filters

As I have problems with sleep at night and tried all sorts of things which are most commonly recommended to improve it, I tought I try this well rated product. I tried it in all the various recommended dosages. But it didn't make the slightest difference for me. Thus I guess that it probably only works if the cause of the unrest is a magnesium deficiency.<br />Reportedly:<br />"With supplements, overdose is possible, however, particularly in people with poor renal function; occasionally, with use of high cathartic doses of magnesium salts, severe hypermagnesemia has been reported to occur even without renal dysfunction".<br />"Safety Issues<br />The US government has set the following upper limits for use of magnesium supplements:<br />Children<br />1-3 years: 65 mg<br />4-8 years: 110 mg<br />Adults: 350 mg<br />Pregnant or Nursing Women: 350 mg"<br />"It has been called nature's calcium channel b

In [48]:
print "\n\nLSI Topics \n\n"
display_topics(lsi_H, lsi_W, tf_feature_names, documents, no_top_words, no_top_documents)



LSI Topics 


Topic 0:
br like food product just taste good flavor coffee really

The taste of SodaStream diet cola does not match that of mainstream brands.  This becomes especially noticeable for frequent drinkers of Diet Coke or Diet Pepsi.  Yet that isn't the main problem with the SodaStream offering.  The truth is simply that consumers are paying way too much for C02 refills -- sixteen times (16x) too much.<br /><br />While the general idea of at-home carbonation is solid, consumers should understand that they are grossly overpaying for SodaStream carbon dioxide refills. In fact, the prices charged are sixteen times (16x) wholesale costs or many, many times more expensive than the prices a restaurant would pay. In Europe, frugal consumers are aware of such markups and actually purchase their own restaurant-sized CO2 refills. In the US, we aren't so frugal.<br /><br />Other than the outrageous costs for the CO2 refills, there are other issues with the SodaStream Fountain Jet. One