# NLP Final Project: Topic Modeling

# Matthew Przybyla

# Table of Contents

Movie Review Data

Hypothesis

Results A

Summary A

Results B

Summary B

Conclusion

# Motivation

To get good hands-on experience working with NMF and/or LDA and the necessary Python libraries to analyze and visualize results. 

# Business Understanding

What is topic-modelling?

It is a statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. 

The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is. It involves various techniques of dimensionality reduction(mostly non-linear) and unsupervised learning like LDA, SVD, autoencoders etc.
Source: Wikipedia

This type of machine learning can be utilized for various facets of industries. For example, a marketing team could uncover valuable
insights about their customers; a group a customers could share similar attributes.

In this analysis, movie reviews are inspected to generate topics.

In [372]:
# import modules and libraries
# print out the versions of libraries, python version, and environment

import warnings
init_notebook_mode(connected=True)
warnings.filterwarnings('ignore')

import sys; print("Python", sys.version)
import pandas as pd; print("pandas", pd.__version__)
import numpy as np; print("numpy", np.__version__)
import sklearn; print("sklearn", sklearn.__version__)
import requests; print("requests", requests.__version__)
import nltk; print("nltk", nltk.__version__)
import re; print('re', re.__version__)
from requests import get; print("requests", requests.__version__)
from urllib.parse import urljoin; print("urllib.parse", urllib.request.__version__)
from bs4 import BeautifulSoup; print("bs4", bs4.__version__)
import nltk.stem.snowball; print("nltk.stem.snowball", nltk.__version__)
import spacy; print("spacy", spacy.__version__)
import pyLDAvis.sklearn; print("pyLDAvis", spacy.__version__)
import plotly; print("plotly", plotly.__version__)
from nltk.stem.snowball import SnowballStemmer
from sklearn.cluster import KMeans
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from spacy.lang.en.stop_words import STOP_WORDS
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from plotly.offline import init_notebook_mode, iplot
from plotly import tools
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
import os
import urllib
import bs4
import string
from collections import Counter
from sklearn.cluster import KMeans

Python 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 14:01:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
pandas 0.20.3
numpy 1.15.4
sklearn 0.20.0
requests 2.14.2
nltk 3.3
re 2.2.1
requests 2.14.2
urllib.parse 3.6
bs4 4.6.3
nltk.stem.snowball 3.3
spacy 2.0.17
pyLDAvis 2.0.17
plotly 3.4.2


# Review Data

In [373]:
hw7data = pd.read_csv('/Users/MatthewPrzybyla/Downloads/hw7reviewdata.csv',encoding="latin-1")

In [374]:
reviewdata = hw7data.drop(['Unnamed: 0','type','label', 'file'],axis=1)
reviewdata.columns = ["review"]
reviewdata.head()

Unnamed: 0,review
0,Once again Mr. Costner has dragged out a movie...
1,This is an example of why the majority of acti...
2,"First of all I hate those moronic rappers, who..."
3,Not even the Beatles could write songs everyon...
4,Brass pictures (movies is not a fitting word f...


In [375]:
moviewReviews = reviewdata[:1000]

In [376]:
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [377]:
stemmer = SnowballStemmer("english")

In [378]:
totalvocab_stemmed = []
totalvocab_tokenized = []

for i in moviewReviews['review']:
    allwords_stemmed = tokenize_and_stem(i) #for each item in 'review', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

In [379]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')

there are 228823 items in vocab_frame


In [389]:
def tokens(x):
    return x.split(',')

In [390]:
tfidf_vect= TfidfVectorizer( tokenizer=tokens ,use_idf=True, smooth_idf=True, sublinear_tf=False)

In [391]:
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer()

tfidf_matrix = tfidf_vectorizer.fit_transform(moviewReviews['review']) #fit the vectorizer to synopses

In [392]:
terms = tfidf_vectorizer.get_feature_names()

In [393]:
dataframe = pd.DataFrame(moviewReviews['review'])

In [394]:
dataframe['review'].astype(str)
dataframe.head()

Unnamed: 0,review
0,Once again Mr. Costner has dragged out a movie...
1,This is an example of why the majority of acti...
2,"First of all I hate those moronic rappers, who..."
3,Not even the Beatles could write songs everyon...
4,Brass pictures (movies is not a fitting word f...


In [395]:
punctuations = string.punctuation
stopwords = list(STOP_WORDS)
print("Number of Stop Words wrt to spaCy is: ", len(stopwords))

Number of Stop Words wrt to spaCy is:  305


In [396]:
dataframe['review'] = dataframe['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)
                                                                   and word not in (punctuations)]))

In [397]:
dataframe.head()

Unnamed: 0,review
0,Once Mr. Costner dragged movie far longer nece...
1,This example majority action films same. Gener...
2,"First I hate moronic rappers, could'nt act gun..."
3,"Not Beatles write songs liked, Walter Hill mop..."
4,Brass pictures (movies fitting word them) some...


# Start of Final Project Analysis

# Hypothesis 

The hypothesis is:

Ho : LDA = NMF = LSI Models.

Ha : LDA ≠ NMF ≠ LSI Models.

The null hypothesis is that all models are the same. The alternative hypothesis is that all models are different.

In [398]:
# Creating a vectorizer
vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(dataframe["review"])

In [399]:
NUM_TOPICS = 10

In [400]:
%%time

# Latent Dirichlet Allocation Model
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_lda = lda.fit_transform(data_vectorized)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10
CPU times: user 4.66 s, sys: 47.6 ms, total: 4.71 s
Wall time: 4.76 s


In [401]:
%%time

# Non-Negative Matrix Factorization Model
nmf = NMF(n_components=NUM_TOPICS)
data_nmf = nmf.fit_transform(data_vectorized)

CPU times: user 785 ms, sys: 286 ms, total: 1.07 s
Wall time: 402 ms


In [402]:
%%time

# Latent Semantic Indexing Model using Truncated SVD
lsi = TruncatedSVD(n_components=NUM_TOPICS)
data_lsi = lsi.fit_transform(data_vectorized)

CPU times: user 50.5 ms, sys: 49.9 ms, total: 100 ms
Wall time: 28.1 ms


In [403]:
# Functions for printing keywords for each topic
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]]) 

# Results A

The below results are for the LDA, NMF and LSI Models.

Shown are the top 10 topics of the models.

In [404]:
# Keywords for topics clustered by Latent Dirichlet Allocation
print("LDA Model:")
selected_topics(lda, vectorizer)

LDA Model:
Topic 0:
[('film', 316.6333877293193), ('characters', 130.26002853659293), ('story', 117.01883583481029), ('like', 113.23038089420484), ('movie', 101.68122410135862), ('scene', 84.29458305935508), ('character', 77.75584830793295), ('original', 66.58630359650287), ('films', 63.31735638507456), ('plot', 61.59394318392143)]
Topic 1:
[('belushi', 20.512740897375025), ('john', 20.29519094279515), ('randy', 19.393197244815898), ('wayne', 14.556713636237834), ('woodward', 11.354662578855933), ('sally', 10.531874892091), ('gang', 7.660474701480918), ('rides', 4.9706535118789015), ('westerns', 4.385460706067212), ('rogers', 4.181141166141913)]
Topic 2:
[('bishop', 45.62290364079268), ('larry', 30.871996001096058), ('hopper', 21.907662347813705), ('squad', 21.713045398612298), ('mod', 20.338072358796452), ('prison', 17.71837820254181), ('dennis', 17.541729053696777), ('ribisi', 17.090029779072218), ('danes', 17.064820444333193), ('tarantino', 16.63005990583501)]
Topic 3:
[('movie', 18

In [405]:
# Keywords for topics clustered by Latent Semantic Indexing
print("NMF Model:")
selected_topics(nmf, vectorizer)

NMF Model:
Topic 0:
[('movie', 11.772141033160667), ('watch', 0.6337236337093755), ('good', 0.6260703939568093), ('movies', 0.5379189333072966), ('know', 0.5026188660509258), ('sandra', 0.4528432887439634), ('people', 0.38748370488888173), ('seen', 0.38267118818077905), ('plot', 0.3615352527976532), ('think', 0.34381770150318447)]
Topic 1:
[('film', 7.908390190787888), ('films', 0.9313872939136315), ('time', 0.44510572136533044), ('vampire', 0.40379738335814175), ('director', 0.38570706933865506), ('fact', 0.2984809711856956), ('seen', 0.2950347288977789), ('feel', 0.2744653574802214), ('acting', 0.26258986771175563), ('think', 0.23160157676406895)]
Topic 2:
[('like', 7.420028479824756), ('guy', 1.0192567307783096), ('time', 1.010709534951721), ('look', 0.8880565605681037), ('people', 0.878533065430672), ('scene', 0.7388777741804591), ('way', 0.7070121784954931), ('prison', 0.6879990210115016), ('stupid', 0.5699815984858847), ('thing', 0.5664755308034349)]
Topic 3:
[('bone', 3.93096779

In [406]:
# Keywords for topics clustered by Non-Negative Matrix Factorization
print("LSI Model:")
selected_topics(lsi, vectorizer)

LSI Model:
Topic 0:
[('movie', 0.6836831252720709), ('film', 0.3261742781421991), ('like', 0.2403776806432977), ('bad', 0.16116795788107138), ('good', 0.13473480762157128), ('don', 0.11103373803501852), ('people', 0.11089690751168849), ('time', 0.10295162050797195), ('story', 0.0960784218661852), ('movies', 0.08910685893954524)]
Topic 1:
[('movie', 0.6016493998621232), ('sandra', 0.035476616163900486), ('bell', 0.013359596074746084), ('bullock', 0.012970223243172239), ('sick', 0.011845231500824439), ('himesh', 0.011145598012606697), ('spartans', 0.009649040651411938), ('scary', 0.009405125126497193), ('drake', 0.009350414289206252), ('stop', 0.00863902207017968)]
Topic 2:
[('like', 0.4327338251732372), ('people', 0.14516013937403333), ('don', 0.12018733450516578), ('way', 0.11550859759731993), ('story', 0.1088372724196813), ('guy', 0.10722501525737184), ('time', 0.09451004463104283), ('bone', 0.09339086973517999), ('eater', 0.08687372264167902), ('bad', 0.08516920113070442)]
Topic 3:
[

# Visual Summary of Results A (LDA)

Use this graph by hovering over a topic number to display the top terms.

These circles will become red and their respective top terms will show on the bar graph to the right.

The larger topics are more frequent and the closer the topics, the more the similar they are.

The selection of keywords is based on their frequency and discriminancy.

In [407]:
pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda, data_vectorized, vectorizer, mds='tsne')
dash

NMF is shown below.

In [408]:
pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(nmf, data_vectorized, vectorizer, mds='tsne')
dash

Below we are visualizing the LSI(SVD) scatterplot.

The data for 2 topics is shown.

The similarity between keywords which is measured by distance with the markers is displayed by dots.

In [409]:
svd_2d = TruncatedSVD(n_components=2)
data_2d = svd_2d.fit_transform(data_vectorized)

In [410]:
py.sign_in('PythonAPI', 'ubpiol2cve')

In [411]:
trace = go.Scattergl(
    x = data_2d[:,0],
    y = data_2d[:,1],
    mode = 'markers',
    marker = dict(
        color = '#FFBAD2',
        line = dict(width = 1)
    ),
    text = vectorizer.get_feature_names(),
    hovertext = vectorizer.get_feature_names(),
    hoverinfo = 'text' 
)
data = [trace]
iplot(data, filename='scatter-mode')

In this scatterplot the actual text is shown.

At first glance, this data looks messy but can actually be zoomed in for better visuals.

In [412]:
trace = go.Scattergl(
    x = data_2d[:,0],
    y = data_2d[:,1],
    mode = 'text',
    marker = dict(
        color = '#FFBAD2',
        line = dict(width = 1)
    ),
    text = vectorizer.get_feature_names()
)
data = [trace]
iplot(data, filename='text-scatter-mode')

In [413]:
dataframe = dataframe[~dataframe['review'].str.contains("br", "br br")]

In [414]:
def spacy_bigram_tokenizer(phrase):
    doc = parser(phrase) # create spacy object
    token_not_noun = []
    notnoun_noun_list = []
    noun = ""

    for item in doc:
        if item.pos_ != "NOUN": # separate nouns and not nouns
            token_not_noun.append(item.text)
        if item.pos_ == "NOUN":
            noun = item.text
        
        for notnoun in token_not_noun:
            notnoun_noun_list.append(notnoun + " " + noun)

    return " ".join([i for i in notnoun_noun_list])

In [415]:
bivectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, ngram_range=(1,2))
bigram_vectorized = bivectorizer.fit_transform(dataframe["review"])

In [416]:
bi_lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_bi_lda = bi_lda.fit_transform(bigram_vectorized)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


# Results B

Here we see the Bi-LDA model which using a bivectorization instead.

In [417]:
print("Bi-LDA Model:")
selected_topics(bi_lda, bivectorizer)

Bi-LDA Model:
Topic 0:
[('movie', 17.367738537025193), ('know', 5.354793127092447), ('expected', 4.4654253750348785), ('book', 4.070176927803961), ('subject', 3.6736629158145906), ('minutes', 3.155111627058195), ('fine', 3.0624037962840527), ('stupid', 2.87438309244078), ('sure', 2.647740249346526), ('family', 2.5690697861850005)]
Topic 1:
[('episode', 4.9877897350590334), ('pretty', 3.179420929085435), ('church', 2.2312864283995935), ('big', 1.8572460254079028), ('self', 1.5821857041672356), ('plays', 1.570240298634208), ('wants', 1.552518122973365), ('john', 1.53672272376347), ('type', 1.4406421595322576), ('like', 1.3235049005823782)]
Topic 2:
[('men', 9.817608972148527), ('women', 8.863726918984014), ('minutes', 6.808435853789321), ('male', 5.054186510234402), ('way', 4.14857576237979), ('like', 3.6518615444879603), ('given', 3.3409116083862016), ('little', 3.3198483651918034), ('shot', 3.2279493273932043), ('felt', 3.1083630391829744)]
Topic 3:
[('story', 20.601534683872906), ('st

# Visual Summary of Bi-LDA Model

Use this graph by hovering over a topic number to display the top terms.

These circles will become red and their respective top terms will show on the bar graph to the right.

The larger topics are more frequent and the closer the topics, the more the similar they are.

The selection of keywords is based on their frequency and discriminancy.

In [418]:
bi_dash = pyLDAvis.sklearn.prepare(bi_lda, bigram_vectorized, bivectorizer, mds='tsne')
bi_dash

# Conclusion

In the LDA model, the first two topics were the most apparent in terms of frequency. 
These terms included, but not limited to: movie, film, bad, good, people, know, and watch. 
In the second largest topic, terms included but were not limited to: film, movie, like, bad, acting, scenes, story, plot.
The first topic appears to be a cluster of terms describing, obviously the movie, but the audience as well. In the second topic,
The structure of the movie seemed to be the motif of the review (i.e., scenes, story, and plot).

In the NMF model, most of the top topics shared similar frequencies. The largest included terms like film, time, vampire, director, seen, feel,think, and look. Most of these terms seem to be describing what a person feels when they are watching a movie, or what they think the
creator of the movie wanted the audience to feel (i.e., seen, feel, think, and look). Once clear outlier in terms of cohesive pattern in the
topic is vampire. This situation is where NMF seems different.

In the LSI model, a different, graphical approach was taken due to the non-negative natrix factorization.
It appears that. The first topic covers general terms like bad, good, story, and movie, while the second topic is different in that is has the names of people in the movie like drake, sandra, and bullock; also includes negative words like sick, scary, and stop, suggesting horror as a genre, or a negative feeling toward the movie.

The Bi-LDA model has the biggest seperation in topic size. It resembles LDA in its first topic, while the second topic seems to have some
contradicting terms like worst and comedy. It is interesting to note that a majority of the top topics are incredibly small in topic size, 
such as terms like watching, waste, unfortunately, stinks, and leave, summarizing a negative review.

Final Thoughts

The alternative hypothesis is supported

Overall, with all of the approaches considered, it seems that NMF is the best in terms of producing an accurate, balanced, and interesting
topic spread. At first, I thought LDA was a better methodology, as it has proven to be considered more like a machine learning algorithm with more robust power and meaning, but when reviewing the use case, I would personally recommend using the NMF
model because it could produce the most summaries, and would not have to worry about having topics with little importance in terms of
frequency. I would rather have more big bubbles, than two big bubbles where the rest are small in regards to the visualizations. I think
the reviews could be summed better with various topics occuring frequently. I also thought it was notable that the NMF model picked up on
reviews that talked almost directly to sentiment like 'feel' and 'think'. These are strong words to use as a reviewer. I think both models can be useful in different ways, depending on the business goal.

For future works, I would like to implement these different models to my work as a Business Analyst to gain insights on customer behavior,
as well as key topics from phone conversations in optimization and aquisition campaigns.