# Part 2: Topic Modeling w/ LDA, LSA, & NMF

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-Libraries" data-toc-modified-id="Importing-Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Importing Libraries</a></span></li><li><span><a href="#Loading-Lemmatized-Data-(Nouns-and-Adjectives-Only)" data-toc-modified-id="Loading-Lemmatized-Data-(Nouns-and-Adjectives-Only)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Loading Lemmatized Data (Nouns and Adjectives Only)</a></span></li><li><span><a href="#Document-Preprocessing" data-toc-modified-id="Document-Preprocessing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Document Preprocessing</a></span></li><li><span><a href="#Latent-Dirichlet-Allocation-(LDA)" data-toc-modified-id="Latent-Dirichlet-Allocation-(LDA)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Latent Dirichlet Allocation (LDA)</a></span></li><li><span><a href="#Latent-Semantic-Analysis-(LSA)" data-toc-modified-id="Latent-Semantic-Analysis-(LSA)-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Latent Semantic Analysis (LSA)</a></span></li><li><span><a href="#Non-Negative-Matrix-Factorization-(NMF)" data-toc-modified-id="Non-Negative-Matrix-Factorization-(NMF)-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Non-Negative Matrix Factorization (NMF)</a></span></li></ul></div>

## Importing Libraries

In [68]:
import numpy as np
import pandas as pd
from gensim import corpora, models, similarities, matutils
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity
import spacy

## Loading Lemmatized Data (Nouns and Adjectives Only)

In [69]:
# Reading Kaggle Dataset
kaggle_toxic_comments_lemmatized_nouns_adj = pd.read_csv('../kaggle_toxic_comments_lemmatized_nouns_adj.csv')

## Document Preprocessing

In [70]:
# Create a CountVectorizer for parsing/counting words
count_vectorizer = CountVectorizer(ngram_range=(1, 2),  
                                   stop_words='english', token_pattern="\\b[a-z][a-z]+\\b")

count_vectorizer.fit(kaggle_toxic_comments_lemmatized_nouns_adj['toxic_comments'].values.astype('U'))

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='\\b[a-z][a-z]+\\b',
                tokenizer=None, vocabulary=None)

In [71]:
# Create the term-document matrix
doc_word = count_vectorizer.transform(kaggle_toxic_comments_lemmatized_nouns_adj['toxic_comments'].values.astype('U')).transpose()

In [72]:
pd.DataFrame(doc_word.toarray(), count_vectorizer.get_feature_names()).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16215,16216,16217,16218,16219,16220,16221,16222,16223,16224
aaa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaa page,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaaaa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaaaa forget,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaany,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


__Convert to Gensim:__

In [74]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(doc_word)

__Map matrix rows to words (tokens):__

In [75]:
id2word = dict((v, k) for k, v in count_vectorizer.vocabulary_.items())

## Latent Dirichlet Allocation (LDA)

In [77]:
# Create LDA Model 
lda = models.LdaModel(corpus=corpus, num_topics=7, id2word=id2word, passes=10)

2020-06-01 14:19:37,207 : INFO : using symmetric alpha at 0.14285714285714285
2020-06-01 14:19:37,208 : INFO : using symmetric eta at 0.14285714285714285
2020-06-01 14:19:37,234 : INFO : using serial LDA version on this node
2020-06-01 14:19:37,350 : INFO : running online (multi-pass) LDA training, 7 topics, 10 passes over the supplied corpus of 16225 documents, updating model once every 2000 documents, evaluating perplexity every 16225 documents, iterating 50x with a convergence threshold of 0.001000
2020-06-01 14:19:37,365 : INFO : PROGRESS: pass 0, at document #2000/16225
2020-06-01 14:19:38,276 : INFO : merging changes from 2000 documents into a model of 16225 documents
2020-06-01 14:19:38,322 : INFO : topic #3 (0.143): 0.042*"nigger" + 0.042*"nigger nigger" + 0.028*"sex" + 0.019*"ass ass" + 0.019*"fuck" + 0.016*"ass" + 0.014*"sexsex" + 0.014*"sex sex" + 0.014*"cunt" + 0.014*"sexsex sex"
2020-06-01 14:19:38,325 : INFO : topic #4 (0.143): 0.026*"fuck" + 0.018*"fuck fuck" + 0.005*"pa

In [78]:
lda.print_topics()

2020-06-01 14:20:13,844 : INFO : topic #0 (0.143): 0.023*"gay" + 0.021*"fat" + 0.019*"jew" + 0.014*"aid" + 0.014*"jew fat" + 0.014*"fat jew" + 0.012*"freedom" + 0.011*"aid aid" + 0.011*"freedom freedom" + 0.009*"dog"
2020-06-01 14:20:13,847 : INFO : topic #1 (0.143): 0.044*"vagina" + 0.043*"vagina vagina" + 0.028*"moron" + 0.024*"penis" + 0.022*"die" + 0.021*"moron moron" + 0.017*"nipple" + 0.017*"nipple nipple" + 0.016*"fag" + 0.014*"fucksex"
2020-06-01 14:20:13,850 : INFO : topic #2 (0.143): 0.125*"fuck" + 0.041*"fuck fuck" + 0.035*"hate" + 0.030*"bitch" + 0.021*"hate hate" + 0.010*"wikipedia" + 0.008*"pussy" + 0.006*"fuck yourselfgo" + 0.006*"yourselfgo" + 0.006*"yourselfgo fuck"
2020-06-01 14:20:13,853 : INFO : topic #3 (0.143): 0.041*"nigger" + 0.034*"ass" + 0.028*"faggot" + 0.025*"cock" + 0.023*"nigger nigger" + 0.021*"cunt" + 0.017*"faggot faggot" + 0.013*"eat" + 0.011*"suck cock" + 0.011*"ass ass"
2020-06-01 14:20:13,856 : INFO : topic #4 (0.143): 0.010*"house" + 0.010*"mouth" 

[(0,
  '0.023*"gay" + 0.021*"fat" + 0.019*"jew" + 0.014*"aid" + 0.014*"jew fat" + 0.014*"fat jew" + 0.012*"freedom" + 0.011*"aid aid" + 0.011*"freedom freedom" + 0.009*"dog"'),
 (1,
  '0.044*"vagina" + 0.043*"vagina vagina" + 0.028*"moron" + 0.024*"penis" + 0.022*"die" + 0.021*"moron moron" + 0.017*"nipple" + 0.017*"nipple nipple" + 0.016*"fag" + 0.014*"fucksex"'),
 (2,
  '0.125*"fuck" + 0.041*"fuck fuck" + 0.035*"hate" + 0.030*"bitch" + 0.021*"hate hate" + 0.010*"wikipedia" + 0.008*"pussy" + 0.006*"fuck yourselfgo" + 0.006*"yourselfgo" + 0.006*"yourselfgo fuck"'),
 (3,
  '0.041*"nigger" + 0.034*"ass" + 0.028*"faggot" + 0.025*"cock" + 0.023*"nigger nigger" + 0.021*"cunt" + 0.017*"faggot faggot" + 0.013*"eat" + 0.011*"suck cock" + 0.011*"ass ass"'),
 (4,
  '0.010*"house" + 0.010*"mouth" + 0.009*"mother" + 0.009*"chuck" + 0.009*"seman" + 0.009*"construct" + 0.009*"house saliva" + 0.009*"seman chuck" + 0.009*"mouth seman" + 0.009*"construct house"'),
 (5,
  '0.009*"page" + 0.009*"wikipedi

## Latent Semantic Analysis (LSA)

In [79]:
lsa = TruncatedSVD(7)
doc_topic = lsa.fit_transform(doc_word)
lsa.explained_variance_ratio_

array([0.06410252, 0.0625699 , 0.05138053, 0.04360968, 0.04101299,
       0.03713032, 0.03005918])

In [80]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [81]:
display_topics(lsa, count_vectorizer.get_feature_names(), 10)


Topic  0
afraid terrorist, area government, agent delete, ahaha, answer concern, backstage video, account vikerne, ban child, beetlefart don, bich

Topic  1
bet contribution, africa western, american article, apology gross, article experience, article jesus, apocalypse muslim, ajraddatz, agony witness, belief know

Topic  2
bald skinhead, angry china, appropriate copyright, anybody new, academic scientific, ack, beat hispanic, article admiralty, article editor, area live

Topic  3
article admin, acorn wikipedia, alive fke, accurate change, action behavior, bich pussy, bear drop, article interfere, article warn, accuracy inaccuracy

Topic  4
airplane guy, allah akbar, attack consider, attract game, assume good, account information, bias fuck, attack prove, bad suffering, bias pov

Topic  5
arson, animation dollar, attention huh, bias slur, baby dare, abusive incite, article hint, alabama, allah akbar, bias fuck

Topic  6
academia work, behalf, agree useless, attack article, band brutal

## Non-Negative Matrix Factorization (NMF)

In [82]:
nmf_model = NMF(7)
doc_topic = nmf_model.fit_transform(doc_word)

In [83]:
display_topics(nmf_model, count_vectorizer.get_feature_names(), 10)


Topic  0
afraid terrorist, area government, agent delete, ahaha, answer concern, account vikerne, backstage video, ban child, bich, anonymous wikipedia

Topic  1
bet contribution, africa western, american article, article experience, apology gross, article jesus, apocalypse muslim, ajraddatz, agony witness, belief know

Topic  2
bald skinhead, angry china, appropriate copyright, anybody new, academic scientific, ack, beat hispanic, article admiralty, article editor, area live

Topic  3
article admin, acorn wikipedia, alive fke, accurate change, action behavior, bich pussy, bear drop, article interfere, article warn, accuracy inaccuracy

Topic  4
airplane guy, account information, abusive incite, article hint, alabama, beat head, attention huh, arsehole callin, bastard return, alberta

Topic  5
arson, animation dollar, attention huh, bias slur, baby dare, abusive incite, article hint, articleness, alabama, ban sad

Topic  6
academia work, behalf, agree useless, attack article, band bru