# Topic Modeling

<b>NOTE</b>: This project is taken from "Topic Modeling and Latent Dirichlet Allocation (LDA) in Python" by Susan Li. Check out the blog post in [towardsdatascience.com](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) for better explanations.

Read the training data and separate by heading text and content.

In [1]:
import pandas as pd

data = pd.read_csv('../data/extracted/abcnews-date-text.csv', error_bad_lines=False)
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

Peak into the training data.

In [2]:
print(len(documents))
print(documents[:5])
print(documents[-5:])

1103665
                                       headline_text  index
0  aba decides against community broadcasting lic...      0
1     act fire witnesses must be aware of defamation      1
2     a g calls for infrastructure protection summit      2
3           air nz staff in aust strike for pay rise      3
4      air nz strike to affect australian travellers      4
                                             headline_text    index
1103660  the ashes smiths warners near miss liven up bo...  1103660
1103661            timelapse: brisbanes new year fireworks  1103661
1103662           what 2017 meant to the kids of australia  1103662
1103663   what the papodopoulos meeting may mean for ausus  1103663
1103664  who is george papadopoulos the former trump ca...  1103664


Load gensim and NLTK library; download `wordnet` data through NLTK.

In [3]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/almermendoza/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Prepare and define functions needed to lemmatize and stem words.

In [11]:
def lemmatize_stemming(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

Check process word output.

In [12]:
doc_sample = documents[documents['index'] == 4310].values[0][0]

print('Original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n Tokenized and lemmatized document: ')
print(preprocess(doc_sample))

Original document: 
['rain', 'helps', 'dampen', 'bushfires']


 Tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


Peak into the output after processing.

In [15]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

Prepare bag of words.

In [33]:
dictionary = gensim.corpora.Dictionary(processed_docs)

Filter out tokens that is not frequent enough.

In [17]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Create bag of words nested-array representation.

In [18]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Investigate BOW output.

In [21]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (`{}`) appears {} time.".format(bow_doc_4310[i][0],
                                                   dictionary[bow_doc_4310[i][0]],
                                                   bow_doc_4310[i][1]))

Word 76 (`bushfir`) appears 1 time.
Word 112 (`help`) appears 1 time.
Word 483 (`rain`) appears 1 time.
Word 4014 (`dampen`) appears 1 time.


Create TF-IDF model.

In [23]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5892908644709983),
 (1, 0.38929657403503015),
 (2, 0.4964985198530063),
 (3, 0.5046520328695662)]


Create LDA model using Bag of Words (and without the TF-IDF model).

In [25]:
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=10,
                                       id2word=dictionary,
                                       passes=2,
                                       workers=2)

Run interpretations.

In [26]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))

Topic: 0 
Words: 0.027*"elect" + 0.017*"say" + 0.016*"hospit" + 0.016*"health" + 0.015*"tasmanian" + 0.014*"labor" + 0.014*"turnbul" + 0.014*"report" + 0.013*"minist" + 0.012*"deal"
Topic: 1 
Words: 0.019*"coast" + 0.018*"help" + 0.018*"chang" + 0.017*"countri" + 0.016*"state" + 0.014*"hour" + 0.014*"indigen" + 0.013*"water" + 0.012*"gold" + 0.012*"communiti"
Topic: 2 
Words: 0.019*"market" + 0.015*"rise" + 0.014*"price" + 0.014*"share" + 0.013*"australian" + 0.012*"victoria" + 0.011*"bank" + 0.011*"hous" + 0.011*"week" + 0.010*"green"
Topic: 3 
Words: 0.061*"polic" + 0.023*"crash" + 0.019*"interview" + 0.018*"miss" + 0.018*"shoot" + 0.016*"arrest" + 0.014*"investig" + 0.012*"driver" + 0.011*"search" + 0.011*"offic"
Topic: 4 
Words: 0.029*"charg" + 0.027*"court" + 0.021*"murder" + 0.018*"woman" + 0.017*"face" + 0.015*"alleg" + 0.015*"brisban" + 0.015*"live" + 0.015*"jail" + 0.014*"perth"
Topic: 5 
Words: 0.036*"australia" + 0.021*"world" + 0.017*"open" + 0.016*"melbourn" + 0.015*"sydne

Create LDA model using the TF-IDF model.

In [None]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf,
                                             num_topics=10,
                                             id2word=dictionary,
                                             passes=2,
                                             workers=4)

Run interpretations.

In [None]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))

Evaluate performance by classifying sample document using LDA BOW model.

In [27]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.4199999272823334	 
Topic: 0.036*"australia" + 0.021*"world" + 0.017*"open" + 0.016*"melbourn" + 0.015*"sydney" + 0.014*"final" + 0.013*"donald" + 0.010*"leagu" + 0.010*"take" + 0.010*"win"

Score: 0.2200000286102295	 
Topic: 0.019*"coast" + 0.018*"help" + 0.018*"chang" + 0.017*"countri" + 0.016*"state" + 0.014*"hour" + 0.014*"indigen" + 0.013*"water" + 0.012*"gold" + 0.012*"communiti"

Score: 0.21999773383140564	 
Topic: 0.024*"south" + 0.023*"kill" + 0.021*"north" + 0.015*"west" + 0.013*"island" + 0.012*"fall" + 0.011*"farm" + 0.010*"attack" + 0.009*"korea" + 0.009*"australian"

Score: 0.02000226452946663	 
Topic: 0.016*"rural" + 0.014*"power" + 0.012*"farmer" + 0.011*"busi" + 0.011*"industri" + 0.011*"guilti" + 0.010*"centr" + 0.010*"council" + 0.009*"region" + 0.009*"liber"

Score: 0.019999999552965164	 
Topic: 0.027*"elect" + 0.017*"say" + 0.016*"hospit" + 0.016*"health" + 0.015*"tasmanian" + 0.014*"labor" + 0.014*"turnbul" + 0.014*"report" + 0.013*"minist" + 0.012*"deal"

Evaluate performance by classifying sample document using LDA TF-IDF model.

In [None]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]],
                           key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score,
                                             lda_model_tfidf.print_topic(index, 10)))

Test model on unseen document using LDA BOW model.

In [31]:
unseen_document = "How a Pentagon deal became an identity crisis for Google"
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector],
                           key=lambda tup: -1*tup[1]):
    print("Score: {}\nTopic: {}\n".format(score, lda_model.print_topic(index, 5)))

Score: 0.3500001132488251
Topic: 0.027*"elect" + 0.017*"say" + 0.016*"hospit" + 0.016*"health" + 0.015*"tasmanian"

Score: 0.1833333522081375
Topic: 0.019*"market" + 0.015*"rise" + 0.014*"price" + 0.014*"share" + 0.013*"australian"

Score: 0.1833333522081375
Topic: 0.061*"polic" + 0.023*"crash" + 0.019*"interview" + 0.018*"miss" + 0.018*"shoot"

Score: 0.18333317339420319
Topic: 0.019*"coast" + 0.018*"help" + 0.018*"chang" + 0.017*"countri" + 0.016*"state"

Score: 0.01666666753590107
Topic: 0.029*"charg" + 0.027*"court" + 0.021*"murder" + 0.018*"woman" + 0.017*"face"

Score: 0.01666666753590107
Topic: 0.036*"australia" + 0.021*"world" + 0.017*"open" + 0.016*"melbourn" + 0.015*"sydney"

Score: 0.01666666753590107
Topic: 0.024*"south" + 0.023*"kill" + 0.021*"north" + 0.015*"west" + 0.013*"island"

Score: 0.01666666753590107
Topic: 0.016*"rural" + 0.014*"power" + 0.012*"farmer" + 0.011*"busi" + 0.011*"industri"

Score: 0.01666666753590107
Topic: 0.037*"trump" + 0.025*"australian" + 0.021*