# Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a popular approach for topic modeling. It works by identifying the key topics within a set of text documents, and the key words that make up each topic.

Under LDA, each document is assumed to have a mix of underlying (latent) topics, each topic with a certain probability of occurring in the document. Individual text documents can therefore be represented by the topics that make them up.

In this way, LDA topic modeling can be used to categorize or classify documents based on their topic content.

Each LDA topic model requires:

1. A set of documents for training the model—the training corpus
2. A dictionary of words to form the vocabulary used in the model—this can be derived from the training corpus


Once a model has been trained, it can be applied to a new set of documents to identify the topics in those new documents.

## Dataset

In [13]:
import pandas as pd

data = pd.read_csv('C:/Users/User/Downloads/abcnews-date-text.csv', error_bad_lines=False);
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

In [14]:
len(documents)

1226258

In [39]:
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## Pre-processing

In [16]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

In [17]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
print(WordNetLemmatizer().lemmatize('went', pos='v'))

go


In [19]:
stemmer = SnowballStemmer('english')
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]
pd.DataFrame(data = {'original word': original_words, 'stemmed': singles})

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [20]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

## Reviewing a Pre-Processed Document

In [21]:
doc_sample = documents[documents['index'] == 4310].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['ratepayers', 'group', 'wants', 'compulsory', 'local', 'govt', 'voting']


 tokenized and lemmatized document: 
['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']


In [22]:
processed_docs = documents['headline_text'].map(preprocess)

In [23]:
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

## Bag of Words on the Data set

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.

In [24]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [25]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


## Gensim doc2bow

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [26]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [27]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(162, 1), (240, 1), (292, 1), (589, 1), (838, 1), (3570, 1), (3571, 1)]

In [28]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 162 ("govt") appears 1 time.
Word 240 ("group") appears 1 time.
Word 292 ("vote") appears 1 time.
Word 589 ("local") appears 1 time.
Word 838 ("want") appears 1 time.
Word 3570 ("compulsori") appears 1 time.
Word 3571 ("ratepay") appears 1 time.


In [29]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [30]:
corpus_tfidf = tfidf[bow_corpus]

In [31]:
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5842699484464488),
 (1, 0.38798859072167835),
 (2, 0.5008422243250992),
 (3, 0.5071987254965034)]


In [32]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [33]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.041*"polic" + 0.026*"death" + 0.026*"case" + 0.025*"charg" + 0.025*"court" + 0.021*"murder" + 0.017*"woman" + 0.017*"face" + 0.015*"alleg" + 0.013*"shoot"
Topic: 1 
Words: 0.047*"trump" + 0.025*"world" + 0.020*"open" + 0.018*"women" + 0.016*"island" + 0.013*"win" + 0.013*"return" + 0.012*"lose" + 0.012*"street" + 0.011*"sydney"
Topic: 2 
Words: 0.047*"coronavirus" + 0.032*"victoria" + 0.024*"live" + 0.021*"covid" + 0.021*"nation" + 0.016*"restrict" + 0.014*"water" + 0.012*"life" + 0.011*"plan" + 0.010*"park"
Topic: 3 
Words: 0.040*"queensland" + 0.026*"sydney" + 0.021*"bushfir" + 0.020*"crash" + 0.019*"adelaid" + 0.018*"die" + 0.015*"final" + 0.014*"miss" + 0.011*"break" + 0.011*"million"
Topic: 4 
Words: 0.034*"year" + 0.020*"famili" + 0.019*"canberra" + 0.018*"tasmania" + 0.018*"melbourn" + 0.015*"jail" + 0.014*"australian" + 0.013*"work" + 0.013*"high" + 0.012*"tasmanian"
Topic: 5 
Words: 0.030*"govern" + 0.020*"health" + 0.019*"school" + 0.018*"state" + 0.014*"he

In [34]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [35]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.016*"interview" + 0.010*"australia" + 0.009*"cricket" + 0.009*"hill" + 0.008*"weather" + 0.007*"extend" + 0.007*"daniel" + 0.007*"august" + 0.006*"world" + 0.006*"smith"
Topic: 1 Word: 0.016*"polic" + 0.016*"charg" + 0.014*"murder" + 0.013*"donald" + 0.011*"alleg" + 0.010*"court" + 0.010*"drum" + 0.010*"woman" + 0.010*"death" + 0.010*"kill"
Topic: 2 Word: 0.014*"queensland" + 0.013*"coast" + 0.012*"coronavirus" + 0.010*"restrict" + 0.009*"miss" + 0.009*"morrison" + 0.007*"gold" + 0.007*"victoria" + 0.006*"rain" + 0.006*"search"
Topic: 3 Word: 0.010*"bushfir" + 0.008*"hobart" + 0.008*"age" + 0.007*"korea" + 0.007*"hotel" + 0.006*"north" + 0.006*"fire" + 0.006*"insid" + 0.006*"sydney" + 0.006*"burn"
Topic: 4 Word: 0.010*"live" + 0.009*"final" + 0.008*"coronavirus" + 0.008*"australia" + 0.008*"financ" + 0.008*"updat" + 0.007*"australian" + 0.006*"alan" + 0.006*"open" + 0.006*"covid"
Topic: 5 Word: 0.035*"trump" + 0.010*"friday" + 0.009*"scott" + 0.008*"zealand" + 0.006*"g

Classification of the topics

In [36]:
processed_docs[4310]

['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']

In [37]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6611946821212769	 
Topic: 0.030*"govern" + 0.020*"health" + 0.019*"school" + 0.018*"state" + 0.014*"help" + 0.013*"indigen" + 0.012*"communiti" + 0.012*"fund" + 0.012*"fight" + 0.011*"feder"

Score: 0.14088132977485657	 
Topic: 0.034*"year" + 0.020*"famili" + 0.019*"canberra" + 0.018*"tasmania" + 0.018*"melbourn" + 0.015*"jail" + 0.014*"australian" + 0.013*"work" + 0.013*"high" + 0.012*"tasmanian"

Score: 0.11033952981233597	 
Topic: 0.044*"australia" + 0.027*"australian" + 0.027*"elect" + 0.026*"donald" + 0.023*"kill" + 0.020*"coast" + 0.017*"border" + 0.013*"protest" + 0.013*"gold" + 0.012*"attack"

Score: 0.012513488531112671	 
Topic: 0.047*"coronavirus" + 0.032*"victoria" + 0.024*"live" + 0.021*"covid" + 0.021*"nation" + 0.016*"restrict" + 0.014*"water" + 0.012*"life" + 0.011*"plan" + 0.010*"park"

Score: 0.012512878514826298	 
Topic: 0.026*"news" + 0.023*"hous" + 0.017*"brisban" + 0.017*"busi" + 0.017*"peopl" + 0.016*"farmer" + 0.016*"time" + 0.015*"market" + 0.014*"roya

In [38]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.5429655313491821	 
Topic: 0.013*"countri" + 0.010*"hour" + 0.008*"elect" + 0.006*"fund" + 0.006*"lockdown" + 0.006*"liber" + 0.006*"farm" + 0.006*"labor" + 0.005*"school" + 0.005*"social"

Score: 0.2121959775686264	 
Topic: 0.023*"news" + 0.019*"market" + 0.014*"rural" + 0.009*"price" + 0.008*"share" + 0.007*"rise" + 0.007*"andrew" + 0.007*"nation" + 0.007*"busi" + 0.006*"australian"

Score: 0.1572805643081665	 
Topic: 0.010*"bushfir" + 0.008*"hobart" + 0.008*"age" + 0.007*"korea" + 0.007*"hotel" + 0.006*"north" + 0.006*"fire" + 0.006*"insid" + 0.006*"sydney" + 0.006*"burn"

Score: 0.012509647756814957	 
Topic: 0.008*"coronavirus" + 0.008*"thursday" + 0.007*"vaccin" + 0.007*"turnbul" + 0.007*"david" + 0.006*"juli" + 0.006*"pandem" + 0.006*"brief" + 0.006*"quarantin" + 0.005*"say"

Score: 0.012508605606853962	 
Topic: 0.018*"coronavirus" + 0.011*"govern" + 0.010*"covid" + 0.010*"health" + 0.008*"climat" + 0.008*"stori" + 0.008*"christma" + 0.007*"wall" + 0.007*"sport" + 0.007*