## Discover news topics through Latent Dirichlet Allocation (LDA)

**Outline**

* [Introduction](#intro)
* [LDA Model on BOW](#exp)
* [LDA Model on Corpus with tfIdf](#use case)
* [References](#ref)

In [62]:
import os, glob
import pandas as pd
import numpy as np
import re, string
# construct the dictionary without loading all texts into memory
from six import iteritems

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import PorterStemmer #leaving only the word stem
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim import corpora, models
import pickle
from gensim.test.utils import datapath
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

## <a id="data">Dataset</a>
abcnews-date-text.csv from [Kaggle](https://www.kaggle.com/therohk/million-headlines/version/7#abcnews-date-text.csv)

## <a id="exp">LDA Modeling</a>

In [2]:
data = pd.read_csv('data/abcnews-date-text.csv', error_bad_lines=False);
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

In [6]:
print(len(documents))
print(documents[:3])

1103663
                                       headline_text  index
0  aba decides against community broadcasting lic...      0
1     act fire witnesses must be aware of defamation      1
2     a g calls for infrastructure protection summit      2


**Step 1. Preprocessing**
* Tokenization
* Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are stemmed — words are reduced to their root form.
* Stop words removal 
* Words that have fewer than 3 characters are removed.

In [32]:
def lemmatize_stemming(text):
    # Create a WordNetLemmatizer object for lemmatization as needed
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [33]:
doc_sample = documents[documents['index'] == 4310].values[0][0]
doc_sample

words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

['rain', 'helps', 'dampen', 'bushfires']


 tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


In [34]:
documents.head(2)

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1


In [35]:
# apply to the entire docs list
corpus_clean = documents['headline_text'].map(preprocess)
corpus_clean[:10]

0               [decid, commun, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

**Step 2: Create vocabulary dictionary**

In [43]:
# find a unique id for each unique term {term : id}
dictionary = corpora.Dictionary(corpus_clean)
# term : id
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 5:
        break

0 broadcast
1 commun
2 decid
3 licenc
4 awar
5 defam


**Step 2.1: Filter extremes**

Filter out tokens that appear in

* less than 15 documents (absolute number) or
* more than 0.5 documents (fraction of total corpus size, not absolute number).
* after the above two steps, keep only the first 100000 most frequent tokens.

In [46]:
dictionary.filter_extremes(no_below=15, no_above=0.7, keep_n=100000)

**Step 3: Apply doc2bow for each doc**

In [47]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(76, 1), (112, 1), (483, 1), (4021, 1)]

In [51]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))

Word 76 ("bushfir") appears 1 time.
Word 112 ("help") appears 1 time.
Word 483 ("rain") appears 1 time.
Word 4021 ("dampen") appears 1 time.


**Step 4. Run LDA model using bow**

In [61]:
# LDA model needs many iterations/passes and a large corpus to work well
# must define the number of topics you want to extract from the corpus
ldamodel = LdaModel(bow_corpus, num_topics=10, id2word = dictionary, passes=5, iterations=200)

In [67]:
for i in ldamodel.print_topics(num_words = 6):
    print(i)

(0, '0.020*"canberra" + 0.019*"home" + 0.018*"north" + 0.012*"commun" + 0.012*"servic" + 0.012*"korea"')
(1, '0.019*"turnbul" + 0.018*"power" + 0.017*"miss" + 0.017*"tasmania" + 0.016*"rise" + 0.014*"break"')
(2, '0.049*"australian" + 0.021*"chang" + 0.021*"women" + 0.013*"victoria" + 0.011*"john" + 0.011*"council"')
(3, '0.034*"sydney" + 0.033*"govern" + 0.030*"queensland" + 0.019*"brisban" + 0.017*"say" + 0.015*"peopl"')
(4, '0.052*"trump" + 0.024*"kill" + 0.024*"south" + 0.017*"attack" + 0.017*"tasmanian" + 0.014*"leagu"')
(5, '0.037*"year" + 0.027*"polic" + 0.023*"hous" + 0.020*"crash" + 0.019*"coast" + 0.019*"die"')
(6, '0.028*"charg" + 0.028*"court" + 0.022*"murder" + 0.022*"adelaid" + 0.019*"perth" + 0.018*"face"')
(7, '0.058*"australia" + 0.019*"test" + 0.015*"accus" + 0.015*"alleg" + 0.012*"interview" + 0.012*"citi"')
(8, '0.023*"elect" + 0.021*"school" + 0.017*"state" + 0.015*"indigen" + 0.014*"high" + 0.012*"polit"')
(9, '0.024*"world" + 0.021*"nation" + 0.020*"donald" + 0.0

**Step 5. Evaluate model on sample document and unseen doc**

In [72]:
#evaluate on sample doc
sample = corpus_clean[4310]
print(sample)
for index, score in sorted(ldamodel[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, ldamodel.print_topic(index, 10)))

['rain', 'help', 'dampen', 'bushfir']

Score: 0.41999632120132446	 
Topic: 0.019*"turnbul" + 0.018*"power" + 0.017*"miss" + 0.017*"tasmania" + 0.016*"rise" + 0.014*"break" + 0.014*"water" + 0.012*"fall" + 0.012*"forc" + 0.012*"busi"

Score: 0.22000007331371307	 
Topic: 0.020*"canberra" + 0.019*"home" + 0.018*"north" + 0.012*"commun" + 0.012*"servic" + 0.012*"korea" + 0.011*"protest" + 0.011*"street" + 0.011*"train" + 0.011*"royal"

Score: 0.21999971568584442	 
Topic: 0.037*"year" + 0.027*"polic" + 0.023*"hous" + 0.020*"crash" + 0.019*"coast" + 0.019*"die" + 0.018*"live" + 0.017*"market" + 0.017*"shoot" + 0.013*"investig"

Score: 0.02000216767191887	 
Topic: 0.023*"elect" + 0.021*"school" + 0.017*"state" + 0.015*"indigen" + 0.014*"high" + 0.012*"polit" + 0.012*"need" + 0.011*"concern" + 0.010*"want" + 0.010*"liber"

Score: 0.020001718774437904	 
Topic: 0.034*"sydney" + 0.033*"govern" + 0.030*"queensland" + 0.019*"brisban" + 0.017*"say" + 0.015*"peopl" + 0.015*"labor" + 0.015*"drug" + 0.

In [75]:
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(ldamodel[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, ldamodel.print_topic(index, 5)))

Score: 0.1833333820104599	 Topic: 0.058*"australia" + 0.019*"test" + 0.015*"accus" + 0.015*"alleg" + 0.012*"interview"
Score: 0.1833333522081375	 Topic: 0.019*"turnbul" + 0.018*"power" + 0.017*"miss" + 0.017*"tasmania" + 0.016*"rise"
Score: 0.1833333522081375	 Topic: 0.023*"elect" + 0.021*"school" + 0.017*"state" + 0.015*"indigen" + 0.014*"high"
Score: 0.1833333522081375	 Topic: 0.024*"world" + 0.021*"nation" + 0.020*"donald" + 0.016*"final" + 0.013*"time"
Score: 0.18333323299884796	 Topic: 0.052*"trump" + 0.024*"kill" + 0.024*"south" + 0.017*"attack" + 0.017*"tasmanian"
Score: 0.01666666753590107	 Topic: 0.020*"canberra" + 0.019*"home" + 0.018*"north" + 0.012*"commun" + 0.012*"servic"
Score: 0.01666666753590107	 Topic: 0.049*"australian" + 0.021*"chang" + 0.021*"women" + 0.013*"victoria" + 0.011*"john"
Score: 0.01666666753590107	 Topic: 0.034*"sydney" + 0.033*"govern" + 0.030*"queensland" + 0.019*"brisban" + 0.017*"say"
Score: 0.01666666753590107	 Topic: 0.037*"year" + 0.027*"polic" +

## <a id="use case">LDA Model on Corpus with tfIdf</a>

In [76]:
#Apply TF-IDF on bow, then apply transformation to the entire corpus
#preview TF-IDF scores for our first document.
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5903603121911333),
 (1, 0.3852450692300274),
 (2, 0.4974556050119205),
 (3, 0.505567858418396)]


In [77]:
ldamodel2 = LdaModel(corpus_tfidf, num_topics=10, id2word = dictionary, passes=5, iterations=200)

In [79]:
#evaluate on sample doc
sample = corpus_clean[4310]
print(sample)
for index, score in sorted(ldamodel2[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, ldamodel2.print_topic(index, 5)))

['rain', 'help', 'dampen', 'bushfir']

Score: 0.5826798677444458	 
Topic: 0.014*"peopl" + 0.012*"countri" + 0.011*"royal" + 0.010*"citi" + 0.009*"refuge"

Score: 0.2572932839393616	 
Topic: 0.023*"queensland" + 0.016*"north" + 0.011*"life" + 0.011*"australia" + 0.011*"korea"

Score: 0.020009439438581467	 
Topic: 0.019*"donald" + 0.016*"murder" + 0.014*"polic" + 0.013*"canberra" + 0.013*"tasmanian"

Score: 0.02000546082854271	 
Topic: 0.016*"elect" + 0.015*"tasmania" + 0.011*"rural" + 0.010*"guilti" + 0.010*"million"

Score: 0.020005056634545326	 
Topic: 0.041*"trump" + 0.011*"rise" + 0.010*"leagu" + 0.009*"coast" + 0.009*"price"

Score: 0.020004302263259888	 
Topic: 0.012*"week" + 0.010*"australian" + 0.008*"sport" + 0.007*"region" + 0.006*"star"

Score: 0.020001139491796494	 
Topic: 0.011*"polit" + 0.010*"senat" + 0.010*"hour" + 0.009*"young" + 0.008*"show"

Score: 0.02000080794095993	 
Topic: 0.015*"live" + 0.014*"crash" + 0.012*"drum" + 0.011*"south" + 0.010*"malcolm"

Score: 0.0200

## <a id="ref">References</a>
* [Topic Modeling and Latent Dirichlet Allocation (LDA) in Python](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)