**Adapted from [Topic Modeling and Latent Dirichlet Allocation (LDA) in Python](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) by Susan Li**

**Interesting yet insightful read for LDA from [Intuitive Guide to Latent Dirichlet Allocation
](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158) by Thushan Ganegedara, HIGHLY RECOMMENDED**

In [None]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

In [51]:
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);

In [52]:
data_text = data[['headline_text']]

**Tokenization** : split text into sentence and sentence into words, lowercase with no punctuation

**Lemmatization** : 3rd person words, past / future tense words are reduced to their original form

**Stemming** : words are reduced to root form, i.e. tokenization -> token (not used in this case)

In this case, also removed words with number of characters less than 3 and stopwords

In [46]:
# def stemming(word):
#     return SnowballStemmer(language='english').stem(word)

def lemmatize(word, pos):
    return WordNetLemmatizer().lemmatize(word=word, pos=pos)

def tokenize(sentence):
    # note minimum and maximum number of tokens are 2 and 15, default in simple_process
    return simple_preprocess(sentence)
    
def remove_stopwords_shortwords(token_list, min_len):
    return [i for i in token_list if i not in STOPWORDS if len(i) > min_len]

def process(sentence, pos='v', min_len=3):
    tokens = tokenize(sentence)
    tokens = remove_stopwords_shortwords(tokens, min_len)
    res_tokens = [lemmatize(i, 'v') for i in tokens]
    return res_tokens

In [55]:
data_text['tokens'] = data_text['headline_text'].map(process)

In [82]:
data_text.head()

Unnamed: 0,headline_text,tokens
0,aba decides against community broadcasting lic...,"[decide, community, broadcast, licence]"
1,act fire witnesses must be aware of defamation,"[witness, aware, defamation]"
2,a g calls for infrastructure protection summit,"[call, infrastructure, protection, summit]"
3,air nz staff in aust strike for pay rise,"[staff, aust, strike, rise]"
4,air nz strike to affect australian travellers,"[strike, affect, australian, travellers]"


In [61]:
text_dict = gensim.corpora.Dictionary(data_text['tokens'])

**bag of words** : list of (token id, token count[in the document])

in gensim, mapping between token and id can be generate through gensim.corpus.Dictionary

In [62]:
# filter out words appeared in less than 15 documents 
# and words with document frequency > 0.5
# keep only the first 100000 frequent words
text_dict.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [85]:
# bag of words formation for processed tokens
bow = [text_dict.doc2bow(doc) for doc in data_text['tokens']]

In [86]:
# example of bow formation
bow[:5]

[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(4, 1), (5, 1), (6, 1)],
 [(7, 1), (8, 1), (9, 1), (10, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1)],
 [(13, 1), (15, 1), (16, 1), (17, 1)]]

**term frequency (tf)** : count number of a token in a document / total length of the document 

**inverse document frequency (idf)** : log(number of documents / number of appearance of a token in all documents)

**tf-idf** : tf * idf

In [87]:
# create the tfidf model using the bag of words
tfidf = gensim.models.TfidfModel(bow)

In [88]:
# extract out tfidf score for these sets of bag of words
corpus_tfidf = tfidf[bow]

In [96]:
# [(bow id, tfidf score) ... ]
corpus_tfidf[0]

[(0, 0.49762407643822143),
 (1, 0.3948343404468724),
 (2, 0.5955575073121571),
 (3, 0.4917187993528625)]

# Latent Dirichlet Allocation (LDA)
LDA is a type of statistical models for discovering the abstract topics that occurs in a collection of documents

It builds a topic per document and words per topic model, modeled as Dirichlet distributions.

In [97]:
# create the lda model from the bag of words
# LdaMulticore for parallel processing
# This one takes a while to complete...
lda_model = gensim.models.LdaMulticore(bow, num_topics=10, id2word=text_dict, passes=2, workers=2)

In [109]:
# These are ten topics identified by the LDA model
# and the words for the model
# number means probablities
lda_model.print_topics() 

[(0,
  '0.022*"murder" + 0.018*"market" + 0.018*"year" + 0.016*"australian" + 0.014*"attack" + 0.013*"record" + 0.013*"tasmania" + 0.013*"share" + 0.013*"sydney" + 0.013*"family"'),
 (1,
  '0.038*"trump" + 0.028*"queensland" + 0.016*"rise" + 0.014*"price" + 0.013*"fall" + 0.011*"children" + 0.009*"energy" + 0.009*"say" + 0.008*"inquiry" + 0.008*"city"'),
 (2,
  '0.052*"police" + 0.022*"charge" + 0.019*"crash" + 0.019*"perth" + 0.018*"woman" + 0.017*"die" + 0.015*"donald" + 0.014*"drug" + 0.013*"people" + 0.012*"jail"'),
 (3,
  '0.035*"court" + 0.022*"face" + 0.012*"sentence" + 0.012*"talk" + 0.011*"accuse" + 0.011*"hold" + 0.011*"royal" + 0.011*"hear" + 0.011*"release" + 0.010*"korea"'),
 (4,
  '0.020*"government" + 0.018*"coast" + 0.016*"state" + 0.015*"plan" + 0.014*"school" + 0.012*"fund" + 0.012*"gold" + 0.010*"say" + 0.009*"centre" + 0.009*"labor"'),
 (5,
  '0.024*"kill" + 0.017*"north" + 0.015*"arrest" + 0.015*"west" + 0.014*"south" + 0.013*"dead" + 0.013*"train" + 0.012*"china" 

In [110]:
# LDA models can also be trained from tfidf model
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=text_dict, passes=2, workers=4)

In [111]:
lda_model_tfidf.print_topics() 

[(0,
  '0.016*"rural" + 0.011*"news" + 0.006*"business" + 0.006*"national" + 0.005*"cut" + 0.005*"fund" + 0.005*"care" + 0.004*"july" + 0.004*"school" + 0.004*"centre"'),
 (1,
  '0.011*"people" + 0.010*"search" + 0.010*"miss" + 0.009*"weather" + 0.008*"queensland" + 0.006*"david" + 0.005*"police" + 0.005*"wild" + 0.005*"boat" + 0.005*"body"'),
 (2,
  '0.012*"market" + 0.010*"share" + 0.007*"price" + 0.007*"australian" + 0.006*"christmas" + 0.006*"michael" + 0.006*"coal" + 0.005*"monday" + 0.005*"live" + 0.005*"dollar"'),
 (3,
  '0.011*"turnbull" + 0.007*"september" + 0.007*"wednesday" + 0.007*"peter" + 0.005*"grand" + 0.004*"wrap" + 0.004*"collapse" + 0.004*"destroy" + 0.004*"blaze" + 0.004*"damage"'),
 (4,
  '0.016*"interview" + 0.008*"crash" + 0.008*"john" + 0.007*"royal" + 0.006*"october" + 0.006*"asylum" + 0.006*"august" + 0.005*"die" + 0.005*"june" + 0.005*"tony"'),
 (5,
  '0.010*"podcast" + 0.010*"drum" + 0.009*"world" + 0.009*"australia" + 0.008*"league" + 0.006*"street" + 0.006