# Topic Modeling and Latent Dirichlet Allocation (LDA) in Python

https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

In [8]:
import pandas as pd
data = pd.read_csv('/Users/teresazhang/Downloads/abcnews-date-text.v8.csv', error_bad_lines=False)
data_text = data[['headline_text']]
data_text.reset_index(inplace=True)
documents = data_text

Data Pre-procesing
- tokenization: split the text into sentences and sentences into words. lowercase the words and remove punctuation
- remove workds have fewer than 3 characters
- remove all stopwords
- words are lemmatized: words in third person are changed to first person and verbs in past and future tenses are changed into present
- words are stemmed: words are reduced to their root form

In [10]:
!pip install gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/c8/3a/32a1edf4f335eba0873021a7ddb3230f05dedd2b5450960118b402ca0771/gensim-3.8.0-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (24.7MB)
[K    100% |████████████████████████████████| 24.7MB 1.4MB/s 
[?25hCollecting smart-open>=1.7.0 (from gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/37/c0/25d19badc495428dec6a4bf7782de617ee0246a9211af75b302a2681dea7/smart_open-1.8.4.tar.gz (63kB)
[K    100% |████████████████████████████████| 71kB 15.0MB/s 
Collecting boto3 (from smart-open>=1.7.0->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/dd/cf/d2b386e197d2c093dfc0a4f2d8225ce8967b9c57b7c290ff61dbfe45d5a7/boto3-1.9.185-py2.py3-none-any.whl (128kB)
[K    100% |████████████████████████████████| 133kB 16.3MB/s 
Collecting s3transfer<0.3.0,>=0.2.0 (from boto3->smart-open>=1.7.0->gensim)
[?25l  Downloading https://files

In [12]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2019)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/teresazhang/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [27]:
def lemmatize_stemming(text):
    return SnowballStemmer('english').stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [30]:
print(documents.headline_text[0].split(' '))
print(preprocess(documents.headline_text[0]))

['aba', 'decides', 'against', 'community', 'broadcasting', 'licence']
['decid', 'communiti', 'broadcast', 'licenc']


In [32]:
preprocessed_docs = documents.headline_text.apply(preprocess)

In [33]:
preprocessed_docs.head()

0     [decid, communiti, broadcast, licenc]
1                        [wit, awar, defam]
2    [call, infrastructur, protect, summit]
3               [staff, aust, strike, rise]
4      [strike, affect, australian, travel]
Name: headline_text, dtype: object

In [34]:
dictionary = gensim.corpora.Dictionary(preprocessed_docs)

In [37]:
dictionary.filter_extremes(no_below = 15, no_above=0.5, keep_n=100000)

In [38]:
bow_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_docs]

In [39]:
bow_corpus[0]

[(0, 1), (1, 1), (2, 1), (3, 1)]

In [40]:
preprocessed_docs[0]

['decid', 'communiti', 'broadcast', 'licenc']

In [41]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. 

In [43]:
corpus_tfidf[0]

[(0, 0.5893154583024485),
 (1, 0.3892866165028569),
 (2, 0.49651921997736453),
 (3, 0.5046106271280878)]

In [44]:
lda_model = gensim.models.LdaModel(bow_corpus, num_topics=10, id2word=dictionary)

In [47]:
lda_model.print_topics(-1)

[(0,
  '0.036*"polic" + 0.026*"queensland" + 0.023*"death" + 0.021*"court" + 0.018*"woman" + 0.017*"alleg" + 0.017*"die" + 0.017*"brisban" + 0.016*"murder" + 0.015*"shoot"'),
 (1,
  '0.020*"chang" + 0.017*"live" + 0.017*"state" + 0.017*"hous" + 0.016*"market" + 0.014*"labor" + 0.012*"show" + 0.012*"help" + 0.012*"share" + 0.012*"bank"'),
 (2,
  '0.033*"year" + 0.019*"elect" + 0.018*"say" + 0.018*"women" + 0.017*"face" + 0.017*"north" + 0.016*"australia" + 0.015*"turnbul" + 0.013*"peopl" + 0.012*"china"'),
 (3,
  '0.036*"attack" + 0.026*"test" + 0.025*"open" + 0.017*"take" + 0.016*"lose" + 0.016*"abus" + 0.014*"flood" + 0.013*"drug" + 0.013*"lead" + 0.013*"aborigin"'),
 (4,
  '0.032*"melbourn" + 0.025*"donald" + 0.022*"coast" + 0.018*"dead" + 0.015*"win" + 0.015*"island" + 0.015*"deal" + 0.014*"water" + 0.014*"student" + 0.014*"polit"'),
 (5,
  '0.021*"nation" + 0.015*"power" + 0.015*"tasmania" + 0.015*"child" + 0.014*"time" + 0.013*"say" + 0.012*"plan" + 0.011*"need" + 0.010*"busi" + 0