# Latent Dirichlet Allocation for Topic Modeling


Topic modeling is a strategy that allows us to efficiently analyze large volumes of text by clustering documents into topics. **LDA**, which is the method used in this Python notebook, is a method of topic modeling for unsupervised classification of documents that finds natural groups of items.

The steps in LDA include:

1. Deciding the number of topics for the LDA model to generate (e.g., in my case, 7 topics for all my documents).
2. Initializing random topic assignments using the `LatentDirichletAllocation` in scikit-learn.
3. The model iteratively reassigns each word to the most probable topic based on the documents and word distributions.
4. After many iterations, the model outputs the final topic distribution for each document and the word distribution for each topic.

Below is a simple step-by-step guide to building an LDA topic model for a collection of online articles.


Import Libriaries

In [3]:
import pandas as pd
import numpy as np

Read csv file 

In [8]:
npr = pd.read_csv(r"C:\Users\Owner\Desktop\UPDATED_NLP_COURSE\UPDATED_NLP_COURSE\05-Topic-Modeling\npr.csv")
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [10]:
npr.describe().transpose()

Unnamed: 0,count,unique,top,freq
Article,11992,11991,"Washington state has released an estimated 3, ...",2


In [14]:
# View oone of the articles
npr['Article'][10]



#### Preprocessing our Data

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

In [21]:
#Perform count vectorization to ignore terms that have high frequency , stop words  and only select words thar appears at least twice
cv = CountVectorizer(max_df= 0.95, min_df=2, stop_words='english')

In [23]:
dtm = cv.fit_transform(npr['Article'])

In [25]:
#final transformed data
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [27]:
from sklearn.decomposition import LatentDirichletAllocation
LDA = LatentDirichletAllocation(n_components=7, random_state=42)

In [29]:
#fit transformed data to the LDA model
LDA.fit(dtm)

Grab the vocabulary of words

In [37]:
#first check the lenth of the feature words
len(cv.get_feature_names_out())

54777

In [39]:
# check the data type to undertand how to gran the cocaublulary of words
type(cv.get_feature_names_out())

numpy.ndarray

In [57]:
# use randomization as way to grab random words in the vocabularly 
import random 
random_word_id = random.randint(0, 54777)
cv.get_feature_names_out()[random_word_id]

'solemnly'

Grab the topics

In [59]:
# check the length to trhe topic componets
len(LDA.components_) 

7

In [63]:
# check the shape 
LDA.components_.shape

(7, 54777)

In [65]:
LDA.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [68]:
# Grab a single topic
single_topic = LDA.components_[0]
#use arg sort to get the index position sorted from least to greatest of the high probaility words
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993], dtype=int64)

In [70]:
#  get high top 20 probaility words using arg sort
single_topic.argsort()[-10:]

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993], dtype=int64)

In [78]:
top_twenty_words = single_topic.argsort()[-20:]
for i in top_twenty_words:
    print(cv.get_feature_names_out()[i])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


Grab the highest probaility of words per topic

In [82]:
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print('\n')


THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

In [84]:
#Attached topic number to article npr dataset
topic_results = LDA.transform(dtm)
topic_results.shape

(11992, 7)

In [86]:
npr['Topic'] = topic_results.argmax(axis=1)
npr

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
...,...,...
11987,The number of law enforcement officers shot an...,1
11988,"Trump is busy these days with victory tours,...",4
11989,It’s always interesting for the Goats and Soda...,3
11990,The election of Donald Trump was a surprise to...,4
