# Topic Modelling

There are 2 techniques to solve the Topic Modelling:-

* Latent Dirichlet Allocation
* Non-Negative Matrix Factorization

## Latent Dirichlet Allocation

In [1]:
#Importing libraries
import pandas as pd

In [2]:
#Reading Csv file
npr=pd.read_csv('npr.csv')

In [3]:
#It is a couple of thousand of articles and each of these rows is the full text of one of the articles.
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [4]:
#Viewing an article
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

**We will attempt to find out and assign a topic of each of these articles**

#### Count Vectorizer (Feature Representation)

**We can only use Count Vectorizer for LDA as it works on probability Distribution**

In [5]:
#Pre-Processing the data
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
cv=CountVectorizer(max_df=0.9,min_df=2,stop_words='english')

# 'max_df' gets rid of terms that are really common across a lot of the documents
#(0.9 states that it will discard words that show up in 90 percent of the documents)
#It can also have interger which means that it has to show that no. of times in the documents

# 'min_df' means that show up a minimum no. of times
# (2 states that word have to show up atleast in two documents)

# Removing stop_words

In [7]:
#Fit_transform
dtm=cv.fit_transform(npr['Article'])

In [8]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [9]:
#Using Latent Dirichlet Allocation
from sklearn.decomposition import LatentDirichletAllocation

In [10]:
LDA=LatentDirichletAllocation(n_components=7,random_state=42)
#n_components means looking for topics(7 here means 7 topics)

In [11]:
#Fitting to the count vectorized matrix
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

LDA is an iterative process that means it is going to keep updating those weights per word per topic over and over again until they have stabilized.

**Grab the vocabulary of words**

In [12]:
cv.get_feature_names()[50000]

'transcribe'

In [13]:
#total words
len(cv.get_feature_names())

54777

In [14]:
type(cv.get_feature_names())

list

**Grab the topics**

In [15]:
len(LDA.components_) #7 topics that we have told in the parameter

7

In [16]:
type(LDA.components_)

numpy.ndarray

In [17]:
LDA.components_.shape

(7, 54777)

In [18]:
LDA.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [22]:
single_topic=LDA.components_[0]

In [23]:
#Sorts the array from smaller to larger and returns the index positions
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993], dtype=int64)

In [24]:
# Word least representative of this topic
single_topic[18302]

0.14285714309286987

In [25]:
# Word most representative of this topic
single_topic[42993]

6247.245510521057

**For my Understanding**

In [26]:
#Argsort example
import numpy as np
arr=np.array([10,200,1])
arr

array([ 10, 200,   1])

In [27]:
arr.argsort() #Smaller to larger

array([2, 0, 1], dtype=int64)

In [28]:
#Returns top 10 words of the document(grabs last 10 values of .argsort())
single_topic.argsort()[-10:]

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993], dtype=int64)

In [29]:
top_ten_words=single_topic.argsort()[-10:]

In [31]:
for index in top_ten_words:
    print(cv.get_feature_names()[index])

new
percent
government
company
million
care
people
health
said
says


**Grab the highest probability words per topic**

In [32]:
for i,topic in enumerate(LDA.components_):
    print(f"The TOP 15 words for Topic #{i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])

The TOP 15 words for Topic #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']
The TOP 15 words for Topic #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']
The TOP 15 words for Topic #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']
The TOP 15 words for Topic #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']
The TOP 15 words for Topic #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']
The TOP 15 words for Topic #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'people',

Now we have these 7 topics, we can now assign it to the different articles that we have. If you want the topic more in detail, you need to define the components more

In [33]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [34]:
npr

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."
...,...
11987,The number of law enforcement officers shot an...
11988,"Trump is busy these days with victory tours,..."
11989,It’s always interesting for the Goats and Soda...
11990,The election of Donald Trump was a surprise to...


In [35]:
topic_results=LDA.transform(dtm)

In [36]:
topic_results.shape

(11992, 7)

In [37]:
#Proability of document belonging to Topics
topic_results[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [38]:
#Returns the index of maximum value(It means that index 1 has the highest probability)
topic_results[0].argmax()

1

In [39]:
npr['Topic']=topic_results.argmax(axis=1)

In [40]:
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2


# Non-Negative Matric Factorization

Let's repeat that topic modeling task, but this time, we will use NMF instead of LDA.

In [41]:
import pandas as pd

In [42]:
npr=pd.read_csv('npr.csv')

**We can only use Count Vectorizer for LDA as it works on probability Distribution, but NMF works on coefficient values.Here we can pre-process the text using TFIDFVectorizer**

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [44]:
tfidf=TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')
#min_df=2 it shows that words should come in atleast 2 documents, we do not want it to be unique in the document.That will no useful

In [45]:
dtm=tfidf.fit_transform(npr['Article'])

In [46]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [47]:
#Non-Negative Matrix Factorization
from sklearn.decomposition import NMF

In [48]:
#Creating 7 topics
nmf_model=NMF(n_components=7,random_state=42)

In [49]:
# This can take awhile, we're dealing with a large amount of documents!
nmf_model.fit(dtm)

NMF(n_components=7, random_state=42)

In [50]:
tfidf.get_feature_names()[5000]

'bask'

**Displaying Topics**

In [51]:
for i,topic in enumerate(nmf_model.components_):
    print(f"The TOP 15 Words for the Topic #{i}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])

The TOP 15 Words for the Topic #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']
The TOP 15 Words for the Topic #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']
The TOP 15 Words for the Topic #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']
The TOP 15 Words for the Topic #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']
The TOP 15 Words for the Topic #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']
The TOP 15 Words for the Topic #5
['love', 'v

These words in the Topic will not match with LDA as they both are different methods, however there will be topics which can match. Now you have to interpret these topics and it will be better when you are a domain specialist

In [52]:
topic_results=nmf_model.transform(dtm)

In [53]:
topic_results[0].round(2)

array([0.  , 0.12, 0.  , 0.06, 0.02, 0.  , 0.  ])

In [54]:
topic_results[0].argmax()

1

In [55]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 0, 4, 3], dtype=int64)

In [56]:
npr['Topic']=topic_results.argmax(axis=1)

In [57]:
npr

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
...,...,...
11987,The number of law enforcement officers shot an...,3
11988,"Trump is busy these days with victory tours,...",1
11989,It’s always interesting for the Goats and Soda...,0
11990,The election of Donald Trump was a surprise to...,4


In [58]:
#Creating a label for the Topic numbers
mytopic_dict={0:"health",1:"election",2:"Legislation",3:"politics",4:"election",5:"music",6:"edu"}
npr['Topic Label']=npr['Topic'].map(mytopic_dict)

In [59]:
npr

Unnamed: 0,Article,Topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,election
1,Donald Trump has used Twitter — his prefe...,1,election
2,Donald Trump is unabashedly praising Russian...,1,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,politics
4,"From photography, illustration and video, to d...",6,edu
...,...,...,...
11987,The number of law enforcement officers shot an...,3,politics
11988,"Trump is busy these days with victory tours,...",1,election
11989,It’s always interesting for the Goats and Soda...,0,health
11990,The election of Donald Trump was a surprise to...,4,election
