<a href="https://colab.research.google.com/github/plaban1981/NLP-with-Python/blob/master/Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Topic Modeling allows to efficiently analyze large volumes of texts by clustering documents into text.

#### In the real word a large amount of text data is unlabeled as areuslt of which we will not be able to apply supervised machine learning algorithms

##Latent Dirichlet Allocation
Based on probabilty distribution

#### Assumptions :-
* Documents with similar topics use similar groups of words

* Latent Topics can then be found by seraching for group of words that frequently occur together in the documents across the corpus.

Documents are probability distributions over latent topics.

Topics are probability distributions over words.

In [1]:
import pandas as pd
data = pd.read_csv('/content/npr.csv')
data.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


## Number of trianing sample

In [2]:
data.shape

(11992, 1)

##Preprocessing

In [0]:
from sklearn.feature_extraction.text import CountVectorizer


**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [0]:
cv = CountVectorizer(max_df = 0.90,min_df=2,stop_words='english')

In [0]:
article_vector = cv.fit_transform(data['Article'])

In [6]:
article_vector.shape

(11992, 54777)

In [8]:
type(article_vector)

scipy.sparse.csr.csr_matrix

##Latent Dirichlet Allocation

In [0]:
from sklearn.decomposition import LatentDirichletAllocation
LDA = LatentDirichletAllocation(n_components=10,random_state=42)

In [10]:
LDA.fit(article_vector)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=10, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

##Grab the vocalbulary of words

In [12]:
len(cv.get_feature_names())

54777

In [13]:
type(cv.get_feature_names())

list

In [17]:
cv.get_feature_names()[50000],cv.get_feature_names()[10000]

('transcribe', 'coelho')

In [19]:
import random
random_word_id = random.randint(0,54777)
cv.get_feature_names()[random_word_id]

'baloo'

##Grab the Topics

In [20]:
len(LDA.components_)

10

In [21]:
type(LDA.components_)

numpy.ndarray

In [22]:
LDA.components_.shape

(10, 54777)

##Grab the highest probability words per topic

In [0]:
single_topic = LDA.components_[0]

In [24]:
single_topic

array([5.11072577e+00, 1.94461867e+03, 1.00001806e-01, ...,
       1.00005562e-01, 1.00000000e-01, 1.00001005e-01])

In [25]:
single_topic.argsort()# returns the index position of the sorted elements in the array in ascending order of value

array([18302,  2475, 44967, ..., 10425, 42561, 42993])

##Grab top 10 words

In [27]:
single_topic.argsort()[-10:]#last 10 values of argsort()

array([    1, 18349, 33390, 32089, 10421, 31464, 22673, 10425, 42561,
       42993])

In [29]:
top_10_words = single_topic.argsort()[-10:]
for index in top_10_words:
  print(cv.get_feature_names()[index])

000
federal
new
money
companies
million
health
company
said
says


In [30]:
top_20_words = single_topic.argsort()[-20:]
for index in top_20_words:
  print(cv.get_feature_names()[index])

industry
tax
business
percent
pay
people
care
government
year
insurance
000
federal
new
money
companies
million
health
company
said
says


In [32]:
for index,topics in enumerate(LDA.components_):
  print(f"The top 15 words for the topic # {index}")
  print([cv.get_feature_names()[i] for i in topics.argsort()[-15:]])

The top 15 words for the topic # 0
['people', 'care', 'government', 'year', 'insurance', '000', 'federal', 'new', 'money', 'companies', 'million', 'health', 'company', 'said', 'says']
The top 15 words for the topic # 1
['npr', 'intelligence', 'security', 'new', 'told', 'russian', 'campaign', 'obama', 'news', 'white', 'russia', 'house', 'president', 'said', 'trump']
The top 15 words for the topic # 2
['know', 'little', 'home', 'make', 'way', 'day', 'water', 'time', 'years', 'people', 'food', 'new', 'just', 'like', 'says']
The top 15 words for the topic # 3
['don', 'food', 'work', 'day', 'life', 'time', 'family', 'children', 'years', 'just', 'women', 'world', 'like', 'people', 'says']
The top 15 words for the topic # 4
['supreme', 'order', 'city', 'states', 'federal', 'country', 'president', 'rights', 'government', 'people', 'law', 'state', 'said', 'court', 'says']
The top 15 words for the topic # 5
['going', 've', 'story', 'life', 'don', 'new', 'way', 'time', 'really', 'know', 'think', 

##Attach the topic numbers to original articles

In [33]:
article_vector

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [0]:
topic_results = LDA.transform(article_vector)

In [36]:
topic_results.shape

(11992, 10)

##probability of a topic belonging to a word document

In [39]:
topic_results[0].round(2)

array([0.01, 0.91, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.08, 0.  ])

In [38]:
import numpy as np
np.argmax(topic_results[0])

1

In [40]:
data['Article'].iloc[0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

##Generate the topic label

In [0]:
data['Topic'] = topic_results.argmax(axis=1)

In [42]:
data.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",6


In [43]:
data.tail()

Unnamed: 0,Article,Topic
11987,The number of law enforcement officers shot an...,7
11988,"Trump is busy these days with victory tours,...",1
11989,It’s always interesting for the Goats and Soda...,3
11990,The election of Donald Trump was a surprise to...,8
11991,Voters in the English city of Sunderland did s...,4
