# LDA 

Reuters Data: A collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.

Source: 
- https://pythonhosted.org/lda/
- https://pypi.python.org/pypi/lda

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Getting-started" data-toc-modified-id="Getting-started-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Getting started</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Modeling</a></span></li><li><span><a href="#Topic-Analysis" data-toc-modified-id="Topic-Analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Topic Analysis</a></span><ul class="toc-item"><li><span><a href="#Putting-it-together" data-toc-modified-id="Putting-it-together-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Putting it together</a></span></li></ul></li><li><span><a href="#All-Together" data-toc-modified-id="All-Together-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>All Together</a></span></li><li><span><a href="#Reference" data-toc-modified-id="Reference-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

## Getting started

In [34]:
# install lda package 
# ! pip install lda

In [20]:
# import packages 
import numpy as np
import lda
from pprint import pprint
import pandas as pd

In [2]:
X = lda.datasets.load_reuters()
print(X.shape)
print(type(X))
print(X.sum())

(395, 4258)
<class 'numpy.ndarray'>
84010


In [16]:
vocab = lda.datasets.load_reuters_vocab()
print(vocab[:10])
print(len(vocab))
print(type(vocab))

('church', 'pope', 'years', 'people', 'mother', 'last', 'told', 'first', 'world', 'year')
4258
<class 'tuple'>


In [28]:
titles = lda.datasets.load_reuters_titles()
pprint(titles[:2])
print(len(titles))
print(type(titles))

('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20',
 '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany '
 '1996-08-21')
395
<class 'tuple'>


## Modeling

In [12]:
model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
model.fit(X)  # model.fit_transform(X) is also available

INFO:lda:n_documents: 395
INFO:lda:vocab_size: 4258
INFO:lda:n_words: 84010
INFO:lda:n_topics: 20
INFO:lda:n_iter: 1500
INFO:lda:<0> log likelihood: -1051748
INFO:lda:<10> log likelihood: -719800
INFO:lda:<20> log likelihood: -699115
INFO:lda:<30> log likelihood: -689370
INFO:lda:<40> log likelihood: -684918
INFO:lda:<50> log likelihood: -681322
INFO:lda:<60> log likelihood: -678979
INFO:lda:<70> log likelihood: -676598
INFO:lda:<80> log likelihood: -675383
INFO:lda:<90> log likelihood: -673316
INFO:lda:<100> log likelihood: -672761
INFO:lda:<110> log likelihood: -671320
INFO:lda:<120> log likelihood: -669744
INFO:lda:<130> log likelihood: -669292
INFO:lda:<140> log likelihood: -667940
INFO:lda:<150> log likelihood: -668038
INFO:lda:<160> log likelihood: -667429
INFO:lda:<170> log likelihood: -666475
INFO:lda:<180> log likelihood: -665562
INFO:lda:<190> log likelihood: -664920
INFO:lda:<200> log likelihood: -664979
INFO:lda:<210> log likelihood: -664722
INFO:lda:<220> log likelihood: -

<lda.lda.LDA at 0x10ee0fbe0>

In [41]:
topic_word = model.topic_word_  # model.components_ also works
print(topic_word[0])

[  3.62505347e-06   3.62505347e-06   3.62505347e-06 ...,   3.62505347e-06
   3.62505347e-06   3.62505347e-06]


In [32]:
n_top_words = 8
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pope vatican paul john surgery hospital pontiff
Topic 5: family funeral police miami versace cunanan city
Topic 6: simpson former years court president wife south
Topic 7: order mother successor election nuns church nirmala
Topic 8: charles prince diana royal king queen parker
Topic 9: film french france against bardot paris poster
Topic 10: germany german war nazi letter christian book
Topic 11: east peace prize award timor quebec belo
Topic 12: n't life show told very love television
Topic 13: years year time last church world people
Topic 14: mother teresa heart calcutta charity nun hospital
Topic 15: city salonika capital buddhist cultural vietnam byzantine
Topic 16: music tour opera singer israel people film
Topic 17: church catholic be

## Topic Analysis

In [10]:
doc_topic = model.doc_topic_
print(doc_topic[0])
print(doc_topic.shape)
# each row represents document, 
# and provides probabilities for each topic in the given document

[  4.34782609e-04   3.52173913e-02   4.34782609e-04   9.13043478e-03
   4.78260870e-03   4.34782609e-04   9.13043478e-03   3.08695652e-02
   5.04782609e-01   4.78260870e-03   4.34782609e-04   4.34782609e-04
   3.08695652e-02   2.17826087e-01   4.34782609e-04   4.34782609e-04
   4.34782609e-04   3.95652174e-02   4.34782609e-04   1.09130435e-01]
(395, 20)


In [11]:
# shows that for document 0, topic 8 has the highest probability
doc_topic[0].argmax()

8

In [18]:
# get the topic for each document 
doc_topics = []
for i in doc_topic: 
    doc_topics.append(i.argmax())

print(doc_topics[:10])

[8, 13, 14, 8, 14, 14, 14, 14, 14, 8]


In [35]:
# shows that topic 13 is the most frequent
pd.Series(doc_topics).value_counts()

13    56
8     36
4     31
1     29
3     23
5     22
18    21
6     20
17    18
10    18
16    17
11    17
14    16
0     15
9     13
19    11
7      9
12     9
2      8
15     6
dtype: int64

In [45]:
# get the top 10 most frequent in order 
top_topics = list(pd.Series(doc_topics).value_counts()[:10].index)
top_topics

[13, 8, 4, 1, 3, 5, 18, 6, 17, 10]

In [58]:
#topic_n_words = []
n_top_words = 10
topic_n_words = {}

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    # (topic, words)
    #topic_n_words.append((i,topic_words))
    topic_n_words[i] = ' '.join(topic_words)
    
topic_n_words[0]    

'british churchill sale million major letters west britain john'

In [57]:
topic_words

array(['city', 'museum', 'art', 'exhibition', 'century', 'million',
       'churches', 'set', 'used'], 
      dtype='<U18')

In [59]:
# get top 10 topics and its words 
print('10 Most Frequent Topics')
for i in top_topics: 
    print('Topic %s: %s' % (i, str(topic_n_words[i])))

10 Most Frequent Topics
Topic 13: years year time last church world people say during
Topic 8: charles prince diana royal king queen parker bowles camilla
Topic 4: pope vatican paul john surgery hospital pontiff rome mass
Topic 1: church government political country state people party against first
Topic 3: yeltsin russian russia president kremlin moscow michael operation orthodox
Topic 5: family funeral police miami versace cunanan city service home
Topic 18: harriman clinton u.s ambassador paris president churchill france american
Topic 6: simpson former years court president wife south church york
Topic 17: church catholic bernardin cardinal bishop wright death cancer chicago
Topic 10: germany german war nazi letter christian book jews scientology


In [60]:
# displaying titles with top topics 
for i in range(10):
    print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))

0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)


### Putting it together

In [62]:
topic_word = model.topic_word_
doc_topic = model.doc_topic_

# get the top 10 most frequent in order 
top_topics = list(pd.Series(doc_topics).value_counts()[:10].index)

# get the topic for each document 
doc_topics = []
for i in doc_topic: 
    doc_topics.append(i.argmax())

n_top_words = 10
topic_n_words = {}

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    # (topic, words)
    topic_n_words[i] = ' '.join(topic_words)
    
# get top 10 topics and its words 
print('10 Most Frequent Topics')
for i in top_topics: 
    print('Topic %s: %s' % (i, str(topic_n_words[i])))

10 Most Frequent Topics
Topic 13: years year time last church world people say during
Topic 8: charles prince diana royal king queen parker bowles camilla
Topic 4: pope vatican paul john surgery hospital pontiff rome mass
Topic 1: church government political country state people party against first
Topic 3: yeltsin russian russia president kremlin moscow michael operation orthodox
Topic 5: family funeral police miami versace cunanan city service home
Topic 18: harriman clinton u.s ambassador paris president churchill france american
Topic 6: simpson former years court president wife south church york
Topic 17: church catholic bernardin cardinal bishop wright death cancer chicago
Topic 10: germany german war nazi letter christian book jews scientology


## All Together


In [1]:
# import packages 
import numpy as np
import lda
from pprint import pprint

X = lda.datasets.load_reuters()
vocab = lda.datasets.load_reuters_vocab()
titles = lda.datasets.load_reuters_titles()

model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
model.fit(X)  # model.fit_transform(X) is also available

topic_word = model.topic_word_  # model.components_ also works
n_top_words = 8

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

INFO:lda:n_documents: 395
INFO:lda:vocab_size: 4258
INFO:lda:n_words: 84010
INFO:lda:n_topics: 20
INFO:lda:n_iter: 1500
INFO:lda:<0> log likelihood: -1051748
INFO:lda:<10> log likelihood: -719800
INFO:lda:<20> log likelihood: -699115
INFO:lda:<30> log likelihood: -689370
INFO:lda:<40> log likelihood: -684918
INFO:lda:<50> log likelihood: -681322
INFO:lda:<60> log likelihood: -678979
INFO:lda:<70> log likelihood: -676598
INFO:lda:<80> log likelihood: -675383
INFO:lda:<90> log likelihood: -673316
INFO:lda:<100> log likelihood: -672761
INFO:lda:<110> log likelihood: -671320
INFO:lda:<120> log likelihood: -669744
INFO:lda:<130> log likelihood: -669292
INFO:lda:<140> log likelihood: -667940
INFO:lda:<150> log likelihood: -668038
INFO:lda:<160> log likelihood: -667429
INFO:lda:<170> log likelihood: -666475
INFO:lda:<180> log likelihood: -665562
INFO:lda:<190> log likelihood: -664920
INFO:lda:<200> log likelihood: -664979
INFO:lda:<210> log likelihood: -664722
INFO:lda:<220> log likelihood: -

Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pope vatican paul john surgery hospital pontiff
Topic 5: family funeral police miami versace cunanan city
Topic 6: simpson former years court president wife south
Topic 7: order mother successor election nuns church nirmala
Topic 8: charles prince diana royal king queen parker
Topic 9: film french france against bardot paris poster
Topic 10: germany german war nazi letter christian book
Topic 11: east peace prize award timor quebec belo
Topic 12: n't life show told very love television
Topic 13: years year time last church world people
Topic 14: mother teresa heart calcutta charity nun hospital
Topic 15: city salonika capital buddhist cultural vietnam byzantine
Topic 16: music tour opera singer israel people film
Topic 17: church catholic be

## Reference

In [7]:
help(lda.LDA)

Help on class LDA in module lda.lda:

class LDA(builtins.object)
 |  Latent Dirichlet allocation using collapsed Gibbs sampling
 |  
 |  Parameters
 |  ----------
 |  n_topics : int
 |      Number of topics
 |  
 |  n_iter : int, default 2000
 |      Number of sampling iterations
 |  
 |  alpha : float, default 0.1
 |      Dirichlet parameter for distribution over topics
 |  
 |  eta : float, default 0.01
 |      Dirichlet parameter for distribution over words
 |  
 |  random_state : int or RandomState, optional
 |      The generator used for the initial topics.
 |  
 |  Attributes
 |  ----------
 |  `components_` : array, shape = [n_topics, n_features]
 |      Point estimate of the topic-word distributions (Phi in literature)
 |  `topic_word_` :
 |      Alias for `components_`
 |  `nzw_` : array, shape = [n_topics, n_features]
 |      Matrix of counts recording topic-word assignments in final iteration.
 |  `ndz_` : array, shape = [n_samples, n_topics]
 |      Matrix of counts recordi