---
Exercises: Topic Modeling with LDA
----

![](http://www.thewrap.com/wp-content/uploads/2015/12/New-York-Times-paper.jpg)

Today you will apply Latent Dirichlet allocation (LDA) to a corpus of NYT articles to discover latent topics. 
_Yes - the same the NYT articles_ as previously.

Load the data
----

Same as last lab

In [49]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [2]:
df = pd.read_pickle("../../corpora/nyt_articles.pkl")

Vectorize data
-----

Let's use tf-idf 

[It is "fight" whether that is "theoretical" correct. But it works better in practice](https://groups.google.com/forum/#!topic/gensim/OESG1jcaXaQ)

In [58]:
tfidf = TfidfVectorizer(max_features=2000,max_df=.95,min_df=2,stop_words='english')
tfidf_matrix = tfidf.fit_transform(df.content)

In [59]:
tfidf_matrix

<1405x2000 sparse matrix of type '<class 'numpy.float64'>'
	with 181848 stored elements in Compressed Sparse Row format>

---
Scikit-learn's LDA
------

Use [Scikit-learn's LDA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) to find topics.

In [60]:
from sklearn.decomposition.online_lda import LatentDirichletAllocation

In [82]:
lda = LatentDirichletAllocation(n_topics=15,
                                max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=42)
lda.fit(tfidf_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=15, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

Write a function to print topics and words:

It should look like this:  
`Topic #0:
said mr year game new season team like time government state people ms company percent republican work million city party`

<br>
<details><summary>
Click here for a hint…
</summary>
lda.components_   
tf_feature_names = vectorizer.get_feature_names()
</details>

<br>
<details><summary>
Click here for the solution…
</summary>
```
def print_top_words(model, feature_names, n_top_words=20):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
print("Topics in LDA model:")
tf_feature_names = vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names)
```
</details>

In [84]:
# each row is a topic each column is a word e
lda.components_

array([[ 0.14516064,  0.13677774,  0.14248406, ...,  0.12968623,
         0.13319879,  0.14408009],
       [ 0.11829146,  0.13801389,  0.13182979, ...,  0.13226887,
         0.11760735,  0.14014631],
       [ 1.16823389,  3.98714025,  0.85274236, ...,  0.23356189,
         3.10296438,  0.78222631],
       ..., 
       [ 0.13332768,  0.13273025,  0.12754854, ...,  0.12070242,
         0.12528881,  0.11884131],
       [ 0.13558369,  0.13571441,  0.1286715 , ...,  0.14539968,
         0.15015816,  0.12436963],
       [ 0.17839324,  0.14199029,  0.13585885, ...,  0.13436469,
         0.12590932,  0.13920161]])

In [85]:
def lda_topic(tfidf,lda,number_of_words):
    for count,topic in enumerate(lda.components_):
        top_indicies = np.argsort(topic)[::-1][:number_of_words]
        term_ranking = [tfidf.get_feature_names()[i] for i in top_indicies]
        print('Topic {} :'.format(count+1),term_ranking)

   
    

In [90]:
lda_2 = LatentDirichletAllocation(n_topics=10,
                                max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=42)
lda_2.fit(tfidf_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=10, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [95]:
lda_3 = LatentDirichletAllocation(n_topics=5,
                                max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=42)
lda_3.fit(tfidf_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=5, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [99]:
lda_topic(tfidf,lda_3,20)
    

Topic 1 : ['said', 'mr', 'government', 'state', 'year', 'people', 'new', 'republican', 'united', 'percent', 'president', 'party', 'country', 'company', 'official', 'house', 'american', 'iran', 'law', 'health']
Topic 2 : ['oil', 'cup', 'said', 'bank', 'russian', 'film', 'judge', 'company', 'gas', 'team', 'entertainment', 'million', 'nation', 'report', 'world', 'financial', 'tuesday', 'month', 'victim', 'determined']
Topic 3 : ['game', 'team', 'season', 'said', 'player', 'league', 'yankee', 'yard', 'cup', 'play', 'coach', 'win', 'rivera', 'second', 'year', 'run', 'inning', 'time', 'hit', 'race']
Topic 4 : ['said', 'game', 'pettitte', 'rivera', 'girardi', 'teacher', 'team', 'season', 'school', 'brand', 'yankee', 'year', 'giant', 'sept', 'coughlin', 'series', 'national', 'street', 'government', 'like']
Topic 5 : ['mr', 'music', 'art', 'ms', 'song', 'dance', 'work', 'new', 'opera', 'like', 'night', 'film', 'band', 'museum', 'ballet', 'artist', 'orchestra', 'performance', 'album', 'series']


In [94]:
lda_topic(tfidf,lda_2,10)

Topic 1 : ['dawn', 'party', 'golden', 'judge', 'mr', 'ms', 'government', 'said', 'art', 'court']
Topic 2 : ['pakistan', 'said', 'drug', 'state', 'attack', 'exchange', 'mr', 'federal', 'people', 'rodriguez']
Topic 3 : ['sept', 'housing', 'student', 'israel', 'israeli', 'editor', 'iran', 'income', 'iranian', '2013']
Topic 4 : ['republican', 'house', 'senate', 'cruz', 'boehner', 'health', 'reid', 'shutdown', 'vote', 'senator']
Topic 5 : ['mr', 'said', 'government', 'obama', 'republican', 'korea', 'song', 'tax', 'debt', 'president']
Topic 6 : ['gun', 'jackson', 'republican', 'energy', 'draft', 'death', 'mr', 'way', 'room', 'palestinian']
Topic 7 : ['percent', 'bank', 'said', 'brand', 'european', 'sugar', 'rate', 'market', 'party', 'google']
Topic 8 : ['iran', 'mr', 'rouhani', 'said', 'united', 'attack', 'syria', 'nuclear', 'weapon', 'iranian']
Topic 9 : ['said', 'mr', 'year', 'game', 'new', 'season', 'team', 'like', 'time', 'government']
Topic 10 : ['vatican', 'said', 'radio', 'game', 'inj

In [98]:
lda_topic(tfidf,lda,15)

Topic 1 : ['iran', 'mr', 'rouhani', 'iranian', 'obama', 'nuclear', 'war', 'rebel', 'israel', 'said', 'president', 'military', 'government', 'netanyahu', 'radio']
Topic 2 : ['golden', 'energy', 'murder', 'party', 'government', 'dawn', 'jackson', 'mr', 'islamist', 'qaeda', 'minister', 'member', 'al', 'nation', 'said']
Topic 3 : ['game', 'season', 'team', 'yankee', 'said', 'player', 'league', 'yard', 'play', 'rivera', 'win', 'coach', 'inning', 'cup', 'second']
Topic 4 : ['government', 'religious', 'drug', 'said', 'default', 'french', 'sunday', 'republican', 'ranger', 'shutdown', 'company', 'wing', 'album', 'social', 'report']
Topic 5 : ['mr', 'said', 'year', 'new', 'government', 'state', 'people', 'like', 'ms', 'republican', 'united', 'company', 'percent', 'american', 'president']
Topic 6 : ['boat', 'republican', 'said', 'president', 'people', 'new', 'girardi', 'gun', 'teacher', 'continuing', 'killed', 'mr', 'green', 'government', 'attack']
Topic 7 : ['jet', 'said', 'game', 'ms', 'johnson

In [74]:
np.shape(lda.components_)

(10, 2000)

In [75]:
#NFM

model_3 = NMF(init="nndsvd",
            n_components=10,
            max_iter=200)

W_3 = model_3.fit_transform(tfidf_matrix)
H_3 = model_3.components_

terms = [""] * len(tfidf.vocabulary_)
for term in tfidf.vocabulary_.keys():
    terms[tfidf.vocabulary_[term]] = term
terms = np.array(terms)


for c,topic in enumerate(H_3):
    count = 0
    current_terms = []
    for term in  terms[np.argsort(topic)[::-1]]:
        #print(term)
        current_terms.append(term)
        count+=1
        if count ==10:
            break
            
           
    print('Topic: ',c+1,' Top 10 terms :',current_terms)
            
    

Topic:  1  Top 10 terms : ['mr', 'said', 'court', 'judge', 'case', 'state', 'justice', 'prison', 'official', 'lawyer']
Topic:  2  Top 10 terms : ['game', 'season', 'team', 'yard', 'said', 'league', 'player', 'touchdown', 'coach', 'play']
Topic:  3  Top 10 terms : ['iran', 'rouhani', 'nuclear', 'iranian', 'mr', 'obama', 'israel', 'united', 'netanyahu', 'president']
Topic:  4  Top 10 terms : ['republican', 'house', 'health', 'government', 'care', 'senate', 'shutdown', 'obama', 'law', 'democrat']
Topic:  5  Top 10 terms : ['attack', 'said', 'syria', 'killed', 'official', 'security', 'government', 'chemical', 'people', 'mall']
Topic:  6  Top 10 terms : ['percent', 'company', 'said', 'market', 'year', 'million', 'bank', 'china', 'rate', 'price']
Topic:  7  Top 10 terms : ['yankee', 'rivera', 'pettitte', 'inning', 'game', 'season', 'girardi', 'jeter', 'baseball', 'rodriguez']
Topic:  8  Top 10 terms : ['ms', 'music', 'art', 'work', 'new', 'like', 'dance', 'mr', 'song', 'opera']
Topic:  9  To

In [66]:
# NFM is a better match for the topics , the words choosen seem to make more sense.
# LDA is including words like MR and MS that add not context to the given topic

In [93]:
df.section_name.unique()

array(['Sports', 'U.S.', 'Business Day', 'World', 'Opinion', 'Arts',
       'Travel', 'Magazine', 'Real Estate', 'Books'], dtype=object)

Experiment with the number of topics. What patterns emerge?

What is the best number of topics?

In [100]:
# It seems that 5 is the best number of topics based upon the similarity of the words for LDA

How do the LDA topics compare to the NMF topics?

#LDA topics
- Topic 1 : ['dawn', 'party', 'golden', 'judge', 'mr', 'ms', 'government', 'said', 'art', 'court']
- Topic 2 : ['pakistan', 'said', 'drug', 'state', 'attack', 'exchange', 'mr', 'federal', 'people', 'rodriguez']
- Topic 3 : ['sept', 'housing', 'student', 'israel', 'israeli', 'editor', 'iran', 'income', 'iranian', '2013']
- Topic 4 : ['republican', 'house', 'senate', 'cruz', 'boehner', 'health', 'reid', 'shutdown', 'vote', 'senator']- 
- Topic 5 : ['mr', 'said', 'government', 'obama', 'republican', 'korea', 'song', 'tax', 'debt', 'president']
- Topic 6 : ['gun', 'jackson', 'republican', 'energy', 'draft', 'death', 'mr', 'way', 'room', 'palestinian']- 
- Topic 7 : ['percent', 'bank', 'said', 'brand', 'european', 'sugar', 'rate', 'market', 'party', 'google']
- Topic 8 : ['iran', 'mr', 'rouhani', 'said', 'united', 'attack', 'syria', 'nuclear', 'weapon', 'iranian']
- Topic 9 : ['said', 'mr', 'year', 'game', 'new', 'season', 'team', 'like', 'time', 'government']
- Topic 10 : ['vatican', 'said', 'radio', 'game', 'injured', 'championship', 'italy', 'viewer', 'sunday', 'hockey']

# The LDA topics appear to be less intuitive to understand compared to the NMF topics.
1:Government Party
2: Iran deal
3: student protests in Iran
4: Obama Presidency
5: Republicans and the govenment
6: Gun deaths in Palenstine
7: Bank recovery in Europe
8: Iran deal
9: Sports Season
10: Pope Francis

### LDA with 5 topics


Topic 1 : ['said', 'mr', 'government', 'state', 'year', 'people', 'new', 'republican', 'united', 'percent', 'president', 'party', 'country', 'company', 'official', 'house', 'american', 'iran', 'law', 'health']
Topic 2 : ['oil', 'cup', 'said', 'bank', 'russian', 'film', 'judge', 'company', 'gas', 'team', 'entertainment', 'million', 'nation', 'report', 'world', 'financial', 'tuesday', 'month', 'victim', 'determined']
Topic 3 : ['game', 'team', 'season', 'said', 'player', 'league', 'yankee', 'yard', 'cup', 'play', 'coach', 'win', 'rivera', 'second', 'year', 'run', 'inning', 'time', 'hit', 'race']
Topic 4 : ['said', 'game', 'pettitte', 'rivera', 'girardi', 'teacher', 'team', 'season', 'school', 'brand', 'yankee', 'year', 'giant', 'sept', 'coughlin', 'series', 'national', 'street', 'government', 'like']
Topic 5 : ['mr', 'music', 'art', 'ms', 'song', 'dance', 'work', 'new', 'opera', 'like', 'night', 'film', 'band', 'museum', 'ballet', 'artist', 'orchestra', 'performance', 'album', 'series']


- 1: Government
- 2: International
- 3: Sports
- 4: Sports
- 5: Arts/Culture

How do the LDA topics compare to the NYT section labels?

In [101]:
#:LDA finds fewer topics than NYT has

---
Challenge Exercises
----

1) Try the same analysis with the `lda` package.

[RTFM](http://pythonhosted.org/lda/)

In [5]:
try:
    import lda
except ImportError:
    import pip
    pip.main(['install', 'lda'])

Collecting lda
  Downloading lda-1.0.4-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (373kB)
Installing collected packages: lda
Successfully installed lda-1.0.4


2) Try with [genism](https://radimrehurek.com/gensim/tut2.html)

3) Try the same analysis with [GraphLab's API](https://dato.com/products/create/docs/generated/graphlab.topic_model.create.html)

Check out this notebook for a great [lda visualization](http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb#topic=2&lambda=0.41&term=).

__NOTE:__ GraphLab only supports Python 2.7

In [None]:
%%bash
# Install graphlab
sudo pip2 install graphlab-create
echo '[Product]
product_key=D868-7DBE-AC8A-0343-45F3-E250-34B4-24CA' > ~/.graphlab/config

In [None]:
import graphlab as gl

<br>
<br>
<br>

---