----
Exercises: Topic Modeling with Non-Negative Matrix factorization (NMF) 
----

Today we will apply the Non-Negative Matrix factorization (NMF) algorithm to discover latent topics in New York Times articles.

![](http://1.bp.blogspot.com/_JNTikHKvtnY/S6tLPRWmxjI/AAAAAAAABcQ/-eszxl-WIQ0/s1600/New-York-Times.jpg)

---
Preprocessing
----

1) Read the articles.pkl file using the `df.read_pickle` function in Pandas and transform the data into a structure for sci-kit learn.

In [1]:
reset -fs

In [10]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_pickle("../../corpora/nyt_articles.pkl")

In [5]:
# Look that data...
# What are the columns?
# What are the rows?
df.head(n=2)

Unnamed: 0,document_type,web_url,lead_paragraph,abstract,snippet,news_desk,word_count,source,section_name,subsection_name,_id,pub_date,print_page,headline,content
0,article,http://www.nytimes.com/2013/10/03/sports/footb...,You would think that in a symmetric zero-sum s...,,You would think that in a symmetric zero-sum s...,Sports,347,The New York Times,Sports,Pro Football,524d4e3a38f0d8198974001f,2013-10-03T00:00:00Z,,Week 5 Probabilities: Why Offense Is More Impo...,the original goal building model football fore...
1,article,http://www.nytimes.com/2013/10/03/us/new-immig...,House Democrats on Wednesday unveiled an immig...,House Democrats unveil immigration bill that p...,House Democrats on Wednesday unveiled an immig...,National,83,The New York Times,U.S.,,524cf71338f0d8198973ff7b,2013-10-03T00:00:00Z,21.0,New Immigration Bill Put Forward,house unveiled immigration bill provides path ...


In [42]:
# Look at a sample of the data - content of the news stories
df.content

0       the original goal building model football fore...
1       house unveiled immigration bill provides path ...
2       federal judge wednesday ordered appointment in...
3       texas nation’s active death-penalty state turn...
4       rafael nadal aiming end novak djokovic’s run 1...
5       brazilian labor judge ordered suspension const...
6       defending champion bayern munich produced powe...
7       simone bile used superb floor vault exercise s...
8       the devil announced cory schneider would start...
9       lindsey vonn’s surgically repaired right knee ...
10      the whitney handicap winner cross traffic run ...
11      atlanta at least eight people died front tire ...
12      farmer waste management company energy industr...
13      juba south sudan even standard fashion model w...
14      to editor read interest “three sister not chek...
15      to editor re “data back bloomberg disparity wi...
16      safety big issue american many danger persist ...
17      a gove

2) Use the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from scikit-learn to turn the content of the news stories into the document-term matrix $\textbf{A}$ 

(I call it "vectorized")

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Here are some good defaults


max_features=1000
max_df=0.95,  
min_df=2,
max_features=1000,
stop_words='english'

In [47]:
vectorized = CountVectorizer(max_features=1000,max_df=.95,min_df=2,stop_words='english')
doc_term = vectorized.fit_transform(df.content)

In [49]:
np.shape(df.content)

(1405,)

What is the size of the document-term matrix?

In [50]:
#1405 docs by 1000 terms is the matrix size
doc_term

<1405x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 135271 stored elements in Compressed Sparse Row format>

----
NMF with scikit-learn 
------

Hint: [Here is an example](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html)

In [51]:
import sklearn

assert sklearn.__version__ == '0.18' # Make sure we are in the modern age

In [52]:
from sklearn.decomposition import NMF

Apply NMF with SVD-based initialization to the document-term matrix $\text{A}$ generate 4 topics.

In [53]:
model = NMF(init="nndsvd",
            n_components=4,
            max_iter=200)

Get the factors $\text{W}$ and $\text{H}$ from the resulting model.

In [55]:
W = model.fit_transform(doc_term)
H = model.components_

In [95]:
W[:100,0]

array([  0.00000000e+00,   0.00000000e+00,   6.00670745e-02,
         7.19590348e-02,   0.00000000e+00,   2.96473692e-02,
         2.54871183e-02,   1.09730065e-02,   0.00000000e+00,
         7.08346396e-02,   0.00000000e+00,   4.05933677e-01,
         7.82687398e-01,   7.21490987e-01,   3.73559476e-01,
         1.70544914e-01,   2.99002649e-01,   1.66389567e-01,
         0.00000000e+00,   4.08411481e-01,   0.00000000e+00,
         5.54626515e-01,   1.86451654e-01,   1.40120017e-01,
         0.00000000e+00,   1.12657534e-01,   0.00000000e+00,
         7.58065691e-01,   9.12978787e-02,   1.18726844e-01,
         8.59929137e-02,   0.00000000e+00,   3.97143189e-02,
         9.54671582e-02,   2.67524960e-01,   2.09066979e-01,
         1.36032642e+00,   8.52540223e-01,   1.86869149e-01,
         4.05289948e-02,   7.41148653e-01,   7.54246731e-01,
         1.11175020e+00,   1.25670998e-01,   3.12460244e-01,
         3.79526966e-01,   6.11516237e-01,   8.66073649e-02,
         0.00000000e+00,

In [62]:
H

array([[ 1.97109492,  1.08278721,  0.50624774, ...,  1.93792322,
         0.80479987,  0.32067983],
       [ 0.14001152,  1.14926156,  0.15534813, ...,  0.15976554,
         0.48232903,  0.04306665],
       [ 0.34525608,  0.10497164,  0.05540141, ...,  0.04922689,
         0.        ,  0.07341733],
       [ 0.05793673,  0.10327576,  0.03387658, ...,  0.71556379,
         0.2551675 ,  0.        ]])

What is are sizes of W and H?

In [56]:
print(np.shape(W)) #docs by topics
print(np.shape(H)) # topics by terms

(1405, 4)
(4, 1000)


Get the list of all terms whose indices correspond to the columns of the document-term matrix.

In [59]:
terms = [""] * len(vectorized.vocabulary_)
for term in vectorized.vocabulary_.keys():
    terms[vectorized.vocabulary_[term]] = term

In [81]:
# Have a look that some of the terms
terms=np.array(terms)

In [78]:
len(terms)

1000

In [93]:
for c,topic in enumerate(H):
    count = 0
    current_terms = []
    for term in  terms[np.argsort(topic)[::-1]]:
        #print(term)
        current_terms.append(term)
        count+=1
        if count ==10:
            break
           
    print('Topic: ',c+1,' Top 10 terms :',current_terms)
            
    

Topic:  1  Top 10 terms : ['said', 'year', 'new', 'people', 'state', 'company', 'gun', 'work', 'like', 'percent']
Topic:  2  Top 10 terms : ['game', 'season', 'said', 'team', 'year', 'player', 'time', 'play', 'yankee', 'league']
Topic:  3  Top 10 terms : ['republican', 'government', 'house', 'health', 'law', 'care', 'party', 'shutdown', 'senate', 'president']
Topic:  4  Top 10 terms : ['mr', 'said', 'iran', 'rouhani', 'united', 'nuclear', 'president', 'obama', 'state', 'netanyahu']


Print the top 10 ranked terms for each topic, by sorting the values in the rows of the $\text{H}$ factor 

<br>

<details><summary>
Click here for a hint…
</summary>
```
for topic_index in None:
    top_indices = np.argsort(None)[None][None]
    term_ranking = [None[i] for i in None]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))
```
</details>

<br>

<details><summary>
Click here for the answer…
</summary>
```
for topic_index in range(H.shape[0]):
    top_indices = np.argsort(H[topic_index,:])[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))
```
</details>

In [None]:
# Topic 1: State Gun Regulation in the New Year

In [None]:
#Topic 2: Yankee lead the baseball season
    

In [None]:
#Topic 3: The Republican Government

In [None]:
#Topic 4:Iran Nuclear Deal

Look at the words in the numbered topics. For each one, make-up a label that describes it.

For example:  
`Topic 3: people, mobile, said, phone, technology, music, digital, users, microsoft, software`  
is about "The Singularity" 😉

Are there any topics that don't make sense (i.e., the words don't go together)?

In [1]:
# No, all of the topics seem to go together

Change the number of topics to match the number of topics in NYT section labels

In [104]:
df.section_name.unique()

array(['Sports', 'U.S.', 'Business Day', 'World', 'Opinion', 'Arts',
       'Travel', 'Magazine', 'Real Estate', 'Books'], dtype=object)

In [102]:
model_2 = NMF(init="nndsvd",
            n_components=10,
            max_iter=200)

W_2 = model_2.fit_transform(doc_term)
H_2 = model_2.components_


In [103]:
for c,topic in enumerate(H_2):
    count = 0
    current_terms = []
    for term in  terms[np.argsort(topic)[::-1]]:
        #print(term)
        current_terms.append(term)
        count+=1
        if count ==10:
            break
           
    print('Topic: ',c+1,' Top 10 terms :',current_terms)
            
    

Topic:  1  Top 10 terms : ['said', 'year', 'day', 'people', 'official', 'case', 'time', 'decision', 'court', 'added']
Topic:  2  Top 10 terms : ['game', 'season', 'team', 'year', 'player', 'time', 'league', 'yankee', 'run', 'play']
Topic:  3  Top 10 terms : ['republican', 'house', 'government', 'health', 'law', 'care', 'party', 'president', 'shutdown', 'obama']
Topic:  4  Top 10 terms : ['mr', 'year', 'party', 'political', 'case', 'like', 'leader', 'state', 'court', 'night']
Topic:  5  Top 10 terms : ['new', 'work', 'company', 'like', 'york', 'people', 'ms', 'job', 'worker', 'executive']
Topic:  6  Top 10 terms : ['gun', 'child', 'death', 'year', 'law', 'state', 'time', 'shooting', 'old', 'killed']
Topic:  7  Top 10 terms : ['iran', 'rouhani', 'nuclear', 'obama', 'iranian', 'netanyahu', 'president', 'israel', 'united', 'mr']
Topic:  8  Top 10 terms : ['davis', 'state', 'story', 'texas', 'woman', 'democratic', 'city', 'new', 'republican', 'candidate']
Topic:  9  Top 10 terms : ['percent

How do the NMF topics compare to the NYT section labels?

In [105]:
#array(['Sports', 'U.S.', 'Business Day', 'World', 'Opinion', 'Arts',
   #    'Travel', 'Magazine', 'Real Estate', 'Books'], dtype=object)
#They seem to match but it is harder to find a topic name using NFM

Which would you use to filter your news?

In [106]:
#I would use the NYT topics since they have an additional screen on the topic modeling.

Repeat with the same modeling with [`tf-idf`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

In [107]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [117]:
tfidf = TfidfVectorizer(max_features=1000,max_df=.95,min_df=2,stop_words='english')
tfidf_matrix = tfidf.fit_transform(df.content)

In [118]:
tfidf_matrix

<1405x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 135271 stored elements in Compressed Sparse Row format>

In [119]:
model_3 = NMF(init="nndsvd",
            n_components=10,
            max_iter=200)

W_3 = model_3.fit_transform(tfidf_matrix)
H_3 = model_3.components_


In [120]:
print(np.shape(W_3))

(1405, 10)


In [121]:
print(np.shape(H_3))

(10, 1000)


In [122]:
for c,topic in enumerate(H_3):
    count = 0
    current_terms = []
    for term in  terms[np.argsort(topic)[::-1]]:
        #print(term)
        current_terms.append(term)
        count+=1
        if count ==10:
            break
           
    print('Topic: ',c+1,' Top 10 terms :',current_terms)
            
    

Topic:  1  Top 10 terms : ['mr', 'said', 'court', 'case', 'judge', 'state', 'justice', 'lawyer', 'prison', 'official']
Topic:  2  Top 10 terms : ['game', 'season', 'yard', 'team', 'said', 'league', 'player', 'coach', 'play', 'touchdown']
Topic:  3  Top 10 terms : ['republican', 'house', 'health', 'care', 'government', 'senate', 'shutdown', 'obama', 'law', 'democrat']
Topic:  4  Top 10 terms : ['iran', 'rouhani', 'nuclear', 'iranian', 'obama', 'israel', 'united', 'mr', 'netanyahu', 'president']
Topic:  5  Top 10 terms : ['ms', 'music', 'art', 'new', 'work', 'like', 'dance', 'york', 'museum', 'song']
Topic:  6  Top 10 terms : ['company', 'percent', 'said', 'market', 'year', 'million', 'bank', 'china', 'price', 'state']
Topic:  7  Top 10 terms : ['yankee', 'rivera', 'pettitte', 'inning', 'game', 'season', 'baseball', 'run', 'pitch', 'stadium']
Topic:  8  Top 10 terms : ['attack', 'said', 'official', 'syria', 'killed', 'people', 'government', 'police', 'security', 'mall']
Topic:  9  Top 10

How does that change the topics?

Are they "tighter"? Easier to describe?

In [123]:
# The topics seem to be more closely related to eachother now. The top words almost create a sentence,
#this is due to the way TF-IDF weights the features (more frequent and rarer words have higher weight). 

---
Challenge Exercises
====

Rolling Your Own (RYO) NMF
-----

With the document matrix (our bags of words), we can begin implementing the NMF algorithm.  

1. Create a NMF class to that is initialized with a document matrix (bag of words or tf-idf) __V__.  As arguments (in addition to the document matrix) it should also take parameters __k__ (# of latent topics) and the maximum # of iterations to perform. 
  
  First we need to initialize our weights (__W__) and features (__H__) matrices.  

2. Initialize the weights matrix (W) with (positive) random values to be a __n x k__ matrix, where __n__ is the number of documents and __k__ is the number of latent topics.

2.  Initialize the feature matrix (H) to be __k x m__ where __m__ is the number of words in our vocabulary (i.e. length of bag).  Our original document matrix (__V__) is a __n x m__ matrix.  __NOTICE: shape(V) = shape(W * H)__

3. Now that we have initialized our matrices and defined our cost, we can begin iterating. Update your weights and features matrices accordingly.  7. Repeat this update until convergence (i.e. change in __cost(V, W*H)__ close to 0). or until our max # of iterations.

4. Assume we want to use a least-squares error metric when we update the matrices __W__ and __H__. This allows us to use the numpy.linalg.lstsq solver. 
Update __H__ by calling lstsq, holding __W__ fixed and minimizing the sum of squared errors predicting the document matrix. Since these values should all be at least 0, clip all the values in __H__ after the call to lstsq.

5. Use the lstsq solver to update __W__ while holding __H__ fixed. The lstsq solver assumes it is optimizing the right matrix of the multiplication (e.g. x in the equation __ax=b__). So you will need to get creative so you can use it and have the dimensions line up correctly.  Brainstorm on paper or a whiteboard how to manipulate the matrices so lstsq can get the dimensionality correct and optimize __W__. __hint: it involves transposes.__ Clip __W__ appropriately after updating it with lstsq to ensure it is at least 0.  
`from numpy.linalg import lstsq`

6. Repeat steps 4 and 5 for a fixed number of iterations.

7. Return the computed weights matrix and features matrix.



Using Your NMF Function
----

1. Write a function that takes __W__, __H__ and the document matrix as arguments, and returns the mean-squared error (of __document matrix - WH__).

2. Using argsort on each topic in __H__, find the index values of the words most associated with that topic.  Combine these index values with the word-names you stored in the __Preliminaries__ section to print out the most common words for each topic.




Run the code you wrote for the __Using Your NMF Function__ on the SKlearn classifier.  How close is the output to what you found writing your own NMF classifier?

__Design an API__:
1. Put your nmf function in an nmf class.
2. Define a function that displays the headlines/titles of the top 10 documents for each topic.
3. Define a function that takes as input a document and displays the top 3 topics it belongs to.
4. Define a function that ensure consistent ordering between your nmf function and the sklearn nmf class.

<br>
<br>
<br>

---