----
Exercises: Topic Modeling with Non-Negative Matrix factorization (NMF) 
----

Today we will apply the Non-Negative Matrix factorization (NMF) algorithm to discover latent topics in New York Times articles.

![](http://1.bp.blogspot.com/_JNTikHKvtnY/S6tLPRWmxjI/AAAAAAAABcQ/-eszxl-WIQ0/s1600/New-York-Times.jpg)

---
Preprocessing
----

1) Read the articles.pkl file using the `df.read_pickle` function in Pandas and transform the data into a structure for sci-kit learn.

In [1]:
reset -fs

In [None]:
import pandas as pd

In [None]:
df = pd.read_pickle("../../../corpora/nyt_articles.pkl")

In [None]:
# Look that data...
# What are the columns?
# What are the rows?
df.head(n=2)

In [None]:
# Look at a sample of the data - content of the news stories
df.content[0][:100]

2) Use the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from scikit-learn to turn the content of the news stories into the document-term matrix $\textbf{A}$ 

(I call it "vectorized")

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Here are some good defaults
max_features=1000
max_df=0.95,  
min_df=2,
max_features=1000,
stop_words='english'

In [None]:
vectorized = None

What is the size of the document-term matrix?

----
NMF with scikit-learn 
------

Hint: [Here is an example](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html)

In [None]:
import sklearn

assert sklearn.__version__ == '0.18' # Make sure we are in the modern age

In [4]:
from sklearn.decomposition import NMF

Apply NMF with SVD-based initialization to the document-term matrix $\text{A}$ generate 4 topics.

In [5]:
model = NMF(init="nndsvd",
            n_components=4,
            max_iter=200)

__Sidebar__: NNDSVD (Nonnegative Double Singular Value Decomposition)

- Provides a deterministic initialization with no random element.
- Chooses initial factors based on positive components of the first k dimensions of SVD of data matrix A.
- Often leads to significant decrease in number of NMF iterations required before convergence.

(Boutsidis & Gallopoulos, 2008)

Get the factors $\text{W}$ and $\text{H}$ from the resulting model.

In [None]:
W = model.fit_transform(vectorized)
H = model.components_

What is are sizes of W and H?

Get the list of all terms whose indices correspond to the columns of the document-term matrix.

In [2]:
terms = [""] * len(vectorizer.vocabulary_)
for term in vectorizer.vocabulary_.keys():
    terms[vectorizer.vocabulary_[term]] = term

NameError: name 'vectorizer' is not defined

In [None]:
# Have a look that some of the terms
terms[-5:]

Print the top 10 ranked terms for each topic, by sorting the values in the rows of the $\text{H}$ factor.

It should look like:  
`Topic 0: said, year, new, people, state, company, gun, work, like, percent`

<br>

<details><summary>
Click here for a hint…
</summary>
```
for topic_index in None:
    top_indices = np.argsort(None)[None][None]
    term_ranking = [None[i] for i in None]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))
```
</details>

<br>

<details><summary>
Click here for the answer…
</summary>
```
for topic_index in range(H.shape[0]):
    top_indices = np.argsort(H[topic_index,:])[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))
```
</details>

Look at the words in the numbered topics. For each one, make-up a label that describes it.

For example:  
`Topic 3: people, mobile, said, phone, technology, music, digital, users, microsoft, software`  
is about "The Singularity" 😉

Are there any topics that don't make sense (i.e., the words don't go together)?

Change the number of topics to match the number of topics in NYT section labels

How do the NMF topics compare to the NYT section labels?

Would you use your NMF topics or NYT section labels to filter your news?

Repeat with the same modeling with [`tf-idf`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

How does that change the topics?

Are they "tighter"? Easier to describe?

---
Challenge Exercises
====

Rolling Your Own (RYO) NMF
-----

With the document matrix (our bags of words), we can begin implementing the NMF algorithm.  

1. Create a NMF class to that is initialized with a document matrix (bag of words or tf-idf) __V__.  As arguments (in addition to the document matrix) it should also take parameters __k__ (# of latent topics) and the maximum # of iterations to perform. 
  
  First we need to initialize our weights (__W__) and features (__H__) matrices.  

2. Initialize the weights matrix (W) with (positive) random values to be a __n x k__ matrix, where __n__ is the number of documents and __k__ is the number of latent topics.

2.  Initialize the feature matrix (H) to be __k x m__ where __m__ is the number of words in our vocabulary (i.e. length of bag).  Our original document matrix (__V__) is a __n x m__ matrix.  __NOTICE: shape(V) = shape(W * H)__

3. Now that we have initialized our matrices and defined our cost, we can begin iterating. Update your weights and features matrices accordingly.  7. Repeat this update until convergence (i.e. change in __cost(V, W*H)__ close to 0). or until our max # of iterations.

4. Assume we want to use a least-squares error metric when we update the matrices __W__ and __H__. This allows us to use the numpy.linalg.lstsq solver. 
Update __H__ by calling lstsq, holding __W__ fixed and minimizing the sum of squared errors predicting the document matrix. Since these values should all be at least 0, clip all the values in __H__ after the call to lstsq.

5. Use the lstsq solver to update __W__ while holding __H__ fixed. The lstsq solver assumes it is optimizing the right matrix of the multiplication (e.g. x in the equation __ax=b__). So you will need to get creative so you can use it and have the dimensions line up correctly.  Brainstorm on paper or a whiteboard how to manipulate the matrices so lstsq can get the dimensionality correct and optimize __W__. __hint: it involves transposes.__ Clip __W__ appropriately after updating it with lstsq to ensure it is at least 0.  
`from numpy.linalg import lstsq`

6. Repeat steps 4 and 5 for a fixed number of iterations.

7. Return the computed weights matrix and features matrix.



Using Your NMF Function
----

1. Write a function that takes __W__, __H__ and the document matrix as arguments, and returns the mean-squared error (of __document matrix - WH__).

2. Using argsort on each topic in __H__, find the index values of the words most associated with that topic.  Combine these index values with the word-names you stored in the __Preliminaries__ section to print out the most common words for each topic.




Run the code you wrote for the __Using Your NMF Function__ on the SKlearn classifier.  How close is the output to what you found writing your own NMF classifier?

__Design an API__:
1. Put your nmf function in an nmf class.
2. Define a function that displays the headlines/titles of the top 10 documents for each topic.
3. Define a function that takes as input a document and displays the top 3 topics it belongs to.
4. Define a function that ensure consistent ordering between your nmf function and the sklearn nmf class.

<br>
<br>
<br>

---