# Topic Modeling: Finding latent topics in NYT articles with NMF

This exercise takes an _unsupervised_ approach to mining a corpus. Recall: _Unsupervised_ methods are those which do not use _labeled_ data, but instead attempt to glean information from the observed data only. In this exercise, we'll re-discover a few NYT topics (sections) using only the contents (words) of each article. We will acheive this by using NMF to find latent topics, having the user hand-label each latent topic, then analyzing a few NYT articles to see how well NMF picked up on the underlying article topics.

We've implemented this exercise in two stand-alone modules (`pair.py` and `my_nmf.py`). This notebook will walk you through the high-level steps, using those two modules along the way.

In [1]:
from nmf_helpers import build_text_vectorizer, hand_label_topics, analyze_article
from my_nmf import NMF

import numpy as np
import pandas as pd

# Overview

The high-level steps below are:

1. Read the NYT articles from disk, and vectorize the text.
2. Run NMF to find latent topics (an unsupervised approach).
3. Have the user view and hand-label each topic.
4. Analyze a few NYT articles to see how well NMF and our hand-labeled topics capture the underlying sections.
5. Do it all again using scikit-learn's NMF implementation.

### Step 1: Read the NYT articles and vectorize them

The NYT articles are stored in a picked pandas DataFrame. We only care about the `content` of each article, but we'll also keep track of the URL of each article so that we can better evaluate our results at the end (by viewing each article in its original format).

We will use TfIdf to vectorize our corpus into a document-term matrix. For most NLP tasks, TfIdf usually gives better results over Bag-of-Words. We will not stem the words, but you are welcome to try out stemming to see how the results change (just change the flab below)! We'll limit the vocabulary to 5000 words so that we only focus on common words; uncommon words might muddle the signal, but feel free to try out other values than 5000 to see how the results are affected.

In [2]:
# Load the corpus.
df = pd.read_pickle("data/articles.pkl")
contents = df.content
web_urls = df.web_url

# Build our text-to-vector vectorizer, then vectorize our corpus.
vectorizer, vocabulary = build_text_vectorizer(contents,
                             use_tfidf=True,
                             use_stemmer=False,
                             max_features=5000)
X = vectorizer(contents)

### Step 2: Run NMF

NMF will factorize our document-term matrix `V` into a matrix `W` (where each row is a latent vector of a single document in the corpus) and a matrix `H` (where each column is a latent vector of a single word in the vocabulary). See the docstring of the NMF.fit() method:

In [3]:
help(NMF.fit)

Help on function fit in module my_nmf:

fit(self, V, verbose=False)
    Do many ALS iterations to factorize `V` into matrices `W` and `H`.
    
    Let `V` be a matrix (`n` x `m`) where each row is an observation
    and each column is a feature. `V` will be factorized into a the matrix
    `W` (`n` x `k`) and the matrix `H` (`k` x `m`) such that `WH` approximates
    `V`.
    
    This method returns the tuple (W, H); `W` and `H` are each ndarrays.



In [4]:
# We'd like to see consistent results, so set the seed.
np.random.seed(12345)

# Find latent topics using our NMF model.
factorizer = NMF(k=7, max_iters=35, alpha=0.5)
W, H = factorizer.fit(X, verbose=True)

iter 0 : reconstruction error: 102.50543281053797
iter 1 : reconstruction error: 44.546743477439996
iter 2 : reconstruction error: 37.723539022352334
iter 3 : reconstruction error: 36.64915715676204
iter 4 : reconstruction error: 36.17495901413914
iter 5 : reconstruction error: 35.97493163298691
iter 6 : reconstruction error: 35.87866304757567
iter 7 : reconstruction error: 35.82390316808533
iter 8 : reconstruction error: 35.78768890593955
iter 9 : reconstruction error: 35.760427355849814
iter 10 : reconstruction error: 35.73744314518574
iter 11 : reconstruction error: 35.715979333148695
iter 12 : reconstruction error: 35.69419184074153
iter 13 : reconstruction error: 35.67095745763352
iter 14 : reconstruction error: 35.64624287087397
iter 15 : reconstruction error: 35.6215453871338
iter 16 : reconstruction error: 35.598923645926625
iter 17 : reconstruction error: 35.57983470940949
iter 18 : reconstruction error: 35.56461759974646
iter 19 : reconstruction error: 35.552737600632426
iter

### Step 3: Hand-label each latent topic

The latent topics which NMF finds can be seen as a reduced dimensionality representation of each document (by taking the `W` matrix). That interpretation doesn't require _knowing_ what each latent topic represents, you simply take the reduced dimensionality representation as an new, opaque feature vector.

But, that's not what we're interested in doing here. For this corpus, we'd like to see if we can _interpret_ what each latent topic is capturing in the corpus! This will take some work on our part, and we'll need to use the `H` matrix to help us. We'll use the `H` matrix to find which words in the vocabulary contribute most to each latent topic. We'll use that as our way to peek at each latent topic in hopes of figuring out what it is capturing.

So, you have a job to do now. For each topic, the code below will print the 20 words which most contribute to that topic. You must look at those words, use your _humanness_, and label what each topic seems to be. If you haven't modified the code thus-far, you should see the following topics pop out:
1. "football"
2. "arts"
3. "baseball"
4. "world news (middle eastern?)"
5. "politics"
6. "world news (war?)"
7. "economics"

In [6]:
# Label topics and analyze a few NYT articles.
hand_labels = hand_label_topics(H, vocabulary)

topic 0
--> game team season said league run inning cup player l hit race mets two n win wild red playoff card
please label this topic: football

topic 1
--> iran rouhani nuclear iranian mr obama israel united netanyahu president weapon syria nation sanction state israeli speech meeting hassan leader
please label this topic: arts

topic 2
--> yankee rivera pettitte season art music like game show work new one dance girardi opera jeter p time ms night
please label this topic: baseball

topic 3
--> mr party said ms court merkel judge case election political state new german parliament year minister right germany conservative former
please label this topic: world news (middle eastern?)

topic 4
--> republican house health government care senate shutdown law obama would debt bill party congress democrat vote senator president insurance cruz
please label this topic: politics

topic 5
--> said attack people government percent company official state year united country group syria security ma

### Step 4: Analyze a few NYT articles using our new labels

Yay! We have figured out what each latent topic **IS**. Now we can go back to our documents and label them. Let's label 15 random articles to see if our unsupervised topic mining is on the right track.

In [7]:
rand_articles = np.random.choice(range(len(W)), 15)

for i in rand_articles:
    analyze_article(i, contents, web_urls, W, hand_labels)

http://www.nytimes.com/2013/09/21/arts/sammy-obeid-on-his-1000th-comedy-show.html
if sammy obeid learned one thing performing stand-up 1,000 consecutive day it’s embrace inner robot mr. obeid 29 finish ironman project friday night long driven goal-oriented perfectionist he earned 3.9 grade point average university california berkeley majoring applied mathematics business administration trying comedy approached rigor “jokes systematic said interview skype tuesday “what right word fit blank make equation work how meaning evoke best laugh his stand-up display analytic style full wordplay clever misdirection ethnic humor riffing middle eastern root he’s lebanese-american in one characteristic joke say “i believe god woman so don’t say ‘amen.’ say ‘that’s a soon started performing six year ago mr. obeid meticulously tracked success rate joke now give grade a++ a+++ based quantity laugh organized joke list “this thing example mechanical robotic person discovered mr. obeid said originally und

THIS WAS VERY SUCCESSFUL! Notice how 7 different _sections_ popped out of the data without us using the _section labels_, we only used the text itself!

### Step 5: Do it all again with scikit-learn

Below is equivalent code, but this time using scikit-learn's NMF class instead of our own. Note: Don't be confused by `alpha` below; sklearn uses `alpha` as a regularization term, where we used `alpha` in our code as a learning rate.

How does this compare with our NMF implementation? (It should be similar! The same topics are found! They only are in a different order.)

In [10]:
from sklearn.decomposition import NMF as NMF_sklearn

nmf = NMF_sklearn(n_components=7, max_iter=100, random_state=12345, alpha=0.0)
W = nmf.fit_transform(X)
H = nmf.components_
print('reconstruction error:', nmf.reconstruction_err_)

hand_labels = hand_label_topics(H, vocabulary)

for i in rand_articles:
    analyze_article(i, contents, web_urls, W, hand_labels)

reconstruction error: 35.396962227823295
topic 0
--> mr ms new art work like music show one said p year york dance opera time city song museum people
please label this topic: arts

topic 1
--> game team season yard 0 touchdown league 1 l said n player first giant 2 coach quarterback play 3 win
please label this topic: football

topic 2
--> iran rouhani nuclear mr iranian obama israel united netanyahu president nation sanction state weapon speech israeli meeting said syria leader
please label this topic: world news (middle eastern?)

topic 3
--> republican house health government care senate party law obama mr shutdown democrat bill would vote senator congress president conservative cruz
please label this topic: politics

topic 4
--> percent bank company market said rate government year economy china european price euro million growth country economic would month investor
please label this topic: economics

topic 5
--> said attack official syria mr police government killed security unit