# DAY 13: Nonnegative Matrix Factorization


### Machine Learning and Computational Statistics (DSC6232)

#### Instructors: Weiwei Pan, Melanie Fernandez, Pavlos Protopapas

#### Due: August 12th, 2:00 pm Kigali Time

**First name**: _________________________________________________________


**Last name**: _____________

## Learning Goals:

1. learn how to process and encode text data
2. understand how to analyze documents using a simple topic model 
3. learn how to interpret nonnegative matrix factorization models

### Load necessary libraries

In [1]:
import random
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
from IPython.display import display
%matplotlib inline


### We include auxiliary functions here that we will need to use later  **No need to read in details!**

We include auxiliary functions here that we will need to use later



In [2]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

def print_features(feature_names, num_columns=5):
    padding = num_columns - len(feature_names) % num_columns
    feature_names += [''] * (padding * (padding != num_columns))
    feature_names = np.array(feature_names).reshape(-1, num_columns)
    display(pd.DataFrame(feature_names, columns=[''] * num_columns).reset_index(drop=True))

# Topic Modeling for News Articles

This exercise is designed to help you transform and model textual data. You may find the tutorial [here](http://scikit-learn.org/stable/modules/feature_extraction.html) helpful.

You will encode a small set of news articles (i.e. represent them as count vectors) and model this set using a Nonnegative Factorization Model. Your goal is to discover a latent set of topic underlying the articles and discover which topics appear in each article.


### Load-in the data and examine it

We use the `fetch_20newsgroups` function from `sklearn` to load a set of articles in the categories: "medicine", "religion" and "motorcycles".

In [3]:
# Load the data set
print("Loading dataset...")
data, _ = fetch_20newsgroups(shuffle=False, remove=('headers', 'footers', 'quotes'),
                             return_X_y=True, categories=['sci.med', 'soc.religion.christian', 'rec.motorcycles'])
print('Done.')

Loading dataset...
Done.


We check to see how many articles we have loaded. We also print two articles form this set to see what they look like.

In [4]:
# Print the number of articles in the data
print('Number of data points: {}\n'.format(len(data)))

# Print out an example article from the data
print('Example articles:\n\n')
print('*' * 10 + ' Example 1 ' + '*' * 10)
print(data[0])

# Print out another article from the data
print('\n\n' + '*' * 10 + ' Example 2 ' + '*' * 10)
print(data[5])

Number of data points: 1791

Example articles:


********** Example 1 **********
Does anyone on this newsgroup happen to know WHY morphine was
first isolated from opium?  If you know why, or have an idea for where I
could look to find this info, please mail me.
	CSH
any suggestionas would be greatly appreciated

--
 "Kilimanjaro is a pretty tricky climb. Most of it's up, until you reach
the very, very top, and then it tends to slope away rather sharply."
					Sir George Head, OBE (JC)


********** Example 2 **********
I just noticed that my halogen table lamp runs off 12 Volts.
The big thinngy that plugs into the wall says 12 Volts DC,  20mA

The question is: Can I trickle charge the battery on my CB650
with it?

I don't know the rating of the battery, but it is a factory
intalled one. 


Thanks,
Sanjay

-- 
   '81 CB650 						DoD #1224


### Encode the data as count vectors

We are going to use `sklearn`'s `TfidfVectorizer` function to remove punctuations and non-meaningful words from the documents and then convert them into count vectors.

**Exercise 1:** Read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for `TfidfVectorizer`, and experiment with different values for the parameters `max_df`, `min_df`, `max_features`. What do each of these parameters mean? How does changing these parameters change the count vector representation of the data?

In [18]:
# Step 1: Reduce the size of the data
n_samples = 1000
data_samples = random.sample(data, min(n_samples, len(data)))

# Step 2: Choose the number of features, or important words, to extract
n_features = 500

# Step 3: Extract tf-idf features
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=5, max_features=n_features, stop_words='english')

# Step 4: Encode the documents as normalized count vectors
vectorized_data = tfidf_vectorizer.fit_transform(data_samples)

# Step 5: Get learned feature names
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# Step 6: Select a samplee of the learned features 
sample_of_features = random.sample(sorted(tfidf_feature_names), 100)

# Step 7: Print that sample of learned feature names
print_features(sample_of_features)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,years,software,tell,patient,war
1,1993,didn,fit,exist,drug
2,know,subject,world,trying,try
3,dog,advice,questions,motorcycle,available
4,suppose,result,right,instead,hurt
5,kind,st,page,matter,condition
6,says,self,skepticism,pain,experience
7,list,mind,money,second,specific
8,christianity,11,able,car,best
9,actually,come,interesting,foods,rec


**Exercise 2:** Print the normalized count vector representation of a single document. What kind of numbers are in this vector? What do these numbers represent? ***Hint:*** recall how we process count vectors before fitting a nonnegative matrix factorization model. 

The `TfidfVectorizer` function normalizes the count vectors, what does this mean and why is this step necessary?

In [20]:
# Step 1: Print the normalized count vector representation of a single document
n = 10
print('The normalized count vector representation of the {}-th document'.format(n))
print(vectorized_data[n, :])

The normalized count vector representation of the 10-th document
  (0, 117)	0.11305800827015901
  (0, 233)	0.17937797725929913
  (0, 313)	0.1559434312822325
  (0, 140)	0.15791746432932727
  (0, 237)	0.17427516845032495
  (0, 364)	0.17589813420027145
  (0, 87)	0.3045786247813023
  (0, 225)	0.1514301970447727
  (0, 193)	0.16337697524024486
  (0, 94)	0.15059083323716208
  (0, 123)	0.18988145616452512
  (0, 84)	0.1321065592771092
  (0, 44)	0.10938975710915493
  (0, 443)	0.12098318912139533
  (0, 231)	0.25483854151349056
  (0, 18)	0.13634443754267764
  (0, 497)	0.1142650840757048
  (0, 13)	0.1514301970447727
  (0, 494)	0.3368418864691065
  (0, 329)	0.11883488024174688
  (0, 268)	0.10755729352054061
  (0, 352)	0.11305800827015901
  (0, 250)	0.08545005863365784
  (0, 130)	0.10145802901185105
  (0, 251)	0.15407067629618976
  (0, 134)	0.08760259060250164
  (0, 405)	0.13988049418878795
  (0, 406)	0.17589813420027145
  (0, 298)	0.2285301681514096
  (0, 55)	0.4196414825663638


### Fit a Nonnegative Matrix Factorization Model to the data

Now that our data has been encoded as normalized count vectors, we can fit an NMF model to it.

**Exercise 3:** Fit an NMF model with 10 topics, print out the top words associated to each topic. Can you interpret what each topic is about?

Fit an NMF model with 2 topics, print out the top words associated to each topic. Can you interpret what each topic is about?

Find an appropriate number of topics. Why is this number appropriate?

In [28]:
# Step 1: Fit NMF model
nmf = NMF(n_components=10, alpha=0.1, l1_ratio=0.5, regularization=None, init='nndsvd').fit(vectorized_data)

# Step 2: Print out the learned topics
print('Topics learned by the NMF:')
print_top_words(nmf, tfidf_feature_names, 10)

Topics learned by the NMF:
Topic #0: god love life lord sin hell scripture heaven christ faith
Topic #1: geb banks pitt chastity n3jxp dsl shameful cadre surrender skepticism
Topic #2: bike bikes dod ride riding motorcycle miles engine turn advice
Topic #3: doctor pain years time patients disease new treatment ago problems
Topic #4: msg food eat use foods people effects effect cause natural
Topic #5: thanks know edu info does looking mail information email post
Topic #6: believe truth people science true christians does think christianity question
Topic #7: jesus church christ mary say father john pope did son
Topic #8: don just like think ll know people ve little try
Topic #9: helmet fall head just fit long face usually ground big





**Exercise 4:** Pick a document and print out the combinations of topics in that document. Which combinations of topics are contained in this article? Do you agree with the the combination of topics learned by the model?

In [29]:
# Step 1: get the document to topic matrix each row of this matrix is the combination of topics in the document
document_to_topic = nmf.transform(vectorized_data)

# Step 2: print the shape of this matrix to verify that we have 1000 documents and 10 topic
print(document_to_topic.shape)

(1000, 10)




In [30]:
# Step 3: print an article and the predicted combination of topics in this article
n = 10

# Print the predicted combination of topics in this article
print('\n\n' + '*' * 10 + ' Predicted combination of topics ' + '*' * 10 + '\n\n')
print(document_to_topic[n])

# Print out an example article from the data
print('\n\n' + '*' * 10 + ' Article {} '.format(n) + '*' * 10)
print(data[n])



********** Predicted combination of topics **********


[0.         0.00134814 0.00029046 0.02690346 0.00311581 0.00057397
 0.0566424  0.17742877 0.039824   0.        ]


********** Article 10 **********

I attribute my success to several factors:

Very low fat.  Except when someone else has cooked a meal for me,
I only eat fruit, vegetables, and whole grain or bran cereals.  I
estimate I only get about 5 to 10 percent of my calories from fat.

Very little sugar or salt.

Very high fiber.  Most Americans get about 10 grams.  25 to 35 are
recommended.  I get between 50 and 150.  Sometimes 200.  (I've heard
of people taking fiber pills.  It seems unlikely that pills can
contain enough fiber to make a difference.  It would be about as
likely as someone getting fat by popping fat pills.  Tablets are
just too small, unless you snarf down hundreds of them daily.)

My "clean your plate" conditioning works *for* me.  Eating the last
10% takes half my eating time, and gives satiety a chance t