<a href="https://colab.research.google.com/github/jamiehadd/Math189AD-MathematicalDataScienceAndTopicModeling/blob/main/tutorials/Nonnegative_Matrix_Factorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Math 189: Nonnegative Matrix Factorization (Our Lens)




In this notebook, we'll explore the application of nonnegative matrix factorization (NMF) on a toy dataset.

##**Activity**

In this notebook, we'll explore NMF applied to a toy dataset.  You'll then have the opportunity to explore this model on a new data set!  You should try to select a dataset where you believe the outputs from these models will be interpretable and visualizable!

###Tasks

* Divide into small groups (2 or 3 people).
* Explore the code below applying NMF to a toy dataset!
* Identify the dataset you want to explore.
* Think about how this data is formatted (is it a matrix? nonnegative?) and hypothesize about what the outcome of NMF will be!
* Run your NMF model.
* Try to interpret the results.
* Report back with interesting findings!

In [None]:
import numpy as np
import re
from sklearn.decomposition import NMF

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We're exploring the [20 Newsgroups](https://scikit-learn.org/stable/datasets/real_world.html?highlight=newsgroups#the-20-newsgroups-text-dataset) data set, which is a collection of approximately 20,000 newsgroup (think "email chatroom") documents. The data set consists of six groups ("comp", "misc", "rec", "sci", "soc", and "talk") partitioned roughly according to subjects, with a total of 20 subgroups, and is commonly used
as an experimental benchmark for document classification
and clustering.

In this notebook, we load only 10 subgroups from the five groups "comp", "misc", "rec", "sci", and "talk"; this list of group and subgroups is called `categories` below.  We remove stopwords and some other often used words, remove numbers, and use a common document data transformation called "term-frequency inverse document-frequence (tf-idf)" to construct our data matrix.

In [None]:
remove = ('headers','footers','quotes')
stopwords_list = stopwords.words('english')
stopwords_list.extend(['thanks','edu','also','would','one','could','please','really','many','anyone','good','right','get','even','want','must','something','well','much','still','said','stay','away','first','looking','things','try','take','look','make','may','include','thing','like','two','or','etc','phone','oh','email'])

categories = [
 'comp.graphics',
 'comp.sys.mac.hardware',
 'misc.forsale',
 'rec.motorcycles',
 'rec.sport.baseball',
 'sci.med',
 'sci.space',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.religion.misc'
 ]

newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)

# remove numbers
data_cleaned = [re.sub(r'\d+','', file) for file in newsgroups_train.data]

vectorizer = TfidfVectorizer(stop_words=stopwords_list)
vectors = vectorizer.fit_transform(data_cleaned).transpose()
idx_to_word = np.array(vectorizer.get_feature_names_out())

X = vectors

We'll apply NMF with rank k.  Note that scikit-learn, one of the most widely-used Python tools for machine learning, has you declare an NMF model before actually performing the model training on the data X and accessing the two factor matrices W and H.

In [None]:
k = 10

model = NMF(n_components=k, init='random', random_state=0,max_iter = 1000)  #learn the NMF model
W = model.fit_transform(X)                                              #access the NMF factor matrices
H = model.components_

Finally, we print out the top 10 keywords of each topic below!  Investigate these topics and see if you can identify which group and subgroup to which they most closely relate.

In [None]:
num_keywords = 10

keywords = np.empty((num_keywords+2,k), dtype=object)

for i in range(keywords.shape[1]):
    keywords[0,i] = "Topic " + str(i+1)
    keywords[1,i] = "-------"

for i in range(W.shape[1]):
    col = W[:,i]
    top = col.argsort()
    top = top[-10:][::-1]

    keywords[2:,i] = idx_to_word[top]

col_widths = [max([len(keywords[i][j]) for i in range(keywords.shape[0])])+2 for j in range(keywords.shape[1])]
for row in keywords:
    print("".join(row[i].ljust(col_widths[i]) for i in range(len(row))))

Topic 1  Topic 2   Topic 3     Topic 4   Topic 5  Topic 6  Topic 7   Topic 8   Topic 9     Topic 10    
-------  -------   -------     -------   -------  -------  -------   -------   -------     -------     
people   year      sale        israel    card     space    graphics  drive     armenian    geb         
think    game      shipping    israeli   mac      nasa     files     hard      armenians   dsl         
know     team      offer       jews      monitor  shuttle  file      scsi      turkish     chastity    
gun      games     new         arab      video    launch   image     disk      genocide    njxp        
say      runs      condition   arabs     mb       orbit    know      internal  armenia     cadre       
time     last      asking      lebanese  apple    moon     program   drives    turkey      skepticism  
see      pitching  price       lebanon   simms    lunar    software  floppy    turks       pitt        
us       baseball  sell        peace     color    earth    mail 

What topics make sense?  Any confusing ones?  You likely have a topic with the words "chastity", "skepticism", "intellect", and "cadre".  What is up with this topic?  Try to do a little internet sleuthing!  It may surprise you where this common topic came from.