# Topic Modeling: Non-Negative Matrix Factorization (NNMF)

This notebook shows how topic discovery and assignment can be done using **non-negative matrix factorization (NNMF)**. The idea is equivalent to **collaborative filtering applied in recommender systems**.

For an introduction to recommender systems with matrix factorization:
- Visit my [Github guide](https://github.com/mxagar/machine_learning_coursera/tree/main/07_Anomaly_Recommender) done after following the Coursera/Stanford course [Machine Learning](https://www.coursera.org/learn/machine-learning) by Andrew Ng.
- Have a look at the summary notes in `./RecommenderSystems_Notes.pdf`

The key idea is that a Matrix `A (n x m)` is decomposed as the mulziplication of lower rank matrices `W (n x k)` and `H (k x m)`, such that the difference `A - W*H` is minimum accross all elements; i.e., ideally `A = W*H`.

![Matrix Factorization](../pics/nnmf_decompostion.png)

The elements in the equations are the following:

- `A` is the Document-Term Matrix (data: vectorized documents)
    - `n` (rows): documents
    - `m` (cols): words
- `k`: latent topics, number chosen by us
- `W`: `documents x topics`; unknown topic weights/probabilities associated to each document, initialized with random values, to be discovered
- `H`: `topics x words`; unknown topic weights/probabilities associated to each word, initialized with random values, to be discovered

I think that the main differences with recommender systems are:

- Now, we have a full matrix `A`, no missing values are present.
- The number of words in the vocabulary is expected to be much larger than the number of movies.

Apart from that, the cost to be minimized is equivalent to the one seen in recommender systems, and the values of `W` and `H` are updated similarly:

$$ \min{J} = \min \frac{1}{2} \Vert A - WH \Vert = \min \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{m} (A_{ij} - (WH)_{ij})^{2}$$

Note that in practice, this notebook is almost equivalent to the previous one, dealing with Latent Dirichlet Allocation, being the differences:

- The TFIDF matrix is computed with `TfidfVectorizer` instead of the DTM (with `TfidfVectorizer`). I understand that setting a maximum value for each document-word pair is a condition for matrix fatorization, as the 5 stars maximum in movie reviews.
- The non-negative matrix decomposition from scikit-learn `NMF` is used on the `TfidfVectorizer` matrix, instead of the `LatentDirichletAllocation`.

Thus, read the previous notebook first and then have a look at this one.

Overview of contents:

1. Load the Dataset
2. Create a IFIDF Matrix and Fit the Non-Negative Matrix Factorization (NNMF) Model to It
3. Explore the Discovered Topics
4. Assign Topics to Articles

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Load the Dataset

In [17]:
import pandas as pd

In [18]:
# Load the NPR dataset: around 12k articles; we want to discover and assign topics
npr = pd.read_csv('../data/npr.csv')

## 2. Create a IFIDF Matrix and Fit the Non-Negative Matrix Factorization (NNMF) Model to It

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
# Important parameters of TfidfVectorizer
# max_df: When building the vocabulary 
#   ignore terms that have a document frequency strictly higher than the given threshold,
#   i.e., corpus-specific stop words.
#   If float, the parameter represents a proportion of documents, integer absolute counts.
# min_df: When building the vocabulary
#   ignore terms that have a document frequency strictly lower than the given threshold.
#   This value is also called cut-off in the literature.
#   If float, the parameter represents a proportion of documents, integer absolute counts
# Stop words: we remove them
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [21]:
# We build the Document-Term Matrix
# We can't do any split, because that's unsupervised learning!
dtm = tfidf.fit_transform(npr['Article'])

In [22]:
from sklearn.decomposition import NMF

In [23]:
# Non-Negative Matrix Factorization
# n_components: number of topics
nmf_model = NMF(n_components=7,random_state=42)

In [24]:
# We fit NMF model to our Document-Term Matrix
# This can take awhile, we're dealing with a large amount of documents!
nmf_model.fit(dtm)



NMF(n_components=7, random_state=42)

## 3. Explore the Discovered Topics

In [26]:
# Get all words in the DTM, i.e., our vocabulary
len(tfidf.get_feature_names_out())

54777

In [28]:
# Explore some of those vocabulary words
import random
for i in range(10):
    random_word_id = random.randint(0,54776)
    print(tfidf.get_feature_names()[random_word_id])

legislator
crusading
tells
unicorns
deleon
indictments
festivals
willie
disquisition
grated




In [29]:
# Matrix with topic-word weights/probabilities: topics x words
nmf_model.components_.shape

(7, 54777)

In [31]:
# Take a single topic k
k = 0
single_topic = nmf_model.components_[k]

In [32]:
# Get indices that sort this array: [0.9 0.7 0.3] -> [2 1 0]
# These are the word indices ordered according to their weight for topic k
# Watch out the order: ascending (default) / descending
single_topic.argsort()

array([    0, 27208, 27206, ..., 36283, 54692, 42993])

In [35]:
# Get the 15 most significant words
# We can see how we could assign topics/themes, similar as with LDA, in a another order
# 1: Health
# 2: Politics and Elections
# 3: Politics and Elections
# 4: Security & International Affairs
# 5: Elections
# 6: Lifestyle
# 7: Education
# LDA was
# 1: Economy & Finances
# 2: Military and Security Affairs
# 3: Family & Resouces
# 4: Health
# 5: Politics and Elections
# 6: Lifestyle
# 7: Education
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index+1}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #1
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDS FOR TOPIC #2
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC #3
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC #4
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDS FOR TOPIC #5
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC #6
['love', 've', 'don', 'al