# Tutorial on Online Non-Negative Matrix Factorization

This notebooks explains basic ideas behind NMF implementation, training examples and use-cases.

**Matrix Factorizations** are useful for many things: recomendation systems, bi-clustering, image compression and, in particular, topic modeling.

Why **Non-Negative**? It makes the problem more strict and allows us to apply some optimizations.

Why **Online**? Because corpora are large and RAM is limited. Online NMF can learn topics iteratively.

This particular implementation is based on [this paper](arxiv.org/abs/1604.02634).

## Training

In [18]:
import numpy as np

from gensim import matutils
from gensim.models.nmf import Nmf
from gensim.models import CoherenceModel
from gensim.parsing.preprocessing import preprocess_string
from sklearn.datasets import fetch_20newsgroups

### Dataset preprocessing

In [19]:
categories = [
    'alt.atheism',
    'comp.graphics',
    'rec.motorcycles',
    'talk.politics.mideast',
    'sci.space'
]

trainset = fetch_20newsgroups(subset='train', categories=categories, random_state=42)
testset = fetch_20newsgroups(subset='test', categories=categories, random_state=42)

train_documents = [preprocess_string(doc) for doc in trainset.data]
test_documents = [preprocess_string(doc) for doc in testset.data]

### Dictionary compilation

In [20]:
from gensim.corpora import Dictionary

dictionary = Dictionary(train_documents)

dictionary.filter_extremes()

### Corpora compilation

In [21]:
train_corpus = [
    dictionary.doc2bow(document)
    for document
    in train_documents
]

test_corpus = [
    dictionary.doc2bow(document)
    for document
    in test_documents
]

### Training

The API works in the way similar to [Gensim.models.LdaModel](https://radimrehurek.com/gensim/models/ldamodel.html).

Specific parameters:

- `use_r` - whether to use residuals. Effectively adds regularization to the model
- `kappa` - optimizer step size coefficient.
- `lambda_` - residuals coefficient. The larger it is, the less more regularized result gets.
- `sparse_coef` - internal matrices sparse coefficient. The more it is, the faster and less accurate training is.

In [30]:
%%time

nmf = Nmf(
    corpus=train_corpus,
    chunksize=1000,
    num_topics=5,
    id2word=dictionary,
    passes=5,
    eval_every=10,
    minimum_probability=0,
    random_state=42,
    use_r=True,
    lambda_=1000,
    kappa=1,
    sparse_coef=3
)

CPU times: user 12.4 s, sys: 1.08 s, total: 13.5 s
Wall time: 13.7 s


### Topics

In [31]:
nmf.show_topics()

[(0,
  '0.035*"god" + 0.030*"atheist" + 0.021*"believ" + 0.020*"exist" + 0.019*"atheism" + 0.016*"religion" + 0.013*"christian" + 0.013*"religi" + 0.013*"peopl" + 0.012*"argument"'),
 (1,
  '0.055*"imag" + 0.054*"jpeg" + 0.033*"file" + 0.024*"gif" + 0.021*"color" + 0.019*"format" + 0.015*"program" + 0.014*"version" + 0.013*"bit" + 0.012*"us"'),
 (2,
  '0.053*"space" + 0.034*"launch" + 0.024*"satellit" + 0.017*"nasa" + 0.016*"orbit" + 0.013*"year" + 0.012*"mission" + 0.011*"data" + 0.010*"commerci" + 0.010*"market"'),
 (3,
  '0.022*"armenian" + 0.021*"peopl" + 0.020*"said" + 0.018*"know" + 0.011*"sai" + 0.011*"went" + 0.010*"come" + 0.010*"like" + 0.010*"apart" + 0.009*"azerbaijani"'),
 (4,
  '0.024*"graphic" + 0.017*"pub" + 0.015*"mail" + 0.013*"data" + 0.013*"ftp" + 0.012*"send" + 0.011*"imag" + 0.011*"rai" + 0.010*"object" + 0.010*"com"')]

### Coherence

In [32]:
CoherenceModel(
    model=nmf,
    corpus=test_corpus,
    coherence='u_mass'
).get_coherence()

-1.6698708891486376

### Perplexity

In [None]:
def perplexity(model, corpus):
    W = model.get_topics().T

    H = np.zeros((W.shape[1], len(corpus)))
    for bow_id, bow in enumerate(corpus):
        for topic_id, proba in model[bow]:
            H[topic_id, bow_id] = proba
    
    dense_corpus = matutils.corpus2dense(corpus, W.shape[0])
    
    return np.exp(-(np.log(W.dot(H), where=W.dot(H)>0) * dense_corpus).sum() / dense_corpus.sum())

perplexity(nmf, test_corpus)