# Wikipedia training

In this tutorial we will:
 - Learn how to train the NMF topic model on English Wikipedia corpus
 - Compare it with LDA model
 - Evaluate results

In [1]:
%load_ext autoreload
%autoreload 2

import itertools
import json
import logging
import numpy as np
import pandas as pd
import scipy.sparse
import time
from tqdm import tqdm, tqdm_notebook

import gensim.downloader as api
from gensim import matutils
from gensim.corpora import MmCorpus, Dictionary
from gensim.models import LdaModel, CoherenceModel
from gensim.models.nmf import Nmf
from gensim.parsing.preprocessing import preprocess_string

tqdm.pandas()

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

# Preprocessing

### Load wikipedia dump
Let's use gensim.downloader.api for that

In [2]:
data = api.load("wiki-english-20171001")
for article in data:
    for section_title, section_text in zip(article['section_titles'],
                                           article['section_texts']):
        print("Section title: %s" % section_title)
        print("Section text: %s" % section_text)
    break

Section title: Introduction
Section text: 




'''Anarchism''' is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary and harmful.

While anti-statism is central, anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations, including—but not limited to—the state system. Anarchism is usually considered a far-left ideology and much of anarchist economics and anarchist legal philosophy reflects anti-authoritarian interpretations of communism, collectivism, syndicalism, mutualism or participatory economics.

Anarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy. Many types and traditions of 

Preprocess and save articles

In [3]:
def wiki_articles_iterator():
    for article in tqdm_notebook(data):
        yield (
            preprocess_string(
                " ".join(
                    " ".join(section)
                    for section
                    in zip(article['section_titles'], article['section_texts'])
                )
            )
        )

def save_preprocessed_articles(filename, articles):
    with open(filename, 'w+') as writer:
        for article in tqdm_notebook(articles):
            writer.write(
                json.dumps(
                    preprocess_string(
                        " ".join(
                            " ".join(section)
                            for section
                            in zip(article['section_titles'],
                                   article['section_texts'])
                        )
                    )
                ) + '\n'
            )


def get_preprocessed_articles(filename):
    with open(filename, 'r') as reader:
        for line in tqdm_notebook(reader):
            yield json.loads(
                line
            )

In [4]:
# save_preprocessed_articles('wiki_articles.jsonlines', data)

### Create and save dictionary

In [5]:
# dictionary = Dictionary(get_preprocessed_articles('wiki_articles.jsonlines'))

# dictionary.save('wiki.dict')

### Load and filter dictionary

In [6]:
dictionary = Dictionary.load('wiki.dict')
dictionary.filter_extremes(keep_n=20000)
dictionary.compactify()

2019-01-15 14:02:25,430 : INFO : loading Dictionary object from wiki.dict
2019-01-15 14:02:26,296 : INFO : loaded wiki.dict
2019-01-15 14:02:28,498 : INFO : discarding 1990258 tokens: [('abdelrahim', 49), ('abstention', 120), ('ammon', 1736), ('amoureus', 359), ('amoureux', 566), ('amparo', 1178), ('anarcha', 101), ('anarchica', 40), ('anarcho', 1433), ('anarchosyndicalist', 20)]...
2019-01-15 14:02:28,499 : INFO : keeping 20000 tokens which were in no less than 5 and no more than 2462447 (=50.0%) documents
2019-01-15 14:02:28,738 : INFO : resulting dictionary: Dictionary(20000 unique tokens: ['abandon', 'abil', 'abl', 'abolit', 'abstent']...)


### MmCorpus wrapper
In this way we'll:

- Make sure that documents are shuffled
- Be able to train-test split corpus without rewriting it

In [7]:
class RandomCorpus(MmCorpus):
    def __init__(self, random_seed=42, testset=False, testsize=1000, *args,
                 **kwargs):
        super().__init__(*args, **kwargs)

        random_state = np.random.RandomState(random_seed)
        # TODO: Don't forget to remove that before push
        self.indices = random_state.permutation(range(self.num_docs))[:4000]
        if testset:
            self.indices = self.indices[:testsize]
        else:
            self.indices = self.indices[testsize:]

    def __iter__(self):
        for doc_id in self.indices:
            yield self[doc_id]
            
    def __len__(self):
        return len(self.indices)

### Create and save corpus

In [8]:
# corpus = (
#     dictionary.doc2bow(article)
#     for article
#     in get_preprocessed_articles('wiki_articles.jsonlines')
# )

# RandomCorpus.serialize('wiki.mm', corpus)

### Load train and test corpus
Using `RandomCorpus` wrapper

In [9]:
train_corpus = RandomCorpus(
    random_seed=42, testset=False, testsize=2000, fname='wiki.mm'
)
test_corpus = RandomCorpus(
    random_seed=42, testset=True, testsize=2000, fname='wiki.mm'
)

2019-01-15 14:02:29,428 : INFO : loaded corpus index from wiki.mm.index
2019-01-15 14:02:29,428 : INFO : initializing cython corpus reader from wiki.mm
2019-01-15 14:02:29,429 : INFO : accepted corpus with 4924894 documents, 20000 features, 629448427 non-zero entries
2019-01-15 14:02:30,774 : INFO : loaded corpus index from wiki.mm.index
2019-01-15 14:02:30,775 : INFO : initializing cython corpus reader from wiki.mm
2019-01-15 14:02:30,775 : INFO : accepted corpus with 4924894 documents, 20000 features, 629448427 non-zero entries


## Metrics

In [10]:
def get_execution_time(func):
    start = time.time()

    result = func()

    return (time.time() - start), result


def get_tm_metrics(model, test_corpus):
    W = model.get_topics().T
    H = np.zeros((model.num_topics, len(test_corpus)))
    for bow_id, bow in enumerate(test_corpus):
        for topic_id, word_count in model.get_document_topics(bow):
            H[topic_id, bow_id] = word_count

    pred_factors = W.dot(H)
    pred_factors /= pred_factors.sum(axis=0)
    
    dense_corpus = matutils.corpus2dense(test_corpus, pred_factors.shape[0])

    perplexity = get_tm_perplexity(pred_factors, dense_corpus)

    l2_norm = get_tm_l2_norm(pred_factors, dense_corpus)

    model.normalize = True

    coherence = CoherenceModel(
        model=model,
        corpus=test_corpus,
        coherence='u_mass'
    ).get_coherence()

    topics = model.show_topics()

    model.normalize = False

    return dict(
        perplexity=perplexity,
        coherence=coherence,
        topics=topics,
        l2_norm=l2_norm,
    )


def get_tm_perplexity(pred_factors, dense_corpus):
    return np.exp(-(np.log(pred_factors, where=pred_factors > 0) * dense_corpus).sum() / dense_corpus.sum())


def get_tm_l2_norm(pred_factors, dense_corpus):
    return np.linalg.norm(dense_corpus / dense_corpus.sum(axis=0) - pred_factors)

In [11]:
tm_metrics = pd.DataFrame()

### Define common params for models

In [12]:
params = dict(
    corpus=train_corpus,
    chunksize=2000,
    num_topics=50,
    id2word=dictionary,
    passes=1,
    eval_every=10,
    minimum_probability=0,
    random_state=42,
)

## Training

### Train NMF and save it
Normalization is turned off to compute metrics correctly

In [13]:
row = dict()
row['model'] = 'nmf'
row['train_time'], nmf = get_execution_time(
    lambda: Nmf(
        use_r=False,
        normalize=False,
        **params
    )
)
nmf.save('nmf.model')

2019-01-15 14:02:37,902 : INFO : Loss (no outliers): 1913.454696981355	Loss (with outliers): 1913.454696981355
2019-01-15 14:02:37,913 : INFO : saving Nmf object under nmf.model, separately None
2019-01-15 14:02:37,981 : INFO : saved nmf.model


### Load NMF and get metrics

In [14]:
nmf = Nmf.load('nmf.model')
row.update(get_tm_metrics(nmf, test_corpus))
tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)

nmf.show_topics(50)

2019-01-15 14:02:38,003 : INFO : loading Nmf object from nmf.model
2019-01-15 14:02:38,038 : INFO : loading id2word recursively from nmf.model.id2word.* with mmap=None
2019-01-15 14:02:38,039 : INFO : loaded nmf.model
2019-01-15 14:02:55,613 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2019-01-15 14:02:55,738 : INFO : CorpusAccumulator accumulated stats from 2000 documents


[(0,
  '0.009*"cell" + 0.009*"linear" + 0.009*"highwai" + 0.009*"base" + 0.007*"open" + 0.007*"air" + 0.005*"set" + 0.005*"site" + 0.005*"neat" + 0.004*"april"'),
 (1,
  '0.019*"secur" + 0.016*"state" + 0.011*"resolut" + 0.011*"wayn" + 0.011*"john" + 0.009*"school" + 0.008*"design" + 0.008*"nation" + 0.007*"yard" + 0.007*"israel"'),
 (2,
  '0.023*"univers" + 0.023*"ukrainian" + 0.017*"doctor" + 0.015*"nation" + 0.012*"ukrain" + 0.011*"women" + 0.011*"wear" + 0.010*"million" + 0.010*"hood" + 0.009*"line"'),
 (3,
  '0.016*"sequenc" + 0.014*"power" + 0.012*"ukrainian" + 0.011*"linear" + 0.009*"new" + 0.009*"arena" + 0.009*"number" + 0.008*"telephon" + 0.008*"switch" + 0.008*"boi"'),
 (4,
  '0.023*"design" + 0.017*"intellig" + 0.011*"song" + 0.010*"glori" + 0.009*"def" + 0.008*"decis" + 0.007*"final" + 0.007*"tournament" + 0.006*"time" + 0.006*"wai"'),
 (5,
  '0.015*"switch" + 0.014*"col" + 0.010*"divis" + 0.009*"learn" + 0.009*"new" + 0.008*"port" + 0.008*"warrant" + 0.007*"compani" + 0.0

### Train NMF with residuals and save it
Residuals add regularization to the model thus increasing quality, but slows down training

In [15]:
row = dict()
row['model'] = 'nmf_with_r'
row['train_time'], nmf_with_r = get_execution_time(
    lambda: Nmf(
        use_r=True,
        lambda_=200,
        normalize=False,
        **params
    )
)
nmf_with_r.save('nmf_with_r.model')

2019-01-15 14:04:09,187 : INFO : Loss (no outliers): 1991.7502435710787	Loss (with outliers): 1910.464771951029
2019-01-15 14:04:09,199 : INFO : saving Nmf object under nmf_with_r.model, separately None
2019-01-15 14:04:09,200 : INFO : storing scipy.sparse array '_r' under nmf_with_r.model._r.npy
2019-01-15 14:04:11,249 : INFO : saved nmf_with_r.model


### Load NMF with residuals and get metrics

In [16]:
nmf_with_r = Nmf.load('nmf_with_r.model')
row.update(get_tm_metrics(nmf_with_r, test_corpus))
tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)

nmf_with_r.show_topics(50)

2019-01-15 14:04:11,266 : INFO : loading Nmf object from nmf_with_r.model
2019-01-15 14:04:11,301 : INFO : loading id2word recursively from nmf_with_r.model.id2word.* with mmap=None
2019-01-15 14:04:11,302 : INFO : loading _r from nmf_with_r.model._r.npy with mmap=None
2019-01-15 14:04:11,391 : INFO : loaded nmf_with_r.model
2019-01-15 14:04:33,763 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2019-01-15 14:04:33,887 : INFO : CorpusAccumulator accumulated stats from 2000 documents


[(0,
  '0.011*"cell" + 0.010*"highwai" + 0.010*"base" + 0.009*"open" + 0.008*"air" + 0.007*"set" + 0.006*"site" + 0.005*"squadron" + 0.005*"german" + 0.004*"interchang"'),
 (1,
  '0.020*"intellig" + 0.020*"design" + 0.019*"state" + 0.018*"secur" + 0.011*"wayn" + 0.011*"school" + 0.011*"john" + 0.010*"resolut" + 0.009*"scienc" + 0.007*"nation"'),
 (2,
  '0.024*"univers" + 0.023*"ukrainian" + 0.017*"doctor" + 0.015*"nation" + 0.012*"ukrain" + 0.012*"women" + 0.011*"wear" + 0.011*"million" + 0.011*"hood" + 0.010*"line"'),
 (3,
  '0.016*"sequenc" + 0.015*"power" + 0.013*"ukrainian" + 0.011*"new" + 0.010*"arena" + 0.009*"number" + 0.009*"telephon" + 0.009*"switch" + 0.009*"boi" + 0.007*"nation"'),
 (4,
  '0.018*"design" + 0.016*"intellig" + 0.011*"song" + 0.010*"glori" + 0.010*"def" + 0.008*"decis" + 0.007*"tournament" + 0.007*"final" + 0.006*"time" + 0.006*"wai"'),
 (5,
  '0.016*"switch" + 0.013*"col" + 0.010*"divis" + 0.009*"learn" + 0.008*"new" + 0.008*"port" + 0.008*"warrant" + 0.008*"a

### Train LDA and save it
That's a common model to do Topic Modeling

In [17]:
row = dict()
row['model'] = 'lda'
row['train_time'], lda = get_execution_time(
    lambda: LdaModel(**params)
)
lda.save('lda.model')

2019-01-15 14:04:34,008 : INFO : using symmetric alpha at 0.02
2019-01-15 14:04:34,008 : INFO : using symmetric eta at 0.02
2019-01-15 14:04:34,011 : INFO : using serial LDA version on this node
2019-01-15 14:04:34,132 : INFO : running online (single-pass) LDA training, 50 topics, 1 passes over the supplied corpus of 2000 documents, updating model once every 2000 documents, evaluating perplexity every 2000 documents, iterating 50x with a convergence threshold of 0.001000
2019-01-15 14:04:39,040 : INFO : -15.341 per-word bound, 41496.7 perplexity estimate based on a held-out corpus of 2000 documents with 515952 words
2019-01-15 14:04:39,041 : INFO : PROGRESS: pass 0, at document #2000/2000
2019-01-15 14:04:42,076 : INFO : topic #48 (0.020): 0.006*"state" + 0.004*"award" + 0.004*"parti" + 0.004*"game" + 0.003*"syke" + 0.003*"new" + 0.003*"year" + 0.003*"includ" + 0.003*"plane" + 0.003*"unit"
2019-01-15 14:04:42,077 : INFO : topic #31 (0.020): 0.005*"time" + 0.005*"new" + 0.005*"year" + 0

### Load LDA and get metrics

In [18]:
lda = LdaModel.load('lda.model')
row.update(get_tm_metrics(lda, test_corpus))
tm_metrics = tm_metrics.append(pd.Series(row), ignore_index=True)

lda.show_topics(50)

2019-01-15 14:04:42,171 : INFO : loading LdaModel object from lda.model
2019-01-15 14:04:42,172 : INFO : loading expElogbeta from lda.model.expElogbeta.npy with mmap=None
2019-01-15 14:04:42,175 : INFO : setting ignored attribute state to None
2019-01-15 14:04:42,175 : INFO : setting ignored attribute id2word to None
2019-01-15 14:04:42,176 : INFO : setting ignored attribute dispatcher to None
2019-01-15 14:04:42,177 : INFO : loaded lda.model
2019-01-15 14:04:42,178 : INFO : loading LdaState object from lda.model.state
2019-01-15 14:04:42,197 : INFO : loaded lda.model.state
2019-01-15 14:04:47,706 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2019-01-15 14:04:47,822 : INFO : CorpusAccumulator accumulated stats from 2000 documents


[(0,
  '0.006*"world" + 0.006*"new" + 0.005*"won" + 0.004*"time" + 0.004*"leagu" + 0.004*"season" + 0.003*"place" + 0.003*"hous" + 0.003*"univers" + 0.003*"team"'),
 (1,
  '0.006*"team" + 0.004*"year" + 0.004*"decemb" + 0.003*"season" + 0.003*"time" + 0.003*"minist" + 0.003*"state" + 0.003*"octob" + 0.003*"nation" + 0.003*"new"'),
 (2,
  '0.004*"school" + 0.004*"year" + 0.004*"includ" + 0.004*"new" + 0.004*"time" + 0.003*"state" + 0.003*"world" + 0.003*"univers" + 0.003*"later" + 0.003*"citi"'),
 (3,
  '0.006*"new" + 0.005*"svg" + 0.004*"imag" + 0.004*"time" + 0.004*"symbol" + 0.004*"year" + 0.004*"citi" + 0.004*"state" + 0.003*"team" + 0.003*"church"'),
 (4,
  '0.005*"new" + 0.004*"american" + 0.004*"state" + 0.004*"nation" + 0.003*"unit" + 0.003*"born" + 0.003*"includ" + 0.003*"page" + 0.003*"open" + 0.003*"member"'),
 (5,
  '0.011*"dai" + 0.010*"hous" + 0.007*"group" + 0.007*"nomin" + 0.005*"main" + 0.005*"glass" + 0.005*"stori" + 0.004*"new" + 0.004*"evict" + 0.004*"week"'),
 (6,
 

## Results

In [19]:
tm_metrics

Unnamed: 0,coherence,l2_norm,model,perplexity,topics,train_time
0,-5.210743,7.915856,nmf,244.153459,"[(21, 0.042*""finn"" + 0.032*""keeper"" + 0.026*""d...",6.285688
1,-5.077906,7.920052,nmf_with_r,248.054745,"[(21, 0.042*""finn"" + 0.032*""keeper"" + 0.026*""d...",73.330969
2,-1.914974,7.987976,lda,4259.619466,"[(49, 0.010*""linear"" + 0.005*""neat"" + 0.005*""m...",8.085454


In [20]:
for row_idx, row in tm_metrics.iterrows():
    print('='*20)
    print(row['model'])
    print('='*20)
    print()
    print("\n\n".join(str(topic) for topic in row['topics']))
    print('\n')

nmf

(21, '0.042*"finn" + 0.032*"keeper" + 0.026*"disnei" + 0.022*"cinema" + 0.021*"overtak" + 0.020*"act" + 0.017*"art" + 0.016*"million" + 0.015*"insid" + 0.014*"rover"')

(33, '0.197*"linear" + 0.102*"neat" + 0.089*"palomar" + 0.058*"april" + 0.055*"februari" + 0.047*"march" + 0.037*"august" + 0.036*"septemb" + 0.023*"octob" + 0.021*"mesa"')

(13, '0.041*"hous" + 0.034*"dai" + 0.031*"develop" + 0.030*"vote" + 0.027*"women" + 0.020*"nomin" + 0.020*"elect" + 0.018*"gender" + 0.017*"glass" + 0.016*"ballot"')

(5, '0.024*"switch" + 0.023*"col" + 0.016*"divis" + 0.014*"learn" + 0.014*"new" + 0.013*"port" + 0.013*"warrant" + 0.012*"compani" + 0.012*"action" + 0.011*"maj"')

(26, '0.039*"club" + 0.032*"season" + 0.029*"leagu" + 0.025*"svg" + 0.022*"imag" + 0.021*"symbol" + 0.021*"divis" + 0.017*"john" + 0.016*"rover" + 0.015*"premier"')

(34, '0.041*"determin" + 0.026*"matrix" + 0.017*"finn" + 0.015*"team" + 0.013*"keeper" + 0.011*"column" + 0.010*"matric" + 0.010*"myth" + 0.010*"disnei" +

`DISCLAIMER: this section will be edited when run on full corpus`

As we can see, NMF can be significantly faster than LDA without sacrificing quality of topics too much (or not sacrificing at all)

Moreover, NMF can be very flexible on RAM usage due to sparsity option, which leaves only small amount of elements in inner matrices.