# Topic Modelling

Latent Dirichlet Allocation (LDA) is an algorithms used to discover the topics that are present in a corpus.

Non-negative Matrix Factorization (NMF) can also be used to find topics in text. NMF sometimes produces more meaningful topics for smaller datasets.

While LDA and NMF have differing mathematical underpinnings, both algorithms are able to return the documents that belong to a topic in a corpus and the words that belong to a topic. LDA is based on probabilistic graphical modeling while NMF relies on linear algebra.

Both algorithms take as input a bag-of-words matrix (i.e., each document represented as a row, with each column containing the count of words in the corpus). The aim of each algorithm is to produce two smaller matrices: a document-to-topic matrix, and a word-to-topic matrix that when multiplied together reproduce the bag-of-words matrix with the lowest error.

Both NMF and LDA are not able to automatically determine the number of topics and this must be specified.

NMF tries to learn a latent embedding that captures the information in the matrix in a much smaller space. In the general form NMF seeks to factor a (non-negative) matrix $M$ into the product of two (non-negative) matrices $W$ and $H$ (or $D$ and $V$ as used in this paper). How does that help us? We can pick some dimension $d$ (controlling the size of the latent space) and break down the $\mathbf{X} \in \mathbb{R}^{n \times t}$ matrix into a d-dimension representation of news articles $\mathbf{D} \in \mathbb{R}^{n \times d}$, and a d-dimension representation of words in the vocabulary, $\mathbf{V} \in \mathbb{R}^{t \times d}$.

See [The why and how of nonnegative matrix factorization](https://blog.acolyer.org/2019/02/18/the-why-and-how-of-nonnegative-matrix-factorization/) February 18, 2019 and [Beyond news contents: the role of social context for fake news detection](https://blog.acolyer.org/2019/02/13/beyond-news-contents-the-role-of-social-context-for-fake-news-detection/) February 13, 2019.

In [1]:
# from sklearn.datasets import fetch_20newsgroups

In [2]:
# dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
# documents = dataset.data

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [20]:
from pathlib import Path
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [38]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [10]:
DATA_DIR = Path('/Users/d777710/src/DeepLearning/dltemplate/data/20_newsgroups')

In [11]:
DATA_PATH = DATA_DIR / '20_newsgroups.txt'

In [12]:
def load_data():
    texts = []
    with open(DATA_PATH, 'r') as f:
        for i, line in enumerate(f):
            if i == 0:
                continue

            texts.append(line.rstrip('\n'))

    return texts

In [13]:
texts = load_data()

In [40]:
len(texts)

11314

In [21]:
# Hyperparameters
n_features = 1000
n_topics = 20

In [18]:
# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

In [19]:
# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

In [22]:
# Run NMF
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

In [24]:
# Run LDA
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5, learning_method='online', 
                                learning_offset=50.,random_state=0).fit(tf)

In [28]:
def display_topics(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print('Topic %d:' % topic_idx)
        print(' '.join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

In [30]:
n_top_words = 10

In [31]:
display_topics(nmf, tfidf_feature_names, n_top_words)

Topic 0:
people time right did good said say make way government
Topic 1:
window problem using server application screen display motif manager running
Topic 2:
god jesus bible christ faith believe christian christians sin church
Topic 3:
game team year games season players play hockey win league
Topic 4:
new 00 sale 10 price offer shipping condition 20 15
Topic 5:
thanks mail advance hi looking info help information address appreciated
Topic 6:
windows file files dos program version ftp ms directory running
Topic 7:
edu soon cs university ftp internet article email pub david
Topic 8:
key chip clipper encryption keys escrow government public algorithm nsa
Topic 9:
drive scsi drives hard disk ide floppy controller cd mac
Topic 10:
just ll thought tell oh little fine work wanted mean
Topic 11:
does know anybody mean work say doesn help exist program
Topic 12:
card video monitor cards drivers bus vga driver color memory
Topic 13:
like sounds looks look bike sound lot things really thing
To

In [32]:
display_topics(lda, tf_feature_names, n_top_words)

Topic 0:
people gun state control right guns crime states law police
Topic 1:
time question book years did like don space answer just
Topic 2:
mr line rules science stephanopoulos title current define int yes
Topic 3:
key chip keys clipper encryption number des algorithm use bit
Topic 4:
edu com cs vs w7 cx mail uk 17 send
Topic 5:
use does window problem way used point different case value
Topic 6:
windows thanks know help db does dos problem like using
Topic 7:
bike water effect road design media dod paper like turn
Topic 8:
don just like think know people good ve going say
Topic 9:
car new price good power used air sale offer ground
Topic 10:
file available program edu ftp information files use image version
Topic 11:
ax max b8f g9v a86 145 pl 1d9 0t 34u
Topic 12:
government law privacy security legal encryption court fbi technology information
Topic 13:
card bit memory output video color data mode monitor 16
Topic 14:
drive scsi disk mac hard apple drives controller software port
T

In [33]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [39]:
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer, mds='tsne')