# Topic Modeling and t-SNE Visualization
Ref:
https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html

**Topic models are a suite of algorithms/statistical models that uncover the hidden topics in a collection of documents.**
Popular topic modeling algorithms incldue latent semantic analysis (LSA), hierarchical Dirichlet process (HDP), and latent Dirichlet allocation (LDA), among which LDA has shown great results in practice and therefore been widely adopted. <br>
**gensim library:**  Great for modeling topics

## t-SNE
t-SNE, or t-distributed stochastic neighbor embedding, is a dimensionality reduction algorithm for high-dimensional data visualization. This is partly to mitigate the fact that human cannot (at least not now) perceive vector space that is beyond 3-D.

In [1]:
#Loading data
import os
import argparse

from sklearn.datasets import fetch_20newsgroups

# we only want to keep the body of the documents!
remove = ('headers', 'footers', 'quotes')

# fetch train and test data
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)

# a list of 18,846 cleaned news in string format
# only keep letters & make them all lower case


In [2]:
 news = [' '.join(filter(None, raw.lower().split())) for raw in
          newsgroups_train.data + newsgroups_test.data]

### Training an LDA model

In [3]:
import lda # if you get error pip install lda

from sklearn.feature_extraction.text import CountVectorizer

n_topics = 20 # number of topics
n_iter = 500 # number of iterations

# vectorizer: ignore English stopwords & words that occur less than 5 times
cvectorizer = CountVectorizer(min_df=5, stop_words='english')
cvz = cvectorizer.fit_transform(news)

# train an LDA model
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)

INFO:lda:n_documents: 18846
INFO:lda:vocab_size: 24164
INFO:lda:n_words: 1721127
INFO:lda:n_topics: 20
INFO:lda:n_iter: 500
  if sparse and not np.issubdtype(doc_word.dtype, int):
INFO:lda:<0> log likelihood: -21729674
INFO:lda:<10> log likelihood: -15811330
INFO:lda:<20> log likelihood: -15020541
INFO:lda:<30> log likelihood: -14774572
INFO:lda:<40> log likelihood: -14644561
INFO:lda:<50> log likelihood: -14557450
INFO:lda:<60> log likelihood: -14489942
INFO:lda:<70> log likelihood: -14443270
INFO:lda:<80> log likelihood: -14411212
INFO:lda:<90> log likelihood: -14390832
INFO:lda:<100> log likelihood: -14369004
INFO:lda:<110> log likelihood: -14358385
INFO:lda:<120> log likelihood: -14345462
INFO:lda:<130> log likelihood: -14340922
INFO:lda:<140> log likelihood: -14332530
INFO:lda:<150> log likelihood: -14329208
INFO:lda:<160> log likelihood: -14324311
INFO:lda:<170> log likelihood: -14319266
INFO:lda:<180> log likelihood: -14317531
INFO:lda:<190> log likelihood: -14316935
INFO:lda:<2

In [13]:
#add a threshold factor that would help filter out unconfident topic assignments. Other wise visualization will be busy

threshold = 0.5
_idx = np.amax(X_topics, axis=1) > threshold  # idx of doc that above the threshold
X_topics = X_topics[_idx]

### Reducing to 2-D with t-SNE

In [14]:
from sklearn.manifold import TSNE

# a t-SNE model
# angle value close to 1 means sacrificing accuracy for speed
# pca initializtion usually leads to better results 
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')

# 20-D -> 2-D
tsne_lda = tsne_model.fit_transform(X_topics)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 9129 samples in 0.007s...
[t-SNE] Computed neighbors for 9129 samples in 3.049s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9129
[t-SNE] Computed conditional probabilities for sample 2000 / 9129
[t-SNE] Computed conditional probabilities for sample 3000 / 9129
[t-SNE] Computed conditional probabilities for sample 4000 / 9129
[t-SNE] Computed conditional probabilities for sample 5000 / 9129
[t-SNE] Computed conditional probabilities for sample 6000 / 9129
[t-SNE] Computed conditional probabilities for sample 7000 / 9129
[t-SNE] Computed conditional probabilities for sample 8000 / 9129
[t-SNE] Computed conditional probabilities for sample 9000 / 9129
[t-SNE] Computed conditional probabilities for sample 9129 / 9129
[t-SNE] Mean sigma: 0.059164
[t-SNE] KL divergence after 250 iterations with early exaggeration: 68.591118
[t-SNE] KL divergence after 1000 iterations: 1.089022


In [15]:
X_topics

array([[0.00263158, 0.13421053, 0.00263158, ..., 0.00263158, 0.00263158,
        0.00263158],
       [0.02156863, 0.00196078, 0.00196078, ..., 0.02156863, 0.06078431,
        0.00196078],
       [0.721875  , 0.003125  , 0.003125  , ..., 0.003125  , 0.003125  ,
        0.065625  ],
       ...,
       [0.10625   , 0.00208333, 0.00208333, ..., 0.00208333, 0.00208333,
        0.10625   ],
       [0.00169492, 0.00169492, 0.00169492, ..., 0.00169492, 0.08644068,
        0.00169492],
       [0.003125  , 0.003125  , 0.003125  , ..., 0.003125  , 0.003125  ,
        0.003125  ]])

In [16]:
X_topics.shape

(9129, 20)

## Visualzing groups and their keywords

In [17]:
#First we do some setup work (import classes & functions, set params, etc.)
import numpy as np
import bokeh.plotting as bp
from bokeh.plotting import save
from bokeh.plotting import figure, show
from bokeh.io import  output_notebook
from bokeh.models import HoverTool

n_top_words = 5 # number of keywords we show

# 20 colors
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

In [18]:
#Then we find the most likely topic for each news
_lda_keys = []
for i in range(X_topics.shape[0]):
  _lda_keys +=  X_topics[i].argmax(),

In [19]:
# get top words for each topic
topic_summaries = []
topic_word = lda_model.topic_word_  # all topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
  topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words + 1):-1] # get!
  topic_summaries.append(' '.join(topic_words)) # append!

In [20]:
output_notebook()

In [21]:
# plot the news (each point representing one news)
title = '20 newsgroups LDA viz'
num_example = len(X_topics)

plot_lda = bp.figure(plot_width=1400, plot_height=1100,
                    title=title,
                    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
                    x_axis_type=None, y_axis_type=None, min_border=1)

plot_lda.scatter(x=tsne_lda[:, 0], y=tsne_lda[:, 1],
                color=colormap[_lda_keys[:num_example]])

In [22]:
show(plot_lda)

In [24]:
# plot the crucial words for each topic
# randomly choose a news (within a topic) coordinate as the crucial words coordinate
topic_coord = np.empty((X_topics.shape[1], 2)) * np.nan
for topic_num in _lda_keys:
  if not np.isnan(topic_coord).any():
    break
  topic_coord[topic_num] = tsne_lda[_lda_keys.index(topic_num)]

# plot crucial words
for i in range(X_topics.shape[1]):
  plot_lda.text(topic_coord[i, 0], topic_coord[i, 1], [topic_summaries[i]])

# hover tools
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips = {"content": "@content - topic: @topic_key"}

# save the plot
#save(plot_lda, '{}.html'.format(title))
show(plot_lda)

### Ref: 
https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html