# Topic Modeling and t-SNE Visualization
Ref:
https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html

**Topic models are a suite of algorithms/statistical models that uncover the hidden topics in a collection of documents.**
Popular topic modeling algorithms incldue latent semantic analysis (LSA), hierarchical Dirichlet process (HDP), and latent Dirichlet allocation (LDA), among which LDA has shown great results in practice and therefore been widely adopted. <br>
**gensim library:**  Great for modeling topics

## t-SNE
t-SNE, or t-distributed stochastic neighbor embedding, is a dimensionality reduction algorithm for high-dimensional data visualization. This is partly to mitigate the fact that human cannot (at least not now) perceive vector space that is beyond 3-D.

In [11]:
#Loading data
import os
import argparse

from sklearn.datasets import fetch_20newsgroups

# we only want to keep the body of the documents!
remove = ('headers', 'footers', 'quotes')

# fetch train and test data
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)

# a list of 18,846 cleaned news in string format
# only keep letters & make them all lower case


In [17]:
 news = [' '.join(filter(None, raw.lower().split())) for raw in
          newsgroups_train.data + newsgroups_test.data]

### Training an LDA model

In [21]:
import lda # if you get error pip install lda

from sklearn.feature_extraction.text import CountVectorizer

n_topics = 20 # number of topics
n_iter = 500 # number of iterations

# vectorizer: ignore English stopwords & words that occur less than 5 times
cvectorizer = CountVectorizer(min_df=5, stop_words='english')
cvz = cvectorizer.fit_transform(news)

# train an LDA model
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)

INFO:lda:n_documents: 18846
INFO:lda:vocab_size: 24164
INFO:lda:n_words: 1721127
INFO:lda:n_topics: 20
INFO:lda:n_iter: 500
  if sparse and not np.issubdtype(doc_word.dtype, int):
INFO:lda:<0> log likelihood: -21729674
INFO:lda:<10> log likelihood: -15795438
INFO:lda:<20> log likelihood: -14977411
INFO:lda:<30> log likelihood: -14738638
INFO:lda:<40> log likelihood: -14629942
INFO:lda:<50> log likelihood: -14566211
INFO:lda:<60> log likelihood: -14524718
INFO:lda:<70> log likelihood: -14491442
INFO:lda:<80> log likelihood: -14472611
INFO:lda:<90> log likelihood: -14457321
INFO:lda:<100> log likelihood: -14443151
INFO:lda:<110> log likelihood: -14434205
INFO:lda:<120> log likelihood: -14424413
INFO:lda:<130> log likelihood: -14417094
INFO:lda:<140> log likelihood: -14410215
INFO:lda:<150> log likelihood: -14403969
INFO:lda:<160> log likelihood: -14400224
INFO:lda:<170> log likelihood: -14395670
INFO:lda:<180> log likelihood: -14395191
INFO:lda:<190> log likelihood: -14392798
INFO:lda:<2

### Reducing to 2-D with t-SNE

In [None]:
from sklearn.manifold import TSNE

# a t-SNE model
# angle value close to 1 means sacrificing accuracy for speed
# pca initializtion usually leads to better results 
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')

# 20-D -> 2-D
tsne_lda = tsne_model.fit_transform(X_topics)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 18846 samples in 0.018s...
[t-SNE] Computed neighbors for 18846 samples in 13.503s...
[t-SNE] Computed conditional probabilities for sample 1000 / 18846
[t-SNE] Computed conditional probabilities for sample 2000 / 18846
[t-SNE] Computed conditional probabilities for sample 3000 / 18846
[t-SNE] Computed conditional probabilities for sample 4000 / 18846
[t-SNE] Computed conditional probabilities for sample 5000 / 18846
[t-SNE] Computed conditional probabilities for sample 6000 / 18846
[t-SNE] Computed conditional probabilities for sample 7000 / 18846
[t-SNE] Computed conditional probabilities for sample 8000 / 18846
[t-SNE] Computed conditional probabilities for sample 9000 / 18846
[t-SNE] Computed conditional probabilities for sample 10000 / 18846
[t-SNE] Computed conditional probabilities for sample 11000 / 18846
[t-SNE] Computed conditional probabilities for sample 12000 / 18846
[t-SNE] Computed conditional probabilities for sa

## Visualzing groups and their keywords

In [None]:
#First we do some setup work (import classes & functions, set params, etc.)
import numpy as np
import bokeh.plotting as bp
from bokeh.plotting import save
from bokeh.models import HoverTool

n_top_words = 5 # number of keywords we show

# 20 colors
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

In [None]:
#Then we find the most likely topic for each news
_lda_keys = []
for i in xrange(X_topics.shape[0]):
  _lda_keys +=  _topics[i].argmax(),

In [None]:
# get top words for each topic
topic_summaries = []
topic_word = lda_model.topic_word_  # all topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
  topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words + 1):-1] # get!
  topic_summaries.append(' '.join(topic_words)) # append!

In [None]:
# plot the news (each point representing one news)
title = '20 newsgroups LDA viz'
num_example = len(X_topics)

plot_lda = bp.figure(plot_width=1400, plot_height=1100,
                     title=title,
                     tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
                     x_axis_type=None, y_axis_type=None, min_border=1)

plot_lda.scatter(x=tsne_lda[:, 0], y=tsne_lda[:, 1],
                 color=colormap[_lda_keys][:num_example],
                 source=bp.ColumnDataSource({
                   "content": news[:num_example],
                   "topic_key": _lda_keys[:num_example]
                   }))