# The 20 newsgroups topic analysis

Instead of repeating the IMDB sentiment analysis from the lesson (because frankly, I'm a little bored with sentiment analysis), I will attempt to apply a similar approach to deep-learning NLP classification to a dataset a coworker has recently been messing around with in `scikit-learn`: `sklearn.datasets.fetch_20newsgroups`.

http://people.csail.mit.edu/jrennie/20Newsgroups/

## Setup data

In [None]:
import os
current_dir = os.getcwd()

LESSON_HOME_DIR = current_dir + '/'
DATA_HOME_DIR = LESSON_HOME_DIR + 'data/'

DATASET_DIR = DATA_HOME_DIR + '20_newsgroup/'
MODEL_DIR = DATASET_DIR + 'models/'

In [None]:
if not os.path.exists(MODEL_DIR):
    os.mkdir(DATASET_DIR)
    os.mkdir(MODEL_DIR)

In [None]:
from sklearn.datasets import fetch_20newsgroups

category_subset = [
    'alt.atheism',
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'soc.religion.christian',
]

newsgroups = fetch_20newsgroups(
    subset = 'all',
    categories = category_subset,
    shuffle = True,
    remove = ('headers', 'footers', 'quotes'))

In [None]:
newsgroups.target_names

`target_names` are as requested

In [None]:
newsgroups.filenames.shape, newsgroups.target.shape, len(newsgroups.data)

Keras implements `get_word_index()` for the IMDB dataset, which returns an dictionary of word->index derived from a json file hosted on Amazon S3.

It seems bizarre to me to host this when you can easily create it on-demand... anyway, sklearn doesn't provide this. So let's create our own index with `keras.preprocessing.text.Tokenizer` (https://keras.io/preprocessing/text/).

In [None]:
import keras.preprocessing.text
import string

# Workaround to add "Unicode support for keras.preprocessing.text"
# (https://github.com/fchollet/keras/issues/1072#issuecomment-295470970)
def text_to_word_sequence(text,
                          filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                          lower=True, split=" "):
    if lower: text = text.lower()
    if type(text) == unicode:
        translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
    else:
        translate_table = string.maketrans(filters, split * len(filters))
    text = text.translate(translate_table)
    seq = text.split(split)
    return [i for i in seq if i]
    
keras.preprocessing.text.text_to_word_sequence = text_to_word_sequence

In [None]:
from keras.preprocessing.text import Tokenizer

vocab_size = 20000

tokenizer = Tokenizer(nb_words=vocab_size)
tokenizer.fit_on_texts(newsgroups.data) # builds the word index
sequences = tokenizer.texts_to_sequences(newsgroups.data)

In [None]:
word_index = tokenizer.word_index

In [None]:
len(word_index)

Reverse the `word_index` with `idx2word`.

In [None]:
idx2word = {v: k for k, v in word_index.iteritems()}

Let's take a look at the first review, both as a list of indices and as text reconstructed from the indices.

In [None]:
', '.join(map(str, sequences[0]))

In [None]:
idx2word[24]

In [None]:
' '.join([idx2word[o] for o in sequences[0]])

In [None]:
newsgroups.target[0], newsgroups.target_names[newsgroups.target[0]]

Distribution of the lengths of sentences:

In [None]:
import numpy as np

lens = np.array(map(len, newsgroups.data))
(lens.max(), lens.min(), lens.mean())

Weird that there are sentences with 0 sequences (words) in them...

In [None]:
# get indices of arrays that do NOT satisfy np.nonzero
nonzero_indices = np.unique(np.nonzero(sequences)[0])
zero_indices = set(range(len(sequences))).difference(nonzero_indices)
len(zero_indices)

So there are 101 sentences with no words. E.g.

In [None]:
sequences[64], newsgroups.target_names[newsgroups.target[64]]

...sure.

Pad (with zero) or truncate each sentence to make consistent length.

In [None]:
from keras.preprocessing import sequence

seq_len = 1000

data = sequence.pad_sequences(sequences, maxlen=seq_len, value=0)

In [None]:
data[:10]

Finally, let's turn the labels into categorical information.

In [None]:
from keras.utils.np_utils import to_categorical

newsgroups.target = to_categorical(np.asarray(newsgroups.target))

In [None]:
data.shape, newsgroups.target.shape

Split data into train-test.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, newsgroups.target, test_size=0.33)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Create simple models

### Single hidden layer NN

The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.

In [None]:
vocab_size, seq_len

In [None]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers.core import Flatten, Dense, Dropout
from keras.optimizers import Adam

# input_length => 1500-word reviews, 32 floats per word
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(len(newsgroups.target_names), activation='softmax')])

In [None]:
model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()

In [None]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Is a mid-70s validation accuracy.. Good? Bad?

Here are some accuracies [from an official `sklearn` example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html) that classifies documents by topics using a bag-of-words approach:

```
[('RidgeClassifier', 0.89726533628972649),
 ('Perceptron', 0.88543976348854403),
 ('PassiveAggressiveClassifier', 0.90613451589061345),
 ('KNeighborsClassifier', 0.85809312638580926),
 ('RandomForestClassifier', 0.83813747228381374),
 ('LinearSVC', 0.90022172949002222),
 ('SGDClassifier', 0.90096082779009612),
 ('LinearSVC', 0.87287509238728755),
 ('SGDClassifier', 0.88543976348854403),
 ('SGDClassifier', 0.89874353288987441),
 ('NearestCentroid', 0.85513673318551364),
 ('MultinomialNB', 0.90022172949002222),
 ('BernoulliNB', 0.88396156688839611),
 ('Pipeline', 0.8810051736881005)]
 
 mean: 0.88311688311688319
 ```

So, not a good result in comparison with much simpler approaches. Training accuracy is high, but testing accuracy is much poorer.

As a sanity check, I also ran code from [`pretrained_word_embeddings.py`](https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py) (from Keras's examples repository) which also runs against `20_newsgroups` (not the `sklearn` version though), and it was able to achieve:

    loss: 0.3784 - acc: 0.8734 - val_loss: 0.9177 - val_acc: 0.7257
after 10 epochs - again, not as accurate as the 'shallow', bag-of-words models - but comparable to the results I'm receiving here.

### Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.

In [None]:
from keras.layers.convolutional import Convolution1D, MaxPooling1D

conv1 = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len, dropout=0.2),
    Dropout(0.4),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.4),
    MaxPooling1D(5),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.7),
    Dense(len(newsgroups.target_names), activation='softmax')])

In [None]:
from keras.optimizers import RMSprop

conv1.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
conv1.summary()

In [None]:
conv1.optimizer.lr.get_value().item()

In [None]:
conv1.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

In [None]:
conv1.optimizer.lr=0.01

In [None]:
conv1.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=4, batch_size=64)

In [None]:
conv1.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=1, batch_size=64)

A good improvement over the previous model.

## Pre-trained vectors

You may want to look at wordvectors.ipynb before moving on.

In this section, we replicate the previous CNN, but using pre-trained embeddings.

In [None]:
from keras.utils.data_utils import get_file

def get_glove_dataset(dataset):
    """Download the requested glove dataset from files.fast.ai
    and return a location that can be passed to load_vectors.
    """
    # see wordvectors.ipynb for info on how these files were
    # generated from the original glove data.
    md5sums = {'6B.50d': '8e1557d1228decbda7db6dfd81cd9909',
               '6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
               '6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
               '6B.300d': '30290210376887dcc6d0a5a6374d8255'}
    glove_path = os.path.abspath('data/glove/results')
    %mkdir -p $glove_path
    return get_file(dataset,
                    'http://files.fast.ai/models/glove/' + dataset + '.tgz',
                    cache_subdir=glove_path,
                    md5_hash=md5sums.get(dataset, None),
                    untar=True)

In [None]:
from utils import load_array
import pickle

def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb')),
        pickle.load(open(loc+'_idx.pkl','rb')))

In [None]:
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.100d'))

In [None]:
len(wordidx)

The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).

In [None]:
import re
from numpy.random import normal

def create_emb():
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))

    for i in range(1,len(emb)):
        word = idx2word[i]
        if word and word in wordidx:
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = normal(scale=0.6, size=(n_fact,))
    emb/=3
    return emb

In [None]:
emb = create_emb()

In [None]:
emb.shape

In [None]:
emb_model = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len, dropout=0.2, 
              weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(128, 5, border_mode='same', activation='relu'),
    Dropout(0.25),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(4, activation='softmax')])

_Note_: I started seeing lines like `4s - loss: nan - acc: 0.6783 - val_loss: nan - val_acc: 0.2131` where in the previous epoch, `val_acc` was twice that amount. A [quick search on the forums](http://forums.fast.ai/t/why-are-my-losses-nan/2931/2) surfaced this explanation:

    "There is one thing that doesn't look quite right: the final activation is not compatible with that loss function. Categorical cross-entropy expects a 'softmax' activation in the final layer, not 'sigmoid'. Consider changing that to see what happens."
    
**Categorical cross-entropy expects a `softmax` activation in the final layer, not `sigmoid`.** So I switched to `softmax`... I don't recall ever learning this information, however. Should ponder why.

In [None]:
emb_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
emb_model.summary()

In [None]:
emb_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.

In [None]:
emb_model.layers[0].trainable=True

In [None]:
emb_model.optimizer.lr=1e-4

In [None]:
emb_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

In [None]:
emb_model.layers[0].trainable=False

In [None]:
emb_model.optimizer.lr=1e-2

In [None]:
emb_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=5, batch_size=64)

Interestingly, the pretrained embeddings didn't provide any improvement...

## Pre-trained vectors + BatchNorm

In [None]:
emb = create_emb()

In [None]:
from keras.layers.normalization import BatchNormalization

batch_model = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len, dropout=0.2, 
              weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(128, 5, border_mode='same', activation='relu'),
    Dropout(0.25),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    BatchNormalization(axis=1),
    Dropout(0.7),
    Dense(4, activation='softmax')])

In [None]:
batch_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
batch_model.summary()

In [None]:
batch_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

In [None]:
batch_model.layers[0].trainable=True

In [None]:
batch_model.optimizer.lr=1e-4

In [None]:
batch_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

In [None]:
batch_model.layers[0].trainable=False

In [None]:
batch_model.optimizer.lr=1e-2

In [None]:
batch_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=5, batch_size=64)

Nada.

## Pre-trained vectors III

Let's try the model from `pretrained_word_embeddings.py (+Dropout)`

In [None]:
emb = create_emb()

In [None]:
deep_model = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len, weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.25),
    MaxPooling1D(5),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.25),
    MaxPooling1D(5),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.25),
    MaxPooling1D(35), # global max pooling
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(len(newsgroups.target_names), activation='softmax')
])

In [None]:
deep_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
deep_model.summary()

In [None]:
deep_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=128)

In [None]:
deep_model.layers[0].trainable=True

In [None]:
deep_model.optimizer.lr=1e-4

In [None]:
deep_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=128)

In [None]:
deep_model.optimizer.lr=1e-3

In [None]:
deep_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=128)

Better than the first two pretrained embedding models and back to being competitive with the single conv layer with max pooling.

## Multi-size CNN

This is an implementation of a multi-size CNN as shown in Ben Bowles' [excellent blog post](https://quid.com/feed/how-quid-uses-deep-learning-with-small-data).

In [None]:
from keras.layers import Merge

We use the functional API to create multiple conv layers of different sizes, and then concatenate them.

In [None]:
from keras.layers import Input, Merge
from keras.models import Model

graph_in = Input((vocab_size, 100))
convs = [] 
for fsz in range (3, 6): 
    x = Convolution1D(64, fsz, border_mode='same', activation='relu')(graph_in)
    x = MaxPooling1D()(x) 
    x = Flatten()(x) 
    convs.append(x)
out = Merge(mode='concat')(convs) 
graph = Model(graph_in, out) 

In [None]:
emb = create_emb()

We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.

In [None]:
multi = Sequential ([
    Embedding(vocab_size, 100, input_length=seq_len, dropout=0.2, weights=[emb]),
    Dropout (0.2),
    graph,
    Dropout (0.5),
    Dense (100, activation='relu'),
    Dropout (0.7),
    Dense (len(newsgroups.target_names), activation='softmax')
    ])

In [None]:
multi.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
multi.summary()

In [None]:
multi.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=128)

In [None]:
multi.layers[0].trainable=False

In [None]:
multi.optimizer.lr=1e-5

In [None]:
multi.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=128)

In [None]:
multi.layers[0].trainable=True

In [None]:
multi.optimizer.lr=1e-2

In [None]:
multi.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=5, batch_size=128)

Highest deep-learning result so far! And the most comparable to the 'shallow' bag-of-words results that achieved upper-80s.

Bonus: this was also the least stressful to watch train because `val_acc` (generally) continued to rise instead of bouncing around like the other models.

## lda2vec

http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

http://lda2vec.readthedocs.io/en/latest/

### Install `lda2vec` and accompanying Python module

1. Download `setup.py`, `requirements.txt`, and the `lda2vec/` folder from https://github.com/cemoody/lda2vec (I used `wget`)
2. Run `python setup.py install`
3. To avoid [this error](https://github.com/explosion/spaCy/issues/855), re-installed Spacy with `conda install spacy -c conda-forge`
4. To [enable CUDA support for `chainer`](https://github.com/chainer/chainer#installation), run `pip install cupy`

Also: `conda install seaborn`

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import seaborn

Install `pyLDAvis` with `pip install git+https://github.com/bmabey/pyLDAvis.git@master#egg=pyLDAvis`

In [None]:
import pyLDAvis
pyLDAvis.enable_notebook()

### Generate preprocessed data files

_Because `lda2vec` is not fully polished, there are a few steps beyond just running `preprocess.py`_

Download `GoogleNews-vectors-negative300.bin.gz` (1.53GB) and unzip it:

    wget -c https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
    gzip -d GoogleNews-vectors-negative300.bin.gz

Run `pip install pyxdameraulevenshtein` (to get access to `damerau_levenshtein_distance_ndarray`\* in `corpus.py`)

\* Had to replace `damerau_levenshtein_distance_withNPArray`

Install `gensim` (we'll be using the `gensim.models.KeyedVectors` module)\*.

\* Note that the `lda2vec` module's [`corpus.py`](https://github.com/cemoody/lda2vec/blob/master/lda2vec/corpus.py) suggests it provides an option to `use_spacy` to load in word vectors, but this appears broken. Only option is `gensim` at the moment. `corpus.py` must also be modified from using the `gensim.models.word2vec` import to `gensim.models.KeyedVectors`

Finally, run either the code block below, or [preprocess.py](https://raw.githubusercontent.com/cemoody/lda2vec/master/examples/twenty_newsgroups/data/preprocess.py) in `examples/twenty_newsgroups/data` to generate the `20_newsgroup` data that `lda2vec` will run on.

In [1]:
#!python data/preprocess.py
# from https://raw.githubusercontent.com/cemoody/lda2vec/master/examples/twenty_newsgroups/data/preprocess.py

# Author: Chris Moody <chrisemoody@gmail.com>
# License: MIT

# This simple example loads the newsgroups data from sklearn
# and train an LDA-like model on it
import pickle

from sklearn.datasets import fetch_20newsgroups
import numpy as np

from lda2vec import preprocess, Corpus

# Fetch data
remove = ('headers', 'footers', 'quotes')
category_subset = [
    'alt.atheism',
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'soc.religion.christian',
]
texts = fetch_20newsgroups(subset='all', categories=category_subset, remove=remove).data
# Remove tokens with these substrings
bad = set(["ax>", '`@("', '---', '===', '^^^'])

def clean(line):
    return ' '.join(w for w in line.split() if not any(t in w for t in bad))

# Preprocess data
max_length = 10000   # Limit of 10k words per document
# Convert to unicode (spaCy only works with unicode)
texts = [unicode(clean(d)) for d in texts]
tokens, vocab = preprocess.tokenize(texts, max_length, merge=False,
                                    n_threads=4)
corpus = Corpus()
# Make a ranked list of rare vs frequent words
corpus.update_word_count(tokens)
corpus.finalize()
# The tokenization uses spaCy indices, and so may have gaps
# between indices for words that aren't present in our dataset.
# This builds a new compact index
compact = corpus.to_compact(tokens)
# Remove extremely rare words
pruned = corpus.filter_count(compact, min_count=30)
# Convert the compactified arrays into bag of words arrays
bow = corpus.compact_to_bow(pruned)
# Words tend to have power law frequency, so selectively
# downsample the most prevalent words
clean = corpus.subsample_frequent(pruned)
# Now flatten a 2D array of document per row and word position
# per column to a 1D array of words. This will also remove skips
# and OoV words
doc_ids = np.arange(pruned.shape[0])
flattened, (doc_ids,) = corpus.compact_to_flat(pruned, doc_ids)
assert flattened.min() >= 0
# Fill in the pretrained word vectors
n_dim = 300
fn_wordvc = 'data/GoogleNews-vectors-negative300.bin'
vectors, s, f = corpus.compact_word_vectors(vocab, filename=fn_wordvc)
# Save all of the preprocessed files
pickle.dump(vocab, open('vocab.pkl', 'w'))
pickle.dump(corpus, open('corpus.pkl', 'w'))
np.save("flattened", flattened)
np.save("doc_ids", doc_ids)
np.save("pruned", pruned)
np.save("bow", bow)
np.save("vectors", vectors)

  util.experimental('cupy.core.fusion')


2 <SKIP>  -->  SKIP
3 ,  -->  上
4 .  -->  上
13 "  -->  上
15 -  -->  上
18 )  -->  上
21 :  -->  上
22 (  -->  上
25 '  -->  上
31 ?  -->  上
32 <  -->  上
37 /  -->  上
44 max>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax  -->  Malavika_Jagannathan_•_mjaganna@greenbaypressgazette.com
63 ;  -->  上
66 !  -->  上
75 ]  -->  上
77 [  -->  上
81 ...  -->  В.В.
91 --  -->  -4
118 g9v  -->  guv
128 g)r  -->  grr
148 |  -->  上
363 ..  -->  В.В.
404 3.1  -->  S.1
441 }  -->  上
494 /pub  -->  pub
534 24  -->  2_
549 10  -->  -0
583 i.e.  -->  ie.
589 os/2  -->  Rs2
634 {  -->  上
635 16  -->  O6
710 14  -->  q4
714 1993  -->  9ja
718 15  -->  O5
731 20  -->  2_
867 256  -->  2_
891 jfif  -->  fif
913 12  -->  q2
960 24-bit  -->  2Gbit
983 30  -->  -0
990 2.0  -->  P.0
1025 phigs  -->  Whigs
1032 e.g.  -->  eg.
1042 100  -->  cw0
1061 msdos  -->  ms_dos
1139 3.0  -->  P.0
1146 18  -->  -8
1160 11  -->  q1
1175 and/or  -->  andor
1190 50  -->  -0
1208 25  -->  2_
1239 alt.atheism  -->  neo_athei

_Note_: the above printing mappings are from `corpus.py`:

    print compact, word, ' --> ', choice

Run [`lda2vec_run.py`](https://github.com/cemoody/lda2vec/raw/master/examples/twenty_newsgroups/lda2vec/lda2vec_run.py) in `examples/twenty_newsgroups/lda2vec` directory to generate `topics.pyldavis.npz` that contains the topic-to-word probabilities and frequencies. What's left is to visualize and label each topic from the it's prevalent words.

_Note_: make sure you've downloaded `lda2vec_model.py` beforehand

In [None]:
!python lda2vec/lda2vec_run.py