# Visualizing a Gensim model

To illustrate how to use [`pyLDAvis`](https://github.com/bmabey/pyLDAvis)'s gensim [helper funtions](https://pyldavis.readthedocs.org/en/latest/modules/API.html#module-pyLDAvis.gensim) we will create a model from the [20 Newsgroup corpus](http://qwone.com/~jason/20Newsgroups/). Minimal preprocessing is done and so the model is not the best, the goal of this notebook is to demonstrate the the helper functions.

## Downloading the data

In [None]:
%%bash
mkdir -p data
pushd data
if [ -d "20news-bydate-train" ]
then
  echo "The data has already been downloaded..."
else
  wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
  tar xfv 20news-bydate.tar.gz
  rm 20news-bydate.tar.gz
fi
echo "Lets take a look at the groups..."
ls 20news-bydate-train/
popd

## Exploring the dataset

Each group dir has a set of files:

In [None]:
ls -lah data/20news-bydate-train/sci.space | tail  -n 5

Lets take a peak at one email:

In [None]:
!head data/20news-bydate-train/sci.space/61422 -n 20

## Loading the tokenizing the corpus

In [2]:
from glob import glob
import re
import string
import funcy as fp
from gensim import models
from gensim.corpora import Dictionary, MmCorpus
import nltk
import pandas as pd

In [3]:
# quick and dirty....
EMAIL_REGEX = re.compile(r"[a-z0-9\.\+_-]+@[a-z0-9\._-]+\.[a-z]*")
FILTER_REGEX = re.compile(r"[^a-z '#]")
TOKEN_MAPPINGS = [(EMAIL_REGEX, "#email"), (FILTER_REGEX, ' ')]

def tokenize_line(line):
    res = line.lower()
    for regexp, replacement in TOKEN_MAPPINGS:
        res = regexp.sub(replacement, res)
    return res.split()
    
def tokenize(lines, token_size_filter=2):
    tokens = fp.mapcat(tokenize_line, lines)
    return [t for t in tokens if len(t) > token_size_filter]
    

def load_doc(filename):
    group, doc_id = filename.split('/')[-2:]
    with open(filename, 'r') as f:
        doc = f.readlines()
    return {'group': group,
            'doc': doc,
            'tokens': tokenize(doc),
            'id': doc_id}


docs = pd.DataFrame(list(map(load_doc, glob('data/20news-bydate-train/*/*')))).set_index(['group','id'])
docs.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,doc,tokens
group,id,Unnamed: 2_level_1,Unnamed: 3_level_1
alt.atheism,49960,"[From: mathew <mathew@mantis.co.uk>\n, Subject...","[from, mathew, #email, subject, alt, atheism, ..."
alt.atheism,51060,"[From: mathew <mathew@mantis.co.uk>\n, Subject...","[from, mathew, #email, subject, alt, atheism, ..."
alt.atheism,51119,[From: I3150101@dbstu1.rz.tu-bs.de (Benedikt R...,"[from, #email, benedikt, rosenau, subject, gos..."
alt.atheism,51120,"[From: mathew <mathew@mantis.co.uk>\n, Subject...","[from, mathew, #email, subject, university, vi..."
alt.atheism,51121,"[From: strom@Watson.Ibm.Com (Rob Strom)\n, Sub...","[from, #email, rob, strom, subject, soc, motss..."


## Creating the dictionary, and bag of words corpus

In [4]:

def nltk_stopwords():
    return set(nltk.corpus.stopwords.words('english'))

def prep_corpus(docs, additional_stopwords=set(), no_below=5, no_above=0.5):
  print('Building dictionary...')
  dictionary = Dictionary(docs)
  #stopwords = nltk_stopwords().union(additional_stopwords)
  #stopword_ids = map(dictionary.token2id.get, stopwords)
  #dictionary.filter_tokens(stopword_ids)
  #dictionary.compactify()
  #dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
  #dictionary.compactify()

  print('Building corpus...')
  corpus = [dictionary.doc2bow(doc) for doc in docs]

  return dictionary, corpus



In [5]:
dictionary, corpus = prep_corpus(docs['tokens'])

Building dictionary...
Building corpus...


In [None]:
MmCorpus.serialize('newsgroups.mm', corpus)
dictionary.save('newsgroups.dict')

## Fitting the LDA model

In [6]:
models.ldamodel.LdaModel.load(open('newsgroups_50_lda.model'), mmap='r')

AttributeError: 'file' object has no attribute 'endswith'

In [7]:
#%%time
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=10)
                                      
lda.save('newsgroups_50_lda.model')

## Visualizing the model with pyLDAvis

Okay, the moment we have all been waiting for is finally here!  You'll notice in the visualizaiton that we have a few junk topics that would probably disappear after better preprocessing of the corpus. This is left as an exercises to the reader. :)

In [None]:
dictionary = dictionary.load('newsgroups.dict')
corpus = MmCorpus.load 

In [18]:
#%load_ext autoreload
%autoreload 2
reload(pyLDAvis)

<module 'pyLDAvis' from '/usr/local/lib/python2.7/dist-packages/pyLDAvis/__init__.pyc'>

In [21]:
import pyLDAvis
vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
with open('topic.html', 'w+') as outf:
    pyLDAvis.save_html(vis, outf)

LALA3


In [22]:
import gensim

dictionary1 = gensim.corpora.Dictionary.from_corpus(corpus)

In [None]:
import copy 
from gensim.models import VocabTransform

# filter the dictionary
old_dict = dictionary
new_dict = copy.deepcopy(old_dict)
new_dict.filter_extremes(keep_n=100000)
new_dict.save('filtered.dict')

# now transform the corpus
corpus2 = corpus
old2new = {old_dict.token2id[token]:new_id for new_id, token in new_dict.iteritems()}
vt = VocabTransform(old2new)
corpora.MmCorpus.serialize('filtered_corpus.mm', vt[corpus2], id2word=new_dict)

import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.save_html(vis, "topic_viz_"+topics+"_passes_"+passes+".html")

## Fitting the HDP model

We could visualize the LDA model with pyLDAvis, in the same maner, we can also visualize gensim HDP models with pyLDAvis.

The difference between HDP and LDA is that HDP is a non-parametric method. Which means you don't need to specify the number of topics, HDP will fit as many topics as it can and find the optimal number of topics by itself.

In [None]:
%%time
# The optional parameter T here indicates that HDP should find no more than 50 topics
# if there exists any.
hdp = models.hdpmodel.HdpModel(corpus, dictionary, T=50)
                                      
hdp.save('newsgroups_hdp.model')

## Visualizing the HDP model with pyLDAvis

As for the LDA model, you only need to give your model, the corpus and the dictionary associated to prepare the visualization.

In [None]:
vis_data = gensimvis.prepare(hdp, corpus, dictionary)
pyLDAvis.display(vis_data)