# Visualizing a Gensim model

To illustrate how to use [`pyLDAvis`](https://github.com/bmabey/pyLDAvis)'s gensim [helper funtions](https://pyldavis.readthedocs.org/en/latest/modules/API.html#module-pyLDAvis.gensim) we will create a model from the [20 Newsgroup corpus](http://qwone.com/~jason/20Newsgroups/). Minimal preprocessing is done and so the model is not the best. However, the goal of this notebook is to demonstrate the helper functions.

## Downloading the data

In [1]:
%%bash
mkdir -p data
pushd data
if [ -d "20news-bydate-train" ]
then
  echo "The data has already been downloaded..."
else
  wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
  tar xfv 20news-bydate.tar.gz
  rm 20news-bydate.tar.gz
fi
echo "Lets take a look at the groups..."
ls 20news-bydate-train/
popd

/content/data /content
20news-bydate-test/
20news-bydate-test/alt.atheism/
20news-bydate-test/alt.atheism/53265
20news-bydate-test/alt.atheism/53339
20news-bydate-test/alt.atheism/53260
20news-bydate-test/alt.atheism/53340
20news-bydate-test/alt.atheism/53333
20news-bydate-test/alt.atheism/53302
20news-bydate-test/alt.atheism/53313
20news-bydate-test/alt.atheism/53293
20news-bydate-test/alt.atheism/53297
20news-bydate-test/alt.atheism/53315
20news-bydate-test/alt.atheism/53320
20news-bydate-test/alt.atheism/53324
20news-bydate-test/alt.atheism/53328
20news-bydate-test/alt.atheism/53325
20news-bydate-test/alt.atheism/53322
20news-bydate-test/alt.atheism/53326
20news-bydate-test/alt.atheism/53261
20news-bydate-test/alt.atheism/53327
20news-bydate-test/alt.atheism/53329
20news-bydate-test/alt.atheism/53321
20news-bydate-test/alt.atheism/53068
20news-bydate-test/alt.atheism/53338
20news-bydate-test/alt.atheism/53257
20news-bydate-test/alt.atheism/53262
20news-bydate-test/alt.atheism/53276


--2019-11-09 07:08:15--  http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
Resolving qwone.com (qwone.com)... 108.20.201.166
Connecting to qwone.com (qwone.com)|108.20.201.166|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14464277 (14M) [application/x-gzip]
Saving to: ‘20news-bydate.tar.gz’

     0K .......... .......... .......... .......... ..........  0%  322K 44s
    50K .......... .......... .......... .......... ..........  0%  722K 31s
   100K .......... .......... .......... .......... ..........  1% 1.29M 24s
   150K .......... .......... .......... .......... ..........  1% 1.40M 21s
   200K .......... .......... .......... .......... ..........  1% 1.37M 18s
   250K .......... .......... .......... .......... ..........  2% 1.40M 17s
   300K .......... .......... .......... .......... ..........  2% 1.39M 16s
   350K .......... .......... .......... .......... ..........  2% 1.87M 15s
   400K .......... .......... .......... .......... ..

## Exploring the dataset

Each group dir has a set of files:

In [2]:
ls -lah data/20news-bydate-train/sci.space | tail  -n 5

-rw-r--r--  1 6602 6602 1.5K Mar 18  2003 61250
-rw-r--r--  1 6602 6602  889 Mar 18  2003 61252
-rw-r--r--  1 6602 6602 1.2K Mar 18  2003 61264
-rw-r--r--  1 6602 6602 1.7K Mar 18  2003 61308
-rw-r--r--  1 6602 6602 1.4K Mar 18  2003 61422


Lets take a peak at one email:

In [3]:
!head data/20news-bydate-train/sci.space/61422 -n 20

From: ralph.buttigieg@f635.n713.z3.fido.zeta.org.au (Ralph Buttigieg)
Subject: Why not give $1 billion to first year-lo
Organization: Fidonet. Gate admin is fido@socs.uts.edu.au
Lines: 34

Original to: keithley@apple.com
G'day keithley@apple.com

21 Apr 93 22:25, keithley@apple.com wrote to All:

 kc> keithley@apple.com (Craig Keithley), via Kralizec 3:713/602


 kc> But back to the contest goals, there was a recent article in AW&ST
about a
 kc> low cost (it's all relative...) manned return to the moon.  A General
 kc> Dynamics scheme involving a Titan IV & Shuttle to lift a Centaur upper
 kc> stage, LEV, and crew capsule.  The mission consists of delivering two
 kc> unmanned payloads to the lunar surface, followed by a manned mission.
 kc> Total cost:  US was $10-$13 billion.  Joint ESA(?)/NASA project was


## Loading and tokenizing the corpus

In [8]:
!pip install funcy gensim pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 4.9MB/s 
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97711 sha256=791094312cdfaa426595a5756507c9311d78a75f8ad0844173e27b7f5c3abe7a
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
Successfully built pyLDAvis
Installing collected packages: pyLDAvis
Successfully installed pyLDAvis-2.1.2


In [0]:
from glob import glob
import re
import string
import funcy as fp
from gensim import models
from gensim.corpora import Dictionary, MmCorpus
import nltk
import pandas as pd

In [11]:
# quick and dirty....
EMAIL_REGEX = re.compile(r"[a-z0-9\.\+_-]+@[a-z0-9\._-]+\.[a-z]*")
FILTER_REGEX = re.compile(r"[^a-z '#]")
TOKEN_MAPPINGS = [(EMAIL_REGEX, "#email"), (FILTER_REGEX, ' ')]

def tokenize_line(line):
    res = line.lower()
    for regexp, replacement in TOKEN_MAPPINGS:
        res = regexp.sub(replacement, res)
    return res.split()
    
def tokenize(lines, token_size_filter=2):
    tokens = fp.mapcat(tokenize_line, lines)
    return [t for t in tokens if len(t) > token_size_filter]
    

def load_doc(filename):
    group, doc_id = filename.split('/')[-2:]
    with open(filename, errors='ignore') as f:
        doc = f.readlines()
    return {'group': group,
            'doc': doc,
            'tokens': tokenize(doc),
            'id': doc_id}



docs = pd.DataFrame([load_doc(file) for file in glob('data/20news-bydate-train/*/*')]).set_index(['group','id'])
docs.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,doc,tokens
group,id,Unnamed: 2_level_1,Unnamed: 3_level_1
talk.politics.guns,54284,[From: lvc@cbnews.cb.att.com (Larry Cipriani)\...,"[from, #email, larry, cipriani, subject, decon..."
talk.politics.guns,54353,[From: gs26@prism.gatech.EDU (Glenn R. Stone)\...,"[from, #email, glenn, stone, subject, batf, fb..."
talk.politics.guns,54119,[From: scottj@magic.dml.georgetown.edu (John L...,"[from, #email, john, scott, subject, that, sil..."
talk.politics.guns,54318,"[From: jon@atlas.MITRE.org (J. E. Shum)\n, Sub...","[from, #email, shum, subject, change, name, or..."
talk.politics.guns,54148,"[From: gardner@convex.com (Steve Gardner)\n, S...","[from, #email, steve, gardner, subject, ban, a..."


## Creating the dictionary, and bag of words corpus

In [0]:

def nltk_stopwords():
    return set(nltk.corpus.stopwords.words('english'))

def prep_corpus(docs, additional_stopwords=set(), no_below=5, no_above=0.5):
  print('Building dictionary...')
  dictionary = Dictionary(docs)
  stopwords = nltk_stopwords().union(additional_stopwords)
  stopword_ids = map(dictionary.token2id.get, stopwords)
  dictionary.filter_tokens(stopword_ids)
  dictionary.compactify()
  dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
  dictionary.compactify()

  print('Building corpus...')
  corpus = [dictionary.doc2bow(doc) for doc in docs]

  return dictionary, corpus



In [15]:
import nltk
nltk.download('stopwords')

dictionary, corpus = prep_corpus(docs['tokens'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Building dictionary...
Building corpus...


In [16]:
MmCorpus.serialize('newsgroups.mm', corpus)
dictionary.save('newsgroups.dict')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## Fitting the LDA model

In [17]:
%%time
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=10)
                                      
lda.save('newsgroups_50_lda.model')

CPU times: user 4min 8s, sys: 2min 48s, total: 6min 56s
Wall time: 3min 36s


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## Visualizing the model with pyLDAvis

Okay, the moment we have all been waiting for is finally here!  You'll notice in the visualization that we have a few junk topics that would probably disappear after better preprocessing of the corpus. This is left as an exercises to the reader. :)

In [0]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [19]:
vis_data = gensimvis.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


## Fitting the HDP model

We can both visualize LDA models as well as gensim HDP models with pyLDAvis.

The difference between HDP and LDA is that HDP is a non-parametric method. Which means that we don't need to specify the number of topics. HDP will fit as many topics as it can and find the optimal number of topics by itself.

In [20]:
%%time
# The optional parameter T here indicates that HDP should find no more than 50 topics
# if there exists any.
hdp = models.hdpmodel.HdpModel(corpus, dictionary, T=50)
                                      
hdp.save('newsgroups_hdp.model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


CPU times: user 1min 1s, sys: 19.5 s, total: 1min 21s
Wall time: 58.8 s


## Visualizing the HDP model with pyLDAvis

As for the LDA model, in order to prepare the visualization you only need to pass it your model, the corpus, and the associated dictionary.

In [21]:
vis_data = gensimvis.prepare(hdp, corpus, dictionary)
pyLDAvis.display(vis_data)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
