# Topic Modeling with gensim
We'll try out [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) in [gensim](http://radimrehurek.com/gensim/index.html) on the [20 Newsgroups dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) with some simple preprocessing.

#### Install gensim

In [1]:
!pip install --upgrade gensim

Requirement already up-to-date: gensim in /Applications/anaconda/envs/py3env/lib/python3.5/site-packages
Requirement already up-to-date: six>=1.5.0 in /Applications/anaconda/envs/py3env/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: numpy>=1.11.3 in /Applications/anaconda/envs/py3env/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: smart-open>=1.2.1 in /Applications/anaconda/envs/py3env/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: scipy>=0.18.1 in /Applications/anaconda/envs/py3env/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: bz2file in /Applications/anaconda/envs/py3env/lib/python3.5/site-packages (from smart-open>=1.2.1->gensim)
Requirement already up-to-date: boto>=2.32 in /Applications/anaconda/envs/py3env/lib/python3.5/site-packages (from smart-open>=1.2.1->gensim)
Requirement already up-to-date: requests in /Applications/anaconda/envs/py3env/lib/python3.5/site-packages (f

##### imports

In [2]:
from __future__ import print_function

In [3]:
# gensim
from gensim import corpora, models, similarities, matutils
# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Using TensorFlow backend.


Let's retain only a subset of the 20 categories in the original 20 Newsgroups Dataset.

In [4]:
# Set categories
categories = ['comp.graphics', 'rec.sport.baseball', 'rec.motorcycles', 'sci.space', 'alt.atheism']
# Download the training subset of the 20 NG dataset, with headers, footers, quotes removed
# Only keep docs from the 6 categories above
ng_train = datasets.fetch_20newsgroups(subset='train', categories=categories, 
                                      remove=('headers', 'footers', 'quotes'))

In [5]:
# Take a look at the first doc
ng_train.data[16]

'Archive-name: atheism/introduction\nAlt-atheism-archive-name: introduction\nLast-modified: 5 April 1993\nVersion: 1.2\n\n-----BEGIN PGP SIGNED MESSAGE-----\n\n                          An Introduction to Atheism\n                       by mathew <mathew@mantis.co.uk>\n\nThis article attempts to provide a general introduction to atheism.  Whilst I\nhave tried to be as neutral as possible regarding contentious issues, you\nshould always remember that this document represents only one viewpoint.  I\nwould encourage you to read widely and draw your own conclusions; some\nrelevant books are listed in a companion article.\n\nTo provide a sense of cohesion and progression, I have presented this article\nas an imaginary conversation between an atheist and a theist.  All the\nquestions asked by the imaginary theist are questions which have been cropped\nup repeatedly on alt.atheism since the newsgroup was created.  Some other\nfrequently asked questions are answered in a companion article.\n\n

## Document Preprocessing
We'll need to generate a term-document matrix of word (token) counts for use in LDA.

We'll use `sklearn`'s `CountVectorizer` to generate our term-document matrix of counts. We'll make use of a few parameters to accomplish the following preprocessing of the text documents all within the `CountVectorizer`:
* `analyzer=word`: Tokenize by word
* `ngram_range=(1,2)`: Keep all 1 and 2-word grams
* `stop_words=english`: Remove all English stop words
* `token_pattern=\\b[a-z][a-z]+\\b`: Match all tokens with 2 or more (strictly) alphabet characters

In [6]:
# Create a CountVectorizer for parsing/counting words
count_vectorizer = CountVectorizer(ngram_range=(1, 2),  
                                   stop_words='english', token_pattern="\\b[a-z][a-z]+\\b")
count_vectorizer.fit(ng_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='\\b[a-z][a-z]+\\b',
        tokenizer=None, vocabulary=None)

In [7]:
# Create the term-document matrix
# Transpose it so the terms are the rows
counts = count_vectorizer.transform(ng_train.data).transpose()

In [8]:
counts.shape

(199825, 2852)

##### Convert to gensim
We need to convert our sparse `scipy` matrix to a `gensim`-friendly object called a Corpus:

In [9]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(counts)

##### Map matrix rows to words (tokens)
We need to save a mapping (dict) of row id to word (token) for later use by gensim:

In [10]:
id2word = dict((v, k) for k, v in count_vectorizer.vocabulary_.items())

In [11]:
len(id2word)

199825

## LDA
At this point we can simply plow ahead in creating an LDA model.  It requires our corpus of word counts, mapping of row ids to words, and the number of topics (3).

In [12]:
# Create lda model (equivalent to "fit" in sklearn)
lda = models.LdaModel(corpus=corpus, num_topics=6, id2word=id2word, passes=10)

Let's take a look at what happened.  Here are the 20 most important words for each of the 8 topics we found:

In [13]:
lda.print_topics(num_words=20)

[(0,
  '0.003*"image" + 0.002*"data" + 0.001*"software" + 0.001*"like" + 0.001*"just" + 0.001*"bike" + 0.001*"time" + 0.001*"don" + 0.001*"edu" + 0.001*"know" + 0.001*"think" + 0.001*"use" + 0.001*"available" + 0.001*"images" + 0.001*"processing" + 0.001*"analysis" + 0.001*"does" + 0.001*"graphics" + 0.001*"need" + 0.001*"new"'),
 (1,
  '0.003*"god" + 0.002*"like" + 0.002*"just" + 0.002*"people" + 0.002*"does" + 0.002*"dod" + 0.002*"argument" + 0.002*"know" + 0.002*"don" + 0.002*"atheism" + 0.001*"think" + 0.001*"true" + 0.001*"believe" + 0.001*"say" + 0.001*"good" + 0.001*"example" + 0.001*"atheists" + 0.001*"time" + 0.001*"evidence" + 0.001*"way"'),
 (2,
  '0.004*"space" + 0.002*"edu" + 0.002*"nasa" + 0.002*"graphics" + 0.001*"jpeg" + 0.001*"data" + 0.001*"ftp" + 0.001*"available" + 0.001*"information" + 0.001*"program" + 0.001*"image" + 0.001*"faq" + 0.001*"launch" + 0.001*"earth" + 0.001*"new" + 0.001*"pub" + 0.001*"shuttle" + 0.001*"send" + 0.001*"use" + 0.001*"files"'),
 (3,
  '0

#### Topic Space
If we want to map our documents to the topic space we need to actually use the LdaModel transformer that we created above, like so:

In [14]:
#lda.get_document_topics(bow5)

In [15]:
# Transform the docs from the word space to the topic space (like "transform" in sklearn)
lda_corpus = lda[corpus]
lda_corpus

<gensim.interfaces.TransformedCorpus at 0x11ec52048>

In [16]:
# Store the documents' topic vectors in a list so we can take a peak
lda_docs = [doc for doc in lda_corpus]

Now we can take a look at the document vectors in the topic space, which are measures of the component of each document along each topic.  Thus, at most a document vector can have `num_topics` nonzero components in the topic space, and most have far fewer.

In [17]:
# Check out the document vectors in the topic space for the a document
lda_docs[6]

[(3, 0.98183259094617337)]

In [18]:
ng_train.data[5]

'\nI was *hoping* somebody would mention clutch.  Clutch?  Baerga?  The\ntwo words simply do not go together.  With runners in scoring\nposition, Baerga batted .308/.366/.418 last year.  This doesn\'t quite\n*suck*, but most batters hit *better* in this situation.\n\nAlomar?  He hit .354/.439/.517 with runners in scoring position!\n\nThe difference?  Alomar had 68 RBIs in 147 such AB.  Baerga had 81\nRBIs in 182 such AB.  Baerga got 25% more chances, yet succeeded only\n20% more times.\n\nFrankly, I don\'t believe in clutch.  But if I did, my vote would\ngo to Alomar for MVP (let alone "best 2B in the AL").'

## On your own...
- Go get some of the NIPS papers from [here](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words).  
- Try performing LDA on this data with gensim
- Play with some of the preprocessing options and parameters for LDA, observe what happens
- See if you can use the resulting topic space to extract topic vectors and cluster some documents
- How do your results look?