In [None]:
# Interactive visualization of topics in our dataset

Below is an ipython notebook presenting the results of LDA application in our already preprocessed dataset.This is for raw BoW representation, could add tf-idf in the future.

First load the data and drop unneeded columns.

In [1]:
# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import gensim.models.ldamulticore as lda
from gensim import matutils as mu
import pyLDAvis as pld
import pyLDAvis.gensim as gensimvis
import gensim.corpora.dictionary as gensimdict
filename='data/dataset.txt'
X = pd.read_csv(filename, ';')
X = X.drop('category', axis=1)
X = X.drop('project', axis=1)
vocab=list(X)

## Additional preprocessing

We don't need terms that appear too much or too little in our dataset as they tend to either dominate the topics or be just noise.

In [3]:
#get rid of less frequent terms
droplist= []
for item in vocab:
    total = X[item].sum()
    if total <= 20 or total >=500:
        droplist.append(item)
X.drop(droplist,axis=1,inplace=True,errors='ignore')

vocab = list(X)

## LDA

We use gensim's LDA module in its parallel flavor. id2word_dict is a dictionary that maps numeric IDs to words, and is needed by LDA to produce readable output. The 

In [7]:
#dictionary that maps index to word.We'll need it later for LDA
id2word_dict = {k:v for k,v in enumerate(vocab)}
#transpose to get correct dimensionality
corpus = mu.Dense2Corpus(X.as_matrix().T)
#below is how it was trained, for now we'll used the saved version of it. Much like a cooking show!
#model = lda.LdaMulticore(corpus,num_topics=10, id2word=id2word_dict,workers=3,iterations=1000,passes=3)
model.load('data/lda')
model.show_topics(num_words=10,formatted = True)

[(0,
  '0.004*jakewharton + 0.004*rxbinding + 0.004*assertBothWays + 0.004*acceptance + 0.003*Observable + 0.003*jsonpath + 0.003*jayway + 0.003*JUnitCore + 0.003*HierarchicalStreamReader + 0.003*addOption'),
 (1,
  '0.004*peer + 0.004*Buffer + 0.003*LOGGER + 0.003*CSVFormat + 0.003*AbstractApplication + 0.003*XMLUnit + 0.003*readAscii + 0.003*XMLStreamReader + 0.003*ComparisonResult + 0.002*TLS'),
 (2,
  '0.003*PropertyChangeListener + 0.003*JButton + 0.003*JPanel + 0.003*SwingUtilities + 0.003*firePropertyChange + 0.003*JXTree + 0.003*defaults + 0.003*Insets + 0.003*AbstractAction + 0.003*JFrame'),
 (3,
  '0.003*AccessToken + 0.003*LatLng + 0.003*putString + 0.003*gms + 0.003*amazon + 0.003*Utility + 0.003*EXTRA + 0.003*ResponseData + 0.003*BeansManager + 0.002*geo'),
 (4,
  '0.006*mvp + 0.004*jcommander + 0.004*beust + 0.004*DocType + 0.003*presenter + 0.003*Parameter + 0.003*Mail + 0.003*Verifier + 0.003*AtomicReference + 0.003*fstack'),
 (5,
  '0.004*SuperCsvTestUtils + 0.004*PREF

## Prepare a better visualization

We're gonna use the pyLDAvis module to produce an interactive plot of our data. The people that built the module were kind enough to provide helpers to smoothen the process with models trained with the gensim toolkit. What's needed is the model, the corpus and a gensim dictionary which we "generate" a posteriori.




In [8]:
#reverse the dict to match pyLDAvis requirements
id2word_dict_rev = {v:k for k,v in id2word_dict.items()}
visdict = gensimdict.Dictionary()
visdict.token2id = id2word_dict_rev
visdata = gensimvis.prepare(model,corpus,visdict)


In [9]:
#voila!
pld.display(visdata)