# Summary

The goal is here to compare the different lda that we coded so far. It's time to pick the best one, to test our data we already used a synthetic dataset which works alost perfectly with all these models.
Now we can try on the Edimbourgh dataset (we don't really have a way to evaluate its performance actually) and on the AP corpus (here we can compare with Blei results and we know that the data are clean enough).

On the edimbourgh, we have a lot of redundant words in the different topics. I think it's because of the nature of the data because all the models provide these results, it may not be only an algo issue.
We should gather more data from the Yelp review to experiment.

On the AP corpus, the lda package provides really good results for 100 iterations in around 15 minutes whereas our models (only got result for the lda minibatch of size 10 without ordering and lda batch with executed in 2h30) provide poor results with still a lot of repetitions.
It seems that we are missing something because Blei and the gibbs sampler version provides better results.

Based on our experiments, we should focus on the mini-batch lda (for a decent number of docs per batch, to prevent the sensitivity at the initialization) and benchmarking the necessity to order also.


### Next Steps

* Improving the mini-batch lda:
    * Functional Cross validation for topics number and (tau, kappa)
    * Compute the gamma matrix on a larger review dataset from Yelp (for instance: Las Vegas)
    
* Multi-class classification: Using the features from the documents-topics distribution in the models
    * multi class logistic regression
    * one-vs-all SVM
    
* Recommender system: I listed the different possibilities
    * Latent features: from a matrix factorisation on the sparse matrix of ratings (user_id, business) to find the user and restaurants features.
    * Computation of a similarity graph between the restaurant: use of the euclidian distance on the different businesses with their features (or in two stages: KL divergence on the topic distribution as it's a probability distribution for each document ==> provide one KL score for each business then compute the euclidean distance on the other numerical feature (checkin, categories...) concatenate to the KL score (possibility of adding a weight to emphasize the KL divergence))
    * Clustering algorithms on our numerical features (k-means)



In [1]:
import sys
sys.path.append('/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages')

In [2]:
import LDA
import lda
import time
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import psi
import collections
import json
from scipy import sparse
import sklearn.cluster
import sklearn.decomposition

In [16]:
with open('temp/Edi10.json', 'r') as fp:
    reduced_edi = json.load(fp)

with open('temp/word_to_index.json', 'r') as fp:
    word_to_index = json.load(fp)

with open('temp/index_to_word.json', 'r') as fp:
    index_to_word = json.load(fp)

with open('temp/bid_to_index.json', 'r') as fp:
    bid_to_index = json.load(fp)

with open('temp/Edi10.json', 'r') as fp:
    index_to_bid = json.load(fp)

In [17]:
vocab10 = word_to_index.keys()
nonzero_data = []
rows_s = []
cols_s = []

for k in reduced_edi.keys():
    counter = collections.Counter(reduced_edi[k])
    nonzero_data += counter.values()
    rows_s += [bid_to_index[k]]*len(counter.values())
    cols_s += [word_to_index[ck] for ck in counter.keys()]

sparse_mat = sparse.csc_matrix((nonzero_data,(rows_s,cols_s)),shape = (len(bid_to_index),len(word_to_index)))

dtm_edi = sparse_mat.toarray()

In [18]:
dtm_edi.shape

(667, 2313)

In [46]:
# Routines used in the lda
def e_step(d, corpus, gamma_d_k, lambda_, lambda_int, alpha, threshold):
    '''
    Routine for the lda function
    '''
    # Info for the current doc
    ids = np.nonzero(corpus[d, :])[0]
    counts = corpus[d, ids]

    gamma_d = gamma_d_k[d, :]
    E_log_beta = digamma(lambda_)[:, ids]
    for t in xrange(100):  # TODO: Wait for convergence
        # Used to check convergence
        old_gamma = gamma_d

        E_log_thetad = digamma(gamma_d)

        # shape of phi is (num_topics, len(ids))
        phi = np.exp(E_log_beta + E_log_thetad[:, np.newaxis])
        phi /= phi.sum(axis=0)

        gamma_d = alpha + np.dot(phi, counts)


        # Check if convergence
        if (np.mean((gamma_d - old_gamma)**2) < threshold):
            break

    gamma_d_k[d, :] = gamma_d
    lambda_int[:, ids] += counts[np.newaxis, :] * phi

    return gamma_d_k, lambda_int

def digamma(data):
    if (len(data.shape) == 1):
        return psi(data) - psi(np.sum(data))
    return psi(data) - psi(np.sum(data, axis=1))[:, np.newaxis]


def get_samples(C, S, max_iter):
    batches_temp = np.zeros(C * max_iter, dtype=int)
    sample = np.arange(C, dtype=int)
    for k in xrange(max_iter):
        batches_temp[k * C: (k + 1) * C] = sample
    # List of mini-batches
    batches = np.split(batches_temp, C * max_iter / S)
    return batches

# Display the selected topics
def print_topic_words(lambda_, vocabulary, num_topics, num_words):
    '''
    Display the first num_words for the topic distribution lambda_ from a
    vocabulary.
    '''
    for t in xrange(num_topics):
        topic_distribution = sorted([(i, p) for i, p in enumerate(lambda_[t, :])], key=lambda x: x[1], reverse=True)
        top_words = [vocabulary[tup[0]] for tup in topic_distribution[:num_words]]
        print 'Topic number ', t
        print top_words


In [20]:
def batch_lda(corpus, lambda_=None, num_topics=10, num_iter=10, alpha=0.5, eta=0.001, threshold=0.000001):
    '''
    Batch Variational Inference EM algorithm for LDA.
    (from algorithm 1 in Blei 2010)
    corpus is a list of lists of [word_index, count] for each document
    corpus is a matrix of count: (docs, voca)
    Args:
        lambda_: to set a specific lambda for the initialization
    '''
    C, V = corpus.shape

    # Initialisation
    if not np.any(lambda_):
        lambda_ = np.random.gamma(100, 1./100, size=(num_topics, V))
    else:
        lambda_ = lambda_.copy()

    gamma_d_k = np.ones((C, num_topics))
    sample = range(C)
    np.random.shuffle(sample)

    for t in xrange(num_iter):
        old_lambda_ = lambda_
        # #### E-step
        lambda_int = np.zeros((num_topics, V))
        for d in sample:
            gamma_d_k, lambda_int = e_step(d, corpus, gamma_d_k, lambda_,
                                           lambda_int, alpha, threshold)

        # #### M-step
        lambda_ = eta + lambda_int

        # Check if convergence
        if (np.mean(np.abs((lambda_ - old_lambda_) / old_lambda_)) < threshold):
            break

    return lambda_, gamma_d_k


In [29]:
def stochastic_lda_ordering(corpus, lambda_=None, S=1, num_topics=10, max_iter=300, tau=1, kappa=0.5, alpha=0.5, eta=0.001, threshold=0.000001):
    '''
    Stochastic Variational Inference EM algorithm for LDA.
    (from algorithm 2 in Blei 2010)
    corpus is a list of lists of [word_index, count] for each document
    corpus is a matrix of count: (docs, voca)
    Args:
        lambda_: to set a specific lambda for the initialization
        batches: to set an order on the use of the corpus
        S: size of the mini-batches
    '''
    C, V = corpus.shape

    # Initialisation
    if not np.any(lambda_):
        lambda_ = np.random.gamma(100, 1./100, size=(num_topics, V))
    else:
        lambda_ = lambda_.copy()

    gamma_d_k = np.ones((C, num_topics))

    # Sampling
    idx = range(C)
    # TODO: better splitting because here we may not consider some documents
    # and could raise an error if C/S does not split idx in equal parts.
    batches = np.array_split(idx, C/S)
    
    for it in xrange(max_iter):
        for t in xrange(len(batches)):
            # #### E-step
            lambda_int = np.zeros((num_topics, V))

            for d in batches[t]:
                gamma_d_k, lambda_int = e_step(d, corpus, gamma_d_k, lambda_, lambda_int, alpha, threshold)

            # #### M-step
            rho = (tau + t)**(-kappa)
            indices = np.unique(np.nonzero(corpus[batches[t], :])[1])
            lambda_int = eta + C / (1. * S) * lambda_int
            lambda_[:, indices] = (1 - rho)*lambda_[:, indices] + rho*lambda_int[:, indices]
        
        if it % 10 == 0:
            sorted_model  = sklearn.cluster.KMeans(n_clusters  = num_topics)
            sorted_index = np.argsort(sorted_model.fit_predict(gamma_d_k))
            batches = np.array_split(sorted_index,C/S)

    return lambda_, gamma_d_k

In [47]:

def stochastic_lda(corpus, batches=None, lambda_=None,
                   ordering=False, S=1, num_topics=10, max_iter=300, tau=1,
                   kappa=0.5, alpha=0.5, eta=0.001, threshold=0.000001):
    '''
    Stochastic Variational Inference EM algorithm for LDA.
    (from algorithm 2 in Blei 2010)
    corpus is a list of lists of [word_index, count] for each document
    corpus is a matrix of count: (docs, voca)
    Args:
        lambda_: to set a specific lambda for the initialization
        batches: to set an order on the use of the corpus
        S: size of the mini-batches
    '''
    C, V = corpus.shape

    # Initialisation
    if not np.any(lambda_):
        lambda_ = np.random.gamma(100, 1./100, size=(num_topics, V))
    else:
        lambda_ = lambda_.copy()

    gamma_d_k = np.ones((C, num_topics))

    # Sampling
    if not np.any(batches):
        batches = get_samples(C, S, max_iter)

    for t in xrange(len(batches)):
        # #### E-step
        lambda_int = np.zeros((num_topics, V))

        for d in batches[t]:
            gamma_d_k, lambda_int = e_step(d, corpus, gamma_d_k, lambda_, lambda_int, alpha, threshold)

        # #### M-step
        rho = (tau + t)**(-kappa)
        indices = np.unique(np.nonzero(corpus[batches[t], :])[1])
        lambda_int = eta + C / (1. * S) * lambda_int
        lambda_[:, indices] = (1 - rho)*lambda_[:, indices] + rho*lambda_int[:, indices]

    return lambda_, gamma_d_k


# Edimbourgh Reviews
Test on the edimbourgh reviews with three different models:
* batch lda: classical EM version where we iterate over the whole document at each E step (algo 1 in Blei 2010). The update in the M step consider only the new evaluation of lambda in the previous E step (because no approximation is done)
* mini-batch lda: we go only over a mini-batch in the E step (here 10). There is the issue relative to the sensitivity to the order of the corpus, this may be handled with the ordering version below.
* mini-batch with ordering

In [56]:
%%time
lambda_batch, gamma_batch = batch_lda(dtm_edi, num_topics=10, num_iter=200, alpha=0.001, eta=0.001, threshold=0.000001)

CPU times: user 1min 59s, sys: 783 ms, total: 1min 59s
Wall time: 2min


In [57]:
%%time
lambda_stochastic, gamma_stochastic = stochastic_lda_ordering(dtm_edi, S=10, num_topics=10, max_iter=200, tau=500, kappa=0.6, alpha=0.001, eta=0.001, threshold=0.000001)

CPU times: user 2min 31s, sys: 1.22 s, total: 2min 32s
Wall time: 2min 33s


In [58]:
%%time
lambda_stochastic_no_ordering, gamma_stochastic_no_ordering = stochastic_lda(dtm_edi, S=10, num_topics=10, max_iter=200, tau=500, kappa=0.6, alpha=0.001, eta=0.001, threshold=0.000001)

CPU times: user 2min 22s, sys: 953 ms, total: 2min 23s
Wall time: 2min 24s


In [70]:
model = lda.LDA(10)
%time model.fit(dtm_edi)

CPU times: user 1min 22s, sys: 287 ms, total: 1min 23s
Wall time: 1min 23s


<lda.lda.LDA instance at 0x106781ea8>

In [69]:
topic_word = model.topic_word_  # model.components_ also works
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab10)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print(u'Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: spanish shop oil mile film measure snag strawberry offer £8
Topic 1: component lassi craft energy elephant gnocchi bum beet £5 rosemary
Topic 2: fancy website habit elephant brewery satay kudo spanish crepe high
Topic 3: spanish oil mile scale employee plate rump gym £8 gathering
Topic 4: band problem hash topping charcuterie meaning katsu acoustic challenge absence
Topic 5: village muffin waffle premium plate process vicinity oil video spot
Topic 6: habit warming octopus confusing work relax snag muffin satay envy
Topic 7: delicacy village oil sister gym pancake confusing premium waffle dairy
Topic 8: oil pine energy kilo conclusion whilst packet croissant village yoghurt
Topic 9: oil premium village watermelon rigatoni energy waffle pine plate business


In [61]:
print_topic_words(lambda_stochastic_no_ordering, vocab10, 10, 10)

Topic number  0
[u'oil', u'spanish', u'village', u'plate', u'mile', u'gym', u'scale', u'employee', u'premium', u'waffle']
Topic number  1
[u'broccoli', u'crust', u'hazelnut', u'stand', u'veggie', u'general', u'husband', u'pal', u'brie', u'banana']
Topic number  2
[u'gastropub', u'package', u'pairing', u'stroll', u'marble', u'request', u'veggie', u'score', u'caf\xe9s', u'bonus']
Topic number  3
[u'eating', u'research', u'crust', u'spoonful', u'shoulder', u'relaxing', u'icecream', u'pence', u'broccoli', u'design']
Topic number  4
[u'crust', u'spoonful', u'research', u'eating', u'broccoli', u'icecream', u'relaxing', u'pence', u'shoulder', u'travel']
Topic number  5
[u'research', u'crust', u'spoonful', u'eating', u'broccoli', u'relaxing', u'icecream', u'shoulder', u'pence', u'nut']
Topic number  6
[u'punk', u'feature', u'nut', u'kilt', u'twist', u'shellfish', u'nonsense', u'excuse', u'teeny', u'request']
Topic number  7
[u'rigatoni', u'fall', u'selling', u'slab', u'liver', u'hang', u'gap',

In [62]:
print_topic_words(lambda_batch, vocab10, 10, 10)

Topic number  0
[u'conclusion', u'energy', u'gnocchi', u'spanish', u'charcuterie', u'twist', u'business', u'hall', u'mile', u'humus']
Topic number  1
[u'component', u'habit', u'spanish', u'warming', u'oil', u'mile', u'confusing', u'octopus', u'scale', u'plate']
Topic number  2
[u'oil', u'village', u'plate', u'spanish', u'premium', u'mile', u'muffin', u'waffle', u'gym', u'rump']
Topic number  3
[u'spanish', u'oil', u'delicacy', u'band', u'mile', u'plate', u'rump', u'gym', u'employee', u'scale']
Topic number  4
[u'oil', u'spanish', u'fancy', u'website', u'habit', u'plate', u'village', u'mile', u'employee', u'scale']
Topic number  5
[u'spanish', u'oil', u'habit', u'mile', u'kilo', u'scale', u'work', u'whilst', u'warming', u'octopus']
Topic number  6
[u'spanish', u'band', u'hash', u'oil', u'problem', u'katsu', u'mile', u'measure', u'film', u'employee']
Topic number  7
[u'oil', u'spanish', u'village', u'premium', u'waffle', u'mile', u'gym', u'plate', u'scale', u'employee']
Topic number  8
[

In [63]:
print_topic_words(lambda_stochastic, vocab10, 10, 10)

Topic number  0
[u'punk', u'request', u'kilt', u'nonsense', u'teeny', u'frame', u'cheer', u'platter', u'plethora', u'reservation']
Topic number  1
[u'walk', u'kilo', u'travel', u'deliciousness', u'whilst', u'meze', u'croissant', u'pork', u'warmth', u'award']
Topic number  2
[u'rigatoni', u'broccoli', u'fall', u'selling', u'hang', u'liver', u'slab', u'gap', u'brim', u'waste']
Topic number  3
[u'feature', u'nut', u'twist', u'shellfish', u'excuse', u'wood', u'humus', u'shake', u'cheeseburger', u'vegetarian']
Topic number  4
[u'crust', u'gastropub', u'package', u'stroll', u'bonus', u'sharing', u'marble', u'\xa315', u'outing', u'weekday']
Topic number  5
[u'habit', u'fancy', u'warming', u'website', u'octopus', u'spoonful', u'spanish', u'research', u'eating', u'component']
Topic number  6
[u'spoonful', u'research', u'eating', u'crust', u'icecream', u'shoulder', u'relaxing', u'broccoli', u'pence', u'need']
Topic number  7
[u'eating', u'spoonful', u'research', u'crust', u'icecream', u'relaxing

# AP Corpus
The models on the dataset of Edimbourgh provides a lot of redundancy, one reasion may be the high sparsity of the data. We can try our models on cleaner data set also as the AP corpus.
The comparison need to be donne with Blei's results: 
http://www.cs.princeton.edu/~blei/lda-c/ap-topics.pdf

In [5]:
# Read the vocabulary
ap_vocabulary = np.loadtxt('ap/vocab.txt', dtype=str)
V = len(ap_vocabulary)

# Read the data
# Output format is a list of document (corpus) with
# document: array([[index1, count1], ... , [index2, count2]])

# To build the sparse matrix
counts = []
row_ind = []
col_ind = []

with open('ap/ap.dat', 'r') as f:
    for i, row in enumerate(f):
        # row format is:
        #    [M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]
        row_raw = row.split(' ')
        M = int(row_raw[0])
        document = np.zeros((M, 2))

        row_ind += M*[i]
        for j, w in enumerate(row_raw[1:]):
            document[j, :] = [int(u) for u in w.split(':')]
        counts += list(document[:, 1])
        col_ind += list(document[:, 0])

# corpus size
C = i + 1

# Building the corpus matrix
ap_corpus = sparse.csc_matrix((counts, (row_ind, col_ind)), shape=(C, V))
ap_corpus = ap_corpus.toarray()

In [4]:
ap_corpus = ap_corpus.astype(int)

In [6]:
# Choosing the number of topics
num_topics=100

In [90]:
model = lda.LDA(num_topics)
%time model.fit(ap_corpus)

CPU times: user 15min 7s, sys: 3.66 s, total: 15min 11s
Wall time: 15min 15s


<lda.lda.LDA instance at 0x101177638>

In [91]:
%%time
lambda_batch, gamma_batch = batch_lda(ap_corpus, num_topics=num_topics, num_iter=100, alpha=0.001, eta=0.001, threshold=0.000001)

CPU times: user 2h 12min 31s, sys: 10min 55s, total: 2h 23min 26s
Wall time: 2h 23min 28s


In [92]:
%%time
lambda_stochastic, gamma_stochastic = stochastic_lda_ordering(ap_corpus, S=10, num_topics=num_topics, max_iter=100, tau=500, kappa=0.6, alpha=0.001, eta=0.001, threshold=0.000001)

CPU times: user 2h 13min 9s, sys: 12min 28s, total: 2h 25min 37s
Wall time: 2h 25min 31s


In [93]:
%%time
lambda_stochastic_no_ordering, gamma_stochastic_no_ordering = stochastic_lda(ap_corpus, S=10, num_topics=num_topics, max_iter=100, tau=500, kappa=0.6, alpha=0.001, eta=0.001, threshold=0.000001)

KeyboardInterrupt: 

In [94]:
print_topic_words(lambda_stochastic, vocabulary, 100, 10)

Topic number  0
['fleisher', 'blackburn', 'pomona', 'leek', 'shuster', 'beazley', 'amirav', 'transistor', 'zennoh', 'bordallo']
Topic number  1
['bordallo', 'sowan', 'morgenstern', 'blackburn', 'quaker', 'hasselbring', 'sheftel', 'ticketron', 'anasazi', 'lobban']
Topic number  2
['coasters', 'leek', 'barbershop', 'quints', 'popeye', 'tartus', 'wallenda', 'hildreths', 'shiley', 'buckey']
Topic number  3
['harwood', 'bst', 'patmos', 'pekin', 'hastings', 'gaviria', 'bordallo', 'teeley', 'chatham', 'menem']
Topic number  4
['coasters', 'galileo', 'alpo', 'seidon', 'southwell', 'lukanov', 'guterman', 'thyssen', 'transistor', 'poe']
Topic number  5
['storks', 'blackburn', 'turtles', 'ganyile', 'hazelwood', 'mpaa', 'chatham', 'fleisher', 'etruscan', 'dalai']
Topic number  6
['gisclair', 'quints', 'corkscrew', 'chiquita', 'kolb', 'erols', 'dalai', 'transistor', 'kiesner', 'brides']
Topic number  7
['etruscan', 'mydland', 'leek', 'bridesmaids', 'jal', 'herons', 'putnam', 'erols', 'blackowned', 

In [95]:
print_topic_words(lambda_batch, vocabulary, 100, 10)

Topic number  0
['million', 'network', 'bush', 'president', 'aristide', 'skull', 'bones', 'television', 'cnn', 'ptl']
Topic number  1
['american', 'i', 'new', 'gotti', 'blood', 'blier', 'says', 'filed', 'people', 'year']
Topic number  2
['i', 'rights', 'grigoryants', 'new', 'last', 'southern', 'year', 'force', 'authority', 'committee']
Topic number  3
['police', 'people', 'force', 'air', 'government', 'two', 'city', 'killed', 'today', 'thursday']
Topic number  4
['mall', 'downtown', 'report', 'lincoln', 'communist', 'cities', 'senators', 'malls', 'group', 'released']
Topic number  5
['i', 'people', 'mrs', 'police', 'years', 'president', 'two', 'three', 'government', 'think']
Topic number  6
['year', 'company', 'police', 'two', 'chinn', 'three', 'meese', 'men', 'record', 'years']
Topic number  7
['northern', 'hair', 'training', 'dna', 'temperatures', 'new', 'family', 'count', 'band', 'increase']
Topic number  8
['estate', 'mrs', 'meese', 'home', 'property', 'people', 'tax', 'federal', '

In [97]:
topic_word = model.topic_word_  # model.components_ also works
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocabulary)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print(u'Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: party political elections opposition government election minister vote parties first
Topic 1: military rebels government president guerrillas coup last human war civil
Topic 2: people india indian government state percent million militants population pakistan
Topic 3: prison death prisoners years released release state day convicted human
Topic 4: hostages lebanon mecham beirut christian hezbollah today syrian hijackers held
Topic 5: church catholic pope john religious roman paul vatican bishops rev
Topic 6: court judge case filed supreme ruling federal appeals order lawsuit
Topic 7: lincoln senators five keating regulators deconcini senate dennis jr morris
Topic 8: south africa black african government mandela de blacks national apartheid
Topic 9: trial charges court case judge guilty attorney jury prison convicted
Topic 10: workers union contract labor strike jobs employees work company unions
Topic 11: world war th american ago years history ii president today
Topic 12: sto

# Test Virgile

In [3]:
def rho(tau,kappa,t):
	return pow(tau + t, - kappa)

def digamma(mat):
	if (len(mat.shape) == 1):
		return(psi(mat) - psi(np.sum(mat)))
	else:
		return(psi(mat) - psi(np.sum(mat, 0))[np.newaxis,:])

In [52]:
def lda_batch(dtm,ntopic,batch_size,tau,kappa):
	nvoc = dtm.shape[1]
	ndoc = dtm.shape[0]
	nu = 1./ntopic
	alpha = 1./ntopic

	topics = np.random.gamma(100.,1./100.,(nvoc,ntopic))
	gamma  = np.random.gamma(100.,1./100.,(ndoc,ntopic))

	numbatch = ndoc / batch_size
	batches = np.array_split(range(ndoc),numbatch)


	for it_batch in range(numbatch):
		ELogBeta = digamma(topics)
		ExpELogBeta = np.exp(ELogBeta)
		
		temp_topics = np.zeros(topics.shape)

		indices = []

		for d in batches[it_batch]:
			# print d
			ids = np.nonzero(dtm[d,:])[0]
			indices.extend(ids)
			cts = dtm[d,ids]
			ExpELogBetad = ExpELogBeta[ids,:]

			gammad = gamma[d,:]
			ElogTethad = digamma(gammad)
			ExpLogTethad = np.exp(ElogTethad)

			# print gammad

			for inner_it in range(1000):
				
				oldgammad = gammad

				phi =  ExpLogTethad * ExpELogBetad
				phi = phi / (phi.sum(axis=1)+0.00001)[:, np.newaxis]

				gammad = alpha + np.dot(cts,phi)

				ElogTethad = digamma(gammad)
				ExpLogTethad = np.exp(ElogTethad)
				# print gammad

				if np.mean((gammad-oldgammad)**2)<0.0000001:
					break

			#print inner_it
			gamma[d,:] = gammad

			temp_topics[ids,:] += phi * cts[:,np.newaxis]

		indices = np.unique(indices)

		rt = rho(tau,kappa,it_batch)

		topics[indices] = (1 - rt) * topics[indices,:] + rt * ndoc * (nu + temp_topics[indices,:]) / len(batches[it_batch])

	return topics,gamma

In [8]:
def lda_batch2(dtm,ntopic,batch_size,tau,kappa,itemax):
	nvoc = dtm.shape[1]
	ndoc = dtm.shape[0]
	nu = 1./ntopic
	alpha = 1./ntopic

	topics = np.random.gamma(100.,1./100.,(nvoc,ntopic))
	phi = np.random.gamma(100.,1./100.,(nvoc,ntopic))
	gamma  = np.random.gamma(100.,1./100.,(ndoc,ntopic))

	numbatch = ndoc / batch_size
	batches = np.array_split(range(ndoc),numbatch)

	for i in range(itemax):
		for it_batch in range(numbatch):
			ELogBeta = digamma(topics)
			ExpELogBeta = np.exp(ELogBeta)
			
			temp_topics = np.zeros(topics.shape)

			indices = []

			for d in batches[it_batch]:
				# print d
				ids = np.nonzero(dtm[d,:])[0]
				indices.extend(ids)
				cts = dtm[d,ids]
				ExpELogBetad = ExpELogBeta[ids,:]

				gammad = gamma[d,:]
				ElogTethad = digamma(gammad)
				ExpLogTethad = np.exp(ElogTethad)

				# print gammad

				for inner_it in range(1000):
					
					oldgammad = gammad

					phi =  ExpLogTethad * ExpELogBetad
					phi = phi / (phi.sum(axis=1)+0.00001)[:, np.newaxis]

					gammad = alpha + np.dot(cts,phi)

					ElogTethad = digamma(gammad)
					ExpLogTethad = np.exp(ElogTethad)
					# print gammad

					if np.mean((gammad-oldgammad)**2)<0.0000001:
						break

				print inner_it
				gamma[d,:] = gammad

				temp_topics[ids,:] += phi * cts[:,np.newaxis]

			indices = np.unique(indices)

			rt = rho(tau,kappa,it_batch)

			topics[indices] = (1 - rt) * topics[indices,:] + rt * ndoc * (nu + temp_topics[indices,:]) / len(batches[it_batch])

	return topics,gamma

In [6]:
%%time
vir = lda_batch(ap_corpus,100,1,1,0)

265
275
999
81
74
848
454
108
7
51
27
164
140
14
176
94
115
147
165
126
29
9
7
36
232
509
318
17
339
159
182
105
13
121
83
76
182
21
30
14
291
59
178
74
157
202
43
276
21
55
45
28
188
117
114
37
41
90
44
35
111
142
30
72
20
28
42
50
25
95
226
30
27
37
24
63
50
53
112
51
143
48
110
25
43
44
41
32
43
21
41
76
37
43
31
46
41
12
49
44
36
46
5
32
23
43
25
40
41
43
44
33
20
23
139
69
23
29
35
31
21
18
51
34
28
37
49
42
40
32
29
44
38
37
40
118
47
20
62
61
41
60
74
102
37
83
25
43
111
43
42
33
38
47
65
32
62
60
50
41
63
29
44
73
16
29
103
52
32
90
15
47
15
72
58
31
69
36
83
29
36
36
155
71
38
20
13
34
32
38
42
35
37
38
120
40
23
36
26
21
41
41
75
74
40
49
49
29
17
97
32
93
19
39
41
53
21
26
38
13
48
15
72
43
195
48
17
18
23
35
39
32
92
32
39
92
48
24
40
35
51
34
64
25
24
79
5
38
28
62
44
84
36
59
41
44
53
6
44
20
54
191
34
36
114
39
23
89
35
23
29
29
57
32
19
37
49
46
24
38
66
48
23
45
25
48
91
44
56
33
65
38
48
29
31
25
34
14
47
39
22
34
22
88
23
35
41
46
56
43
40
35
54
22
53
11
39
17
39
30


In [9]:
%%time
vir2 = lda_batch2(ap_corpus,100,1,1,0,2)

282
545
199
260
61
924
159
74
7
118
13
64
312
14
217
230
89
591
241
246
31
9
7
16
203
186
241
35
455
511
234
180
12
130
167
104
182
26
26
21
319
23
387
54
89
108
47
158
25
170
22
36
244
39
56
48
42
48
88
5
25
83
38
71
23
23
30
38
19
34
33
23
28
19
34
63
69
34
234
36
133
81
68
21
46
122
35
13
17
20
34
30
75
41
68
34
44
12
32
26
55
40
4
51
22
34
52
51
56
29
42
39
47
19
35
43
329
20
156
19
117
19
16
35
21
43
111
12
25
26
40
46
28
62
44
87
104
18
36
248
32
43
36
31
53
35
18
130
9
36
23
17
46
23
64
27
68
25
55
29
26
28
64
37
32
28
80
104
27
79
24
100
32
30
35
49
30
24
101
23
43
46
60
44
47
16
16
25
42
44
21
50
45
34
105
70
30
23
28
28
44
38
41
22
23
37
59
31
10
88
37
28
22
14
44
79
15
21
71
17
29
25
32
25
193
45
39
11
96
41
24
23
148
49
30
53
46
12
33
52
155
18
41
25
27
19
5
56
15
33
101
23
39
33
32
65
34
5
74
23
39
123
80
42
50
30
12
35
26
45
14
33
27
16
23
22
39
119
37
33
47
36
22
37
21
28
47
71
23
39
28
54
21
22
119
89
38
28
79
27
45
35
27
52
41
23
35
33
27
23
45
25
66
38
41
5
38
48
45
6

In [11]:
def print_topic_words(lambda_, vocabulary, num_topics, num_words):
    '''
    Display the first num_words for the topic distribution lambda_ from a
    vocabulary.
    '''
    for t in xrange(num_topics):
        topic_distribution = sorted([(i, p) for i, p in enumerate(lambda_[:, t])], key=lambda x: x[1], reverse=True)
        top_words = [vocabulary[tup[0]] for tup in topic_distribution[:num_words]]
        print 'Topic number ', t
        print top_words

In [7]:
topic_word = vir[0].T  # model.components_ also works
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(ap_vocabulary)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print(u'Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: opera bass danced conductor royal symphony hadson hrb orchestra orchestras
Topic 1: buffs blocked wear opponent confirm roughly slain referendum explain hollywood
Topic 2: tibet earths emissions tibetans tibetan ershad dhaka smog fires roberson
Topic 3: clinton cosmetics cutter jamming transatlantic perfume dock cebu voa wynberg
Topic 4: yates hampden aggravated dunn bujang armory discharge judith sherry teens
Topic 5: comparable yen premium kodak occidental texaco dollar franc volatility futures
Topic 6: greyhound arco osha graves belfast gravley checkpoint ulster hopkins unite
Topic 7: mecham demjanjuk gesell mine impeachment milstead regan walsh relevant mechams
Topic 8: gull diapers landfills landfill virgin contamination lowery superfund recycle refrigerator
Topic 9: volvo editors ag buses discontinued sweden caution correct bus profitable
Topic 10: mohawk federated coroner ferraro reservation raid coleman aground arresting verity
Topic 11: kashmir valdez ramstein srinaga

In [10]:
topic_word = vir2[0].T  # model.components_ also works
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(ap_vocabulary)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print(u'Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: payless asbestos manville centrust uta glauberman jeffer chargeurs weinstein tremolite
Topic 1: pauley newsprint moss documentary barbershop menem morgan icc supersonic cosby
Topic 2: owen greenwald volvo watkins uaw lopez gm chryslers scripps acustar
Topic 3: buffs blocked wear opponent confirm roughly slain referendum explain hollywood
Topic 4: maxwell massage wallenda bowen nih pregnant antiabortion clinics infants fetal
Topic 5: comparable federated coastamerica occidental iacocca koppers lorimar atts glantz hutton
Topic 6: cdy clr barahona grammer rn daisy pinochet matta sula palmerola
Topic 7: bloom kephart vanuatu spoor pillsburys gotner hildreth dances burger athletes
Topic 8: arts endowment cat quints poe awards alpo fellows shuster sisters
Topic 9: quinn baugh feazell vietnamese hoan guinness vase zaccaro zubal rouge
Topic 10: warmus nida mexican greek castaneda athens teens drinks threats missed
Topic 11: sajudis kalugin walesa katyn zinoviev baku shatalin rakowski 

In [11]:
%%time
vir3 = lda_batch(ap_corpus,100,50,1024,0.6)

709
168
999
764
67
708
999
47
2
129
34
430
271
7
122
999
90
491
204
201
37
2
2
43
210
250
422
96
275
177
381
832
44
73
511
141
242
999
219
128
329
74
349
68
127
756
58
257
365
999
65
51
107
107
275
57
253
144
132
5
26
264
14
79
10
101
635
99
126
181
193
86
71
181
180
772
170
197
343
296
223
255
199
38
40
386
12
294
14
189
244
234
186
96
136
97
999
23
335
359
357
96
3
187
84
59
52
140
137
162
52
98
32
154
135
148
24
67
78
9
273
5
94
171
80
34
231
48
88
82
193
104
13
999
70
131
147
69
281
126
132
157
94
19
25
207
62
160
10
99
81
85
158
8
84
43
56
60
197
29
69
57
124
52
8
110
154
74
95
127
101
83
78
16
93
20
65
53
84
74
84
32
126
126
93
7
11
8
39
66
73
77
112
69
77
60
50
78
240
9
133
92
24
34
8
123
261
98
96
110
67
45
10
35
56
135
39
14
7
43
58
42
48
74
58
21
10
8
13
437
17
50
25
48
46
57
42
8
24
8
149
10
244
131
47
92
3
87
11
104
18
67
111
118
29
65
58
6
91
83
18
53
14
47
59
12
16
23
8
62
17
22
17
15
46
21
151
56
37
57
32
29
37
29
19
66
42
30
21
32
25
52
8
50
13
50
16
68
60
9
24
22
63
73

In [12]:
topic_word = vir3[0].T  # model.components_ also works
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(ap_vocabulary)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print(u'Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: sang rose roses actor tony buckley theater mystery ceiling maureen
Topic 1: czechoslovak cuban cuba ctk diplomatic embassy washington havana interests source
Topic 2: hadson llosas ferret cereal cherry satisfactory hazelwood annunzio laurentiis sieck
Topic 3: inhabitants kaunda burdick probability anthrax exceptions squeezed seasonally revco reserved
Topic 4: bombs torres rescue car glass exploded bombings biggest injured war
Topic 5: blier hoan angle disruptive forecaster travell mumford doubtful revco explosions
Topic 6: bailey dna manilas classics cordon nida trump shamrock relayed poet
Topic 7: patronage stretched tiger ridley loves kitty warn antisemitism azerbaijanis relay
Topic 8: blackowned businesses percent firms census receipts book share dorrance business
Topic 9: reversing cdy kahn clout ransom banana ferret insolvent y vargas
Topic 10: symbols coma bordallo mash mpaa wines volvo maxwell dtexas silverman
Topic 11: wellstone abbott hoppe plessey pw populations clar

In [20]:
%%time
vir4 = lda_batch(ap_corpus,100,50,512,0.7)

CPU times: user 13.5 s, sys: 364 ms, total: 13.8 s
Wall time: 13.9 s


In [14]:
topic_word = vir4[0].T  # model.components_ also works
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(ap_vocabulary)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print(u'Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: percent manufacturing production rate goods month october november index utilities
Topic 1: kahn curran menem impediments cruz baguio retribution burleson labels crocodile
Topic 2: peewee hezbollah stormy grassley humphreys apologized chairmen provoked harmful counterparts
Topic 3: harvard cambridge el military sister salvadoran wolf protesters salvador delegations
Topic 4: peres israel ferrets rappaport official animals pipeline bechtel company offer
Topic 5: princess withhold fastfood sovereignty commencement cbn richest affidavit seas disturbances
Topic 6: sporting handgun assaults administrators implied rude mosbacher eleven creque obeyed
Topic 7: popeye wallenda mpaa experiencing outlets sells odd madison richfield quartet
Topic 8: precautionary keidanren preference dartmouth broyles richards sula spinal puppet braking
Topic 9: india state delhi flooded evacuated drowned worst town camps floods
Topic 10: bombs torres exploded injured biggest bombings rescue roof glass cla

In [27]:
def inference(lda,dtm,tau,kappa):

	ntopic = lda[0].shape[1]
	nvoc = dtm.shape[1]
	ndoc = dtm.shape[0]
	nu = 1./ntopic
	alpha = 1./ntopic

	topics = lda[0]
	phi = np.random.gamma(100.,1./100.,(nvoc,ntopic))
	gamma  = np.random.gamma(100.,1./100.,(ndoc,ntopic))

	numbatch = ndoc
	batches = np.array_split(range(ndoc),numbatch)

	for i in range(1):
		for it_batch in range(numbatch):
			ELogBeta = digamma(topics)
			ExpELogBeta = np.exp(ELogBeta)
			
			temp_topics = np.zeros(topics.shape)

			indices = []

			for d in batches[it_batch]:
				# print d
				ids = np.nonzero(dtm[d,:])[0]
				indices.extend(ids)
				cts = dtm[d,ids]
				ExpELogBetad = ExpELogBeta[ids,:]

				gammad = gamma[d,:]
				ElogTethad = digamma(gammad)
				ExpLogTethad = np.exp(ElogTethad)

				# print gammad

				for inner_it in range(1000):
					
					oldgammad = gammad

					phi =  ExpLogTethad * ExpELogBetad
					phi = phi / (phi.sum(axis=1)+0.00001)[:, np.newaxis]

					gammad = alpha + np.dot(cts,phi)

					ElogTethad = digamma(gammad)
					ExpLogTethad = np.exp(ElogTethad)
					# print gammad

					if np.mean((gammad-oldgammad)**2)<0.0000001:
						break

# 				# print inner_it
				gamma[d,:] = gammad

				temp_topics[ids,:] += phi * cts[:,np.newaxis]

			indices = np.unique(indices)

			rt = rho(tau,kappa,it_batch)

			topics[indices] = (1 - rt) * topics[indices,:] + rt * ndoc * (nu + temp_topics[indices,:]) / len(batches[it_batch])

	return topics,gamma

In [28]:
def perplexity_test(lda,newdocs,tau,kappa,perword = False):
	
	new = inference(lda,newdocs,tau,kappa)
	
	topics = new[0]
	gammas = new[1]
	
	topics = topics/topics.sum(axis=0)
	
	if len(gammas.shape) == 1:
		gammas = gammas/np.sum(gammas)
		doc_idx = np.nonzero(newdocs)[0]
		doc_cts = newdocs[doc_idx]
		return np.exp(-np.log(np.sum(np.dot(topics[doc_idx,:],gammas)*doc_cts))/np.sum(doc_cts))
	
	else:
		norm = lambda x: x/np.sum(x)
		gammas = np.apply_along_axis(norm,axis = 1,arr = gammas)
		
		num = 0
		denom = 0
		
		for i in range(gammas.shape[0]):
			doc_idx = np.nonzero(newdocs[i,:])[0]
			doc_cts = newdocs[i,doc_idx]
			num = np.sum(np.log(np.dot(topics[doc_idx,:],gammas[i,:]))*doc_cts)
			denom += np.sum(doc_cts)
			
		if ~perword:
			return num
		else:
			return num/denom

In [17]:
ap_corpus.shape

(2246, 10473)

In [22]:
%%time
np.random.seed(0)
train1 = lda_batch(ap_corpus[:1900,:],100,50,512,0.7)

CPU times: user 17.4 s, sys: 379 ms, total: 17.7 s
Wall time: 17.8 s


In [29]:
perplexity_test(train1,ap_corpus[1900:,:],512,0.7,True)

-666.90144554510755

In [34]:
%%time
np.random.seed(0)
train2 = lda_batch(ap_corpus[:1900,:],100,50,1024,0.7)

CPU times: user 13.8 s, sys: 344 ms, total: 14.2 s
Wall time: 14.2 s


In [37]:
%%time
perplexity_test(train2,ap_corpus[1900:,:],1024,0.7,True)

CPU times: user 23.8 s, sys: 1.23 s, total: 25 s
Wall time: 25.1 s


-684.4899849517393

In [38]:
%%time
np.random.seed(0)
train3 = lda_batch(ap_corpus[:1900,:],100,50,256,0.7)

CPU times: user 15.9 s, sys: 436 ms, total: 16.3 s
Wall time: 16.4 s


In [40]:
%%time
perplexity_test(train3,ap_corpus[1900:,:],256,0.7,True)

CPU times: user 24.4 s, sys: 1.15 s, total: 25.5 s
Wall time: 25.6 s


-664.20605008159703

In [42]:
%%time
np.random.seed(0)
train4 = lda_batch(ap_corpus[:1900,:],100,50,256,0.6)
print perplexity_test(train4,ap_corpus[1900:,:],256,0.6,True)

-662.758024501
CPU times: user 38.6 s, sys: 1.63 s, total: 40.2 s
Wall time: 40.3 s


In [43]:
%%time
np.random.seed(0)
train5 = lda_batch(ap_corpus[:1900,:],100,50,256,0.8)
print perplexity_test(train5,ap_corpus[1900:,:],256,0.8,True)

-669.336657084
CPU times: user 41.4 s, sys: 1.61 s, total: 43 s
Wall time: 43.1 s


In [47]:
%%time
np.random.seed(0)
train6 = lda_batch(ap_corpus[:1900,:],100,50,256,0.5)
print perplexity_test(train6,ap_corpus[1900:,:],256,0.5,True)

-656.685616766
CPU times: user 40.6 s, sys: 1.75 s, total: 42.3 s
Wall time: 42.6 s


In [48]:
np.random.seed(0)
best = lda_batch(ap_corpus,100,50,256,0.5)

In [49]:
topic_word = best[0].T  # model.components_ also works
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(ap_vocabulary)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print(u'Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: collider mchaffie storks khashoggi gacy amalgam ornellas furmark mofford dickman
Topic 1: noriega panama panamanian noriegas infected virus panamas delvalle hiv palma
Topic 2: ferret ferrets animals bites pet fuji thornton seidon au experiments
Topic 3: greene easterly fuji vogel clr hildreths collider rajneesh scripps patriarca
Topic 4: ehrenhalt bcspehealth fixedrate therapist pillsburys cna microbe catholicjewish evren dickey
Topic 5: ceremsak solis suarez putnam tamils gursky rison wynberg tumor siegelman
Topic 6: fuji hasselbring cxcbt dudycz bowsher tartus niklus ziemet sheftel guideline
Topic 7: noriega powell sudan panamanian panama examining jacksons strongman channels noriegas
Topic 8: quints forman atlarge siegelman halloween deri cdy sipc fixedrate dalai
Topic 9: hildreth fis ritalin wppss sula forman zaccaro ames shedd uta
Topic 10: fbi multistate fbis agents copyright exam oferrell testimony immunity immunized
Topic 11: schwarzkopf hasegawa dorrance mayer belongi

In [55]:
%%time
np.random.seed(0)
vir6 = lda_batch2(ap_corpus,100,40,750,0.7,5)

983
122
230
152
102
999
647
41
2
123
207
138
178
6
617
84
351
477
120
999
43
2
2
38
834
614
730
182
829
181
999
405
91
149
554
64
999
114
266
105
716
10
225
806
115
218
44
220
83
153
124
84
208
91
340
128
242
39
274
5
142
305
14
562
14
175
187
39
349
163
929
192
98
103
348
124
154
161
306
91
424
167
60
140
107
136
11
25
10
120
87
57
46
103
54
76
104
7
13
106
77
72
3
50
83
12
44
122
108
521
76
175
101
74
145
93
237
123
107
12
45
6
72
29
52
27
62
30
24
73
69
91
16
80
77
69
66
54
47
122
65
55
66
18
22
71
84
97
16
74
60
24
109
11
108
50
239
78
68
115
65
69
97
159
7
36
80
97
64
98
28
87
35
47
85
35
96
20
72
33
57
81
102
93
73
12
12
11
35
36
83
64
30
80
92
116
84
66
76
9
39
67
316
29
12
22
132
15
36
54
21
64
51
17
60
215
21
86
10
74
124
74
58
116
60
64
11
96
54
73
47
36
68
84
40
106
90
16
51
11
59
16
73
38
30
57
5
56
14
49
56
11
64
96
44
18
42
5
21
14
105
41
129
22
39
28
99
60
11
32
41
63
48
81
20
87
13
12
47
61
18
58
59
13
100
29
26
59
49
94
69
87
8
25
92
61
30
119
50
14
53
62
104
31
23
25


In [56]:
topic_word = vir6[0].T  # model.components_ also works
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(ap_vocabulary)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print(u'Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: collider mchaffie storks khashoggi gacy lopez amalgam formula mofford ornellas
Topic 1: noriega powell channels cordero menorah panamanian mydland noriegas patmos panama
Topic 2: ferrets bites doctors ferret pet babies animals increasingly two paisley
Topic 3: greene vogel fuji easterly procession clr hildreths disarmament scripps storer
Topic 4: ehrenhalt therapist bcspehealth fixedrate bastion microbe camarena pillsburys cna catholicjewish
Topic 5: ceremsak solis tamils suarez putnam gursky tumor rison wynberg chinn
Topic 6: fuji hasselbring cxcbt dudycz aoun azerbaijani circular bowsher tartus niklus
Topic 7: noriega panama powell sudan panamanian jackson channels jacksons examining diplomatic
Topic 8: halloween atlarge quints importers forman siegelman fixedrate deri cdy sipc
Topic 9: hildreth fis sula ritalin wppss raisa forman zaccaro shedd skyscraper
Topic 10: enforcement licenses convictions programs grants suspension drug distribution establish convicted
Topic 11: sch

In [57]:
ap_corpus.shape

(2246, 10473)

In [60]:
%%time
np.random.seed(0)
vir6 = lda_batch2(ap_corpus,100,2246,750,0,100)

983
122
230
152
102
999
647
41
2
123
207
138
178
6
617
84
351
477
120
999
43
2
2
38
834
614
730
182
829
181
999
405
91
149
554
64
999
114
266
105
716
163
665
89
350
999
199
229
526
582
236
43
999
102
384
85
585
76
136
2
189
589
999
999
56
53
111
219
223
160
180
180
483
160
173
999
999
181
393
364
999
175
999
54
999
199
390
234
145
404
853
194
260
172
289
999
999
4
212
226
748
259
2
956
139
335
147
999
425
849
164
194
221
71
774
126
98
758
296
144
89
7
65
327
91
187
445
10
96
777
212
999
264
663
346
999
289
73
199
711
842
726
701
105
999
131
434
941
3
999
29
64
144
289
231
999
156
999
637
188
101
363
384
999
4
422
348
804
663
197
999
914
327
548
204
296
104
29
243
530
85
999
299
432
154
102
138
999
55
295
157
336
72
999
158
693
86
283
927
140
839
171
712
90
278
64
192
133
56
95
742
97
20
134
334
711
43
161
999
85
367
181
209
146
606
437
4
85
616
210
83
419
138
989
196
895
132
205
257
144
188
94
196
7
479
87
2
242
194
999
137
79
125
769
999
86
104
2
801
132
345
842
224
87
233
117
110
406

In [61]:
topic_word = vir6[0].T  # model.components_ also works
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(ap_vocabulary)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print(u'Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: i bush president million drug think time going congress people
Topic 1: noriega jackson housing roberts panama people i two gene year
Topic 2: new years city inheritance president soviet york aids sakharov john
Topic 3: members west women virginia kennedy abortion vote club draft president
Topic 4: electric training launch percent i first boat shuttle wiese news
Topic 5: million trust attorney federal i art milken carpenter manville claims
Topic 6: soviet nato cents president yen supreme gorbachev board state people
Topic 7: police people government president two first iraq today time day
Topic 8: cent cents futures higher lower people state percent two new
Topic 9: i new editor people yearold children ap officials united money
Topic 10: year percent i million program new federal billion house last
Topic 11: new york i bush campaign wednesday dukakis program states asked
Topic 12: company two first disney school year last late tuesday say
Topic 13: people labor officials natio