# Stochastic Variational Inference for LDA

## Introducción

En el presente notebook se realiza una implementación del algoritmo para topic modelling, Latent Dirichlet Allocation (LDA), en su versión con Stochastic Variational Inference. La implementación trata de apegarse a lo propuesto por Hoffman, Blei y Wang en [Stochastic Variational Inference](http://www.columbia.edu/~jwp2128/Papers/HoffmanBleiWangPaisley2013.pdf).
Otros trabajos que fueron tomados como referencia son:
 * Para un panorama más amplio sobre LDA: [Inference Methods for Latent Dirichlet Allocation](http://times.cs.uiuc.edu/course/598f16/notes/lda-survey.pdf);
 * Para una explicación más detallada sobre la Mean-Field Variational Family: [Variational Inference: A Review for Statisticians](https://arxiv.org/pdf/1601.00670.pdf)
 
Adicionalmente, la idea de implementar LDA surgió de un intento previo de querer implementar la propuesta de Wang y Blei, en [Collaborative Topic Modeling for Recommending Scientific Articles](http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf), para Sistemas de Recomendación. De hecho el dataset utilizado aqui es el mismo que usan ellos. No descarto implementar este paper como siguiente paso.

## Marco teórico

Asumiendo K tópicos, y D documentos, cada uno con N palabras (por simplicidad asumo documentos de igual longitud) pertenecientes a un vocabulario de tamaño V; el modelo generativo es el siguiente:

1. Generamos tópicos $\beta_k$ ~ Dirichlet($\eta,\dots,\eta$) para cada $k\in \{0,\dots,K-1\}$
2. Para cada documento $d \in \{0,\dots, D-1\}$:
    3. Generamos $\theta_d$ ~ Dirichlet($\alpha,\dots,\alpha$)
    4. Para cada palabra $w \in \{0,\dots, N-1\}$:
        5. Generamos la asignación de tema $z_{dn}$ ~ Multinomial($\theta_d$)
        6. Generamos la palabra $w_{dn}$ ~ Multinomial($\beta_{z_{dn}}$)

Los parámetros involucrados son:
* $\beta_k$, el tópico $k$. Un tópico consiste básicamente en un vector de probabilidad de longitud V, que modela la distribución de las distintas palabras en el vocabulario para ese tema
* $\theta_d$, la proporción de los tópicos para el documento $d$. Nuevamente es un vector de probabilidades, que en este caso apunta a modelar la participación parcial de los distintos temas para un mismo documento. El hecho de que se permita que un documento pertenezca a un mixture de tópicos en lugar de uno solo, es una de las principales virtudes de LDA. __Esta es la principal variable sobre la que nos interesa poder hacer inferencia__.
* $z_{dn}$, es la asignación del tópico de la palabra $n$ en el domunento $d$
* $w_{dn}$, es la palabra $n$ del documento $d$. Pertenece al tópico $z_{dn}$

## Dataset

El dataset fue descargado de http://www.cs.cmu.edu/~chongw/data/citeulike/. El mismo contiene información del sitio http://www.citeulike.org/, que le permite a investigadores armar sus bibliotecas de papers y recibir recomendaciones en base a las mismas. En particular yo no uso la información referente a los usuarios y sus bibliotecas, si no que únicamente me centro en la data sobre artículos.
Los dos archivos que uso son:
* mult.dat, que contiene los ids de las palabras más relevantes de cada artículos, y sus conteos. En total son 16980 artículos (documentos).
* vocab.dat, el mapeo de los ids a la palabra específica. El tamaño del vocabulario usado es de 8000 palabras.

La extracción del vocabulario, y demás preprocesamientos de la data son los que se explican en la sección 4 de [Wang and Blei (2011)](http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf).

Adicionalmente, en esta primera iteración del algoritmo, pido que todos los documentos tengan la misma cantidad de palabras. Eso me lleva a realizar un procesamiento extra, en el cual descarto documentos con menos de 40 palabras, quedandome con 14836. A su vez, para estos artículos, solo conservo las primeras 40 palabras, ordenadas de mayor a menor ocurrencia en el documento.

A continuación, importo algunas bibliotecas que voy a usar, y realizo la carga del dataset. Por cuestiones de reproducibilidad, seteo una semilla fija.

In [1]:
from scipy.stats import dirichlet, multinomial, expon
from scipy.special import digamma
from collections import defaultdict
from math import pow
import numpy as np
import copy

from online_lda import OnlineLDA
from mini_batch_lda import MiniBatchLDA
from helpers import *

In [2]:
def load_dataset():
    words_per_doc = defaultdict(list)
    # Parse line by line the documents data's file
    with open('data/mult.dat') as keywords_f:
        words_and_counts_by_doc = []
        for l in keywords_f:
            data = l.split() # First element is the number of distinct words, the rest are the word_ids with its counters 
            words_and_counts_by_doc.append(data[1:])
        for d, words_and_counts in enumerate(words_and_counts_by_doc):
            for w_c in words_and_counts:
                word, cnt = w_c.split(':')
                for _ in xrange(int(cnt)):    
                    words_per_doc[d].append(int(word))
            # Resampling the data to always have the same number of words per document
            np.random.seed(42)
            words_per_doc[d] = np.random.choice(words_per_doc[d], size=60)
    
    vocab = {}
    with open('data/vocab.dat') as vocab_f:
        for v, word in enumerate(vocab_f):
            vocab[v] = word.rstrip()
    
    return words_per_doc, vocab
            

In [3]:
words_by_doc, vocab = load_dataset()

## LDA

Primero me defino algunas funciones auxiliares que me van a servir luego.

In [4]:
from IPython.display import display, Markdown
def print_md(string, bold=False):
    if bold:
        string = '**' + string + '**'
    display(Markdown(string))

La clase _LatentDirichletAllocation_ tiene como método principal _fit_, que es el que se encarga de aplicar el algoritmo propuesto por Hoffman et al. (2013). La notación elegida para los parámetros sigue fielmente la usada en la sección 3 de dicho paper. 

La imagen siguiente muestra el pseudocódigo del algoritmo en cuestión.

![title](img/SVI-pseudocode.png)

In [5]:
D = len(words_by_doc)  # number of documents
N = 60 # words per doc
K = 60  # number of topics
V = len(vocab)  # vocabulary size

# For alpha and eta values I follow recommendation from Griffiths and Steyvers (2004)
lda = OnlineLDA(K, D, N, V, 50.0/K, 0.1, words_by_doc)
lda.fit()

Iteracion: 0
Documento: 5770
Ro: 0.0334929207043
Iteracion: 1
Documento: 11239
Ro: 0.0333109641294
Iteracion: 2
Documento: 11493
Ro: 0.0331313897449
Iteracion: 3
Documento: 661
Ro: 0.0329541483957
Iteracion: 4
Documento: 12864
Ro: 0.032779192306
Iteracion: 5
Documento: 7180
Ro: 0.0326064750308
Iteracion: 6
Documento: 363
Ro: 0.0324359514088
Iteracion: 7
Documento: 4779
Ro: 0.0322675775179
Iteracion: 8
Documento: 5343
Ro: 0.0321013106322
Iteracion: 9
Documento: 1663
Ro: 0.0319371091812
Iteracion: 10
Documento: 6505
Ro: 0.03177493271
Iteracion: 11
Documento: 14368
Ro: 0.0316147418418
Iteracion: 12
Documento: 9360
Ro: 0.0314564982415
Iteracion: 13
Documento: 13164
Ro: 0.0313001645812
Iteracion: 14
Documento: 11476
Ro: 0.0311457045065
Iteracion: 15
Documento: 7843
Ro: 0.0309930826047
Iteracion: 16
Documento: 4408
Ro: 0.0308422643739
Iteracion: 17
Documento: 11600
Ro: 0.0306932161937
Iteracion: 18
Documento: 4302
Ro: 0.0305459052965
Iteracion: 19
Documento: 6823
Ro: 0.0304002997402
Iteracio

In [6]:
for i, lamb_k in enumerate(lda.lamb):
    print_md('Topic {}'.format(i), bold=True)
    np.random.seed(42)
    words_distribution = np.random.dirichlet(lamb_k)
    #print words_distribution.argsort()[::-1][:10]
    #words_distribution.sort()
    #print words_distribution[::-1][:10]
    print_words(words_distribution.argsort()[::-1][:20], vocab)

**Topic 0**

72, techniques
118, text
642, hierarchical
377, integration
385, reference
279, basis
349, traditional
619, visualization
566, tags
574, tagging
478, edition
612, expected
766, tag
951, scales
910, easily
1366, systematically
721, end
1092, reading
1125, index
1132, reasoning


**Topic 1**

99, transcription
163, mechanisms
329, shows
432, experience
460, address
669, relations
585, degree
578, early
459, library
586, categories
1143, optical
864, affect
1178, supporting
734, libraries
1427, realtime
1064, relation
1752, explanation
2596, classroom
863, idea
1165, remain


**Topic 2**

236, areas
209, observed
310, real
257, efficient
254, factor
647, perspective
327, called
266, power
931, offer
223, communication
518, widely
469, building
665, original
548, families
637, loss
468, published
725, choice
651, guide
427, virtual
3282, contributing


**Topic 3**

30, genome
61, dna
141, genomic
58, sites
94, cell
158, identified
140, related
202, technologies
387, signal
122, mirnas
235, microarray
280, spatial
242, group
245, cancer
247, conserved
654, location
837, product
597, challenge
1077, designing
886, translation


**Topic 4**

24, method
43, sequences
126, functions
443, adaptive
608, improved
435, due
325, class
431, mutations
737, selected
692, version
1790, modularity
572, contrast
726, adaptation
535, increasing
840, force
593, distributions
588, strategy
778, rapidly
999, applying
1903, highresolution


**Topic 5**

201, proposed
244, dynamic
218, existing
275, components
231, accuracy
331, parameters
217, better
283, conditions
222, memory
410, technique
221, programming
531, rules
536, graphs
620, perform
1023, benefits
760, matrix
1063, infrastructure
1176, hand
686, great
1086, procedures


**Topic 6**

54, work
124, level
497, modules
272, made
268, site
144, mirna
490, limited
340, alternative
1155, reduction
1240, challenging
871, likelihood
364, simulation
638, wireless
811, suggesting
529, currently
606, points
791, detect
601, solutions
1149, improvement
1788, consisting


**Topic 7**

4, model
28, number
81, control
82, found
212, case
388, availability
333, comprehensive
173, form
193, management
439, inference
464, classes
521, tree
1153, precise
660, revealed
874, observations
1635, binary
980, substantial
1047, involving
696, coverage
2468, scene


**Topic 8**

57, users
161, effects
293, theoretical
285, analyses
363, machine
347, effect
219, scientific
374, measures
899, video
342, introduction
1152, external
771, normal
402, aspects
454, average
1114, genetics
2048, contextaware
1414, trends
915, vision
3017, fear
1372, details


**Topic 9**

120, rna
105, simple
106, high
157, levels
311, help
649, artificial
690, link
470, report
516, directly
761, investigate
740, needed
687, five
500, chromatin
1689, decade
684, despite
698, procedure
1005, animal
850, facilitate
1401, metabolism
1504, maintenance


**Topic 10**

11, protein
93, small
47, evolution
73, selection
1025, internal
213, alignment
400, free
418, rnas
485, pairs
963, thousands
262, energy
747, correlated
1206, dynamical
1313, discovered
1718, finite
708, physics
901, manner
1507, classic
1729, minimum
1222, enriched


**Topic 11**

89, statistical
78, tools
172, state
237, discuss
277, evaluation
503, values
298, ability
291, similarity
445, required
1139, circuits
517, account
555, statistics
600, activities
699, explain
796, surface
1337, determining
849, ones
1248, generating
2447, recursive
810, light


**Topic 12**

25, structure
107, general
51, function
104, molecular
149, associated
192, local
117, computer
450, relationship
359, involved
641, tested
270, predicted
482, simulations
965, logic
581, estimation
742, members
790, imaging
882, leading
3107, devoted
1854, kinase
914, enables


**Topic 13**

1, analysis
10, methods
90, three
112, single
360, quantitative
461, approximately
1411, designs
800, advantage
508, direct
1528, care
806, combining
1485, extracted
1470, screening
1191, profile
918, share
3530, mixtures
1302, profiling
1669, proportion
1601, enrichment
4281, unix


**Topic 14**

23, learning
74, language
258, presented
289, query
326, like
373, images
320, best
430, finding
375, focus
299, pattern
563, feedback
630, consider
452, active
526, resulting
1070, creating
599, goal
659, representations
792, organizations
1048, focused
1101, effectively


**Topic 15**

37, studies
304, identification
314, background
631, scientists
1484, approximate
648, genomics
722, loci
3153, shaping
959, summary
1276, policy
1135, entities
1342, answer
2979, terminology
754, vector
1334, module
861, subjects
1232, observation
2932, exactly
809, illustrate
2268, inheritance


**Topic 16**

59, book
150, people
546, series
496, generation
292, relevant
1140, interact
777, written
625, connectivity
423, bioinformatics
892, seen
607, lead
395, impact
671, learn
1898, logical
938, readers
1226, exchange
1043, identity
970, reveals
1483, teaching
1387, standards


**Topic 17**

26, show
71, cells
113, specific
77, features
321, view
352, recently
488, long
284, complexity
287, programs
611, words
570, question
770, fact
2136, pagerank
748, specifically
650, estimate
858, usage
505, correlation
2384, emergent
895, matching
1215, mathematics


**Topic 18**

18, two
86, recent
65, interaction
205, program
467, past
568, compare
168, evidence
370, positive
382, linear
419, collaborative
884, chemical
670, increasingly
666, stochastic
1660, novo
710, advances
834, protocols
1621, perspectives
1305, date
1500, magnetic
1509, independently


**Topic 19**

45, available
208, environment
191, test
148, significant
114, experiments
409, public
463, history
295, researchers
875, topic
662, errors
768, estimates
656, contains
1662, execution
1068, simulated
1529, enhance
1254, faster
1617, publicly
1157, developers
575, health
685, apply


**Topic 20**

39, algorithm
44, important
159, behavior
338, practical
580, agents
465, distinct
694, experiment
456, increase
541, systematic
720, resolution
355, previous
602, determine
534, parallel
446, low
1131, describing
801, negative
913, actions
560, times
1236, continuous
1112, style


**Topic 21**

13, system
67, performance
176, visual
448, input
203, architecture
532, introduce
297, significantly
434, documents
1102, advantages
633, links
324, cortex
617, interfaces
799, argue
668, shared
1016, evaluated
1277, generic
556, document
1425, capable
2610, instruction
4650, primitives


**Topic 22**

60, search
46, provide
88, science
85, user
328, states
615, relevance
215, neurons
302, cognitive
530, needs
401, independent
1103, changing
729, filtering
779, reveal
971, semantics
924, body
996, principle
1007, established
1258, publication
1291, implicit
752, showed


**Topic 23**

35, theory
76, framework
69, patterns
210, image
154, highly
353, understand
248, relationships
1026, exist
538, give
354, critical
421, diverse
1794, balance
416, mobile
753, determined
661, far
821, interactive
1659, carbon
793, efficiency
925, interesting
1368, descriptions


**Topic 24**

8, networks
20, social
234, internet
462, media
334, students
507, proteinprotein
703, relatively
1231, behavioral
1031, age
905, culture
1362, acquisition
1477, und
682, popular
764, sciences
1614, everyday
1522, names
1195, books
982, structured
1699, spread
573, biomedical


**Topic 25**

2, new
160, key
177, processing
558, component
836, enable
701, numbers
475, difficult
457, making
1008, rich
859, created
1158, construct
1150, serve
3790, infinite
731, emerging
906, efforts
943, achieved
2178, options
2129, looking
1275, mechanics
640, fast


**Topic 26**

19, research
12, human
63, problem
386, cases
348, findings
499, mathematical
510, topics
415, populations
900, purpose
539, languages
756, climate
1283, opportunities
1091, biochemical
1034, comparisons
986, taken
1218, complementary
2193, examination
1711, distinguish
1797, readily
2063, preliminary


**Topic 27**

21, approach
190, modeling
185, distribution
332, growth
255, useful
166, prediction
371, essential
547, decision
453, sources
678, automatically
1331, extend
948, entire
1173, bound
1374, directions
1212, characterize
1390, format
2147, sorting
3701, cope
2942, attractive
1202, collections


**Topic 28**

95, processes
167, dynamics
198, classification
368, objects
472, clusters
366, develop
372, complete
520, collection
525, profiles
346, rates
613, samples
876, economic
718, material
466, increased
798, variability
1304, origin
716, maps
1224, section
1181, ratio
1615, direction


**Topic 29**

130, technology
111, activity
184, article
772, corresponding
549, powerful
474, means
492, discussed
447, environments
644, evaluate
623, full
436, humans
1309, reactions
1266, technological
1535, possibility
1883, slow
1416, characteristic
1129, participants
1916, occurring
1512, earlier
1205, hundreds


**Topic 30**

102, developed
92, known
91, regions
108, sequencing
171, need
300, interest
489, strong
825, transfer
533, molecules
514, motif
605, analyze
2033, multimedia
527, probabilistic
479, signals
817, promoter
1430, scalable
1192, arise
1779, immune
1208, modification
1636, acids


**Topic 31**

9, network
131, role
178, demonstrate
249, differences
367, physical
440, improve
362, designed
365, pathway
634, insights
624, rapid
1120, built
819, characterized
700, routing
567, play
854, traits
672, mammalian
1249, discrete
1039, topology
1076, node
955, represented


**Topic 32**

101, evolutionary
194, literature
224, resources
188, global
271, databases
477, phylogenetic
797, growing
442, suggests
571, hypothesis
590, implemented
992, advanced
318, mining
1227, category
966, gives
746, larger
1360, demonstrates
789, define
935, annotated
885, education
1666, blocks


**Topic 33**

7, results
115, sets
211, task
127, semantic
420, feature
308, tasks
408, measure
306, result
414, good
392, obtained
839, motion
732, reported
1141, kinds
1755, institutions
983, wikipedia
498, reads
1327, things
1065, hidden
868, extent
1370, pair


**Topic 34**

3, gene
34, proteins
143, genomes
147, structural
260, size
246, domain
68, binding
476, implications
552, indicate
622, drosophila
1130, look
1731, distances
783, docking
750, overview
1003, fraction
561, complexes
1074, innovation
1676, eukaryotes
673, combined
1082, insight


**Topic 35**

142, review
153, natural
180, space
339, shown
139, community
301, effective
174, rate
309, four
358, scale
361, transcriptional
406, questions
728, formal
493, primary
616, microbial
950, selective
1156, lists
1001, definition
522, represent
853, searching
688, sensitivity


**Topic 36**

62, multiple
261, addition
170, years
225, propose
643, game
286, recognition
860, success
428, error
898, special
844, eukaryotic
765, activation
528, unique
902, list
762, markov
589, produce
843, respect
887, lower
1235, extension
1027, strongly
1585, carlo


**Topic 37**

5, genes
15, expression
96, support
48, biological
64, genetic
312, graph
233, disease
422, fundamental
1045, linked
962, cerevisiae
1244, flexibility
1055, phenotypic
683, greater
543, little
679, protocol
1562, regulators
1144, correct
1436, regulated
1592, infer
1099, saccharomyces


**Topic 38**

6, systems
22, based
134, types
175, applied
635, assembly
763, typically
921, decisions
2166, transient
1655, messages
1638, intelligent
1734, equilibrium
1161, positions
1600, enhanced
1241, discover
1024, contribute
1482, oscillations
1505, overlap
3320, trains
1399, vast
2189, dynamically


**Topic 39**

17, different
80, understanding
243, services
495, temporal
449, allow
220, computing
351, point
628, trees
2049, genomescale
315, diversity
274, motifs
509, providing
407, training
393, value
564, computation
1406, privacy
604, family
1142, lines
1382, taxonomy
774, conclusions


**Topic 40**

0, information
66, provides
83, approaches
123, current
207, retrieval
313, domains
426, derived
335, interface
657, extraction
972, explicit
891, conventional
695, term
932, graphical
1412, files
741, micrornas
781, chapter
1138, studying
1826, stateoftheart
1715, prototype
1428, deal


**Topic 41**

151, world
451, responses
988, capacity
751, stimuli
288, digital
894, mean
562, core
824, successful
736, trust
1924, exists
826, sensory
1233, emergence
961, treatment
1371, today
1012, assumptions
1299, consists
1837, todays
417, expressed
1434, setting
984, stimulus


**Topic 42**

16, web
110, database
216, issues
145, brain
513, sample
458, describes
412, concept
1265, metrics
2548, hypertext
336, higher
709, testing
807, pages
856, collaboration
878, introduced
873, accessible
2420, continues
1478, engines
3719, faults
1543, paths
1803, embedded


**Topic 43**

100, structures
240, source
273, underlying
675, studied
398, practice
376, object
594, optimal
455, ideas
712, combination
697, base
576, region
4828, helix
1247, works
639, require
3316, embodied
1104, dimensions
855, increases
1920, geometry
1891, gradient
719, specificity


**Topic 44**

33, software
50, species
250, concepts
384, challenges
379, attention
715, cluster
610, games
896, paradigm
1118, discusses
738, analyzed
1004, progress
994, goals
904, issue
814, production
1245, stage
743, developing
890, organizational
1424, planning
1537, degradation
1408, collected


**Topic 45**

32, large
49, complex
116, properties
165, clustering
197, tool
269, open
343, bayesian
186, order
239, response
317, abstract
381, predictions
614, description
305, described
344, empirical
582, sharing
1011, variable
880, ecological
835, shape
553, create
2278, usefulness


**Topic 46**

14, models
38, time
55, applications
322, organisms
378, area
803, bacterial
484, defined
629, modern
399, survey
711, measurements
583, variables
603, performed
1010, capture
1806, performing
968, reviews
912, thought
1474, generalized
879, uncertainty
2819, guidance
3658, reconstructions


**Topic 47**

129, individual
42, interactions
356, mechanism
227, standard
226, access
394, short
337, probability
680, length
429, nodes
1620, temperature
405, mapping
523, consistent
551, service
830, unknown
1017, devices
1357, typical
976, reliable
433, accurate
941, practices
907, allowing


**Topic 48**

70, process
41, functional
75, problems
56, first
265, annotation
282, cellular
584, contact
281, ontology
389, generated
294, individuals
653, annotations
554, characteristics
1128, modelling
591, highthroughput
1441, updated
1169, steps
893, coli
2215, del
1317, que
1354, los


**Topic 49**

31, study
53, algorithms
103, describe
252, basic
238, compared
263, random
1661, emotion
704, solution
652, queries
942, maximum
1274, improvements
818, examined
730, clinical
1675, signatures
1100, six
1148, fluctuations
1119, reduce
4916, fairness
1198, sequenced
1355, coherent


**Topic 50**

189, major
264, life
183, range
303, yeast
494, comparative
296, previously
267, discovery
438, predict
483, additional
577, subject
502, environmental
713, examine
444, importance
676, discussion
1201, led
1670, percent
512, integrated
1284, longterm
1272, workflow
1211, unified


**Topic 51**

29, design
109, common
119, field
52, development
214, future
228, allows
290, detection
471, project
357, strategies
164, elements
323, organization
204, regulation
769, fully
350, uses
403, principles
1122, platform
626, explore
727, business
749, read
658, signaling


**Topic 52**

169, make
196, distributed
162, variation
182, suggest
199, groups
390, largescale
187, population
316, comparison
413, fields
542, distance
928, behaviour
598, appropriate
411, events
1347, law
997, phenomena
397, relative
1656, choices
1471, fit
424, articles
1052, consequences


**Topic 53**

27, sequence
133, novel
121, online
138, biology
230, quality
319, metabolic
195, neural
481, main
307, representation
569, optimization
437, datasets
632, noncoding
705, spike
537, noise
506, coding
739, supplementary
441, mrna
985, candidate
559, frequency
1683, smaller


**Topic 54**

128, potential
256, targets
341, presents
136, identify
251, code
504, influence
717, automatic
519, detailed
940, notion
714, necessary
823, secondary
978, crucial
723, methodology
544, makes
609, forms
1239, political
916, cultural
1166, speed
775, occur
1093, functionally


**Topic 55**

97, experimental
98, changes
79, regulatory
229, pathways
152, change
278, nature
706, place
511, identifying
480, central
724, neuronal
993, bacteria
857, see
805, largely
953, folding
1085, transition
1154, tissue
759, flow
784, broad
2152, neuron
3200, pyramidal


**Topic 56**

36, present
87, computational
40, knowledge
253, wide
137, context
125, particular
179, content
515, requires
241, second
587, offers
369, implementation
867, file
1107, contributions
2437, concurrent
1223, cycle
1303, widespread
1771, extracting
1502, functionality
2239, der
926, predicting


**Topic 57**

206, communities
345, engineering
425, action
276, authors
957, package
645, generate
636, take
1113, estimated
667, theories
832, motor
1243, ranking
1183, understood
501, able
872, conceptual
1252, international
828, measured
845, nonlinear
869, correlations
788, analyzing
787, total


**Topic 58**

259, variety
232, type
200, quantum
181, examples
383, likely
155, target
391, resource
936, last
330, conservation
744, classical
592, personal
877, formation
958, scheme
954, microrna
833, word
1000, perception
1540, expansion
1737, scenarios
1344, differ
1704, viewed


**Topic 59**

132, terms
146, possible
135, application
156, factors
1282, explains
380, association
473, provided
404, genomewide
396, ways
540, robust
1264, cognition
486, map
1909, examining
618, author
944, freely
663, diseases
881, appear
1042, writing
1184, subset
674, products


In [7]:
D = len(words_by_doc)  # number of documents
N = 60 # words per doc
K = 60  # number of topics
V = len(vocab)  # vocabulary size
minibatch_size = 32

# For alpha and eta values I follow suggestion from Griffiths and Steyvers (2004)
lda = MiniBatchLDA(K, D, N, V, 50.0/K, 0.1, words_by_doc, minibatch_size)
lda.fit()

Iteracion: 0
Documentos: [ 5770 12459  9969  6671 10870 13254 16844  2418 13242 15500   761  3410
  6391 13336 11433 11602  1535 13834  8740  2025  5971  8832  5380 15532
  4029 11998  6332 14655 11041  8740  2748  2060]
Ro: 0.0334929207043
Iteracion: 1
Documentos: [ 7156  1830 14457 13098  6566   397  7317 13271  6759  9351 14493 16739
  4445  2667   844 15046 11211 14846  4267  5498 15313 10708 10266 11143
 15683  5904  5191  5391 15200 16471  2321 12164]
Ro: 0.0333109641294
Iteracion: 2
Documentos: [ 4862 11376 16019  8746 11012 14519  2815  8677 13285 13336  8341 12805
  5307  1455 13226 13655 15638  9548 12009 10438 12804  2581  7766 12724
 12335 15085  3575  3469  6148 13103 11208  5647]
Ro: 0.0331313897449
Iteracion: 3
Documentos: [ 9889  5600  2013  8645 15581  4203 13504  7497  5687 14826  9410 16863
 13965  9624  4634 13494 13013 14370 10219  5723  1508  5049   707 12477
 14617  2483  7613  7214  9638  8937 12783  7451]
Ro: 0.0329541483957
Iteracion: 4
Documentos: [14580  411

In [8]:
for i, lamb_k in enumerate(lda.lamb):
    print_md('Topic {}'.format(i), bold=True)
    np.random.seed(42)
    words_distribution = np.random.dirichlet(lamb_k)
    #print words_distribution.argsort()[::-1][:10]
    #words_distribution.sort()
    #print words_distribution[::-1][:10]
    print_words(words_distribution.argsort()[::-1][:20], vocab)

**Topic 0**

149, associated
210, image
642, hierarchical
382, linear
251, code
228, allows
448, input
299, pattern
619, visualization
373, images
286, recognition
451, responses
1538, arbitrary
612, expected
774, conclusions
719, specificity
659, representations
709, testing
625, connectivity
421, diverse


**Topic 1**

24, method
163, mechanisms
371, essential
669, relations
327, called
699, explain
682, popular
1178, supporting
726, adaptation
916, cultural
976, reliable
891, conventional
863, idea
871, likelihood
938, readers
1361, phenomenon
1883, slow
745, synaptic
1780, reasons
2561, illustrates


**Topic 2**

78, tools
92, known
236, areas
158, identified
223, communication
254, factor
379, attention
518, widely
266, power
298, ability
426, derived
543, little
647, perspective
725, choice
568, compare
548, families
637, loss
498, reads
1285, constant
927, assessment


**Topic 3**

47, evolution
182, suggest
383, likely
629, modern
510, topics
521, tree
597, challenge
555, statistics
683, greater
1183, understood
1005, animal
922, cooperation
1540, expansion
735, clear
820, remains
827, copy
1378, host
1228, evolve
140, related
1667, ago


**Topic 4**

222, memory
263, random
466, increased
535, increasing
378, area
588, strategy
608, improved
707, working
692, version
748, specifically
560, times
552, indicate
1354, los
1231, behavioral
1359, kind
1288, las
1614, everyday
1799, establish
1778, release
2210, computations


**Topic 5**

32, large
101, evolutionary
148, significant
443, adaptive
192, local
444, importance
397, relative
593, distributions
1003, fraction
986, taken
756, climate
1133, splicing
854, traits
1530, induced
1348, transcriptome
1152, external
1501, greatly
1622, news
1398, year
1006, plant


**Topic 6**

135, application
224, resources
220, computing
209, observed
377, integration
55, applications
57, users
490, limited
566, tags
897, movement
665, original
867, file
766, tag
850, facilitate
1155, reduction
974, metadata
1642, boundaries
832, motor
1149, improvement
1412, files


**Topic 7**

4, model
28, number
121, online
100, structures
85, user
120, rna
173, form
216, issues
140, related
186, order
146, possible
393, value
636, take
1098, realworld
1028, assess
489, strong
1517, extensions
55, applications
1182, effectiveness
2784, conduct


**Topic 8**

21, approach
29, design
72, techniques
36, present
219, scientific
214, future
342, introduction
310, real
475, difficult
197, tool
287, programs
297, significantly
353, understand
605, analyze
554, characteristics
585, degree
301, effective
979, storage
915, vision
917, numerous


**Topic 9**

157, levels
232, type
311, help
155, target
470, report
122, mirnas
461, approximately
144, mirna
477, phylogenetic
418, rnas
558, component
649, artificial
500, chromatin
925, interesting
697, base
741, micrornas
977, modifications
810, light
450, relationship
1265, metrics


**Topic 10**

10, methods
94, cell
212, case
458, describes
213, alignment
281, ontology
469, building
656, contains
645, generate
1132, reasoning
708, physics
767, patients
524, alignments
1082, insight
1036, efficiently
813, construction
1206, dynamical
1142, lines
795, density
2177, integrates


**Topic 11**

7, results
18, two
257, efficient
277, evaluation
253, wide
256, targets
245, cancer
320, best
358, scale
229, pathways
438, predict
66, provides
430, finding
895, matching
899, video
835, shape
1812, transform
1020, face
759, flow
1333, forces


**Topic 12**

0, information
25, structure
53, algorithms
89, statistical
133, novel
207, retrieval
380, association
332, growth
400, free
424, articles
463, history
641, tested
503, values
599, goal
694, experiment
581, estimation
582, sharing
57, users
1108, academic
1258, publication


**Topic 13**

69, patterns
108, sequencing
71, cells
112, single
160, key
313, domains
187, population
272, made
533, molecules
434, documents
472, clusters
390, largescale
717, automatic
715, cluster
556, document
771, normal
617, interfaces
982, structured
1001, definition
1410, navigation


**Topic 14**

107, general
176, visual
200, quantum
345, engineering
290, detection
238, compared
420, feature
402, aspects
329, shows
479, signals
872, conceptual
576, region
531, rules
877, formation
1129, participants
839, motion
1048, focused
1163, successfully
821, interactive
1080, accurately


**Topic 15**

48, biological
119, field
88, science
190, modeling
170, years
502, environmental
314, background
335, interface
322, organisms
631, scientists
757, amount
456, increase
648, genomics
545, variants
846, significance
729, filtering
1093, functionally
1115, showing
1114, genetics
920, differential


**Topic 16**

152, change
394, short
496, generation
395, impact
564, computation
341, presents
892, seen
994, goals
436, humans
607, lead
1100, six
1794, balance
705, spike
777, written
1177, variations
970, reveals
1224, section
1453, modular
928, behaviour
1571, citations


**Topic 17**

77, features
384, challenges
352, recently
409, public
284, complexity
419, collaborative
321, view
602, determine
512, integrated
677, initial
1126, supports
1358, run
668, shared
804, demonstrated
858, usage
779, reveal
862, build
905, culture
2170, adoption
1665, running


**Topic 18**

139, community
1026, exist
416, mobile
601, solutions
544, makes
638, wireless
142, review
834, protocols
679, protocol
700, routing
884, chemical
901, manner
1219, come
1021, predictive
1519, extremely
1308, solve
1924, exists
800, advantage
2038, underlie
1063, infrastructure


**Topic 19**

45, available
110, database
102, developed
114, experiments
93, small
191, test
351, point
445, required
467, past
788, analyzing
1157, developers
937, true
1077, designing
433, accurate
1223, cycle
1091, biochemical
944, freely
960, contain
932, graphical
1237, done


**Topic 20**

11, protein
355, previous
590, implemented
848, leads
483, additional
537, noise
720, resolution
990, stable
801, negative
852, synthetic
885, education
1107, contributions
1383, period
1105, gain
732, reported
1236, continuous
886, translation
688, sensitivity
1784, allowed
964, precision


**Topic 21**

3, gene
5, genes
15, expression
99, transcription
65, interaction
58, sites
340, alternative
235, microarray
204, regulation
41, functional
283, conditions
365, pathway
303, yeast
497, modules
653, annotations
514, motif
1154, tissue
441, mrna
404, genomewide
595, mouse


**Topic 22**

105, simple
195, neural
172, state
319, metabolic
211, task
363, machine
328, states
215, neurons
412, concept
302, cognitive
1025, internal
401, independent
667, theories
790, imaging
754, vector
743, developing
996, principle
1739, instance
752, showed
1103, changing


**Topic 23**

130, technology
35, theory
203, architecture
205, program
646, requirements
508, direct
381, predictions
591, highthroughput
1131, describing
527, probabilistic
988, capacity
985, candidate
1666, blocks
1373, numerical
1040, mind
903, developmental
924, body
1494, spaces
783, docking
1078, developments


**Topic 24**

19, research
40, knowledge
31, study
37, studies
123, current
162, variation
513, sample
643, game
293, theoretical
357, strategies
610, games
718, material
1096, hypotheses
814, production
975, volume
882, leading
1419, examines
468, published
992, advanced
945, qualitative


**Topic 25**

96, support
208, environment
336, higher
770, fact
457, making
530, needs
580, agents
634, insights
484, defined
201, proposed
504, influence
784, broad
1104, dimensions
2, new
913, actions
906, efforts
1158, construct
1618, objectoriented
1000, perception
921, decisions


**Topic 26**

63, problem
74, language
75, problems
386, cases
255, useful
261, addition
539, languages
177, processing
698, procedure
499, mathematical
900, purpose
415, populations
414, good
354, critical
570, question
1283, opportunities
946, extended
704, solution
1035, conformational
942, maximum


**Topic 27**

2, new
80, understanding
64, genetic
97, experimental
90, three
166, prediction
547, decision
622, drosophila
519, detailed
425, action
662, errors
763, typically
1289, event
1106, closely
1032, utility
948, entire
1365, benefit
1935, proposes
930, elegans
1374, directions


**Topic 28**

33, software
129, individual
61, dna
233, disease
439, inference
359, involved
431, mutations
714, necessary
731, emerging
733, drug
663, diseases
716, maps
613, samples
523, consistent
980, substantial
703, relatively
798, variability
963, thousands
520, collection
525, profiles


**Topic 29**

111, activity
137, context
312, graph
474, means
288, digital
644, evaluate
347, effect
411, events
571, hypothesis
713, examine
780, interpretation
969, associations
1264, cognition
1179, psychology
1759, rational
1266, technological
1147, physiological
1534, directed
1402, aim
2046, put


**Topic 30**

132, terms
41, functional
145, brain
171, need
387, signal
316, comparison
91, regions
609, forms
587, offers
640, fast
760, matrix
772, corresponding
823, secondary
825, transfer
675, studied
751, stimuli
1482, oscillations
861, subjects
1396, driven
1357, typical


**Topic 31**

9, network
237, discuss
440, improve
536, graphs
624, rapid
870, comparing
574, tagging
360, quantitative
1118, discusses
1120, built
806, combining
1018, parameter
1630, focusing
22, based
1238, mass
2781, vertices
747, correlated
1459, metric
1240, challenging
1729, minimum


**Topic 32**

81, control
113, specific
159, behavior
481, main
280, spatial
452, active
399, survey
413, fields
569, optimization
734, libraries
853, searching
1010, capture
1484, approximate
1576, codes
746, larger
1180, operations
1548, mental
1074, innovation
966, gives
787, total


**Topic 33**

82, found
226, access
264, life
234, internet
385, reference
194, literature
480, central
294, individuals
408, measure
575, health
455, ideas
660, revealed
1012, assumptions
546, series
3841, mixing
866, medical
650, estimate
1117, speech
1141, kinds
1371, today


**Topic 34**

26, show
22, based
73, selection
104, molecular
68, binding
260, size
246, domain
57, users
309, four
289, query
488, long
307, representation
492, discussed
737, selected
454, average
1101, effectively
808, acid
58, sites
1045, linked
1790, modularity


**Topic 35**

76, framework
153, natural
278, nature
339, shown
370, positive
464, classes
706, place
361, transcriptional
664, presence
830, unknown
358, scale
764, sciences
690, link
951, scales
1161, positions
1728, perceptual
1450, missing
1250, journals
2041, recall
1300, rdf


**Topic 36**

56, first
225, propose
349, traditional
146, possible
620, perform
589, produce
526, resulting
797, growing
843, respect
542, distance
762, markov
860, success
550, phase
929, composition
1279, starting
1317, que
1156, lists
596, considered
1689, decade
529, currently


**Topic 37**

87, computational
86, recent
95, processes
218, existing
143, genomes
305, described
66, provides
142, review
875, topic
422, fundamental
295, researchers
1043, identity
1067, close
666, stochastic
1140, interact
959, summary
1808, benchmark
1743, drawing
1344, differ
1588, market


**Topic 38**

131, role
117, computer
134, types
375, focus
269, open
769, fully
398, practice
567, play
1017, devices
878, introduced
775, occur
1433, schemes
1051, organism
1795, intensity
1992, consequence
2109, meet
1734, equilibrium
1585, carlo
2005, assume
2151, modeled


**Topic 39**

17, different
43, sequences
127, semantic
243, services
292, relevant
267, discovery
449, allow
447, environments
333, comprehensive
491, ontologies
753, determined
495, temporal
628, trees
1406, privacy
803, bacterial
1248, generating
1564, spectrum
1088, effort
604, family
1386, strength


**Topic 40**

1, analysis
8, networks
44, important
60, search
325, class
598, appropriate
728, formal
175, applied
573, biomedical
868, extent
1478, engines
1726, record
2163, aimed
22, based
1773, personalized
1876, multivariate
55, applications
1664, aims
2769, exploited
2338, documentation


**Topic 41**

52, development
83, approaches
59, book
151, world
181, examples
334, students
406, questions
710, advances
809, illustrate
736, trust
824, successful
562, core
1056, easy
961, treatment
776, step
1112, style
1049, capabilities
1532, ithe
1252, international
1168, living


**Topic 42**

16, web
179, content
271, databases
368, objects
586, categories
392, obtained
658, signaling
1044, items
494, comparative
1597, minimal
1187, google
807, pages
1321, manual
1053, page
1445, top
873, accessible
1202, collections
2183, load
1609, interests
2262, wikis


**Topic 43**

14, models
67, performance
54, work
165, clustering
240, source
407, training
898, special
584, contact
712, combination
528, unique
1384, transformation
907, allowing
1008, rich
1247, works
2436, advice
1030, robustness
1916, occurring
1637, performs
1339, views
1159, limitations


**Topic 44**

6, systems
115, sets
250, concepts
275, components
376, object
279, basis
391, resource
462, media
372, complete
674, products
837, product
670, increasingly
904, issue
896, paradigm
1004, progress
931, offer
1409, historical
1360, demonstrates
1227, category
1368, descriptions


**Topic 45**

38, time
258, presented
150, people
343, bayesian
473, provided
799, argue
509, providing
630, consider
639, require
661, far
614, description
616, microbial
880, ecological
874, observations
1162, operating
1512, earlier
1281, reports
1144, correct
680, length
1763, specification


**Topic 46**

23, learning
118, text
259, variety
198, classification
178, demonstrate
318, mining
678, automatically
532, introduce
657, extraction
308, tasks
427, virtual
534, parallel
685, apply
594, optimal
583, variables
778, rapidly
1362, acquisition
912, thought
478, edition
902, list


**Topic 47**

42, interactions
106, high
128, potential
227, standard
350, uses
217, better
241, second
356, mechanism
268, site
405, mapping
486, map
551, service
851, experimentally
453, sources
429, nodes
654, location
1229, coupled
1244, flexibility
1764, scaling
1377, rule


**Topic 48**

13, system
39, algorithm
70, process
196, distributed
156, factors
180, space
410, technique
615, relevance
174, rate
538, give
428, error
740, needed
563, feedback
840, force
958, scheme
1254, faster
687, five
1014, false
1083, constructed
1391, hardware


**Topic 49**

103, describe
206, communities
167, dynamics
231, accuracy
188, global
252, basic
239, response
517, account
516, directly
317, abstract
606, points
442, suggests
324, cortex
623, full
652, queries
730, clinical
828, measured
1668, compute
796, surface
995, citation


**Topic 50**

116, properties
147, structural
221, programming
168, evidence
183, range
265, annotation
296, previously
364, simulation
1235, extension
635, assembly
577, subject
816, computers
965, logic
561, complexes
815, prior
1097, potentially
1226, exchange
978, crucial
842, flexible
691, roles


**Topic 51**

49, complex
109, common
136, identify
164, elements
403, principles
323, organization
459, library
362, designed
471, project
460, address
485, pairs
918, share
792, organizations
890, organizational
849, ones
655, cost
721, end
855, increases
859, created
768, estimates


**Topic 52**

388, availability
331, parameters
432, experience
887, lower
600, activities
338, practical
749, read
1023, benefits
742, members
1249, discrete
997, phenomena
681, sampling
1122, platform
1033, phenotypes
1337, determining
1151, architectures
1119, reduce
1366, systematically
1462, foundation
1123, display


**Topic 53**

27, sequence
34, proteins
138, biology
185, distribution
141, genomic
515, requires
262, energy
482, simulations
326, like
291, similarity
346, rates
507, proteinprotein
739, supplementary
981, connected
953, folding
1034, comparisons
758, amino
1084, majority
1094, detail
1491, considerable


**Topic 54**

124, level
193, management
476, implications
273, underlying
315, diversity
505, correlation
723, methodology
522, represent
423, bioinformatics
553, create
578, early
755, position
1284, longterm
724, neuronal
971, semantics
876, economic
1184, subset
1221, creation
889, biodiversity
1081, society


**Topic 55**

12, human
30, genome
79, regulatory
126, functions
51, function
50, species
154, highly
511, identifying
91, regions
247, conserved
274, motifs
621, extensive
506, coding
330, conservation
632, noncoding
829, promoters
822, million
962, cerevisiae
817, promoter
229, pathways


**Topic 56**

62, multiple
125, particular
169, make
367, physical
750, overview
501, able
446, low
671, learn
611, words
651, guide
744, classical
1153, precise
1127, meaning
557, constraints
869, correlations
1272, workflow
1148, fluctuations
1042, writing
1336, intrinsic
1486, promising


**Topic 57**

98, changes
276, authors
337, probability
230, quality
344, empirical
389, generated
282, cellular
627, projects
983, wikipedia
702, tests
673, combined
894, mean
1213, reconstruction
1019, suggested
856, collaboration
1171, detecting
1186, agent
957, package
1075, objective
1349, traffic


**Topic 58**

20, social
46, provide
202, technologies
189, major
248, relationships
184, article
199, groups
369, implementation
242, group
300, interest
592, personal
366, develop
1027, strongly
493, primary
789, define
936, last
919, collective
1539, focuses
940, notion
1087, matter


**Topic 59**

161, effects
348, findings
540, robust
374, measures
618, author
1029, conclude
1282, explains
695, term
881, appear
972, explicit
811, suggesting
396, ways
1369, locations
1057, commonly
1269, scoring
2179, teachers
1909, examining
1208, modification
3331, circuitry
1089, reflect
