# Stochastic Variational Inference for LDA

## Introducción

En el presente notebook se realiza una implementación del algoritmo para topic modelling, Latent Dirichlet Allocation (LDA), en su versión con Stochastic Variational Inference. La implementación trata de apegarse a lo propuesto por Hoffman, Blei y Wang en [Stochastic Variational Inference](http://www.columbia.edu/~jwp2128/Papers/HoffmanBleiWangPaisley2013.pdf).
Otros trabajos que fueron tomados como referencia son:
 * Para un panorama más amplio sobre LDA: [Inference Methods for Latent Dirichlet Allocation](http://times.cs.uiuc.edu/course/598f16/notes/lda-survey.pdf);
 * Para una explicación más detallada sobre la Mean-Field Variational Family: [Variational Inference: A Review for Statisticians](https://arxiv.org/pdf/1601.00670.pdf)
 
Adicionalmente, la idea de implementar LDA surgió de un intento previo de querer implementar la propuesta de Wang y Blei, en [Collaborative Topic Modeling for Recommending Scientific Articles](http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf), para Sistemas de Recomendación. De hecho el dataset utilizado aqui es el mismo que usan ellos. No descarto implementar este paper como siguiente paso.

## Marco teórico

Asumiendo K tópicos, y D documentos, cada uno con N palabras (por simplicidad asumo documentos de igual longitud) pertenecientes a un vocabulario de tamaño V; el modelo generativo es el siguiente:

1. Generamos tópicos $\beta_k$ ~ Dirichlet($\eta,\dots,\eta$) para cada $k\in \{0,\dots,K-1\}$
2. Para cada documento $d \in \{0,\dots, D-1\}$:
    3. Generamos $\theta_d$ ~ Dirichlet($\alpha,\dots,\alpha$)
    4. Para cada palabra $w \in \{0,\dots, N-1\}$:
        5. Generamos la asignación de tema $z_{dn}$ ~ Multinomial($\theta_d$)
        6. Generamos la palabra $w_{dn}$ ~ Multinomial($\beta_{z_{dn}}$)

Los parámetros involucrados son:
* $\beta_k$, el tópico $k$. Un tópico consiste básicamente en un vector de probabilidad de longitud V, que modela la distribución de las distintas palabras en el vocabulario para ese tema
* $\theta_d$, la proporción de los tópicos para el documento $d$. Nuevamente es un vector de probabilidades, que en este caso apunta a modelar la participación parcial de los distintos temas para un mismo documento. El hecho de que se permita que un documento pertenezca a un mixture de tópicos en lugar de uno solo, es una de las principales virtudes de LDA. __Esta es la principal variable sobre la que nos interesa poder hacer inferencia__.
* $z_{dn}$, es la asignación del tópico de la palabra $n$ en el domunento $d$
* $w_{dn}$, es la palabra $n$ del documento $d$. Pertenece al tópico $z_{dn}$

## Dataset

El dataset fue descargado de http://www.cs.cmu.edu/~chongw/data/citeulike/. El mismo contiene información del sitio http://www.citeulike.org/, que le permite a investigadores armar sus bibliotecas de papers y recibir recomendaciones en base a las mismas. En particular yo no uso la información referente a los usuarios y sus bibliotecas, si no que únicamente me centro en la data sobre artículos.
Los dos archivos que uso son:
* mult.dat, que contiene los ids de las palabras más relevantes de cada artículos, y sus conteos. En total son 16980 artículos (documentos).
* vocab.dat, el mapeo de los ids a la palabra específica. El tamaño del vocabulario usado es de 8000 palabras.

La extracción del vocabulario, y demás preprocesamientos de la data son los que se explican en la sección 4 de [Wang and Blei (2011)](http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf).

Adicionalmente, en esta primera iteración del algoritmo, pido que todos los documentos tengan la misma cantidad de palabras. Eso me lleva a realizar un procesamiento extra, en el cual descarto documentos con menos de 40 palabras, quedandome con 14836. A su vez, para estos artículos, solo conservo las primeras 40 palabras, ordenadas de mayor a menor ocurrencia en el documento.

A continuación, importo algunas bibliotecas que voy a usar, y realizo la carga del dataset. Por cuestiones de reproducibilidad, seteo una semilla fija.

In [1]:
from scipy.stats import dirichlet, multinomial, expon
from scipy.special import digamma
from collections import defaultdict
from math import pow
import numpy as np
import copy
import logging

from online_lda import OnlineLDA
from mini_batch_lda import MiniBatchLDA
from helpers import *

logging.basicConfig(level=logging.INFO)

In [2]:
def load_dataset():
    words_per_doc = defaultdict(list)
    # Parse line by line the documents data's file
    with open('data/mult.dat') as keywords_f:
        words_and_counts_by_doc = []
        for l in keywords_f:
            data = l.split() # First element is the number of distinct words, the rest are the word_ids with its counters 
            words_and_counts_by_doc.append(data[1:])
        for d, words_and_counts in enumerate(words_and_counts_by_doc):
            for w_c in words_and_counts:
                word, cnt = w_c.split(':')
                for _ in xrange(int(cnt)):    
                    words_per_doc[d].append(int(word))
            # Resampling the data to always have the same number of words per document
            np.random.seed(42)
            words_per_doc[d] = np.random.choice(words_per_doc[d], size=60)
    
    vocab = {}
    with open('data/vocab.dat') as vocab_f:
        for v, word in enumerate(vocab_f):
            vocab[v] = word.rstrip()
    
    return words_per_doc, vocab
            

In [3]:
words_by_doc, vocab = load_dataset()

## LDA

Primero me defino algunas funciones auxiliares que me van a servir luego.

La clase _LatentDirichletAllocation_ tiene como método principal _fit_, que es el que se encarga de aplicar el algoritmo propuesto por Hoffman et al. (2013). La notación elegida para los parámetros sigue fielmente la usada en la sección 3 de dicho paper. 

La imagen siguiente muestra el pseudocódigo del algoritmo en cuestión.

![title](img/SVI-pseudocode.png)

In [4]:
D = len(words_by_doc)  # number of documents
N = 60 # words per doc
K = 60  # number of topics
V = len(vocab)  # vocabulary size

# For alpha and eta values I follow recommendation from Griffiths and Steyvers (2004)
lda = OnlineLDA(K, D, N, V, 50.0/K, 0.1, words_by_doc)
lda.fit()

Iteracion: 0
Documento: 5770
Ro: 0.0334929207043
Iteracion: 1
Documento: 11239
Ro: 0.0333109641294
Iteracion: 2
Documento: 11493
Ro: 0.0331313897449
Iteracion: 3
Documento: 661
Ro: 0.0329541483957
Iteracion: 4
Documento: 12864
Ro: 0.032779192306
Iteracion: 5
Documento: 7180
Ro: 0.0326064750308
Iteracion: 6
Documento: 363
Ro: 0.0324359514088
Iteracion: 7
Documento: 4779
Ro: 0.0322675775179
Iteracion: 8
Documento: 5343
Ro: 0.0321013106322
Iteracion: 9
Documento: 1663
Ro: 0.0319371091812
Iteracion: 10
Documento: 6505
Ro: 0.03177493271
Iteracion: 11
Documento: 14368
Ro: 0.0316147418418
Iteracion: 12
Documento: 9360
Ro: 0.0314564982415
Iteracion: 13
Documento: 13164
Ro: 0.0313001645812
Iteracion: 14
Documento: 11476
Ro: 0.0311457045065
Iteracion: 15
Documento: 7843
Ro: 0.0309930826047
Iteracion: 16
Documento: 4408
Ro: 0.0308422643739
Iteracion: 17
Documento: 11600
Ro: 0.0306932161937
Iteracion: 18
Documento: 4302
Ro: 0.0305459052965
Iteracion: 19
Documento: 6823
Ro: 0.0304002997402
Iteracio

KeyboardInterrupt: 

In [None]:
for i, lamb_k in enumerate(lda.lamb):
    print_md('Topic {}'.format(i), bold=True)
    np.random.seed(42)
    words_distribution = np.random.dirichlet(lamb_k)
    #print words_distribution.argsort()[::-1][:10]
    #words_distribution.sort()
    #print words_distribution[::-1][:10]
    print_words(words_distribution.argsort()[::-1][:20], vocab)

In [5]:
D = len(words_by_doc)  # number of documents
N = 60 # words per doc
K = 60  # number of topics
V = len(vocab)  # vocabulary size
minibatch_size = 64

# For alpha and eta values I follow suggestion from Griffiths and Steyvers (2004)
lda = MiniBatchLDA(K, D, N, V, 50.0/K, 0.1, words_by_doc, minibatch_size)
lda.fit()

INFO:mini_batch_lda:Iteration: 0
INFO:mini_batch_lda:Ro: 1.0
INFO:mini_batch_lda:Iteration: 1
INFO:mini_batch_lda:Ro: 0.574349177499
INFO:mini_batch_lda:Iteration: 2
INFO:mini_batch_lda:Ro: 0.415243646539
INFO:mini_batch_lda:Iteration: 3
INFO:mini_batch_lda:Ro: 0.329876977693
INFO:mini_batch_lda:Iteration: 4
INFO:mini_batch_lda:Ro: 0.275945932292
INFO:mini_batch_lda:Iteration: 5
INFO:mini_batch_lda:Ro: 0.238494846851
INFO:mini_batch_lda:Iteration: 6
INFO:mini_batch_lda:Ro: 0.210824737371
INFO:mini_batch_lda:Iteration: 7
INFO:mini_batch_lda:Ro: 0.189464570814
INFO:mini_batch_lda:Iteration: 8
INFO:mini_batch_lda:Ro: 0.172427285991
INFO:mini_batch_lda:Iteration: 9
INFO:mini_batch_lda:Ro: 0.158489319246
INFO:mini_batch_lda:Iteration: 10
INFO:mini_batch_lda:Ro: 0.1468540242
INFO:mini_batch_lda:Iteration: 11
INFO:mini_batch_lda:Ro: 0.136979319126
INFO:mini_batch_lda:Iteration: 12
INFO:mini_batch_lda:Ro: 0.128482896333
INFO:mini_batch_lda:Iteration: 13
INFO:mini_batch_lda:Ro: 0.121087014505
I

In [6]:
for i, lamb_k in enumerate(lda.lamb):
    print_md('Topic {}'.format(i), bold=True)
    np.random.seed(42)
    words_distribution = np.random.dirichlet(lamb_k)
    #print words_distribution.argsort()[::-1][:10]
    #words_distribution.sort()
    #print words_distribution[::-1][:10]
    print_words(words_distribution.argsort()[::-1][:20], vocab)

**Topic 0**

52, development
382, linear
471, project
286, recognition
641, tested
448, input
537, noise
694, experiment
506, coding
624, rapid
867, file
571, hypothesis
709, testing
659, representations
756, climate
666, stochastic
1221, creation
1166, speed
1229, coupled
1784, allowed


**Topic 1**

99, transcription
163, mechanisms
185, distribution
256, targets
122, mirnas
144, mirna
1026, exist
405, mapping
585, degree
461, approximately
741, micrornas
970, reveals
747, correlated
595, mouse
817, promoter
71, cells
417, expressed
1054, units
954, microrna
1183, understood


**Topic 2**

131, role
236, areas
223, communication
158, identified
279, basis
518, widely
298, ability
296, previously
354, critical
979, storage
770, fact
567, play
675, studied
701, numbers
1029, conclude
548, families
882, leading
663, diseases
752, showed
1251, equations


**Topic 3**

47, evolution
160, key
410, technique
463, history
510, topics
763, typically
719, specificity
775, occur
873, accessible
654, location
860, success
804, demonstrated
820, remains
383, likely
1005, animal
664, presence
1214, evolved
1512, earlier
1228, evolve
857, see


**Topic 4**

24, method
43, sequences
197, tool
392, obtained
472, clusters
521, tree
728, formal
628, trees
535, increasing
692, version
485, pairs
744, classical
959, summary
687, five
560, times
674, products
910, easily
445, required
1922, joint
907, allowing


**Topic 5**

32, large
65, interaction
101, evolutionary
148, significant
168, evidence
231, accuracy
898, special
403, principles
397, relative
684, despite
928, behaviour
854, traits
1136, obtain
489, strong
1367, organized
2034, particles
1554, weak
1559, sizes
779, reveal
1572, atomic


**Topic 6**

222, memory
227, standard
175, applied
401, independent
329, shows
412, concept
262, energy
319, metabolic
155, target
508, direct
490, limited
834, protocols
1155, reduction
921, decisions
1032, utility
1235, extension
1152, external
767, patients
1009, vivo
1106, closely


**Topic 7**

4, model
225, propose
237, discuss
140, related
173, form
349, traditional
338, practical
470, report
439, inference
468, published
1098, realworld
711, measurements
815, prior
1286, similarities
1642, boundaries
1164, contexts
1379, reaction
1902, name
70, process
1517, extensions


**Topic 8**

21, approach
36, present
72, techniques
310, real
287, programs
301, effective
374, measures
297, significantly
475, difficult
1283, opportunities
662, errors
554, characteristics
272, made
771, normal
865, electronic
1044, items
858, usage
1111, recommender
1217, partial
1146, represents


**Topic 9**

93, small
106, high
120, rna
157, levels
241, second
418, rnas
533, molecules
617, interfaces
584, contact
477, phylogenetic
661, far
733, drug
758, amino
808, acid
2180, surfaces
1348, transcriptome
1116, enzymes
1666, blocks
1788, consisting
1313, discovered


**Topic 10**

17, different
73, selection
112, single
246, domain
212, case
458, describes
281, ontology
469, building
491, ontologies
437, datasets
529, currently
482, simulations
732, reported
855, increases
901, manner
813, construction
1192, arise
1082, insight
1224, section
1016, evaluated


**Topic 11**

66, provides
59, book
277, evaluation
369, implementation
360, quantitative
308, tasks
517, account
621, extensive
438, predict
720, resolution
466, increased
555, statistics
623, full
958, scheme
1113, estimated
1211, unified
722, loci
1003, fraction
396, ways
1201, led


**Topic 12**

0, information
25, structure
167, dynamics
576, region
424, articles
573, biomedical
914, enables
949, journal
1467, statistically
1685, seeking
1024, contribute
1870, eight
1910, auditory
2626, promote
1361, phenomenon
995, citation
2047, expertise
2381, presenting
3512, polynomial
22, based


**Topic 13**

1, analysis
69, patterns
187, population
343, bayesian
285, analyses
715, cluster
698, procedure
1148, fluctuations
1212, characterize
1257, principal
10, methods
1124, takes
2198, optimized
1503, substantially
3586, ancestry
3163, singlemolecule
1672, shift
1301, mode
1876, multivariate
1073, depends


**Topic 14**

23, learning
107, general
85, user
220, computing
345, engineering
414, good
381, predictions
488, long
643, game
627, projects
649, artificial
819, characterized
610, games
838, intelligence
1089, reflect
1099, saccharomyces
940, notion
962, cerevisiae
1048, focused
1195, books


**Topic 15**

30, genome
88, science
108, sequencing
202, technologies
143, genomes
362, designed
266, power
332, growth
304, identification
456, increase
565, motivation
534, parallel
294, individuals
400, free
635, assembly
498, reads
394, short
648, genomics
141, genomic
739, supplementary


**Topic 16**

28, number
152, change
209, observed
305, described
450, relationship
496, generation
423, bioinformatics
443, adaptive
564, computation
647, perspective
640, fast
737, selected
1004, progress
757, amount
557, constraints
726, adaptation
994, goals
892, seen
1285, constant
1022, dataset


**Topic 17**

6, systems
196, distributed
200, quantum
321, view
357, strategies
291, similarity
465, distinct
422, fundamental
578, early
512, integrated
1162, operating
1366, systematically
938, readers
862, build
971, semantics
1275, mechanics
1310, actual
943, achieved
1405, integrate
1104, dimensions


**Topic 18**

96, support
221, programming
139, community
386, cases
341, presents
177, processing
288, digital
897, movement
226, access
710, advances
391, resource
601, solutions
947, output
1402, aim
1387, standards
1178, supporting
1294, networking
1176, hand
1151, architectures
1414, trends


**Topic 19**

45, available
119, field
114, experiments
191, test
351, point
377, integration
265, annotation
467, past
46, provide
608, improved
433, accurate
932, graphical
788, analyzing
841, achieve
978, crucial
980, substantial
1077, designing
1138, studying
1149, improvement
944, freely


**Topic 20**

39, algorithm
252, basic
233, disease
299, pattern
355, previous
602, determine
885, education
700, routing
848, leads
794, technical
435, due
992, advanced
677, initial
818, examined
961, treatment
1131, describing
964, precision
915, vision
917, numerous
781, chapter


**Topic 21**

15, expression
3, gene
92, known
51, function
249, differences
235, microarray
162, variation
98, changes
290, detection
347, effect
505, correlation
1282, explains
653, annotations
1156, lists
2049, genomescale
1396, driven
1389, magnitude
933, divergence
1142, lines
1051, organism


**Topic 22**

172, state
111, activity
195, neural
215, neurons
245, cancer
337, probability
302, cognitive
314, background
328, states
1025, internal
415, populations
544, makes
832, motor
451, responses
579, cortical
324, cortex
1143, optical
754, vector
307, representation
774, conclusions


**Topic 23**

29, design
35, theory
205, program
413, fields
646, requirements
784, broad
706, place
1118, discusses
1411, designs
1215, mathematics
1094, detail
956, containing
1071, art
1527, implement
1252, international
1613, definitions
1418, designers
2179, teachers
1780, reasons
1618, objectoriented


**Topic 24**

20, social
19, research
31, study
37, studies
293, theoretical
283, conditions
447, environments
476, implications
764, sciences
1265, metrics
295, researchers
212, case
1129, participants
1247, works
1096, hypotheses
1121, valuable
1081, society
1419, examines
905, culture
1475, disciplines


**Topic 25**

2, new
387, signal
208, environment
261, addition
634, insights
504, influence
877, formation
809, illustrate
705, spike
801, negative
1012, assumptions
1170, simultaneously
1925, mediated
1988, expanded
1671, element
3075, held
1770, chosen
996, principle
2691, truly
2590, intracellular


**Topic 26**

12, human
63, problem
75, problems
380, association
390, largescale
499, mathematical
452, active
645, generate
716, maps
569, optimization
704, solution
942, maximum
868, extent
1042, writing
975, volume
1180, operations
977, modifications
1667, ago
912, thought
2051, outperforms


**Topic 27**

11, protein
64, genetic
203, architecture
166, prediction
190, modeling
213, alignment
313, domains
303, yeast
408, measure
425, action
453, sources
316, comparison
255, useful
547, decision
421, diverse
870, comparing
688, sensitivity
561, complexes
1344, differ
1114, genetics


**Topic 28**

61, dna
90, three
275, components
268, site
503, values
606, points
772, corresponding
822, million
613, samples
174, rate
697, base
748, specifically
1140, interact
791, detect
1392, coupling
1171, detecting
1210, fixed
1331, extend
1072, representing
893, coli


**Topic 29**

130, technology
121, online
224, resources
136, identify
592, personal
474, means
900, purpose
311, help
574, tagging
419, collaborative
734, libraries
566, tags
729, filtering
727, business
766, tag
1266, technological
1046, recommendations
931, offer
530, needs
853, searching


**Topic 30**

55, applications
257, efficient
171, need
335, interface
642, hierarchical
334, students
708, physics
587, offers
619, visualization
563, feedback
967, course
1120, built
460, address
1277, generic
1112, style
671, learn
1372, details
1507, classic
311, help
1225, usability


**Topic 31**

9, network
178, demonstrate
22, based
312, graph
263, random
389, generated
509, providing
536, graphs
446, low
434, documents
609, forms
743, developing
556, document
541, systematic
981, connected
572, contrast
1039, topology
843, respect
429, nodes
1534, directed


**Topic 32**

41, functional
81, control
104, molecular
159, behavior
342, introduction
519, detailed
479, signals
603, performed
824, successful
738, analyzed
683, greater
936, last
787, total
1630, focusing
751, stimuli
1134, difference
930, elegans
1560, preferences
1033, phenotypes
1682, iii


**Topic 33**

115, sets
201, proposed
230, quality
253, wide
353, understand
769, fully
22, based
350, uses
407, training
455, ideas
620, perform
593, distributions
598, appropriate
459, library
760, matrix
1028, assess
785, sense
1013, array
2050, achieves
1202, collections


**Topic 34**

26, show
57, users
83, approaches
50, species
124, level
58, sites
260, size
68, binding
254, factor
156, factors
161, effects
289, query
531, rules
2177, integrates
327, called
852, synthetic
783, docking
1002, ligand
1330, reliability
1564, spectrum


**Topic 35**

76, framework
184, article
153, natural
388, availability
367, physical
278, nature
464, classes
1153, precise
872, conceptual
483, additional
793, efficiency
919, collective
828, measured
1539, focuses
972, explicit
1001, definition
946, extended
1519, extremely
484, defined
1428, deal


**Topic 36**

44, important
56, first
126, functions
128, potential
214, future
70, process
494, comparative
526, resulting
359, involved
723, methodology
542, distance
891, conventional
528, unique
550, phase
982, structured
1351, derive
735, clear
352, recently
998, forest
1621, perspectives


**Topic 37**

48, biological
95, processes
218, existing
194, literature
217, better
183, range
322, organisms
318, mining
238, compared
604, family
725, choice
903, developmental
778, rapidly
670, increasingly
1244, flexibility
1640, surveys
1175, suitable
1557, applicable
1095, causal
1649, homologous


**Topic 38**

134, types
117, computer
181, examples
375, focus
234, internet
372, complete
402, aspects
605, analyze
789, define
835, shape
1017, devices
861, subjects
878, introduced
366, develop
730, clinical
968, reviews
1516, combinations
1542, defining
1233, emergence
1043, identity


**Topic 39**

3, gene
102, developed
127, semantic
82, found
243, services
267, discovery
333, comprehensive
292, relevant
404, genomewide
495, temporal
525, profiles
714, necessary
559, frequency
1154, tissue
1305, date
913, actions
929, composition
1406, privacy
1070, creating
1390, format


**Topic 40**

8, networks
416, mobile
326, like
631, scientists
395, impact
597, challenge
638, wireless
718, material
782, connections
599, goal
997, phenomena
679, protocol
1490, behaviors
1302, profiling
1358, run
1299, consists
1197, path
2288, possibilities
1398, year
1336, intrinsic


**Topic 41**

33, software
135, application
100, structures
151, world
154, highly
240, source
371, essential
457, making
216, issues
856, collaboration
629, modern
615, relevance
618, author
859, created
1157, developers
966, gives
1196, security
52, development
1639, quickly
987, third


**Topic 42**

16, web
60, search
123, current
271, databases
513, sample
717, automatic
406, questions
899, video
586, categories
807, pages
777, written
652, queries
682, popular
226, access
1058, email
1300, rdf
1056, easy
1478, engines
1053, page
1312, webbased


**Topic 43**

14, models
67, performance
259, variety
336, higher
543, little
527, probabilistic
398, practice
1008, rich
10, methods
1538, arbitrary
1274, improvements
1050, supported
552, indicate
909, regression
1194, completely
1474, generalized
1018, parameter
1479, enabling
327, called
1655, messages


**Topic 44**

250, concepts
232, type
384, challenges
242, group
264, life
280, spatial
376, object
539, languages
462, media
669, relations
600, activities
636, take
896, paradigm
432, experience
577, subject
736, trust
1067, close
1135, entities
575, health
904, issue


**Topic 45**

53, algorithms
38, time
105, simple
150, people
165, clustering
344, empirical
449, allow
473, provided
639, require
665, original
630, consider
633, links
523, consistent
963, thousands
690, link
1159, limitations
546, series
922, cooperation
1604, latent
1793, modified


**Topic 46**

5, genes
54, work
149, associated
431, mutations
478, edition
594, optimal
637, loss
703, relatively
827, copy
864, affect
993, bacteria
436, humans
811, suggesting
825, transfer
1383, period
803, bacterial
934, mutation
2155, solved
1255, involves
1324, reward


**Topic 47**

42, interactions
80, understanding
94, cell
229, pathways
71, cells
269, open
179, content
356, mechanism
340, alternative
568, compare
580, agents
365, pathway
551, service
51, function
724, neuronal
1168, living
955, represented
876, economic
1007, established
1139, circuits


**Topic 48**

13, system
18, two
211, task
180, space
176, visual
207, retrieval
368, objects
538, give
588, strategy
430, finding
740, needed
875, topic
839, motion
1249, discrete
986, taken
1000, perception
1391, hardware
1363, indexing
1230, improving
908, machines


**Topic 49**

103, describe
78, tools
110, database
192, local
125, particular
385, reference
239, response
317, abstract
189, major
442, suggests
540, robust
393, value
712, combination
626, explore
699, explain
887, lower
806, combining
1090, stability
324, cortex
1037, criteria


**Topic 50**

116, properties
358, scale
284, complexity
364, simulation
444, importance
713, examine
614, description
762, markov
845, nonlinear
581, estimation
871, likelihood
842, flexible
1150, serve
884, chemical
1076, node
1220, property
795, density
1083, constructed
951, scales
1127, meaning


**Topic 51**

118, text
109, common
198, classification
228, allows
164, elements
320, best
611, words
428, error
833, word
10, methods
656, contains
657, extraction
768, estimates
918, share
721, end
1102, advantages
1117, speech
1061, kernel
1536, medicine
1078, developments


**Topic 52**

40, knowledge
146, possible
182, suggest
199, groups
331, parameters
348, findings
570, question
890, organizational
1027, strongly
582, sharing
676, discussion
965, logic
742, members
681, sampling
1110, measurement
731, emerging
1132, reasoning
850, facilitate
792, organizations
1652, assumed


**Topic 53**

27, sequence
10, methods
34, proteins
133, novel
138, biology
147, structural
515, requires
188, global
493, primary
346, rates
507, proteinprotein
632, noncoding
622, drosophila
522, represent
753, determined
786, residues
851, experimentally
247, conserved
755, position
886, translation


**Topic 54**

193, management
251, code
273, underlying
206, communities
325, class
315, diversity
309, four
558, component
330, conservation
502, environmental
616, microbial
562, core
553, create
673, combined
846, significance
678, automatically
880, ecological
1284, longterm
1088, effort
1006, plant


**Topic 55**

97, experimental
79, regulatory
244, dynamic
282, cellular
204, regulation
361, transcriptional
141, genomic
427, virtual
274, motifs
511, identifying
270, predicted
612, expected
514, motif
836, enable
691, roles
976, reliable
500, chromatin
3, gene
672, mammalian
805, largely


**Topic 56**

89, statistical
62, multiple
74, language
87, computational
142, review
91, regions
169, make
145, brain
516, directly
137, context
501, able
750, overview
790, imaging
1064, relation
10, methods
625, connectivity
1500, magnetic
1019, suggested
1362, acquisition
1375, resonance


**Topic 57**

86, recent
49, complex
77, features
170, years
373, images
339, shown
300, interest
379, attention
323, organization
480, central
667, theories
702, tests
957, package
894, mean
821, interactive
849, ones
1020, face
454, average
1010, capture
1213, reconstruction


**Topic 58**

7, results
258, presented
219, scientific
363, machine
378, area
644, evaluate
492, discussed
411, events
409, public
590, implemented
759, flow
988, capacity
983, wikipedia
46, provide
906, efforts
1023, benefits
1125, index
707, working
1031, age
1167, diffusion


**Topic 59**

129, individual
132, terms
186, order
210, image
248, relationships
486, map
651, guide
420, feature
895, matching
863, idea
796, surface
695, term
668, shared
866, medical
780, interpretation
1126, supports
1665, running
1145, contribution
1410, navigation
1128, modelling


In [20]:
lda.gamma[90:100]

array([[ 1.80063452,  0.96543904,  4.6927096 ,  0.83347637,  0.8334179 ,
         0.84745332,  0.86911628,  2.74826727,  0.869611  ,  0.84137215,
         3.76127965,  0.85443808,  2.92360381,  0.83878002,  0.87534081,
         1.74202782,  0.83826289,  0.84145542,  0.83606442,  0.84334155,
         0.83470439,  0.83623865,  0.83370108,  1.8481884 ,  3.32069561,
         0.83683384,  6.56288889,  2.76104599,  1.87336555,  0.83972944,
         0.83646284,  2.90549102,  0.92129091,  0.84222479,  1.64027932,
         0.84410728,  0.8525896 ,  0.83425007,  0.86161461,  1.86977874,
         1.84935466,  0.83918701,  7.56134089,  0.84088599,  3.95019911,
         0.84304385,  1.30005764,  0.83810353,  0.85924589,  0.83629584,
         3.0258385 ,  2.89843033,  2.58121708,  0.83400247,  0.83747627,
         5.88358516,  2.81906587,  4.11606799,  2.84057258,  1.83445639],
       [ 2.83778888,  0.85705955,  0.84069184,  2.6649871 ,  1.85947524,
         0.83679065,  2.82190297,  5.82438299,  2.