# Stochastic Variational Inference for LDA

## Introducción

En el presente notebook se realiza una implementación del algoritmo para topic modelling, Latent Dirichlet Allocation (LDA), en su versión con Stochastic Variational Inference. La implementación trata de apegarse a lo propuesto por Hoffman, Blei y Wang en [Stochastic Variational Inference](http://www.columbia.edu/~jwp2128/Papers/HoffmanBleiWangPaisley2013.pdf).
Otros trabajos que fueron tomados como referencia son:
 * Para un panorama más amplio sobre LDA: [Inference Methods for Latent Dirichlet Allocation](http://times.cs.uiuc.edu/course/598f16/notes/lda-survey.pdf);
 * Para una explicación más detallada sobre la Mean-Field Variational Family: [Variational Inference: A Review for Statisticians](https://arxiv.org/pdf/1601.00670.pdf)
 
Adicionalmente, la idea de implementar LDA surgió de un intento previo de querer implementar la propuesta de Wang y Blei, en [Collaborative Topic Modeling for Recommending Scientific Articles](http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf), para Sistemas de Recomendación. De hecho el dataset utilizado aqui es el mismo que usan ellos. No descarto implementar este paper como siguiente paso.

## Marco teórico

Asumiendo K tópicos, y D documentos, cada uno con N palabras (por simplicidad asumo documentos de igual longitud) pertenecientes a un vocabulario de tamaño V; el modelo generativo es el siguiente:

1. Generamos tópicos $\beta_k$ ~ Dirichlet($\eta,\dots,\eta$) para cada $k\in \{0,\dots,K-1\}$
2. Para cada documento $d \in \{0,\dots, D-1\}$:
    3. Generamos $\theta_d$ ~ Dirichlet($\alpha,\dots,\alpha$)
    4. Para cada palabra $w \in \{0,\dots, N-1\}$:
        5. Generamos la asignación de tema $z_{dn}$ ~ Multinomial($\theta_d$)
        6. Generamos la palabra $w_{dn}$ ~ Multinomial($\beta_{z_{dn}}$)

Los parámetros involucrados son:
* $\beta_k$, el tópico $k$. Un tópico consiste básicamente en un vector de probabilidad de longitud V, que modela la distribución de las distintas palabras en el vocabulario para ese tema
* $\theta_d$, la proporción de los tópicos para el documento $d$. Nuevamente es un vector de probabilidades, que en este caso apunta a modelar la participación parcial de los distintos temas para un mismo documento. El hecho de que se permita que un documento pertenezca a un mixture de tópicos en lugar de uno solo, es una de las principales virtudes de LDA. __Esta es la principal variable sobre la que nos interesa poder hacer inferencia__.
* $z_{dn}$, es la asignación del tópico de la palabra $n$ en el domunento $d$
* $w_{dn}$, es la palabra $n$ del documento $d$. Pertenece al tópico $z_{dn}$

## Dataset

El dataset fue descargado de http://www.cs.cmu.edu/~chongw/data/citeulike/. El mismo contiene información del sitio http://www.citeulike.org/, que le permite a investigadores armar sus bibliotecas de papers y recibir recomendaciones en base a las mismas. En particular yo no uso la información referente a los usuarios y sus bibliotecas, si no que únicamente me centro en la data sobre artículos.
Los dos archivos que uso son:
* mult.dat, que contiene los ids de las palabras más relevantes de cada artículos, y sus conteos. En total son 16980 artículos (documentos).
* vocab.dat, el mapeo de los ids a la palabra específica. El tamaño del vocabulario usado es de 8000 palabras.

La extracción del vocabulario, y demás preprocesamientos de la data son los que se explican en la sección 4 de [Wang and Blei (2011)](http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf).

Adicionalmente, en esta primera iteración del algoritmo, pido que todos los documentos tengan la misma cantidad de palabras. Eso me lleva a realizar un procesamiento extra, en el cual descarto documentos con menos de 40 palabras, quedandome con 14836. A su vez, para estos artículos, solo conservo las primeras 40 palabras, ordenadas de mayor a menor ocurrencia en el documento.

A continuación, importo algunas bibliotecas que voy a usar, y realizo la carga del dataset. Por cuestiones de reproducibilidad, seteo una semilla fija.

In [157]:
from scipy.stats import dirichlet, multinomial, expon
from scipy.special import digamma
from collections import defaultdict
from math import pow
import numpy as np

In [158]:
def load_dataset():
    words_per_doc = defaultdict(list)
    with open('data/mult.dat') as keywords_f:
        words_and_counts_by_doc = []
        for l in keywords_f:
            data = l.split()
            n_words = data[0]
            if int(n_words) < 40:
                continue
            words_and_counts_by_doc.append(data[1:])
        for d, words_and_counts in enumerate(words_and_counts_by_doc):
            for w_c in words_and_counts:
                words_per_doc[d].append(tuple(w_c.split(':')))
            words_per_doc[d] = sorted(words_per_doc[d], key=lambda x: -1 *int(x[1]))[:40]
            words_per_doc[d] = [int(word) for word, _ in words_per_doc[d]]
    
    vocab = {}
    with open('data/vocab.dat') as vocab_f:
        for v, word in enumerate(vocab_f):
            vocab[v] = word.rstrip()
    
    return words_per_doc, vocab
            

In [159]:
words_by_doc, vocab = load_dataset()

## LDA

Primero me defino algunas funciones auxiliares que me van a servir luego.

In [160]:
def dirich_log_expectation(alpha):
    res = None
    if len(alpha.shape) == 1:
        # Case gamma, of shape (K, )
        res = digamma(alpha) - digamma(np.sum(alpha, keepdims=True))
        return np.reshape(res, (res.shape[0], 1))
    else:
        # Case phi, of shape (K, V)
        return digamma(alpha) - digamma(np.sum(alpha, axis=1, keepdims=True))
    
def one_hot_encoding(words, length):
    '''
    Arguments:
        words: list of length N, with the positions of the desired non-zero elements
        length: the length of each encoded vector
    Output:
        A matrix of shape (N, length), where each row is a 1-hot encoding
    '''
    res = np.zeros((len(words), length))
    res[np.arange(res.shape[0]), words] = 1.0
    return res

La clase _LatentDirichletAllocation_ tiene como método principal _fit_, que es el que se encarga de aplicar el algoritmo propuesto por Hoffman et al. (2013). La notación elegida para los parámetros sigue fielmente la usada en la sección 3 de dicho paper. 

La imagen siguiente muestra el pseudocódigo del algoritmo en cuestión.

![title](img/SVI-pseudocode.png)

In [169]:
class LatentDirichletAllocation:
        
        def __init__(self, n_topics, n_documents, words_per_doc, vocab_size, alpha, eta, words_by_doc, tau=128.0, kappa=0.7):
            np.random.seed(0)
            
            self.K = n_topics
            self.D = n_documents
            self.N = words_per_doc
            self.V = vocab_size
            self.alpha = alpha
            self.eta = eta
            self.tau = tau
            self.kappa = kappa
            
            # Initialize variational parameters
            self.lamb = []  # Parameter for words distribution by topic (qBeta ~ Dirichlet(lamb)) 
            exp_parameter = float(self.K * self.V) / (self.D * self.N)
            for k in xrange(self.K):
                self.lamb.append(expon(scale=exp_parameter).rvs(size=self.V) + self.eta) # Suggested in Hoffman et al. (2013)
            self.lamb = np.array(self.lamb)
            self.gamma = np.ones((self.D, self.K)) # Parameter for topics assignments by doc (qTheta ~ Dirichlet(gamma))
            self.phi = np.zeros((self.D, self.N, self.K))
            
            self.words_by_doc = words_by_doc
        
        def fit(self):
            #Inicializo lambda como sugiere el paper, pie de pagina 1328
            docs = self.words_by_doc.keys()
            for t in xrange(1000):
                #sample uniformly a document
                # TODO: use mini-batches
                print 'Iteracion:', t
                d = np.random.choice(docs)
                print 'Documento:', d
                ro = pow((t + self.tau), (-self.kappa))
                print 'Ro:', ro
                self.e_step(d)
                self.m_step(d, ro)
        
        def e_step(self, d):
            i = 0
            self.gamma[d] = np.random.gamma(shape=self.K, scale=1./self.K, size=self.K)
            words = self.words_by_doc[d]
            mean_change = 1.0 # Variable for determine if convergence has been achieved
            # The E-Step doesn't updates lambda parameter.
            expected_log_beta = dirich_log_expectation(self.lamb)[:, words] # shape (K, N)
            print 'expected_log_beta:', np.mean(expected_log_beta), np.std(expected_log_beta)
            print 'lambda:', np.mean(self.lamb)
            while i < 100 and mean_change > 0.001:
                if i >= 1:
                    print "ENTRE DOS VECES"
                prev_gamma = self.gamma[d]
                
                # I am leveraging vectorization over the topics and the words.
                expected_log_theta = dirich_log_expectation(self.gamma[d]) # shape (K, 1)
                
                # Updates phi[d]
                self.phi[d] = np.exp(expected_log_theta + expected_log_beta).T # shape (N, K)
                
                # Updates gamma[d]
                self.gamma[d] = self.alpha + np.sum(self.phi[d], axis=0)
                mean_change = np.mean(abs(self.gamma[d] - prev_gamma))
                i += 1
                
        def m_step(self, d, ro):
            indicator_words = one_hot_encoding(self.words_by_doc[d], self.V)
            intermediate_lambda = self.eta + self.D * np.dot(self.phi[d].T, indicator_words)
            self.lamb = (1 - ro) * self.lamb + ro * intermediate_lambda
            

In [170]:
D = len(words_by_doc)  # number of documents
N = 40 # words per doc
K = 100  # number of topics
V = len(vocab)  # vocabulary size

lda = LatentDirichletAllocation(K, D, N, V, 50.0/K, 0.1, words_by_doc)
lda.fit()

Iteracion: 0
Documento: 10575
Ro: 0.0334929207043
expected_log_beta: -10.5443840538 2.33740537511
lambda: 1.44942516929
Iteracion: 1
Documento: 953
Ro: 0.0333109641294
expected_log_beta: -10.5625723156 2.34246917387
lambda: 1.4042302545
Iteracion: 2
Documento: 5371
Ro: 0.0331313897449
expected_log_beta: -10.606943944 2.3856459843
lambda: 1.36078633903
Iteracion: 3
Documento: 125
Ro: 0.0329541483957
expected_log_beta: -10.598980056 2.37008869089
lambda: 1.31901595469
Iteracion: 4
Documento: 11195
Ro: 0.032779192306
expected_log_beta: -10.6862854527 2.35234908501
lambda: 1.27884554689
Iteracion: 5
Documento: 13945
Ro: 0.0326064750308
expected_log_beta: -10.6572782839 2.3602926709
lambda: 1.24020507068
Iteracion: 6
Documento: 7186
Ro: 0.0324359514088
expected_log_beta: -10.667041601 2.38581043828
lambda: 1.2030281801
Iteracion: 7
Documento: 9717
Ro: 0.0322675775179
expected_log_beta: -10.6679140598 2.39450492084
lambda: 1.16725158527
Iteracion: 8
Documento: 4178
Ro: 0.0321013106322
expect

expected_log_beta: -12.7452186937 2.5499297034
lambda: 0.286017042187
Iteracion: 69
Documento: 6017
Ro: 0.0247670135463
expected_log_beta: -12.7381043545 2.57115161665
lambda: 0.281393762717
Iteracion: 70
Documento: 1239
Ro: 0.0246793869184
expected_log_beta: -12.7745659449 2.48146548299
lambda: 0.276901420472
Iteracion: 71
Documento: 3783
Ro: 0.0245925094256
expected_log_beta: -12.8341832491 2.51706736388
lambda: 0.272535820011
Iteracion: 72
Documento: 3406
Ro: 0.024506370947
expected_log_beta: -12.7966411322 2.48426767179
lambda: 0.268292944867
Iteracion: 73
Documento: 1243
Ro: 0.0244209615481
expected_log_beta: -12.8954295272 2.50359282786
lambda: 0.264168908701
Iteracion: 74
Documento: 1097
Ro: 0.0243362714765
expected_log_beta: -12.9242889667 2.49376571195
lambda: 0.260159953572
Iteracion: 75
Documento: 8408
Ro: 0.0242522911578
expected_log_beta: -12.9350501573 2.53500328928
lambda: 0.25626245617
Iteracion: 76
Documento: 6966
Ro: 0.0241690111911
expected_log_beta: -13.0129693186 2

expected_log_beta: -14.9689542431 1.69343274879
lambda: 0.139984693744
Iteracion: 137
Documento: 3077
Ro: 0.0201246292503
expected_log_beta: -14.9733293931 1.68631600492
lambda: 0.139177896602
Iteracion: 138
Documento: 6733
Ro: 0.0200716397865
expected_log_beta: -15.0534177806 1.66639881847
lambda: 0.13838946884
Iteracion: 139
Documento: 8513
Ro: 0.0200189878988
expected_log_beta: -15.0319112256 1.66891657427
lambda: 0.137618940969
Iteracion: 140
Documento: 730
Ro: 0.0199666701842
expected_log_beta: -15.069698914 1.63401392658
lambda: 0.136865859594
Iteracion: 141
Documento: 8750
Ro: 0.0199146832868
expected_log_beta: -15.0903762399 1.64088714442
lambda: 0.136129781625
Iteracion: 142
Documento: 7186
Ro: 0.0198630238966
expected_log_beta: -15.115877456 1.62455383943
lambda: 0.135410279169
Iteracion: 143
Documento: 10065
Ro: 0.0198116887488
expected_log_beta: -15.1281027485 1.60252972579
lambda: 0.13470693456
Iteracion: 144
Documento: 13976
Ro: 0.019760674623
expected_log_beta: -15.13760

expected_log_beta: -16.3121020777 0.762062824634
lambda: 0.110225966489
Iteracion: 210
Documento: 13169
Ro: 0.0169729882201
expected_log_beta: -16.3501523557 0.752980716676
lambda: 0.110052041798
Iteracion: 211
Documento: 4029
Ro: 0.0169379252243
expected_log_beta: -16.3541054952 0.757592360502
lambda: 0.109881429462
Iteracion: 212
Documento: 11130
Ro: 0.0169030376208
expected_log_beta: -16.3775234575 0.715290640471
lambda: 0.109714059411
Iteracion: 213
Documento: 3234
Ro: 0.0168683240205
expected_log_beta: -16.3705962154 0.712801698232
lambda: 0.109549863095
Iteracion: 214
Documento: 9334
Ro: 0.016833783049
expected_log_beta: -16.3750217074 0.708918130981
lambda: 0.109388773699
Iteracion: 215
Documento: 4526
Ro: 0.0167994133468
expected_log_beta: -16.4200287021 0.67477176203
lambda: 0.10923072587
Iteracion: 216
Documento: 2922
Ro: 0.0167652135691
expected_log_beta: -16.393442464 0.688927171656
lambda: 0.109075655808
Iteracion: 217
Documento: 3936
Ro: 0.0167311823854
expected_log_beta:

Iteracion: 278
Documento: 4053
Ro: 0.0149290363848
expected_log_beta: -16.8177046498 0.305576520404
lambda: 0.103376980586
Iteracion: 279
Documento: 2901
Ro: 0.0149033504363
expected_log_beta: -16.8339057721 0.290964773169
lambda: 0.103326565852
Iteracion: 280
Documento: 4188
Ro: 0.014877771552
expected_log_beta: -16.8283573776 0.300438224307
lambda: 0.103276989197
Iteracion: 281
Documento: 11750
Ro: 0.0148522990248
expected_log_beta: -16.8314507153 0.293964903323
lambda: 0.103228235222
Iteracion: 282
Documento: 12298
Ro: 0.0148269321541
expected_log_beta: -16.8333982698 0.289659182269
lambda: 0.103180288824
Iteracion: 283
Documento: 10306
Ro: 0.0148016702456
expected_log_beta: -16.8373661274 0.282088541844
lambda: 0.103133135217
Iteracion: 284
Documento: 12417
Ro: 0.0147765126111
expected_log_beta: -16.8413495627 0.275321502684
lambda: 0.103086759899
Iteracion: 285
Documento: 11779
Ro: 0.0147514585687
expected_log_beta: -16.8436277993 0.27686666027
lambda: 0.103041148661
Iteracion: 28

expected_log_beta: -16.9930277758 0.121319297542
lambda: 0.101299677859
Iteracion: 346
Documento: 1754
Ro: 0.013395383783
expected_log_beta: -16.9935106408 0.122201330981
lambda: 0.101282242656
Iteracion: 347
Documento: 12303
Ro: 0.0133756369778
expected_log_beta: -16.9928734732 0.123470080462
lambda: 0.10126506676
Iteracion: 348
Documento: 11198
Ro: 0.0133559607192
expected_log_beta: -16.9974679422 0.119730542791
lambda: 0.101248145922
Iteracion: 349
Documento: 11976
Ro: 0.0133363546078
expected_log_beta: -16.9975622617 0.123287645232
lambda: 0.101231475969
Iteracion: 350
Documento: 5306
Ro: 0.0133168182471
expected_log_beta: -16.9980328789 0.117729808044
lambda: 0.101215052805
Iteracion: 351
Documento: 268
Ro: 0.0132973512438
expected_log_beta: -16.9990774791 0.118241623398
lambda: 0.1011988724
Iteracion: 352
Documento: 8236
Ro: 0.0132779532078
expected_log_beta: -17.0034890443 0.112976399643
lambda: 0.101182930805
Iteracion: 353
Documento: 3649
Ro: 0.0132586237517
expected_log_beta:

expected_log_beta: -17.0597057458 0.0516839138227
lambda: 0.100534662375
Iteracion: 415
Documento: 9641
Ro: 0.0121797989433
expected_log_beta: -17.0614170959 0.0507466609
lambda: 0.100528142087
Iteracion: 416
Documento: 11423
Ro: 0.012164122083
expected_log_beta: -17.0598640694 0.0524776313435
lambda: 0.10052170962
Iteracion: 417
Documento: 8596
Ro: 0.0121484941364
expected_log_beta: -17.0617025528 0.0515817120488
lambda: 0.10051536368
Iteracion: 418
Documento: 2203
Ro: 0.0121329148617
expected_log_beta: -17.0597286506 0.0533298123089
lambda: 0.100509102985
Iteracion: 419
Documento: 4719
Ro: 0.0121173840185
expected_log_beta: -17.0623730358 0.0504373098152
lambda: 0.10050292628
Iteracion: 420
Documento: 12980
Ro: 0.0121019013681
expected_log_beta: -17.0641736702 0.0478152997109
lambda: 0.100496832327
Iteracion: 421
Documento: 2703
Ro: 0.0120864666735
expected_log_beta: -17.0634521518 0.0476285341765
lambda: 0.100490819907
Iteracion: 422
Documento: 8757
Ro: 0.0120710796991
expected_log_

expected_log_beta: -17.0859753103 0.0237970585314
lambda: 0.100234782014
Iteracion: 485
Documento: 6197
Ro: 0.0111886494653
expected_log_beta: -17.0864169762 0.0234725837408
lambda: 0.100232152295
Iteracion: 486
Documento: 3378
Ro: 0.0111758905576
expected_log_beta: -17.0869194183 0.0229450608527
lambda: 0.100229555002
Iteracion: 487
Documento: 3786
Ro: 0.011163166927
expected_log_beta: -17.0867041177 0.0236104734035
lambda: 0.1002269897
Iteracion: 488
Documento: 7916
Ro: 0.0111504784189
expected_log_beta: -17.0873218027 0.0223881399453
lambda: 0.100224455955
Iteracion: 489
Documento: 13219
Ro: 0.0111378248795
expected_log_beta: -17.0873644002 0.0229516887902
lambda: 0.10022195334
Iteracion: 490
Documento: 6625
Ro: 0.0111252061561
expected_log_beta: -17.0871437917 0.0234077723986
lambda: 0.100219481441
Iteracion: 491
Documento: 13679
Ro: 0.0111126220966
expected_log_beta: -17.088037705 0.0211326174152
lambda: 0.100217039841
Iteracion: 492
Documento: 364
Ro: 0.0111000725502
expected_log

lambda: 0.100113408343
Iteracion: 552
Documento: 3279
Ro: 0.0104050401677
expected_log_beta: -17.0973288076 0.011523131152
lambda: 0.100112227271
Iteracion: 553
Documento: 9145
Ro: 0.0103943424681
expected_log_beta: -17.0976534367 0.0108210055836
lambda: 0.100111059705
Iteracion: 554
Documento: 4601
Ro: 0.0103836714402
expected_log_beta: -17.0978413435 0.0112051267319
lambda: 0.100109905475
Iteracion: 555
Documento: 8571
Ro: 0.0103730269785
expected_log_beta: -17.0980057297 0.010792499648
lambda: 0.100108764417
Iteracion: 556
Documento: 10193
Ro: 0.0103624089781
expected_log_beta: -17.0981376214 0.0106485812546
lambda: 0.100107636364
Iteracion: 557
Documento: 12429
Ro: 0.0103518173348
expected_log_beta: -17.0982897801 0.0107577113286
lambda: 0.100106521153
Iteracion: 558
Documento: 7246
Ro: 0.0103412519447
expected_log_beta: -17.0980374684 0.0110706082799
lambda: 0.100105418629
Iteracion: 559
Documento: 13333
Ro: 0.0103307127047
expected_log_beta: -17.0982203041 0.0104671841523
lambda:

Iteracion: 622
Documento: 10175
Ro: 0.00971532012989
expected_log_beta: -17.102603828 0.00555951782538
lambda: 0.100055324619
Iteracion: 623
Documento: 9991
Ro: 0.0097062627621
expected_log_beta: -17.1027160543 0.00568103506366
lambda: 0.100054787274
Iteracion: 624
Documento: 13211
Ro: 0.00969722587383
expected_log_beta: -17.1027207615 0.00554130056988
lambda: 0.100054255647
Iteracion: 625
Documento: 14634
Ro: 0.00968820939164
expected_log_beta: -17.1028328241 0.00556352665633
lambda: 0.100053729669
Iteracion: 626
Documento: 1632
Ro: 0.00967921324243
expected_log_beta: -17.1029009109 0.005313802855
lambda: 0.100053209277
Iteracion: 627
Documento: 7179
Ro: 0.00967023735348
expected_log_beta: -17.1029888613 0.00528092632021
lambda: 0.100052694405
Iteracion: 628
Documento: 7706
Ro: 0.00966128165241
expected_log_beta: -17.1029517501 0.00528547906165
lambda: 0.10005218499
Iteracion: 629
Documento: 11967
Ro: 0.00965234606721
expected_log_beta: -17.1030005104 0.00536975287717
lambda: 0.100051

expected_log_beta: -17.105179098 0.00282745210212
lambda: 0.100028797067
Iteracion: 692
Documento: 12816
Ro: 0.0091270473443
expected_log_beta: -17.1051170459 0.00296129269437
lambda: 0.100028534153
Iteracion: 693
Documento: 14069
Ro: 0.00911926403
expected_log_beta: -17.1051320187 0.00289673758917
lambda: 0.100028273864
Iteracion: 694
Documento: 8332
Ro: 0.00911149681552
expected_log_beta: -17.1051508156 0.00288557483802
lambda: 0.100028016169
Iteracion: 695
Documento: 159
Ro: 0.00910374564803
expected_log_beta: -17.105184428 0.00281151556811
lambda: 0.100027761043
Iteracion: 696
Documento: 12989
Ro: 0.00909601047493
expected_log_beta: -17.105195289 0.00281602350265
lambda: 0.100027508454
Iteracion: 697
Documento: 13859
Ro: 0.00908829124389
expected_log_beta: -17.1053236472 0.00264572714403
lambda: 0.100027258379
Iteracion: 698
Documento: 4804
Ro: 0.00908058790276
expected_log_beta: -17.1052930006 0.00270919925293
lambda: 0.100027010788
Iteracion: 699
Documento: 12768
Ro: 0.0090729003

Iteracion: 760
Documento: 5217
Ro: 0.00863198990861
expected_log_beta: -17.1063292186 0.00155771349907
lambda: 0.100015567515
Iteracion: 761
Documento: 14757
Ro: 0.00862519191878
expected_log_beta: -17.106303961 0.0015637131631
lambda: 0.100015433272
Iteracion: 762
Documento: 2163
Ro: 0.00861840691606
expected_log_beta: -17.1063269939 0.00155756651282
lambda: 0.100015300292
Iteracion: 763
Documento: 10001
Ro: 0.0086116348611
expected_log_beta: -17.1063327736 0.00154642818146
lambda: 0.100015168563
Iteracion: 764
Documento: 8346
Ro: 0.00860487571469
expected_log_beta: -17.1063703944 0.00148765285264
lambda: 0.100015038071
Iteracion: 765
Documento: 5900
Ro: 0.0085981294378
expected_log_beta: -17.1063585232 0.00153957493224
lambda: 0.100014908804
Iteracion: 766
Documento: 444
Ro: 0.00859139599155
expected_log_beta: -17.1063670571 0.00151628390199
lambda: 0.10001478075
Iteracion: 767
Documento: 6894
Ro: 0.00858467533724
expected_log_beta: -17.106361681 0.0014658591172
lambda: 0.10001465389

expected_log_beta: -17.1069143513 0.000903144439738
lambda: 0.100008990068
Iteracion: 826
Documento: 1863
Ro: 0.00820948923488
expected_log_beta: -17.1069163215 0.000898103893134
lambda: 0.100008916337
Iteracion: 827
Documento: 13288
Ro: 0.00820347086264
expected_log_beta: -17.1069000541 0.00090622967533
lambda: 0.100008843267
Iteracion: 828
Documento: 12842
Ro: 0.00819746319421
expected_log_beta: -17.1069220721 0.000896038978276
lambda: 0.100008770849
Iteracion: 829
Documento: 8596
Ro: 0.00819146619938
expected_log_beta: -17.1069464351 0.000886930486407
lambda: 0.100008699078
Iteracion: 830
Documento: 2019
Ro: 0.00818547984806
expected_log_beta: -17.1069443248 0.000861703673427
lambda: 0.100008627947
Iteracion: 831
Documento: 4927
Ro: 0.0081795041103
expected_log_beta: -17.1069434971 0.000870641504026
lambda: 0.100008557451
Iteracion: 832
Documento: 1266
Ro: 0.00817353895622
expected_log_beta: -17.1069406866 0.000887212614041
lambda: 0.100008487582
Iteracion: 833
Documento: 2001
Ro: 0

expected_log_beta: -17.107242455 0.000547422019666
lambda: 0.100005329672
Iteracion: 891
Documento: 9019
Ro: 0.00783931419749
expected_log_beta: -17.107248508 0.000540159418703
lambda: 0.100005287984
Iteracion: 892
Documento: 5325
Ro: 0.00783393348448
expected_log_beta: -17.107258892 0.000520778613881
lambda: 0.100005246652
Iteracion: 893
Documento: 8137
Ro: 0.00782856173186
expected_log_beta: -17.1072530539 0.000538991737213
lambda: 0.100005205673
Iteracion: 894
Documento: 12525
Ro: 0.00782319891595
expected_log_beta: -17.1072652998 0.000509137576465
lambda: 0.100005165042
Iteracion: 895
Documento: 4161
Ro: 0.00781784501317
expected_log_beta: -17.1072497654 0.00053310359648
lambda: 0.100005124756
Iteracion: 896
Documento: 5129
Ro: 0.0078125
expected_log_beta: -17.1072604096 0.000517355582056
lambda: 0.100005084813
Iteracion: 897
Documento: 8999
Ro: 0.00780716385303
expected_log_beta: -17.1072720237 0.000519512315585
lambda: 0.100005045209
Iteracion: 898
Documento: 1301
Ro: 0.007801836

expected_log_beta: -17.1074258164 0.000322747256843
lambda: 0.10000321134
Iteracion: 957
Documento: 6096
Ro: 0.00750238217017
expected_log_beta: -17.1074385449 0.000322070167385
lambda: 0.100003187348
Iteracion: 958
Documento: 12113
Ro: 0.00749754571236
expected_log_beta: -17.107433311 0.000331712025554
lambda: 0.100003163552
Iteracion: 959
Documento: 4660
Ro: 0.00749271681952
expected_log_beta: -17.1074451452 0.000320929188317
lambda: 0.10000313995
Iteracion: 960
Documento: 2648
Ro: 0.00748789547286
expected_log_beta: -17.1074490539 0.000313054524836
lambda: 0.100003116542
Iteracion: 961
Documento: 9027
Ro: 0.00748308165368
expected_log_beta: -17.1074505809 0.000308108632117
lambda: 0.100003093324
Iteracion: 962
Documento: 7848
Ro: 0.00747827534332
expected_log_beta: -17.1074542082 0.000304633617862
lambda: 0.100003070293
Iteracion: 963
Documento: 3066
Ro: 0.0074734765232
expected_log_beta: -17.107443245 0.000308434937531
lambda: 0.100003047448
Iteracion: 964
Documento: 228
Ro: 0.0074

In [171]:
lda.gamma[444]

array([ 0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.5       ,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50000001,
        0.50000001,  0.50000001,  0.50000001,  0.50000001,  0.50

In [57]:
1*np.random.gamma(100., 1./100., (20, 5))

array([[ 1.0798604 ,  0.97373394,  0.88739621,  1.03440563,  0.98806988],
       [ 1.13795157,  1.06429252,  0.80639188,  1.06942342,  0.91533107],
       [ 0.91871202,  0.90605241,  1.14016092,  1.00765478,  0.87824069],
       [ 1.10465927,  0.97034079,  0.93218111,  0.91484871,  0.94797843],
       [ 0.95292439,  0.96887915,  0.93923932,  1.19762181,  0.96959022],
       [ 1.18304523,  0.99720983,  1.14736179,  1.13641531,  0.83999419],
       [ 1.00538492,  1.03706921,  1.02907629,  0.85584699,  1.00059327],
       [ 0.97266799,  1.00689861,  0.87516129,  0.86166215,  0.99133922],
       [ 0.98128828,  0.88970856,  0.96094688,  0.98216586,  1.10912599],
       [ 1.11726342,  1.08499997,  0.99982186,  0.84480264,  0.87377502],
       [ 1.05386468,  1.15673207,  0.8894465 ,  0.9651385 ,  0.8962132 ],
       [ 0.81650642,  0.87782913,  1.10108531,  0.85179599,  1.00440288],
       [ 1.01389473,  0.98372603,  0.8860026 ,  1.07659528,  0.92636043],
       [ 0.97371583,  1.075801  ,  1.0

In [48]:
l = np.array([[1,2,3], [4,5,6]])
np.sum(l, axis=0)

array([5, 7, 9])

In [81]:
rand_state = np.random.RandomState(888)
rand_state.randn(10)

array([-0.17620087,  0.18887636,  0.82674718, -0.03244731, -0.65249942,
       -0.10533938,  0.21777612,  0.5872815 ,  0.10023789, -1.09994668])

In [82]:
rand_state.randn(10)

array([-0.25530539,  0.40530438,  0.16266395,  1.04163462,  0.22432418,
        0.69930445, -0.86554351, -1.39346831, -0.23668791, -0.75704191])