### LDA links

* [LDA intro (2 layer hierarchical bayesian model)](http://www.slideshare.net/WayneLee9/lda-oct3-2013)
* [Scikit Learn example](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py)
* [mechanism & gibbs sampling & mcmc basics](https://www.quora.com/Does-Topic-Modeling-need-a-training-stage-when-using-Gibbs-sampling-And-why-does-it-work/answer/Ivan-Savov)
* [EM optimization for LDA](http://obphio.us/pdfs/lda_tutorial.pdf)
* [Dirichlet Distribution](http://www.slideshare.net/g33ktalk/machine-learning-meetup-12182013)

In [1]:
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

In [2]:
n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()


In [15]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(data_samples)
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

In [19]:
print (data_samples[0])

Well i'm not sure about the story nad it did seem biased. What
I disagree with is your statement that the U.S. Media is out to
ruin Israels reputation. That is rediculous. The U.S. media is
the most pro-israeli media in the world. Having lived in Europe
I realize that incidences such as the one described in the
letter have occured. The U.S. media as a whole seem to try to
ignore them. The U.S. is subsidizing Israels existance and the
Europeans are not (at least not to the same degree). So I think
that might be a reason they report more clearly on the
atrocities.
	What is a shame is that in Austria, daily reports of
the inhuman acts commited by Israeli soldiers and the blessing
received from the Government makes some of the Holocaust guilt
go away. After all, look how the Jews are treating other races
when they got power. It is unfortunate.



In [34]:
print (tf.shape)
print (tf[0])

(2000, 1000)
  (0, 708)	1
  (0, 410)	1
  (0, 493)	1
  (0, 548)	1
  (0, 130)	1
  (0, 567)	1
  (0, 412)	1
  (0, 750)	1
  (0, 841)	1
  (0, 206)	1
  (0, 764)	1
  (0, 748)	1
  (0, 904)	1
  (0, 923)	1
  (0, 527)	1
  (0, 432)	1
  (0, 988)	1
  (0, 488)	2
  (0, 717)	1
  (0, 587)	4
  (0, 862)	1
  (0, 286)	1
  (0, 867)	1
  (0, 881)	1


In [26]:
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=10, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [27]:
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic #0:
edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1:
don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2:
christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3:
drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4:
hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5:
god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6:
55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7:
car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8:
people said did just didn know ti

### the LDA model has:

10 topics, 1000 words, 2000 docs

In [33]:
print(lda.components_.shape)
print(lda.components_[0][:100])

(10, 1000)
[  4.96604155   4.3537397   21.42539886   9.34656164  11.52305328
  13.8861695   40.27406664   9.84737644  16.13188767  11.95301712
   9.24725729   9.36810397   5.10001764  13.64836908   3.95895035
   2.20351083  47.88483728  21.1535831    0.27293177  14.54287686
   5.5530001    5.75956702  12.34425041   2.57049698   6.16460343
   4.31762237   4.12938934   0.76545744   3.72577593   0.43356217
   5.92791121   3.36509377   5.04403287   2.1290783    4.44416027
   1.44482564   3.01441175   3.15958896   0.18858835   3.28519136
  38.28776415   2.79687305   1.56690784   3.12518011   0.80187587
   4.00337725   3.47845026   5.90786908   9.00886246   1.95526998
   0.93435688   0.2632052    4.41326525   4.37331326   1.36104572
   1.95775214   2.65166035   4.3222133   18.14299357   1.72724557
   3.57822658   5.45766768  41.12427674   3.60894078   1.39120378
  38.54250212   0.13625848  10.37449816   0.18997826   0.1382722
   0.13123498   0.281217     0.91687958  10.21878193   6.09635839


In [39]:
print(lda.transform(tf)[0])

[ 0.00344893  0.6285982   0.00344908  0.00344877  0.00344865  0.34381098
  0.00344842  0.00344869  0.00344944  0.00344884]
