In [1]:
%matplotlib inline


# Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation


This is an example of applying Non-negative Matrix Factorization
and Latent Dirichlet Allocation on a corpus of documents and
extract additive models of the topic structure of the corpus.
The output is a list of topics, each represented as a list of terms
(weights are not shown).

The default parameters (n_samples / n_features / n_topics) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).


In [2]:
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 10000
n_features = 1000
n_topics = 19
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [3]:
print("Loading dataset...")
t0 = time()
categories = ['alt.atheism', 'comp.graphics',
              'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
              'comp.windows.x', 'misc.forsale', 'rec.autos',
              'rec.motorcycles', 'rec.sport.baseball',
              'rec.sport.hockey', 'sci.crypt', 'sci.electronics',
              'sci.med', 'sci.space', 'soc.religion.christian',
              'talk.politics.guns', 'talk.politics.mideast',
              'talk.politics.misc', 'talk.religion.misc']
dataset = fetch_20newsgroups(shuffle=True, random_state=1,categories=categories,
                             remove=('headers', 'footers', 'quotes'))
print(len(dataset.filenames))
print(dataset.target_names)
print(len(dataset.data))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))
len(data_samples)
data_samples[:2]

Loading dataset...
10723
['alt.atheism', 'comp.graphics', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
10723
done in 2.300s.


['\nThere was a recession, and none of the potential entrants could raise any\nmoney.  The race organizers were actually supposed to be handling part of\nthe fundraising, but the less said about that the better.',
 "\nI only have one comment on this:  You call this a *classic* playoff year\nand yet you don't include a Chicago-Detroit series.  C'mon, I'm a Boston\nfan and I even realize that Chicago-Detroit games are THE most exciting\ngames to watch."]

In [4]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

Extracting tf-idf features for NMF...
done in 2.169s.


In [5]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

Extracting tf features for LDA...
done in 1.820s.


In [6]:
# Fit the NMF model
print("Fitting the NMF model with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))

lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Fitting the NMF model with tf-idf features, n_samples=10000 and n_features=1000...
done in 1.857s.

Topics in NMF model:
Topic #0:
people time right did good say make said way government really point going years want things believe course long didn
Topic #1:
thanks advance mail hi looking info address email information appreciated post help anybody send interested list reply tell need good
Topic #2:
god jesus bible christ faith believe christian christians sin church lord hell life truth man belief love christianity say son
Topic #3:
game team year games season players play hockey win league player teams nhl baseball good runs toronto best hit better
Topic #4:
new 00 sale price 10 offer shipping condition 20 50 interested asking 15 12 email 11 sell 25 30 excellent
Topic #5:
drive scsi disk drives hard ide controller floppy cd bus internal cable mac tape rom mb power apple ram card
Topic #6:
key chip encryption clipper keys escrow government public algorithm nsa security phone secure ch

In [7]:
data_samples[:5]

['\nThere was a recession, and none of the potential entrants could raise any\nmoney.  The race organizers were actually supposed to be handling part of\nthe fundraising, but the less said about that the better.',
 "\nI only have one comment on this:  You call this a *classic* playoff year\nand yet you don't include a Chicago-Detroit series.  C'mon, I'm a Boston\nfan and I even realize that Chicago-Detroit games are THE most exciting\ngames to watch.",
 '\n\nI\'m not quite sure how these numbers are generated.  It appears that in\na neutral park Bo\'s HR and slugging tend to drop (he actually loses two\nhome runs).  Or do they?  What is "equivalent average?"\n\nOne thing, when looking at Bo\'s stats, is that you can see that KC took\naway some homers.  Normally, you expect some would-be homers to go for\ndoubles or triples in big parks, or to be caught, and for that matter you\nexpect lots of doubles and triples anyway.  But Bo, despite his speed, \nhit very few doubles and not that ma