In [44]:
%matplotlib inline


# Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation


This is an example of applying :class:`sklearn.decomposition.NMF` and
:class:`sklearn.decomposition.LatentDirichletAllocation` on a corpus
of documents and extract additive models of the topic structure of the
corpus.  The output is a list of topics, each represented as a list of
terms (weights are not shown).

Non-negative Matrix Factorization is applied with two different objective
functions: the Frobenius norm, and the generalized Kullback-Leibler divergence.
The latter is equivalent to Probabilistic Latent Semantic Indexing.

The default parameters (n_samples / n_features / n_components) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).


In [45]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000 # input 데이터 얼마나 사용할지
n_features = 1000 # TF-IDF 에 max-feature로 들어갈 값
n_components = 10 # 단어조합(뭉치?) 수
n_top_words = 20 # component 마다 top-n words


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [46]:
# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'),
                             return_X_y=True)
data_samples = data[:n_samples]
print("done in %0.3fs." % (time() - t0))

Loading dataset...
done in 1.737s.


In [47]:
print(len(data))
print(_.shape)
print(data[:2])

11314
(11314,)
["Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n", "\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need

* TF-IDF(Term Frequency - Inverse Document Frequency)

** CountVectorizer, TfidfVectorizer (Just count, TF-IDF 고려한 float)
    * parameters
        max_df
        단어장에 포함되기 위한 최대 빈도, 정수 또는 [0.0, 1.0] 사이의 실수.
        ex) 0.95: 문서 중 95%를 초과한 문서에 나오는 단어는 단어장에서 제외
        ex) 100: 100개 초과의 문서에 나오는 단어는 단어장에서 제외
        max_df을 이용하여 stop words 제거하는 방법도 쓸 수 있음.
        min_df
        단어장에 포함되기 위한 최소 빈도, 정수 또는 [0.0, 1.0] 사이의 실수.
        token_pattern
        토큰 정의용 정규 표현식
        max_features
        frequency가 높은 순서대로 max_features 수 만큼만 가져오겠다는 뜻
        tokenizer
        토큰 생성함수
    * methods
        fit_transform
        vector(sparse matrix) 생성
        get_feature_names
        array mapping (feature integer indices → feature name)

In [48]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')

Extracting tf-idf features for NMF...


In [51]:
tfidf_vectorizer

TfidfVectorizer(max_df=0.95, max_features=1000, min_df=2, stop_words='english')


In [49]:
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

done in 0.427s.


In [72]:
print(tfidf.shape)
print(tfidf[0].shape)
print(tfidf[:3])# sparse matrix 인것에 주의.
print(tfidf[:3].toarray())

(2000, 1000)
(1, 1000)
  (0, 709)	0.12621877625178227
  (0, 411)	0.11650651629173196
  (0, 495)	0.1631127602376565
  (0, 550)	0.11873384536901997
  (0, 134)	0.13595955391213657
  (0, 568)	0.13595955391213657
  (0, 413)	0.12831668397369733
  (0, 751)	0.15376128408643466
  (0, 841)	0.18564440175793037
  (0, 210)	0.15810189392327795
  (0, 765)	0.1640284908630232
  (0, 749)	0.13595955391213657
  (0, 904)	0.08983671288492111
  (0, 923)	0.11966934266418663
  (0, 529)	0.1690393571774018
  (0, 433)	0.13369075280946802
  (0, 988)	0.12740095334833063
  (0, 490)	0.3750048191807266
  (0, 718)	0.17767638066823058
  (0, 588)	0.6454209423982519
  (0, 862)	0.1551447391479567
  (0, 289)	0.11115911128919416
  (0, 867)	0.15810189392327795
  (0, 881)	0.11227372176926384
  (1, 382)	0.20157910011124136
  :	:
  (2, 769)	0.1419319840924351
  (2, 190)	0.15392290999271482
  (2, 872)	0.15549779067451702
  (2, 418)	0.12713107605026056
  (2, 125)	0.15392290999271482
  (2, 140)	0.1607584992974209
  (2, 136)	0.12680

In [52]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

Extracting tf features for LDA...
done in 0.327s.



In [65]:
print(tf.shape)
print(tf[0].toarray())
print(tf[0])

(2000, 1000)
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0

** NMF (Non-Negative Matrix Fatorization)
    * parameters
        n_components
            토픽 조합을 몇가지로 할것인지
        alpha (default: 0)
            regularization term
        l1_ratio
            regularization l1과 l2의 비율을 결정
            ex) 0: only l2 penalty(aka. Frobenius Norm)
            ex) 1: only l1 penaty
            ex) 0<l1_ratio<1: combination of l1, l2
        beta_loss
            frobenius (default)
            kullback-leibler
            itakura-saito
        solver
            cd (default)
                Coordinate Descent
            mu
                Multiplicative Update
    * methods
        fit
            NMF 모델 학습

In [35]:
# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_features=1000...
done in 0.365s.

Topics in NMF model (Frobenius norm):
Topic #0: just people don think like know good time make way really say ve right want did ll new use years
Topic #1: windows use dos using window program os application drivers help software pc running ms screen files version work code mode
Topic #2: god jesus bible faith christian christ christians does sin heaven believe lord life mary church atheism love belief human religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information list send video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil speed power good 000 brake year models used bought
Topic #5: edu soon send com university internet mit ftp mail cc pub article information hope email mac home program blood contact
Topic #6: file files problem format win sound ftp pub read save sit

In [40]:
# print(nmf.components_.shape)
# nmf.components_[0]
# len(tfidf_feature_names)

(10, 1000)


array([7.00614483e-03, 6.91592895e-02, 1.54308855e-01, 7.42408389e-02,
       5.91235361e-02, 7.76759141e-02, 1.14782362e-03, 4.33649588e-02,
       9.16615747e-04, 6.52796719e-02, 7.46070246e-02, 6.90679233e-02,
       3.78869213e-02, 3.19697170e-02, 1.31875795e-02, 1.61358319e-02,
       3.27155361e-02, 7.09497792e-02, 2.99967821e-02, 2.49985638e-02,
       2.45385808e-02, 2.26007334e-02, 2.19066862e-02, 5.01919710e-02,
       2.54627335e-02, 1.59629887e-02, 1.15323129e-02, 6.99365007e-03,
       4.06008367e-03, 7.07284755e-03, 5.23199452e-02, 1.66495040e-02,
       7.13104294e-03, 2.18357396e-02, 7.47064564e-03, 0.00000000e+00,
       1.11282696e-02, 0.00000000e+00, 0.00000000e+00, 1.82772938e-03,
       4.52868840e-04, 0.00000000e+00, 4.67417530e-02, 1.59230400e-03,
       0.00000000e+00, 0.00000000e+00, 8.07804997e-03, 0.00000000e+00,
       0.00000000e+00, 6.03722537e-02, 4.25036723e-02, 1.83069887e-03,
       1.00417517e-02, 0.00000000e+00, 2.20486719e-02, 1.33125230e-03,
      

In [73]:
# Fit the NMF model
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=2000 and n_features=1000...
done in 0.930s.


In [79]:
tfidf_vectorizer

TypeError: 'TfidfVectorizer' object does not support indexing

In [42]:
print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)


Topics in NMF model (generalized Kullback-Leibler divergence):
Topic #0: people don think just right did like time say really know make said question course let way real things good
Topic #1: windows thanks help hi using looking does info software video use dos pc advance anybody mail appreciated card need know
Topic #2: god does jesus true book christian bible christians religion faith church believe read life christ says people lord exist say
Topic #3: thanks know bike interested car mail new like price edu heard list hear want cars email contact just com mark
Topic #4: 10 time year power 12 sale 15 new offer 20 30 00 16 monitor ve 11 14 condition problem 100
Topic #5: space government 00 nasa public security states earth phone 1993 research technology university subject information science data internet provide blood
Topic #6: edu file com program try problem files soon window remember sun win send library mike article just mit oh code
Topic #7: game team year games play world seas

In [43]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                # online은 batch size정해서 batch씩 학습, batch는 전체 데이터 학습
                                learning_method='online',
                                
                                # learning_offset: 이전 iteration들을 downweigths, 1 이상의 값. default는 10
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 4.062s.

Topics in LDA model:
Topic #0: hiv health aids disease medical care study research said 1993 national april service children test information rules page new dr
Topic #1: drive car disk hard drives game power speed card just like good controller new year bios rom better team got
Topic #2: edu com mail windows file send graphics use version ftp pc thanks available program help files using software time know
Topic #3: vs gm thanks win interested copies john email text st mail copy hi new book division edu buying advance know
Topic #4: performance wanted robert speed couldn math ok change address include organization mr science major university internet edu computer driver kept
Topic #5: space scsi earth moon surface probe lunar orbit mission nasa launch science mars energy bit printer spacecraft probes sci solar
Topic #6: israel 000 section turkish military armenian greek killed state armenians peo