gensim tutorial の3番目
https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py

# ドキュメントの読み込みとコーパスの準備

In [1]:
from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

texts: 単語のリストのリスト
Dictionaryは、単語のリストのリストを食う
corpus: ここでは、bow。順序の情報は失われている。ここでの特殊なケースなのか、一般的にそうするのか？はよく分からん。LDAする分には、bowで良いとは思うが

In [10]:
from pprint import pprint
pprint(corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


# モデルの初期化


Gensimでは、modelとはvector→vectorの変換器。bow, tfidfなどは、word idをindexとする高次元のベクトルと見なされる。学習に使ったベクトル空間とインプットとするベクトル空間は同じものとなるように注意。ここではbowを学習に使ったので、インプットにはbowを常に入れることになる。

In [4]:
from gensim import models

tfidf = models.TfidfModel(corpus)

In [6]:
new_doc = [(0, 1), (1, 1)]
print(tfidf[new_doc])

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


モデルの呼び出しは `( )` ではなく `[ ]` らしい。気持ち悪い・・なんでや。

コーパス全体に対して実行することもできる

In [12]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    pprint(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476),
 (3, 0.44424552527467476),
 (4, 0.44424552527467476),
 (5, 0.3244870206138555),
 (6, 0.44424552527467476),
 (7, 0.3244870206138555)]
[(2, 0.5710059809418182),
 (5, 0.4170757362022777),
 (7, 0.4170757362022777),
 (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


変換されたコーパスを、さらに変換することも可能

In [13]:
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi_model[corpus_tfidf]

In [16]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
lsi_model.print_topics(2)

2019-12-09 04:17:20,570 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2019-12-09 04:17:20,574 : INFO : topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"


[(0,
  '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

LSIについてよく知らないが、tfidfの線型結合らしい