이 튜토리얼에서는 문서를 한 벡터 표현에서 다른 벡터 표현으로 변환하는 방법을 보여드리겠습니다. 이 과정은 두 가지 목표를 가지고 있습니다:

	1.	말뭉치에서 숨겨진 구조를 드러내고, 단어 간의 관계를 발견하여 문서를 새롭고 (희망적으로) 더 의미 있는 방식으로 설명하는 것입니다.
	2.	문서 표현을 더 간결하게 만들어 효율성을 향상시키고 (새로운 표현이 적은 자원을 소비함) 효과성을 높이는 것입니다 (마이너 데이터 경향을 무시하고 노이즈를 줄입니다).

말뭉치 생성

먼저, 작업할 말뭉치를 만들어야 합니다. 이 단계는 이전 튜토리얼과 동일하며, 이미 완료했다면 다음 섹션으로 건너뛰어도 괜찮습니다.

In [2]:
from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [3]:
from gensim import models

tfidf = models.TfidfModel(corpus) 

In [4]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


LSI Model

In [8]:
# initialize an LSI transformation
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) #차원수 2

# create a double wrapper over the original corpus : bow->tfidf->fold-in-lsi
corpus_lsi = lsi_model[corpus_tfidf]

In [9]:
# Topics
lsi_model.print_topics(2)

[(0,
  '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

In [11]:
#Document vector
#both bow->tfidf and tfidf ->lsi transformations are actually executed here, on the fly.
for doc, as_text in zip(corpus_lsi, documents):
    print (doc, as_text)

[(0, 0.066007833960907), (1, -0.5200703306361851)] Human machine interface for lab abc computer applications
[(0, 0.19667592859142932), (1, -0.7609563167700032)] A survey of user opinion of computer system response time
[(0, 0.08992639972446861), (1, -0.7241860626752511)] The EPS user interface management system
[(0, 0.07585847652178528), (1, -0.632055158600343)] System and human system engineering testing of EPS
[(0, 0.10150299184980437), (1, -0.5737308483002944)] Relation of user perceived response time to error measurement
[(0, 0.70321089393783), (1, 0.16115180214026137)] The generation of random binary unordered trees
[(0, 0.877478767311982), (1, 0.16758906864659862)] The intersection graph of paths in trees
[(0, 0.9098624686818569), (1, 0.140865536287195)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.6165825350569283), (1, -0.053929075663890116)] Graph minors A survey


In [12]:
# singular value
lsi_model.projection.s

array([1.59363762, 1.47629312])

In [13]:
# word vector(left singular vectors)
lsi_model.projection.u

array([[ 0.04940859, -0.29287972],
       [ 0.02969616, -0.2804038 ],
       [ 0.03522417, -0.32750471],
       [ 0.05951239, -0.3204961 ],
       [ 0.1869311 , -0.17065511],
       [ 0.06135723, -0.46024666],
       [ 0.05951239, -0.3204961 ],
       [ 0.05823724, -0.3726838 ],
       [ 0.03490897, -0.3323675 ],
       [ 0.70321089,  0.1611518 ],
       [ 0.53773148,  0.07585493],
       [ 0.40171367,  0.0294099 ]])

In [14]:
#Words
words = [dictionary.id2token[i] for i in range(len(dictionary))]
words

['computer',
 'human',
 'interface',
 'response',
 'survey',
 'system',
 'time',
 'user',
 'eps',
 'trees',
 'graph',
 'minors']

In [15]:
#Word Vector
list(zip(words, lsi_model.projection.u))

[('computer', array([ 0.04940859, -0.29287972])),
 ('human', array([ 0.02969616, -0.2804038 ])),
 ('interface', array([ 0.03522417, -0.32750471])),
 ('response', array([ 0.05951239, -0.3204961 ])),
 ('survey', array([ 0.1869311 , -0.17065511])),
 ('system', array([ 0.06135723, -0.46024666])),
 ('time', array([ 0.05951239, -0.3204961 ])),
 ('user', array([ 0.05823724, -0.3726838 ])),
 ('eps', array([ 0.03490897, -0.3323675 ])),
 ('trees', array([0.70321089, 0.1611518 ])),
 ('graph', array([0.53773148, 0.07585493])),
 ('minors', array([0.40171367, 0.0294099 ]))]

In [16]:
#Document vector(Right singular vectors)
for doc in corpus_lsi:
    print(doc)

[(0, 0.066007833960907), (1, -0.5200703306361851)]
[(0, 0.19667592859142932), (1, -0.7609563167700032)]
[(0, 0.08992639972446861), (1, -0.7241860626752511)]
[(0, 0.07585847652178528), (1, -0.632055158600343)]
[(0, 0.10150299184980437), (1, -0.5737308483002944)]
[(0, 0.70321089393783), (1, 0.16115180214026137)]
[(0, 0.877478767311982), (1, 0.16758906864659862)]
[(0, 0.9098624686818569), (1, 0.140865536287195)]
[(0, 0.6165825350569283), (1, -0.053929075663890116)]
