미리 만들어둔 document - term matrix 를 이용하여 LSI 를 학습합니다. 이를 위해서 SVD 를 직접 이용합니다.

In [1]:
import pickle

with open('../../../data/corpus_10days/models/params_keywords', 'rb') as f:
    params = pickle.load(f)
x = params['x']
vocab2idx = params['word2index']
idx2vocab = params['index2word']

문서마다 단어의 개수가 다르기 때문에 L2 normalization 을 수행합니다.

In [2]:
from sklearn.preprocessing import normalize

x_ = normalize(x)

TruncatedSVD 를 이용하면 n_components 차원으로 문서와 단어의 공간을 바꿀 수 있습니다.

In [3]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100)
y = svd.fit_transform(x_)

document - term matrix 는 (30091, 9774) 의 행렬이었습니다. 9,774 개의 단어로 이뤄진 문서의 공간이 100 차원의 공간으로 바뀌었습니다.

In [4]:
print(x_.shape, y.shape)

(30091, 9774) (30091, 100)


각 단어에 대한 100 차원의 벡터는 components_ 에 저장되어 있습니다.

In [5]:
svd.components_.shape

(100, 9774)

이를 이용하여 topically similar terms 와 topically similar docs 를 찾을 수 있습니다.

In [6]:
from sklearn.metrics import pairwise_distances

def most_similar_terms(term, topn=10):

    # encode term as index
    idx = vocab2idx.get(term, -1)
    if idx < 0:
        return []
    
    # prepare query term vector
    query_vec = svd.components_[:,idx].reshape(1,-1)

    # compute cosine - distance
    dist = pairwise_distances(
        # transpose word vectors : (100, 9774) -> (9774, 100)
        svd.components_.transpose(),
        query_vec,
        metric='cosine'
    ).reshape(-1)
    
    # find most closest terms
    similar_idx = dist.argsort()[:topn]
    
    # get their distance
    similar_dist = dist[similar_idx]
    
    # format : [(term, distance), ... ]
    similar_terms = [(idx, d) for idx, d in zip(similar_idx, similar_dist)]
    
    # decode term index to vocabulary
    similar_terms = [(idx2vocab[idx], d) for idx, d in similar_terms]

    # return
    return similar_terms

'아이오아이'의 topically similar terms 입니다.

In [7]:
most_similar_terms('아이오아이')

[('아이오아이', 1.1102230246251565e-16),
 ('신용재', 0.04015749988006123),
 ('엠카운트다운', 0.04376174748669781),
 ('오블리스', 0.0456031178529811),
 ('빅브레인', 0.0467705423425252),
 ('너무너무너무', 0.05191660254815156),
 ('세븐', 0.06866961383970227),
 ('갓세븐', 0.07091395215096752),
 ('산들', 0.11148465700268906),
 ('중독성', 0.11373631637330073)]

'오바마'의 topically similar terms 입니다.

In [8]:
most_similar_terms('오바마')

[('오바마', 0.0),
 ('버락', 0.02842665009085288),
 ('백악관', 0.10196960176734082),
 ('주지사', 0.1818647366158357),
 ('꼭두각시', 0.19674068147232604),
 ('이민자', 0.21531647556387057),
 ('클린턴', 0.2251726272441089),
 ('장벽', 0.2273930054140233),
 ('공화당', 0.23119685454084837),
 ('푸틴', 0.23313240797754864)]

단어와 문서의 100 차원의 벡터를 학습하였으니, 이를 이용하여 해당 단어와 topically relavant 한 문서들을 검색할 수 있습니다.

각 문서에 대해 most frequent terms 를 확인하기 위하여 get_bow 함수를 만듭니다.

In [9]:
def most_similar_docs_from_term(term, topn=10):

    # encode term as index
    idx = vocab2idx.get(term, -1)
    if idx < 0:
        return []

    # prepare query term vector
    query_vec = svd.components_[:,idx].reshape(1,-1)

    # compute distance between query term vector and document vectors
    dist = pairwise_distances(
        y,
        query_vec,
        metric='cosine'
    ).reshape(-1)
    
    # find similar document indices
    similar_doc_idx = dist.argsort()[:topn]

    # return
    return similar_doc_idx

def get_bow(doc_idx, topn=10):

    # get term frequency submatrix
    x_sub = x[doc_idx,:]

    # get term indices and their frequencies
    terms = x_sub.nonzero()[1]
    freqs = x_sub.data
    
    # format : [(term, frequency), ... ]
    bow = [(term, freq) for term, freq in zip(terms, freqs)]
    
    # sort by frequency in decreasing order
    bow = sorted(bow, key=lambda x:-x[1])[:topn]
    
    # decode term index to vocabulary
    bow = [(idx2vocab[term], freq) for term, freq in bow]
    
    # return
    return bow

'오바마'와 관련된 문서들입니다. 2016-10 에는 미국 대선이 이뤄지던 기간입니다.

In [10]:
for doc_idx in most_similar_docs_from_term('오바마'):
    bow = get_bow(doc_idx)
    print('doc#={} : {}'.format(doc_idx, bow))

doc#=9615 : [('오바마', 11), ('트럼프', 11), ('대통령', 10), ('미국', 5), ('초대', 4), ('토론', 4), ('대변', 3), ('지지', 3), ('케냐', 3), ('힐러리', 3)]
doc#=9471 : [('오바마', 13), ('대통령', 12), ('트럼프', 11), ('미국', 6), ('초대', 4), ('대변', 3), ('지지', 3), ('케냐', 3), ('토론', 3), ('부인', 2)]
doc#=14951 : [('트럼프', 10), ('대통령', 7), ('클린턴', 7), ('후보', 7), ('공화당', 4), ('결과', 3), ('꼭두각시', 3), ('대선', 3), ('비방', 3), ('기자', 2)]
doc#=7256 : [('트럼프', 25), ('힐러리', 17), ('대통령', 7), ('토론', 7), ('푸틴', 6), ('미국', 5), ('여자', 5), ('끔찍', 4), ('러시아', 4), ('꼭두각시', 3)]
doc#=11929 : [('대통령', 7), ('오바마', 7), ('트럼프', 7), ('선거', 6), ('주장', 6), ('조작', 5), ('대선', 4), ('이라고', 4), ('클린턴', 4), ('후보', 4)]
doc#=11441 : [('공화당', 20), ('대통령', 7), ('로비', 6), ('선거', 6), ('후보', 6), ('것이다', 5), ('미국', 5), ('조직', 4), ('트럼프', 4), ('권력', 3)]
doc#=27797 : [('트럼프', 31), ('클린턴', 20), ('대통령', 12), ('미국', 9), ('토론', 9), ('꼭두각시', 6), ('여성', 6), ('후보', 6), ('19일', 5), ('뉴스1', 5)]
doc#=30018 : [('토론', 12), ('트럼프', 9), ('후보', 7), ('클린턴', 6), ('3차', 5), ('대선', 4), ('대통