# Снижение размерности в векторной модели

## Часть 1. Стандартный пример из Deerwester et al.’s “Indexing by Latent Semantic Analysis”

Зададим небольшую коллекцию документов

In [1]:
from gensim import corpora
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
             "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

Предобработка

In [2]:
stoplist = set('for a of the and to in'.split()) ## стоп-слова
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents] ## удаляем стоп-слова

from collections import defaultdict ## задаем частотный словарь
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts] ## удалим токены, которые встречаются только 1 раз
from pprint import pprint  # pretty-printer
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


Задаем словарь и непосредственно представление текстов векторами

In [3]:
dictionary = corpora.Dictionary(texts) ## инициализируем словарь 
print(dictionary) 
print(dictionary.token2id)

Dictionary(12 unique tokens: ['human', 'interface', 'computer', 'survey', 'user']...)
{'human': 0, 'interface': 1, 'computer': 2, 'survey': 3, 'user': 4, 'system': 5, 'response': 6, 'time': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [4]:
corpus = [dictionary.doc2bow(text) for text in texts] ## здесь хранится непосрдественно векторная модель  
print(corpus)

[[(0, 1), (1, 1), (2, 1)], [(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(1, 1), (4, 1), (5, 1), (8, 1)], [(0, 1), (5, 2), (8, 1)], [(4, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(3, 1), (10, 1), (11, 1)]]


Снижение размерности

In [6]:
from gensim import corpora, models, similarities
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2) ## задаем LSI модель, число тем = 2
print(lsi)

LsiModel(num_terms=12, num_topics=2, decay=1.0, chunksize=20000)


Поиск по многословному запросу

In [8]:
doc = "Human computer interaction" ##  документ – запрос
vec_bow = dictionary.doc2bow(doc.lower().split()) ## запрос – вектор
print(vec_bow)
vec_lsi = lsi[vec_bow] ##  конвертируем запрос в пространство меньшей размерности
print(vec_lsi)

[(0, 1), (2, 1)]
[(0, 0.46182100453271591), (1, 0.070027665279000173)]


In [9]:
index = similarities.MatrixSimilarity(lsi[corpus]) ##  индекс и векторное представление исходных текстов в пространстве меньшей размерности
print(index)

MatrixSimilarity<9 docs, 2 features>


In [10]:
sims = index[vec_lsi] ##  индексируем вектор документ и находим ближайшие документы 
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print("Q:", doc)
for i in sims:
    print('doc', i[0], documents[i[0]], i[1])

Q: Human computer interaction
doc 2 The EPS user interface management system 0.998445
doc 0 Human machine interface for lab abc computer applications 0.998093
doc 3 System and human system engineering testing of EPS 0.986589
doc 1 A survey of user opinion of computer system response time 0.937486
doc 4 Relation of user perceived response time to error measurement 0.907559
doc 8 Graph minors A survey 0.0500418
doc 7 Graph minors IV Widths of trees and well quasi ordering -0.0987946
doc 6 The intersection graph of paths in trees -0.106393
doc 5 The generation of random binary unordered trees -0.124168


In [11]:
lsi.show_topics(2)

[(0,
  '0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"response" + 0.265*"time" + 0.240*"computer" + 0.221*"human" + 0.206*"survey" + 0.198*"interface" + 0.036*"graph"'),
 (1,
  '-0.623*"graph" + -0.490*"trees" + -0.451*"minors" + -0.274*"survey" + 0.167*"system" + 0.141*"eps" + 0.113*"human" + -0.107*"response" + -0.107*"time" + 0.072*"interface"')]