# Векторная модель 

Зададим небольшую коллекцию документов

In [1]:
from gensim import corpora
from gensim import corpora, models, similarities

from pylab import pcolor, show, colorbar, xticks, yticks
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
             "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Предобработка

In [2]:
stoplist = set('for a of the and to in'.split()) ## стоп-слова
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents] ## удаляем стоп-слова

from collections import defaultdict ## задаем частотный словарь
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts] ## удалим токены, которые встречаются только 1 раз
from pprint import pprint  # pretty-printer
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


## Векторное представление коллекции текстов

|           | $d_1$      | $d_2$      | $\ldots$ | $d_{D}$    |
|-----------|------------|------------|----------|--------------|
| $w_1$     | $f_{11}$   | $f_{12}$   |          | $f_{1D}$   |
| $w_2$     | $f_{21}$   | $f_{22}$   |          | $f_{2D}$   |
| $\ldots$  |            |            |          |              |
| $w_{|V|}$ | $f_{V1}$ | $f_{V2}$ |          | $f_{VD}$ |


Косинусная мера близости в векторной модели [Salton et. al, 1975]: 
$ \cos(d_i, d_j) = \frac {d_i \times d_j}{||d_i||||d_j||} = \frac{\sum_k f_{ki} \times f_{kj}} {\sqrt{(\sum_k f_{ki})^2} \sqrt{(\sum_k f_{kj})^2}}$


Если вектора нормированы на длину $||d_i|| = ||d_j|| = 1$, $ \cos(d_i, d_j) = d_i \times d_j$

Задаем словарь и непосредственно представление текстов векторами

In [3]:
dictionary = corpora.Dictionary(texts) ## инициализируем словарь 
print(dictionary) 
print(dictionary.token2id)

Dictionary(12 unique tokens: ['human', 'interface', 'computer', 'survey', 'user']...)
{'human': 0, 'interface': 1, 'computer': 2, 'survey': 3, 'user': 4, 'system': 5, 'response': 6, 'time': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [4]:
corpus = [dictionary.doc2bow(text) for text in texts] ## здесь хранится непосрдественно векторная модель  
print(corpus)

[[(0, 1), (1, 1), (2, 1)], [(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(1, 1), (4, 1), (5, 1), (8, 1)], [(0, 1), (5, 2), (8, 1)], [(4, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(3, 1), (10, 1), (11, 1)]]


## Поиск по запросу

Ищем ближайший документ к вектору запроса по косинусной мере близости:

![](img/ir3.png)

In [5]:
q = "human computer interaction" 
vec = dictionary.doc2bow(q.lower().split()) 
print(vec)

[(0, 1), (2, 1)]


In [6]:
index = similarities.MatrixSimilarity(corpus) 
print(index)

MatrixSimilarity<9 docs, 12 features>


In [7]:
sims = index[vec]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print("Q:", q)
for i in sims:
    print('doc', i[0], documents[i[0]], i[1])

Q: human computer interaction
doc 0 Human machine interface for lab abc computer applications 0.81649655
doc 1 A survey of user opinion of computer system response time 0.28867513
doc 3 System and human system engineering testing of EPS 0.28867513
doc 2 The EPS user interface management system 0.0
doc 4 Relation of user perceived response time to error measurement 0.0
doc 5 The generation of random binary unordered trees 0.0
doc 6 The intersection graph of paths in trees 0.0
doc 7 Graph minors IV Widths of trees and well quasi ordering 0.0
doc 8 Graph minors A survey 0.0


Выполним $tf-idf$ преобразование:

In [8]:
tfidf = models.TfidfModel(corpus)

In [9]:
for word_id in dictionary:
    print("%s – %d – %1.4f" %(dictionary[word_id],  tfidf.dfs[word_id], tfidf.idfs[word_id]))

human – 2 – 2.1699
interface – 2 – 2.1699
computer – 2 – 2.1699
survey – 2 – 2.1699
user – 3 – 1.5850
system – 3 – 1.5850
response – 2 – 2.1699
time – 2 – 2.1699
eps – 2 – 2.1699
trees – 3 – 1.5850
graph – 3 – 1.5850
minors – 2 – 2.1699


In [10]:
corpus[1], tfidf[corpus[1]]

([(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 0.44424552527467476),
  (3, 0.44424552527467476),
  (4, 0.3244870206138555),
  (5, 0.3244870206138555),
  (6, 0.44424552527467476),
  (7, 0.44424552527467476)])

#### Задание 1
1. Преобразуйте с помощью $tf-idf$ все вектора из корпуса и создайте новый индекс
2. Преобразуйте вектор запроса с помощью $tf-idf$ 
3. Как вычисляются $idf$ веса в векторе запроса?
4. Повторите поиск по запросу после $tf-idf$  преобразования

## Снижение размерности

Сингулярное разложение: $M = U \Sigma V^T$

Снижение размерности с помощью сингулярного разложения: $M'_k = U \Sigma_k V^t_k$


![рисунок](img/svd.jpg)

In [11]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2) ## задаем LSI модель, число тем = 2
print(lsi)

LsiModel(num_terms=12, num_topics=2, decay=1.0, chunksize=20000)


#### Задание 2
1. Повторите задание 1, теперь в пространстве меньшей размерности 
4. Что означают новые координаты вектора запроса?

In [24]:
print('\n'*100)








































































































In [22]:
## решение 1

index = similarities.MatrixSimilarity(tfidf[corpus]) 
vec_tfidf = tfidf[vec]

for word in q.lower().split():
    if word in dictionary.token2id:
        word_id = dictionary.token2id[word]
        print("%s – %d – %1.4f" %(dictionary[word_id],  tfidf.dfs[word_id], tfidf.idfs[word_id]))
        
        
sims = index[vec_tfidf]
sims_tfidf = sorted(enumerate(sims), key=lambda item: -item[1])
print("Q:", q)
for i in sims_tfidf:
    print('doc', i[0], documents[i[0]], i[1])

human – 2 – 2.1699
computer – 2 – 2.1699
Q: human computer interaction
doc 0 Human machine interface for lab abc computer applications 0.81649655
doc 3 System and human system engineering testing of EPS 0.3477732
doc 1 A survey of user opinion of computer system response time 0.31412902
doc 2 The EPS user interface management system 0.0
doc 4 Relation of user perceived response time to error measurement 0.0
doc 5 The generation of random binary unordered trees 0.0
doc 6 The intersection graph of paths in trees 0.0
doc 7 Graph minors IV Widths of trees and well quasi ordering 0.0
doc 8 Graph minors A survey 0.0


In [13]:
## решение 2

index = similarities.MatrixSimilarity(lsi[corpus]) ##  индекс и векторное представление исходных текстов в пространстве меньшей размерности

vec_lsi = lsi[vec] ##  конвертируем запрос в пространство меньшей размерности
print(vec_lsi)

sims = index[vec_lsi]
sims_lsi = sorted(enumerate(sims), key=lambda item: -item[1])
print("Q:", q)
for i in sims_lsi:
    print('doc', i[0], documents[i[0]], i[1])

[(0, 0.4618210045327157), (1, -0.07002766527899983)]
Q: human computer interaction
doc 2 The EPS user interface management system 0.9984453
doc 0 Human machine interface for lab abc computer applications 0.998093
doc 3 System and human system engineering testing of EPS 0.9865886
doc 1 A survey of user opinion of computer system response time 0.93748635
doc 4 Relation of user perceived response time to error measurement 0.90755945
doc 8 Graph minors A survey 0.05004177
doc 7 Graph minors IV Widths of trees and well quasi ordering -0.09879463
doc 6 The intersection graph of paths in trees -0.10639259
doc 5 The generation of random binary unordered trees -0.12416792
