# Check Similarity Between Documents

In [31]:
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
corpus = corpora.MmCorpus('/tmp/deerwester.mm') 
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [4]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

We create a 2D model through LSI

In [7]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)

[(0, 0.4618210045327159), (1, -0.07002766527900048)]


Since our lsi is derived from bag of words, any further input should also be first transformed by bag of words

#### Cosine Similarity
Basically similarity is from the angle between the two non zero vector product space. Two are completely different if they have an angle of `90` between them and completely similar if they have an angle of `0`. It is common in `Vector Space Modeling` but in case of vectors representing `Probability Distribution`, `Kullback-Leibler divergence` is more appropriate.

#### Find Similarity
We bring the documents to same model `lsi` and index the result.

###### similarities.MatrixSimilarity() is only suitable when the entire corpus fits in the memory
If memory is not sufficient we should use `similarities.Similarity()`

In [20]:
index = similarities.MatrixSimilarity(lsi[corpus])

In [21]:
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

#### Similarity for Human computer interaction sotred in vec_lsi

In [22]:
sims = index[vec_lsi]
print(list(enumerate(sims)))

[(0, 0.998093), (1, 0.93748635), (2, 0.9984453), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.09879463), (8, 0.05004177)]


#### Sorting the data

In [23]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
from pprint import pprint
pprint(sims)

[(2, 0.9984453),
 (0, 0.998093),
 (3, 0.9865886),
 (1, 0.93748635),
 (4, 0.90755945),
 (8, 0.05004177),
 (7, -0.09879463),
 (6, -0.10639259),
 (5, -0.12416792)]


1. The EPS user interface management system
2. Human machine interface for lab abc computer applications
3. System and human system engineering testing of EPS
4. A survey of user opinion of computer system response time
5. Relation of user perceived response time to error measurement
6. Graph minors A survey
7. Graph minors IV Widths of trees and well quasi ordering
8. The intersection graph of paths in trees
9. The generation of random binary unordered trees

The documents 2 and 4 do not share much common characters to `Human computer interaction` but still they are ranked highly in similarity. This is due to LSI.