# Text similarity

How to compute the similarity between two strings?

In [1]:
from __future__ import unicode_literals
from __future__ import division

In [2]:
a = 'Geladeira Brastemp CFR45 20L frostfree'
b = 'Geladeira Brastemp CFR45 20L com desgelo automático'

In [3]:
# Value for similar tokens:
tokensA = a.split()
tokensB = b.split()
set(tokensA).intersection(tokensB)

{u'20L', u'Brastemp', u'CFR45', u'Geladeira'}

In [4]:
similar = len(set(tokensA).intersection(tokensB))
total = len(set(tokensA).union(tokensB))
print '{} similars from {} tokens: {:0.2f}% of similarity'.format(similar, total, similar/total*100)

4 similars from 8 tokens: 50.00% of similarity


In [5]:
# several other metrics. See jellyfish, fuzzywuzzy, metaphone, etc
import jellyfish

In [6]:
# The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. 
jellyfish.jaro_distance(a,b)

0.8439972480220158

## Other possibilities:

* extract named features for measuring the importance of each token
* use some basic text preprocessing (lowecase, stemm, etc)
* remove stopword
* weight the words using a measure of importante (TF/IDF, for example) 

## Using word2vec to computer vector similairties

It is possibel to use [word2vec]((http://nbviewer.ipython.org/github/danielfrg/word2vec/blob/master/examples/word2vec.ipynb) or [gensim](https://radimrehurek.com/gensim/models/word2vec.html)

In [13]:
# read the corpus
import codecs

# this could be done in a iterate way for performance in huge corpus
with codecs.open('corpus.txt', encoding='utf8') as fp:
    corpus = fp.read()

# cut in the corpus for speed up the example
corpus = corpus[:10000000]

In [14]:
# sent and word tokenize with ntlk
from nltk import sent_tokenize, word_tokenize
sentences = [[w.lower() for w in word_tokenize(sentence, language='portuguese')] for sentence in sent_tokenize(corpus, language='portuguese')]

In [15]:
from gensim.models import Word2Vec
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=8)
model.init_sims(replace=True)

In [16]:
model.most_similar('geladeira')

[(u'ative', 0.5917613506317139),
 (u'frost', 0.585434079170227),
 (u'assistindo', 0.5765464901924133),
 (u'lavadora', 0.560122013092041),
 (u'cinematogr\xe1fica', 0.5491166114807129),
 (u'dama', 0.5454461574554443),
 (u'curtindo', 0.5446858406066895),
 (u'melhorias', 0.5411743521690369),
 (u'organizados', 0.5394273400306702),
 (u'envolver', 0.5388178825378418)]

In [26]:
tokensA = [t.lower() for t in tokensA]
vectorsA = sum([model[token] for token in tokensA if token in model.vocab])

tokensB = [t.lower() for t in tokensB]
vectorsB = sum([model[token] for token in tokensB if token in model.vocab])

In [28]:
from nltk.cluster.util import cosine_distance
print 'Similarity: {}'.format(abs(1 - cosine_distance(vectorsA, vectorsB)))

Similarity: 0.804425913887
