# Metrics for text similarity: cosine distance

In the previous notebook, we studied two different metrics to measure the similarity between texts: jacard and rouge. These metrics only focus on lexical similarities. 

In this notebook, we learn how to obtain the similarity using **cosine distance**. 


<img src="https://miro.medium.com/max/852/1*hub04IikybZIBkSEcEOtGA.png">


where A and B are vectors (sentence embeddings) representing two texts. 


This metric depends on the embeddings (vectors) representing the texts to be compared. If the vectors capture semantic and syntactic relations between words in a text, the cosine distance should be able to reflect these relations, becoming a better metric for text similarity. 


The sklearn library already provides us an implementation to obtain this distance

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])

array([[-0.5]])

However, we first have to represent texts into vectors. 

 represent the texts. 


## Bag-Of-Word model

The most known  model is the Bag-Of-Word model, which has been succesfully applied for text classificaton during many years. To build this model, we will use two libraries:
- NLTK to remove stopwords
- sklearn to create the vocabulary and represent the texts into vectors.*texto en cursiva*

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We consider a small corpus containing three sentences. We use this corpus to exemplarize how to build a BoW model with sklearn. 

In [None]:
texts = ["The hotel was very expensive and not good",
"The hotel was very good and not expensive",
"The hotel was horrible"]


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# create the transform
bow = CountVectorizer(stop_words=stop_words)
# tokenize and build vocab
bow.fit(texts)
# summarize
print('Tamaño del vocabulario:',len(bow.vocabulary_))

from operator import itemgetter
print(sorted(bow.vocabulary_.items(), key=itemgetter(1)))

# encode document

Tamaño del vocabulario: 4
[('expensive', 0), ('good', 1), ('horrible', 2), ('hotel', 3)]


Now, we represent each text by using this model:

In [None]:
embeddings=[]
for i in range(len(texts)):
  vector = bow.transform([texts[i]])
  embeddings.append(vector)
  # summarize encoded vector
  #print(vector.shape)
  #print(type(vectors))
  print('{}->:\t{}'.format(texts[i],vector.toarray()))


The hotel was very expensive and not good->:	[[1 1 0 1]]
The hotel was very good and not expensive->:	[[1 1 0 1]]
The hotel was horrible->:	[[0 0 1 1]]


For example, we can calculate the similarity of the first sentence with the other sentences. 

In [None]:
print(cosine_similarity( embeddings[0], embeddings[1] ))
print(cosine_similarity( embeddings[0], embeddings[2] ))


[[1.]]
[[0.40824829]]


To train the BoW model, you should use a large collection of input texts. NLTK also provides some corpus that you can use to train a BoW models. For example, NLTK contains the gutenber collection of books.

In [None]:
#first we must download the corpus
#nltk.download('gutenberg')
#nltk.download('punkt')
gb = nltk.corpus.gutenberg
print("First books in Gutenbert:n", gb.fileids()[-5:])
text_sent=gb.sents()


First books in Gutenbert:n ['milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [None]:
print("total number of sentences", len(text_sent))
print(text_sent[10])

total number of sentences 98552
['The', 'danger', ',', 'however', ',', 'was', 'at', 'present', 'so', 'unperceived', ',', 'that', 'they', 'did', 'not', 'by', 'any', 'means', 'rank', 'as', 'misfortunes', 'with', 'her', '.']


In [None]:
texts=[]
for sentence in text_sent:
    text=' '.join(sentence)
    texts.append(text)

In [None]:
# create the transform
bow = CountVectorizer(stop_words=stop_words)
# tokenize and build vocab
bow.fit(texts)
# summarize
print('Tamaño del vocabulario:',len(bow.vocabulary_))

from operator import itemgetter
print(sorted(bow.vocabulary_.items(), key=itemgetter(1)))

Tamaño del vocabulario: 41919


We can use the BoW model to represent a list of sentences, and the obtain their similarities.

In [None]:
sentences=["Where can I find the user guide?","Where can I find the true love?","Where can I find the manual?","where can i find the user guide?","The streets are plenty of people"]
embeddings=[]
for i in range(len(sentences)):
  vector = bow.transform([sentences[i]])
  embeddings.append(vector)
  # summarize encoded vector
  #print(vector.shape)
  #print(type(vectors))
  #print('{}->:\t{}'.format(sentences[i],vector.toarray()))

In [None]:
for i in range(len(sentences)):
    sentence1=sentences[i]
    vector1=embeddings[i]
    for j in range(i+1,len(sentences)):
        
        sentence2=sentences[j]
        vector2=embeddings[j]

        result=cosine_similarity(vector1,vector2)
        print("\nCosine distance '{}' and '{}' is {}".format(sentence1,sentence2,result))


Cosine distance 'Where can I find the user guide?' and 'Where can I find the true love?' is [[0.33333333]]

Cosine distance 'Where can I find the user guide?' and 'Where can I find the manual?' is [[0.40824829]]

Cosine distance 'Where can I find the user guide?' and 'where can i find the user guide?' is [[1.]]

Cosine distance 'Where can I find the user guide?' and 'The streets are plenty of people' is [[0.]]

Cosine distance 'Where can I find the true love?' and 'Where can I find the manual?' is [[0.40824829]]

Cosine distance 'Where can I find the true love?' and 'where can i find the user guide?' is [[0.33333333]]

Cosine distance 'Where can I find the true love?' and 'The streets are plenty of people' is [[0.]]

Cosine distance 'Where can I find the manual?' and 'where can i find the user guide?' is [[0.40824829]]

Cosine distance 'Where can I find the manual?' and 'The streets are plenty of people' is [[0.]]

Cosine distance 'where can i find the user guide?' and 'The streets ar

In [None]:
s1="AI is our friend and it has been friendly"
s2="AI and humans have always been friendly"
sentences=[s1,s2]
embeddings=[]
for i in range(len(sentences)):
  vector = bow.transform([sentences[i]])
  embeddings.append(vector)

result=cosine_similarity(embeddings[0],embeddings[1])
print("\nCosine distance '{}' and '{}' is {}".format(s1,s2,result))


Cosine distance 'AI is our friend and it has been friendly' and 'AI and humans have always been friendly' is [[0.66666667]]


## TF-IDF an extension version of bow

In BoW, most vectors are scarse. These requiere more memory and computational resources. 

TF-IDF is an extension of BoW model, which provides less weigth to those words that appear frequenly in many texts. 



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words="english")
tfidf.fit(texts)
# summarize
#print(tfidf.vocabulary_)
print(sorted(tfidf.vocabulary_.items(), key=itemgetter(1)))




Cosine distance 'The hotel was very good, and not expensive' and 'The hotel was not very good, and not expensive' is [[1.]]


In [None]:
s1="AI is our friend and it has been friendly"
s2="AI and humans have always been friendly"
sentences=[s1,s2]
embeddings=[]
for i in range(len(sentences)):
  vector = bow.transform([sentences[i]])
  embeddings.append(vector)

result=cosine_similarity(embeddings[0],embeddings[1])
print("\nCosine distance '{}' and '{}' is {}".format(s1,s2,result))


Cosine distance 'AI is our friend and it has been friendly' and 'AI and humans have always been friendly' is [[0.66666667]]


## Limitations of BoW /TF-IDF
BoW also has the same limitations described in Jaccard. 
- Not consider the order of the words. 
- Not deal with the semantic similarities.

In [None]:
s1="The hotel was very good, and not expensive"
s2="The hotel was not very good, and not expensive"
sentences=[s1,s2]
embeddings=[]
for i in range(len(sentences)):
  vector = bow.transform([sentences[i]])
  embeddings.append(vector)

result=cosine_similarity(embeddings[0],embeddings[1])
print("\nCosine distance '{}' and '{}' is {}".format(s1,s2,result))


Cosine distance 'The hotel was very good, and not expensive' and 'The hotel was not very good, and not expensive' is [[1.]]


In [None]:
s1="The hotel was very good, and not expensive"
s2="The inn was especially nice, and not overprice"
sentences=[s1,s2]
embeddings=[]
for i in range(len(sentences)):
  vector = bow.transform([sentences[i]])
  embeddings.append(vector)

result=cosine_similarity(embeddings[0],embeddings[1])
print("\nCosine distance '{}' and '{}' is {}".format(s1,s2,result))


Cosine distance 'The hotel was very good, and not expensive' and 'The inn was especially nice, and not overprice' is [[0.]]


TF-IDF also has the same problem, that is, it is not able to capture the order of the words. 

In [None]:
s1="The hotel was very good, and not expensive"
s2="The hotel was not very good, and not expensive"
sentences=[s1,s2]
embeddings=[]
for i in range(len(sentences)):
  vector = tfidf.transform([sentences[i]])
  embeddings.append(vector)

result=cosine_similarity(embeddings[0],embeddings[1])
print("\nCosine distance '{}' and '{}' is {}".format(s1,s2,result))


Cosine distance 'The hotel was very good, and not expensive' and 'The hotel was not very good, and not expensive' is [[1.]]
