<a href="https://colab.research.google.com/github/isegura/TextSimilarity/blob/master/Metrics_for_Text_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metrics for Text Similarity

Computing the similarity between texts is necessary in many tasks such as search engines (which have to provide relevant documents for a given query) or customer services (which should be able to understand semantically similar queries from users and provide a uniform response). For example, if the user asks “What has happened to my delivery?” or “What is wrong with my shipping?”, the user will expect the same response.

The goal of text similarity is to determine how ‘close’ two pieces of text are 
in (1) meaning (**semantic similarity**) or (2) surface closeness (**lexical similarity**). 
For instance, how similar are the phrases *the dog bites the man* with *the man bites the dog* by just looking at the words?. These two phrases are similar as they have the same words. However, these sentences have very different meanings. 

We are to study two different metrics: (1) Jaccard similarity, which is useful to measure the lexical similarity, and (2) cosine distance, which takes into account the contexts of the words and is useful to measure the semantic similarity. 



## Jaccard similarity

It is the intersection over union is defined as size of intersection divided by size of union of two sets. Let’s take example of two sentences:


Sentence 1: AI is our friend and it has been friendly

Sentence 2: AI and humans have always been friendly

<img src='https://miro.medium.com/max/926/1*NSK8ERXexyIZ_SRaxioFEg.png'/>

In order to calculate similarity using Jaccard similarity, we will first perform lemmatization to reduce words to the same root word. In our case, “friend” and “friendly” will both become “friend”, “has” and “have” will both become “has”.

In [0]:
import spacy
print('spaCy Version: %s' % (spacy.__version__))
spacy_nlp = spacy.load('en_core_web_sm')

def preprocess(text):
  """removes the stop words and returns the list of lemmas"""
  doc = spacy_nlp(text)
  tokens = [token.lemma_.lower() for token in doc if not token.is_stop]
  return tokens



spaCy Version: 2.1.9


In [0]:


def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union


text1="AI is our friend and it has been friendly"
text2="AI and humans have always been friendly"
tokens1=preprocess(text1)
tokens2=preprocess(text2)



jaccard_similarity(tokens1,tokens2)

0.5

The following texts do not have any word in common, so its jaccard similarity is 0. However, their meanings are very similar. Therefore, Jaccard similarity is neither able to capture semantic similarity nor lexical semantic of these two sentences.


In [0]:
text1="President greets the press in Chicago"
text2="Obama speaks in Illinois"

tokens1=preprocess(text1)
tokens2=preprocess(text2)

jaccard_similarity(tokens1,tokens2)

0.0


## Cosine distance
Instead of doing a word for word comparison, we also need to pay attention to context in order to capture more of the semantics. Word embeddings can allow us to capture the semantic relations between words. 

Cosine similarity calculates similarity by measuring the cosine of angle between two vectors. The smaller the angle, higher the cosine similarity.


###Exercise### 

In this exercise, you should define a function that allows us to measure the similarity between two documents (or sentences) based on the cosine distance.

The idea is to use a pre-trained word embedding model and the cosine distance. You can load a model using the gensim model (for example, the pre-trained word embedding model **GoogleNews-vectors-negative300.bin**, which we have used in the notebook about Gensim). A document could be represented by the mean vectors of its word embeddings. Previuosly, the texts should be cleaned by removing stopwords and their tokens should be normalized (by lemmatization) to reduce noise.  


The **numpy** modules provides a function, **mean**, to calculate the mean vector of a set of vectors (see https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html).

To obtain the cosine distance between two vectors, you can use the **scipy** library, in particular, its function cosine (see 
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html)



First, we load the word embedding model by using gensim. This process can take some minutes.

In [0]:
from google.colab import drive
drive.mount("/content/drive/")

sst_home='drive/My Drive/Colab Notebooks/'
#Modify your folder
sst_home += 'TESI/5-TextSimilarity/'

!ls 'drive/My Drive/Colab Notebooks/TESI/5-TextSimilarity/data/'

from gensim.models import KeyedVectors
# Load vectors directly from the file
model = KeyedVectors.load_word2vec_format(sst_home+'data/GoogleNews-vectors-negative300.bin', binary=True)


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/
GoogleNews-vectors-negative300.bin


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Implement your solution:

In [0]:
import numpy as np 
import scipy


def cosine_distance_wordembedding(s1, s2):
    print(s1)
    print(s2)
    
    #Complete your code
    print('Word Embedding method with a cosine distance asses that our two sentences are similar to',round((1-cosine)*100,2),'%')
    print()

text1="President greets the press in Chicago"
text2="Obama speaks in Illinois"

cosine_distance_wordembedding(text1, text2)

cosine_distance_wordembedding(text1, text1)

text1="AI is our friend and it has been friendly"
text2="AI and humans have always been friendly"

cosine_distance_wordembedding(text1, text2)
