# Metrics for text similarity: Jaccard and Rouge


There are several metrics that we can use to measure the similarity of two texts. In this notebook, we will review and implement some of these metrics.  
Source: 
https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50


The most popular metrics to measure the text similarity are: **Jaccard similarity** and **cosine distance**. 

**Jaccard similarity** one only focuses on lexical similarity, that is, it is based on the number of words that two texts share in common. However, this approach is not able to capture the semantic similarity of words (for example, synonyms, anotonyms). 

On the other hand, the similarity provided by the **cosine distance** depends on the quality of embeddings (vectors). If the vectors capture semantic and syntactic relations between words in a text, the cosine distance should be able to reflect these relations, becoming a better metric for text similarity. 

## Jacard similarity
Let me explain, start with the definition of "Jacard similarity". It is defined with the following equation:

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/9dfe117504d4a74fc8f5d44445756380153cd576">

where A and B are sentences (texts). 


Below we implement the funcion and apply it to several pair of sentences to compare them.

In [44]:
def get_jaccard_sim(text1, text2):
    """calculates the jaccard similarity of the two input texts""" 
    tokens1 = set(text1.split())  #gets the unique tokens in text1
    tokens2 = set(text2.split())   #gets the unique tokens in text1
    common = tokens1.intersection(tokens2) #gets the list of tokens in common
    union = tokens1.union(tokens2) #gets the list of tokens in common
    
    #jaccard is the the number of words in common between the number of words in both sentences
    return len(common) / len(union)



In [45]:
s1="Where can I find the user guide?"
print("\njaccard similarity of '{}' and '{}' is : {}".format(s1,s1,get_jaccard_sim(s1,s1)))

s2="Where can I find the true love?"
print("\njaccard similarity of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim(s1,s2)))

s2="Where are the manual?"
print("\njaccard similarity of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim(s1,s2)))


jaccard similarity of 'Where can I find the user guide?' and 'Where can I find the user guide?' is : 1.0

jaccard similarity of 'Where can I find the user guide?' and 'Where can I find the true love?' is : 0.5555555555555556

jaccard similarity of 'Where can I find the user guide?' and 'Where are the manual?' is : 0.2222222222222222


We can improve the implementation of this function. For example, it should not have to be case sensitive. The two following sentences: 
- s1="Where can I find the user guide?"
- s2="where can i find the user guide?"

are the same except some words in lowercase. The jaccard score for them is only 0.55!!!


In [75]:
s1="Where can I find the user guide?"
s2="where can i find the user guide?"
print("\njaccard similarity of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim(s1,s2)))




jaccard similarity of 'Where can I find the user guide?' and 'where can i find the user guide?' is : 0.5555555555555556


Now we redefine the function to transform the tokens to lowercase. 
Moreover, we will use the library Spacy to process the sentences. 


In [None]:
#!python3 -m spacy download en_core_web_sm

import spacy
nlp = spacy.load('en_core_web_sm')           # load model package "en_core_web_sm"
print('spacy.en loaded')


In [76]:
def get_jaccard_sim2(text1, text2):
    """calculates the jaccard similarity of the two input texts. 
    tokens are transformed to lowercase""" 
    doc1=nlp(text1)
    doc2=nlp(text2)
    tokens1=[token.text.lower() for token in doc1]
    tokens2=[token.text.lower() for token in doc2]
    
    set1 = set(tokens1)
    set2 = set(tokens2)

    set_union = set1.union(set2)
    set_intersection = set1.intersection(set2)
    
    result=len(set_intersection) / len(set_union)
    
    return result

In [70]:
s1="Where can I find the user guide?"
s2="where can i find the user guide?"
print("\njaccard similarity (version 1) of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim(s1,s2)))
print("jaccard similarity (version 2) of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim2(s1,s2)))



jaccard similarity (version 1) of 'Where can I find the user guide?' and 'where can i find the user guide?' is : 0.5555555555555556
jaccard similarity (version 2) of 'Where can I find the user guide?' and 'where can i find the user guide?' is : 1.0


Moreover, we can also use lemmas (or stems) instead of tokens, to reduce the vocabulary. 

Another improvement is to consider the use of lemmas instead of tokens:

In [71]:
def get_jaccard_sim3(text1, text2):
    """calculates the jaccard similarity of the two input texts.
    We obtain lemmas instead tokens""" 

    doc1=nlp(text1)
    doc2=nlp(text2)
    lemmas1=[token.lemma_.lower() for token in doc1]
    lemmas2=[token.lemma_.lower() for token in doc2]
    
    set1 = set(lemmas1)
    set2 = set(lemmas2)

    set_union = set1.union(set2)
    set_intersection = set1.intersection(set2)
    
    return len(set_intersection) / len(set_union)
    

In [77]:
s1="AI is our friend and it has been friendly"
s2="AI and humans have always been friendly"

print("\njaccard similarity (without any preprocessing) of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim(s1,s2)))
print("jaccard similarity (with lowercase) of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim2(s1,s2)))
print("jaccard similarity (with lemmas) of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim3(s1,s2)))


jaccard similarity (without any preprocessing) of 'AI is our friend and it has been friendly' and 'AI and humans have always been friendly' is : 0.3333333333333333
jaccard similarity (with lowercase) of 'AI is our friend and it has been friendly' and 'AI and humans have always been friendly' is : 0.3333333333333333
jaccard similarity (with lemmas) of 'AI is our friend and it has been friendly' and 'AI and humans have always been friendly' is : 0.5555555555555556


## Limitations of Jaccard similarity

Now we can see an example of two sentences with completely opposite meanings, but with the maximum jaccard score, that is, 1. Therefore, Jaccard similarity does not consider the **order of words**, which many times is critial for the meaning of a text. 

In [78]:
s1="The hotel was very good, and not expensive"
s2="The hotel was not very good, and not expensive"
print("\njaccard similarity of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim3(s1,s2)))



jaccard similarity of 'The hotel was very good, and not expensive' and 'The hotel was not very good, and not expensive' is : 1.0


On the other hand, the following cell shows two sentences with very close meanings but with a very low jaccard similarity. In this case, Jaccard similarity does not take into account the **semantic relations** between words (in this case, synonymy)

In [80]:
s1="The hotel was very good, and not expensive"
s2="The inn was especially nice, and not overprice"
print("\njaccard similarity of '{}' and '{}' is : {}".format(s1,s2,get_jaccard_sim3(s1,s2)))


jaccard similarity of 'The hotel was very good, and not expensive' and 'The inn was especially nice, and not overprice' is : 0.38461538461538464


## More metrics to measure the lexical similarity

Rouge is a set of metrics that can be used to measure the lexical similarity of two texts. They usually ared used for tasks such as text summarization or machine translation. For example, in text summarization, Rouge is used to compare the generated summary and the reference summary provided in the dataset. 

The most popular Rouge metrics are: 

- ROUGE-N: measures the  overlap of n-grams (sequence of n tokens) between the two texts to be compared. So, ROUGE-1 refers to the overlap of unigram, while Rouge-2 refers to the overlap of bigrams.
- ROUGE-L: measures the number of longest common subsequence in both texts. 

There are several python libraries that implement these metrics. For example, https://pypi.org/project/rouge/



In [81]:
!pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [100]:
from rouge import Rouge 
s1="Where can I find the user guide"
s2="Where can I find the manual"

rouge = Rouge()
scores = rouge.get_scores(s1, s2)
scores=scores[0]
for s in scores.keys():
    print(s,scores[s])

rouge-1 {'f': 0.7692307642603551, 'p': 0.7142857142857143, 'r': 0.8333333333333334}
rouge-2 {'f': 0.7272727223140496, 'p': 0.6666666666666666, 'r': 0.8}
rouge-l {'f': 0.7692307642603551, 'p': 0.7142857142857143, 'r': 0.8333333333333334}


In [101]:
+s1="Where can I find the user guide"
s2="Where can I find the true love"

rouge = Rouge()
scores = rouge.get_scores(s1, s2)
scores=scores[0]
for s in scores.keys():
    print(s,scores[s])

rouge-1 {'f': 0.7142857092857143, 'p': 0.7142857142857143, 'r': 0.7142857142857143}
rouge-2 {'f': 0.6666666616666668, 'p': 0.6666666666666666, 'r': 0.6666666666666666}
rouge-l {'f': 0.7142857092857143, 'p': 0.7142857142857143, 'r': 0.7142857142857143}


Rouge and Jaccard share the same limitations: 


In [103]:
s1="The hotel was very good, and not expensive"
s2="The hotel was not very good, and not expensive"
rouge = Rouge()
scores = rouge.get_scores(s1, s2)
scores=scores[0]
for s in scores.keys():
    print(s,scores[s])

rouge-1 {'f': 0.9411764656055364, 'p': 1.0, 'r': 0.8888888888888888}
rouge-2 {'f': 0.7999999950222222, 'p': 0.8571428571428571, 'r': 0.75}
rouge-l {'f': 0.999999995, 'p': 1.0, 'r': 1.0}


In [102]:
from rouge import Rouge 

s1="The hotel was very good, and not expensive"
s2="The inn was especially nice, and not overprice"
rouge = Rouge()
scores = rouge.get_scores(s1, s2)
scores=scores[0]
for s in scores.keys():
    print(s,scores[s])

rouge-1 {'f': 0.4999999950000001, 'p': 0.5, 'r': 0.5}
rouge-2 {'f': 0.14285713785714302, 'p': 0.14285714285714285, 'r': 0.14285714285714285}
rouge-l {'f': 0.4999999950000001, 'p': 0.5, 'r': 0.5}
