In [6]:
#Run the following lines of code to install spacy

# !pip install -U pip setuptools wheel
# !pip install -U spacy
# !python -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_md

In [1]:
import spacy
import numpy as np
import pandas as pd
from math import e

In [2]:
nlp = spacy.load("en_core_web_md")

### Code

In [3]:
def similarity(s1,s2):
    # convert to lowercase
    s1 = s1.lower()
    s2 = s2.lower()
    
    # Convert sentences to doc
    docs = [nlp(s) for s in [s1,s2]]
    
    # Tokenize the sentence
    token1,token2 = [[token.text for token in doc] for doc in docs] 
    jac =  len(set(token1).intersection(token2)) / len(set(token1).union(token2))
    

    # Vectorize using word2vector
    v1,v2 = [doc.vector for doc in docs]
    cos = np.dot(v1,v2) / sum(v1**2)**0.5 / sum(v2**2)**0.5
    
    # Eucledian distance
    euc = e ** -sum((v1-v2)**2)**0.5
    
    df =  pd.DataFrame([[jac,cos,euc]], index = ["Similarity"], columns = ["Jaccard", "Cosine","Eucledian"]).T
    return df

### Approach

1. Convert texts to lowercase
2. Tokenize the texts
3. Vectorize the tokens using the word2vector method

### Evaluation Metrics

1. Jaccard : Ratio of number of intersecting words and total number of words
2. Cosine : Cos($\theta$), where $\theta$ is the angle between the vectors
3. Eucledian : $e^{-d}$, where $d$ is the eucledian distance between the vectors

# Performace

### Example 1 : Paraphrased texts

In [7]:
# Paraphrased texts from quilbot
print("Text 1")
a = """Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her."""
b = """Alice was becoming bored of sitting on the bank with her sister and having nothing to do: she had peered inside the book her sister was reading once or twice, but it was devoid of pictures or discussions, 'and what is the use of a book without pictures or conversation?' Alice wondered. ' 
So she was debating in her head (as best she could, because the hot day had made her feel drowsy and foolish) whether the pleasure of making a daisy-chain was worth the trouble of getting up and gathering the daisies, when a White Rabbit with pink eyes dashed close by her."""
display(similarity(a,b))

print("\nText 2")
a = "It was a pleasure to burn.It was a special pleasure to see things eaten, to see things blackened and changed. With the brass nozzle in his fists, with this great python spitting its venomous kerosene upon the world, the blood pounded in his head, and his hands were the hands of some amazing conductor playing all the symphonies of blazing and burning to bring down the tatters and charcoal ruins of history. With his symbolic helmet numbered 451 on his stolid head, and his eyes all orange flame with the thought of what came next, he flicked the igniter and the house jumped up in a gorging fire that burned the evening sky red and yellow and black."
b = "It was a lot of fun to burn.It was really satisfying to see things devoured, charred, and transformed. The blood beat in his skull, and his hands were the hands of some fantastic conductor playing all the symphonies of scorching and burning to bring down the tatters and charcoal ruins of history with the brass nozzle in his fists, with this big python spitting its toxic kerosene into the world. He flicked the igniter, and the house jumped up in a gorging fire that burned the evening sky red, yellow, and black. With his symbolic helmet numbered 451 on his stolid head, and his eyes all orange flame with the thought of what came next, he flicked the igniter, and the house jumped up in a gorging fire that burned the evening sky red, yellow, and black."
display(similarity(a,b))

print("\nText 3")
a = """ABOUT 13.5 BILLION YEARS AGO, MATTER, energy, time and space came into being in what is known as the Big Bang. The story of these fundamental features of our universe is called physics.
About 300,000 years after their appearance, matter and energy started to coalesce into complex structures, called atoms, which then combined into molecules. The story of atoms, molecules and their interactions is called chemistry.
About 3.8 billion years ago, on a planet called Earth, certain molecules combined to form particularly large and intricate structures called organisms. The story of organisms is called biology."""
b = """MATTER, ENERGY, TIME, AND SPACE BEGAN ABOUT 13.5 BILLION YEARS AGO in what is known as the Big Bang. Physics is the narrative of these fundamental aspects of our universe. 
Matter and energy began to consolidate into complex formations called atoms some 300,000 years after they first appeared, which subsequently merged to form molecules. Chemistry is the study of atoms, molecules, and their interactions. 
3.8 billion years ago, on a planet named Earth, some chemicals came together to form animals, which are especially massive and intricate structures. Biology is the study of living things."""
display(similarity(a,b))

print("\nCosine works great,\nEucledian is also reliable,\nJaccard isn't")


Text 1


array([0.6372549 , 0.99712061, 0.77839407])


Text 2


array([0.7311828 , 0.99622822, 0.75834485])


Text 3


array([0.59756098, 0.99450341, 0.72205203])


Cosine works great,
Eucledian is also reliable,
Jaccard isn't


### Example 2 : Active vs passive voices

In [5]:
a = "He is kicking the monkey"
b = "The monkey is being kicked by him"
display(similarity(a,b))

a = "Mom read the novel in one day."
b = "The novel was read by Mom in one day."
display(similarity(a,b))

print("Again, Cosine works really well")

Unnamed: 0,Similarity
Jaccard,0.333333
Cosine,0.929012
Eucledian,0.236978


Unnamed: 0,Similarity
Jaccard,0.8
Cosine,0.978416
Eucledian,0.470138


Again, Cosine works really well


### Limitations : Can't differentiate sentences based on their meaning

In [6]:
a = "The cat was purring tonight"
b = "I am going to eat tonight"
display(similarity(a,b))

print("Cosine similarity should not be this high")

Unnamed: 0,Similarity
Jaccard,0.1
Cosine,0.737367
Eucledian,0.049716


Cosine similarity should not be this high


### Furthur Research:

1. Contextual Embeddings
2. Sentence Transformers (BERT, USE)
3. Doc2vector method
4. Lemmatization and Stemming

