## Finding Semantic Textual Similarity

### Problem Statement
Given two paragraphs, quantify the degree of similarity between the two text-based on Semantic similarity. Semantic Textual Similarity (STS) assesses the degree to which two sentences are semantically equivalent to each other. The STS task is motivated by the observation that accurately modelling the meaning similarity of sentences is a foundational language understanding problem relevant to numerous applications including machine translation (MT), summarization, generation, question-answering (QA), short answer grading, semantic search.  

STS is the assessment of pairs of sentences according to their degree of semantic similarity. The task involves producing real-valued similarity scores for sentence pairs. 


The data contains a pair of paragraphs. These text paragraphs are randomly sampled from a raw dataset. Each pair of the sentence may or may not be semantically similar. The candidate is to predict a value between 0-1 indicating a degree of similarity between the pair of text paras. 

1 means highly similar  

0 means highly dissimilar 

In [38]:
import numpy as np
import pandas as pd

import re
from tqdm import tqdm

import collections

from sklearn.cluster import KMeans

from nltk.stem import WordNetLemmatizer  # For Lemmetization of words
from nltk.corpus import stopwords  # Load list of stopwords
from nltk import word_tokenize # Convert paragraph in tokens

import pickle
import sys

from gensim.models import word2vec # For represent words in vectors
import gensim

In [37]:
import os
print(os.listdir("C:\\Users\\Deepak Jaiswal\\Desktop\\Data Science Project\\Precily Assignment"))

['Precily Assessment.zip', 'Precily_Assessment_DSFT.pdf', 'Text_Similarity_Dataset.csv', 'word2vec-GoogleNews-vectors-master', 'word2vec-GoogleNews-vectors-master.zip']


In [3]:
text_data = pd.read_csv("C:\\Users\Deepak Jaiswal\\Desktop\\Data Science Project\\Precily Assignment\\Text_Similarity_Dataset.csv")

In [4]:
# shape of the text dataset with rows and columns
print(text_data.shape)
text_data.head()

(4023, 3)


Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...


In [5]:
# Check the text data if any null values
text_data.isnull().sum()

Unique_ID    0
text1        0
text2        0
dtype: int64

### Preprocessing of text1 & text 2
1. Convert phrases like won't to will not using function decontracted() below
2. Remove Stopwords
3. Remove any special symbol and lower case all words
4. Lemmatizing words using WordNetLemmatizer define in function word_tokenizer below 

In [6]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't","will not", phrase)
    phrase = re.sub(r"can\'t","can not", phrase)
    
    #general
    phrase = re.sub(r"n\'t"," not", phrase)
    phrase = re.sub(r"n\'re"," are", phrase)
    phrase = re.sub(r"n\'s"," is", phrase)
    phrase = re.sub(r"n\'d"," would", phrase)
    phrase = re.sub(r"n\'ll"," will", phrase)
    phrase = re.sub(r"n\'t"," not", phrase)
    phrase = re.sub(r"n\'ve"," have", phrase)
    phrase = re.sub(r"n\'m"," am", phrase)
    return phrase

In [39]:
# Combining all the above stundents 


preprocessed_text1 = []

# tqdm is for printing the status bar

for sentance in tqdm(text_data['text1'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)

    sent = ' '.join(e for e in sent.split() if e not in stopwords.words('english'))
    preprocessed_text1.append(sent.lower().strip())

100%|██████████| 4023/4023 [08:43<00:00,  8.90it/s]


In [40]:
# Merging preprocessed _text1 in text_data

text_data['text1'] = preprocessed_text1
text_data.head()

Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail spot ads internet search ...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions miss net 2025 40 uk population still ...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short ginepri fifteen year old...,ruddock backs yapp s credentials wales coach m...
3,3,diageo buy us wine firm diageo world biggest s...,mci shares climb on takeover bid shares in us ...
4,4,careful code new european directive could put ...,media gadgets get moving pocket-sized devices ...


In [41]:
def word_tokenizer(text):
          #tokenizes and stems the next
        tokens = word_tokenize(text)
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
        return tokens
    

### Word embeddings:

1. Word embeddings are low dimensional vectors obtained by training a neural network on a large corpus to predict a word given  context (Continuous Bag Of Words model) or to predict the context given a word (skip gram model). The context is a window of surrounding words. Pre-trained word embeddings are also available in the word2vec code.google page.

2.In this i am using Google news pre trained vectors and compare similarity between text1 & text2 using n_similarity method in gensim library which is nothing compares cosine similarity between two

3.Consider it as a unsupervised problem.

In [None]:

# Load pre_trained Google News Vectors after download file

wordmodelfile="GoogleNews-vectors-negative300.bin.gz"
wordmodel= gensim.models.KeyedVectors.load_word2vec_format(wordmodelfile, binary=True)

In [24]:
# This code check if word in text1 & text2 present in our google news vectors vocabalry.
# if not it removes that word and if present it compares similarity score between text1 and text2 words


similarity = [] # List for store similarity score



for ind in text_data.index:
    
        s1 = text_data['text1'][ind]
        s2 = text_data['text2'][ind]
        
        if s1==s2:
                 similarity.append(0.0) # 0 means highly similar
                
        else:   

            s1words = word_tokenizer(s1)
            s2words = word_tokenizer(s2)
            
           
            
            vocab = wordmodel.vocab #the vocabulary considered in the word embeddings
            
            if len(s1words and s2words)==0:
                    similarity.append(1.0)

            else:
                
                for word in s1words.copy(): #remove sentence words not found in the vocab
                    if (word not in vocab):
                           
                            
                            s1words.remove(word)
                        
                    
                for word in s2words.copy(): #idem

                    if (word not in vocab):
                           
                            s2words.remove(word)
                            
                            
                similarity.append((1-wordmodel.n_similarity(s1words, s2words))) 
                # as it is given 1 means highly dissimilar & 0 means highly similar

In [21]:
# get Unique_ ID and similarity

final_score = pd.DataFrame({'Unique_ID':text_data.Unique_ID,
                            'Similarity_score':similarity})
final_score.head(3)

Unnamed: 0,Unique_ID,Similarity_score
0,0,0.30123
1,1,0.238022
2,2,0.282121


In [23]:
# Save DF as CSV file
final_score.to_csv('final_score.csv',index=False)

2nd Method (Jaccard similarity)

Jaccard similarity or intersection over union is defined as size of intersection divided by size of union of two sets. Let’s take example of two sentences:

* Sentence 1: AI is our friend and it has been friendly
* Sentence 2: AI and humans have always been friendly

In order to calculate similarity using Jaccard similarity, we will first perform lemmatization to reduce words to the same root word. In our case, “friend” and “friendly” will both become “friend”, “has” and “have” will both become “has”.
For the above two sentences, we get Jaccard similarity of 5/(5+3+2) = 0.5 which is size of intersection of the set divided by total size of set.

In [25]:
# Similarity check with Jaccard Similarity
def get_jaccard_sim(str1, str2):
    a = set(str1.split())
    b = set(str2.split())
    c = a.intersection(b)
    return float (len(c)) / (len(a) + len(b) - len(c))

In [32]:
similarity = [] # List for store similarity score



for ind in text_data.index:
    
        s1 = text_data['text1'][ind]
        s2 = text_data['text2'][ind]
        similarity.append(1-get_jaccard_sim(s1, s2))
 

In [33]:
# get Unique_ ID and similarity

final_score = pd.DataFrame({'Unique_ID':text_data.Unique_ID,
                            'Similarity_score':similarity})
final_score.head(3)

Unnamed: 0,Unique_ID,Similarity_score
0,0,0.970356
1,1,0.978528
2,2,0.975207


In [35]:
# Save DF as CSV file
final_score.to_csv('Jaccard_Similarity.csv',index=False)