# **Semantic Similarity between Paragraphs or Sentences**

-----------------------

## **Part A:** **Task Description:**
#### The task at hand involves evaluating the semantic similarity between two paragraphs. **Semantic Textual Similarity (STS)** measures the extent to which two pieces of text convey similar meanings. STS entails assesses the degree to which two sentences are semantically equivalent to each other. Our task is to involves the producing real-valued similarity scores for sentence pairs.


## **Description of the Data:**

#### The dataset comprises pairs of paragraphs randomly selected from a larger raw dataset. These paragraphs may or may not exhibit semantic similarity. Participants are tasked with predicting a value ranging from 0 to 1, which indicates the degree of similarity between each pair of text paragraphs. A score of

- ### **1** means highly similar
- ### **0** means highly dissimilar

## **Approach to solve this problem:**
To solve this **Natural Language Processing (NLP)** problem, the initial step involves text embedding, a pivotal aspect in building deep learning models. Text embedding transforms **textual data** (such as sentences) into **numerical vectors**.

Once the sentences are converted into vectors, we can calculate how close these vectors are based on the cosine similarity.

We are not converting just based on keyword. Here, we need to concentrate the context and meaning of the text.

To address this, we leverage the **Universal Sentence Encoder (USE).** This encoder translates text into higher-dimensional vectors, which are ideal for our semantic similarity task. The pre-trained **Universal Sentence Encoder (USE)** is readily accessible through TensorFlow Hub, providing a robust solution for our needs.

## **Step-1: Install the required libraries or Packeges**

In [None]:
!pip install -q tensorflow tensorflow_hub pandas

## **Step-2: Importing required libraries:**

### Let's import the necessary libraries and load the TensorFlow Hub module for the Universal Sentence Encoder.

In [None]:
import tensorflow as tf       # To work with USE4
import pandas as pd           # To work with tables
import tensorflow_hub as hub  # contains USE4
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #Model is imported from this URL
model = hub.load(module_url)
def embed(input):
  return model(input)

## **Step-3: Reading Data**

In [None]:
data = pd.read_csv("/content/DataNeuron_Text_Similarity.csv")

In [None]:
data.head()

Unnamed: 0,text1,text2
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...


In [None]:
data.shape

(3000, 2)

In [None]:
data['text1'][0]

'broadband challenges tv viewing the number of europeans with broadband has exploded over the past 12 months  with the web eating into tv viewing habits  research suggests.  just over 54 million people are hooked up to the net via broadband  up from 34 million a year ago  according to market analysts nielsen/netratings. the total number of people online in europe has broken the 100 million mark. the popularity of the net has meant that many are turning away from tv  say analysts jupiter research. it found that a quarter of web users said they spent less time watching tv in favour of the net  the report by nielsen/netratings found that the number of people with fast internet access had risen by 60% over the past year.  the biggest jump was in italy  where it rose by 120%. britain was close behind  with broadband users almost doubling in a year. the growth has been fuelled by lower prices and a wider choice of always-on  fast-net subscription plans.  twelve months ago high speed internet

In [None]:
type(data['text1'][0]) # we can see that all the data is in string type

str

## **Step-4: Encoding text to vectors:**
We have used USE version 4.
It is trained on the whole wikipedia data.
Our Sentence have a sequence of words. we give this sentence to our model (USE4), it gives us a "dense numeric vector".
Here, we passed sentence pair and got a vector pair.

In [None]:
message = [data['text1'][0], data['text2'][0]]
message_embeddings = embed(message)
message_embeddings

<tf.Tensor: shape=(2, 512), dtype=float32, numpy=
array([[-0.02720232,  0.00681642, -0.03939367, ..., -0.03903357,
        -0.05795865, -0.05810072],
       [-0.05569994, -0.0564485 , -0.056383  , ...,  0.04282599,
        -0.05645383, -0.05647698]], dtype=float32)>

In [None]:
type(message_embeddings)

tensorflow.python.framework.ops.EagerTensor

### Here we can see that the type of the vector retured is tensorflow.python.framework.ops.EagerTensor so, we cannot directly use it to compute the cosine similarity. We need to convert it into a numpy array first.
---



In [None]:
type(message_embeddings[0])

tensorflow.python.framework.ops.EagerTensor

In [None]:
type(tf.make_ndarray(tf.make_tensor_proto(message_embeddings)))

numpy.ndarray

In [None]:
a_np = tf.make_ndarray(tf.make_tensor_proto(message_embeddings))

## **Step-5: Finding Cosine similarity**
We ran a for loop for all the sentence pair present in our data and found the vector representation of our sentences. For each vector pair, we found the cosine between the by using usual cosine formula.
i.e.  

### **Cosine Similarity = Dot(a,b)/norm(a)*norm(b)**

We get the value ranging from -1 to 1. But, we need values ranging from 0 to 1 hence we will add 1 to the cosine similarity value and then normalizze it.


In [None]:
from numpy import dot                                           # to calculate the dot product of two vectors
from numpy.linalg import norm                                   #for finding the norm of a vector

ans = []                                                        # This list will contain the cosin similarity value for each vector pair present.
for i in range(len(data)):
  messages = [data['text1'][i], data['text2'][i]]               #storing each sentence pair in messages
  message_embeddings = embed(messages)                          #converting the sentence pair to vector pair using the embed() function
  a = tf.make_ndarray(tf.make_tensor_proto(message_embeddings)) #storing the vector in the form of numpy array
  cos_sim = dot(a[0], a[1])/(norm(a[0])*norm(a[1]))             #Finding the cosine between the two vectors
  ans.append(cos_sim)                                           #Appending the values into the ans list

In [None]:
len(ans)

3000

## **Step-6: To get the scores and save the file in CSV format**.

In [None]:
# Converting the ans list into Dataframe so that we can add it to our "data"
Answer = pd.DataFrame(ans, columns = ['Similarity_Score'])

In [None]:
Answer.head()

Unnamed: 0,Similarity_Score
0,0.272668
1,0.277622
2,0.169011
3,0.157467
4,0.246201


In [None]:
data = data.join(Answer)  #Joining the Similarity_Score Dataframe (Ans) to our main Data

In [None]:
data.head()

Unnamed: 0,text1,text2,Similarity_Score
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,0.272668
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,0.277622
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,0.169011
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,0.157467
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,0.246201


In [None]:
# Adding 1 to each of the values of Similarity_Score to make the values from 0 to 2. (Initially it was from [-1,1])
data['Similarity_Score'] = data['Similarity_Score'] + 1

In [None]:
data.head()

Unnamed: 0,text1,text2,Similarity_Score
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,1.272668
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,1.277622
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,1.169011
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,1.157467
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,1.246201


In [None]:
# Normalizing the Similarity_Score to get the value between 0 and 1
data['Similarity_Score'] = data['Similarity_Score']/data['Similarity_Score'].abs().max()

In [None]:
data.head()

Unnamed: 0,text1,text2,Similarity_Score
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,0.636334
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,0.638811
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,0.584505
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,0.578734
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,0.6231


In [None]:
data.insert(0, 'Unique_ID', range(1, len(data) + 1))

In [None]:
data.head()

Unnamed: 0,Unique_ID,text1,text2,Similarity_Score
0,1,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,0.636334
1,2,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,0.638811
2,3,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,0.584505
3,4,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,0.578734
4,5,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,0.6231


In [None]:
# # Similarity_Score

# from matplotlib import pyplot as plt
# data['Similarity_Score'].plot(kind='line', figsize=(8, 4), title='Similarity_Score')
# plt.gca().spines[['top', 'right']].set_visible(False)

In [None]:
data['Unique_ID'].shape

(3000,)

In [None]:
Submission_task = data[['Unique_ID', 'Similarity_Score']]

In [None]:
Submission_task.head()

Unnamed: 0,Unique_ID,Similarity_Score
0,1,0.636334
1,2,0.638811
2,3,0.584505
3,4,0.578734
4,5,0.6231


In [None]:
Submission_task.set_index("Unique_ID", inplace = True)

In [None]:
from google.colab import files
Submission_task.to_csv('Submission_Task.csv')
files.download('Submission_Task.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>