<a href="https://colab.research.google.com/github/maitysuvo19/Semantic_textual_similarity/blob/main/bert_sts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The module semantic-text-similarity is an easy-to-use interface to fine-tuned BERT models for computing semantic similarity.It modifies pytorch-transformers by abstracting away all the research benchmarking code for ease of real-world applicability. For more information visist https://pypi.org/project/semantic-text-similarity/#description

In [None]:
#installation
!pip install semantic-text-similarity

Collecting semantic-text-similarity
[?25l  Downloading https://files.pythonhosted.org/packages/f1/d7/eade8afd89103e3dcc4b4db146a134a26bd7336ba86d9a95cf0d0e3a28cb/semantic_text_similarity-1.0.3-py3-none-any.whl (416kB)
[K     |████████████████████████████████| 419kB 29.1MB/s 
Collecting fuzzywuzzy[speedup]
  Downloading https://files.pythonhosted.org/packages/43/ff/74f23998ad2f93b945c0309f825be92e04e0348e062026998b5eefef4c33/fuzzywuzzy-0.18.0-py2.py3-none-any.whl
Collecting pytorch-transformers==1.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/50/89/ad0d6bb932d0a51793eaabcf1617a36ff530dc9ab9e38f765a35dc293306/pytorch_transformers-1.1.0-py3-none-any.whl (158kB)
[K     |████████████████████████████████| 163kB 51.7MB/s 
[?25hCollecting strsim
[?25l  Downloading https://files.pythonhosted.org/packages/0d/95/14e5dea00c3bc73e5962261442957ee3691de8d51c97909ba7b2f46bf584/strsim-0.0.3-py3-none-any.whl (42kB)
[K     |████████████████████████████████| 51kB 5.3MB/s 
Collectin

In [None]:
from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity

web_model = WebBertSimilarity(device='cpu', batch_size=10) #defaults to GPU prediction

clinical_model = ClinicalBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

Downloading model: web-bert-similarity from https://github.com/AndriyMulyar/semantic-text-similarity/releases/download/v1.0.0/web_bert_similarity.tar.gz


100%|██████████| 405359924/405359924 [00:09<00:00, 44260724.63B/s]
  0%|          | 0/401555686 [00:00<?, ?B/s]

Downloading model: clinical-bert-similarity from https://github.com/AndriyMulyar/semantic-text-similarity/releases/download/v1.0.0/clinical_bert_similarity.tar.gz


100%|██████████| 401555686/401555686 [00:09<00:00, 44221716.96B/s]


In [None]:
#basic imports
import pandas as pd
import numpy as np

In [None]:
#loading the data set
df=pd.read_csv('/content/drive/MyDrive/nlp data/Text_Similarity_Dataset.csv')

In [None]:
#view first few rows of the dataset
df.head(3)

Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...


In [None]:
#importing packages for pre-processing texts
import re    # for regular expressions 
import nltk  # for text manipulation 
nltk.download('stopwords')
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
import string 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')
text_cleaning_re = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

In [None]:
#function to preprocess texts
def preprocess(text, stem=True):
  text = re.sub(text_cleaning_re, ' ', str(text).lower()).strip()
  tokens = []
  for token in text.split():
    if token not in stop_words:
      if stem:
        tokens.append(stemmer.stem(token))
      else:
        tokens.append(token)
  return " ".join(tokens)

In [None]:
df.text1 = df.text1.apply(lambda x: preprocess(x))
df.text2 = df.text2.apply(lambda x: preprocess(x))

In [None]:
#preprocessed dataframe
df.head()

Unnamed: 0,Unique_ID,text1,text2
0,0,savvi searcher fail spot ad internet search en...,newcastl 2 1 bolton kieron dyer smash home win...
1,1,million miss net 2025 40 uk popul still withou...,nasdaq plan 100m share sale owner technolog do...
2,2,young debut cut short ginepri fifteen year old...,ruddock back yapp credenti wale coach mike rud...
3,3,diageo buy us wine firm diageo world biggest s...,mci share climb takeov bid share us phone comp...
4,4,care code new european direct could put softwa...,media gadget get move pocket size devic let pe...


In [None]:
text1=df.text1.tolist()
text2=df.text2.tolist()

This package Maps batches of sentence pairs to real-valued scores in the range [0,5]

In [None]:
#finding similarity score between each pair of sentences
similarity_score=[]
for index, row in df.iterrows():
  similarity_score.append(web_model.predict([(text1[index],text2[index])]))

In [None]:
#transforming similarity score sclae [0,5] to [0,1]
Similarity_Score = [x /5 for x in similarity_score]

In [None]:
df=df.assign(Similarity_Score=Similarity_Score)

In [None]:
df.head()

Unnamed: 0,Unique_ID,text1,text2,Similarity_Score
0,0,savvi searcher fail spot ad internet search en...,newcastl 2 1 bolton kieron dyer smash home win...,[0.07144032]
1,1,million miss net 2025 40 uk popul still withou...,nasdaq plan 100m share sale owner technolog do...,[0.15003903]
2,2,young debut cut short ginepri fifteen year old...,ruddock back yapp credenti wale coach mike rud...,[0.12324971]
3,3,diageo buy us wine firm diageo world biggest s...,mci share climb takeov bid share us phone comp...,[0.21954179]
4,4,care code new european direct could put softwa...,media gadget get move pocket size devic let pe...,[0.15129845]


In [None]:
df=df[['Unique_ID','Similarity_Score']]
df.head()

Unnamed: 0,Unique_ID,Similarity_Score
0,0,[0.07144032]
1,1,[0.15003903]
2,2,[0.12324971]
3,3,[0.21954179]
4,4,[0.15129845]


In [None]:
from google.colab import files
df.to_csv('df_sts_bert_1.csv')
files.download('df_sts_bert_1.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
#for printing
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('sts_bert_1.ipynb')

File ‘colab_pdf.py’ already there; not retrieving.





[NbConvertApp] Converting notebook /content/drive/My Drive/Colab Notebooks/sts_bert_1.ipynb to pdf
[NbConvertApp] Writing 41918 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: [u'xelatex', u'./notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: [u'bibtex', u'./notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 36818 bytes to /content/drive/My Drive/sts_bert_1.pdf


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

'File ready to be Downloaded and Saved to Drive'