# Testing SentenceBERT for semantic similarity

* https://medium.com/analytics-vidhya/semantic-similarity-in-sentences-and-bert-e8d34f5a4677
* https://towardsdatascience.com/word-embedding-using-bert-in-python-dd5a86c00342
* https://github.com/huggingface/transformers

Install hugginface transformers and sentence-transformers

In [2]:
!pip install transformers # https://github.com/huggingface/transformers
!pip install -U sentence-transformers # https://github.com/UKPLab/sentence-transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |▍                               | 10kB 22.7MB/s eta 0:00:01[K     |▉                               | 20kB 3.1MB/s eta 0:00:01[K     |█▎                              | 30kB 4.2MB/s eta 0:00:01[K     |█▊                              | 40kB 4.5MB/s eta 0:00:01[K     |██▏                             | 51kB 3.6MB/s eta 0:00:01[K     |██▋                             | 61kB 4.1MB/s eta 0:00:01[K     |███                             | 71kB 4.4MB/s eta 0:00:01[K     |███▍                            | 81kB 4.8MB/s eta 0:00:01[K     |███▉                            | 92kB 5.2MB/s eta 0:00:01[K     |████▎                           | 102kB 4.9MB/s eta 0:00:01[K     |████▊                           | 112kB 4.9MB/s eta 0:00:01[K     |█████▏                          | 122kB 4.9M

In [3]:
import pandas as pd
import numpy as np
import torch
from sentence_transformers import SentenceTransformer


model = SentenceTransformer('bert-large-nli-stsb-mean-tokens') # Load the BERT model. Semantic Textual Similarity models are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md

100%|██████████| 1.24G/1.24G [01:12<00:00, 17.1MB/s]


## 1. Load the sts-benchmark data and remove lines that contain errors.

In [6]:
# Remove "warn_bad_lines=False" to print the lines that have errors.
train_df = pd.read_csv('sts-train.csv', sep='\t', engine='python', header=None, encoding='utf-8', error_bad_lines=False, warn_bad_lines=False)

## 2. A quick look at the dataset we are using

In [7]:
print(train_df.loc[0])
print('\n')
print(train_df.loc[45])

train_df.head()

0                  main-captions
1                         MSRvid
2                       2012test
3                              1
4                              5
5         A plane is taking off.
6    An air plane is taking off.
Name: 0, dtype: object


0                     main-captions
1                            MSRvid
2                          2012test
3                                68
4                                 1
5       A man is playing the piano.
6    A woman is playing the violin.
Name: 45, dtype: object


Unnamed: 0,0,1,2,3,4,5,6
0,main-captions,MSRvid,2012test,1,5.0,A plane is taking off.,An air plane is taking off.
1,main-captions,MSRvid,2012test,4,3.8,A man is playing a large flute.,A man is playing a flute.
2,main-captions,MSRvid,2012test,5,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,main-captions,MSRvid,2012test,6,2.6,Three men are playing chess.,Two men are playing chess.
4,main-captions,MSRvid,2012test,9,4.25,A man is playing the cello.,A man seated is playing the cello.


## 3. Comparing two sentence paires with SentenceBert as an example

In [8]:
s1 = train_df.loc[0][5]
s2 = train_df.loc[0][6]
s3 = train_df.loc[45][5]
s4 = train_df.loc[45][6]

print(f's1 = {s1}')
print(f's2 = {s2}')
print('\n')
print(f's3 = {s3}')
print(f's4 = {s4}')

s1 = A plane is taking off.
s2 = An air plane is taking off.


s3 = A man is playing the piano.
s4 = A woman is playing the violin.


In [9]:
from scipy.spatial import distance

s1_embedding = model.encode(s1)
s2_embedding = model.encode(s2)
s3_embedding = model.encode(s3)
s4_embedding = model.encode(s4)

print(f's1 vs s2 = {distance.cosine(s1_embedding,s2_embedding)}')
print(f'Human score = {train_df.loc[0][4]}')
print(f'SentenceBERT Score = {round((1-distance.cosine(s1_embedding,s2_embedding))*5,1)}')

print(f's3 vs s4 = {distance.cosine(s3_embedding,s4_embedding)}')
print(f'Human score = {train_df.loc[45][4]}')
print(f'SentenceBERT Score = {round((1-distance.cosine(s3_embedding,s4_embedding))*5,1)}')

print(f's1 vs s3 = {distance.cosine(s1_embedding,s3_embedding)}')
print(f's1 vs s4 = {distance.cosine(s1_embedding,s4_embedding)}')

s1 vs s2 = 0.017929553985595703
Human score = 5.0
SentenceBERT Score = 4.9
s3 vs s4 = 0.77469402551651
Human score = 1.0
SentenceBERT Score = 1.1
s1 vs s3 = 0.8804589062929153
s1 vs s4 = 0.8903428241610527


## 4. Getting the human scores and the SentenceBERT scores and comparing them

### 4.1 Load the data and preprocess it

In [10]:
import nltk

data = []
with open('sts-dev.csv') as f:
    for line in f.read().splitlines():
        splits = line.split('\t')
        data.append({
            'score': float(splits[4]),
            's1': splits[5],
            's2': splits[6]
        })

# removes punctuation from sentences
tokenizer = nltk.RegexpTokenizer(r"\w+")

# lowercase, tokenize and remove punctuation from sentences
for x in data:
    x['s1'].lower()
    x['s2'].lower()
    x['s1'] = tokenizer.tokenize(x['s1'])
    x['s2'] = tokenizer.tokenize(x['s2'])
    x['s1'] = ' '.join(x['s1'])
    x['s2'] = ' '.join(x['s2'])

In [11]:
data[3]

{'s1': 'A woman is playing the guitar',
 's2': 'A man is playing guitar',
 'score': 2.4}

### 4.2 Get the scores and normalize them

In [12]:
score_human = []

for x in data:
    score = x['score']/5
    score_human.append(score)

In [13]:
score_machine = []

for x in data:
    s1_embedding = model.encode(x['s1'])
    s2_embedding = model.encode(x['s2'])
    score = (1-distance.cosine(s1_embedding,s2_embedding))
    score_machine.append(score)

### 4.3 Compare human and fastText scores

In [14]:
from scipy.stats import pearsonr

result, _ = pearsonr(score_machine, score_human)
print('Pearsonr:', end=' ')
print("%.1f" % (result*100))

Pearsonr: 87.8
