# Testing SentenceBERT for semantic similarity

* https://medium.com/analytics-vidhya/semantic-similarity-in-sentences-and-bert-e8d34f5a4677
* https://towardsdatascience.com/word-embedding-using-bert-in-python-dd5a86c00342
* https://github.com/huggingface/transformers

Install hugginface transformers and sentence-transformers

In [None]:
!pip install transformers # https://github.com/huggingface/transformers
!pip install -U sentence-transformers # https://github.com/UKPLab/sentence-transformers

In [2]:
import pandas as pd
import numpy as np
import torch

from sentence_transformers import SentenceTransformer
from scipy.spatial import distance
import nltk

model = SentenceTransformer('bert-large-nli-stsb-mean-tokens') # Load the BERT model. Semantic Textual Similarity models are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md

100%|██████████| 1.24G/1.24G [01:05<00:00, 19.0MB/s]


## 1. Load the sts-benchmark data and remove lines that contain errors.

In [None]:
train_df = pd.pandas.read_table(
    'stsbenchmark/sts-train.csv',
    error_bad_lines=False,
    skip_blank_lines=True,
    usecols=[4, 5, 6],
    names=["score", "s1", "s2"])

## 2. A quick look at the dataset we are using

In [None]:
train_df.head()

Unnamed: 0,score,s1,s2
0,5.0,A plane is taking off.,An air plane is taking off.
1,3.8,A man is playing a large flute.,A man is playing a flute.
2,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,2.6,Three men are playing chess.,Two men are playing chess.
4,4.25,A man is playing the cello.,A man seated is playing the cello.


In [None]:
train_df.tail()

Unnamed: 0,score,s1,s2
5706,0.0,Severe Gales As Storm Clodagh Hits Britain,Merkel pledges NATO solidarity with Latvia
5707,0.0,Dozens of Egyptians hostages taken by Libyan t...,Egyptian boat crash death toll rises as more b...
5708,0.0,President heading to Bahrain,President Xi: China to continue help to fight ...
5709,0.0,"China, India vow to further bilateral ties",China Scrambles to Reassure Jittery Stock Traders
5710,0.0,Putin spokesman: Doping charges appear unfounded,The Latest on Severe Weather: 1 Dead in Texas ...


## 3. Comparing two sentence paires with SentenceBert as an example

In [None]:
s1 = train_df.loc[0][1]
s2 = train_df.loc[0][2]
s3 = train_df.loc[45][1]
s4 = train_df.loc[45][2]

print(f's1 = {s1}')
print(f's2 = {s2}')
print('\n')
print(f's3 = {s3}')
print(f's4 = {s4}')

s1 = A plane is taking off.
s2 = An air plane is taking off.


s3 = A man is playing the piano.
s4 = A woman is playing the violin.


In [None]:
s1_embedding = model.encode(s1)
s2_embedding = model.encode(s2)
s3_embedding = model.encode(s3)
s4_embedding = model.encode(s4)

print(f's1 vs s2 = {distance.cosine(s1_embedding,s2_embedding)}')
print(f'Human score = {train_df.loc[0][0]}')
print(f'SentenceBERT Score = {round((1-distance.cosine(s1_embedding,s2_embedding))*5,1)}')

print(f's3 vs s4 = {distance.cosine(s3_embedding,s4_embedding)}')
print(f'Human score = {train_df.loc[45][0]}')
print(f'SentenceBERT Score = {round((1-distance.cosine(s3_embedding,s4_embedding))*5,1)}')

print(f's1 vs s3 = {distance.cosine(s1_embedding,s3_embedding)}')
print(f's1 vs s4 = {distance.cosine(s1_embedding,s4_embedding)}')

s1 vs s2 = 0.017929434776306152
Human score = 5.0
SentenceBERT Score = 4.9
s3 vs s4 = 0.7746940553188324
Human score = 1.0
SentenceBERT Score = 1.1
s1 vs s3 = 0.8804589882493019
s1 vs s4 = 0.8903428316116333


## 4. Getting the human scores and the SentenceBERT scores and comparing them

### 4.1 Load the data and preprocess it

In [None]:
dev_df = pd.pandas.read_table(
    'stsbenchmark/sts-dev.csv',
    error_bad_lines=False,
    skip_blank_lines=True,
    usecols=[4, 5, 6],
    names=["score", "s1", "s2"])

# removes punctuation from sentences
tokenizer = nltk.RegexpTokenizer(r"\w+")

# For some reason some of the sentences were "float" datatypes 
dev_df['s1'] = dev_df['s1'].astype(str)
dev_df['s2'] = dev_df['s2'].astype(str)


dev_df['s1'] = dev_df.apply(lambda row: tokenizer.tokenize(row['s1']), axis=1)
dev_df['s1'] = dev_df.apply(lambda row: ' '.join(row['s1']).lower() , axis=1)

dev_df['s2'] = dev_df.apply(lambda row: tokenizer.tokenize(row['s2']), axis=1)
dev_df['s2'] = dev_df.apply(lambda row: ' '.join(row['s2']).lower() , axis=1)

In [None]:
dev_df.head()

Unnamed: 0,score,s1,s2
0,5.0,a man with a hard hat is dancing,a man wearing a hard hat is dancing
1,4.75,a young child is riding a horse,a child is riding a horse
2,5.0,a man is feeding a mouse to a snake,the man is feeding a mouse to the snake
3,2.4,a woman is playing the guitar,a man is playing guitar
4,2.75,a woman is playing the flute,a man is playing a flute


### 4.2 Get the scores and normalize them

In [None]:
dev_scores = dev_df['score'].tolist()

score_human = []

for row in dev_scores:
    score = row/5
    score_human.append(score)

In [None]:
score_machine = []

for row in dev_df.itertuples(index=False):
    s1_embedding = model.encode(str(row[1]))
    s2_embedding = model.encode(str(row[2]))
    score = (1-distance.cosine(s1_embedding,s2_embedding))
    score_machine.append(score)

### 4.3 Compare human and fastText scores

In [None]:
from scipy.stats import pearsonr

result, _ = pearsonr(score_machine, score_human)
print('Pearsonr:', end=' ')
print("%.1f" % (result*100))

Pearsonr: 87.4


## 5. Compare similarity between longer texts

In [3]:
text1 = "The public beta of macOS Big Sur, the next major release of Apple’s Mac operating system, is now available. The new update brings a big visual overhaul to macOS while also adding a number of brand-new enhancements. If you’re thinking about installing the macOS Big Sur public beta, be warned that it’s still, well, a beta. That means you could experience some unexpected bugs, and software you rely on may not work with the new OS just yet. Before you install Big Sur, make sure all of your important documents are backed up somewhere safe, and if at all possible, you should only install this on a secondary Mac. But if you do roll the dice and install the Big Sur beta, you’ll immediately see that it looks much different than previous versions of macOS, as Apple has made significant design changes across the entire operating system. Windows have a whole lot more white, for example (unless you’re using dark mode, in which case, there’s still a lot of black). Apple’s app icons have received a major facelift and are now rounded squares, like iOS’s app icons. And the menu bar is now translucent, blending into your wallpaper. In Big Sur, Apple has added a dedicated Control Center, like what iOS has had for years, making it easy to manage items like your Wi-Fi and Bluetooth connections and the display brightness and volume of your Mac, all in one place. And Notification Center is no longer two separate panels for notifications and widgets; it’s now combined into one. If you’re a Safari user, you’ll notice some big changes, too. You can now set a customized start page, letting you add things like your favorites, frequently visited websites, and even a background image of your choice. Tabs get some improvements as well: favicons are turned on by default, and when you hover your mouse over a tab, it shows a preview of that webpage. And like iOS 14, Safari in macOS Big Sur offers what Apple calls a Privacy Report, which shows you what trackers the browser has blocked for you. Messages is also getting some much-needed improvements in Big Sur. You can finally send the message effects like the ones on iOS, meaning you can send virtual confetti, balloons, lasers, and more to your contacts (though only if they’re also on iMessage). Some of the new updates to Messages on iOS 14 are coming to Big Sur, too, such as pinned conversations and inline replies. There’s a bunch more packed into Big Sur that I didn’t touch on here, such as improvements to Maps, a suite of new system sounds, and the return of the Mac’s iconic startup chime. So if you do install the Big Sur public beta, there’s a fair amount to dig into. I’ve been running the Big Sur developer betas on my personal MacBook Air purchased in 2014 without too many issues, and I like a lot of the changes, especially to Messages. But I pretty much only use Apple-made apps on that computer, so I can’t really speak to how other apps you might rely on will run. If you decide to install the public beta, just know that things may not work like you expect them to just yet." 
# https://www.theverge.com/2020/8/6/21356413/apple-macos-big-sur-public-beta-now-available

text2 = "Apple today seeded the first beta of the upcoming macOS 11 Big Sur update to its public beta testing group, allowing non-developers to give the software a try ahead of its public release this fall. Beta testers who signed up for Apple's beta testing program can download the macOS Big Sur beta through the Software Update mechanism in System Preferences after installing the proper profile. Mac users who want to be a part of Apple's beta testing program can sign up to participate on the beta website, which gives users access to iOS, macOS, watchOS, and tvOS betas. Potential beta testers should make a full backup before installing ‌macOS Big Sur‌, and it may not be wise to install the update on a primary machine because betas can be unstable. macOS Big Sur introduces a refined design for the macOS operating system, which is more similar to iOS but immediately familiar to Mac users with tweaks to window design, color palette, app icons, system sounds, menu bars, and sidebars. The update brings Control Center to the Mac for the first time, providing quicker access to system controls for things like volume, keyboard brightness, screen brightness, Wi-Fi connection, and more. An updated Notification Center includes more interactive notifications and redesigned widgets that mirror the new ‌widgets‌ in iOS 14. Notifications are now grouped by app, and you can customize which ‌widgets‌ show up. Safari has a new customizable start page, built-in translation, and a Privacy Report feature that lets you know which trackers each website is using to follow you across the web. There's a new Mac App Store category for extensions, and you can now control the specific sites that extensions are able to work with for more privacy. The Messages app for Mac has been overhauled to bring it more in line with the Messages app for iOS and it supports features like pinned conversations, mentions, inline replies, Messages effects, and Memoji creation and Memoji stickers. Search is also better to make it easier to find old conversations, photos, links, and more. A redesigned Maps app in macOS Big Sur adds support for Look Around, indoor maps, Guides, and Shared ETA updates, plus it can be used to generate cycling routes and routes with charging stops for electric vehicles, which can be sent to iPhone. There are also smaller updates for apps like Photos, Music, and Home, with a full list of everything new in macOS Big Sur available in our roundup." 
# https://www.macrumors.com/2020/08/06/macos-big-sur-public-beta/

text3 = "Earth, the Moon and Mars come into alignment this weekend. So far this summer has been all about super-bright Jupiter and, just 8º away, ringed planet Saturn, which have been dominating the southwestern night sky after dark. This weekend it’s the turn of Mars, as the waning Moon passes close to the red planet. Three spacecraft are on their way to Mars right now, and it’s also a great time to admire it. Mars is now creeping towards opposition in October, the point in its orbit when it’s closest to Earth, so as big and bright as it gets. It’s already getting visibly bigger and brighter with every passing night. Mars is rising earlier each evening, and this weekend is now in the sky before midnight, with a 65% illuminated waning gibbous Moon in tow. Stargazers call this event—when two celestial bodies appear to pass close to each other—a conjunction. How to see Mars and the Moon in conjunction on Saturday, August 8, 202 Look to the east around midnight on Saturday going into Sunday and you’ll easily find a waning gibbous Moon. Only those in North America will see the closest conjunction, at around 4:00 a.m. EDT on the morning of Sunday, August 9.  The Moon will be close to Mars on Saturday night through Sunday morning. You could even try to catch the Moon at moonrise—the most beautiful time to observe our satellite—by consulting this Moon calculator to get times for your exact location. In doing so you’ll also witness a “Mars-rise.” Just 0.8º north of the Moon will be Mars, shining at magnitude -1.3. That’s significantly brighter than any stars, so Mars will be obvious. The conjunction of two of the night sky’s top sights isn’t that rare, but there are few more pleasing celestial sights to unaided naked eyes than a big Moon passing a bright, red planet. Wishing you clear skies and wide eyes." 
# https://www.forbes.com/sites/jamiecartereurope/2020/08/05/mars-and-the-moon-will-align-this-friday-heres-when-and-where-you-can-see-a-marvellous-mars-rise/#23a345516c4d

In [4]:
text1_embedding = model.encode(text1)
text2_embedding = model.encode(text2)
text3_embedding = model.encode(text3)


print(f'text1 vs text2 = {distance.cosine(text1_embedding,text2_embedding)}')
print(f'SentenceBERT Score = {round((1-distance.cosine(text1_embedding,text2_embedding))*5,1)}')

print(f'text1 vs text3 = {distance.cosine(text1_embedding,text3_embedding)}')
print(f'SentenceBERT Score = {round((1-distance.cosine(text1_embedding,text3_embedding))*5,1)}')



text1 vs text2 = 0.2723304033279419
SentenceBERT Score = 3.6
text1 vs text3 = 0.7658985704183578
SentenceBERT Score = 1.2
