# CH07c Semantic similarity experiment with FLAIR
In this experiment, we will qualitatively evaluate the sentence representation models thanks to the flair library, which really simplifies obtaining the document embeddings for us.

We will perform experiments while taking on the following approaches:
- Document average pool embeddings
- RNN-based embeddings
- BERT embeddings
- SBERT embeddings


For qualitative evaluation, we define a list of similar sentence pairs and a list of dissimilar sentence pairs (five pairs for each). What we expect from the embeddings models is that they should measure a high score and a low score, respectively.  

The sentence pairs are extracted from the SBS Benchmark dataset, which we are already familiar with from the sentence-pair regression part of Chapter 6, Fine-Tuning Language Models for Token Classification. For similar pairs, two sentences are completely equivalent, and they share the same meaning.

In [None]:
# !pip install flair

In [1]:
import pandas as pd

The pairs with a similarity score of around 5 in the STSB dataset are randomly taken, as follows:

In [2]:
similar=[("A black dog walking beside a pool.","A black dog is walking along the side of a pool."),
("A blonde woman looks for medical supplies for work in a suitcase.	"," The blond woman is searching for medical supplies in a suitcase."),
("A doubly decker red bus driving down the road.","A red double decker bus driving down a street."),
("There is a black dog jumping into a swimming pool.","A black dog is leaping into a swimming pool."),
("The man used a sword to slice a plastic bottle.	","A man sliced a plastic bottle with a sword.")]
pd.DataFrame(similar, columns=["sen1", "sen2"])


Unnamed: 0,sen1,sen2
0,A black dog walking beside a pool.,A black dog is walking along the side of a pool.
1,A blonde woman looks for medical supplies for ...,The blond woman is searching for medical supp...
2,A doubly decker red bus driving down the road.,A red double decker bus driving down a street.
3,There is a black dog jumping into a swimming p...,A black dog is leaping into a swimming pool.
4,The man used a sword to slice a plastic bottle.\t,A man sliced a plastic bottle with a sword.


Here is the list of dissimilar sentences whose similarity scores are around 0, taken from the STS-B dataset:

In [3]:
# import pandas as pd
dissimilar= [("A little girl and boy are reading books. ", "An older child is playing with a doll while gazing out the window."),
("Two horses standing in a field with trees in the background.", "A black and white bird on a body of water with grass in the background."),
("Two people are walking by the ocean." , "Two men in fleeces and hats looking at the camera."),
("A cat is pouncing on a trampoline.","A man is slicing a tomato."),
("A woman is riding on a horse.","A man is turning over tables in anger.")]
pd.DataFrame(dissimilar, columns=["sen1", "sen2"])

Unnamed: 0,sen1,sen2
0,A little girl and boy are reading books.,An older child is playing with a doll while ga...
1,Two horses standing in a field with trees in t...,A black and white bird on a body of water with...
2,Two people are walking by the ocean.,Two men in fleeces and hats looking at the cam...
3,A cat is pouncing on a trampoline.,A man is slicing a tomato.
4,A woman is riding on a horse.,A man is turning over tables in anger.


The following `sim()` function computes the cosine similarity between two sentences; that is, s1, s2:

In [10]:
import torch, numpy as np
def sim(s1,s2):
  # cosine similarity function outputs in the range 0-1
  s1=s1.embedding.unsqueeze(0)
  s2=s2.embedding.unsqueeze(0)
  sim=torch.cosine_similarity(s1,s2).item() 
  return np.round(sim,2)


The document embeddings models that were used in this experiment are all pre-trained models.  
We will pass the document embeddings model object and sentence pair list (similar or dissimilar) to the following `evaluate()` function, where, once the model encodes the sentence embeddings, it will compute the similarity score for each pair in the list, along with the list average. The definition of the function is as follows:

In [4]:
from flair.data import Sentence
def evaluate(embeddings, myPairList):
  # it evaluates embeddings for a given list of sentence pair
  scores=[]
  for s1, s2 in myPairList:
    s1,s2=Sentence(s1), Sentence(s2)        # tokenization
    embeddings.embed(s1)
    embeddings.embed(s2)
    score=sim(s1,s2)
    scores.append(score)
  return scores, np.round(np.mean(scores),2)

## Document Pool Embedding

The Document Pool embeddings (also called Average word embedding) apply mean pooling operation over all word where the average of all word embeddings in a sentence is computed to obtain sentence embedding.  

The following execution instantiates a document pool embedding based on GloVe vectors. Note that although we will use only GloVe vectors here, the flair API allows us to use multiple word embeddings. Here is the code definition:

In [8]:
from flair.data import Sentence
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
glove_embedding = WordEmbeddings('glove')
glove_pool_embeddings = DocumentPoolEmbeddings([glove_embedding])

2022-08-25 12:01:33,032 https://flair.informatik.hu-berlin.de/resources/embeddings/token/glove.gensim.vectors.npy not found in cache, downloading to /tmp/tmpqh9t2df4


100%|██████████| 160000128/160000128 [00:06<00:00, 25847639.93B/s]

2022-08-25 12:01:39,627 copying /tmp/tmpqh9t2df4 to cache at /home/guy/.flair/embeddings/glove.gensim.vectors.npy





2022-08-25 12:01:39,733 removing temp file /tmp/tmpqh9t2df4
2022-08-25 12:01:40,137 https://flair.informatik.hu-berlin.de/resources/embeddings/token/glove.gensim not found in cache, downloading to /tmp/tmpl2vmsgkd


100%|██████████| 21494764/21494764 [00:01<00:00, 17825818.54B/s]

2022-08-25 12:01:41,731 copying /tmp/tmpl2vmsgkd to cache at /home/guy/.flair/embeddings/glove.gensim
2022-08-25 12:01:41,744 removing temp file /tmp/tmpl2vmsgkd





In [11]:
evaluate(glove_pool_embeddings, similar)

([0.97, 0.99, 0.97, 0.99, 0.98], 0.98)

The results seem to be good since those resulting values are very high, which is what we expect.  
However, the model produces high scores such as 0.94 on average for the dissimilar list as well. Our expectation would be less than 0.4. We'll talk about why we got this later in this chapter. Here is the execution

In [12]:
evaluate(glove_pool_embeddings, dissimilar)

([0.94, 0.97, 0.94, 0.92, 0.93], 0.94)

## RNN-based Document Embeddings

In [13]:
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings
gru_embeddings = DocumentRNNEmbeddings([glove_embedding])

In [14]:
evaluate(gru_embeddings, similar)

([0.98, 1.0, 0.94, 1.0, 0.88], 0.96)

In [15]:
evaluate(gru_embeddings, dissimilar)

([0.86, 1.0, 0.87, 0.83, 0.86], 0.88)

Likewise, we get a high score for the dissimilar list. This is not what we want from sentence embeddings.

## Transformer-based BERT Embeddings

In [16]:
from flair.embeddings import TransformerDocumentEmbeddings
from flair.data import Sentence
bert_embeddings = TransformerDocumentEmbeddings('bert-base-uncased')

In [17]:
evaluate(bert_embeddings, similar)

([0.85, 0.9, 0.96, 0.91, 0.89], 0.9)

In [18]:
evaluate(bert_embeddings, dissimilar)

([0.93, 0.94, 0.86, 0.93, 0.92], 0.92)

This is worse! The score of the dissimilar list is higher than that of the similar list.

## SentenceBERT

In [None]:
# !pip install sentence-transformers

As we mentioned previously, Sentence-BERT provides a variety of pre-trained models. We will pick the bert-base-nli-mean-tokens model for evaluation.

In [19]:
from flair.data import Sentence
from flair.embeddings import SentenceTransformerDocumentEmbeddings
# init embedding
sbert_embeddings = SentenceTransformerDocumentEmbeddings('bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [20]:
evaluate(sbert_embeddings, similar)

([0.98, 0.95, 0.96, 0.99, 0.98], 0.97)

In [21]:
evaluate(sbert_embeddings, dissimilar)

([0.48, 0.41, 0.19, -0.05, 0.0], 0.21)

Well done! The SBERT model produced better results. The model produced a low similarity score for the dissimilar list, which is what we expect.

In [None]:
# Tricky pairs

In [22]:
tricky_pairs=[("An elephant is bigger than a lion","A lion is bigger than an elephant") ,("the cat sat on the mat","the mat sat on the cat")]

In [23]:
evaluate(glove_pool_embeddings, tricky_pairs)

([1.0, 1.0], 1.0)

In [24]:
evaluate(gru_embeddings, tricky_pairs)

([0.86, 0.59], 0.72)

In [25]:
evaluate(bert_embeddings, tricky_pairs)

([1.0, 0.98], 0.99)

In [26]:
evaluate(sbert_embeddings, tricky_pairs)

([0.93, 0.97], 0.95)

Interesting! The scores are very high since the sentence similarity model works similar to topic detection and measures content similarity. When we look at the sentences, they share the same content, even though they contradict each other. The content is about lion and elephant or cat and mat. Therefore, the models produce a high similarity score. Since the GloVe embedding method pools the average of the words without caring about word order, it measures two sentences as being the same. On the other hand, the GRU model produced lower values as it cares about word order. Surprisingly, even the SBERT model does not produce efficient scores. This may be due to the content similarity-based supervision that's used in the SBERT model.

In [28]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

nli_model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')
tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')

import numpy as np

for permise, hypothesis in tricky_pairs:
    x = tokenizer.encode(permise,hypothesis,return_tensors='pt',truncation_strategy='only_first')
    logits = nli_model(x)[0]
    print(f"Permise: {permise}")
    print(f"Hypothesis: {hypothesis}")
    print("Top Class:")
    print(nli_model.config.id2label[np.argmax(logits[0].detach().numpy()) ])
    print("Full softmax scores:")
    for i in range(3):
        print(nli_model.config.id2label[i],logits.softmax(dim=1)[0][i].detach().numpy())

    print("="*20)



Permise: An elephant is bigger than a lion
Hypothesis: A lion is bigger than an elephant
Top Class:
contradiction
Full softmax scores:
contradiction 0.9954543
neutral 0.00049089367
entailment 0.0040547904
Permise: the cat sat on the mat
Hypothesis: the mat sat on the cat
Top Class:
entailment
Full softmax scores:
contradiction 0.49365252
neutral 0.007260751
entailment 0.49908674
