<a href="https://colab.research.google.com/github/kjahan/semantic_similarity/blob/main/examples/colab/sentence_transformers_limitations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal: explore limitations of sentence transformers

**Install necessary packages**

[Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)


In [1]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 3.0 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 20.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 35.9 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 64.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 66.5 MB/s 
Building wheels for collected p

In [2]:
from sentence_transformers import SentenceTransformer, util

from transformers import BertTokenizer
from transformers import MPNetTokenizer, MPNetModel
import torch

import numpy as np
import pandas as pd

import random

## Load pre-trained model

https://www.sbert.net/docs/pretrained_models.html

In [3]:
model_name_1 = 'all-mpnet-base-v2'
model_1 = SentenceTransformer(model_name_1)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Last year model from sentence transformers

`paraphrase-mpnet-base-v2`

In [4]:
model_name_2 = 'paraphrase-mpnet-base-v2'
model_2 = SentenceTransformer(model_name_2)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.70k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/594 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

## Simple test

In [11]:
q1 = "what is the united stats population?"
q2 = "how many people live in the united states?"

# Compute embedding for both lists
emb_1 = model_1.encode(q1, convert_to_tensor=True)
emb_2 = model_1.encode(q2, convert_to_tensor=True)

# Compute cosine-similarits
cosine_sim = util.pytorch_cos_sim(emb_1, emb_2).item()
print("Cosine sim from : {} - model name: {}".format(round(cosine_sim, 3), model_name_1))

Cosine sim from : 0.7886649370193481 - model name: all-mpnet-base-v2


In [12]:
q1 = "what is the united stats population?"
q2 = "how many people live in the united states?"

# Compute embedding for both lists
emb_1 = model_2.encode(q1, convert_to_tensor=True)
emb_2 = model_2.encode(q2, convert_to_tensor=True)

# Compute cosine-similarits
cosine_sim = util.pytorch_cos_sim(emb_1, emb_2).item()
print("Cosine sim from : {} - model name: {}".format(round(cosine_sim, 3), model_name_2))

Cosine sim from : 0.773 - model name: paraphrase-mpnet-base-v2


## Test MCQ

Chemistry problems:

https://www.chem.tamu.edu/class/fyp/mcquest/ch1.html
https://www.chem.tamu.edu/class/fyp/mcquest/mcquest.html

In [13]:
test_1 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(a) uranium/U
(b) sulfur/S
(c) nitrogen/N
(d) potassium/K
(e) iron/I
"""

test_1_reordered_1 = """
(a) uranium/U
(b) sulfur/S
(c) nitrogen/N
(d) potassium/K
(e) iron/I
Of the following name/symbol combinations of elements, which one is WRONG?
"""

test_1_reordered_2 = """
uranium/U
sulfur/S
nitrogen/N
potassium/K
iron/I
(a)
(b)
(c)
(e)
(d)
Of the following name/symbol combinations of elements, which one is WRONG?
"""

test_1_reordered_3 = """
(a)
iron/I
(b)
potassium/K
(c)
nitrogen/N
(d) 
sulfur/S
(e) 
uranium/U
Of the following name/symbol combinations of elements, which one is WRONG?
"""

test_1_truncated_4 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(a) uranium/U
(b) sulfur/S
(c) nitrogen/N
(d) potassium/K
"""

test_1_truncated_5 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(a) uranium/U
(b) sulfur/S
(d) potassium/K
(e) iron/I
"""

test_1_truncated_6 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(a) uranium/U
(b) sulfur/S
(c) nitrogen/N
"""

test_1_truncated_7 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(a) uranium/U
(b) sulfur/S
"""

test_1_truncated_8 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(a) uranium/U
"""

test_1_truncated_9 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(b) sulfur/S
"""

test_1_truncated_10 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(c) nitrogen/N
"""

test_1_truncated_11 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(d) potassium/K
"""

test_1_truncated_12 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(e) iron/I
"""

test_1_truncated_13 = """
Of the following name/symbol combinations of elements, which one is WRONG?
(e) iron/Fe
"""

In [17]:
queries_2 = [test_1_reordered_1, test_1_reordered_2, test_1_reordered_3, test_1_truncated_4, test_1_truncated_5, test_1_truncated_6, test_1_truncated_7, test_1_truncated_8, test_1_truncated_9, 
             test_1_truncated_10, test_1_truncated_11, test_1_truncated_12, test_1_truncated_13]
queries_1 = len(queries_2)*[test_1]

# Compute embedding for both lists
embeddings1 = model_1.encode(queries_1, convert_to_tensor=True)
embeddings2 = model_1.encode(queries_2, convert_to_tensor=True)

# Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

# Output the pairs with their score
for i in range(len(queries_1)):
    print("{} \t\t {} \t\t Score: {:.4f}\n\n".format(queries_1[i], queries_2[i], round(cosine_scores[i][i].item(), 2)))


Of the following name/symbol combinations of elements, which one is WRONG?
(a) uranium/U
(b) sulfur/S
(c) nitrogen/N
(d) potassium/K
(e) iron/I
 		 
(a) uranium/U
(b) sulfur/S
(c) nitrogen/N
(d) potassium/K
(e) iron/I
Of the following name/symbol combinations of elements, which one is WRONG?
 		 Score: 0.9800



Of the following name/symbol combinations of elements, which one is WRONG?
(a) uranium/U
(b) sulfur/S
(c) nitrogen/N
(d) potassium/K
(e) iron/I
 		 
uranium/U
sulfur/S
nitrogen/N
potassium/K
iron/I
(a)
(b)
(c)
(e)
(d)
Of the following name/symbol combinations of elements, which one is WRONG?
 		 Score: 0.9800



Of the following name/symbol combinations of elements, which one is WRONG?
(a) uranium/U
(b) sulfur/S
(c) nitrogen/N
(d) potassium/K
(e) iron/I
 		 
(a)
iron/I
(b)
potassium/K
(c)
nitrogen/N
(d) 
sulfur/S
(e) 
uranium/U
Of the following name/symbol combinations of elements, which one is WRONG?
 		 Score: 0.9700



Of the following name/symbol combinations of elements, 

## Test BERT Tokenizer / OOV

https://albertauyeung.github.io/2020/06/19/bert-tokenization.html/

We can test BERT Tokenizer:
https://huggingface.co/docs/transformers/model_doc/bert

BERT has `vocab_size = 30522` tokens!

When the BERT model was trained, each token was given a unique ID. Hence, when we want to use a pre-trained BERT model, we will first need to convert each token in the input sentence into its corresponding unique IDs.

There is an important point to note when we use a pre-trained model. Since the model is pre-trained on a certain corpus, the vocabulary was also fixed. In other words, when we apply a pre-trained model to some other data, it is possible that some tokens in the new data might not appear in the fixed vocabulary of the pre-trained model. This is commonly known as the out-of-vocabulary (OOV) problem.

For tokens not appearing in the original vocabulary, it is designed that they should be replaced with a special token [UNK], which stands for unknown token.

However, converting all unseen tokens into [UNK] will take away a lot of information from the input data. Hence, BERT makes use of a WordPiece algorithm that breaks a word into several subwords, such that commonly seen subwords can also be represented by the model.

It seems `[UNK]` or unkown token has ID of 100 in `BERT`!

It For the pre-training corpus BERT used the BooksCorpus (800M words) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers.

https://arxiv.org/pdf/1810.04805.pdf

In [18]:
bert_tz = BertTokenizer.from_pretrained("bert-base-cased")

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [19]:
bert_tz.convert_tokens_to_ids(["characteristically"])
# UNK --> [100]

[100]

In [38]:
sent = "He remains characteristically confident and optimistic."
bert_tz.tokenize(sent)

['He',
 'remains',
 'characteristic',
 '##ally',
 'confident',
 'and',
 'optimistic',
 '.']

In [20]:
## Persian word!!
bert_tz.convert_tokens_to_ids(["سلام"])

[100]

In [21]:
## Made up word
bert_tz.convert_tokens_to_ids(["xzslsopsksgdgfagsfagfsgafsa"])

[100]

## Let's check som especial terms in BERT

1. math formual/symbols
2. Biology terms
3. Physics
4. Chemistry
5. Law
6. Accounting

## Math


https://www.math.ucdavis.edu/~kouba/CalcTwoDIRECTORY/areadirectory/Area.html

In [33]:
math_tokens = ["h(x)", "y=x2"]

for token in math_tokens:
  print(bert_tz.convert_tokens_to_ids([token]))

[100]
[100]


## Biology

https://ocw.mit.edu/courses/7-012-introduction-to-biology-fall-2004/resources/ps1q/

`"""
Question 2
A new startup company hires you to help with their product development. Your task is to find
a protein that interacts with a polysaccharide.
a) You find a large protein that has a single binding site for the polysaccharide cellulose.
Which amino acids might you expect to find in the binding pocket of the protein? What is the
strongest type of interaction possible between these amino acids and the cellulose?
"""`

In [34]:
bio_tokens = ["polysaccharide", "cellulose", "acids"]

for token in bio_tokens:
  print(bert_tz.convert_tokens_to_ids([token]))

[100]
[100]
[13087]


## CHEM Practice Problems

https://ocw.mit.edu/courses/3-091-introduction-to-solid-state-chemistry-fall-2018/resources/mit3_091f18_ppa/

`"""
Give the symbol AX for these elements, all of which exist as a single isotope: Z
a. beryllium
b. ruthenium
c. phosphorus
d. aluminum
e. cesium
f. praseodymium
g. colbalt
h. yttrium
i. arsenic
"""`

`Calculate the molecular mass or formula mass (molar mass) of each compound:
a. V2O4 (vanadium(IV) oxide)
b. CaSiO3 (calcium silicate)
c. BiOCl (bismuth oxychloride)
d. CH3COOH (acetic acid)
e. Ag2SO4 (silver sulfate)
f. Na2CO3 (sodium carbonate)
g. (CH3)2CHOH (isopropyl alchohol)`

In [46]:
chem_tokens = ["aluminum", "yttrium", "praseodymium", "isotope", "beryllium", "ruthenium", "BiOCl", "Ag2SO4", "molecular", "(CH3)2CHOH"]

for token in chem_tokens:
  print(bert_tz.convert_tokens_to_ids([token]))

[14349]
[100]
[100]
[100]
[100]
[100]
[100]
[100]
[9546]
[100]


## MLPNET Tokenizer

OOV tests

https://huggingface.co/docs/transformers/model_doc/mpnet#transformers.MPNetTokenizerFast


MPNet tokenizer:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mpnet/tokenization_mpnet.py

In [39]:
mpnet_tz = MPNetTokenizer.from_pretrained("microsoft/mpnet-base")
model = MPNetModel.from_pretrained("microsoft/mpnet-base")

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/507M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/mpnet-base were not used when initializing MPNetModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing MPNetModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MPNetModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['mpnet.pooler.dense.weight', 'mpnet.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predi

In [40]:
mpnet_tz.convert_tokens_to_ids(["characteristically"])
# UNK --> [104] --> It seems UNK token is 104 in mpnet

[104]

In [41]:
sent = "He remains characteristically confident and optimistic."
mpnet_tz.tokenize(sent)

['he',
 'remains',
 'characteristic',
 '##ally',
 'confident',
 'and',
 'optimistic',
 '.']

In [44]:
mpnet_tz.convert_tokens_to_ids(["سلام"])

[104]

## CHEM, BIO, MATH

In [47]:
tokens = ["h(x)", "y=x2"] + ["polysaccharide", "cellulose", "acids"] + ["aluminum", "yttrium", "praseodymium", "isotope", "beryllium", "ruthenium", "BiOCl", "Ag2SO4", "molecular", "(CH3)2CHOH"]

for token in tokens:
  print(mpnet_tz.convert_tokens_to_ids([token]))

[104]
[104]
[104]
[104]
[12741]
[13065]
[104]
[104]
[28850]
[104]
[104]
[104]
[104]
[8386]
[104]
