<a href="https://colab.research.google.com/github/kjahan/semantic_similarity/blob/main/examples/colab/sbert_chem_representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal: explore limitations of sentence transformers

We want to check limitation of SBERT and MPNET in Chemistry formulas!

In [1]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3

In [2]:
from sentence_transformers import SentenceTransformer, util

from transformers import BertTokenizer
from transformers import MPNetTokenizer, MPNetModel
import torch

import numpy as np
import pandas as pd

import random

## Load pre-trained model

https://www.sbert.net/docs/pretrained_models.html

In [3]:
model_name_1 = 'all-mpnet-base-v2'
model_1 = SentenceTransformer(model_name_1)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Test 1

In [4]:
q1 = "H2O2"
q2 = "H2O"

# Compute embedding for both lists
emb_1 = model_1.encode(q1, convert_to_tensor=True)
emb_2 = model_1.encode(q2, convert_to_tensor=True)

# Compute cosine-similarits
cosine_sim = util.pytorch_cos_sim(emb_1, emb_2).item()
print("Cosine sim from : {} - model name: {}".format(round(cosine_sim, 3), model_name_1))

Cosine sim from : 0.947 - model name: all-mpnet-base-v2


## MLPNET Tokenizer

OOV tests

https://huggingface.co/docs/transformers/model_doc/mpnet#transformers.MPNetTokenizerFast


MPNet tokenizer:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mpnet/tokenization_mpnet.py

In [5]:
mpnet_tz = MPNetTokenizer.from_pretrained("microsoft/mpnet-base")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

## Test 2

In [8]:
q = "H2O2S4"
mpnet_tz.tokenize(q)

['h', '##2', '##o', '##2', '##s', '##4']

## Test 3

In [9]:
q = "H2OS6"
mpnet_tz.tokenize(q)

['h', '##2', '##os', '##6']