In [1]:
%load_ext autoreload
%autoreload 2

Let's benchmark the LLM representations against some common representations used in Molecular Modelling.  


We will use the same setup found in the [molfeat benchmark](https://molfeat-docs.datamol.io/stable/benchmark.html). Note that this is not an extensive benchmark, and therefore the outcomes should not be taken as a definitive conclusion. 

In the molfeat benchmark, they used the following representations: **ECFP6**, **Mordred** and **ChemBERTa**. We will keep the same setup and will also use  their results to avoid rerunning the experiments. 

Furthermore, Because LLMs are computationally costly, we will only run the _Lipophilicity_ benchmark.

In our experiments let's consider the following featurizers:

- **openai/text-embedding-ada-002**: the default OpenAI embedding model
- **sentence-transformers/all-mpnet-base-v2**: a popular [sentence embedding model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) that maps text into a 768 dimensional dense vector.
- **openai/gpt-3.5-turbo**: OpenAI instruction-following model that backs ChatGPT
- **hkunlp/instructor-large**: an [instruction-conditioned model](https://huggingface.co/hkunlp/instructor-large) for embedding generation. 

<div class="admonition tip highlight">
<p class="admonition-title">Tl;dr - Can non-finetuned LLMs outperform hand-crafted or pretrained molecular featurizers ?</p>
<p>
<strong>No !</strong> Understanding of molecular context/structure/properties is key for building good molecular featurizers. 
</p>
</div>


```bash
! pip install auto-sklearn
````

In [2]:
import os
import warnings
import numpy as np
import pandas as pd
import datamol as dm
import fsspec

import matplotlib.pyplot as plt
import autosklearn.classification
import autosklearn.regression
from tqdm.auto import tqdm
from collections import defaultdict
from rdkit.Chem import SaltRemover

from sklearn.metrics import mean_absolute_error, roc_auc_score
from sklearn.model_selection import GroupShuffleSplit
from sklearn.neighbors import KNeighborsClassifier

from molfeat.utils.cache import FileCache
from molfeat.trans.base import PrecomputedMolTransformer
from molfeat.trans.fp import FPVecTransformer
from molfeat.trans.pretrained.hf_transformers import PretrainedHFTransformer
from molfeat_hype.trans.llm_embeddings import LLMTransformer
from molfeat_hype.trans.llm_instruct_embeddings import InstructLLMTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Making the output less verbose
warnings.simplefilter("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
dm.disable_rdkit_log()

In [4]:
def load_dataset(uri: str, readout_col: str):
    """Loads the MoleculeNet dataset"""
    df = pd.read_csv(uri)
    smiles = df["smiles"].values
    y = df[readout_col].values
    return smiles, y


def preprocess_smiles(smi):
    """Preprocesses the SMILES string"""
    mol = dm.to_mol(smi, ordered=True, sanitize=False)    
    try: 
        mol = dm.sanitize_mol(mol)
    except:
        mol = None
            
    if mol is None: 
        return
        
    mol = dm.standardize_mol(mol, disconnect_metals=True)
    remover = SaltRemover.SaltRemover()
    mol = remover.StripMol(mol, dontRemoveEverything=True)

    return dm.to_smiles(mol)


def scaffold_split(smiles):
    """In line with common practice, we will use the scaffold split to evaluate our models"""
    scaffolds = [dm.to_smiles(dm.to_scaffold_murcko(dm.to_mol(smi))) for smi in smiles]
    splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    return next(splitter.split(smiles, groups=scaffolds))


Classic embeddings

In [5]:
openai_api_key = os.environ.get("OPENAI_API_KEY", None)

In [6]:
openai_ada_cache = FileCache(cache_file="../../cache/openai_ada_cache.parquet", name="openai_ada_cache")
transf_openai_ada = LLMTransformer(kind="openai/text-embedding-ada-002", openai_api_key=openai_api_key, precompute_cache=openai_ada_cache)

In [7]:
sent_trans_cache = FileCache(cache_file="../../cache/sentence_transformer.parquet", name="sent_trans_cache")
transf_sentence = LLMTransformer(kind="sentence-transformers/all-mpnet-base-v2", precompute_cache=sent_trans_cache)

Instruct embeddings

In [8]:
cond_embed_cache = FileCache(cache_file="../../cache/cond_embed.parquet", name="cond_embed_cache")
transf_cond_embed = InstructLLMTransformer(kind="hkunlp/instructor-large", precompute_cache=cond_embed_cache)

load INSTRUCTOR_Transformer
max_seq_length  512


#### Lipophilicity
Lipophilicity is a regression task with 4200 molecules

In [9]:
# Prepare the Lipophilicity dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv", "exp")

smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if dm.to_mol(smi) is not None])

feats_openai_ada, ind_openai_ada = transf_openai_ada(smiles, ignore_errors=True)

X = {
    "openai/text-embedding-ada-002": feats_openai_ada[ind_openai_ada],
}

In [10]:
feats_sentence, ind_sentence = transf_sentence(smiles, ignore_errors=True)
X["sentence-transformers/all-mpnet-base-v2"] = feats_sentence[ind_sentence]

In [11]:
# feats_cond_embed = transf_cond_embed.batch_transform(transf_cond_embed, smiles, batch_size=512, n_jobs=8, progress=True)
# ind_cond_embed = np.arange(len(smiles))
# X["hkunlp/instructor-large"] = feats_cond_embed[ind_cond_embed]

In [12]:
# mols = [dm.to_mol(smi) for smi in smiles]
# _cache = dict(zip(mols, feats_cond_embed))
# transf_cond_embed.precompute_cache.update(_cache)

In [13]:
feats_cond_embed, ind_cond_embed = transf_cond_embed(smiles, ignore_errors=True)
X["hkunlp/instructor-large"] = feats_cond_embed[ind_cond_embed]

In [14]:
! mkdir -p ../../cache/

In [15]:
# transf_base_chatgpt = InstructLLMTransformer(kind="openai/chatgpt", embedding_size=16, openai_api_key=openai_api_key, precompute_cache=False, conv_buffer_size=4, request_timeout=300)
# transf_chatgpt = PrecomputedMolTransformer(cache=chatgpt_cache, featurizer=transf_base_chatgpt)
#feats_chatgpt = transf_chatgpt.batch_transform(transf_chatgpt, smiles, batch_size=16, n_jobs=-1)
#X["openai/chatgpt"] = feats_chatgpt[ind_chatgpt]
# chatgpt_cache.update(transf_chatgpt.cache)
# chatgpt_cache.save_to_file()

In [16]:
chatgpt_cache = FileCache(cache_file="../../cache/chatgpt.parquet", name="chatgpt_cache")
transf_chatgpt = InstructLLMTransformer(kind="openai/chatgpt", embedding_size=16, openai_api_key=openai_api_key, conv_buffer_size=3, request_timeout=300, precompute_cache=chatgpt_cache, batch_size=4)

In [17]:
# for k, x in transf_chatgpt.precompute_cache.cache.copy().items():
#     if x is None  or np.any(np.isnan(x)):
#         del transf_chatgpt.precompute_cache.cache[k]

In [18]:
feats_chatgpt, ind_chatgpt = transf_chatgpt(smiles, ignore_errors=True)
X["openai/chatgpt"] = feats_chatgpt#[ind_chatgpt]

In [19]:
transf_sentence.precompute_cache.save_to_file()
transf_openai_ada.precompute_cache.save_to_file()
transf_cond_embed.precompute_cache.save_to_file()
transf_chatgpt.precompute_cache.save_to_file()

In [25]:
# Train a model
train_ind, test_ind = scaffold_split(smiles)

lipo_scores = {}
for name, feats in X.items():
    # print(name, feats.shape, y_true.shape, np.any(np.isnan(feats)))
    # Train
    automl = autosklearn.regression.AutoSklearnRegressor(
        memory_limit=None, 
        # For practicality’s sake, limit this to 5 minutes! 
        # (x3 = 15 min in total)
        time_left_for_this_task=180,  
        n_jobs=1,
        seed=1,
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    
    # Predict and evaluate
    y_hat = automl.predict(feats[test_ind])
    
    # Evaluate
    mae = mean_absolute_error(y_true[test_ind], y_hat)
    lipo_scores[name] = mae

lipo_scores

	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1


{'openai/text-embedding-ada-002': 0.8497511191899347,
 'sentence-transformers/all-mpnet-base-v2': 0.846618632301065,
 'hkunlp/instructor-large': 0.8109557972763406,
 'openai/chatgpt': 0.9238916528125248}

#### Conclusion

| Dataset       | Metric   | Representation | Score | Rank |
|---------------|----------|----------------|-------|------|
| Lipophilicity | MAE ↓    | ECFP           | 0.727  | 1    |
|               |          | Mordred        | 0.579  | 0    |
|               |          | ChemBERTa      | 0.740  | 2    |
|               |          | openai/text-embedding-ada-002      | 0.850  | 5    |
|               |          | sentence-transformers/all-mpnet-base-v2      | 0.847  | 4    |
|               |          | hkunlp/instructor-large      | 0.811  | 3    |
|               |          | openai/chatgpt      | 0.924  | 6    |


Without surprise, the models built with LLMs embeddings without any finetuning performed worse than the molecular structure/context/properties aware featurization and some of them were not better than random models. Interestingly, the instruction conditioned embedding performed ok, while ChatGPT was the worse.