# Literature Review

#1. Transformers for molecular property prediction: Domain adaptation efficiently improves performance (22 May 2025) https://arxiv.org/pdf/2503.03360

"In this study, we systematically investigated the performance of transformer-based models for seven molecular property
datasets, evaluating the effects of pre-training strategies and domain adaptation objectives. Our results show that
while large-scale pre-training with generic objectives like masked language modeling (MLM) offers some benefit,
performance plateaus beyond a certain scale. In contrast, domain adaptation using a chemically informed multi-task
regression (MTR) objective on domain molecules led to consistent and statistically significant improvements across
diverse ADME datasets, even when applied to ≤ 4K molecules."

#2. A review of transformers in drug discovery and beyond (30 August 2024) https://www.sciencedirect.com/science/article/pii/S2095177924001783

"In this review, we provide a comprehensive overview of the applications of transformer-based models in drug discovery, as well as chemistry and biology. Specifically, we discuss primary areas, such as protein design and protein engineering, MD, drug target identification, transformer-enabled drug VS, drug lead optimization, drug addiction, small data set challenges, chemical and biological image analysis, chemical language understanding, and single cell data"

#3. AI-enabled language models (LMs) to large language models (LLMs) and multimodal large language models (MLLMs) in drug discovery and development (12 February 2025) https://www.sciencedirect.com/science/article/pii/S2090123225001092

"Integrating prompt-engineering LMs, LLMs and MLLMs represent a transformative leap in drug discovery and development. These sophisticated AI technologies facilitate the precise and efficient identification of drug candidates and offer deeper insights into complex biological processes by synthesizing extensive and varied datasets."

# Prototype 1: LLM (DistilBERT) evaluates drug efficacy based on text descriptions

Testing DistilBERT https://huggingface.co/docs/transformers/en/model_doc/distilbert

DistilBERT is pretrained by knowledge distillation to create a smaller model with faster inference and requires less compute to train. Through a triple loss objective during pretraining, language modeling loss, distillation loss, cosine-distance loss, DistilBERT demonstrates similar performance to a larger transformer language model.

In [None]:
!pip install transformers datasets torch --quiet

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load a pretrained transformer model for sequence classification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)


classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Sample drug formulation descriptions (mock)
formulations = [
    "This drug formulation includes ibuprofen combined with a lipid nanoparticle for extended release.",
    "The compound contains unstable enzymes that degrade quickly in the bloodstream.",
    "A novel combination of amoxicillin and clavulanic acid with enhanced bioavailability."
]

# Simulate efficacy prediction
print("Predicted Efficacy Labels (Mock):")
for desc in formulations:
    result = classifier(desc)[0]
    label = result["label"]
    score = result["score"]

    # Mock efficacy mapping based on sentiment as proxy
    if label == "POSITIVE":
        efficacy = "High"
    elif score > 0.5:
        efficacy = "Medium"
    else:
        efficacy = "Low"

    print(f"\nInput: {desc}\nPredicted efficacy: {efficacy} (Confidence: {score:.2f})")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Device set to use cpu


Predicted Efficacy Labels (Mock):

Input: This drug formulation includes ibuprofen combined with a lipid nanoparticle for extended release.
Predicted efficacy: Medium (Confidence: 0.98)

Input: The compound contains unstable enzymes that degrade quickly in the bloodstream.
Predicted efficacy: Medium (Confidence: 1.00)

Input: A novel combination of amoxicillin and clavulanic acid with enhanced bioavailability.
Predicted efficacy: High (Confidence: 1.00)


# Prototype 2: LLM (GPT-2) evaluates drug efficacy based on features like: molecular_weight, logP, num_hbond_donors


## 1. Molecular Weight (MW)
Definition: The total mass of a molecule (sum of atomic weights of all atoms).

Unit: Dalton (Da)

Significance:
*   Affects absorption and permeability.
*   Smaller molecules (typically <500 Da) are more likely to be orally bioavailable.

Lipinski's Rule of Five suggests: MW < 500 for good absorption.

## 2.  logP (Partition Coefficient)
Definition: The logarithm of a compound’s partition coefficient between octanol and water.

Formula: logP = log10([drug]_octanol / [drug]_water)

Significance:

* Measures lipophilicity—how soluble the compound is in fat vs water

* High logP → lipophilic → crosses cell membranes more easily but might have low solubility.

Low logP > hydrophilic > better solubility, but poor permeability.

Ideal range: logP < 5

## 3. num_hbond_donors (Number of Hydrogen Bond Donors)
Definition: Number of groups in the molecule that can donate a hydrogen bond (e.g. –OH, –NH groups)

Significance:

* Too many donors can reduce membrane permeability (can’t pass through lipid bilayer).

* Affects solubility and binding to target proteins.

Rule of thumb: ≤ 5 donors for good oral bioavailability.

In [None]:
!pip install transformers torch pandas --quiet

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import pandas as pd
import numpy as np
import torch
import random


model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Simulate drug data with 3 features
np.random.seed(42)
random.seed(42)

n_samples = 5
data = {
    "molecular_weight": np.random.normal(300, 50, n_samples),
    "logP": np.random.normal(3, 1, n_samples),
    "num_hbond_donors": np.random.randint(0, 5, n_samples),
}

df = pd.DataFrame(data)

# Few-shot prompt using only 3 features
few_shot_prompt = """Given the following drug formulation features, predict the efficacy category: High, Medium, or Low.

Example 1:
- Molecular weight: 280
- logP: 3.2
- H-bond donors: 1
Efficacy: High

Example 2:
- Molecular weight: 360
- logP: 4.8
- H-bond donors: 2
Efficacy: Medium

Example 3:
- Molecular weight: 510
- logP: 5.5
- H-bond donors: 4
Efficacy: Low

Now classify the following compound:
"""

# Generate predictions
print("LLM-Predicted Efficacy Labels:\n")
for idx, row in df.iterrows():
    compound_desc = (
        f"- Molecular weight: {round(row['molecular_weight'], 1)}\n"
        f"- logP: {round(row['logP'], 2)}\n"
        f"- H-bond donors: {row['num_hbond_donors']}\n"
        f"Efficacy:"
    )

    full_prompt = few_shot_prompt + compound_desc

    result = generator(full_prompt, max_new_tokens=10, temperature=0.7, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
    prediction = result.split("Efficacy:")[-1].strip().split()[0]
    print(f"Input features (row {idx + 1}):")
    print(compound_desc)
    print(f"Predicted efficacy: {prediction}\n")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cpu


LLM-Predicted Efficacy Labels:

Input features (row 1):
- Molecular weight: 324.8
- logP: 2.77
- H-bond donors: 3.0
Efficacy:
Predicted efficacy: High

Input features (row 2):
- Molecular weight: 293.1
- logP: 4.58
- H-bond donors: 4.0
Efficacy:
Predicted efficacy: High

Input features (row 3):
- Molecular weight: 332.4
- logP: 3.77
- H-bond donors: 0.0
Efficacy:
Predicted efficacy: Medium

Input features (row 4):
- Molecular weight: 376.2
- logP: 2.53
- H-bond donors: 3.0
Efficacy:
Predicted efficacy: High

Input features (row 5):
- Molecular weight: 288.3
- logP: 3.54
- H-bond donors: 1.0
Efficacy:
Predicted efficacy: High

