# Task 1: Fine-tune Chemical Language Model on Lipophilicity

In this notebook, we fine-tune a pre-trained chemical language model (MoLFormer-XL) on the Lipophilicity dataset. The goal is to predict the lipophilicity (logD) of molecules represented as SMILES strings.

In [1]:
# Install necessary packages
!pip install torch datasets transformers scikit-learn pandas tqdm


Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting 

In [2]:
# Import dependencies
import torch
from datasets import load_dataset
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
import pandas as pd
from tqdm.notebook import tqdm
import random
import os

In [3]:
# Huggingface token
import os
os.environ['HF_TOKEN'] = 'YOUR_TOKEN_HERE'

## Step 1: Load Dataset

Load the Lipophilicity dataset from Hugging Face and perform some exploratory data analysis (EDA).

In [4]:
# Specify dataset and model names
DATASET_PATH = "scikit-fingerprints/MoleculeNet_Lipophilicity"
MODEL_NAME = "ibm/MoLFormer-XL-both-10pct"  # MoLFormer model

# Load the dataset
lipophilicity_data = load_dataset(DATASET_PATH)

# Explore the dataset: print info, column names, and first 5 samples
print(lipophilicity_data)
columns = lipophilicity_data['train'].column_names
print("Columns:", columns)
print("First 5 samples:", lipophilicity_data['train'][:5])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

lipophilicity.csv:   0%|          | 0.00/223k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['SMILES', 'label'],
        num_rows: 4200
    })
})
Columns: ['SMILES', 'label']
First 5 samples: {'SMILES': ['Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14', 'COc1cc(OC)c(cc1NC(=O)CSCC(=O)O)S(=O)(=O)N2C(C)CCc3ccccc23', 'COC(=O)[C@@H](N1CCc2sccc2C1)c3ccccc3Cl', 'OC[C@H](O)CN1C(=O)C(Cc2ccccc12)NC(=O)c3cc4cc(Cl)sc4[nH]3', 'Cc1cccc(C[C@H](NC(=O)c2cc(nn2C)C(C)(C)C)C(=O)NCC#N)c1'], 'label': [3.54, -1.18, 3.69, 3.37, 3.1]}


## Step 2: Split Dataset

Since the dataset has only a single (train) split, we perform a train-test split. We use stratification on binned target values (logD) to ensure the split is representative.

In [5]:
# Convert the dataset to a DataFrame
df = pd.DataFrame(lipophilicity_data['train'])

# Create stratification bins for the continuous target (label)
num_bins = 10  # adjust number of bins as needed
df['bin'] = pd.qcut(df['label'], q=num_bins, duplicates='drop')

# Perform train-test split with stratification based on bins
train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['bin'], random_state=42)

print(f"Train size: {len(train_df)}, Test size: {len(test_df)}")

# Remove the auxiliary bin column
train_df = train_df.drop(columns=['bin'])
test_df = test_df.drop(columns=['bin'])

Train size: 3360, Test size: 840


## Step 3: Tokenization and PyTorch Dataset Class

Load the tokenizer for MoLFormer-XL and define a custom PyTorch `Dataset` to process SMILES strings and their target lipophilicity values.

In [6]:
# Load the pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Test the tokenizer on a sample SMILES string
sample_smiles = train_df.iloc[0]['SMILES']
tokens = tokenizer.tokenize(sample_smiles)
ids = tokenizer.convert_tokens_to_ids(tokens)
print("SMILES:", sample_smiles)
print("Tokens:", tokens)
print("Token IDs:", ids)

# Define a custom PyTorch Dataset
class LipoDataset(Dataset):
    def __init__(self, smiles_list, targets, tokenizer, max_length=128):
        self.smiles_list = smiles_list
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.smiles_list)

    def __getitem__(self, idx):
        smiles = self.smiles_list[idx]
        target = self.targets[idx]
        encoding = self.tokenizer(smiles, padding='max_length', truncation=True,
                                   max_length=self.max_length, return_tensors="pt")
        item = {key: val.squeeze(0) for key, val in encoding.items()}
        item["labels"] = torch.tensor(target, dtype=torch.float)
        return item

# Create dataset instances for training and testing
train_dataset = LipoDataset(train_df['SMILES'].tolist(), train_df['label'].tolist(), tokenizer)
test_dataset  = LipoDataset(test_df['SMILES'].tolist(), test_df['label'].tolist(), tokenizer)

# Create a DataLoader to test the dataset
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
batch = next(iter(train_loader))
print(batch['input_ids'].shape, batch['labels'].shape)

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenization_molformer_fast.py:   0%|          | 0.00/6.50k [00:00<?, ?B/s]

tokenization_molformer.py:   0%|          | 0.00/9.48k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ibm/MoLFormer-XL-both-10pct:
- tokenization_molformer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/ibm/MoLFormer-XL-both-10pct:
- tokenization_molformer_fast.py
- tokenization_molformer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


vocab.json:   0%|          | 0.00/41.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/54.0k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

SMILES: O[C@@H](CNCCCOCCOCCc1cccc2ccccc12)c3ccc(O)c4NC(=O)Sc34
Tokens: ['O', '[C@@H]', '(', 'C', 'N', 'C', 'C', 'C', 'O', 'C', 'C', 'O', 'C', 'C', 'c', '1', 'c', 'c', 'c', 'c', '2', 'c', 'c', 'c', 'c', 'c', '1', '2', ')', 'c', '3', 'c', 'c', 'c', '(', 'O', ')', 'c', '4', 'N', 'C', '(', '=', 'O', ')', 'S', 'c', '3', '4']
Token IDs: [9, 16, 6, 4, 10, 4, 4, 4, 9, 4, 4, 9, 4, 4, 5, 8, 5, 5, 5, 5, 11, 5, 5, 5, 5, 5, 8, 11, 7, 5, 14, 5, 5, 5, 6, 9, 7, 5, 19, 10, 4, 6, 12, 9, 7, 18, 5, 14, 19]
torch.Size([32, 128]) torch.Size([32])


## Step 4: Load Model and Add Regression Head

Load the pre-trained MoLFormer-XL model and add a regression head to predict the continuous lipophilicity value.

In [15]:
# Load the base MoLFormer-XL model
base_model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Define a model with a regression head
class MolFormerRegressor(nn.Module):
    def __init__(self, base_model):
        super(MolFormerRegressor, self).__init__()
        self.base_model = base_model
        hidden_size = base_model.config.hidden_size
        self.regressor = nn.Linear(hidden_size, 1)  # Regression head


    def forward(self, input_ids, attention_mask):
        # Get the output from the base model
        outputs = self.base_model(input_ids, attention_mask)

        # If outputs is a dict, use 'last_hidden_state'; if it's a tuple, use index 0.
        if isinstance(outputs, dict):
            hidden_state = outputs.get('last_hidden_state', None)
        else:
            hidden_state = outputs[0]

        # Ensure we have a valid hidden state
        if hidden_state is None:
            raise ValueError("The base model did not return a valid hidden state.")

        # Use the representation of the [CLS] token
        cls_hidden_state = hidden_state[:, 0, :]

        # Pass the representation to the regressor
        logits = self.regressor(cls_hidden_state)
        return logits


# Initialize the regression model and move it to device
model = MolFormerRegressor(base_model)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define loss function and optimizer
loss_fn = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

## Step 5: Training

Train the regression model using gradient accumulation, a learning rate scheduler, and early stopping.

In [8]:
from torch.optim.lr_scheduler import StepLR

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

accumulation_steps = 2
scheduler = StepLR(optimizer, step_size=1, gamma=0.9)

epochs = 5
best_val_loss = float('inf')
patience = 2
patience_counter = 0

# Directory for saving checkpoints
checkpoint_dir = "./checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for i, batch in enumerate(train_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].unsqueeze(1).to(device)

        outputs = model(input_ids, attention_mask)
        loss = loss_fn(outputs, labels)
        loss = loss / accumulation_steps
        loss.backward()

        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

    scheduler.step()

    # Validation phase
    model.eval()
    val_losses = []
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].unsqueeze(1).to(device)
            preds = model(input_ids, attention_mask)
            val_loss = loss_fn(preds, labels)
            val_losses.append(val_loss.item())
    avg_val_loss = sum(val_losses) / len(val_losses)
    print(f"Epoch {epoch+1}: Val MSE = {avg_val_loss:.4f}")

    # Save a checkpoint after each epoch
    checkpoint_path = os.path.join(checkpoint_dir, f"checkpoint_regression_epoch_{epoch+1}.pt")
    torch.save({
        'epoch': epoch+1,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict(),
        'avg_val_loss': avg_val_loss,
        'best_val_loss': best_val_loss
    }, checkpoint_path)
    print(f"Checkpoint saved at {checkpoint_path}")

    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        patience_counter = 0
        best_model_state = model.state_dict()
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("Early stopping triggered.")
            break

if 'best_model_state' in globals():
    model.load_state_dict(best_model_state)

Epoch 1: Val MSE = 0.9409
Checkpoint saved at ./checkpoints/checkpoint_regression_epoch_1.pt
Epoch 2: Val MSE = 0.7134
Checkpoint saved at ./checkpoints/checkpoint_regression_epoch_2.pt
Epoch 3: Val MSE = 0.5889
Checkpoint saved at ./checkpoints/checkpoint_regression_epoch_3.pt
Epoch 4: Val MSE = 0.5423
Checkpoint saved at ./checkpoints/checkpoint_regression_epoch_4.pt
Epoch 5: Val MSE = 0.5164
Checkpoint saved at ./checkpoints/checkpoint_regression_epoch_5.pt


## Step 6: Evaluation

Evaluate the trained model on the test set using Mean Squared Error (MSE), Mean Absolute Error (MAE), and R² score.

In [9]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

model.eval()
predictions = []
true_values = []
with torch.no_grad():
    for batch in DataLoader(test_dataset, batch_size=32):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels']
        preds = model(input_ids, attention_mask)
        predictions.extend(preds.squeeze(1).cpu().tolist())
        true_values.extend(labels.tolist())

mse = mean_squared_error(true_values, predictions)
mae = mean_absolute_error(true_values, predictions)
r2  = r2_score(true_values, predictions)

print(f"Test MSE: {mse:.4f}")
print(f"Test MAE: {mae:.4f}")
print(f"Test R^2: {r2:.4f}")

Test MSE: 0.5196
Test MAE: 0.5493
Test R^2: 0.6424


## 2. Add Unsupervised Fine-Tuning (MLM)

Perform unsupervised fine-tuning using the Masked Language Modeling (MLM) objective on the SMILES strings.

In [10]:
mlm_model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME, trust_remote_code=True)
mlm_model.to(device)
mlm_model.train()

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

class SmilesDataset(Dataset):
    def __init__(self, smiles_list, tokenizer, max_length=128):
        self.smiles_list = smiles_list
        self.tokenizer = tokenizer
        self.max_length = max_length
    def __len__(self):
        return len(self.smiles_list)
    def __getitem__(self, idx):
        smiles = self.smiles_list[idx]
        enc = self.tokenizer(smiles, padding='max_length', truncation=True, max_length=self.max_length, return_tensors="pt")
        return enc['input_ids'].squeeze(0)

unlabeled_smiles = train_df['SMILES'].tolist()
unlabeled_dataset = SmilesDataset(unlabeled_smiles, tokenizer)
mlm_loader = DataLoader(unlabeled_dataset, batch_size=32, shuffle=True, collate_fn=data_collator)

mlm_optimizer = torch.optim.AdamW(mlm_model.parameters(), lr=5e-5)
mlm_epochs = 1  # You can adjust the number of epochs as needed

for epoch in range(mlm_epochs):
    for batch in mlm_loader:
        mlm_optimizer.zero_grad()
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = mlm_model(**batch)
        loss = outputs.loss
        loss.backward()
        mlm_optimizer.step()
    # Save checkpoint for MLM fine-tuning after each epoch
    mlm_checkpoint_path = os.path.join(checkpoint_dir, f"checkpoint_mlm_epoch_{epoch+1}.pt")
    torch.save({
        'epoch': epoch+1,
        'mlm_model_state_dict': mlm_model.state_dict(),
        'optimizer_state_dict': mlm_optimizer.state_dict(),
        'loss': loss.item(),
    }, mlm_checkpoint_path)
    print(f"MLM Epoch {epoch+1} completed and checkpoint saved at {mlm_checkpoint_path}")


We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


MLM Epoch 1 completed and checkpoint saved at ./checkpoints/checkpoint_mlm_epoch_1.pt


## 3. Fine-Tune for Comparison

After unsupervised MLM fine-tuning, reinitialize the regression model using the updated base model and fine-tune again on the regression task.

In [18]:
model.base_model = mlm_model.base_model
model.regressor = nn.Linear(model.base_model.config.hidden_size, 1)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
model.to(device)


for epoch in range(epochs):
    model.train()
    for i, batch in enumerate(train_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].unsqueeze(1).to(device)
        outputs = model(input_ids, attention_mask)
        loss = loss_fn(outputs, labels)
        loss = loss / accumulation_steps
        loss.backward()
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()
    # Save checkpoint for fine-tuning after each epoch
    ft_checkpoint_path = os.path.join(checkpoint_dir, f"checkpoint_finetune_epoch_{epoch+1}.pt")
    torch.save({
        'epoch': epoch+1,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss.item(),
    }, ft_checkpoint_path)
    print(f"Fine-tuning Epoch {epoch+1} completed and checkpoint saved at {ft_checkpoint_path}")

model.eval()
predictions = []
true_values = []
with torch.no_grad():
    for batch in DataLoader(test_dataset, batch_size=32):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels']
        preds = model(input_ids, attention_mask)
        predictions.extend(preds.squeeze(1).cpu().tolist())
        true_values.extend(labels.tolist())

mse = mean_squared_error(true_values, predictions)
mae = mean_absolute_error(true_values, predictions)
r2  = r2_score(true_values, predictions)

print("After MLM Fine-Tuning:")
print(f"Test MSE: {mse:.4f}")
print(f"Test MAE: {mae:.4f}")
print(f"Test R^2: {r2:.4f}")


Fine-tuning Epoch 1 completed and checkpoint saved at ./checkpoints/checkpoint_finetune_epoch_1.pt
Fine-tuning Epoch 2 completed and checkpoint saved at ./checkpoints/checkpoint_finetune_epoch_2.pt
Fine-tuning Epoch 3 completed and checkpoint saved at ./checkpoints/checkpoint_finetune_epoch_3.pt
Fine-tuning Epoch 4 completed and checkpoint saved at ./checkpoints/checkpoint_finetune_epoch_4.pt
Fine-tuning Epoch 5 completed and checkpoint saved at ./checkpoints/checkpoint_finetune_epoch_5.pt
After MLM Fine-Tuning:
Test MSE: 0.5417
Test MAE: 0.5642
Test R^2: 0.6272


## Conclusion

We have successfully fine-tuned a pre-trained chemical language model on the Lipophilicity dataset using both supervised and unsupervised (MLM) fine-tuning.

## Results and Performance Metrics:
Two main fine-tuning strategies were evaluated:

1. Direct Fine-Tuning (Initial Training):

**Test Mean Squared Error (MSE)**: 0.5196

**Test Mean Absolute Error (MAE)**: 0.5493

**Test R²** (coefficient of determination): 0.6424

These results indicate the model achieved good predictive performance, explaining about 64.24% of the variance in the dataset.

2. After Additional MLM Fine-Tuning:
The model was further improved through an additional unsupervised fine-tuning step (Masked Language Modeling - MLM):

Test MSE: 0.5417 (slightly worse than initial fine-tuning)

Test MAE: 0.5493 (same range as before)

Test R²: 0.6272 (slightly lower due to higher MSE)

The additional unsupervised MLM (Masked Language Modeling) fine-tuning did not lead to performance gains—in fact, it slightly decreased predictive performance, as indicated by the increase in Test MSE from 0.5196 to 0.5417.

##**Interpretation:**
**Initial Fine-Tuning:**

Successfully achieved a strong predictive performance, indicating effective transfer learning from the general MoLFormer-XL model.

**MLM Additional Fine-Tuning:**
Surprisingly, additional unsupervised fine-tuning via MLM slightly hurt performance. This suggests the MoLFormer-XL was already sufficiently optimized for the specific predictive task, and additional general-purpose MLM fine-tuning introduced noise rather than beneficial representations.
Recommendations:
Since the additional MLM fine-tuning step slightly degraded performance, future experiments might focus on optimizing hyperparameters during initial fine-tuning or using domain-specific unsupervised training strategies.
Early stopping or a better-tailored learning rate scheduler could be used during MLM fine-tuning to avoid potential overfitting or negative transfer.