# Task 1: Fine-tune Chemical Language Model

The goal is to fine-tune a pre-trained chemical language model on a regression task using the Lipophilicity dataset. The task involves predicting the lipophilicity value for a given molecule representation (SMILES string). You will learn how to load and tokenize a dataset from HuggingFace, how to load a pre-trained language model, and finally, how to run a model in inference mode.

Your task is to complete the missing code blocks below.

In [1]:
!pip install -r requirements.txt

Collecting accelerate==0.26.0 (from -r requirements.txt (line 2))
  Downloading accelerate-0.26.0-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.26.0-py3-none-any.whl (270 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.7/270.7 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 1.4.0
    Uninstalling accelerate-1.4.0:
      Successfully uninstalled accelerate-1.4.0
Successfully installed accelerate-0.26.0


In [4]:
# import dependencies
import torch
from datasets import load_dataset
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling
from torch.utils.data import DataLoader, Dataset, Subset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
from tqdm.notebook import tqdm
import random
import numpy as np
import os
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from tqdm import tqdm
import torch.optim as optim

In [5]:
# setting the seed
seed = 100

# for reproducibility
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# 1.Fine-tune a Chemical Language Model on Lipophilicity


## --- Step 1: Load Dataset ---

The dataset we are going to use is the [Lipophilicity](https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Lipophilicity) dataset, part of [MoleculeNet](https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a) benchmark.

Lipophilicity, also known as hydrophobicity, is a measure of how readily a substance dissolves in nonpolar solvents (such as oil) compared to polar solvents (such as water).

In [5]:
# specify dataset name and model name
DATASET_PATH = "scikit-fingerprints/MoleculeNet_Lipophilicity"
MODEL_NAME = "ibm/MoLFormer-XL-both-10pct"  #MoLFormer model

In [6]:
# load the dataset from HuggingFace
import pandas as pd
lipophilicity_df = pd.read_csv("hf://datasets/scikit-fingerprints/MoleculeNet_Lipophilicity/lipophilicity.csv")

#dataset = load_dataset(DATASET_PATH)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
# Explore the dataset
# For example, print the column names and display a few sample rows
# TODO: your code goes here
print(lipophilicity_df.shape)
print(lipophilicity_df.info())
print(lipophilicity_df.head())

(4200, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4200 entries, 0 to 4199
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   SMILES  4200 non-null   object 
 1   label   4200 non-null   float64
dtypes: float64(1), object(1)
memory usage: 65.8+ KB
None
                                              SMILES  label
0            Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14   3.54
1  COc1cc(OC)c(cc1NC(=O)CSCC(=O)O)S(=O)(=O)N2C(C)...  -1.18
2             COC(=O)[C@@H](N1CCc2sccc2C1)c3ccccc3Cl   3.69
3  OC[C@H](O)CN1C(=O)C(Cc2ccccc12)NC(=O)c3cc4cc(C...   3.37
4  Cc1cccc(C[C@H](NC(=O)c2cc(nn2C)C(C)(C)C)C(=O)N...   3.10


In [8]:
lipophilicity_strings = lipophilicity_df['SMILES'].values
lipophilicity_targets = lipophilicity_df['label'].values

In [9]:
# define a PyTorch Dataset class for handling SMILES strings and targets

# TODO: your code goes here
class SMILESDataset(Dataset):
    def __init__(self, strings, labels):
        self.strings = strings
        self.labels = labels

    def __len__(self):
        return len(self.strings)

    def __getitem__(self, idx):
        string = self.strings[idx]
        target = self.labels[idx]
        return [string, target]

#printing a sample to check whether it is working or not
smile_dataset = SMILESDataset(lipophilicity_strings, lipophilicity_targets)
print(smile_dataset[10])

['CS(=O)(=O)c1ccc(Oc2ccc(cc2)C#C[C@]3(O)CN4CCC3CC4)cc1', 1.51]


## --- Step 2: Split Dataset ---

As there is only one split (train split) in the original dataset, we need to split the data into training and testing sets by ourselves.

In [10]:
# tokenize the data
# load a pre-trained tokenizer from HuggingFace
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

In [11]:
# split the data into training and test datasets
# TODO: your code goes here
from sklearn.preprocessing import MinMaxScaler
#X_train and X_test contains strings and targets
scaler = MinMaxScaler()
# scaling the target values
lipophilicity_targets = scaler.fit_transform(lipophilicity_targets.reshape(-1,1))
X_train, X_test, y_train_scaled, y_test_scaled = train_test_split(lipophilicity_strings, lipophilicity_targets, test_size = 0.2, random_state=seed)

In [12]:
# construct Pytorch data loaders for both train and test datasets
BATCH_SIZE = 16 # adjust based on memory constraints

# TODO: your code goes here
# creating train dataloader
train_dataset = SMILESDataset(X_train, y_train_scaled)
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE)

# creating test dataloader
test_dataset = SMILESDataset(X_test, y_test_scaled)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

## --- Step 3: Load Model ---

In [13]:
# load pre-trained model from HuggingFace
model = AutoModel.from_pretrained(MODEL_NAME, deterministic_eval=True, trust_remote_code=True)
print(model)

MolformerModel(
  (embeddings): MolformerEmbeddings(
    (word_embeddings): Embedding(2362, 768, padding_idx=2)
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (encoder): MolformerEncoder(
    (layer): ModuleList(
      (0-11): 12 x MolformerLayer(
        (attention): MolformerAttention(
          (self): MolformerSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (rotary_embeddings): MolformerRotaryEmbedding()
            (feature_map): MolformerFeatureMap(
              (kernel): ReLU()
            )
          )
          (output): MolformerSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (in

In [14]:
# We need to add a regression head on the language model as we are doing a regression task.

# specify model with a regression head

class MoLFormerWithRegressionHead(nn.Module):
    def __init__(self, model, hidden_size=768):
        super(MoLFormerWithRegressionHead, self).__init__()
        self.encoder = model

        # regression head (fully connected layer)
        self.regressor = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size // 2, 1)
        )

    def forward(self, inputs):
        outputs = self.encoder(**inputs)
        model_representation = outputs.pooler_output
        regression_output = self.regressor(model_representation)
        return regression_output

In [15]:
# initialize the regression model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
regression_model = MoLFormerWithRegressionHead(model).to(device)
print(device)
print(regression_model)

cuda
MoLFormerWithRegressionHead(
  (encoder): MolformerModel(
    (embeddings): MolformerEmbeddings(
      (word_embeddings): Embedding(2362, 768, padding_idx=2)
      (dropout): Dropout(p=0.2, inplace=False)
    )
    (encoder): MolformerEncoder(
      (layer): ModuleList(
        (0-11): 12 x MolformerLayer(
          (attention): MolformerAttention(
            (self): MolformerSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (rotary_embeddings): MolformerRotaryEmbedding()
              (feature_map): MolformerFeatureMap(
                (kernel): ReLU()
              )
            )
            (output): MolformerSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

## --- Step 4: Training ---

In [16]:
# TODO: your code goes here
def train(train_dataloader, model, tokenizer, epochs, loss_fn, optimizer, save_path):
    model.train()
    best_loss = 1000
    for epoch in range(epochs):
        running_loss = 0
        count = 0
        with tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{epochs}", unit="batch") as pbar:
            for index, data in enumerate(pbar):
                smile_strings = data[0]
                smile_targets = data[1].to(device).float()

                inputs = tokenizer(smile_strings, padding=True, return_tensors="pt").to(device)
                outputs = model(inputs)

                loss = loss_fn(outputs, smile_targets)

                running_loss = loss.item() + running_loss
                count = count + len(data)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
        epoch_loss = running_loss / count
        print(f"Epoch {epoch+1} Loss: {epoch_loss:.4f}")
        # saving the best model
        if epoch_loss < best_loss :
            best_loss = epoch_loss
            torch.save(model.state_dict(), save_path)


epochs = 2
learning_rate = 0.0001
model_save_path = "base_model.pth"
mse_loss = nn.MSELoss()
optimizer = optim.Adam(regression_model.parameters(), lr=learning_rate)

train(
    train_dataloader,
    regression_model,
    tokenizer,
    epochs = epochs,
    loss_fn=mse_loss,
    optimizer=optimizer,
    save_path=model_save_path
)

Epoch 1/2: 100%|██████████| 210/210 [00:32<00:00,  6.55batch/s]


Epoch 1 Loss: 0.0135


Epoch 2/2: 100%|██████████| 210/210 [00:25<00:00,  8.32batch/s]


Epoch 2 Loss: 0.0080


## --- Step 5: Evaluation ---

In [17]:
# TODO: your code goes here

def test(test_dataloader, model):
    model.eval()
    predictions = []
    actuals = []

    with torch.no_grad():
        with tqdm(test_dataloader, desc="Testing", unit="batch") as pbar:
            for data in pbar:
                smile_strings = data[0]
                smile_targets = data[1].to(device).float()

                inputs = tokenizer(smile_strings, padding=True, return_tensors="pt").to(device)
                outputs = model(inputs)  # Flatten output to match targets

                predictions.extend(outputs.cpu().numpy())
                actuals.extend(smile_targets.cpu().numpy())

    # Convert lists to numpy arrays
    predictions = np.array(predictions).flatten()  # Ensure 1D shape
    actuals = np.array(actuals).flatten()

    # Ensure correct dtype
    predictions = predictions.astype(np.float64)
    actuals = actuals.astype(np.float64)

    predictions = predictions.reshape(-1, 1)
    actuals = actuals.reshape(-1, 1)

    # Scale back
    predictions = np.array(scaler.inverse_transform(predictions)).flatten().tolist()
    actuals = np.array(scaler.inverse_transform(actuals)).flatten().tolist()


    # Calculate metrics
    mse = mean_squared_error(actuals, predictions)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(actuals, predictions)
    r2 = r2_score(actuals, predictions)
    pearson_corr, _ = pearsonr(actuals, predictions)
    spearman_corr, _ = spearmanr(actuals, predictions)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R-squared (R²): {r2:.4f}")
    print(f"Pearson Correlation: {pearson_corr:.4f}")
    print(f"Spearman Correlation: {spearman_corr:.4f}")

regression_model = MoLFormerWithRegressionHead(model).to(device)
regression_model.load_state_dict(torch.load(model_save_path, map_location=device))
test(
    test_dataloader,
    regression_model
)

  regression_model.load_state_dict(torch.load(model_save_path, map_location=device))
Testing: 100%|██████████| 53/53 [00:02<00:00, 25.25batch/s]

Mean Squared Error (MSE): 0.5822
Root Mean Squared Error (RMSE): 0.7630
Mean Absolute Error (MAE): 0.5882
R-squared (R²): 0.5938
Pearson Correlation: 0.8044
Spearman Correlation: 0.7911





# 2.Add Unsupervised Finetuning
In this step, you will perform unsupervised fine-tuning on the training dataset. This means the model will leverage only the SMILES strings without any corresponding labels to adapt its understanding of the data distribution. By familiarizing the model with the patterns and structure of the SMILES strings, you can potentially enhance its performance on downstream supervised tasks.

For this fine-tuning, you will use the Masked Language Modeling (MLM) objective, where the model learns to predict randomly masked tokens within the input sequence. Remember to save the fine-tuned model for later use.


In [18]:
import torch
import torch.nn as nn
from transformers import (
    AutoModelForMaskedLM, AutoTokenizer, Trainer, TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import Dataset

# load the model for MLM with AutoModelForMaskedLM class
model = AutoModelForMaskedLM.from_pretrained("ibm/MoLFormer-XL-both-10pct", deterministic_eval=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# creating the traning data
training_data = []
for index, data in enumerate(train_dataloader):
    smile_strings = list(data[0])
    training_data.extend(smile_strings)

dataset = Dataset.from_list([{"smiles": s} for s in training_data])

def tokenize_function(examples):
    tokenized = tokenizer(
        examples["smiles"],
        padding="max_length",
        truncation=True,
        max_length=128
    )
    return tokenized


tokenized_dataset = dataset.map(tokenize_function, batched=True)

# creating the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)
from transformers import TrainingArguments, Trainer

# setting the traning arguments
training_args = TrainingArguments(
    output_dir="./molformer_finetuned",
    evaluation_strategy="no",
    save_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    num_train_epochs=1,
    logging_dir="./logs",
    logging_steps=50,
    report_to = "none"
    #no_cuda=True
)

# training the Masked LM on smile strings
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

# saving the model
model.save_pretrained("./molformer_finetuned")
tokenizer.save_pretrained("./molformer_finetuned")

print("Training completed! Model saved to ./molformer_finetuned")

Map:   0%|          | 0/3360 [00:00<?, ? examples/s]

  trainer = Trainer(


Step,Training Loss
50,0.3808
100,0.2822


Training completed! Model saved to ./molformer_finetuned


In [19]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

# Define the path where the model is saved
model_path = "./molformer_finetuned"

# Load the fine-tuned model
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
regression_model = MoLFormerWithRegressionHead(model).to(device)

print("Model and tokenizer successfully loaded!")

Model and tokenizer successfully loaded!


# 3.Fine-Tune for Comparison
After performing unsupervised fine-tuning on the training data, we now fine-tune the model on the regression task with the regression head. By comparing the performance of the model before and after unsupervised fine-tuning, you can evaluate how the unsupervised fine-tuning impacts the model's performance on our target task.


In [20]:
# TODO: your code goes here
def train(train_dataloader, model, tokenizer, epochs, loss_fn, optimizer, save_path):
    model.train()
    best_loss = 1000
    for epoch in range(epochs):
        running_loss = 0
        count = 0
        with tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{epochs}", unit="batch") as pbar:
            for index, data in enumerate(pbar):
                smile_strings = data[0]
                smile_targets = data[1].to(device).float()

                inputs = tokenizer(smile_strings, padding=True, return_tensors="pt").to(device)
                outputs = model(inputs)

                loss = loss_fn(outputs, smile_targets)

                running_loss = loss.item() + running_loss

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
        epoch_loss = running_loss / len(train_dataloader)
        print(f"Epoch {epoch+1} Loss: {epoch_loss:.4f}")
        if epoch_loss < best_loss :
            best_loss = epoch_loss
            torch.save(model.state_dict(), save_path)


epochs = 3
learning_rate = 0.0001
model_save_path = "unsupervised_model.pth"
mse_loss = nn.MSELoss()
optimizer = optim.Adam(regression_model.parameters(), lr=learning_rate)

train(
    train_dataloader,
    regression_model,
    tokenizer,
    epochs = epochs,
    loss_fn=mse_loss,
    optimizer=optimizer,
    save_path=model_save_path
)

Epoch 1/3: 100%|██████████| 210/210 [00:29<00:00,  7.18batch/s]


Epoch 1 Loss: 0.0276


Epoch 2/3: 100%|██████████| 210/210 [00:28<00:00,  7.48batch/s]


Epoch 2 Loss: 0.0168


Epoch 3/3: 100%|██████████| 210/210 [00:27<00:00,  7.68batch/s]


Epoch 3 Loss: 0.0123


In [21]:
# TODO: your code goes here
def test(test_dataloader, model):
    model.eval()
    predictions = []
    actuals = []

    with torch.no_grad():
        with tqdm(test_dataloader, desc="Testing", unit="batch") as pbar:
            for data in pbar:
                smile_strings = data[0]
                smile_targets = data[1].to(device).float()

                inputs = tokenizer(smile_strings, padding=True, return_tensors="pt").to(device)
                outputs = model(inputs)  # Flatten output to match targets

                predictions.extend(outputs.cpu().numpy())
                actuals.extend(smile_targets.cpu().numpy())

    # Convert lists to numpy arrays
    predictions = np.array(predictions).flatten()  # Ensure 1D shape
    actuals = np.array(actuals).flatten()

    # Ensure correct dtype
    predictions = predictions.astype(np.float64)
    actuals = actuals.astype(np.float64)

    predictions = predictions.reshape(-1, 1)
    actuals = actuals.reshape(-1, 1)

    # Scale back
    predictions = np.array(scaler.inverse_transform(predictions)).flatten().tolist()
    actuals = np.array(scaler.inverse_transform(actuals)).flatten().tolist()

    # Calculate metrics
    mse = mean_squared_error(actuals, predictions)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(actuals, predictions)
    r2 = r2_score(actuals, predictions)
    pearson_corr, _ = pearsonr(actuals, predictions)
    spearman_corr, _ = spearmanr(actuals, predictions)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R-squared (R²): {r2:.4f}")
    print(f"Pearson Correlation: {pearson_corr:.4f}")
    print(f"Spearman Correlation: {spearman_corr:.4f}")

regression_model = MoLFormerWithRegressionHead(model).to(device)
regression_model.load_state_dict(torch.load(model_save_path, map_location=device))
test(
    test_dataloader,
    regression_model
)

  regression_model.load_state_dict(torch.load(model_save_path, map_location=device))
Testing: 100%|██████████| 53/53 [00:02<00:00, 23.37batch/s]

Mean Squared Error (MSE): 0.4905
Root Mean Squared Error (RMSE): 0.7004
Mean Absolute Error (MAE): 0.5375
R-squared (R²): 0.6578
Pearson Correlation: 0.8220
Spearman Correlation: 0.8129



