# 🤠 MolGPT Training - Cowboy Chronicle 🤠



This notebook demonstrates how to train the MolGPT model on molecular datasets. MolGPT is a transformer-decoder model for molecular generation that can be trained on SMILES strings with or without conditional properties.

## 1. Setup Environment



First, let's make sure we have all the necessary imports and set up our environment.

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Import directly from the files instead of using package imports
sys.path.insert(0, '.')
from train.model import GPT, GPTConfig
from train.trainer import Trainer, TrainerConfig
from train.dataset import SmileDataset
from train.utils import set_seed

# Set random seed for reproducibility
set_seed(42)

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 2. Load and Explore Dataset



Let's load the dataset and explore its structure. We'll use the Moses dataset for this demonstration.

In [None]:
# Load the Moses dataset
data_name = 'moses2'
data = pd.read_csv(f'datasets/{data_name}.csv')
data = data.dropna(axis=0).reset_index(drop=True)
data.columns = data.columns.str.lower()

# Display the first few rows
print(f"Dataset shape: {data.shape}")
data.head()

In [None]:
# Split the data into train and validation sets
train_data = data[data['split'] == 'train'].reset_index(drop=True)
val_data = data[data['split'] == 'test'].reset_index(drop=True)

print(f"Training data shape: {train_data.shape}")
print(f"Validation data shape: {val_data.shape}")

## 3. Prepare Data for Training



Now we'll prepare the data for training by tokenizing the SMILES strings and creating the dataset objects.

In [None]:
# Extract SMILES and scaffold strings
smiles = train_data['smiles']
vsmiles = val_data['smiles']
scaffold = train_data['scaffold_smiles']
vscaffold = val_data['scaffold_smiles']

# Define the regex pattern for tokenizing SMILES
pattern = "(\[[^\]]+]|<|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"
regex = re.compile(pattern)

# Calculate maximum lengths
lens = [len(regex.findall(i.strip())) for i in (list(smiles.values) + list(vsmiles.values))]
max_len = max(lens)
print(f'Max SMILES length: {max_len}')

lens = [len(regex.findall(i.strip())) for i in (list(scaffold.values) + list(vscaffold.values))]
scaffold_max_len = max(lens)
print(f'Max scaffold length: {scaffold_max_len}')

# Pad the SMILES strings
smiles = [i + str('<')*(max_len - len(regex.findall(i.strip()))) for i in smiles]
vsmiles = [i + str('<')*(max_len - len(regex.findall(i.strip()))) for i in vsmiles]

scaffold = [i + str('<')*(scaffold_max_len - len(regex.findall(i.strip()))) for i in scaffold]
vscaffold = [i + str('<')*(scaffold_max_len - len(regex.findall(i.strip()))) for i in vscaffold]

In [None]:
# Define the character set
whole_string = ['#', '%10', '%11', '%12', '(', ')', '-', '1', '2', '3', '4', '5', '6', '7', '8', '9', '<', '=', 'B', 'Br', 'C', 'Cl', 'F', 'I', 'N', 'O', 'P', 'S', '[B-]', '[BH-]', '[BH2-]', '[BH3-]', '[B]', '[C+]', '[C-]', '[CH+]', '[CH-]', '[CH2+]', '[CH2]', '[CH]', '[F+]', '[H]', '[I+]', '[IH2]', '[IH]', '[N+]', '[N-]', '[NH+]', '[NH-]', '[NH2+]', '[NH3+]', '[N]', '[O+]', '[O-]', '[OH+]', '[O]', '[P+]', '[PH+]', '[PH2+]', '[PH]', '[S+]', '[S-]', '[SH+]', '[SH]', '[Se+]', '[SeH+]', '[SeH]', '[Se]', '[Si-]', '[SiH-]', '[SiH2]', '[SiH]', '[Si]', '[b-]', '[bH-]', '[c+]', '[c-]', '[cH+]', '[cH-]', '[n+]', '[n-]', '[nH+]', '[nH]', '[o+]', '[s+]', '[sH+]', '[se+]', '[se]', 'b', 'c', 'n', 'o', 'p', 's']

# Extract property values for conditional training
props = ['logp']
prop = train_data[props].values.tolist()
vprop = val_data[props].values.tolist()
num_props = len(props)

# Create dataset objects
class Args:
    def __init__(self):
        self.debug = False
        self.scaffold = True
        self.lstm = False
        self.lstm_layers = 0
        self.num_props = num_props
        self.props = props

args = Args()

train_dataset = SmileDataset(args, smiles, whole_string, max_len, prop=prop, aug_prob=0, scaffold=scaffold, scaffold_maxlen=scaffold_max_len)
valid_dataset = SmileDataset(args, vsmiles, whole_string, max_len, prop=vprop, aug_prob=0, scaffold=vscaffold, scaffold_maxlen=scaffold_max_len)

print(f"Vocabulary size: {train_dataset.vocab_size}")

## 4. Define Model Architecture



Now we'll define the model architecture using the GPT configuration.

In [None]:
# Define model configuration
n_layer = 8
n_head = 8
n_embd = 256

mconf = GPTConfig(train_dataset.vocab_size, train_dataset.max_len, 
                  num_props=num_props,
                  n_layer=n_layer, n_head=n_head, n_embd=n_embd, 
                  scaffold=args.scaffold, scaffold_maxlen=scaffold_max_len,
                  lstm=args.lstm, lstm_layers=args.lstm_layers)

# Create the model
model = GPT(mconf)

# Print model summary
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

## 5. Train the Model



Now we'll set up the training configuration and train the model.

In [None]:
# Define training configuration
batch_size = 32  # Reduced for demonstration
max_epochs = 2   # Reduced for demonstration
learning_rate = 6e-4
run_name = 'logp_moses_demo'

tconf = TrainerConfig(
    max_epochs=max_epochs, 
    batch_size=batch_size, 
    learning_rate=learning_rate,
    lr_decay=True, 
    warmup_tokens=0.1*len(train_data)*max_len, 
    final_tokens=max_epochs*len(train_data)*max_len,
    num_workers=2, 
    ckpt_path=f'weights/{run_name}.pt', 
    block_size=train_dataset.max_len, 
    generate=False
)

# Initialize trainer
trainer = Trainer(model, train_dataset, valid_dataset, tconf, train_dataset.stoi, train_dataset.itos)

# For demonstration, we'll skip the actual training since it would take too long
# In a real scenario, you would run:
# df = trainer.train(None)  # Set to None since we're not using wandb

print("Training would start here in a real scenario.")
print("For demonstration purposes, we'll use pre-trained weights.")

## 6. Visualize Training Results



Let's create some mock training results to visualize.

In [None]:
# Create mock training results
epochs = list(range(1, 11))
train_loss = [2.5, 2.2, 1.9, 1.7, 1.5, 1.4, 1.3, 1.25, 1.2, 1.18]
val_loss = [2.6, 2.3, 2.0, 1.8, 1.65, 1.55, 1.5, 1.45, 1.4, 1.38]

# Plot training and validation loss
plt.figure(figsize=(10, 6))
plt.plot(epochs, train_loss, 'b-', label='Training Loss')
plt.plot(epochs, val_loss, 'r-', label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

## 7. Save Model Weights



In a real training scenario, the model weights would be saved automatically by the trainer. For demonstration, we'll show how you would save the weights manually.

In [None]:
# Save model weights (commented out since we're not actually training)
# torch.save(model.state_dict(), f'weights/{run_name}.pt')
print(f"Model weights would be saved to: weights/{run_name}.pt")

## 8. Summary



In this notebook, we've demonstrated how to:



1. Set up the environment for MolGPT training

2. Load and explore the molecular dataset

3. Prepare the data for training

4. Define the model architecture

5. Configure and (simulate) training the model

6. Visualize training results

7. Save model weights



The MolGPT model can be trained with different conditioning options:

- Unconditional generation

- Property-based conditional generation (e.g., logP, QED, SAS)

- Scaffold-based conditional generation

- Combined property and scaffold-based conditional generation



For actual training, you would need to run the training process for more epochs and with larger batch sizes, which would take several hours depending on your hardware.