We introduce VeGA-RX, conditioned on dual semantic descriptors (SMILES + SMARTS-RX), and VeGA-SCX, which integrates topological guidance (SMILES + Scaffold + SMARTS-RX). Both models efficiently explore chemical space but with distinct profiles: VeGA-SCX prioritizes structural discipline, while VeGA-RX maximizes generative freedom.
This repository provides a complete pipeline for training, fine-tuning, and generating molecules using conditional SMARTS-RX → SMILES models.
The workflow is organized into six notebooks.
First, download the codebase. Then, use conda to set up a new environment for VeGA. If you're new to conda, we recommend checking out this tutorial before proceeding.
conda env create -f enviroment.yml
conda activate vega
python -m pip install tensorflow[and-cuda]
conda install -c conda-forge jupyter notebookNow, you can launch one of the three notebooks using Jupyter Notebook.
Recommended order of execution:
Training → (Optional) Fine-Tuning → Generation
Train the SMARTS + Scaffold conditioned SMILES generator.
Modify the file paths in the configuration section:
- SMILES dataset
- SMARTS definition file
- Output/cache paths
Run the main training cell.
If train_data.pkl and val_data.pkl do NOT exist, the notebook starts full preprocessing and training:
logger.info("🚀 START TRAINING: SMARTS-RX → SMILES")
If cached files already exist, it directly loads them:
logger.info("🚀 AVVIO SCRIPT: CARICAMENTO DIRETTO E ADDESTRAMENTO")
In most cases, simply running the main training cell is sufficient.
- Trained model (.keras)
- char2idx.pkl
- idx2char.pkl
- vocab.json
Train a SMARTS-RX conditioned model (alternative implementation).
Update:
- Dataset path
- SMARTS file
- Output directory
Run all cells sequentially.
- GPU acceleration is recommended for training.
Fine-tune a pretrained SMARTS + Scaffold model.
Modify:
- PRETRAINED_MODEL
- VOCAB_PATH
- SMARTS file
- Fine-tuning dataset
- SAVE_DIR
Ensure MAX_LENGTH matches the original training configuration.
Run all cells.
Fine-tune a SMARTS-only conditioned model.
Update:
- Pretrained model path
- Vocabulary files
- Dataset path
- Save directory
Run all cells sequentially.
Interactive molecule generation.
Supports:
- SMARTS conditioning
- Scaffold conditioning
- Combined SMARTS + Scaffold
- Unconditional generation
- Batch generation with CSV export
Update:
- Model path (.keras)
- char2idx.pkl
- idx2char.pkl
- vocab.json
Run the notebook.
An interactive menu will appear in the console.
Follow the prompts to:
- Select generation mode
- Set batch size
- Adjust temperature
- Provide SMARTS and/or scaffold inputs
- Keep MAX_LENGTH consistent across all notebooks.
- Vocabulary files must correspond to the specific trained model.