Skip to content
Switch branches/tags


This is the code for the paper on

This repo introduces two things :

  • A new Variational Auto-Encoder (VAE) architecture that goes from a molecular graph to a sequence representation (and especially SELFIEs).
  • An optimization pipeline that optimizes a scoring function that includes docking

The necessary packages are packaged as ymls available for cpu or cuda10 usage.

conda env create -f ymls/cpu.yml 

Otherwise one should manually install the following packages :

pytorch, dgl, networkx, scikit-learn,rdkit, tqdm, ordered-sets, moses, pandas

Prior model training

Data loading

We use Molecular Sets ( to train our model : After installing the moses python library, the data can be reached by running

python data_processing/ 

To train a graph2selfies model, selfies need to be precomputed for the train set by running To compute selfies for another dataset stored in csv, the molecules should be in a column entitled 'smiles', run :

python data_processing/ -i [path_to_my_csv_dataset]

Model training

To train the model run

python --train [my_dataset.csv] --n [your_model_name]

The csv must contain columns entitled 'smiles' and 'selfies'

Embedding molecules

To compute embeddings for molecules in csv file:

python -i [path_to_csv] --name [your_model_name] -v [smiles]/[selfies]

The column containing the smiles/selfies should be labeled 'smiles'.

Generating samples

To generate samples from a trained model, run :

python generate/ -N [number_of_samples] --name [name_of_the_model]

Moses metrics

To compute the Moses benchmark metrics for the samples (recommended 30k samples), run

python eval/ -i [path_to_txt_with_samples]

Scoring function optimization

This is mostly an efficient implementation of the CbAS algorithm for docking. there is also two implementations for BO in /optim


Go to /cbas


Optimization of binding affinities in chemical space for drug discovery



No releases published


No packages published