This repo complements the "Transfer learning enables the molecular transformer to predict regio-and stereoselective reactions on carbohydrates" publication.
The specific version used in this project were: Python: 3.6.9 Torch Version: 1.2.0 TorchText Version: 0.4.0 ONMT Version: 1.0.0 RDKit: 2019.03.2
conda create -n onmt36 python=3.6
conda activate onmt36
conda install -c rdkit rdkit=2019.03.2 -y
conda install -c pytorch pytorch=1.2.0 -y
git clone https://github.com/rxn4chemistry/OpenNMT-py.git
cd OpenNMT-py
git checkout carbohydrate_transformer
pip install -e .
The training and evaluation was performed using OpenNMT-py. The full documentation of the OpenNMT library can be found here.
Start by merging the two uspto training source files into a single file using python merge_src_splits.py
in data/uspto_dataset
.
DATADIR=data/uspto_dataset
onmt_preprocess -train_src $DATADIR/src-train.txt -train_tgt $DATADIR/tgt-train.txt -valid_src $DATADIR/src-valid.txt -valid_tgt $DATADIR/tgt-valid.txt -save_data $DATADIR/uspto -src_seq_length 3000 -tgt_seq_length 3000 -src_vocab_size 3000 -tgt_vocab_size 3000 -share_vocab
DATADIR=data/transfer_dataset
onmt_preprocess -train_src $DATADIR/src-train.txt -train_tgt $DATADIR/tgt-train.txt -valid_src $DATADIR/src-valid.txt -valid_tgt $DATADIR/tgt-valid.txt -save_data $DATADIR/sequential -src_seq_length 3000 -tgt_seq_length 3000 -src_vocab_size 3000 -tgt_vocab_size 3000 -share_vocab
DATASET=data/uspto_dataset
DATASET_TRANSFER=data/transfer_dataset
onmt_preprocess -train_src ${DATASET}/src-train.txt ${DATASET_TRANSFER}/src-train.txt -train_tgt ${DATASET}/tgt-train.txt ${DATASET_TRANSFER}/tgt-train.txt -train_ids uspto transfer -valid_src ${DATASET_TRANSFER}/src-valid.txt -valid_tgt ${DATASET_TRANSFER}/tgt-valid.txt -save_data ${DATASET_TRANSFER}/multi_task -src_seq_length 3000 -tgt_seq_length 3000 -src_vocab_size 3000 -tgt_vocab_size 3000 -share_vocab
The files have been previously tokenized using the tokenization function for the reaction SMILES is available from https://github.com/pschwllr/MolecularTransformer.
The data consists of parallel precursors (src
) and products (tgt
) data containing one reaction per line with tokens separated by a space:
src-train.txt
tgt-train.txt
src-val.txt
tgt-val.txt
After running the preprocessing, the following files are generated:
uspto.train.pt
: serialized PyTorch file containing training datauspto.valid.pt
: serialized PyTorch file containing validation datauspto.vocab.pt
: serialized PyTorch file containing vocabulary data
Internally the system never touches the words themselves, but uses these indices.
The transformer models were trained using the following hyperparameters:
DATADIR=data/uspto_dataset
onmt_train -data $DATADIR/uspto \
-save_model uspto_model_pretrained \
-seed $SEED -gpu_ranks 0 \
-train_steps 250000 -param_init 0 \
-param_init_glorot -max_generator_batches 32 \
-batch_size 6144 -batch_type tokens \
-normalization tokens -max_grad_norm 0 -accum_count 4 \
-optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam \
-warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
-layers 4 -rnn_size 384 -word_vec_size 384 \
-encoder_type transformer -decoder_type transformer \
-dropout 0.1 -position_encoding -share_embeddings \
-global_attention general -global_attention_function softmax \
-self_attn_type scaled-dot -heads 8 -transformer_ff 2048
DATADIR=data/transfer_dataset
WEIGHT1=9
WEIGHT2=1
onmt_train -data $DATADIR/multi_task \
-save_model multi_task_model \
-data_ids uspto transfer --data_weights $WEIGHT1 $WEIGHT2 \
-seed $SEED -gpu_ranks 0 \
-train_steps 250000 -param_init 0 \
-param_init_glorot -max_generator_batches 32 \
-batch_size 6144 -batch_type tokens \
-normalization tokens -max_grad_norm 0 -accum_count 4 \
-optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam \
-warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
-layers 4 -rnn_size 384 -word_vec_size 384 \
-encoder_type transformer -decoder_type transformer \
-dropout 0.1 -position_encoding -share_embeddings \
-global_attention general -global_attention_function softmax \
-self_attn_type scaled-dot -heads 8 -transformer_ff 2048
DATADIR=data/transfer_dataset
onmt_train -data $DATADIR/sequential \
-train_from models/upsto_model_pretrained.pt \
-save_model sequential_model \
-seed $SEED -gpu_ranks 0 \
-train_steps 6000 -param_init 0 \
-param_init_glorot -max_generator_batches 32 \
-batch_size 6144 -batch_type tokens \
-normalization tokens -max_grad_norm 0 -accum_count 4 \
-optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam \
-warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
-layers 4 -rnn_size 384 -word_vec_size 384 \
-encoder_type transformer -decoder_type transformer \
-dropout 0.1 -position_encoding -share_embeddings \
-global_attention general -global_attention_function softmax \
-self_attn_type scaled-dot -heads 8 -transformer_ff 2048
To test the model on new reactions run:
onmt_translate -model uspto_model_pretrained.pt -src $DATADIR/src-test.txt -output predictions.txt -n_best 1 -beam_size 5 -max_length 300 -batch_size 64
Pretrained models can be found in the models
folder.
@article{pesciullesi2020transfer,
title={Transfer learning enables the molecular transformer to predict regio-and stereoselective reactions on carbohydrates},
author={Pesciullesi, Giorgio and Schwaller, Philippe and Laino, Teodoro and Reymond, Jean-Louis},
journal={Nature Communications},
volume={11},
number={1},
pages={1--8},
year={2020},
publisher={Nature Publishing Group}
}
The Carbohydrate Transformer is based on OpentNMT-py, if you reuse this code please also cite the underlying code framework.
OpenNMT: Neural Machine Translation Toolkit
@inproceedings{opennmt,
author = {Guillaume Klein and
Yoon Kim and
Yuntian Deng and
Jean Senellart and
Alexander M. Rush},
title = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation},
booktitle = {Proc. ACL},
year = {2017},
url = {https://doi.org/10.18653/v1/P17-4012},
doi = {10.18653/v1/P17-4012}
}