Code for the "Grammar Variational Autoencoder" https://arxiv.org/abs/1703.01925
Python Cuda C++ Jupyter Notebook C TeX Other
Switch branches/tags
Nothing to show
Clone or download
J.M. Hernandez Lobato
J.M. Hernandez Lobato Updated readme
Latest commit 643fe70 Jun 26, 2017

README.md

Grammar Variational Autoencoder

This repository contains training and sampling code for the paper: Grammar Variational Autoencoder.

Requirements

Install (CPU version) using pip install -r requirements.txt

For GPU compatibility, replace the fourth line in requirements.txt with: https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-0.12.1-cp27-none-linux_x86_64.whl

Creating datasets

Molecules

To create the molecule datasets, call:

  • python make_zinc_dataset_grammar.py
  • python make_zinc_dataset_str.py

Equations

The equation dataset can be downloaded here: grammar, string

Training

Molecules

To train the molecule models, call:

  • python train_zinc.py % the grammar model
  • python train_zinc.py --latent_dim=2 --epochs=50 % train a model with a 2D latent space and 50 epochs
  • python train_zinc_str.py

Equations

  • python train_eq.py % the grammar model
  • python train_eq.py --latent_dim=2 --epochs=50 % train a model with a 2D latent space and 50 epochs
  • python train_eq_str.py

Sampling

Molecules

The file molecule_vae.py can be used to encode and decode SMILES strings. For a demo run:

  • python encode_decode_zinc.py

Equations

The analogous file equation_vae.py can encode and decode equation strings. Run:

  • python encode_decode_eq.py

Bayesian optimization

The Bayesian optimization experiments use sparse Gaussian processes coded in theano.

We use a modified version of theano with a few add ons, e.g. to compute the log determinant of a positive definite matrix in a numerically stable manner. The modified version of theano can be insalled by going to the folder Theano-master and typing

  • python setup.py install

The experiments with molecules require the rdkit library, which can be installed as described in http://www.rdkit.org/docs/Install.html.

The Bayesian optimization experiments can be replicated as follows:

1 - Generate the latent representations of molecules and equations. For this, go to the folders

molecule_optimization/latent_features_and_targets_grammar/

molecule_optimization/latent_features_and_targets_character/

equation_optimization/latent_features_and_targets_grammar/

equation_optimization/latent_features_and_targets_character/

and type

  • python generate_latent_features_and_targets.py

2 - Go to the folders

molecule_optimization/simulation1/grammar/

molecule_optimization/simulation1/character/

equation_optimization/simulation1/grammar/

equation_optimization/simulation1/character/

and type

  • nohup python run_bo.py &

Repeat this step for all the simulation folders (simulation2,...,simulation10). For speed, it is recommended to do this in a computer cluster in parallel.

2 - Extract the results by going to the folders

molecule_optimization/

equation_optimization/

and typing

  • python get_final_results.py
  • ./get_average_test_RMSE_LL.sh