Skip to content

megado123/drug-discovery-vae

Repository files navigation

CS-598 Deep Learning for Healthcare Final Project

This project explored generative methods for drug discovery using Variational Autoencoders. Our initial objective is to measure the ratio of syntactically valid molecules that the models can produce to determine which adjustments to the baseline model produce the largest improvements. We explore various training techniques, model improvements, and hard constraints imposed on the input format or latent space to achieve molecular generation that is more valid and chemically rich.

Our presentation can be found here: https://youtu.be/zZHGXTrgfrY

We discuss our results and ideas for further investigation in our paper, found here.

Sources and Inspiration

The inspiration and sources for this repository come from:

Roadmap

1. Exploring the Data

First we explored a dataset of 250K molecules to understand the dataset a little better. None of us have a background in chemistry or biology so this was fairly essential. The exploration can be reproduced by running Data_Exploration.ipynb in a jupyter notebook environment (originally Google Colab).

Note on Notebooks:

Given the usage of Azure ML in these notebooks, the notebooks will create an image to run the notebook on a compute cluster. The notebooks will also automatically download the dataset required for training. The notebooks listed below will run in an Azure ML Instance in Azure creating a run of a experiment and log the results.

2. Baseline Model using Deepchem

After exploring the data, we leveraged the deepchem library to train an instance of the AspuruGuzikAutoEncoder using our 250K molecule dataset. This work can be reproduced by running the notebook (inside an Azure ML Workspace):

Approach1_BaselineModel.ipynb

3. AspuruGuzikAutoEncoder with KL Annealing

KL Annealing is used to prioritize reconstruction loss early in the training process and then later incorporate more of the Kullback–Leibler (KL) loss. The baseline model provided by deepchem and we wanted to test the impact of turning it off. Those results can be reproduced by running

  • Approach2_DisablingCostAnnealing.ipynb

4. Teacher Forcing

Teacher forcing is a technique to help the decoder along when doing training so that the decoding of the characters later in a sequence get a better chance to train. For example, if I'm trying to reconstruct C[NH+](C/C=C/c1ccco1)CCC(F)(F)F and my decoder messes up the first couple of characters, the teacher-forcing model will propagate the correct character at that iteration of the decoder so that the following characters have a better chance of also being correct. This allows for the whole sequence being trained/learned at the same time instead of depending on the initial decoding to be correct in order to get the remaining characters. We attempted to add this technique to the deepchem library, but ultimately our efforts were unsuccessful. Ultimately we were able to use a VAE model with teacher forcing enabled using the moses library. That implementation can be reproduced by running moses_vae.ipynb

The results can be reproduced by running:

  • Approach3_Moses.ipynb

5. SELFIES

Up to this point, our models has used the SMILES representation to represent a molecule as a string. We were able to experiment with a different string representation of a molecule called SELFIES. Those results can be reproduced by running

  • Approach4_Selfies.ipynb.

Conclusion

Hopefully these notebooks are useful to you in your drug discovery journey!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •