CS-598 Deep Learning for Healthcare Final Project

This project explored generative methods for drug discovery using Variational Autoencoders. Our initial objective is to measure the ratio of syntactically valid molecules that the models can produce to determine which adjustments to the baseline model produce the largest improvements. We explore various training techniques, model improvements, and hard constraints imposed on the input format or latent space to achieve molecular generation that is more valid and chemically rich.

Our presentation can be found here: https://youtu.be/zZHGXTrgfrY

We discuss our results and ideas for further investigation in our paper, found here.

Sources and Inspiration

The inspiration and sources for this repository come from:

Roadmap

1. Exploring the Data

First we explored a dataset of 250K molecules to understand the dataset a little better. None of us have a background in chemistry or biology so this was fairly essential. The exploration can be reproduced by running Data_Exploration.ipynb in a jupyter notebook environment (originally Google Colab).

Note on Notebooks:

Given the usage of Azure ML in these notebooks, the notebooks will create an image to run the notebook on a compute cluster. The notebooks will also automatically download the dataset required for training. The notebooks listed below will run in an Azure ML Instance in Azure creating a run of a experiment and log the results.

2. Baseline Model using Deepchem

After exploring the data, we leveraged the deepchem library to train an instance of the AspuruGuzikAutoEncoder using our 250K molecule dataset. This work can be reproduced by running the notebook (inside an Azure ML Workspace):

Approach1_BaselineModel.ipynb

3. AspuruGuzikAutoEncoder with KL Annealing

KL Annealing is used to prioritize reconstruction loss early in the training process and then later incorporate more of the Kullback–Leibler (KL) loss. The baseline model provided by deepchem and we wanted to test the impact of turning it off. Those results can be reproduced by running

Approach2_DisablingCostAnnealing.ipynb

4. Teacher Forcing

Teacher forcing is a technique to help the decoder along when doing training so that the decoding of the characters later in a sequence get a better chance to train. For example, if I'm trying to reconstruct C[NH+](C/C=C/c1ccco1)CCC(F)(F)F and my decoder messes up the first couple of characters, the teacher-forcing model will propagate the correct character at that iteration of the decoder so that the following characters have a better chance of also being correct. This allows for the whole sequence being trained/learned at the same time instead of depending on the initial decoding to be correct in order to get the remaining characters. We attempted to add this technique to the deepchem library, but ultimately our efforts were unsuccessful. Ultimately we were able to use a VAE model with teacher forcing enabled using the moses library. That implementation can be reproduced by running moses_vae.ipynb

The results can be reproduced by running:

Approach3_Moses.ipynb

5. SELFIES

Up to this point, our models has used the SMILES representation to represent a molecule as a string. We were able to experiment with a different string representation of a molecule called SELFIES. Those results can be reproduced by running

Approach4_Selfies.ipynb.

Conclusion

Hopefully these notebooks are useful to you in your drug discovery journey!

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Archive		Archive
Approach1_BaselineModel.ipynb		Approach1_BaselineModel.ipynb
Approach2_DisablingCostAnnealing.ipynb		Approach2_DisablingCostAnnealing.ipynb
Approach3_Moses.ipynb		Approach3_Moses.ipynb
Approach4_Selfies.ipynb		Approach4_Selfies.ipynb
Data_Exploration.ipynb		Data_Exploration.ipynb
Drug Discovery VAE Presentation.pdf		Drug Discovery VAE Presentation.pdf
Paper_Drug_Discovery_VAE.pdf		Paper_Drug_Discovery_VAE.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS-598 Deep Learning for Healthcare Final Project

Sources and Inspiration

Roadmap

1. Exploring the Data

Note on Notebooks:

2. Baseline Model using Deepchem

3. AspuruGuzikAutoEncoder with KL Annealing

4. Teacher Forcing

5. SELFIES

Conclusion

About

Releases

Packages

Contributors 4

Languages

megado123/drug-discovery-vae

Folders and files

Latest commit

History

Repository files navigation

CS-598 Deep Learning for Healthcare Final Project

Sources and Inspiration

Roadmap

1. Exploring the Data

Note on Notebooks:

2. Baseline Model using Deepchem

3. AspuruGuzikAutoEncoder with KL Annealing

4. Teacher Forcing

5. SELFIES

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages