Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology #213

agitter opened this issue Jan 27, 2017 · 4 comments


Copy link

agitter commented Jan 27, 2017

Recent advances in deep learning and specifically in generative adversarial networks have demonstrated surprising results in generating new images and videos upon request even using natural language as input. In this paper we present the first application of generative adversarial autoencoders (AAE) for generating novel molecular fingerprints with a defined set of parameters. We developed a 7-layer AAE architecture with the latent middle layer serving as a discriminator. As an input and output the AAE uses a vector of binary fingerprints and concentration of the molecule. In the latent layer we also introduced a neuron responsible for growth inhibition percentage, which when negative indicates the reduction in the number of tumor cells after the treatment. To train the AAE we used the NCI-60 cell line assay data for 6252 compounds profiled on MCF-7 cell line. The output of the AAE was used to screen 72 million compounds in PubChem and select candidate molecules with potential anti-cancer properties. This approach is a proof of concept of an artificially-intelligent drug discovery engine, where AAEs are used to generate new molecular fingerprints with the desired molecular properties.

@agitter agitter added the treat label Jan 27, 2017
Copy link

gwaybio commented Feb 2, 2017

Interesting study that uses binarized chemical compound vectors of length 166 (that look like this) combined with dosage concentration data to generate new compounds that may help prioritize candidate small molecules that treat cancer patients.

Biological Aspects

  • Chemical compounds with dosage information as input
  • Also included is the chemical's corresponding growth inhibition in a breast cancer cell line (MCF-7)

Computational Aspects

  • adversarial autoencoder that encodes input binarized chemical compound vectors into a length 5 latent layer
  • 2 layer encoder to learn how the molecular fingerprint impacts growth inhibition
    • The latent layer can thereby represent a vector of how well the corresponding fingerprint impacts MCF-7 growth
  • 2 layer decoder for reconstruction
  • The adversarial training comes in as the authors sample from a learned prior distribution
    • The sampled length 5 vector from the prior is then run through a discriminator to detect real latent vectors from fake
    • Growth inhibition is sampled from a normal distribution with mean=5 and variance=1 independently from the prior
  • Once the model is trained, the sampled latent vector is decoded to output an artificial molecular fingerprint with a corresponding drug concentration
  • This artificial fingerprint is compared against a reference of 72 million compounds from pubchem
    • The authors then selected the top 10 most similar compounds to their predicted compounds if the decoded log concentration was less than -5.0 molar

Why we should include it in our review

I am not entirely sure if we should consider this paper for our review. edit I think we can include it now, in the treat section or as a method for prioritizing drug candidates/repurposing.

This is not my field of expertise, but I am interested in adversarial methods so I gave this paper a thorough read. However, the methods, results, and evaluation remain a bit unclear to me. Another really nice thing about this paper is the availability of source code ( Perhaps @spoilt333 can help to clarify some of my confusion. I outlined my understanding above, but a couple of points remain:

  • Why was the growth inhibition (GI) sampled independently?
    • it seems to me that this is a critical component of the model and if the GI is high, then the drug is considered effective. Isn't this artificial sampling decoupled from the learning process? edit Based on @spoilt333's response, it is now clear that this parameter is learned. If the latent vector combination is unreasonable (presents a very high concentration in the output layer) the generated compound is rejected.
  • Why did the authors choose to sample 640 vectors and how did they exactly determine similar compounds from pubchem?
    • edit 640 is a random number
  • What is the discriminator? Is it using some sort of density metric or KL divergence as compared to the latent distribution?
  • There is no discussion on how the model is training and if it is actually learning something meaningful. The authors do really nicely discuss several specific examples of "nearest" compounds so it seems to be working but it would really be great to see some sort of model evaluation.
    • For example, what is the reconstruction cost associated with the autoencoder portion of the model and what was stopping criteria? What is it across epochs?
    • What are the hyperparameters of the model and how were they chosen?
    • edit these results will be expanded upon in a future publication

Overall, I thought the paper elegantly laid out the problem of very high drug development failure rate and the evolution of computational methods for compound prioritization. They also apply a promising approach that appears to be working at first glance. I think it would be great to see this approach work really well as it appears to be a very promising approach for drug development and drug repurposing. However, I think that given my concerns perhaps it is not suitable for this review. Maybe we could talk about the idea of the approach in the discussion - I am not sure.

Copy link

spoilt333 commented Feb 2, 2017 via email

Copy link

gwaybio commented Feb 2, 2017

Hi @spoilt333 - this is great! thanks for your prompt response - i think this clears up a lot. I'll respond to your points below:

  1. Actually, GI neuron was trained jointly with rest latent neurons as predictor of "efficiency" of drug. But, after training, it was used as tuner for generating new drugs. Latent layer is a kind of noise and GI is a condition for Decoder net, and both used to produce output.

Ah, I see, this makes sense now - I think this is a nice innovation! I can see then that the rejection criterion was whether or not the concentration of the corresponding reconstructed molecular fingerprint was reasonable.

  1. There was no reason to pick exactly 640 samples, but we had to chose some:) As output layer has sigmoid activation we treat it as probability of presence of corresponding bit in compound code. So, "similarity" was just a likelihood of a compound to be sampled from generated vector.

Great, ok, I see now. I must have missed that the output layer was sigmoid.

  1. Discriminator is a standard GANs part. In fact, it is a binary classifier which tries to determine was sample came from some "true" distribution or it was generated by NN. In our case, true distribution was Gaussian, and false came from Encoder.

Yep! I was wondering what the architecture of the discriminator was. Sounds like it could be a logistic regression classifier? Or was it that you sampled several times from the generator and if it fell beyond the distribution of the real latent space then it was rejected?

  1. It is really a big point and we are going to make it more clear in next paper. There is few ideas Most important hyperparameter is a latent layer size IMO.

I have found this to be the case as well. Looking forward to the next paper.

Thanks again for responding so quickly, I will update my summary posted above accordingly.

Copy link

spoilt333 commented Feb 2, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

3 participants