## Demo

How can we help doctors discover new drugs? We'll use a Generative Adversarial Network to generate new drugs. 

https://dash-drug-explorer.plot.ly/
https://github.com/mostafachatillon/ChemGAN-challenge

![alt text](https://cdn-images-1.medium.com/max/1600/0*dwa5nSFfTgPiUyeU. "Logo Title Text 1")

## How Drugs are Discovered

![alt text](https://image.slidesharecdn.com/drugdiscoveryanddevelopment-111227060152-phpapp02/95/drug-discovery-and-development-11-638.jpg?cb=1422671083 "Logo Title Text 1")

- The process of gettting a drug from Initial Discovery to Market can take years or decades
-  Experiments, clinical studies, and clinical trials are necessary
- 90% of all clinical trials in humans fail even after the molecules have been successfully tested in animals.

#### Step 1 - Studying Medical Literature

![alt text](https://image.slidesharecdn.com/drugdiscoveryprocessstyle3powerpointpresentationtemplates-120316015043-phpapp02/95/drug-discovery-process-style-3-powerpoint-presentation-templates-1-728.jpg?cb=1331862866 "Logo Title Text 1")

- Doctors must study the particular associations between drugs, diseases, and proteins published in other papers and clinical studies
- They have to find out what the target for the drug should be, i.e., which protein it should bind with;

#### Step 2 - Figuring out what the properties of the drug should be

![alt text](https://www.scientist.com/wp-content/uploads/2013/11/Stem-cells-can-be-used-to-determine-drug-like-properties.jpg "Logo Title Text 1")

- What kind of properties they want from the drug?
- How soluble it should be?
- Which specific structures it should have to bind with this protein?
- Should it treat this kind of cancer or that kind of cancer?

#### Step 3 - Figuring out which molecules have these properties

![alt text](http://images.slideplayer.com/28/9337949/slides/slide_4.jpg "Logo Title Text 1")

- one standard database lists 72 million molecules, complete with their formulas, some properties and everything
- Does a certain molecule cure a certain disease? They must find out

#### Step 4 - Experimentation!

![alt text](http://humanspareparts.fi/wp-content/uploads/2015/06/hsp_infographic_4.jpg "Logo Title Text 1")

-  Their ideas, called lead molecules, are  sent to the lab for experimental validation
-  if the lab says that the substance works, the clinical trial procedure can be initiated
-  A small percentage of drugs actually go all the way through the funnel and reach the market
- This is because we need to be confident its effective on a large number of patients
- Maybe we could soon go from in silico (on computer) to patients immediately. 

![alt text](https://cdn-images-1.medium.com/max/1600/0*-jW_R9CNS3NEjnjx. "Logo Title Text 1")

## Possibilities

![alt text](https://294305267s7hqfks2cfh08ip-wpengine.netdna-ssl.com/wp-content/uploads/2017/05/Exscientia-Sanofi-artificial-intelligence-deal-e1494255220247.png "Logo Title Text 1")

- During the initial stage of identifying the lead molecules, we cannot be sure of anything
- Live experiments in the lab are still very slow and expensive, so we would like to find lead molecules as accurately as we can. 
- Even if the goal is to treat cancer there is no hope to check the entire endless variation of small molecules in the lab
- 72 million is just the size of a specific database, the total number of small molecules is estimated to be between 10⁶⁰ and 10²⁰⁰
- synthesizing and testing a single new molecule in the lab may cost thousands or tens of thousands of dollars. 
- The early guessing stage is really, really important.
- We can use machine learning models to try and choose the molecules that are most likely to have desired properties.

## When you have 72 million of something, “choosing” stops looking like classification and starts looking like “generation” We have to generate a molecule from scratch, a promising candidate for a drug. We can stop searching for a needle in a haystack and design perfect needles instead:


![alt text](https://cdn-images-1.medium.com/max/1600/0*FBFuaOq__7M6vEb3. "Logo Title Text 1")

## History of ML in Drug Discovery

#### Recurrent Networks https://arxiv.org/pdf/1701.01329v1.pdf

![alt text](https://image.slidesharecdn.com/deeplearningbusinessmodelsvnitc-2015-09-13-150912180100-lva1-app6891/95/deep-learning-and-business-models-vnitc-20150913-31-638.jpg?cb=1442162292 "Logo Title Text 1")

- The chemical language model was trained on a Smiles file containing 1.4 million molecules from the ChEMBL database, which contains molecules and measured biological activity data. 
- They employed a recurrent neural network with three stacked LSTM layers, each with 1024 dimensions, and each one followed by a dropout[65] layer, with a dropout ratio of 0.2, to regularise the neural network.
- To generate novel molecules, 50,000,000 Smiles symbols were sampled from the model symbol-by-symbol.
- After filtering out duplicates, they obtained 847,955 novel molecules.

#### Convolutional Networks http://arxiv.org/abs/1510.02855v1

![alt text](https://media.springernature.com/lw785/springer-static/image/art%3A10.1186%2Fs12859-017-1702-0/MediaObjects/12859_2017_1702_Fig4_HTML.gif "Logo Title Text 1")

- AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. 
- They demonstrated how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. 
- AtomNet’s application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators
- The network topology consists of an input layer, followed by multiple 3D-convolutional and fully-connected layers, and topped by a logistic-cost layer that assigns probabilities over the active and inactive classes.

#### Generative Adversarial Networks

![alt text](https://cdn-images-1.medium.com/max/1600/1*L_pqMKMNJ8J-sa0Fx0qKpw.jpeg "Logo Title Text 1")

- GANS are a popular 2016 model that consist of 2 neural networks 'competing with each other'
- the objective of the generator is to generate new objects that are supposed to pass for “true” data points
-  The discriminator has to decipher the tricks played by the generator and distinguish between real data points and the ones produced by the generator.

![alt text](https://cdn-images-1.medium.com/max/1600/0*MNymJpmxULbHQKVo. "Logo Title Text 1")

- The discriminator learns to spot the generator’s counterfeit images, 
- The generator learns to fool the discriminator. 

- Conditional GANs have been used for image transformations with the explicit purpose of enhancing images; 
-  GANs also makes sense for some other relatively “continuous” kinds of data. 
- But molecules? The atomic structure is totally not continuous, and GANs are notoriously hard to train for discrete structures. 
- Still, GANs did prove to work for generating molecules as well. 

#### Enter Adversarial Autoencoders

![alt text](http://www.inference.vc/content/images/2016/01/Screen-Shot-2016-01-08-at-14-48-25.png "Logo Title Text 1")

- Researchers presented an architecture for generating lead molecules based on a variation of the GAN idea called Adversarial Autoencoders (AAE). 
- In AAE, the idea is to learn to generate objects from their latent representations.
- Autoencoders are neural architectures that take an object as input and try to return the same object as output. 
- In the middle of the architecture, the input must go through a middle layer that learns a latent representation, i.e., a set of features that encode the input in such a way that afterwards subsequent layers can decode the object back

![alt text](https://cdn-images-1.medium.com/max/1600/0*ovqbOqz_q6FERZCn. "Logo Title Text 1")

- Either the middle layer is simply smaller (has lower dimension) than input and output, or the autoencoder uses special regularization techniques, but in any case it’s impossible to simply copy the input through all layers, and the autoencoder has to extract the really important stuff.
- They took a conditional adversarial autoencoder and trained it to generate fingerprints of molecules, using and serving desired properties as conditions. 

![alt text](https://cdn-images-1.medium.com/max/1600/0*KzKdBlERyNuY_RXd. "Logo Title Text 1")

- There is a discriminator that tries to distinguish the distribution of latent representations from some standard distribution
- If you can make the distribution of latent codes indistinguishable from some standard distribution, it means that you can then sample from this distribution and generate reasonable samples through the decoder
- There is also a condition that in this case encodes desired properties of the molecule; we train on the molecules with known properties, and the problem is then to generate molecules with desired (perhaps even never before seen) combinations of properties.
- A simple screening of the database can find molecules with the fingerprints most similar to generated ones.

![alt text](https://cdn-images-1.medium.com/max/1600/0*wWlxq_gYBOMZsfVf. "Logo Title Text 1")

## GAN Demo time!