#  Baseline Model Implementation - CS598 Deep Learning

## Drug Discovery: Variational Autoencoder Techniques for Molecule Generation

## Approach 1: Baseline Model

Team Members:

- Andrew Jacobson jonaj2@illinois.edu
- Dixon Liang dixonl2@illinois.edu
- John Judge jmjudge2@illinois.edu
- Megan Masanz mjneuman@illinois.edu

Implementation Description:

- Baseline Model
- Character Based Chemical VAE
- Aspuru-Guzik 

References include
* [Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules](https://arxiv.org/abs/1610.02415)
* https://github.com/aspuru-guzik-group/chemical_vae
* https://github.com/deepchem/deepchem
* https://github.com/molecularsets/moses (Leveraged in Notebook #3, but included here for completeness)
* https://github.com/Azure/azureml-examples/blob/main/tutorials/an-introduction/2.pytorch-model.ipynb

Requirements for notebook to run:
* Run this notebook inside an AzureML workspace (or provide configuration)
* No data is required as the training script will download the dataset

###  Below is importing the requirements for this notebook to run inside an Azure ML environment for :
- connecting to a workspace
- creating remote computer for training 

In [1]:
import azureml.core #adding core - this by default is in notebooks run on computer in Azure ML
from azureml.core import Workspace #needed for connecting to workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration, DEFAULT_GPU_IMAGE 
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies


ws = Workspace.from_config()

### Creating compute cluster.  The Azure ML SDK will look for compute with the provided name, if it doesn't exist, it will create it.  Standard code found in Azure ML notebook examples

In [2]:
cluster_name = 'mmdsvm04d'
try:
    compute_target = ComputeTarget(workspace=ws,  name=cluster_name )
    print('found existing:', compute_target.name)
    
except ComputeTargetException:
    print('creating new.')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='STANDARD_NC12',
        min_nodes=0,
        max_nodes=1)
    
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True)

found existing: mmdsvm04d


### Below are the required packaged for running an Azure ML Experiment

In [3]:

myenv = Environment('deepchem_backend2')

#already created
conda_dep = CondaDependencies().create(python_version='3.7.10', conda_packages=['tensorflow-gpu==2.4.1', 'rdkit', 'openmm', 'pdbfixer'])
conda_dep.add_channel("conda-forge")
conda_dep.add_channel("omnia")
conda_dep.add_pip_package("azureml-sdk")
conda_dep.add_pip_package("deepchem")
#1.19.4
conda_dep.add_pip_package("numpy==1.19.4")
#IPython
conda_dep.add_pip_package("IPython")
conda_dep.save(path="./train/condadep.yml")
myenv.python.conda_dependencies=conda_dep
myenv.docker.enabled = True
myenv.docker.base_image = DEFAULT_GPU_IMAGE
myenv.register(workspace=ws)

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/intelmpi2018.3-cuda10.0-cudnn7-ubuntu16.04:20210113.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": true,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "deepchem_backend2",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
         

### Get the current working directory and inside of it, create a subdirectory called "train" to hold the training script - contents of this will be passed to the remote machine for training

In [4]:
cwd = os.getcwd()
current_dir = cwd
print(cwd)

/mnt/batch/tasks/shared/LS_root/mounts/clusters/mm-deepchem/code/Users/meganmasanz.work/vae_training


In [5]:
import os
script_folder = os.path.join(os.getcwd(), "train")
print(script_folder)
os.makedirs(script_folder, exist_ok = True)

/mnt/batch/tasks/shared/LS_root/mounts/clusters/mm-deepchem/code/Users/meganmasanz.work/vae_training/train


### The code below is actually the training and the model evaluation script.  Modifiying the script below will specifically in the main section will change parameters in the training.

### Parameters this script will allow changes to include:
- epoch_count = 2
- learning_rate = .0001
- sampling_size = 100000 - during model evaluation how many molecules to attempt to generate.

In [None]:
%%writefile $script_folder/train.py



import sys
import os
import requests
import subprocess
import shutil
import IPython
from logging import getLogger, StreamHandler, INFO
from deepchem.models.optimizers import Adam, ExponentialDecay
from deepchem.models.seqtoseq import AspuruGuzikAutoEncoder
import rdkit
import numpy as np
import deepchem
import rdkit
import tensorflow as tf
from azureml.core import Run
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem
from rdkit import DataStructs
import random

# DEFINE THE MODEL
def get_model(train_smiles, tokens, max_length, learning_rate):
    run = Run.get_context()
    batch_size = 100
    
    #A learning rate that decreases exponentially with the number of training steps.
    #(initial_rate: float, decay_rate: float, decay_steps: int, staircase: bool = True)
    run.log('learningRate', learning_rate)
    learning_rate = ExponentialDecay(learning_rate, 0.95, len(train_smiles)/batch_size)
    
    model = AspuruGuzikAutoEncoder(tokens, max_length, model_dir='vae', batch_size=batch_size, learning_rate=learning_rate)
    return model

# GENERATE MOLECULES AND TEST IF THEY ARE VALID
def generate_molecules(model, zinc_data, n_molecules=5000):
    run = Run.get_context()
    run.log('n_moleculesd', n_molecules)
    
    predictions = model.predict_from_embeddings(np.random.normal(size=(n_molecules,196))) 
    valid = []

    #using chem from rdkit to ensure generated molecules are valid
    count = 0
    for p in predictions:
        count += 1
        smiles = ''.join(p)
        if count < 25:
            print(smiles)
        if rdkit.Chem.MolFromSmiles(smiles) is not None:
            valid.append(smiles) 

    print(len(valid) / n_molecules)
    
    run.log('valid', (len(valid) / n_molecules))
    i = 0
    print('**************************')
    for s in valid:
        i = i + 1
        print(s)
        output_dir = './outputs/'
        os.makedirs(output_dir, exist_ok=True)
        
        mol = rdkit.Chem.MolFromSmiles(s)
        filename = os.path.join(output_dir, 'image' + str(i) + '_' + s + '.png')
        print(filename)
        rdkit.Chem.Draw.MolToFile(mol, filename)
  
    print('***************************')
    
    
    print('looking for winners')
    valid_in_zinc = []
    for x in valid:
        if x in zinc_data:
            print('Found a winner')
            valid_in_zinc.append(x)
            
    
    run.log('valid_in_zinc', len(valid_in_zinc))
    print(len(valid_in_zinc))
    
    if len(valid) > 0:
        print(len(valid_in_zinc)/len(valid))
        run.log('valid_in_zinc_out_of_valid', len(valid_in_zinc)/len(valid))
    else:
        run.log('valid_in_zinc_out_of_valid', 0)

    
    print(len(valid_in_zinc)/n_molecules)
    run.log('valid_in_zinc_out_of_all_generated', len(valid_in_zinc)/n_molecules)
    


def get_mol(smiles):
    mol = rdkit.Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    return rdkit.Chem.Kekulize(mol)
   


def generate_sequences(epochs, train_smiles): 
    run = Run.get_context()
    run.log('epochs', epochs)
    for i in range(epochs):
        print('epoch:', i+1)
        for s in train_smiles: 
            yield (s, s)
            

#deepchem has its own fit model variation
def train(model, train_smiles, epochs=1):
    model.fit_sequences(generate_sequences(epochs, train_smiles))

    
def modeltrain(epoch_count, learning_rate, sampling_size): 

    os.makedirs('data', exist_ok = True)
    run = Run.get_context()
    tasks, datasets, _ = deepchem.molnet.load_zinc15(
    featurizer='raw',
    splitter=None,
    transformers=[],
    data_dir='data', 
    save_dir='data')
    print(tasks)


    data = datasets[0]
    train_smiles = []
    for X, _, _, _ in data.itersamples():
        train_smiles.append(rdkit.Chem.MolToSmiles(X))
    print(len(train_smiles))
    run.log('datasetsize', len(train_smiles))
    for smile in train_smiles[0:5]:
        print(smile)

    # DEFINE THE SMILES TOKENS AND MAX_LENGTHS
    tokens = set()
    for s in train_smiles:
        tokens = tokens.union(set(s))
    tokens = sorted(list(tokens))
    max_length = max(len(s) for s in train_smiles)

    try:
        seed = 123
        tf.random.set_seed(seed)
        device_name = tf.test.gpu_device_name()
        print('***************')
        print(device_name)
        print('***************')
        run.log('device_name', device_name)

        with tf.device(device_name):
            model = get_model(train_smiles, tokens, max_length, learning_rate)
            train(model, train_smiles, epoch_count)
            generate_molecules(model, train_smiles, sampling_size)

    except Exception as e: 
        print(e)



def main():
    seed = 123
    random.seed(seed)
    tf.random.set_seed(seed)
    np.random.seed(seed)
    #os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
    print('in main')
    epoch_count = 2
    learning_rate = .0001
    sampling_size = 100000
    modeltrain(epoch_count, learning_rate, sampling_size)
    
if __name__ == "__main__":
    main()

### Below will create an experiment with the name 'vae-no-teacher-forcing-run-410e' using the train.py file in the training folder leveraging the compute cluster created

In [10]:
from azureml.core import Experiment, ScriptRunConfig
from azureml.widgets import RunDetails

experiment = Experiment(workspace = ws, name = "vae-no-teacher-forcing-run-410e")
script_config = ScriptRunConfig(source_directory = script_folder, script = 'train.py', environment=myenv, compute_target = cluster_name)

experiment = Experiment(workspace=ws, name = "vae-no-teacher-forcing-run-410e" )
run = experiment.submit(config= script_config)

In [6]:
### The widiget below will show ouput while submitting the job for real-time review of results

In [11]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [9]:
run.wait_for_completion

<bound method Run.wait_for_completion of Run(Experiment: vae-no-teacher-forcing-run-410e,
Id: vae-no-teacher-forcing-run-410e_1618575134_dcf82369,
Type: azureml.scriptrun,
Status: Preparing)>