# Description

This notebook is used to train a RNN (recurrent neural network) on the known universe of SMILES to learn to very accurately generate novel small molecules. We then use this initial network to generate our generation 0 (gen0) candidate molecules.

## Train the Network

In [1]:
import tensorflow
print(tensorflow.test.is_gpu_available())
print(tensorflow.config.list_physical_devices('GPU'))

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

In [3]:
import numpy as np
from copy import copy

import keras

from lstm_chem.utils.config import process_config
from lstm_chem.model import LSTMChem
from lstm_chem.generator import LSTMChemGenerator
from lstm_chem.trainer import LSTMChemTrainer
from lstm_chem.data_loader import DataLoader

Using TensorFlow backend.


In [4]:
CONFIG_FILE = './config/config.json'
config = process_config(CONFIG_FILE)

In [None]:
modeler = LSTMChem(config, session='train')

In [None]:
train_dl = DataLoader(config, data_type='train')

In [None]:
valid_dl = copy(train_dl)
valid_dl.data_type = 'valid'

In [None]:
trainer = LSTMChemTrainer(modeler, train_dl, valid_dl)

In [None]:
trainer.train()

In [None]:
# Save weights of the trained model
trainer.model.save_weights('./checkpoints/LSTM_Chem-baseline-model-full.hdf5.hdf5')

## Now load the model and GENERATE new molecules

In [5]:
config['model_weight_filename'] = './checkpoints/LSTM_Chem-baseline-model-full.hdf5'
print(config)

batch_size: 512
checkpoint_dir: experiments/2020-04-27/LSTM_Chem/checkpoints/
checkpoint_mode: min
checkpoint_monitor: val_loss
checkpoint_save_best_only: false
checkpoint_save_weights_only: true
checkpoint_verbose: 1
config_file: ./config/config.json
data_filename: ./datasets/all_smiles_clean.smi
data_length: 0
exp_dir: experiments/2020-04-27/LSTM_Chem
exp_name: LSTM_Chem
finetune_batch_size: 1
finetune_data_filename: ''
finetune_epochs: 5
model_arch_filename: ./config/model_arch.json
model_weight_filename: ./checkpoints/LSTM_Chem-baseline-model-full.hdf5
num_epochs: 42
optimizer: adam
sampling_temp: 0.75
seed: 71
smiles_max_length: 128
tensorboard_log_dir: experiments/2020-04-27/LSTM_Chem/logs/
tensorboard_write_graph: true
train_smi_max_len: 128
units: 256
validation_split: 0.1
verbose_training: true



In [6]:
modeler = LSTMChem(config, session='generate')
generator = LSTMChemGenerator(modeler)
print(config)

Loading model architecture from ./config/model_arch.json ...
Loading model checkpoint from ./checkpoints/LSTM_Chem-baseline-model-full.hdf5 ...
Loaded the Model.
batch_size: 512
checkpoint_dir: experiments/2020-04-27/LSTM_Chem/checkpoints/
checkpoint_mode: min
checkpoint_monitor: val_loss
checkpoint_save_best_only: false
checkpoint_save_weights_only: true
checkpoint_verbose: 1
config_file: ./config/config.json
data_filename: ./datasets/all_smiles_clean.smi
data_length: 0
exp_dir: experiments/2020-04-27/LSTM_Chem
exp_name: LSTM_Chem
finetune_batch_size: 1
finetune_data_filename: ''
finetune_epochs: 5
model_arch_filename: ./config/model_arch.json
model_weight_filename: ./checkpoints/LSTM_Chem-baseline-model-full.hdf5
num_epochs: 42
optimizer: adam
sampling_temp: 0.75
seed: 71
smiles_max_length: 128
tensorboard_log_dir: experiments/2020-04-27/LSTM_Chem/logs/
tensorboard_write_graph: true
train_smi_max_len: 128
units: 256
validation_split: 0.1
verbose_training: true



# sample_number = 10000
sampled_smiles = generator.sample(num=sample_number)

In [8]:
from rdkit import RDLogger, Chem, DataStructs
from rdkit.Chem import AllChem, Draw, Descriptors
from rdkit.Chem.Draw import IPythonConsole
RDLogger.DisableLog('rdApp.*')

In [9]:
valid_mols = []
for smi in sampled_smiles:
    mol = Chem.MolFromSmiles(smi)
    if mol is not None:
        valid_mols.append(mol)
# low validity
print('Validity: ', f'{len(valid_mols) / sample_number:.2%}')

valid_smiles = [Chem.MolToSmiles(mol) for mol in valid_mols]
# high uniqueness
print('Uniqueness: ', f'{len(set(valid_smiles)) / len(valid_smiles):.2%}')

# Of valid smiles generated, how many are truly original vs ocurring in the training data
import pandas as pd
training_data = pd.read_csv('./datasets/dataset_cleansed.smi', header=None)
training_set = set(list(training_data[0]))
original = []
for smile in valid_smiles:
    if not smile in training_set:
        original.append(smile)
print('Originality: ', f'{len(set(original)) / len(set(valid_smiles)):.2%}')

Validity:  74.32%
Uniqueness:  16.78%
Originality:  100.00%


In [12]:
with open('./generations/gen0.smi', 'w') as f:
    for item in valid_smiles:
        f.write("%s\n" % item)