<a href="https://colab.research.google.com/github/krishnaxamin/off-target_inhibition_hppy/blob/master/dnn_class_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install and import necessary things.

In [None]:
!pip install keras-tuner

Collecting keras-tuner
  Downloading keras_tuner-1.1.2-py3-none-any.whl (133 kB)
[?25l[K     |██▌                             | 10 kB 28.1 MB/s eta 0:00:01[K     |█████                           | 20 kB 28.4 MB/s eta 0:00:01[K     |███████▍                        | 30 kB 13.4 MB/s eta 0:00:01[K     |█████████▉                      | 40 kB 10.6 MB/s eta 0:00:01[K     |████████████▎                   | 51 kB 6.6 MB/s eta 0:00:01[K     |██████████████▊                 | 61 kB 7.8 MB/s eta 0:00:01[K     |█████████████████▏              | 71 kB 8.3 MB/s eta 0:00:01[K     |███████████████████▋            | 81 kB 6.0 MB/s eta 0:00:01[K     |██████████████████████          | 92 kB 6.6 MB/s eta 0:00:01[K     |████████████████████████▌       | 102 kB 7.3 MB/s eta 0:00:01[K     |███████████████████████████     | 112 kB 7.3 MB/s eta 0:00:01[K     |█████████████████████████████▍  | 122 kB 7.3 MB/s eta 0:00:01[K     |███████████████████████████████▉| 133 kB 7.3 MB/s eta 0:

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Flatten, Dense, Dropout, Softmax
from tensorflow.keras.callbacks import EarlyStopping
import keras_tuner as kt
from pandas import read_csv, concat, DataFrame
import numpy as np
from math import floor, ceil, sqrt
from statistics import mean
from tensorflow.keras.metrics import FalseNegatives, FalsePositives, TrueNegatives, TruePositives, Accuracy

In [None]:
from google.colab import files

Function to balance imbalanced data by oversampling the minority class - the minority class is replicated the maximum number of times that allow the minority class to remain in the minority.

In [None]:
# oversample minority class (inhibitors) in imbalanced data to give a more balanced dataset
# input_x is a df, input_y is a numpy array
def data_balancer(input_x, input_y):
    num_inhibitors = sum(input_y)
    num_non_inhibitors = len(input_y) - num_inhibitors
    if num_non_inhibitors/num_inhibitors < 2:  # data sufficiently balanced
        x_out = input_x
        y_out = input_y
    else:   # data imbalanced
        # x = DataFrame(input_x)
        y = input_y.to_frame()
        data = concat([input_x, y], axis=1)
        times_to_replicate = floor(num_non_inhibitors/num_inhibitors) - 1
        inhibitors = data[data['classification'] == 1]
        inhibitors_replicated = concat([inhibitors]*times_to_replicate, ignore_index=True)
        data_balanced = concat([data, inhibitors_replicated], ignore_index=True)
        x_out = data_balanced.drop(['classification'], axis=1)
        y_out = data_balanced['classification']
    return x_out, y_out


Function to make 1D data (e.g. a 1D numpy array) into a 2D array that is the smallest square-shape possible without losing information.

In [None]:
def make_one_dim_array_square(one_d):
    length = len(one_d)
    dim = ceil(sqrt(length))
    square = np.zeros(shape=(dim, dim))
    for i in range(length):
        row_idx = floor(i / dim)
        col_idx = i % dim
        square[row_idx, col_idx] = one_d[i]
    return square

Function that uses the above function to convert data in a dataframe (a sequence of rows containing data) into a collection of 2D arrays ('images') - each 'image' corresponding to a row of data.

In [None]:
def data_df_to_images(df_input):
    df_input_list = [df_input.loc[i] for i in df_input.index]
    output_images = np.array([make_one_dim_array_square(series) for series in df_input_list])
    return output_images

Function to train (and utilise) a number of neural networks independently and return an output assembled from the output of all the neural networks. The number of neural networks is determined by num_nets.  

If used to get training and validation statistics and histories - procedure='fitting' - the forest outputs a Series with the training and validation loss and accuracies, and a num_nets-long list of keras.callbacks.History objects.

If used on test or prediction data - procedure='testing'/'predicting' - the forest outputs an array with the pooled predictions, and a num_nets-long list of keras.callbacks.History objects.

Definition of y_train_forest depends on what the metrics used. If only accuracy is used, y_train_forest = data_input[1] is sufficient. If true and false positives and negatives are used, the alternative definition must be used.

In [None]:
def neural_forest(num_nets, data_input, callback_forest, train_frac_forest=0.8, val_frac_forest=0.1,
                   num_epochs=50, shuffle_bool=False):
    assert len(data_input) == 4, 'Number of data inputs should be 4 - x_train, y_train, x_test and x_predict'
    forest_train_val_stats = DataFrame()
    forest_test_votes = DataFrame()
    forest_predict_votes = DataFrame()
    histories = []
    x_train_forest = data_input[0]
    y_train_forest = data_input[1]
    for i in range(num_nets):
        print(i + 1)
        model = build_tuned_model()
        history = model.fit(x_train_forest, y_train_forest, epochs=num_epochs,
                            validation_split=(val_frac_forest / (train_frac_forest + val_frac_forest)),
                            shuffle=shuffle_bool, callbacks=[callback_forest])
        histories.append(history)

        # get train-val stats
        history_df = DataFrame(history.history)
        train_val_stats = history_df.iloc[[-1]]
        forest_train_val_stats = concat([forest_train_val_stats, train_val_stats])

        prediction_model = tf.keras.Sequential([model, Softmax()])
        # get test predictions
        x_test_forest = data_input[2]
        test_prediction_probabilities = prediction_model.predict(x_test_forest)
        test_prediction = np.array(
            [np.argmax(test_prediction_probabilities[i]) for i in range(test_prediction_probabilities.shape[0])])
        forest_test_votes = concat([forest_test_votes, DataFrame([test_prediction])])

        # get UKB predictions
        x_predict_forest = data_input[3]
        prediction_probabilities = prediction_model.predict(x_predict_forest)
        prediction = np.array(
            [np.argmax(prediction_probabilities[i]) for i in range(prediction_probabilities.shape[0])])
        forest_predict_votes = concat([forest_predict_votes, DataFrame([prediction])])

    forest_test_consensus = forest_test_votes.mean()
    forest_test_consensus_out = np.round(np.array(forest_test_consensus))
    forest_predict_consensus = forest_predict_votes.mean()
    forest_predict_consensus_out = np.round(np.array(forest_predict_consensus))

    return [histories, forest_train_val_stats, forest_test_consensus_out, forest_predict_consensus_out]

Function to get performance metrics of a fitted model. Inputs true values and predictions from that model. Returns accuracy, sensitivity, specificity, balanced accuracy and the F1 score. 

In [None]:
def performance_metrics(y_true, y_pred):
    accuracy = Accuracy()
    accuracy.update_state(y_true, y_pred)
    accuracy_val = accuracy.result().numpy()
    fn = FalseNegatives()
    fn.update_state(y_true, y_pred)
    fn_val = fn.result().numpy()
    fp = FalsePositives()
    fp.update_state(y_true, y_pred)
    fp_val = fp.result().numpy()
    tn = TrueNegatives()
    tn.update_state(y_true, y_pred)
    tn_val = tn.result().numpy()
    tp = TruePositives()
    tp.update_state(y_true, y_pred)
    tp_val = tp.result().numpy()
    sensitivity = tp_val / (tp_val + fn_val)
    specificity = tn_val / (tn_val + fp_val)
    balanced_accuracy = mean([sensitivity, specificity])
    precision = tp_val / (tp_val + fp_val)
    f1 = 2 * (sensitivity * precision)/(sensitivity + precision)
    return {'accuracy': accuracy_val, 'sensitivity': sensitivity, 'specificity': specificity, 
            'balanced_accuracy': balanced_accuracy, 'f1': f1}


In [None]:
train_frac = 0.60
val_frac = 0.20
test_frac = 0.20

callback = EarlyStopping(monitor='val_loss', patience=10)

df = read_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/happyhour_inhibitor_name_class_fingerprints.csv')

df_train = df.sample(frac=train_frac+val_frac, random_state=42)
df_test = df.drop(df_train.index)

x_train_df = df_train.drop(['molecule_chembl_id', 'classification'], axis=1)
y_train = df_train['classification']
x_train_df, y_train = data_balancer(x_train_df, y_train)

x_train = data_df_to_images(x_train_df)
y_train = np.array(y_train)
x_test = data_df_to_images(df_test.drop(['molecule_chembl_id', 'classification'], axis=1))
y_test = np.array(df_test['classification'])

Build the tuned model, with the only metric being accuracy. Output from this model is named with dnn_tuned*x*.

In [None]:
def build_tuned_model():
    best_hps_df = read_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/hppy_dnn_best_hps_60_20_20_run4.csv')
    model = tf.keras.Sequential([
        Flatten(input_shape=(30, 30)),
        Dense(160, activation='relu'),
        Dropout(rate=0.25),
        Dense(2)])
    learning_rate = best_hps_df['lr'][0]
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    return model

In [None]:
ukb_drugs_descriptor = read_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/drug_ukb_name_fingerprints.csv')
ukb_drugs_fingerprints = ukb_drugs_descriptor.drop(['Name'], axis=1)
ukb_drugs_notna = read_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/drug_ukb_cleaned.csv')

ukb_drugs_images = data_df_to_images(ukb_drugs_fingerprints)

In [None]:
forest = neural_forest(100, [x_train, y_train, x_test, ukb_drugs_images],
                       callback_forest=callback,
                       train_frac_forest=train_frac,
                       val_frac_forest=val_frac,
                       num_epochs=500)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
77
Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch

Unpack the neural_forest output. 

In [None]:
forest_histories = forest[0]
forest_train_val = forest[1]
forest_test = forest[2]
forest_predict = forest[3]

Save outputs. 

In [None]:
for i, history in enumerate(forest_histories):
    history_df = DataFrame(history.history)
    history_df.to_csv(f'/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/class_2/dnn_tuned3/run2/dnn_tuned3_run2_training_val_history_{i}.csv', index=False)

In [None]:
forest_train_val.to_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/class_2/dnn_tuned3/run2/dnn_tuned3_run2_train_val_stats.csv', index=False)   

In [None]:
stats_testing = performance_metrics(y_true=y_test, y_pred=forest_test)
DataFrame([stats_testing]).to_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/class_2/dnn_tuned3/run2/dnn_tuned3_run2_test_stats.csv', index=False)

In [None]:
ukb_drugs_dnn_classed = concat([ukb_drugs_notna.drop(['Drug', 'Drug_curated', 'smiles'], axis=1),
                              DataFrame(np.vstack(forest_predict), columns=['predicted_classification'])], axis=1)
dnn_active_ukb_drugs = ukb_drugs_dnn_classed[ukb_drugs_dnn_classed['predicted_classification'] == 1]
dnn_active_ukb_drugs.to_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/class_2/dnn_tuned3/run2/dnn_tuned3_run2_active_ukb_drugs.csv', index=False)

NameError: ignored

tuned3_run3

In [None]:
forest = neural_forest(100, [x_train, y_train, x_test, ukb_drugs_images],
                       callback_forest=callback,
                       train_frac_forest=train_frac,
                       val_frac_forest=val_frac,
                       num_epochs=500)
forest_histories = forest[0]
forest_train_val = forest[1]
forest_test = forest[2]
forest_predict = forest[3]

for i, history in enumerate(forest_histories):
    history_df = DataFrame(history.history)
    history_df.to_csv(f'/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/class_2/dnn_tuned3/run3/dnn_tuned3_run3_training_val_history_{i}.csv', index=False)
  
forest_train_val.to_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/class_2/dnn_tuned3/run3/dnn_tuned3_run3_train_val_stats.csv', index=False) 

stats_testing = performance_metrics(y_true=y_test, y_pred=forest_test)
DataFrame([stats_testing]).to_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/class_2/dnn_tuned3/run3/dnn_tuned3_run3_test_stats.csv', index=False)

ukb_drugs_dnn_classed = concat([ukb_drugs_notna.drop(['Drug', 'Drug_curated', 'smiles'], axis=1),
                              DataFrame(np.vstack(forest_predict), columns=['predicted_classification'])], axis=1)
dnn_active_ukb_drugs = ukb_drugs_dnn_classed[ukb_drugs_dnn_classed['predicted_classification'] == 1]
dnn_active_ukb_drugs.to_csv('/content/drive/MyDrive/partIII_sysbiol2021/happyhour_inhibition_ml/class_2/dnn_tuned3/run3/dnn_tuned3_run3_active_ukb_drugs.csv', index=False)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78/500
Epoch 79/500
Epoch 80/500
Epoch 81/500
Epoch 82/500
Epoch 83/500
75
Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch