Link to GitHub repo: https://github.com/milanpatel1022/dlh-project

# Introduction
***
## Background
    - Paper: CNN-DDI
    - I built off of the DDIMDL repo code (which was specified in the paper selection list) to follow this paper

### Type of problem
    - Predicting events associated to Drug-Drug Interactions (DDIs)
    - DDIs are the reactions between two or more drugs
### Importance
    - Unexpected DDIs can cause serious and unforeseen health issues
    - Multiple drug consumption is becoming increasingly common
    - The study of DDIs is important to drug discovery and development, but also for increasing the safety potential of patients who may be consuming multiple drugs
### Difficulty
    - In the past, DDIs were solely discovered through wet experiments
    - This method is labor-intensive and time-consuming
### Methods and effectiveness
    - In recent years, databases containing large amounts of drug data have been constructed using information from literature and reports
    - This has allowed for ML methods to be applied to this task of predicting DDI associated events, thereby reducing efforts, time, and cost
### Paper Proposal
    - This paper proposes a novel CNN architecture in the supervised learning approach to predicting DDIs
### Paper Innovations
    - This paper's innovations include the novel CNN architecture + utilizing a new drug feature that older papers haven't (called drug category) in training/prediction
### Paper Metrics
    - Outperforms all ML methods that came before it, including DDIMDL (DDIMDL was trained on different features)
    - ACC      AUPR     AUC       F1       Precision    Recall
      0.8871   0.9251   0.9980    0.7496   0.8556       0.7220
### Paper Contribution to Research
    - This paper's contribution is identfying that the CNN architecture along with the drug category feature, can allow for improved DDI predictions

# Scope of Reproducibility
***
## Hypothesis 1
    - Drugs with similar features will interact similarly
    - Ex: If Drug A and B interact with each other and have a biological impact
        - If Drug C is similar to A, it will likely interact with B in the same manner A does
    - CNNs will capture patterns in the feature similarity scores of drugs and accurately predict interactions accordingly

# Methodology
***
## IMPORTANT NOTE
    - All data extraction, data processing, model, evaluation code is present in this notebook. main() will begin running the entire process.
        - Each cell/section of code will be highlighted to what its high-level purpose is
        - I understand the code is out of order of what the template specifies, but nonetheless, the implementations are all there
        - I have added highly detailed comments throughout each and every part of the code for your understanding as well.
    - Most of this code comes from the DDIMDL paper
    - I have made several changes to remove bugs, unnecessary code, add detailed explanations, and to include the CNN architecture specified in CNN-DDI

## Data
### Source
    - DDIMDL repo (within the event.db file) -> ultimately sourced from DrugBank
    - Drug table: contains 572 drugs and their features (id, target, enzyme, pathway, smile, name)
    - Event table: 37264 DDIs between 572 drugs (id1, name1, id2, name2, interaction)
    - Extraction table: each interaction transformed to tuple {mechanism, action, drugA, drugB} -> done using NLPProcess code from DDIMDL
    - Event_number table: interaction/event mapped to its frequency
### Data Process
    - 4 (572x572) similarity matrices created and data split into training/test sets using 5-fold cross validation
    - Why 4 matrices? 4 features to consider
    - Why 572 x 572? Because 572 drugs
    - In each feature matrix, each drug is represented by a binary vector
## Model Architecture
- I still need to implement the residual block/connection described in the paper (part of next steps)
### Layers
    - 5 convolutional layers followed by 2 fully connected layers
    - 2 max pooling layers in between
### Activation functions
    - ReLU activation used after each convolutional layer
    - Softmax used in output layer
## Training Objectives
### Loss Function
    - Categorical Cross Entropy (aka Softmax Loss)
### Optimizer
    - Adam

# Results
***
## Note
    - Results should be taken with a grain of salt at this time. I was not able to get the drug category feature into my dataset
    - The results are from only training using 3 features (smile, target, and enzyme) while the paper used 4 features (pathway, target, enzyme, category)
## Stored in 2 Files
    - First CSV: shows 11 performance metrics describing the the overall classification performance (11 rows, 1 column)
    - Second CSV: shows 6 performance metrics describing the per-event classification performance (65 rows, 6 columns)
## Figures
| Model              | ACC   | AUPR  | AUC   | F1    | Precision | Recall |
|--------------------|-------|-------|-------|-------|-----------|--------|
| CNN-DDI (Paper)    | 0.8871| 0.9251| 0.9980| 0.7496| 0.8556    | 0.7220 |
| My Implementation | 0.8633| 0.9207| 0.9975| 0.7350| 0.8429    | 0.6753 |  |  |  |

# Discussion
***
Training epochs changed from 10 to 3 to demo that the model is properly training/learning
## Reproducibility
    - It's been somewhat reproducible so far. 
    - This paper has no code, so instead, the code from DDIMDL was used as the foundation
    - The DDIMDL repo has already acquired the data and placed it into a database file
    - CNN-DDI says it uses data acquired from DDIMDL
    - However, CNN-DDI uses a new drug feature that is not present in the Drug table in the database
    - Neither paper explained the data acquisition method, only where the data came from
    - I have been unable to acquire that extra feature thus far.
    
## What was easy/difficult?
    - It was difficult fully understanding the DDIMDL code so I can debug it and modify it to work with CNN architecture instead of the DNN it was using
## Suggestions to authors
    - Explain how the data was acquired
    - Provide some code
## What I will do in next phase
    - Figure out how to get that drug category feature
    - Add residual block/connection to CNN architecture

# References
***
### [1] Zhang, C., Lu, Y. & Zang, T. CNN-DDI: a learning-based method for predicting drug–drug interactions using convolution neural networks. BMC Bioinformatics 23 (Suppl 1), 88 (2022). https://doi.org/10.1186/s12859-022-04612-2
### [2] Yifan Deng, Xinran Xu, Yang Qiu, Jingbo Xia, Wen Zhang, Shichao Liu, A multimodal deep learning framework for predicting drug–drug interaction events, Bioinformatics, Volume 36, Issue 15, August 2020, Pages 4316–4322, https://doi.org/10.1093/bioinformatics/btaa501


In [1]:
from numpy.random import seed
seed(1)

import numpy as np

import csv
import sqlite3
import time
import numpy as np
import pandas as pd
from pandas import DataFrame

from keras.models import Sequential

from sklearn.model_selection import KFold
from sklearn.decomposition import PCA
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.preprocessing import label_binarize

from keras.models import Model
from keras.layers import Dense, Input, Activation, Conv1D, MaxPooling1D, Flatten
from keras.callbacks import EarlyStopping

import os
import pickle
from tensorflow.keras.models import model_from_json

In [2]:
event_num = 65
droprate = 0.3
vector_size = 572

In [3]:
'''MODEL ARCHITECTURE AND CODE'''

def CNN():
    #TO-DO: still need to add residual connection
    
    # Define input layer (note: batch_size doesn't need to be included in shape but it is implied... so input shape is: [batch_size, 1144, 1]
    train_input = Input(shape=(vector_size * 2, 1), name='Inputlayer')

    # Define model architecture (followed CNN-DDI paper which mentioned the 5 conv layers and 2 dense/fc layers
    x = Conv1D(64, kernel_size=3, activation='relu')(train_input) #64 output filters
    x = Conv1D(128, kernel_size=3, activation='relu')(x) #128 output filters
   
    x = MaxPooling1D(pool_size=2)(x) #reduce dimensionality & extract important features
    
    x = Conv1D(128, kernel_size=3, activation='relu')(x) #128 output filters
    x = Conv1D(128, kernel_size=3, activation='relu')(x) #128 output filters
    
    x = MaxPooling1D(pool_size=2)(x) #reduce dimensionality & extract important features
    
    x = Conv1D(256, kernel_size=3, activation='relu')(x) #256 output filters
    
    x = Flatten()(x) #flatten feature vector to prepare for dense/fc layers
    
    x = Dense(256, activation='relu')(x)
    output = Dense(65, activation='softmax')(x) #apply softmax to get probability distribution over the 65 output classes

    # Define the model (specify the input and output layers)
    model = Model(inputs=train_input, outputs=output)

    # Compile the model (optimizer, loss function, and evaluation metric)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    return model

In [4]:
'''DATA PROCESSING PART 1'''

'''This function is called separately for each of the 4 features'''

def prepare(df_drug, feature_list, vector_size, mechanism, action, drugA, drugB):
    
    #dicts to store labels and features
    d_label = {}
    d_feature = {}

    #store events as strings (mechanism + action = event)
    d_event=[] #length: 37264 (that many events have occurred)
    for i in range(len(mechanism)):
        d_event.append(mechanism[i]+" "+action[i])

    
    #all 37264 events end up being classified into 65 unique events (and our models will predict two drugs to have one of these 65 types of events)
    count={} #65 events mapped to their individual occurrence count
    for i in d_event:
        if i in count:
            count[i]+=1
        else:
            count[i]=1

    #sort events by count in descending order
    list1 = sorted(count.items(), key=lambda x: x[1],reverse=True)

    #each event is assigned a label (index in this case) where label 0 is the most frequently occuring event and label 64 is the least
    for i in range(len(list1)):
        d_label[list1[i][0]]=i

    #initialize empty vector for each of the 572 drugs
    vector = np.zeros((len(np.array(df_drug['name']).tolist()), 0), dtype=float) #shape: (572, 0)... we set up all the rows in our feature matrix
    
    for i in feature_list:
        #horizontally stack the feature vectors into our vector object
        vector = np.hstack((vector, feature_vector(i, df_drug, vector_size)))

    #vector is now shape: (572, 572) -> it is our similarity matrix
        
    #Map each drug name to its corresponding feature vector from the vector calculated above
    for i in range(len(np.array(df_drug['name']).tolist())):
        d_feature[np.array(df_drug['name']).tolist()[i]] = vector[i]
    
    new_feature = [] #store new feature vectors
    new_label = [] #store new labels


    #iterate over length of events (length is 37264)
    for i in range(len(d_event)):
        #horizontally concatenate feature vectors of drugA + drugB (note: we are doing this for each event)
        new_feature.append(np.hstack((d_feature[drugA[i]], d_feature[drugB[i]])))

        #event i is our key into d_label, which tells us which one of the 65 labels it is assigned
        new_label.append(d_label[d_event[i]])

    #shape: (37264, 1144)... 37264 = number of events between 572 drugs; 1144 = (572x2) since we are hstacking feature vectors of two drugs
    new_feature = np.array(new_feature) 

    #shape: (37264, )... label for each event
    new_label = np.array(new_label)

    return (new_feature, new_label, event_num)

In [5]:
'''DATA PROCESSING PART 2'''

'''
This function is called separately for each of the 4 features
It will generate the respective feature vectors for each drug
'''

def feature_vector(feature_name, df, vector_size):
    # df are the 572 kinds of drugs
    
    # Jaccard Similarity: takes in matrix that represents the feature vectors of drugs
    #we want to calculate the similarity between the feature vectors of the drugs
    def Jaccard(matrix):
        matrix = np.mat(matrix) #convert input matrix to numpy matrix
        
        numerator = matrix * matrix.T #calculate numerator of Jaccard similarity (dot product of matrix with the transpose of itself)
        denominator = np.ones(np.shape(matrix)) * matrix.T + matrix * np.ones(np.shape(matrix.T)) - matrix * matrix.T #calculate denominator

        #return the Jaccard similarity matrix
        return numerator / denominator

    '''
    Each of the 4 features is defined by a set of descriptors
        - Ex) Smile feature has 881 descriptors (so the drug is represented by a length 881 binary vector in terms of the smile feature)
        - Ex) Target feature has 1162 descriptors (so the drug is represented by a length 1162 binary vector in terms of the target feature)
    '''

    all_descriptors = [] #stores all the unique descriptors of this feature

    #drug_list=["P30556|P05412","P28223|P46098|……"]
    drug_list = np.array(df[feature_name]).tolist() #list of length 572 (for each drug, we list its descriptors)

    #For each of the 572 drugs 
    for i in drug_list:
        #Identify its descriptors
        for each_descriptor in i.split('|'):
            #Add any unseen descriptors to all_descriptors (in this fashion, we obtain all the descriptors for a feature)
            if each_descriptor not in all_descriptors:
                all_descriptors.append(each_descriptor)

    #shape: (572, len(all_descriptors))
    feature_matrix = np.zeros((len(drug_list), len(all_descriptors)), dtype=float)

    #dataframe where rows are 572 drugs and columns are all_descriptors (columns can be indexed by descriptor name)
    df_feature = DataFrame(feature_matrix, columns=all_descriptors)

    #For each drug's feature vector -> mark cell as 1 if it has that corresponding descriptor present (else it will remain 0)
    for i in range(len(drug_list)):
        for each_descriptor in df[feature_name].iloc[i].split('|'):
            df_feature[each_descriptor].iloc[i] = 1

    #Calculate Jaccard Similarity matrix (shape: (572, 572) where each element [i, j] represents similarity between drug i and drug j)
    sim_matrix = Jaccard(np.array(df_feature))

    #Apply PCA to reduce noise/redundancy in data while keeping dimensionality the same
    pca = PCA(n_components=vector_size)
    pca.fit(np.asarray(sim_matrix))
    sim_matrix = pca.transform(np.asarray(sim_matrix))

    #shape: (572, 572)
    return sim_matrix

In [6]:
'''DATA PROCESSING PART 3'''

'''
This function assigns fold numbers to each of the 37264 events for cross-validation
It is basically a helper to the cross_validation function of ours
'''

def get_index(label_matrix, event_num, seed, CV):

    #store the fold number for each sample in the dataset (fold numbers range from 0 to 4)
    index_all_class = np.zeros(len(label_matrix)) #shape: (37264, )

    #for each event number (0 through 64)
    for j in range(event_num):
        
        #find label indices where the event number matches event j
        index = np.where(label_matrix == j)

        #initialize KFold cross-validator with CV (5) folds, shuffling data, and setting random seed
        kf = KFold(n_splits=CV, shuffle=True, random_state=seed)
        k_num = 0 #keep track of fold number

        #for the samples that were matched to j -> split them into training and test sets across 5 folds
        for train_index, test_index in kf.split(range(len(index[0]))):

            #for the current training & test split/fold, assign the test indices to current fold number
            index_all_class[index[0][test_index]] = k_num
            k_num += 1 #increment fold number for next iteration

    #for each of the 37264 events, we now know in which fold it is treated as test data. else, it will be used as training data in the other folds.
    return index_all_class

In [7]:
'''TRAINING CODE'''

'''
This function performs 5-fold cross validation for each of the 4 feature similarity matrices (so 20 different CNN models are trained)
The results are aggregated and overall and per-event performance is evaluated
'''

def cross_validation(feature_matrix, label_matrix, event_num, seed, CV):
    #evaluation results across all events (CSV file will have 11 rows and 1 column)
    all_eval_type = 11
    result_all = np.zeros((all_eval_type, 1), dtype=float)

    #evaluation results for each event (CSV file will have 65 rows and 6 columns)
    each_eval_type = 6
    result_eve = np.zeros((event_num, each_eval_type), dtype=float)
    
    y_true = np.array([]) #store true labels
    y_pred = np.array([]) #store predicted labels
    y_score = np.zeros((0, event_num), dtype=float) #store scores of predictions

    #get fold numbers for each of the 37264 events for cross-validation
    index_all_class = get_index(label_matrix, event_num, seed, CV) #shape: (37264, )

    #NOTE: we will train 20 models (for each fold (5) -> for each feature matrix (4) = 5 * 4 = 20)

    #for each fold (0 through 4)
    for k in range(CV):
        print(f'Training for Fold {k}')
        
        train_index = np.where(index_all_class != k) #training indices are all indices not marked as k
        test_index = np.where(index_all_class == k) #test indices are all indices marked as k
        pred = np.zeros((len(test_index[0]), event_num), dtype=float) #initialize array to store predicted scores for testing set

        #iterate over each of our 4 feature similarity matrices (smile, target, enzyme, pathway)
        for i in range(len(feature_matrix)):
            print(f'Training for Feauture {i}')
            
            #separate the data into training and testing sets based on the current fold indices we calculated above
            x_train = feature_matrix[i][train_index] #shape: (train_index, 1144)
            x_test = feature_matrix[i][test_index] #shape: (test_index, 1144)
            y_train = label_matrix[train_index] #shape: (train_index, )
            y_test = label_matrix[test_index] #shape: (test_index, )

            #one-hot encoding (each label represented by length 65 vector, with 1 at index corresponding to label number and 0's elsewhere)
            #shape: (y_train, 65)
            y_train_one_hot = np.array(y_train)
            y_train_one_hot = (np.arange(y_train_one_hot.max() + 1) == y_train[:, None]).astype(dtype='float32')
            
            # one-hot encoding
            y_test_one_hot = np.array(y_test)
            y_test_one_hot = (np.arange(y_test_one_hot.max() + 1) == y_test[:, None]).astype(dtype='float32')

            #let's train a sub-model in regards to the current feature
            cnn = CNN()

            #technique to prevent overfitting (monitors model performance on the validation set & stops when performance stops improving/worsens)
            early_stopping = EarlyStopping(monitor='val_loss', patience=10, verbose=0, mode='auto')

            #our data is compatible with the original DNN from paper, but with our CNN, an extra dimension is required. let's add it
            #now data is of shape: [batch_size, 1144, 1]
            x_train = np.expand_dims(x_train, axis=-1)
            x_test = np.expand_dims(x_test, axis=-1)

            #fit the model to the training data
            #epochs originally 10, let's just do 3 for time and demo purposes
            cnn.fit(x_train, y_train_one_hot, batch_size=128, epochs=3, validation_data=(x_test, y_test_one_hot),
                    callbacks=[early_stopping])

            # Save the trained model
            if cnn is not None:
                print("model should be saved???")
                print(cnn)

            #save model and its weights separately (because pickle and joblib are not working)
            model_json = cnn.to_json()
            model_filename = os.path.join("./models", f"fold_{k}_feature_{i}_model.json")
            
            with open(model_filename, 'w') as file:
                file.write(model_json)

            weights_filename = os.path.join("./model_weights", f"fold_{k}_feature_{i}.weights.h5")
            cnn.save_weights(weights_filename)

            
            #make predictions for the testing data and append it to current fold's pred array
            pred += cnn.predict(x_test)
                

        #AGGREGATE RESULTS
        pred_score = pred / len(feature_matrix) #calculate the average prediction score for each sample across all folds
        pred_type = np.argmax(pred_score, axis=1) #determine the predicted event number by taking index of max value in each row of pred_score
        
        y_true = np.hstack((y_true, y_test)) #horizontally stack true labels to keep track of true labels across all folds
        y_pred = np.hstack((y_pred, pred_type)) #horizontally stack predicted labels to keep track of predicted labels across all folds

        #vertically stack prediction scores to create matrix where each row represents the prediction scores for each sample across all folds
        #columns represent each event number
        y_score = np.row_stack((y_score, pred_score))

    '''
    Recap
        - Predictions from 20 trained models are combined to obtain an overall more robust prediction (rather than selecting a single best performing model)
    '''
    
    #calculate evaluation metrics based on predicted labels and scores (returns overall and event-specific evaluation metrics)
    result_all, result_eve = evaluate(y_pred, y_score, y_true, event_num)
 
    return result_all, result_eve

In [8]:
'''EVALUATION CODE'''

'''
This function computes multiple evaluation metrics (accuracy, ROC AUPR, ROC AUC, precision, recall, f1)
    - For both overall performance and per individual event
'''

def evaluate(pred_type, pred_score, y_test, event_num):

    #evaluation results across all events (CSV file will have 11 rows and 1 column)
    all_eval_type = 11 #11 evaluation metrics
    result_all = np.zeros((all_eval_type, 1), dtype=float)

    #evaluation results for each event (CSV file will have 65 rows and 6 columns)
    each_eval_type = 6 #6 evaluation metrics
    result_eve = np.zeros((event_num, each_eval_type), dtype=float)

    #convert true labels (y_test) and predicted labels (pred_type) to one-hot encoded arrays (just like we did in cross_validation)
    y_one_hot = label_binarize(y_test, classes=np.arange(event_num))
    pred_one_hot = label_binarize(pred_type, classes=np.arange(event_num))

    #results across all events
    result_all[0] = accuracy_score(y_test, pred_type) #compute overall accuracy
    
    result_all[1] = roc_aupr_score(y_one_hot, pred_score, average='micro') #compute micro-average AUPR score
    result_all[2] = roc_aupr_score(y_one_hot, pred_score, average='macro') #compute macro-average AUPR score
    
    result_all[3] = roc_auc_score(y_one_hot, pred_score, average='micro') #compute micro-average ROC_AUC score
    result_all[4] = roc_auc_score(y_one_hot, pred_score, average='macro') #compute macro-average ROC_AUC score
    
    result_all[5] = f1_score(y_test, pred_type, average='micro') 
    result_all[6] = f1_score(y_test, pred_type, average='macro')
    
    result_all[7] = precision_score(y_test, pred_type, average='micro')
    result_all[8] = precision_score(y_test, pred_type, average='macro')
    
    result_all[9] = recall_score(y_test, pred_type, average='micro')
    result_all[10] = recall_score(y_test, pred_type, average='macro')

    #results for each event
    for i in range(event_num):
        #calculate accuracy for each event class (will be stored in 1st column)
        result_eve[i, 0] = accuracy_score(y_one_hot.take([i], axis=1).ravel(), pred_one_hot.take([i], axis=1).ravel())

        #calculate ROC AUPR score for each event class (will be stored in 2nd column)
        result_eve[i, 1] = roc_aupr_score(y_one_hot.take([i], axis=1).ravel(), pred_one_hot.take([i], axis=1).ravel(),
                                          average=None)
        #ROC AUC score
        result_eve[i, 2] = roc_auc_score(y_one_hot.take([i], axis=1).ravel(), pred_one_hot.take([i], axis=1).ravel(),
                                         average=None)

        #f1 score
        result_eve[i, 3] = f1_score(y_one_hot.take([i], axis=1).ravel(), pred_one_hot.take([i], axis=1).ravel(),
                                    average='binary')

        #precision score
        result_eve[i, 4] = precision_score(y_one_hot.take([i], axis=1).ravel(), pred_one_hot.take([i], axis=1).ravel(),
                                           average='binary')

        #recall score
        result_eve[i, 5] = recall_score(y_one_hot.take([i], axis=1).ravel(), pred_one_hot.take([i], axis=1).ravel(),
                                        average='binary')
        
    return [result_all, result_eve]

In [9]:
'''EVALUATION CODE HELPER FUNCTION'''

def roc_aupr_score(y_true, y_score, average="macro"):
    def _binary_roc_aupr_score(y_true, y_score):
        precision, recall, pr_thresholds = precision_recall_curve(y_true, y_score)
        return auc(recall, precision) #update

    def _average_binary_score(binary_metric, y_true, y_score, average):  # y_true= y_one_hot
        if average == "binary":
            return binary_metric(y_true, y_score)
        if average == "micro":
            y_true = y_true.ravel()
            y_score = y_score.ravel()
        if y_true.ndim == 1:
            y_true = y_true.reshape((-1, 1))
        if y_score.ndim == 1:
            y_score = y_score.reshape((-1, 1))
        n_classes = y_score.shape[1]
        score = np.zeros((n_classes,))
        for c in range(n_classes):
            y_true_c = y_true.take([c], axis=1).ravel()
            y_score_c = y_score.take([c], axis=1).ravel()
            score[c] = binary_metric(y_true_c, y_score_c)
        return np.average(score)

    return _average_binary_score(_binary_roc_aupr_score, y_true, y_score, average)

In [10]:
'''SAVING EVALUATION RESULTS TO TWO CSV FILES'''

def save_result(feature_name, result_type, clf_type, result):
    with open(feature_name + '_' + result_type + '_' + clf_type+ '.csv', "w", newline='') as csvfile:
        writer = csv.writer(csvfile)
        for i in result:
            writer.writerow(i)
    return 0

In [11]:
'''
DATA EXTRACTION IS DONE HERE
DATA PROCESSING AND TRAINING CODE IS CALLED FROM HERE
SAVING RESULTS IS CALLED FROM HERE AS WELL
'''

def main(feature_list=["smile", "target", "enzyme", "pathway"], classifier="CNN"):
    seed = 0
    CV = 5

    #Establish connection to SQLite DB and load in 3 tables
    conn = sqlite3.connect("event.db")
    
    df_drug = pd.read_sql('select * from drug;', conn) #drug info (id, the 4 drug features, name)
    df_event = pd.read_sql('select * from event_number;', conn) #frequency of events (event, frequency)
    df_interaction = pd.read_sql('select * from event;', conn) #event info (id1, name1, id2, name2, interaction)

    extraction = pd.read_sql('select * from extraction;', conn) #mechanism, action, drugA, drugB
    mechanism = extraction['mechanism']
    action = extraction['action']
    drugA = extraction['drugA']
    drugB = extraction['drugB']

    featureName="+".join(feature_list) #concatenate feature names into single string separated by "+"
    clf_list = [classifier]

    #these two will store the overall and event-wise evaluation results
    result_all = {}
    result_eve = {}

    #store the feature matrix for each feature
    all_matrix = []

    #read all drug names from txt and store in drugList
    drugList=[]
    for line in open("DrugList.txt",'r'):
        drugList.append(line.split()[0])

    #loop through every feature -> prepare the feature matrix -> append to all_matrix
    for feature in feature_list:
        print(feature)
        new_feature, new_label, event_num = prepare(df_drug, [feature], vector_size, mechanism,action,drugA,drugB)

        #note: we don't store new_label, event_num in each iteration because they are the same in each iteration
        #note: however, we store new_feature into all_matrix, because that does change per feature
        all_matrix.append(new_feature)

    #keep track of time taken for validation + evaluation
    start = time.time()

    clf = "CNN"
    
    #perform cross-validation and evaluation
    all_result, each_result = cross_validation(all_matrix, new_label, event_num, seed, CV)

    #save the overall and event-wise evaluation results to individual CSV files
    save_result(featureName, 'all', clf, all_result)
    save_result(featureName, 'each', clf, each_result)

    #save the results in a variable as well (this part can be removeD)
    result_all[clf] = all_result
    result_eve[clf] = each_result

    #total time taken for validation + evaluation
    print("time used:", time.time() - start)

In [None]:
main()

smile
target
enzyme
pathway
Training for Fold 0
Training for Feauture 0
Epoch 1/3
[1m233/233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m101s[0m 425ms/step - accuracy: 0.2779 - loss: 2.5185 - val_accuracy: 0.5911 - val_loss: 1.4869
Epoch 2/3
[1m233/233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 422ms/step - accuracy: 0.6575 - loss: 1.1933 - val_accuracy: 0.7608 - val_loss: 0.7645
Epoch 3/3
[1m233/233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m100s[0m 430ms/step - accuracy: 0.7948 - loss: 0.6240 - val_accuracy: 0.7966 - val_loss: 0.6123
model should be saved???
<Functional name=functional_1, built=True>
[1m234/234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 22ms/step
Training for Feauture 1
Epoch 1/3
[1m233/233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m101s[0m 425ms/step - accuracy: 0.4084 - loss: 2.0959 - val_accuracy: 0.7172 - val_loss: 0.9103
Epoch 2/3
[1m233/233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m101s[0m 432ms/step - accuracy: 