In [None]:
# Get the DrugBank XML database
!pip install gdown # downloading files from google drive.
!gdown --folder https://drive.google.com/drive/folders/1hZa_Vc9dZf_oyNjQoCsO2eKVAOzz-78e
!unzip drug-drug/drugbank_all_full_database.xml.zip

In [19]:
# Decompress the graph structure descriptor TSV file
!gzip -dk drug-drug/ChCh-Miner_durgbank-chem-chem.tsv.gz # unzip, keep original .gz file as well 

Archive:  drug-drug/drugbank_all_full_database.xml.zip
  inflating: full database.xml       


In [46]:
import numpy as np
import pandas as pd

edges_df = pd.read_csv('drug-drug/ChCh-Miner_durgbank-chem-chem.tsv', sep='\t', header=None, index_col=False)
edges_df

Unnamed: 0,0,1
0,DB00862,DB00966
1,DB00575,DB00806
2,DB01242,DB08893
3,DB01151,DB08883
4,DB01235,DB01275
...,...,...
48509,DB00542,DB01354
48510,DB00476,DB01239
48511,DB00621,DB01120
48512,DB00808,DB01356


In [47]:
vertices = np.unique(edges_df.values)
vertices.shape, vertices

((1514,),
 array(['DB00005', 'DB00006', 'DB00007', ..., 'DB11256', 'DB11315',
        'DB11354'], dtype=object))

## Generating the training, validation and test data

The article suggests that the auto-encoder builds a latent representation ($\textbf{Z}$) based on the $\textbf{X}$ feature vectors and the $\textbf{A}$ adjacency matrix, then the decoder (the generative model) predicts $\textbf{A}$ based on $\textbf{Z}$, with pairs of vertices whose $\textbf{z}$ latent representations have a higher inner product being more likely to be connected.

The goal of the prediction is to determine the existence of an edge between any two nodes. Any edge should be in at most one of the *train*, *validation* or *split* sets, otherwise we would have, for example, validation data that was part of the training data as well, defeating the purpose and integrity of the validation set. As a consequence, the best way to split the data is to partition the edges into *train edges*, *validation edges* and *test edges*.

The model proposed uses graph convolutional layers to extract the latent variables ($\textbf{Z}$), and graph convolutions use vertex-local structure, with information from vertices closer than $d$ hops, where $d$ is the convolutional filter's size. When selecting samples, these samples should be subgraphs (as opposed to one, or few pairs of nodes per sample, querying wether they are connected or not), to keep some of the valid information about connected-ness intact.

The three-way split defines three separate subgraphs, and the samples will be subgraphs of these. During the split, it is possible that, for example, for some vertex $a$, edge $(a, x)$ is in the *train* split, $(a, y)$ is in *validation* and $(a, z)$ is in *test*, so during training, the majority of the information about the vertex's neighbourhood is lost. We solve this problem by keeping the percentage of *training* edges high, to render this situation unlikely.

Furthermore, it is also possible that during the sampling of the *training* graph, *training* edges connected to the same vertex get split up across samples, so we reduce the information that the model 'sees', at that particular sample, with respect to that given vertex. This is not necessarily a problem, because this situation can condition the model to be more redundant (in the way a *dropout* layer can condition a model using fully connected layers to be more redundant). However, we will solve this problem, because so far, the feature extraction from the database does not give much to rely on, so neighbourhood and clustering information between interacting molecules can be critical. We will sample a fixed number of vertices from the *training* subgraph first, then we will include all the edges between these vertices that are in the *training* graph - also called an *induced subgraph* (same for the other splits).

It is crucial, that when *loss* is calculated, we should **not penalize** the model for finding an edge that is **not in the *training* sample**, but **is present in the original graph** (the same applies to the *validation* and *test* steps as well). Supppose we have a training sample $(X, y)$, where $X=(G, features)$, $y=G'$, and the model should infer $y$ from $X$. $G$ should be a subgraph of the *training graph* (to keep the integrity of the *train*, *validation* and *test* sets - see the second paragraph of this section), but $G'$ should be the maximal subgraph *of the original graph* that has exactly the same vertices as $G$ (for correctness).

Applying the principles above, representing the $G$ and $G'$ graphs as a list of edges, we generate the *train*, *validation* and *test* sets as follows:

In [65]:
# Input representation for the whole graph
N = vertices.shape[0] # we have about 1500 vertices
M = edges_df.shape[0] # we have about 48000 edges
F = 10 # insert number of relevant features
features = np.random.uniform(size=(N, F)) # insert values parsed from XML

# Train-val-test split
test_percentage = 0.2
val_percentage = 0.2
val_split = int((1.0 - test_percentage) * M)
train_split = int((1.0 - val_percentage - test_percentage) * M)

# Partitioning edges randomly into either category
shuffled_edges = edges_df.sample(frac=1.0).reset_index(drop=True)
train_edges = shuffled_edges.iloc[:train_split].values
train_vertices = np.unique(train_edges)
val_edges = shuffled_edges.iloc[train_split:val_split].values
val_vertices = np.unique(val_edges)
test_edges = shuffled_edges.iloc[val_split:].values
test_vertices = np.unique(test_edges)
# Print
print(f'Training subgraph:\n\tEdges:{train_edges.shape[0]}\n\tVertices:{train_vertices.shape[0]}\n')
print(f'Validation subgraph:\n\tEdges:{val_edges.shape[0]}\n\tVertices:{val_vertices.shape[0]}\n')
print(f'Training subgraph:\n\tEdges:{test_edges.shape[0]}\n\tVertices:{test_vertices.shape[0]}\n')
# train_edges, val_edges, test_edges

Training subgraph:
	Edges:29108
	Vertices:1466

Validation subgraph:
	Edges:9703
	Vertices:1322

Training subgraph:
	Edges:9703
	Vertices:1323



In [None]:
nb_vertices = 500 # number of vertices in a sample
nb_train = 6000 # number of train samples
nb_val = 2000 # number of validation samples
nb_test = 2000 # number of test samples

train = np.sample()
# Sampling nodes, subgraphs
sample_vertices_count = 500
samples_num = 10000 # there are about 2 * 10^415 possible subsamples
test_split = 0.2 # proportion of test samples
val_split = 0.2 # proportion of validation samples