# Create data structure to inform GCPN loss function using KG-COVID-19 embeddings

In this notebook, we wake a data structure to be used to inform a GCPN loss function with
information from embeddings of the KG-COVID-19 knowledge graph. This graph contains a broad
array of information about COVID-19 and SARS-CoV-2 ([detailed here](https://knowledge-graph-hub.github.io/kg-covid-19-dashboard/)).

The data structure we produce will be a 2D matrix comprised of the 6,900 ChEMBL antiviral compounds as rows,
important SARS-CoV-2 entities represented as nodes in our KG (such as SARS-CoV-2 itself, protein cleavage
products like ADRP and Mpro, etc) as columns. The value in the matrix will be the cosine similarity
between the embedding of the node for the ChEMBL antiviral in the row and the embedding of the node for
the SARS-CoV-2 entity in that column. The matrix will be output as a TSV. The SMILE string representation of each ChEMBL antiviral will also be output as an additional column so that Tanimoto similarity can be calculated.

The data structure produced here will be used to investigate whether embeddings of the COVID-19 KG
can be used to guide GCPNs in producing more useful/viable antivirals for COVID-19 treatment. Specifically,
for each compound produced by GCPN, a loss function might be defined using the product of the Tanimoto similarity and cosine similarity for the ChEMBL antivirals, something like:

$L = -max^{n}_{i=1}(tanimoto(C^{gcpn}, C^{chembl}_i) * cosine\_sim(C^{chembl}_i, N^{interest}))$

where n is the number of ChEMBL antivirals, N is the SARS-CoV-2 node of interest (ADRP, Mpro, SARS-CoV-2 itself, etc) that is to be targeted by the therapeutic.

Conceptually this should confine the GCPN to some boundary of "druglikeness" as defined by similarity to ChEMBL antivirals and their ability to target the drug target of interest.

CAVEAT: these embeddings were created with a version of ensmallen_graph that has a known issue producing holdouts (too few test edges held out). This probably will not affect the quality of the embeddings, but we will need to redo these embeddings before any results are published just to be safe.

## Loading the KG-COVID-19 knowledge graph
We need to load the graphs and redo the training/test split exactly as we did when generating the embeddings in order to retrieve the labels for the embedddings

In [10]:
!pip install ensmallen_graph==0.3.6



In [11]:
graph_data_dir = "data"

In [12]:
# Get the graphs from Zenodo. This zenodo upload also contains embeddings from an unrelated experiment.
# (We are using a different (better) set of embeddings from those contained in this zenodo upload - we load these
# later on below.)

import urllib
import os
os.makedirs(graph_data_dir, exist_ok=True)
if not os.path.exists(graph_data_dir + "/kg-covid-19-skipgram-aug-2020.tar.gz"):
    with urllib.request.urlopen("https://zenodo.org/record/4011267/files/kg-covid-19-skipgram-aug-2020.tar.gz") as response, \
        open(graph_data_dir + "/kg-covid-19-skipgram-aug-2020.tar.gz", 'wb') as out_file:
            data = response.read()  # a `bytes` object
            out_file.write(data)

os.system("tar -xvzf " + graph_data_dir + "/kg-covid-19-skipgram-aug-2020.tar.gz -C " + graph_data_dir)

0

In [13]:
%%time
from ensmallen_graph import EnsmallenGraph

graph = EnsmallenGraph.from_csv(
    edge_path= graph_data_dir + "/merged-kg_edges.tsv",
    sources_column="subject",
    destinations_column="object",
    directed=False,
    default_edge_type="biolink:association",
    node_path= graph_data_dir + "/merged-kg_nodes.tsv",
    nodes_column="id",
    node_types_column="category",
    default_node_type="biolink:NamedThing",
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,
    force_conversion_to_undirected=True
)

CPU times: user 6min 22s, sys: 10 s, total: 6min 32s
Wall time: 6min 32s


In [14]:
graph.report()

{'degrees_mode': '1',
 'degrees_max': '90378',
 'bidirectional_rate': '1',
 'nodes_number': '375365',
 'degrees_median': '6',
 'unique_edge_types_number': '0',
 'unique_node_types_number': '36',
 'degrees_min': '0',
 'selfloops_rate': '0.000015391581103247148',
 'strongly_connected_components_number': '8976',
 'degrees_mean': '82.21604837957722',
 'edges_number': '30861027',
 'traps_rate': '0.021906677500566116',
 'connected_components_number': '8976',
 'density': '0.00021902960686152735',
 'singleton_nodes': '8223',
 'is_directed': 'false'}

In [15]:
training, validation = graph.connected_holdout(42, 0.8)

In [16]:
training.report()

{'degrees_median': '5',
 'degrees_mean': '65.77283976929122',
 'degrees_min': '0',
 'bidirectional_rate': '1',
 'strongly_connected_components_number': '8976',
 'connected_components_number': '8976',
 'selfloops_rate': '0.000014743514291609376',
 'nodes_number': '375365',
 'unique_edge_types_number': '0',
 'edges_number': '24688822',
 'singleton_nodes': '8233',
 'degrees_max': '71988',
 'degrees_mode': '1',
 'unique_node_types_number': '36',
 'is_directed': 'false',
 'density': '0.0001752236883281372',
 'traps_rate': '0.021933318236916067'}

In [17]:
validation.report()

{'edges_number': '6172205',
 'bidirectional_rate': '1',
 'degrees_min': '0',
 'connected_components_number': '162705',
 'unique_edge_types_number': '0',
 'degrees_median': '1',
 'degrees_max': '18390',
 'strongly_connected_components_number': '162705',
 'traps_rate': '0.42166957494705154',
 'density': '0.000043805918533390134',
 'is_directed': 'false',
 'nodes_number': '375365',
 'degrees_mode': '0',
 'unique_node_types_number': '36',
 'selfloops_rate': '0.00001798384855979346',
 'degrees_mean': '16.44320861028599',
 'singleton_nodes': '158280'}

The followings checks are not strictly necessary, but are offered as sanity checks:

In [18]:
assert graph > training
assert graph > validation
assert (training + validation).contains(graph)
assert graph.contains(training + validation)
assert not training.overlaps(validation)
assert not validation.overlaps(training)

## Loading the embeddings

In [19]:
# https://zenodo.org/record/4019808/files/SkipGram_80_20_training_test_epoch_500_delta_0.0001_embedding.npy?download=1
embedding_dir = "link_prediction_experiment_embeddings"
embedding_file = os.path.join(embedding_dir, "SkipGram_embedding.npy")
os.makedirs(embedding_dir, exist_ok=True)

with urllib.request.urlopen("https://zenodo.org/record/4019808/files/SkipGram_80_20_training_test_epoch_500_delta_0.0001_embedding.npy") as response, \
    open(embedding_file, 'wb') as out_file:
        data = response.read()  # a `bytes` object
        out_file.write(data)

In [23]:
!pip install numpy
import numpy as np
embedding_file = os.path.join(graph_data_dir, "SkipGram_embedding.npy")
embeddings = np.load(embedding_file)

Collecting numpy
  Using cached numpy-1.19.2-cp37-cp37m-macosx_10_9_x86_64.whl (15.3 MB)
Installing collected packages: numpy
Successfully installed numpy-1.19.2


In [29]:
node_names = list(np.array(training.nodes_reverse_mapping))

In [25]:
assert len(training.nodes_reverse_mapping) == len(embeddings)

In [26]:
assert len(training.node_types) == len(embeddings)

In [27]:
# get embeddings for Nodes of interest
sars_cov_2_name = 'CHEMBL.TARGET:CHEMBL4303835'
<http://identifiers.org/uniprot/P0DTD1-PRO_0000449623

In [30]:
sars_cov_2_idx = node_names.index(sars_cov_2_name)

In [None]:
chembl_prefix = 'CHEMBL.COMPOUND'
chembl_names = [x for x in node_names if (match := re.compile(chembl_prefix).search(x))]
chembl_idx = [index for index, x in enumerate(node_names) if (match := re.compile(chembl_prefix).search(x))]

In [None]:
sars_cov_2_emb = embeddings[sars_cov_2_idx]

In [None]:
from embiggen import GraphTransformer, EdgeTransformer

assert(mlp_model[0] in EdgeTransformer.methods)

transformer = GraphTransformer(mlp_model[0]) # pass edge embedding method, which is mlp_model[0]
transformer.fit(embeddings)
train_edges = transformer.transform(training)
assert(training.get_edges_number() == len(train_edges))

In [None]:
# let's try to predict a link that should exist in training graph
# example SARS-CoV-2 -> ChEMBL compound edge (which should be positive)
example_chembl_edge = train_edges[training.get_edge_id(sars_cov_2_idx, chembl_idx[0])]
example_chembl_edge.shape
example_chembl_edge.__class__
mlp_model[1].predict(example_chembl_edge)