# Create data structure to inform GCPN loss function using KG-COVID-19 embeddings

In this notebook, we make a data structure to be used to inform a GCPN loss function with
information from embeddings of the KG-COVID-19 knowledge graph. This graph contains a broad
array of information about COVID-19 and SARS-CoV-2 ([detailed here](https://knowledge-graph-hub.github.io/kg-covid-19-dashboard/)).

The data structure we produce will be a 2D matrix comprised of the 6,900 ChEMBL antiviral compounds as rows,
important SARS-CoV-2 entities represented as nodes in our KG (such as SARS-CoV-2 itself, protein cleavage
products like ADRP and Mpro, etc) as columns. The value in the matrix will be the cosine similarity
between the embedding of the node for the ChEMBL antiviral in the row and the embedding of the node for
the SARS-CoV-2 entity in that column. The matrix will be output as a TSV. The SMILE string representation of each ChEMBL antiviral will also be output as an additional column so that Tanimoto similarity can be calculated.

The data structure produced here will be used to investigate whether embeddings of the COVID-19 KG
can be used to guide GCPNs in producing more useful/viable antivirals for COVID-19 treatment. Specifically,
for each compound produced by GCPN, a loss function might be defined using the product of the Tanimoto similarity and cosine similarity for the ChEMBL antivirals, something like:

$L = -max^{n}_{i=1}(tanimoto(C^{gcpn}, C^{chembl}_i) * cosine\_sim(C^{chembl}_i, N^{interest}))$


where n is the number of ChEMBL antivirals, N is the SARS-CoV-2 node of interest (ADRP, Mpro, SARS-CoV-2 itself, etc) that is to be targeted by the therapeutic.

Conceptually this should confine the GCPN to some boundary of "druglikeness" as defined by similarity to ChEMBL antivirals and their ability to target the drug target of interest.

CAVEAT: These embeddings were generated using an 80/20 training/test split. We could and possibly should regenerate these embeddings with the entire graph (no 80/20 split)

## Loading the KG-COVID-19 knowledge graph
We need to load the graphs and redo the training/test split exactly as we did when generating the embeddings in order to retrieve the labels for the embedddings

### Define all files and URLs up top here

In [1]:
import os

base_dl_dir = "downloaded_data"
graph_data_dir = os.path.join(base_dl_dir, "kg-covid-19-20201001")
embedding_data_dir = os.path.join(base_dl_dir, "embeddings-20201001")

# graph stuff
graph_out_file = os.path.join(graph_data_dir + "/kg-covid-19.tar.gz")
nodes_file = os.path.join(graph_data_dir, "merged-kg_nodes.tsv")
edges_file = os.path.join(graph_data_dir, "merged-kg_edges.tsv")
sorted_edges_file = os.path.join(graph_data_dir, "merged-kg_edges_SORTED.tsv")
graph_tar_url = "https://kg-hub.berkeleybop.io/kg-covid-19/20201001/kg-covid-19.tar.gz"

# embeddings URLs
base_kghub_url = "http://kg-hub.berkeleybop.io/"
embeddings_url = os.path.join(base_kghub_url, "embeddings/20201001/SkipGram_80_20_kg_covid_19_20201001_training_test_epoch_500_delta_0.0001_embedding.npy")
embedding_file = os.path.join(embedding_data_dir, "SkipGram_embedding.npy")

# params
seed = 42
train_percentage = 0.8
patience = 5

In [2]:
chembl_antiviral_dir = os.path.join(base_dl_dir, "chembl_antiviral-20201001")
chembl_antiviral_file = os.path.join(chembl_antiviral_dir, "chembl_nodes.tsv")
chembl_antiviral_url = "https://kg-hub.berkeleybop.io/kg-covid-19/20201001/transformed/ChEMBL/nodes.tsv"

In [3]:
from pkg_resources import get_distribution
assert(get_distribution("ensmallen-graph").version == '0.4.3')
assert(get_distribution("embiggen").version == '0.6.0')

In [4]:
# download the graphs, if necessary

import urllib
import os
os.makedirs(graph_data_dir, exist_ok=True)

if not os.path.exists(nodes_file) or not os.path.exists(edges_file):
    with urllib.request.urlopen(graph_tar_url) as response, \
        open(graph_out_file, 'wb') as out_file:
            data = response.read()  # a `bytes` object
            out_file.write(data)
    os.system("tar -xvzf " + graph_out_file + " -C " + graph_data_dir)

## Retrieve the embeddings

In [5]:
os.makedirs(embedding_data_dir, exist_ok=True)

if not os.path.exists(embedding_file):
    with urllib.request.urlopen(embeddings_url) as response, \
        open(embedding_file, 'wb') as out_file:
            data = response.read()  # a `bytes` object
            out_file.write(data)

In [6]:
%%time
from ensmallen_graph import EnsmallenGraph

if not os.path.exists(sorted_edges_file):
    graph = EnsmallenGraph.from_unsorted_csv(
        edge_path = edges_file,
        sources_column="subject",
        destinations_column="object",
        directed=False,
        node_path = nodes_file,
        nodes_column = 'id',
        node_types_column = 'category',
        default_node_type = 'biolink:NamedThing'
    )

    graph.dump_edges(sorted_edges_file,
        sources_column="subject",
        destinations_column="object")

CPU times: user 1.43 ms, sys: 1.25 ms, total: 2.69 ms
Wall time: 3.43 ms


In [7]:
from ensmallen_graph import EnsmallenGraph

graph = EnsmallenGraph.from_sorted_csv(
    edge_path = sorted_edges_file,
    sources_column="subject",
    destinations_column="object",
    directed=False,
    nodes_number=377577,  # should be = or > than actual number
    edges_number=30949369,   # same ^
    node_path = nodes_file,
    nodes_column = 'id',
    node_types_column = 'category',
    default_node_type = 'biolink:NamedThing'
)

graph.report()

{'self_loops_rate': '0.00001554151233261008',
 'nodes_number': '377577',
 'singletons': '8314',
 'directed': 'false',
 'unique_edge_types_number': '0',
 'self_loops_number': '481',
 'degree_mean': '81.96836406878597',
 'edges_number': '30949369',
 'density': '0.037284237441157525',
 'unique_node_types_number': '37'}

In [8]:
%%time
pos_training, pos_validation = graph.connected_holdout(train_percentage, seed=seed)

CPU times: user 2min 28s, sys: 3.07 s, total: 2min 31s
Wall time: 2min 33s


The followings checks are not strictly necessary, but are offered as sanity checks:

In [9]:
%%time
coherence_check=False
if coherence_check:
    assert graph.contains(pos_training)
    assert graph.contains(pos_validation)
    assert (pos_training | pos_validation).contains(graph)
    assert graph.contains(pos_training | pos_validation)
    assert not training.overlaps(pos_validation)
    assert not validation.overlaps(pos_training)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 6.91 µs


In [10]:
import numpy as np
embeddings = np.load(embedding_file)

In [11]:
node_names = list(np.array(pos_training.get_nodes_reverse_mapping()))

In [12]:
assert len(pos_training.get_nodes_reverse_mapping()) == len(embeddings)

In [13]:
assert len(pos_training.get_node_types()) == len(embeddings)

#### here are all the SARS-CoV-2 proteins if we need to add more

In [15]:
import wget

if not os.path.exists("uniprot_sars-cov-2.gpi"):
    url = "https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi"
    filename = wget.download(url)

with open("uniprot_sars-cov-2.gpi", 'r') as gpi:
    lines = gpi.readlines()
    print("NAME\tDB ID")
    for line in lines:
        if line.startswith("!"):
            continue
        fields = line.split("\t")
        print("%s\t%s" % (fields[2], ":".join([fields[0], fields[1]])))

You should consider upgrading via the '/Users/jtr4v/PycharmProjects/justaddcoffee-chembl_antiviral_smile_and_embeddings/venv/bin/python3.8 -m pip install --upgrade pip' command.[0m
NAME	DB ID
pp1a	UniProtKB:P0DTC1
nsp11	UniProtKB:P0DTC1-PRO_0000449645
S protein	UniProtKB:P0DTC2
Spike protein S1	UniProtKB:P0DTC2-PRO_0000449647
Spike protein S2	UniProtKB:P0DTC2-PRO_0000449648
Spike protein S2'	UniProtKB:P0DTC2-PRO_0000449649
ORF3a	UniProtKB:P0DTC3
E protein	UniProtKB:P0DTC4
M protein	UniProtKB:P0DTC5
ORF6	UniProtKB:P0DTC6
ORF7a	UniProtKB:P0DTC7
ORF8	UniProtKB:P0DTC8
N protein	UniProtKB:P0DTC9
pp1ab	UniProtKB:P0DTD1
nsp1	UniProtKB:P0DTD1-PRO_0000449619
nsp2	UniProtKB:P0DTD1-PRO_0000449620
nsp3	UniProtKB:P0DTD1-PRO_0000449621
nsp4	UniProtKB:P0DTD1-PRO_0000449622
nsp5	UniProtKB:P0DTD1-PRO_0000449623
nsp6	UniProtKB:P0DTD1-PRO_0000449624
nsp7	UniProtKB:P0DTD1-PRO_0000449625
nsp8	UniProtKB:P0DTD1-PRO_0000449626
nsp9	UniProtKB:P0DTD1-PRO_0000449627
nsp10	UniProtKB:P0DTD1-PRO_0000449628
nsp12	U

In [None]:
# get embeddings for Nodes of interest
sars_cov_2_name = 'CHEMBL.TARGET:CHEMBL4303835'
nsp5_name = 'UniProtKB:P0DTD1-PRO_0000449623' # nsp5 3C-like proteinase

In [16]:
sars_cov_2_idx = node_names.index(sars_cov_2_name)
sars_cov_2_emb = embeddings[sars_cov_2_idx]

In [17]:
# get chembl antiviral nodes file
import wget

os.makedirs(chembl_antiviral_dir, exist_ok=True)

if not os.path.exists(chembl_antiviral_file):
    filename = wget.download(chembl_antiviral_url, chembl_antiviral_file)

In [18]:
chembl_antiviral_names = []
with open(chembl_antiviral_file, 'r') as f:
    header = f.readline().split("\t")
    while line := f.readline():
        items = line.split("\t")
        chembl_antiviral_names.append(items[header.index("id")])

In [19]:
chembl_antiviral_idx = [node_names.index(av) for av in chembl_antiviral_names]

In [20]:
# import re
# chembl_prefix = 'CHEMBL.COMPOUND'
# chembl_names = [x for x in node_names if (match := re.compile(chembl_prefix).search(x))]
# chembl_idx = [index for index, x in enumerate(node_names) if (match := re.compile(chembl_prefix).search(x))]
# len(chembl_names)

In [21]:
from scipy import spatial
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

chembl_antiviral_cosine_sim = []

sars_cov_2_emb = embeddings[sars_cov_2_idx]

for antiviral_idx in tqdm(chembl_antiviral_idx):
    antiviral_emb = embeddings[antiviral_idx]
    chembl_antiviral_cosine_sim.append(1 - spatial.distance.cosine(antiviral_emb, sars_cov_2_emb))

100%|██████████| 6956/6956 [00:00<00:00, 14091.69it/s]
