# Drug Repurposing using PAM Case-Study

In this notebook we will showcase the efficiency of PAM on Drug repurposing.

We will follow the procedure as in [DRKG - COVID-19 Drug Repurpose](https://github.com/gnn4dr/DRKG/blob/master/drug_repurpose/COVID-19_drug_repurposing.ipynb).


The process is simple:
1. First download the [DRKG](https://github.com/gnn4dr/DRKG/) dataset and the related files from the drug-repurpose use-case. The .tsv with the triples is expected to be in folder in the same directory as this notebook with the name "data".
2. Create the **lossless** $1$-hop PAM for the KG $P$.
3. Create a low-rank approximation of the PAM adjacency matrix $\tilde{P} = U * S * V$ of rank $k=200$.
4. In the approxmated matrix $\tilde{P}$, rank all the possible drug-diseases combinations, from highest to lowest.
5. Calculate the Hits@100 of our methodology and compare them with the ones from the original work.






## Initial Imports

In [1]:
%load_ext autoreload
%autoreload 2

import scipy
import csv
import numpy as np
import pandas as pd


from pam_creation import create_pam_matrices
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

## Load the original KG

In [2]:
path = "./data/DRKG/train.tsv"
df_train = pd.read_csv(path, sep="\t")
df_train.dropna(inplace=True)
df_train.columns = ["head", "rel", "tail"]

## Create the PAM matrix

In [3]:
# we only want the 1-hop matrix
max_order = 1

power_A_directed, node2id, rel2id = create_pam_matrices(df_train, max_order=max_order, spacing_strategy='step_10', use_log=True)

# create a dictionary tha maps the index of nodes to their names
id2node = dict(zip(list(node2id.values()), list(node2id.keys())))

# of unique rels: 107 	 | # of unique nodes: 97238
(97238, 97238) Sparsity: 99.95 %


## Load: 

- the representative nodes of the COVID disease.
- the candidate drugs

In [4]:
COV_disease_list = [
'Disease::SARS-CoV2 E',
'Disease::SARS-CoV2 M',
'Disease::SARS-CoV2 N',
'Disease::SARS-CoV2 Spike',
'Disease::SARS-CoV2 nsp1',
'Disease::SARS-CoV2 nsp10',
'Disease::SARS-CoV2 nsp11',
'Disease::SARS-CoV2 nsp12',
'Disease::SARS-CoV2 nsp13',
'Disease::SARS-CoV2 nsp14',
'Disease::SARS-CoV2 nsp15',
'Disease::SARS-CoV2 nsp2',
'Disease::SARS-CoV2 nsp4',
'Disease::SARS-CoV2 nsp5',
'Disease::SARS-CoV2 nsp5_C145A',
'Disease::SARS-CoV2 nsp6',
'Disease::SARS-CoV2 nsp7',
'Disease::SARS-CoV2 nsp8',
'Disease::SARS-CoV2 nsp9',
'Disease::SARS-CoV2 orf10',
'Disease::SARS-CoV2 orf3a',
'Disease::SARS-CoV2 orf3b',
'Disease::SARS-CoV2 orf6',
'Disease::SARS-CoV2 orf7a',
'Disease::SARS-CoV2 orf8',
'Disease::SARS-CoV2 orf9b',
'Disease::SARS-CoV2 orf9c',
'Disease::MESH:D045169',
'Disease::MESH:D045473',
'Disease::MESH:D001351',
'Disease::MESH:D065207',
'Disease::MESH:D028941',
'Disease::MESH:D058957',
'Disease::MESH:D006517'
]


# Load entity file
drug_list = []
with open("./data/DRKG/infer_drug.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['drug','ids'])
    for row_val in reader:
        drug_list.append(row_val['drug'])

## Map the wanted drug and diseases to their indexes

In [5]:
# handle the ID mapping
drug_ids = []
disease_ids = []
for drug in drug_list:
    drug_ids.append(node2id[drug])
    
for disease in COV_disease_list:
    disease_ids.append(node2id[disease])

print(f"# Drugs: {len(drug_ids)} \t # Diseases: {len(disease_ids)}")

# Drugs: 8104 	 # Diseases: 34


## Load the ground-truth clinical trial drugs.


These are the drugs that are considered to be important for testing out and we will evaluate against them.

In [6]:
clinical_drugs_file = './data/DRKG/COVID19_clinical_trial_drugs.tsv'
clinical_drug_map = {}
with open(clinical_drugs_file, newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['id', 'drug_name','drug_id'])
    for row_val in reader:
        clinical_drug_map[row_val['drug_id']] = row_val['drug_name']
        
print(f"# Ground-Truth Drugs: {len(clinical_drug_map)}")

# Ground-Truth Drugs: 32


## Find out the latest clicinal trial drugs used, as well.


Because the study was performed in September of 2021, there were many other drugs used as well.

An updated list can be found in the [DrugBank](https://go.drugbank.com/covid-19#drugs) website.

We will scrape it to create an updated-list with more drugs, that have been used in newer clinical trials.


In [13]:
from bs4 import BeautifulSoup
import requests

# Downloading contents of the web page
url = "https://go.drugbank.com/covid-19#drugs"
data = requests.get(url).text


# Creating BeautifulSoup object
soup = BeautifulSoup(data, 'html.parser')


# Find the names of the drugs from the corresponding table
table = soup.find_all("table")[4]
rows = [[ele.text.strip() for ele in item.find_all("td")[1:2]]
        for item in table.find_all("tr")]
covid_latest_drugs = set([ll[0] for ll in rows[1:]])

print(f"The latest # of Ground-Truth  Drugs is: {len(set(covid_latest_drugs))}")


The latest # of Ground-Truth  Drugs is: 708


## Load a mapping of all the compound names from DrugBank


Use the vocabulary of DrugBank from [here](https://go.drugbank.com/releases/latest#open-data).

In [11]:
drugbank_df = pd.read_csv("./data/DRKG/drugbank_vocabulary.csv")
dbid2name = dict(zip(drugbank_df['DrugBank ID'], drugbank_df['Common name']))
print(f"# of Drugs in DrugBank: {len(dbid2name)}")

# of Drugs in DrugBank: 15235


## DRKG-methodology predictions

In [19]:
found_by_DRKG_full = """Compound::DB00811	-0.21416784822940826
Compound::DB00993	-0.8350892663002014
Compound::DB00635	-0.8974801898002625
Compound::DB01082	-0.9854875802993774
Compound::DB01234	-0.9984006881713867
Compound::DB00982	-1.0160722732543945
Compound::DB00563	-1.0189464092254639
Compound::DB00290	-1.064104437828064
Compound::DB01394	-1.080674648284912
Compound::DB01222	-1.084547519683838
Compound::DB00415	-1.0853980779647827
Compound::DB01004	-1.096668004989624
Compound::DB00860	-1.1004775762557983
Compound::DB00681	-1.1011559963226318
Compound::DB00688	-1.125687599182129
Compound::DB00624	-1.1428285837173462
Compound::DB00959	-1.1618402004241943
Compound::DB00115	-1.1868144273757935
Compound::DB00091	-1.1906721591949463
Compound::DB01024	-1.2051165103912354
Compound::DB00741	-1.2147064208984375
Compound::DB00441	-1.2320444583892822
Compound::DB00158	-1.2346539497375488
Compound::DB00499	-1.2525147199630737
Compound::DB00929	-1.2730510234832764
Compound::DB00770	-1.2825534343719482
Compound::DB01331	-1.2960500717163086
Compound::DB00958	-1.2967796325683594
Compound::DB02527	-1.303438663482666
Compound::DB00196	-1.3053392171859741
Compound::DB00537	-1.3131829500198364
Compound::DB00644	-1.3131871223449707
Compound::DB01048	-1.3267226219177246
Compound::DB00552	-1.3272088766098022
Compound::DB00328	-1.3286101818084717
Compound::DB00171	-1.3300385475158691
Compound::DB01212	-1.3330755233764648
Compound::DB09093	-1.3382999897003174
Compound::DB00783	-1.338560938835144
Compound::DB09341	-1.3396968841552734
Compound::DB00558	-1.3425884246826172
Compound::DB05382	-1.3575129508972168
Compound::DB01112	-1.3584508895874023
Compound::DB00515	-1.3608112335205078
Compound::DB01101	-1.381548523902893
Compound::DB01165	-1.3838160037994385
Compound::DB01183	-1.3862146139144897
Compound::DB00815	-1.3863483667373657
Compound::DB00755	-1.3881785869598389
Compound::DB00198	-1.3885014057159424
Compound::DB00480	-1.3935325145721436
Compound::DB00806	-1.3996552228927612
Compound::DB01656	-1.3999741077423096
Compound::DB00759	-1.404650092124939
Compound::DB00917	-1.4116020202636719
Compound::DB01181	-1.4148889780044556
Compound::DB01039	-1.4176580905914307
Compound::DB00512	-1.4207379817962646
Compound::DB01233	-1.4211887121200562
Compound::DB11996	-1.425789475440979
Compound::DB00738	-1.4274098873138428
Compound::DB00716	-1.4327492713928223
Compound::DB03461	-1.437927484512329
Compound::DB00591	-1.4404338598251343
Compound::DB01327	-1.4408743381500244
Compound::DB00131	-1.4446886777877808
Compound::DB00693	-1.4460749626159668
Compound::DB00369	-1.4505752325057983
Compound::DB04630	-1.453115463256836
Compound::DB00878	-1.456466555595398
Compound::DB08818	-1.4633680582046509
Compound::DB00682	-1.4691765308380127
Compound::DB01068	-1.4700121879577637
Compound::DB00446	-1.4720206260681152
Compound::DB01115	-1.4729849100112915
Compound::DB00355	-1.4770021438598633
Compound::DB01030	-1.485068678855896
Compound::DB00620	-1.4973516464233398
Compound::DB00396	-1.4976921081542969
Compound::DB01073	-1.4987037181854248
Compound::DB00640	-1.5026229619979858
Compound::DB00999	-1.5034282207489014
Compound::DB01060	-1.504364252090454
Compound::DB00493	-1.5072362422943115
Compound::DB01240	-1.5090957880020142
Compound::DB00364	-1.509944200515747
Compound::DB01263	-1.511993169784546
Compound::DB00746	-1.513066053390503
Compound::DB00718	-1.5183149576187134
Compound::DB01065	-1.5207160711288452
Compound::DB01205	-1.521277904510498
Compound::DB01137	-1.5229592323303223
Compound::DB08894	-1.5239660739898682
Compound::DB00813	-1.5308701992034912
Compound::DB01157	-1.5316557884216309
Compound::DB04570	-1.5430843830108643
Compound::DB00459	-1.5503207445144653
Compound::DB01752	-1.5541703701019287
Compound::DB00775	-1.5559712648391724
Compound::DB01610	-1.5563474893569946"""

found_by_DRKG = [dbid2name[line.split("\t")[0].split("::")[1]] for line in found_by_DRKG_full.split("\n")]

print(f"Old clinical trials-drug overlap top-100: {len(set(found_by_DRKG).intersection(list(clinical_drug_map.values())))}/{len(clinical_drug_map)}\n")
print(f"New clinical trials-drug overlap top-100: {len(set(found_by_DRKG).intersection(list(covid_latest_drugs)))}/{len(covid_latest_drugs)}\n")

for rank, drug in enumerate(found_by_DRKG):
    if drug in clinical_drug_map.values():
        print(f"{drug} ({rank+1})")

Old clinical trials-drug overlap top-100: 6/32

New clinical trials-drug overlap top-100: 32/708

Ribavirin (1)
Dexamethasone (5)
Colchicine (9)
Methylprednisolone (17)
Oseltamivir (50)
Deferoxamine (88)


## Low-rank approximation of the KG

In [20]:
# Approximate P using the first 200 eigenvalues of its decomposition
k = 200

P = power_A_directed[0]
U, S, V = svds(P.astype(np.float32), k=k)
P_approximate = (U[drug_ids] * S).dot(V[:,disease_ids])

## Find the most interacting drug, disease pairs and rank them according to their score

In [22]:
top_k = 100

top_drugs_ids = np.unravel_index(np.argsort(P_approximate.ravel())[::-1], P_approximate.shape)[0]
_, idx = np.unique(top_drugs_ids, return_index=True)
unique_top_drugs_ids = top_drugs_ids[np.sort(idx)][:top_k]

unique_top_drugs = np.array(drug_ids)[unique_top_drugs_ids]
unique_top_drugs_names = [dbid2name[id2node[index][10:17]] for index in unique_top_drugs if id2node[index][10:17] in dbid2name]

    

print(f"Old clinical trials overlap top-100: {len(set(unique_top_drugs_names).intersection(list(clinical_drug_map.values())))}/{len(clinical_drug_map)}")
print(f"New clinical trials-drug overlap top-100: {len(set(unique_top_drugs_names).intersection(list(covid_latest_drugs)))}/{len(covid_latest_drugs)}\n")


for rank, drug in enumerate(unique_top_drugs_names):
    if drug in clinical_drug_map.values():
        print(f"{drug} ({rank+1})")

Old clinical trials overlap top-100: 10/32
New clinical trials-drug overlap top-100: 45/708

Dexamethasone (1)
Methylprednisolone (5)
Ribavirin (14)
Colchicine (28)
Thalidomide (34)
Deferoxamine (51)
Azithromycin (58)
Oseltamivir (60)
Chloroquine (70)
Hydroxychloroquine (91)


## Check ranks of predictions for the 45/708 drugs that are found in the newest clinical trials drug list

In [23]:
for rank, drug in enumerate(unique_top_drugs_names):
    if drug in covid_latest_drugs:
        print(f"{drug} ({rank+1})")

Dexamethasone (1)
Prednisone (2)
Prednisolone (3)
Hydrocortisone (4)
Methylprednisolone (5)
Cyclosporine (6)
Methotrexate (7)
Betamethasone (13)
Ribavirin (14)
Isotretinoin (16)
Fluoxetine (17)
Vitamin D (18)
Pentoxifylline (22)
Simvastatin (23)
Morphine (26)
Colchicine (28)
Alprostadil (29)
Sirolimus (31)
Thalidomide (34)
Tretinoin (35)
Clarithromycin (37)
Folic acid (38)
Cyanocobalamin (39)
Itraconazole (40)
Ceftriaxone (42)
Glutathione (44)
Budesonide (46)
Indomethacin (48)
Deferoxamine (51)
Naltrexone (53)
Tacrolimus (56)
Azithromycin (58)
Oseltamivir (60)
Midazolam (62)
Cholecalciferol (64)
Curcumin (69)
Chloroquine (70)
Minocycline (71)
Iodine (73)
Amiodarone (77)
Melatonin (78)
Clindamycin (80)
Colistin (87)
Hydroxychloroquine (91)
Erythromycin (94)


# Conclusion

We show an elegant and fast way of utilizing PAM for link-prediction in the context of drug-repurposing.

An intresting idea here would be to use the 2-hop PAM to do the same taks.

As we can see from the DRKG connectivity diagram, the 1-hop PAM encapsulates the compound-disease predictions.

The 2-hop PAMs would be extended with Compound-Compound-Disease and Compound-Gene-Disease paths as well.

So performing the same procedure on the 2-hop matrix would encapsulate more info as well.

![Image of DRKG](https://github.com/gnn4dr/DRKG/blob/master/connectivity.png?raw=true)