# Description

This notebook builds the gold-standard for drug-disease prediction using [PharmarcotherapyDB](https://dx.doi.org/10.7554%2FeLife.26726)

Instead of using all drug-disease pairs in PharmarcotherapyDB, we only use disease-modifying (DM) pairs as positive cases, and non-indications (NOT) as negative ones. We exclude symptomatic (SYM) because those might not exert an important effect to the disease.

# Modules loading

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path

import pandas as pd

import conf

# Settings

In [3]:
OUTPUT_DIR = conf.RESULTS["DRUG_DISEASE_ANALYSES"]
display(OUTPUT_DIR)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

PosixPath('/media/miltondp/Elements1/projects/phenoplier/results/drug_disease_analyses')

# PharmacotherapyDB: load gold standard

## Read data

In [4]:
input_file = conf.PHARMACOTHERAPYDB["INDICATIONS_FILE"]
display(input_file)

pharmadb_gold_standard = pd.read_csv(input_file, sep="\t")

PosixPath('/home/miltondp/projects/labs/greenelab/phenoplier/base/data/hetionet/pharmacotherapydb-v1.0/indications.tsv')

In [5]:
pharmadb_gold_standard.shape

(1388, 7)

In [6]:
pharmadb_gold_standard.head()

Unnamed: 0,doid_id,drugbank_id,disease,drug,category,n_curators,n_resources
0,DOID:10652,DB00843,Alzheimer's disease,Donepezil,DM,2,1
1,DOID:10652,DB00674,Alzheimer's disease,Galantamine,DM,1,4
2,DOID:10652,DB01043,Alzheimer's disease,Memantine,DM,1,3
3,DOID:10652,DB00989,Alzheimer's disease,Rivastigmine,DM,1,3
4,DOID:10652,DB00245,Alzheimer's disease,Benzatropine,SYM,3,1


In [7]:
pharmadb_gold_standard["doid_id"].unique().shape

(97,)

In [8]:
pharmadb_gold_standard["drugbank_id"].unique().shape

(601,)

## Build gold standard

In [9]:
pharmadb_gold_standard["category"].value_counts()

DM     755
SYM    390
NOT    243
Name: category, dtype: int64

In [10]:
gold_standard = (
    pharmadb_gold_standard[pharmadb_gold_standard["category"].isin(("DM", "NOT"))]
    .set_index(["doid_id", "drugbank_id"])
    .apply(lambda x: int(x.category in ("DM",)), axis=1)
    .reset_index()
    .rename(
        columns={
            "doid_id": "trait",
            "drugbank_id": "drug",
            0: "true_class",
        }
    )
)

In [11]:
gold_standard.shape

(998, 3)

In [12]:
assert gold_standard.shape[0] == 998

In [13]:
gold_standard.head()

Unnamed: 0,trait,drug,true_class
0,DOID:10652,DB00843,1
1,DOID:10652,DB00674,1
2,DOID:10652,DB01043,1
3,DOID:10652,DB00989,1
4,DOID:10652,DB00810,0


In [14]:
gold_standard["trait"].unique().shape

(87,)

In [15]:
gold_standard["drug"].unique().shape

(465,)

In [16]:
gold_standard["true_class"].value_counts()

1    755
0    243
Name: true_class, dtype: int64

In [17]:
gold_standard.dropna().shape

(998, 3)

In [18]:
doids_in_gold_standard = set(gold_standard["trait"].values)

# Save

In [19]:
output_file = Path(OUTPUT_DIR, "gold_standard.pkl").resolve()
display(output_file)

PosixPath('/media/miltondp/Elements1/projects/phenoplier/results/drug_disease_analyses/gold_standard.pkl')

In [20]:
gold_standard.to_pickle(output_file)