## BioSNAP

BioSNAP is a collection of datasets for graph-based representation of biological networks. The datasets are collected from various sources and are used for various tasks such as node classification, link prediction, etc.

We only consider to use the side effect dataset from BioSNAP. More details about the dataset can be found [here](https://snap.stanford.edu/biodata/datasets/10018/10018-ChSe-Decagon.html).


### Find matched ids for drugs and side effects in ChSe-Decagon_monopharmacy.csv.gz

Our knowledge graph is based on the DrugBank database. Before we use the drug-sideeffect file to annotate our predicted results, we need to find the drug bank ids for all drugs and the MESH/SYMP ids for all side effects in the file.

### [Optional] Dependencies

It's optional to run the following code to convert the pubchem ids to drugbank ids if you already have the ChSe-Decagon_monopharmacy_drugbank.csv file.

In [3]:
import pandas as pd
import requests
import json


def convert_pubchem_to_drugbank(pubchem_ids):
    # mychem.info API URL
    url = "https://mychem.info/v1/query"

    # Dictionary to hold DrugBank to MeSH ID mappings
    mapping = {}

    for i in range(0, len(pubchem_ids), 100):
        # Prepare the query
        q = ",".join(pubchem_ids[i : i + 100])
        params = {
            "q": q,
            "fields": "drugbank.id,drugcentral.xrefs.drugbank_id,pharmgkb.xrefs.drugbank,unichem.drugbank",
            "scopes": "pubchem.cid",
        }

        # Send the request
        response = requests.post(url, params=params)

        # Check if the response is valid
        print(response.status_code, response.text)
        results = response.json()
        for result in results:
            if result.get("drugbank"):
                mapping[result["query"]] = result["drugbank"]["id"]
            else:
                mapping[result["query"]] = None

    return mapping


def convert_id_to_umls(id, id_type, api_key):
    """
    Convert a ID to UMLS ID using BioPortal's REST API.

    :param id: The ID to convert.
    :param id_type: The type of ID to convert. Must be one of MESH, SNOMEDCT, SYMP, MEDDRA.
    :param api_key: Your BioPortal API key.
    :return: The corresponding UMLS ID, if found.
    """
    base_url = "http://data.bioontology.org"
    headers = {"Authorization": f"apikey token={api_key}"}

    # More details on the API here: https://data.bioontology.org/documentation#Class
    # You can get the related UMLS ids for SYMP from the downloaded file here: https://bioportal.bioontology.org/ontologies/SYMP?p=summary
    if id_type not in ["MESH", "SNOMEDCT", "MEDDRA"]:
        print(
            f"Error: {id_type} is not a valid ID type, must be one of MESH, SNOMEDCT, MEDDRA"
        )
        return None

    if id_type in ["MESH", "SNOMEDCT", "MEDDRA"]:
        path = f"http%3A%2F%2Fpurl.bioontology.org%2Fontology%2F{id_type}%2F{id}"

    url = f"{base_url}/ontologies/{id_type}/classes/{path}"
    print("The URL is: ", url)

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        data = response.json()
        print(json.dumps(data, indent=2))
        mappings = data.get("cui", [])
        if len(mappings) > 0:
            return mappings[0]
        else:
            print(f"Error: No mappings found for {id}")
            return None
    else:
        print(f"Error: {response.status_code}")
        return None

### [Optional] Convert the pubchem ids to drugbank ids

It's optional to run the following code to convert the pubchem ids to drugbank ids if you already have the ChSe-Decagon_monopharmacy_drugbank.csv file.

#### Input

In [2]:
import pandas as pd

data = pd.read_csv(
    "./ChSe-Decagon_monopharmacy.csv.gz",
    compression="gzip",
)
output_file = "./ChSe-Decagon_monopharmacy_drugbank.csv"

#### Output

In [3]:
data = data.rename(
    columns={
        "# STITCH": "pubchem_id",
        "Individual Side Effect": "ulms_id",
        "Side Effect Name": "side_effect_name",
    }
)

In [4]:
pubchem_ids = data["pubchem_id"].unique().tolist()
formatted_pubchem_ids = [x.replace("CID", "").strip("0") for x in pubchem_ids]
id_map = dict(zip(pubchem_ids, formatted_pubchem_ids))
mapping = convert_pubchem_to_drugbank(formatted_pubchem_ids)

drugbank_ids = []
for pubchem_id in data["pubchem_id"]:
    drugbank_id = mapping.get(id_map.get(pubchem_id))
    drugbank_ids.append(drugbank_id)
data["drugbank_id"] = drugbank_ids

200 [{"query":"3062316","_id":"ZBNZXTGUTAYRHI-UHFFFAOYSA-N","_score":17.083454,"drugbank":{"_license":"https://bit.ly/3Hikpvm","id":"DB01254"}},{"query":"3117","_id":"AUZONCFQVSMFAP-UHFFFAOYSA-N","_score":17.083454,"drugbank":{"_license":"https://bit.ly/3Hikpvm","id":"DB00822"}},{"query":"3114","_id":"UVTNFZQICZKOEM-UHFFFAOYSA-N","_score":17.083544,"drugbank":{"_license":"https://bit.ly/3Hikpvm","id":"DB00280"}},{"query":"373","notfound":true},{"query":"3736","_id":"DGAIEPBNLOQYER-UHFFFAOYSA-N","_score":17.083454,"drugbank":{"_license":"https://bit.ly/3Hikpvm","id":"DB09156"}},{"query":"3734","_id":"XQZXYNRDCRIARQ-UHFFFAOYSA-N","_score":17.083544},{"query":"2646","_id":"WDLWHQDACQUCJR-UHFFFAOYSA-N","_score":17.08357},{"query":"28112","_id":"OCZDCIYGECBNKL-UHFFFAOYSA-N","_score":17.083454},{"query":"4183806","_id":"OJLOPKGSLYJEMD-UHFFFAOYSA-N","_score":17.083454},{"query":"2462","_id":"VOVIALXJUBGFJZ-UHFFFAOYSA-N","_score":17.083454},{"query":"5381","_id":"OGQICQVSFDPSEI-UHFFFAOYSA-N","

In [6]:
data.head()

Unnamed: 0,pubchem_id,ulms_id,side_effect_name,drugbank_id
0,CID003062316,C1096328,central nervous system mass,DB01254
1,CID003062316,C0162830,Photosensitivity reaction,DB01254
2,CID003062316,C1611725,leukaemic infiltration brain,DB01254
3,CID003062316,C0541767,platelet adhesiveness abnormal,DB01254
4,CID003062316,C0242973,Ventricular dysfunction,DB01254


In [5]:
data.to_csv(output_file, index=False)

### Format the data to match the biomedgps format

More details on the data format can be found [here](https://open-prophetdb.github.io/biomedgps-data/graph_data_index/#knowledge-graph-file).

**Examples:**

| relation_type                  | resource | source_id | source_type | target_id   | target_type | source_name                    | target_name |
|--------------------------------|----------|-----------|-------------|-------------|-------------|--------------------------------|-------------|
| DGIDB::INHIBITOR::Gene:Compound| DGIDB    | ENTREZ:4311 | Gene        | MESH:D015244| Compound    | membrane metalloendopeptidase  | Thiorphan   |
| DGIDB::INHIBITOR::Gene:Compound| DGIDB    | ENTREZ:4311 | Gene        | MESH:C097292| Compound    | membrane metalloendopeptidase  | aladotrilat |


**NOTE:**
> Currently, we don't have enough information to determine the target type. Disease or Symptom? BioMedGPS::SideEffect::Compound:Symptom or BioMedGPS::SideEffect::Compound:Disease
> 
> So we keep both the disease and symptom ids in the target_id column. So we have two output files.
> 
> In the edge mapping stage, we will use the target_id to find the matched ids in the entity file. If a target_id can be matched to both disease and symptom ids, we will have two edges. Otherwise, we will only have one edge or none.


**WARNING:**
> we will have invalid rows in the output file, because we can't find the related drugbank ids for the pubchem ids. **Is MyChem.info a good source to convert pubchem ids to drugbank ids?**


In [1]:
import pandas as pd

input_file = "./ChSe-Decagon_monopharmacy_drugbank.csv"
formatted_data = pd.read_csv(input_file, sep=",")

formatted_data["relation_type"] = ""
formatted_data["raw_source_id"] = formatted_data["pubchem_id"]
formatted_data["source_id"] = "DrugBank:" + formatted_data["drugbank_id"]
formatted_data["source_type"] = "Compound"
formatted_data["source_name"] = ""
formatted_data["raw_target_id"] = formatted_data["ulms_id"]
formatted_data["target_id"] = "UMLS:" + formatted_data["ulms_id"]
formatted_data["target_type"] = ""
formatted_data["target_name"] = formatted_data["side_effect_name"]
formatted_data["resource"] = "BioSNAP"

invalid_formatted_data = formatted_data[
    (formatted_data["source_id"].isna()) | (formatted_data["target_id"].isna())
]

# Filter out the rows with empty source_id or target_id
formatted_data = formatted_data[
    (formatted_data["source_id"].notna()) & (formatted_data["target_id"].notna())
]

formatted_data = formatted_data[
    [
        "source_id",
        "source_type",
        "source_name",
        "target_id",
        "target_type",
        "target_name",
        "relation_type",
        "resource",
    ]
]

In [14]:
import os
import os.path as osp
import subprocess


def format_biosnap(filename, target_type="Disease"):
    def get_project_root():
        try:
            return osp.dirname(osp.dirname(os.getcwd()))
        except Exception as e:
            raise RuntimeError(f"Failed to determine project root: {e}")

    try:
        root_dir = get_project_root()
        print(f"Project root directory: {root_dir}")
    except RuntimeError as e:
        print(e)
        exit(1)

    database = "customdb"
    relations_path = osp.join(
        root_dir,
        "relations",
        "biosnap",
        filename,
    )
    output_dir = osp.join(root_dir, "formatted_relations", f"biosnap_{target_type.lower()}")
    entities_path = osp.join(root_dir, "entities.tsv")
    log_file = osp.join(output_dir, "log.txt")

    command = [
        "graph-builder",
        "--database",
        database,
        "-d",
        relations_path,
        "-o",
        output_dir,
        "-f",
        entities_path,
        "-n",
        "20",
        "--download",
        "--skip",
        "-l",
        log_file,
        "--debug",
    ]

    print("Executing command:", " ".join(command))

    try:
        subprocess.run(command, check=True)
    except FileNotFoundError:
        print(
            "Error: 'graph-builder' command not found. Make sure it is installed and available in the PATH."
        )
        exit(1)
    except subprocess.CalledProcessError as e:
        print(f"Error: Command execution failed with return code {e.returncode}")
        print(f"Output: {e.output}")
        exit(1)
    except Exception as e:
        print(f"Unexpected error: {e}")
        exit(1)

#### Determine the target_type for the side effects, map the side effect ids to the disease ids

In [15]:
formatted_data["relation_type"] = "BioMedGPS::SideEffect::Compound:Disease"
formatted_data["target_type"] = "Disease"

output_file = "./formatted_biosnap_compound_sideeffect_disease.tsv"
formatted_data.to_csv(output_file, index=False, sep="\t")

invalid_output_file = "./invalid_biosnap_compound_sideeffect_disease.tsv"
invalid_formatted_data.to_csv(invalid_output_file, index=False, sep="\t")

In [16]:
format_biosnap(output_file, target_type="Disease")

Project root directory: /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data
Executing command: graph-builder --database customdb -d /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/relations/biosnap/./formatted_biosnap_compound_sideeffect_disease.tsv -o /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/formatted_relations/biosnap_disease -f /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/entities.tsv -n 20 --download --skip -l /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/formatted_relations/biosnap_disease/log.txt --debug


2024-11-12 11:49:44 - cli:156 - INFO - Run jobs with (output_dir: /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/formatted_relations/biosnap_disease, db file/directory: /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/relations/biosnap/./formatted_biosnap_compound_sideeffect_disease.tsv, databases: ('customdb',), download: True, skip: True)
2024-11-12 11:49:46 - customdb_parser:90 - INFO - Get 91051 relations
2024-11-12 11:49:47 - base_parser:484 - INFO - Found 91051 relations.
2024-11-12 11:49:47 - base_parser:784 - INFO - Found entity id map file, skip to generate it. If you want to regenerate it, please delete the file: /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/formatted_relations/biosnap_disease/customdb.entity_id_map.json
2024-11-12 11:49:47 - base_parser:486 - INFO - Found 9158 entity ids in entity id map.
2024-11-12 11:49:47 - base_parser:500 - INFO - The number of relations before dropna: 91051
2024-11-12 11:49:47 - base_par

#### Determine the target_type for the side effects, map the side effect ids to the symptom ids

In [17]:
formatted_data["relation_type"] = "BioMedGPS::SideEffect::Compound:Phenotype"
formatted_data["target_type"] = "Phenotype"

output_file = "./formatted_biosnap_compound_sideeffect_phenotype.tsv"
formatted_data.to_csv(output_file, index=False, sep="\t")

invalid_output_file = "./invalid_biosnap_compound_sideeffect_phenotype.tsv"
invalid_formatted_data.to_csv(invalid_output_file, index=False, sep="\t")

In [18]:
format_biosnap(output_file, target_type="Phenotype")

Project root directory: /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data
Executing command: graph-builder --database customdb -d /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/relations/biosnap/./formatted_biosnap_compound_sideeffect_phenotype.tsv -o /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/formatted_relations/biosnap_phenotype -f /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/entities.tsv -n 20 --download --skip -l /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/formatted_relations/biosnap_phenotype/log.txt --debug


2024-11-12 11:49:55 - cli:156 - INFO - Run jobs with (output_dir: /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/formatted_relations/biosnap_phenotype, db file/directory: /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/graph_data/relations/biosnap/./formatted_biosnap_compound_sideeffect_phenotype.tsv, databases: ('customdb',), download: True, skip: True)
2024-11-12 11:49:57 - customdb_parser:90 - INFO - Get 91051 relations
2024-11-12 11:49:58 - base_parser:484 - INFO - Found 91051 relations.
2024-11-12 11:49:58 - base_parser:792 - INFO - Start to get entity id map.
2024-11-12 11:53:37 - base_parser:834 - INFO - The number of entity ids: 9158
2024-11-12 11:53:37 - base_parser:486 - INFO - Found 9158 entity ids in entity id map.
2024-11-12 11:53:38 - base_parser:500 - INFO - The number of relations before dropna: 91051
2024-11-12 11:53:38 - base_parser:502 - INFO - The number of relations after dropna: 91051
2024-11-12 11:53:38 - base_parser:510 - INFO - Processing 