# Notebook to add confidence annotation on edges in MoA-net

This notebook helps us to add corresponding confidences to the PPI edges in MoA-net. This is so that we can use them alongside weighted rules in a probabilistic logic program.

In [1]:
import os
import pandas as pd
from tqdm import tqdm
from collections import defaultdict
import json

tqdm.pandas()

## Load MoA-net:

In [2]:
KG_DIR = "../data/kg"
FIG_DIR = "../figures"
os.makedirs(FIG_DIR, exist_ok=True)

In [3]:
kg = pd.read_csv(f"{KG_DIR}/final_kg.tsv", sep="\t")
kg.drop_duplicates(inplace=True)
kg.head(2)

Unnamed: 0,source,source_node_type,target,target_node_type,edge_type
0,pubchem.compound:10607,Compound,ncbigene:3553,Gene,upregulates
1,pubchem.compound:10607,Compound,ncbigene:203068,Gene,downregulates


## Get edge pairs

Here, we extract all PPIs from MoA-net:

In [4]:
gene_edges = set()

for row in kg.values:
    (
        source_id,
        source_type,
        target_id,
        target_type,
        edge_type,
    ) = row

    if source_type == "Gene" and target_type == "Gene":
        gene_edges.add(f"{source_id}_{target_id}")

len(gene_edges)

86786

## Get gene-gene metadata

Next, we need information about PPIs.

This information is got from STRING database, where a confidence score between two gene edges is captured.

The following files can be downloaded from [this page](https://string-db.org/cgi/download?sessionId=bALlO0OwuwIq&species_text=Homo+sapiens).

Note that the combined score is multiplied by 1000 to make it an integer (says [here](https://string-db.org/help/faq/)), so we should divide by 1000 to get a probability.

``The combined score is computed by combining the probabilities from the different evidence channels and corrected for the probability of randomly observing an interaction.`` ([von Mering *et al.*](https://pubmed.ncbi.nlm.nih.gov/15608232/))

In [5]:
stringdb_df = pd.read_csv(
    "../data/mappings/9606.protein.links.v12.0.txt.gz", compression="gzip", sep=" "
)
stringdb_df.head(2)

Unnamed: 0,protein1,protein2,combined_score
0,9606.ENSP00000000233,9606.ENSP00000356607,173
1,9606.ENSP00000000233,9606.ENSP00000427567,154


Now, map the STRING DB IDs to HGNC symbols.

In [6]:
ensembl_gene_mapper = (
    pd.read_csv(
        "../data/mappings/9606.protein.info.v12.0.txt.gz", sep="\t", compression="gzip"
    )
    .set_index("#string_protein_id")
    .to_dict()["preferred_name"]
)

stringdb_df["gene1"] = stringdb_df["protein1"].map(ensembl_gene_mapper)
stringdb_df["gene2"] = stringdb_df["protein2"].map(ensembl_gene_mapper)

In [7]:
stringdb_df.dropna(subset=["gene1", "gene2"], inplace=True)

In [8]:
stringdb_df.head(5)

Unnamed: 0,protein1,protein2,combined_score,gene1,gene2
0,9606.ENSP00000000233,9606.ENSP00000356607,173,ARF5,RALGPS2
1,9606.ENSP00000000233,9606.ENSP00000427567,154,ARF5,FHDC1
2,9606.ENSP00000000233,9606.ENSP00000253413,151,ARF5,ATP6V1E1
3,9606.ENSP00000000233,9606.ENSP00000493357,471,ARF5,CYTH2
4,9606.ENSP00000000233,9606.ENSP00000324127,201,ARF5,PSD3


Mapping the HGNC symbols to NCBI gene IDs:

This file was obtained from [HGNC](https://www.genenames.org/download/custom/) using a custom download in which the following are checked:

- Approved Symbol
- Previous Symbols
- NCBI Gene ID

In [9]:
symbol_to_gene_df = pd.read_csv(
    "../data/mappings/symbol_to_ncbigene.txt", sep="\t", dtype=str
)

In [10]:
gene_id_mapper = defaultdict(str)

for row in symbol_to_gene_df.values:
    (symbol, prev_symbol, gene_id) = row

    gene_id_mapper[symbol] = f"ncbigene:{gene_id}"

    if pd.notna(prev_symbol):  ## add all previous symbols as well
        for symn in prev_symbol.split(","):
            gene_id_mapper[symn] = f"ncbigene:{gene_id}"

len(gene_id_mapper)

62220

Map to the HGNC symbols:

In [11]:
stringdb_df["gene1_id"] = stringdb_df["gene1"].map(gene_id_mapper)
stringdb_df["gene2_id"] = stringdb_df["gene2"].map(gene_id_mapper)

Get a dictionary of PPIs to scores in the STRING DB:

In [12]:
string_ppi = defaultdict(int)

for row in tqdm(stringdb_df.values):
    (
        protein1,
        protein2,
        combined_score,
        gene1,
        gene2,
        gene1_id,
        gene2_id,
    ) = row

    if gene1_id and gene2_id:
        string_ppi[f"{gene1_id}_{gene2_id}"] = combined_score
        string_ppi[f"{gene2_id}_{gene1_id}"] = combined_score

100%|██████████| 13715404/13715404 [00:34<00:00, 399315.69it/s]


Let's see what proportion of our PPIs have information in the string DB:

In [13]:
len(set(string_ppi.keys()) & gene_edges) / len(gene_edges)

0.7249786831977508

Okay, we'll make a final file which maps the PPIs to confidences.

Divide the confidences by 1000 to convert back to probabilities:

In [14]:
mapped = {key: val/1000 for key, val in string_ppi.items() if key in gene_edges}

In [15]:
# save the mappings to confidence as a json file:
with open(f"{KG_DIR}/stringdb_ppi_confidences.json", "w") as f:
    json.dump(mapped, f, indent=2)