# Introduction to Network Medicine

Network medicine applies network science to understand the complexity of human diseases. In this tutorial, we will walk through a practical workflow using the `netmedpy` package to explore how molecular interactions and disease-gene associations can be integrated to gain mechanistic insights and identify potential therapeutic opportunities.

This notebook will guide you through:
- Building a protein-protein interaction (PPI) network from STRING data.
- Identifying Vitamin D protein targets from multiple compound-target interaction databases using `CPIExtract`.
- Collecting disease-associated genes from DisGeNet.
- Calculating network-based proximity between Vitamin D targets and disease genes.
- Comparing proximity z-scores using two different null models.

Let's get started!

## Load Required Libraries

We start by importing the necessary Python libraries, including `netmedpy`, `cpiextract`, and standard packages for data manipulation and visualization.


In [1]:
import pandas as pd
import numpy as np
import requests
import zipfile
import gzip
import shutil
import os
import mygene
import networkx as nx
import json
import matplotlib.pyplot as plt
import seaborn as sns

import tools
import netmedpy
from cpiextract import Comp2Prot


## Download STRING PPI Interactions

We download and process the human protein-protein interaction dataset from STRING (version 12). We then:
- Filter high-confidence physical interactions (score > 300),
- Convert Ensembl protein IDs to HGNC symbols,
- Extract the largest connected component (LCC),
- And save the resulting network in `output/string_ppi_filtered.csv`.


In [2]:
# Define the URL for the STRING PPI dataset
string_url = "https://stringdb-downloads.org/download/protein.physical.links.v12.0/9606.protein.physical.links.v12.0.txt.gz"

# Define paths for temporary files
string_gz_path = './tmp_string/string.gz'

# Download and extract STRING data
print("Downloading STRING dataset...")
tools.download_file(string_url, string_gz_path)
tools.ungz_file(string_gz_path, "./tmp_string/string_data")

print("Reading STRING dataset...")
string_df = pd.read_csv("./tmp_string/string_data/string", sep="\s+", engine="python")

# Clean up temporary files
shutil.rmtree("./tmp_string")

# Remove prefixes from protein names
print("Processing protein names...")
string_df["protein1"] = string_df["protein1"].str.replace("9606.", "", regex=False)
string_df["protein2"] = string_df["protein2"].str.replace("9606.", "", regex=False)

# Convert Ensembl IDs to HGNC symbols
ens_to_hgnc = tools.ensembl_to_hgnc(string_df)
string_df["HGNC1"] = string_df["protein1"].map(ens_to_hgnc)
string_df["HGNC2"] = string_df["protein2"].map(ens_to_hgnc)

# Remove entries with unknown gene mappings
string_df = string_df.query("HGNC1 != 'Unknown' and HGNC2 != 'Unknown'")
string_df = string_df.rename(columns={"combined_score": "weight"})

filtered_df = string_df.query("weight > 300")
G_string = nx.from_pandas_edgelist(filtered_df, 'HGNC1', 'HGNC2', create_using=nx.Graph())

G_string = netmedpy.extract_lcc(G_string.nodes, G_string)

print(f"Nodes: {len(G_string.nodes)}")
print(f"Edges: {len(G_string.edges)}")


# Save to CSV
df_edges = nx.to_pandas_edgelist(G_string)
df_edges.to_csv("output/string_ppi_filtered.csv", index=False)



Downloading STRING dataset...
File downloaded successfully and saved to ./tmp_string/string.gz
File extracted to: ./tmp_string/string_data/string
Reading STRING dataset...
Processing protein names...


Input sequence provided is already in string format. No operation performed
Input sequence provided is already in string format. No operation performed
364 input query terms found no hit:	['ENSP00000053469', 'ENSP00000074304', 'ENSP00000155858', 'ENSP00000224807', 'ENSP00000224862', 'ENS


Nodes: 16785
Edges: 280081


## Extract Vitamin D Targets with CPIExtract

We extract protein targets of Cholecalciferol (Vitamin D3, PubChem CID: 5280795) using the `Comp2Prot` module from `cpiextract`. The tool integrates multiple drug-target interaction databases, including ChEMBL, STITCH, BindingDB, CTD, and more.


### Construct Compound-Target Database

We unzip and load all required compound-target databases into memory. These databases provide curated chemical–protein interaction data from various sources.

The databases are stored in `output/cpie_Databases`


In [3]:
# Define database directory path
data_path = "./output/cpie_Databases"

if os.path.exists(data_path):
    shutil.rmtree(data_path)

tools.unzip_file("../VitaminD/supplementary/sup_data/cpie_databases/Databases.zip", data_path)

# Load databases into pandas DataFrames

# BindingDB (downloaded on 03/30/2023)
file_path = os.path.join(data_path, 'BindingDB.csv')
BDB_data = pd.read_csv(file_path, sep=',', usecols=['CID', 'Ligand SMILES', 'Ligand InChI', 'BindingDB MonomerID',
                                                    'Ligand InChI Key', 'BindingDB Ligand Name',
                                                    'Target Name Assigned by Curator or DataSource',
                                                    'Target Source Organism According to Curator or DataSource',
                                                    'Ki (nM)', 'IC50 (nM)', 'Kd (nM)', 'EC50 (nM)', 'pH', 'Temp (C)',
                                                    'Curation/DataSource',
                                                    'UniProt (SwissProt) Entry Name of Target Chain',
                                                    'UniProt (SwissProt) Primary ID of Target Chain'],
                         on_bad_lines='skip')

# STITCH (downloaded on 02/22/2023)
file_path = os.path.join(data_path, 'STITCH.tsv')
sttch_data = pd.read_csv(file_path, sep='\t')

# ChEMBL (downloaded on 02/01/2024)
file_path = os.path.join(data_path, 'ChEMBL.csv')
chembl_data = pd.read_csv(file_path, sep=',')

# CTD
file_path = os.path.join(data_path, 'CTD.csv')
CTD_data = pd.read_csv(file_path, sep=',')

# DTC (downloaded on 02/24/2023)
file_path = os.path.join(data_path, 'DTC.csv')
DTC_data = pd.read_csv(file_path, sep=',', usecols=['CID', 'compound_id', 'standard_inchi_key', 'target_id',
                                                    'gene_names', 'wildtype_or_mutant', 'mutation_info',
                                                    'standard_type', 'standard_relation', 'standard_value',
                                                    'standard_units', 'activity_comment', 'pubmed_id', 'doc_type'])

# DrugBank (downloaded on 03/02/2022)
file_path = os.path.join(data_path, 'DB.csv')
DB_data = pd.read_csv(file_path, sep=',')

# DrugCentral (downloaded on 02/25/2024)
file_path = os.path.join(data_path, 'DrugCentral.csv')
DC_data = pd.read_csv(file_path, sep=',')

# Store all databases in a dictionary
dbs = {
    'chembl': chembl_data,
    'bdb': BDB_data,
    'stitch': sttch_data,
    'ctd': CTD_data,
    'dtc': DTC_data,
    'db': DB_data,
    'dc': DC_data
}

Files extracted to: ./output/cpie_Databases


  BDB_data = pd.read_csv(file_path, sep=',', usecols=['CID', 'Ligand SMILES', 'Ligand InChI', 'BindingDB MonomerID',
  chembl_data = pd.read_csv(file_path, sep=',')
  DTC_data = pd.read_csv(file_path, sep=',', usecols=['CID', 'compound_id', 'standard_inchi_key', 'target_id',


### Search for Vitamin D Targets

Using the integrated databases, we now extract the protein targets of Cholecalciferol. The targets will be mapped to HGNC symbols for compatibility with the PPI network. Then, they are saved in `output/vd_targets.json`



In [4]:
# Cholecalciferol (PubChem CID: 5280795)
comp_id = 5280795

# Initialize Comp2Prot
C2P = Comp2Prot('local', dbs=dbs)

# Search for interactions
comp_dat, status = C2P.comp_interactions(input_id=comp_id)

# Extract HGNC symbols
vd_targets = {"Vitamin D": list(comp_dat.hgnc_symbol)} 

# Save extracted targets
with open('./output/vd_targets.json', 'w') as f:
    json.dump(vd_targets, f)

  tar_all = pd.concat([tar_all, result])


pc done!
chembl done!
bdb done!
stitch done!
ctd done!
dtc done!
otp done!
dc done!
db done!


  data.loc[index,'std_pchembl'] = 'Sources do not provide activity data'


## Construct Disease Gene Sets

We extract disease-gene associations for four conditions: Huntington’s disease, Inflammation, Rickets, and Vitamin D deficiency. These associations come from DisGeNet, and we filter them based on a minimum association score. The results are saved in the file `output/disease_genes.json`

In [5]:
# Directory containing the disease genes
dis_gene_path = "input_data/disease_genes"

disease_file_names = {
    "Huntington":"DGN_Huntington.csv",
    "Inflammation": "DGN_inflammation.csv",
    "Rickets": "DGN_Rickets.csv",
    "Vit. D deficiency": "DGN_VDdeff.csv"
}

disease_genes = {}

# Load files and filter for strong associations
for name,file_name in disease_file_names.items():
    path = dis_gene_path + "/" + file_name

    df = pd.read_csv(path)
    df = df.query("Score_gda > 0.1")

    disease_genes[name] =  list(df.Gene)

# Save file
with open('./output/disease_genes.json', 'w') as f:
    json.dump(disease_genes, f)

## Verify Network Coverage

We check how many of the disease-associated genes and Vitamin D targets are present in the PPI network. This step ensures that the proximity analysis is based on genes with known interactions.


In [6]:
# Load PPI network
ppi = pd.read_csv("output/string_ppi_filtered.csv")
ppi = nx.from_pandas_edgelist(ppi, 'source', 'target', create_using=nx.Graph())

# Load disease genes
with open('./output/disease_genes.json', 'r') as f:
    disease_genes = json.load(f)

# Load Vitamin D targets
with open('./output/vd_targets.json', 'r') as f:
    dtargets = json.load(f)

# Keep only associations existing in the PPI
nodes = set(ppi.nodes)
for name, genes in disease_genes.items():
    disease_genes[name] = set(genes) & nodes
    print(f"{name}: {len(disease_genes[name])} associations in PPI")

for name, targets in dtargets.items():
    dtargets[name] = set(targets) & nodes
    print(f"{name}: {len(dtargets[name])} targets in PPI")


Huntington: 46 associations in PPI
Inflammation: 164 associations in PPI
Rickets: 11 associations in PPI
Vit. D defficiency: 5 associations in PPI
Vitamin D: 24 targets in PPI


## Compute Random Walk Based Distances

Using `netmedpy`, we calculate the pairwise biased random walk distances between all nodes in the PPI network. These distances will later be used to compute the proximity between drug targets and disease genes. The resulting matrix is stored as a `pickle` file in `output/ppi_distances_BRW.pkl`


In [7]:
# Calculate Random Walk based distance between all pair of genes
dmat = netmedpy.all_pair_distances(
    ppi,
    distance='biased_random_walk',
    reset = 0.3
)

# Save distances for further use
netmedpy.save_distances(dmat,"output/ppi_distances_BRW.pkl")

## Compute Proximity Using Log Binning Null Model

We compute proximity scores between Vitamin D targets and disease gene sets using `netmedpy`’s screening function. This step uses a log-binning-based null model to estimate z-scores, which help assess the significance of the observed proximities.


In [8]:
# Calculate proximity between Vitamin D targets and Diseases
proximity_lb = netmedpy.screening(
    dtargets, 
    disease_genes, 
    ppi,
    dmat,
    score="proximity",
    properties=["z_score"],
    null_model="log_binning",
    n_iter=10000,n_procs=10
)

zscore_lb = proximity_lb['z_score'].T
zscore_lb = zscore_lb.sort_values(by='Vitamin D')
zscore_lb

2025-04-03 22:01:04,085	INFO worker.py:1816 -- Started a local Ray instance.


[36m(_calculate_score pid=250541)[0m Vitamin D-Vit. D defficiency finished
[36m(_calculate_score pid=250549)[0m Vitamin D-Rickets finished
[36m(_calculate_score pid=250544)[0m Vitamin D-Huntington finished


Unnamed: 0,Vitamin D
Vit. D defficiency,-3.122121
Inflammation,-2.134047
Huntington,-1.552955
Rickets,-1.498974


## Repeat Proximity Calculation with Degree-Matched Null Model

Here, we repeat the proximity analysis using a different null model — degree matching. This approach controls for network degree when estimating null distributions, offering an alternative robustness check.


In [9]:
proximity_dm = netmedpy.screening(
    dtargets, 
    disease_genes, 
    ppi,
    dmat,
    score="proximity",
    properties=["z_score"],
    null_model="degree_match",
    n_iter=10000,n_procs=10
)

zscore_dm = proximity_dm['z_score'].T
zscore_dm = zscore_dm.sort_values(by='Vitamin D')
zscore_dm

2025-04-03 22:01:50,116	INFO worker.py:1816 -- Started a local Ray instance.


[36m(_calculate_score pid=251812)[0m Vitamin D-Vit. D defficiency finished
[36m(_calculate_score pid=251806)[0m Vitamin D-Rickets finished
[36m(_calculate_score pid=251805)[0m Vitamin D-Huntington finished


Unnamed: 0,Vitamin D
Vit. D defficiency,-3.451563
Inflammation,-2.44133
Huntington,-1.947869
Rickets,-1.743998


## Compare Z-scores from Both Null Models

Finally, we merge the proximity z-scores from both null models into a single table. This comparison allows us to evaluate the robustness of the Vitamin D–disease associations across different randomization strategies.


In [10]:
zscore_lb.columns = ["Log Binning"]
zscore_dm.columns = ["Degree Match"]

zscore = pd.merge(zscore_lb,zscore_dm, left_index=True, right_index=True)

zscore

Unnamed: 0,Log Binning,Degree Match
Vit. D defficiency,-3.122121,-3.451563
Inflammation,-2.134047,-2.44133
Huntington,-1.552955,-1.947869
Rickets,-1.498974,-1.743998
