# Analyzing Subnetworks

This notebook provides an example of analyzing subnetworks for a study that uses drugs to target transcription factors.  We care about networks where proteins bind / inhibit / activate each other.

In [1]:
!pip install -q git+https://github.com/gyorilab/adeft.git

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
indra 1.21.0 requires markupsafe<2.1.0, but you have markupsafe 2.1.5 which is incompatible.
indra 1.21.0 requires ndex2==2.0.1, but you have ndex2 3.5.0 which is incompatible.
indra 1.21.0 requires pandas<1.3, but you have pandas 2.2.2 which is incompatible.
indra 1.21.0 requires protmapper>=0.0.23, but you have protmapper 0.0.21 which is incompatible.
indra 1.21.0 requires pysb<=1.9.1,>=1.3.0, but you have pysb 1.14.0 which is incompatible.
indra 1.21.0 requires sympy==1.3, but you have sympy 1.12 which is incompatible.
pgadmin4 7.8 requires SQLAlchemy==2.*, but you have sqlalchemy 1.3.24 which is incompatible.
flask-sqlalchemy 3.0.5 requires sqlalchemy>=1.4.18, but you have sqlalchemy 1.3.24 which is incompatible.
belql 0.3.0 requires Flask<3.0.0,>=2.3.2, but you have flask 2.2.5 which is incompat

In [2]:
!pip install -q -r requirements.txt

## STEP 1: Import MSstats Dataset

First, we will import MSstats datasets as pandas dataframes.  The MSstats dataset consist of the output of the MSstats groupComparison function, which consists of protein p-values.

We filter the datasets to smaller sizes based on p-value.  You can adjust those parameters as well.

In [10]:
import pandas as pd

P_VALUE_LOGFC_PATH = "model.csv" # Set this path yourself
LABELS_FILTER = ["DMSO-PF477736"]
#LABELS_FILTER = ["DMSO-DbET6"]
#LABELS_FILTER = ["DMSO-Nuc"]
P_VALUE_FILTER = 0.05 # Adjust this yourself
MAIN_TARGETS = ["CHK1_HUMAN"]

def construct_pvalue_logfc_df(filename):
    """Return a filtered data frame from the given data file."""
    pandas_df = pd.read_csv(filename)
    pandas_df = pandas_df.loc[
        ((pandas_df['issue'].isnull()) & (pandas_df['adj.pvalue'] < P_VALUE_FILTER) 
         | (pandas_df['Protein'].isin(MAIN_TARGETS))) 
        & (pandas_df['Label'].isin(LABELS_FILTER))]
    
    return pandas_df

pvalue_logfc_df = construct_pvalue_logfc_df(P_VALUE_LOGFC_PATH)
pvalue_logfc_df

Unnamed: 0,Protein,Label,log2FC,SE,Tvalue,DF,pvalue,adj.pvalue,issue,MissingPercentage,ImputationPercentage
30606,CHK1_HUMAN,DMSO-PF477736,0.115541,0.157794,0.732229,238.0,0.464749,0.9980494,,0.334667,0.0
32541,CLH1_HUMAN,DMSO-PF477736,-0.269784,0.056036,-4.814489,260.0,2.506764e-06,0.0030946,,0.143333,0.0
63546,FCGRN_HUMAN,DMSO-PF477736,-0.674851,0.132991,-5.074427,179.0,9.674775e-07,0.001592468,,0.195918,0.0
115656,NECP2_HUMAN,DMSO-PF477736,0.693701,0.097077,7.145875,259.0,9.076739e-12,2.241047e-08,,0.04,0.0
152151,RFA1_HUMAN,DMSO-PF477736,-0.603824,0.07619,-7.925239,260.0,6.705747e-14,3.311298e-10,,0.120667,0.0


## STEP 2: ID CONVERSION

INDRA Cogex only accepts HGNC IDs at the time of this writing.  However, the dataset provided in the example above contains uniprot mnemonic IDs.  

Luckily, INDRA has code to convert uniprot mnemonic IDs into HGNC ids. For now, we will store this mapping in a separate dictionary

In [11]:
from indra.databases import uniprot_client

def uniprot_to_hgnc_id(uniprot_mnemonic):
    """Get an HGNC ID from a UniProt mnemonic."""
    uniprot_id = uniprot_client.get_id_from_mnemonic(uniprot_mnemonic)
    if uniprot_id:
        return uniprot_client.get_hgnc_id(uniprot_id)
    else:
        return None

uniprot_to_hgnc_id("CLH1_HUMAN")

'2092'

In [12]:
def uniprot_to_hgnc_gene_name(uniprot_mnemonic):
    """Get an HGNC gene name from a UniProt mnemonic."""
    uniprot_id = uniprot_client.get_gene_name(uniprot_mnemonic)
    return uniprot_id
uniprot_to_hgnc_gene_name("CLH1_HUMAN")

'CLTC'

In [13]:
def create_hgnc_id_to_uniprot_mapping(pandas_df):
    mappings = {}
    for protein in pandas_df['Protein'].unique():
        mappings[uniprot_to_hgnc_id(protein)] = protein
    return mappings

hgnc_id_to_uniprot_mapping = create_hgnc_id_to_uniprot_mapping(pvalue_logfc_df)
hgnc_id_to_uniprot_mapping

{'1925': 'CHK1_HUMAN',
 '2092': 'CLH1_HUMAN',
 '3621': 'FCGRN_HUMAN',
 '25528': 'NECP2_HUMAN',
 '10289': 'RFA1_HUMAN'}

In [14]:
def create_hgnc_gene_name_to_uniprot_mapping(pandas_df):
    mappings = {}
    for protein in pandas_df['Protein'].unique():
        mappings[uniprot_to_hgnc_gene_name(protein)] = protein
    return mappings

hgnc_gene_name_to_uniprot_mapping = create_hgnc_gene_name_to_uniprot_mapping(pvalue_logfc_df)
hgnc_gene_name_to_uniprot_mapping

{'CHEK1': 'CHK1_HUMAN',
 'CLTC': 'CLH1_HUMAN',
 'FCGRT': 'FCGRN_HUMAN',
 'NECAP2': 'NECP2_HUMAN',
 'RPA1': 'RFA1_HUMAN'}

## STEP 3: QUERY INDRA COGEX

Using INDRA Cogex, we can extract subnetwork relationships among the proteins from the MSstats output.

In [15]:
import requests

def query_indra_subnetwork(groundings):
    """Return a list INDRA subnetwork relations based on a list of groundings."""
    res = requests.post(
        'https://discovery.indra.bio/api/indra_subnetwork_relations',
        json={'nodes': groundings}
    )
    return res.json()

In [16]:
groundings = []
for hgnc_id in hgnc_id_to_uniprot_mapping.keys():
    groundings.append(('HGNC', hgnc_id))
subnetwork_relations = query_indra_subnetwork(groundings)
subnetwork_relations[0]

{'data': {'belief': 0.98,
  'evidence_count': 1,
  'has_database_evidence': True,
  'has_reader_evidence': False,
  'has_retracted_evidence': False,
  'medscan_only': False,
  'source_counts': '{"biogrid": 1}',
  'sparser_only': False,
  'stmt_hash': 13959655286867654,
  'stmt_json': '{"type": "Complex", "members": [{"name": "RPA1", "db_refs": {"HGNC": "10289", "UP": "P27694", "EGID": "6117"}}, {"name": "CLTC", "db_refs": {"HGNC": "2092", "UP": "Q00610", "EGID": "1213"}}], "belief": 0.98, "evidence": [{"source_api": "biogrid", "pmid": "24332808", "source_id": "934345", "annotations": {"biogrid_int_id": "934345", "entrez_a": "6117", "entrez_b": "1213", "biogrid_a": "112037", "biogrid_b": "107623", "syst_name_a": null, "syst_name_b": null, "hgnc_a": "RPA1", "hgnc_b": "CLTC", "syn_a": "HSSB|MST075|REPA1|RF-A|RP-A|RPA70", "syn_b": "CHC|CHC17|CLH-17|CLTCL2|Hc", "exp_system": "Affinity Capture-MS", "exp_system_type": "physical", "author": "Marechal A (2014)", "pmid": "24332808", "organism_a"

## STEP 4: ASSEMBLE DATA

INDRA has a set of assembler classes to display the data for further analysis.  Below are some examples of assemblers

### HTML Assembler

In [17]:
import json
from indra.statements import stmts_from_json

# Gather statistics for HTML presentation
unique_stmts = {entry['data']['stmt_hash']: json.loads(entry['data']['stmt_json'])
                for entry in subnetwork_relations}
ev_counts_by_hash = {entry['data']['stmt_hash']: entry['data']['evidence_count']
                     for entry in subnetwork_relations}
source_counts_by_hash = {entry['data']['stmt_hash']: json.loads(entry['data']['source_counts'])
                         for entry in subnetwork_relations}
stmts = stmts_from_json(list(unique_stmts.values()))

In [18]:
from indra.assemblers.html import HtmlAssembler
ha = HtmlAssembler(stmts,
                   title='INDRA subnetwork statements',
                   db_rest_url='https://db.indra.bio',
                   ev_counts=ev_counts_by_hash,
                   source_counts=source_counts_by_hash)
html_str = ha.make_model()

In [19]:
from IPython.core.display import HTML
HTML(html_str)

### NETWORK VISUALIZATION

We can also visualize the subnetwork acquired from INDRA Cogex using an INDRA built-in assembler

In [1]:
# Reference: https://pygraphviz.github.io/documentation/stable/install.html
!pip install -q pygraphviz \
    --config-settings=--global-option=build_ext \
    --config-settings=--global-option="-I$(brew --prefix graphviz)/include" \
    --config-settings=--global-option="-L$(brew --prefix graphviz)/lib"

In [2]:
!brew install graphviz

[34m==>[0m [1mAuto-updating Homebrew...[0m
Adjust how often this is run with HOMEBREW_AUTO_UPDATE_SECS or disable with
HOMEBREW_NO_AUTO_UPDATE. Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 2 taps (homebrew/core and homebrew/cask).
[34m==>[0m [1mNew Formulae[0m
bigquery-emulator                        tmt
[34m==>[0m [1mNew Casks[0m
carbon-copy-cloner@6

You have [1m13[0m outdated formulae installed.

graphviz 10.0.1 is already installed but outdated (so it will be upgraded).
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/graphviz/manifests/11.0.0[0m
######################################################################### 100.0%
[32m==>[0m [1mFetching dependencies for graphviz: [32mgiflib[39m, [32mwebp[39m, [32maom[39m, [32mopenssl@3[39m, [32mjasper[39m, [32mnetpbm[39m, [32mlibxcb[39m, [32mlibx11[39m, [32mgdk-pixbuf[39m and [32mfribidi[39m[0m
[34m==>[0m [1mDownlo

In [21]:
from indra.assemblers.graph.assembler import GraphAssembler

ga = GraphAssembler(stmts=stmts)
ga.make_model()
ga.save_pdf(file_name='graph.pdf', prog='dot')

### Tabular Format

INDRA also has an assembler for displaying INDRA cogex results as a table.

In [22]:
from indra.assemblers.indranet.assembler import IndraNetAssembler

def add_evidence_column(stmt, ev_counts = ev_counts_by_hash):
    hash = stmt.get_hash(refresh=True)
    return ev_counts[hash]

indra_net_assembler = IndraNetAssembler(statements=stmts)
relations_table = indra_net_assembler.make_df(extra_columns=[('evidence_count', add_evidence_column)])
relations_table = relations_table.sort_values(by=['evidence_count'], ascending=False)
relations_table

Unnamed: 0,agA_name,agB_name,agA_ns,agA_id,agB_ns,agB_id,residue,position,stmt_type,evidence_count,stmt_hash,belief,source_counts,initial_sign
7,CHEK1,RPA1,HGNC,1925,HGNC,10289,,,Phosphorylation,16,22039018332715694,0.999082,{'sparser': 1},
5,RPA1,CHEK1,HGNC,10289,HGNC,1925,,,Phosphorylation,3,6298771024467860,0.923,{'reach': 1},
4,RPA1,CHEK1,HGNC,10289,HGNC,1925,,,Dephosphorylation,2,9751777159668901,0.9419,{'reach': 1},
0,RPA1,CLTC,HGNC,10289,HGNC,2092,,,Complex,1,13959655286867654,0.98,{'biogrid': 1},
1,CLTC,RPA1,HGNC,2092,HGNC,10289,,,Complex,1,13959655286867654,0.98,{'biogrid': 1},
2,RPA1,CHEK1,HGNC,10289,HGNC,1925,S,345.0,Dephosphorylation,1,-4359730843333697,0.65,{'reach': 1},
3,RPA1,CHEK1,HGNC,10289,HGNC,1925,,,Activation,1,-22751136463918078,0.65,{'reach': 1},
6,RPA1,CHEK1,HGNC,10289,HGNC,1925,,,Dephosphorylation,1,-28824006899253340,0.65,{'reach': 1},
8,CHEK1,RPA1,HGNC,1925,HGNC,10289,,,Inhibition,1,-891482891587470,0.65,{'reach': 1},


In [27]:
from indra.assemblers.cx import CxAssembler

ca = CxAssembler(stmts, 'INDRA May Institute Network')
ncx = ca.make_model()
ca.save_model()

### Correlation Matrix

In [31]:
ABUNDANCE_PATH = "ProteinLevelData.csv" # Set this path yourself
ABUNDANCE_GROUPS_FILTER = ['DMSO', 'PF477736']
def construct_abundance_df(filename):
    pandas_df = pd.read_csv(filename)
    pandas_df = pandas_df[pandas_df['GROUP'].isin(ABUNDANCE_GROUPS_FILTER)]
    return pandas_df

protein_abundance_df = construct_abundance_df(ABUNDANCE_PATH)
protein_abundance_df

Unnamed: 0,RUN,Protein,LogIntensities,originalRUN,GROUP,SUBJECT,TotalGroupMeasurements,NumMeasuredFeature,MissingPercentage,more50missing,NumImputedFeature
0,1,1433B_HUMAN,12.873423,230719_THP-1_Chrom_end2end_Plate1_DMSO_A02_DIA,DMSO,2,1210,10,0.0,False,0
1,2,1433B_HUMAN,12.866217,230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA,DMSO,5,1210,10,0.0,False,0
2,3,1433B_HUMAN,12.686827,230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA,DMSO,10,1210,10,0.0,False,0
3,4,1433B_HUMAN,12.625462,230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA,DMSO,12,1210,10,0.0,False,0
4,5,1433B_HUMAN,12.538365,230719_THP-1_Chrom_end2end_Plate1_DMSO_B01_DIA,DMSO,13,1210,10,0.0,False,0
...,...,...,...,...,...,...,...,...,...,...,...
1189787,232,ZZZ3_HUMAN,10.369426,230719_THP-1_Chrom_end2end_Plate3_VE-821_D08,PF477736,236,179,10,0.0,False,0
1189788,233,ZZZ3_HUMAN,10.587775,230719_THP-1_Chrom_end2end_Plate3_DbET6_E08,PF477736,248,179,10,0.0,False,0
1189789,234,ZZZ3_HUMAN,10.755412,230719_THP-1_Chrom_end2end_Plate3_DMSO_E10,PF477736,250,179,10,0.0,False,0
1189790,235,ZZZ3_HUMAN,10.653718,230719_THP-1_Chrom_end2end_Plate3_DMSO_F01,PF477736,253,179,10,0.0,False,0


The `dataProcess` function outputs a dataset containing protein abundances for protein and biological replicate pair.  Using the protein abundances data, we can determine the correlation of abundance between pairs of proteins.

In [32]:
import numpy as np
def calculate_correlation_matrix(pvalue_df, protein_level_summary):
    data = {}
    subjects = protein_level_summary['SUBJECT'].unique()
    for protein in pvalue_df['Protein'].unique():
        data[protein] = []
        protein_level_df = protein_level_summary[protein_level_summary['Protein'] == protein]
        for subject in subjects:
            if subject in protein_level_df['SUBJECT'].values:
                protein_level_df_subject = protein_level_df[protein_level_df['SUBJECT'] == subject]
                data[protein].append(protein_level_df_subject['LogIntensities'].iloc[0])
            else:
                data[protein].append(np.nan)
    df = pd.DataFrame(data)
    corrM = df.corr() 
    return corrM

corr_matrix = calculate_correlation_matrix(pvalue_logfc_df, protein_abundance_df)
corr_matrix

Unnamed: 0,CHK1_HUMAN,CLH1_HUMAN,FCGRN_HUMAN,NECP2_HUMAN,RFA1_HUMAN
CHK1_HUMAN,1.0,-0.217592,-0.103847,0.287814,0.212776
CLH1_HUMAN,-0.217592,1.0,0.187168,-0.111017,-0.014356
FCGRN_HUMAN,-0.103847,0.187168,1.0,-0.242859,0.316762
NECP2_HUMAN,0.287814,-0.111017,-0.242859,1.0,-0.115349
RFA1_HUMAN,0.212776,-0.014356,0.316762,-0.115349,1.0


In [34]:
LOG_FC_FILTER = 0.25
pvalue_logfc_df['log2FC'] = pvalue_logfc_df['log2FC'].astype(float)
logfc_proteins = pvalue_logfc_df[pvalue_logfc_df['log2FC'].abs() > LOG_FC_FILTER]
logfc_proteins

Unnamed: 0,Protein,Label,log2FC,SE,Tvalue,DF,pvalue,adj.pvalue,issue,MissingPercentage,ImputationPercentage
32541,CLH1_HUMAN,DMSO-PF477736,-0.269784,0.056036,-4.814489,260.0,2.506764e-06,0.0030946,,0.143333,0.0
63546,FCGRN_HUMAN,DMSO-PF477736,-0.674851,0.132991,-5.074427,179.0,9.674775e-07,0.001592468,,0.195918,0.0
115656,NECP2_HUMAN,DMSO-PF477736,0.693701,0.097077,7.145875,259.0,9.076739e-12,2.241047e-08,,0.04,0.0
152151,RFA1_HUMAN,DMSO-PF477736,-0.603824,0.07619,-7.925239,260.0,6.705747e-14,3.311298e-10,,0.120667,0.0
