# Hands on: Creating context-specific mechanistic networks from experimental data and prior knowledge

This Python notebook serves as an example of how users can use their own datasets and integrate them with INDRA to provide meaningful interpretation.  We start with bringing our own MSstats dataset that consists of a list of proteins alongside their p-values.  We use p-values to filter which genes we query INDRA with.

In [2]:
!pip install -q -r requirements.txt

## STEP 1: Import MSstats Dataset

First, we will import MSstats datasets as pandas dataframes.  The MSstats dataset consist of the output of the MSstats groupComparison function, which consists of protein p-values.

We filter the datasets to smaller sizes based on p-value.  You can adjust those parameters as well.

In [14]:
import pandas as pd

P_VALUE_LOGFC_PATH = "../data/model.csv" # Set this path yourself
LABELS_FILTER = ["DMSO-DbET6"]

P_VALUE_FILTER = 0.05 # Adjust this yourself

def construct_pvalue_logfc_df(filename):
    """Return a filtered data frame from the given data file."""
    pandas_df = pd.read_csv(filename)
    pandas_df = pandas_df[pandas_df['issue'].isnull()]
    pandas_df = pandas_df[pandas_df['adj.pvalue'] < P_VALUE_FILTER]
    pandas_df = pandas_df[pandas_df['Label'].isin(LABELS_FILTER)]
    return pandas_df

pvalue_logfc_df = construct_pvalue_logfc_df(P_VALUE_LOGFC_PATH)
pvalue_logfc_df

Unnamed: 0,Protein,Label,log2FC,SE,Tvalue,DF,pvalue,adj.pvalue,issue,MissingPercentage,ImputationPercentage
19890,BRD2_HUMAN,DMSO-DbET6,2.046185244,0.114339,17.895836,260.0,0.0,0.0,,0.310067,0.0
19935,BRD3_HUMAN,DMSO-DbET6,3.333427936,0.126571,26.336522,257.0,0.0,0.0,,0.252349,0.0
19980,BRD4_HUMAN,DMSO-DbET6,2.668934662,0.101283,26.351317,257.0,0.0,0.0,,0.118121,0.0
27900,CEBPZ_HUMAN,DMSO-DbET6,-0.291058829,0.07434,-3.915236,260.0,0.000115442,0.03005139,,0.05906,0.0
37530,CRNL1_HUMAN,DMSO-DbET6,-0.268053808,0.069816,-3.839419,260.0,0.000154958,0.03832114,,0.027517,0.0
41580,DAZP1_HUMAN,DMSO-DbET6,0.617508071,0.099537,6.203818,260.0,2.15995e-09,1.780518e-06,,0.286577,0.0
66060,FUBP2_HUMAN,DMSO-DbET6,0.291044501,0.077226,3.768747,260.0,0.000203053,0.04782372,,0.01745,0.0
66105,FUBP3_HUMAN,DMSO-DbET6,0.300409541,0.069882,4.298798,260.0,2.430235e-05,0.008941156,,0.095302,0.0
72855,GTPB4_HUMAN,DMSO-DbET6,-0.344028893,0.087806,-3.918077,260.0,0.000114166,0.03005139,,0.020134,0.0
75735,HEAT3_HUMAN,DMSO-DbET6,-0.409028514,0.100056,-4.08798,260.0,5.804139e-05,0.01794204,,0.127517,0.0


In [3]:
sorted(set(pvalue_logfc_df.Label))

['DMSO-DbET6']

## STEP 2: ID CONVERSION

INDRA Cogex only accepts HGNC IDs at the time of this writing.  However, the dataset provided in the example above contains uniprot mnemonic IDs.  

Luckily, INDRA has code to convert uniprot mnemonic IDs into HGNC ids. For now, we will store this mapping in a separate dictionary

In [15]:
from indra.databases import uniprot_client

def uniprot_to_hgnc_id(uniprot_mnemonic):
    """Get an HGNC ID from a UniProt mnemonic."""
    uniprot_id = uniprot_client.get_id_from_mnemonic(uniprot_mnemonic)
    if uniprot_id:
        return uniprot_client.get_hgnc_id(uniprot_id)
    else:
        return None

uniprot_to_hgnc_id("BRD2_HUMAN")

'1103'

In [16]:
def uniprot_to_hgnc_gene_name(uniprot_mnemonic):
    """Get an HGNC gene name from a UniProt mnemonic."""
    uniprot_id = uniprot_client.get_gene_name(uniprot_mnemonic)
    return uniprot_id
uniprot_to_hgnc_gene_name("BRD2_HUMAN")

'BRD2'

In [17]:
def create_hgnc_id_to_uniprot_mapping(pandas_df):
    mappings = {}
    for protein in pandas_df['Protein'].unique():
        mappings[uniprot_to_hgnc_id(protein)] = protein
    return mappings

hgnc_id_to_uniprot_mapping = create_hgnc_id_to_uniprot_mapping(pvalue_logfc_df)
hgnc_id_to_uniprot_mapping

{'1103': 'BRD2_HUMAN',
 '1104': 'BRD3_HUMAN',
 '13575': 'BRD4_HUMAN',
 '24218': 'CEBPZ_HUMAN',
 '15762': 'CRNL1_HUMAN',
 '2683': 'DAZP1_HUMAN',
 '6316': 'FUBP2_HUMAN',
 '4005': 'FUBP3_HUMAN',
 '21535': 'GTPB4_HUMAN',
 '26087': 'HEAT3_HUMAN',
 '5036': 'HNRPD_HUMAN',
 '5176': 'KRR1_HUMAN',
 '7867': 'NOP2_HUMAN',
 '9296': 'PP1R8_HUMAN',
 '17351': 'PRP18_HUMAN',
 '21100': 'QKI_HUMAN',
 '11802': 'TIA1_HUMAN',
 '25758': 'UTP15_HUMAN',
 '14098': 'WDR12_HUMAN',
 '28945': 'WDR43_HUMAN',
 '25725': 'WDR75_HUMAN'}

In [18]:
def create_hgnc_gene_name_to_uniprot_mapping(pandas_df):
    mappings = {}
    for protein in pandas_df['Protein'].unique():
        mappings[uniprot_to_hgnc_gene_name(protein)] = protein
    return mappings

hgnc_gene_name_to_uniprot_mapping = create_hgnc_gene_name_to_uniprot_mapping(pvalue_logfc_df)
hgnc_gene_name_to_uniprot_mapping

{'BRD2': 'BRD2_HUMAN',
 'BRD3': 'BRD3_HUMAN',
 'BRD4': 'BRD4_HUMAN',
 'CEBPZ': 'CEBPZ_HUMAN',
 'CRNKL1': 'CRNL1_HUMAN',
 'DAZAP1': 'DAZP1_HUMAN',
 'KHSRP': 'FUBP2_HUMAN',
 'FUBP3': 'FUBP3_HUMAN',
 'GTPBP4': 'GTPB4_HUMAN',
 'HEATR3': 'HEAT3_HUMAN',
 'HNRNPD': 'HNRPD_HUMAN',
 'KRR1': 'KRR1_HUMAN',
 'NOP2': 'NOP2_HUMAN',
 'PPP1R8': 'PP1R8_HUMAN',
 'PRPF18': 'PRP18_HUMAN',
 'QKI': 'QKI_HUMAN',
 'TIA1': 'TIA1_HUMAN',
 'UTP15': 'UTP15_HUMAN',
 'WDR12': 'WDR12_HUMAN',
 'WDR43': 'WDR43_HUMAN',
 'WDR75': 'WDR75_HUMAN'}

## STEP 3: QUERY INDRA COGEX

Using INDRA Cogex, we can extract subnetwork relationships among the proteins from the MSstats output.

In [12]:
import requests

def query_indra_subnetwork(groundings):
    """Return a list INDRA subnetwork relations based on a list of groundings."""
    res = requests.post(
        'https://discovery.indra.bio/api/indra_subnetwork_relations',
        json={'nodes': groundings}
    )
    return res.json()

In [19]:
groundings = []
for hgnc_id in hgnc_id_to_uniprot_mapping:
    groundings.append(('HGNC', hgnc_id))
subnetwork_relations = query_indra_subnetwork(groundings)
subnetwork_relations[0]

{'data': {'belief': 0.65,
  'evidence_count': 1,
  'has_database_evidence': False,
  'has_reader_evidence': True,
  'has_retracted_evidence': False,
  'medscan_only': False,
  'source_counts': '{"sparser": 1}',
  'sparser_only': True,
  'stmt_hash': 6100415255007272,
  'stmt_json': '{"type": "Complex", "members": [{"name": "RAD21", "db_refs": {"UP": "O60216", "TEXT": "RAD21", "HGNC": "9811", "EGID": "5885"}}, {"name": "BRD2", "db_refs": {"UP": "P25440", "TEXT": "BRD2", "HGNC": "1103", "EGID": "6046"}}, {"name": "BRD4", "db_refs": {"UP": "O60885", "TEXT": "BRD4", "HGNC": "13575", "EGID": "23476"}}], "belief": 0.65, "evidence": [{"source_api": "sparser", "pmid": "28107481", "text": "However, we were unable to demonstrate any direct physical interaction between BRD2 or BRD4 with cohesin subunit RAD21 ( xref ), suggesting the conformational control of KSHV latency involves additional factors.", "annotations": {"found_by": "INTERACT"}, "text_refs": {"PMID": "28107481", "TRID": 16352739, "PM

In [34]:
res = requests.post(
        'https://discovery.indra.bio/api/get_stmts_for_stmt_hashes',
        json={
          "stmt_hashes": [
            "12484535149707124",
          ],
          "evidence_limit": 200
        }
    )

In [35]:
res.json()

[{'belief': 0.99995,
  'evidence': [{'annotations': {'agents': {'coords': [[110, 114], [119, 123]]},
     'found_by': 'binding_of_and'},
    'context': {'species': {'db_refs': {'TAXONOMY': '9606'}, 'name': None},
     'type': 'bio'},
    'epistemics': {'direct': True, 'section_type': None},
    'pmid': '25603177',
    'source_api': 'reach',
    'source_hash': -1268912701169983876,
    'text': 'BRD2 and BRD4 have been reported to be involved in cell cycle progression, given the evidence that binding of BRD2 and BRD4 to acetylated chromatin persists even during mitosis when chromatin is highly condensed and transcription is interrupted [XREF_BIBR, XREF_BIBR].',
    'text_refs': {'DOI': '10.3390/IJMS16011928',
     'PMCID': 'PMC4307342',
     'PMID': '25603177',
     'TRID': 461769}},
   {'annotations': {'found_by': 'BINDING'},
    'pmid': '25603177',
    'source_api': 'sparser',
    'source_hash': -431739944072706480,
    'text': 'BRD2 and BRD4 have been reported to be involved in cell c

In [20]:
len(subnetwork_relations)

140

## STEP 4: ASSEMBLE DATA

INDRA has a set of assembler classes to display the data for further analysis.  Below are some examples of assemblers

### HTML Assembler

In [21]:
import json
from indra.statements import stmts_from_json

# Gather statistics for HTML presentation
unique_stmts = {entry['data']['stmt_hash']: json.loads(entry['data']['stmt_json'])
                for entry in subnetwork_relations}
ev_counts_by_hash = {entry['data']['stmt_hash']: entry['data']['evidence_count']
                     for entry in subnetwork_relations}
source_counts_by_hash = {entry['data']['stmt_hash']: json.loads(entry['data']['source_counts'])
                         for entry in subnetwork_relations}
stmts = stmts_from_json(list(unique_stmts.values()))

In [26]:
from indra.assemblers.html import HtmlAssembler
ha = HtmlAssembler(stmts,
                   title='INDRA subnetwork statements',
                   db_rest_url='https://db.indra.bio',
                   ev_counts=ev_counts_by_hash,
                   source_counts=source_counts_by_hash)
html_str = ha.make_model()
ha.save_model('statements.html')

In [12]:
from IPython.core.display import HTML
# HTML(html_str)

### NETWORK VISUALIZATION

We can also visualize the subnetwork acquired from INDRA Cogex using an INDRA built-in assembler

In [12]:
# Reference: https://pygraphviz.github.io/documentation/stable/install.html
!pip install -q pygraphviz \
    --config-settings=--global-option=build_ext \
    --config-settings=--global-option="-I$(brew --prefix graphviz)/include" \
    --config-settings=--global-option="-L$(brew --prefix graphviz)/lib"

In [27]:
from indra.assemblers.graph.assembler import GraphAssembler

ga = GraphAssembler(stmts=stmts)
ga.make_model()
ga.save_pdf(file_name='graph.pdf', prog='dot')

### Tabular Format

INDRA also has an assembler for displaying INDRA cogex results as a table.

In [19]:
from indra.assemblers.indranet.assembler import IndraNetAssembler

def add_evidence_column(stmt, ev_counts = ev_counts_by_hash):
    hash = stmt.get_hash(refresh=True)
    return ev_counts[hash]

indra_net_assembler = IndraNetAssembler(statements=stmts)
relations_table = indra_net_assembler.make_df(extra_columns=[('evidence_count', add_evidence_column)])
relations_table = relations_table.sort_values(by=['evidence_count'], ascending=False)
relations_table

Unnamed: 0,agA_name,agB_name,agA_ns,agA_id,agB_ns,agB_id,residue,position,stmt_type,evidence_count,stmt_hash,belief,source_counts,initial_sign
119,BRD4,BRD2,HGNC,13575,HGNC,1103,,,Complex,113,12484535149707124,0.999950,{'reach': 1},
118,BRD2,BRD4,HGNC,1103,HGNC,13575,,,Complex,113,12484535149707124,0.999950,{'reach': 1},
231,BRD4,BRD3,HGNC,13575,HGNC,1104,,,Complex,51,359233681958482,0.999975,{'sparser': 1},
232,BRD3,BRD4,HGNC,1104,HGNC,13575,,,Complex,51,359233681958482,0.999975,{'sparser': 1},
151,BRD2,BRD3,HGNC,1103,HGNC,1104,,,Complex,42,-16515703320827288,0.999950,{'sparser': 1},
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
157,JMJD6,BRD2,HGNC,19355,HGNC,1103,,,Complex,1,34194495135899326,0.650000,{'sparser': 1},
158,JMJD6,BRD3,HGNC,19355,HGNC,1104,,,Complex,1,34194495135899326,0.650000,{'sparser': 1},
159,KRR1,NOP2,HGNC,5176,HGNC,7867,,,Complex,1,-24130039704357981,0.980000,{'biogrid': 1},
160,NOP2,KRR1,HGNC,7867,HGNC,5176,,,Complex,1,-24130039704357981,0.980000,{'biogrid': 1},


### NDEx network upload

**This only works if you have a working NDEx account with credentials configured locally**

In [28]:
import ndex2.client
from ndex2 import create_nice_cx_from_server
from indra.assemblers.cx import NiceCxAssembler
from indra.databases import ndex_client

ca = NiceCxAssembler(stmts, 'INDRA May Institute Network')
ncx = ca.make_model(self_loops=False)

style_network_id = '058c452f-b0d6-11ea-a4d3-0660b7976219'
style_ncx = create_nice_cx_from_server(
    server='http://test.ndexbio.org',
    uuid=style_network_id)
ncx.apply_style_from_network(style_ncx)

username, password = ndex_client.get_default_ndex_cred(ndex_cred=None)
ndex_args = {'server': 'http://public.ndexbio.org',
             'username': username,
             'password': password}
network_url = ncx.upload_to(**ndex_args)
network_url

Generating CX


'https://www.ndexbio.org/v2/network/2eccaaf8-0ee1-11ef-9621-005056ae23aa'

## ADDITIONAL STEP: MORE CREATIVE WAYS TO DISPLAY DATA

### Example 1: Simple Network Visualization with P-values, Correlation, LogFC Annotations

In this example, we take data from both MSstats functions `dataProcess` and `groupComparison`, gather metrics on p-value, correlation, and logFCs, and then contextualize that data with INDRA.  

In [9]:
ABUNDANCE_PATH = "../data/ProteinLevelData.csv" # Set this path yourself
ABUNDANCE_GROUPS_FILTER = ['DMSO', 'DbET6']
def construct_abundance_df(filename):
    pandas_df = pd.read_csv(filename)
    pandas_df = pandas_df[pandas_df['GROUP'].isin(ABUNDANCE_GROUPS_FILTER)]
    return pandas_df

protein_abundance_df = construct_abundance_df(ABUNDANCE_PATH)
protein_abundance_df

Unnamed: 0,RUN,Protein,LogIntensities,originalRUN,GROUP,SUBJECT,TotalGroupMeasurements,NumMeasuredFeature,MissingPercentage,more50missing,NumImputedFeature
0,1,1433B_HUMAN,12.873423,230719_THP-1_Chrom_end2end_Plate1_DMSO_A02_DIA,DMSO,2,1210,10,0.0,False,0
1,2,1433B_HUMAN,12.866217,230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA,DMSO,5,1210,10,0.0,False,0
2,3,1433B_HUMAN,12.686827,230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA,DMSO,10,1210,10,0.0,False,0
3,4,1433B_HUMAN,12.625462,230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA,DMSO,12,1210,10,0.0,False,0
4,5,1433B_HUMAN,12.538365,230719_THP-1_Chrom_end2end_Plate1_DMSO_B01_DIA,DMSO,13,1210,10,0.0,False,0
...,...,...,...,...,...,...,...,...,...,...,...
1189700,145,ZZZ3_HUMAN,10.725469,230719_THP-1_Chrom_end2end_Plate3_PF477736_D05,DbET6,233,169,10,0.0,False,0
1189701,146,ZZZ3_HUMAN,10.155338,230719_THP-1_Chrom_end2end_Plate3_DMSO_D06,DbET6,234,169,10,0.0,False,0
1189702,147,ZZZ3_HUMAN,9.700678,230719_THP-1_Chrom_end2end_Plate3_K975_D12,DbET6,240,169,10,0.0,False,0
1189703,148,ZZZ3_HUMAN,10.889323,230719_THP-1_Chrom_end2end_Plate3_VTP50469_F06,DbET6,258,169,10,0.0,False,0


The `dataProcess` function outputs a dataset containing protein abundances for protein and biological replicate pair.  Using the protein abundances data, we can determine the correlation of abundance between pairs of proteins.

In [16]:
import numpy as np
def calculate_correlation_matrix(pvalue_df, protein_level_summary):
    data = {}
    subjects = protein_level_summary['SUBJECT'].unique()
    for protein in pvalue_df['Protein'].unique():
        data[protein] = []
        protein_level_df = protein_level_summary[protein_level_summary['Protein'] == protein]
        for subject in subjects:
            if subject in protein_level_df['SUBJECT'].values:
                protein_level_df_subject = protein_level_df[protein_level_df['SUBJECT'] == subject]
                data[protein].append(protein_level_df_subject['LogIntensities'].iloc[0])
            else:
                data[protein].append(np.nan)
    df = pd.DataFrame(data)
    corrM = df.corr() 
    return corrM

corr_matrix = calculate_correlation_matrix(pvalue_logfc_df, protein_abundance_df)
corr_matrix

Unnamed: 0,BRD2_HUMAN,BRD3_HUMAN,BRD4_HUMAN,CEBPZ_HUMAN,CRNL1_HUMAN,DAZP1_HUMAN,FUBP2_HUMAN,FUBP3_HUMAN,GTPB4_HUMAN,HEAT3_HUMAN,...,KRR1_HUMAN,NOP2_HUMAN,PP1R8_HUMAN,PRP18_HUMAN,QKI_HUMAN,TIA1_HUMAN,UTP15_HUMAN,WDR12_HUMAN,WDR43_HUMAN,WDR75_HUMAN
BRD2_HUMAN,1.0,0.797708,0.879262,0.01985,-0.207924,0.598796,0.515316,0.557954,-0.09377,-0.429264,...,-0.118297,-0.225047,0.621668,0.255623,0.479966,0.544447,-0.466654,-0.26525,-0.253804,-0.157479
BRD3_HUMAN,0.797708,1.0,0.866766,-0.271567,-0.104151,0.312657,0.307522,0.226802,-0.387214,-0.175765,...,-0.192796,-0.302674,0.422305,0.483052,0.660086,0.185024,-0.322859,-0.183338,-0.316502,-0.223405
BRD4_HUMAN,0.879262,0.866766,1.0,-0.027827,-0.155598,0.552971,0.43537,0.467304,-0.034894,-0.385975,...,-0.068881,-0.274753,0.472644,0.275248,0.499226,0.489967,-0.423467,-0.236744,-0.388817,-0.146948
CEBPZ_HUMAN,0.01985,-0.271567,-0.027827,1.0,0.329054,0.36781,0.363832,0.413635,0.765247,-0.066058,...,0.575716,0.447506,0.250952,-0.262162,-0.105441,0.483017,0.20818,0.236918,0.353536,0.377506
CRNL1_HUMAN,-0.207924,-0.104151,-0.155598,0.329054,1.0,-0.035805,0.107018,0.026768,0.230088,0.231355,...,0.396803,0.527981,-0.047657,0.219325,0.15109,-0.183671,0.600379,0.558789,0.56693,0.510267
DAZP1_HUMAN,0.598796,0.312657,0.552971,0.36781,-0.035805,1.0,0.608002,0.743681,0.329881,-0.311158,...,0.199983,0.006942,0.595334,0.0626,0.372729,0.751419,-0.268644,-0.070602,-0.132739,0.042042
FUBP2_HUMAN,0.515316,0.307522,0.43537,0.363832,0.107018,0.608002,1.0,0.610191,0.168864,-0.218615,...,0.282158,0.194035,0.513343,0.237327,0.469106,0.589631,-0.048102,0.109012,0.165,0.105673
FUBP3_HUMAN,0.557954,0.226802,0.467304,0.413635,0.026768,0.743681,0.610191,1.0,0.419721,-0.322835,...,0.156421,0.0709,0.604436,0.042284,0.303556,0.790524,-0.295023,-0.032122,0.020199,0.132255
GTPB4_HUMAN,-0.09377,-0.387214,-0.034894,0.765247,0.230088,0.329881,0.168864,0.419721,1.0,-0.044,...,0.48497,0.288195,0.11246,-0.407856,-0.330488,0.534385,0.066697,0.030891,0.189041,0.265675
HEAT3_HUMAN,-0.429264,-0.175765,-0.385975,-0.066058,0.231355,-0.311158,-0.218615,-0.322835,-0.044,1.0,...,0.199914,0.096674,-0.343605,0.141172,-0.025871,-0.320925,0.445552,0.254046,0.246794,-0.046507


In [17]:
LOG_FC_FILTER = 0.25
pvalue_logfc_df['log2FC'] = pvalue_logfc_df['log2FC'].astype(float)
logfc_proteins = pvalue_logfc_df[pvalue_logfc_df['log2FC'] > LOG_FC_FILTER]
logfc_proteins

Unnamed: 0,Protein,Label,log2FC,SE,Tvalue,DF,pvalue,adj.pvalue,issue,MissingPercentage,ImputationPercentage
19890,BRD2_HUMAN,DMSO-DbET6,2.046185,0.114339,17.895836,260.0,0.0,0.0,,0.310067,0.0
19935,BRD3_HUMAN,DMSO-DbET6,3.333428,0.126571,26.336522,257.0,0.0,0.0,,0.252349,0.0
19980,BRD4_HUMAN,DMSO-DbET6,2.668935,0.101283,26.351317,257.0,0.0,0.0,,0.118121,0.0
41580,DAZP1_HUMAN,DMSO-DbET6,0.617508,0.099537,6.203818,260.0,2.15995e-09,2e-06,,0.286577,0.0
66060,FUBP2_HUMAN,DMSO-DbET6,0.291045,0.077226,3.768747,260.0,0.000203053,0.047824,,0.01745,0.0
66105,FUBP3_HUMAN,DMSO-DbET6,0.30041,0.069882,4.298798,260.0,2.430235e-05,0.008941,,0.095302,0.0
137745,PP1R8_HUMAN,DMSO-DbET6,0.358282,0.069056,5.188314,260.0,4.275951e-07,0.000235,,0.045638,0.0
140535,PRP18_HUMAN,DMSO-DbET6,0.485713,0.120799,4.020834,260.0,7.603588e-05,0.022122,,0.049664,0.0
145440,QKI_HUMAN,DMSO-DbET6,0.336985,0.05513,6.112567,260.0,3.570594e-09,3e-06,,0.0,0.0
196155,TIA1_HUMAN,DMSO-DbET6,0.476791,0.105361,4.525306,260.0,9.182867e-06,0.004129,,0.058389,0.0


In [20]:
CORRELATION_FILTER = 0.3
EVIDENCE_FILTER = 10

for node in ga.graph.nodes():
    node_properties = {
        'color': '#808080',
        'shape': 'Mrecord',
        'fontsize': 8
    }
    protein_id = hgnc_gene_name_to_uniprot_mapping.get(node)
    if not protein_id:
        ga.graph.add_node(node,
                          label=node,
                          **node_properties)
    elif protein_id in logfc_proteins['Protein'].values:
        node_properties['color'] = '#00FF00'
        logFC_value = round(pvalue_logfc_df[pvalue_logfc_df['Protein'] == protein_id]['log2FC'].iloc[0], 2)
        ga.graph.add_node(node,
                          label=f'{node}: {logFC_value} LogFC',
                          **node_properties)
    else:
        node_properties['color'] = '#FF0000'
        logFC_value = round(pvalue_logfc_df[pvalue_logfc_df['Protein'] == protein_id]['log2FC'].iloc[0], 2)
        ga.graph.add_node(node,
                          label=f'{node}: {logFC_value} LogFC',
                          **node_properties)

color = '#ff0000'
color_default = '#000000'
for edge in ga.graph.edges():
    params = {'color': color_default,
              'arrowhead': 'normal',
              'dir': 'forward'}
    if edge[0] in hgnc_gene_name_to_uniprot_mapping.keys() and \
        edge[1] in hgnc_gene_name_to_uniprot_mapping.keys():
        uniprot_0 = hgnc_gene_name_to_uniprot_mapping.get(edge[0])
        uniprot_1 = hgnc_gene_name_to_uniprot_mapping.get(edge[1])
        correlation = round(corr_matrix[uniprot_0][uniprot_1], 2)
        evidence_df = relations_table[relations_table['agA_name'] == edge[0]]
        evidence_df = evidence_df[evidence_df['agB_name'] == edge[1]]
        evidence = evidence_df['evidence_count'].iloc[0]
        if evidence >= EVIDENCE_FILTER and correlation > CORRELATION_FILTER:
            params['color'] = '#00ff00'
        else:
            params['color'] = color
        params['label'] = f'Correlation: {correlation}, Evidence: {evidence}'
    ga.graph.add_edge(edge[0], edge[1], **params)

ga.save_pdf(file_name='graph_annotations.pdf', prog='dot')

### Example 2: Plotly Visualization

In [21]:
import math
import random
import networkx as nx
import plotly.graph_objects as go
from indra.databases import hgnc_client

In [22]:
def initialize_networkx_graph(subnetwork_relations, filter_bidirectional=False):
    """Return a networkx graph from the INDRA relations."""
    G = nx.DiGraph()
    
    # Get max evidence count of "Complex" INDRA statements for each pair of proteins.
    ev_counts = {}
    for entry in subnetwork_relations:
        if entry['data']['stmt_type'] == "Complex":
            if ev_counts.get((entry['source_id'], entry['target_id'])):
                ev_counts[(entry['source_id'], entry['target_id'])] = \
                max(int(entry['data']['evidence_count']), int(ev_counts.get((entry['source_id'], entry['target_id']))))
            else:
                ev_counts[(entry['source_id'], entry['target_id'])] = int(entry['data']['evidence_count'])

    # Construct graph object
    for entry in subnetwork_relations:
        source = entry['source_id']
        target = entry['target_id']
        
        if filter_bidirectional:
            # If there is a statement with opposite direction and more evidence
            # then we skip this one
            if ev_counts[(source, target)] < ev_counts.get((target, source), 0):
                continue

        # Add nodes to graph
        source_name = hgnc_client.get_hgnc_name(source)
        target_name = hgnc_client.get_hgnc_name(target)
        G.add_node(source, label=source_name)
        G.add_node(target, label=target_name)
            
        # Add the edge to the graph
        G.add_edge(
            source,
            target,
            evidence_count=ev_counts[(source, target)],
            belief=entry['data']['belief'],
            stmt_type=entry['data']['stmt_type']
        )

    return G

In [23]:
def find_communities(G, weight='evidence_count'):
    """Return the communities of a networkx graph using a custom weight attribute."""
    return nx.community.louvain_communities(G, weight=weight)

In [24]:
def generate_node_initial_positions(G, communities):
    """Return node positions of a networkx graph based on communities."""
    initial_pos = {}
    circle_r = 1
    big_r = 1
    pi = math.pi
    centers = [(math.cos(2 * pi / len(communities) * x) * big_r, math.sin(2 * pi / len(communities) * x) * big_r)
               for x in range(0, len(communities))]
    for index, nodes in enumerate(communities):
        for node in nodes:
            alpha = 2 * math.pi * random.random()
            r = circle_r * math.sqrt(random.random())
            x = r * math.cos(alpha) + centers[index][0]
            y = r * math.sin(alpha) + centers[index][1]
            initial_pos[node] = [x, y]
            nx.set_node_attributes(G, {node: index}, name='community')
    
    return initial_pos

In [25]:
def apply_layout_to_graph(G, initial_pos, k=50, iterations=100, weight='evidence_count'):
    """Apply custom layout positions to a networkx graph."""
    pos = nx.spring_layout(G, weight=weight, k=k / math.sqrt(len(G.nodes)), pos=initial_pos, iterations=iterations)
    for node in G.nodes():
        x = pos[node][0]
        y = pos[node][1]
        nx.set_node_attributes(G, {node: [x, y]}, name='pos')
    return G

In [26]:
def construct_networkx_graph(subnetwork_relations):
    """Return a custom networkx graph from INDRA subnetwork relations."""
    G = initialize_networkx_graph(subnetwork_relations)
    communities = find_communities(G)
    initial_pos = generate_node_initial_positions(G, communities)
    apply_layout_to_graph(G, initial_pos)
    return G


G = construct_networkx_graph(subnetwork_relations)

In [27]:
def construct_arrows(G):
    """Return list of directed edges for network visualization."""
    arrow_list = []
    for edge in G.edges():
        x0, y0 = G.nodes[edge[0]]['pos']
        x1, y1 = G.nodes[edge[1]]['pos']

        arrow = go.layout.Annotation(dict(
            x=x0,
            y=y0,
            xref="x", yref="y",
            showarrow=True,
            axref="x", ayref='y',
            ax=x1,
            ay=y1,
            arrowhead=3,
            arrowwidth=min(2, G.edges[edge]['evidence_count']),
            arrowcolor='lightgreen')
        )

        arrow_list.append(arrow)
    return arrow_list

In [28]:
def construct_arrow_labels(G):
    """Return custom edge labels for network visualization."""
    mnode_x, mnode_y, mnode_txt = [], [], []
    for edge in G.edges():
        x0, y0 = G.nodes[edge[0]]['pos']
        x1, y1 = G.nodes[edge[1]]['pos']
        name0 = G.nodes[edge[0]]['label']
        name1 = G.nodes[edge[1]]['label']

        mnode_x.extend([(x0 + x1) / 2])
        mnode_y.extend([(y0 + y1) / 2])
        mnode_txt.extend([f'{name0}->{name1} evidence count: {G.edges[edge]["evidence_count"]}'])

    mnode_trace = go.Scatter(
        x=mnode_x, y=mnode_y,
        mode="markers",
        showlegend=False,
        hovertext=mnode_txt,
        hovertemplate="Edge %{hovertext}<extra></extra>",
        marker=go.scatter.Marker(opacity=0)
    )
    
    return mnode_trace

In [29]:
def construct_node_trace(G):
    """Return custom nodes for network visualization"""
    node_x = []
    node_y = []
    for node in G.nodes():
        x, y = G.nodes[node]['pos']
        node_x.append(x)
        node_y.append(y)

    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers+text',
        hoverinfo='text',
        text=[data['label'] for node, data in list(G.nodes(data=True))],
        textposition="bottom center",
        marker=dict(
            showscale=True,
            colorscale='YlGnBu',
            reversescale=True,
            color=[],
            size=30,
            colorbar=dict(
                thickness=15,
                title='Cluster ID',
                xanchor='left',
                titleside='right'
            ),
            line_width=2))
    
    node_colors = []
    node_text = []
    for node in G.nodes():
        node_colors.append(G.nodes[node]['community'])

    node_trace.marker.color = node_colors

    return node_trace

In [39]:
def show_plotly_graph(nodes, arrows):
    """Visualize the network based on the list of nodes and arrows."""
    fig = go.Figure(data=nodes,
                    layout=go.Layout(
                        title='<br>Network graph made with Python',
                        font=dict(
                            family="Courier New, monospace",
                            size=10,
                            color="Black"
                        ),
                        annotations=arrows,
                        showlegend=False,
                        hovermode='closest',
                        margin=dict(b=20, l=5, r=5, t=40),
                        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                    )
    fig.show(renderer="notebook")

In [35]:
def create_plotly_graph(G):
    """Generate and visualize a custom network from a networkx graph."""
    edges = construct_arrows(G)
    edge_midpoints = construct_arrow_labels(G)
    nodes = construct_node_trace(G)
    show_plotly_graph([edge_midpoints, nodes], edges)

In [None]:
create_plotly_graph(G)

In [32]:
G.nodes['13575']

{'label': 'BRD4',
 'community': 0,
 'pos': [0.9528133707816417, 0.19311777236587424]}