# Network propagation development

The notebook currently covers how results from a AnnData/MuData object can be added to a cpr_graph to setup network-based inference. The general strategy is to:
1. pull out a pd.DataFrame containing feature-level measures of interest along with feature metadata.
2. These are then mapped on the species ids in an sbml_dfs model by shared on ontology, disambiguated (to handle mapping of multiple features to the same s_id), and s_id-indexed results are embedded in the sbml_dfs as a table in species_data
3. attributes of interrest are then passed from the sbml_dfs model into the graph.

This example uses real MuData results but only a small sbml_dfs object which has uniprot but not ENSG identifiers. This makes things easy to work with but a genome-scale graph will need to be used for a real analysis.

Reflecting on the current functionality,

(1) is not too hard but the interface can probably be cleaned up as we should have a function which applies 1-3 in a single call.
(2) is in pretty good shape following a LOT of new functionality being added to napistu-py for handling many-to-one mappings and wide/nested formats for identifiers.
(3) will need some better functionality since the reaction_attrs syntax is pretty cryptic but the core functionality is all there.

Next, steps will be develop basic PPR functionality.

In [1]:
import os

import pandas as pd

from napistu.ingestion import sbml
from napistu import sbml_dfs_core
from napistu import mechanism_matching

# local utils (these should be refactored and removed elsewhere (like in napistu-py))
import utils
import test_utils

# setup logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# paths
PROJECT_DIR =  os.path.expanduser("~/Desktop/Forny_2023_data")
SUPPLEMENTAL_DATA_DIR = os.path.join(PROJECT_DIR, "input")
CACHE_DIR = os.path.join(PROJECT_DIR, "cache")

# Define the path to save hyperparameter scan results
MOFA_PARAM_SCAN_MODELS_PATH = os.path.join(CACHE_DIR, "mofa_param_scan_h5mu")
# Final results 
OPTIMAL_MODEL_H5MU_PATH = os.path.join(CACHE_DIR, "mofa_optimal_model.h5mu")

DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7


In [2]:
PATH_TO_TEST_DATA = os.path.expanduser("~/Desktop/GITHUB/napistu/lib/napistu-py/src/tests/test_data")
example_pathway = os.path.join(PATH_TO_TEST_DATA, "reactome_glucose_metabolism.sbml")
assert os.path.exists(example_pathway)

In [3]:
sbml_dfs = sbml_dfs_core.SBML_dfs(sbml.SBML(example_pathway))

species_identifiers = sbml_dfs.get_identifiers("species").query("bqb == 'BQB_IS'").query("ontology != 'reactome'")

INFO:napistu.utils:creating an edgelist linking index levels s_id, entry and linking it to levels defined by ontology, identifier
DEBUG:napistu.utils:label is not defined in table_schema; adding a constant (1)


In [4]:
# lets load the Forny results so we trying adding a few different types of tables to the sbml_dfs

import mudata as md
mdata = md.read_h5mu(OPTIMAL_MODEL_H5MU_PATH)


DEBUG:h5py._conv:Creating converter from 3 to 5
  self._update_attr("var", axis=0, join_common=join_common)
  self._update_attr("obs", axis=1, join_common=join_common)


In [5]:
# results from var
var_level_results = mdata["proteomics"].var[["effect_case", "qval_case"]].copy()
var_level_results.index.name = "feature_id"
var_level_results['uniprot'] = var_level_results.index.to_series()

mechanism_matching.bind_wide_results(
    sbml_dfs,
    var_level_results,
    "var_level_results",
    ontologies = {"uniprot"},
    dogmatic = False,
    verbose = True
)

sbml_dfs.species_data["var_level_results"]

INFO:napistu.sbml_dfs_utils:Running in non-dogmatic mode - genes, transcripts, and proteins will be merged if possible.
  promiscuous_component_identifiers = pd.Series(
DEBUG:napistu.mechanism_matching:Validated ontology columns: {'uniprot'}
INFO:napistu.mechanism_matching:Using columns as results: ['qval_case', 'effect_case', 'feature_id']
DEBUG:napistu.mechanism_matching:Final long format shape: (4788, 5)
DEBUG:napistu.mechanism_matching:Matching 4788 features to 98 species for ontology uniprot
INFO:napistu.mechanism_matching:Found 57 total matches across 1 ontologies
INFO:napistu.mechanism_matching:1.3% of feature_ids are present one or more times in the output (57/4513)
INFO:napistu.mechanism_matching:7 s_id(s) map to more than one feature_id.
INFO:napistu.mechanism_matching:Examples of s_id mapping to multiple feature_ids (showing up to 3):
s_id       s_name           
S00000016  SLC25A12,13               [541, 4488]
S00000031  enolase dimer        [833, 906, 1034]
S00000050  aldo

Unnamed: 0_level_0,effect_case,qval_case,feature_id
s_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S00000006,0.064783,0.996212,1512
S00000012,0.059317,0.996212,4425
S00000013,0.290775,0.996212,2156
S00000015,0.07957,0.996212,1513
S00000016,0.021029,0.996212,4488541
S00000019,0.047774,0.996212,1767
S00000022,0.202818,0.996212,984
S00000031,0.165435,0.99666,1034833906
S00000033,0.083939,0.996212,730
S00000036,0.217395,0.996212,2660


In [6]:
# merge factors with metadata
mofa_dfs_dict = utils.split_varm_by_modality(mdata)

modality = "transcriptomics"

mofa_df_list = list()
for modality in mofa_dfs_dict.keys():

    modality_pk = mofa_dfs_dict[modality].index.name
    filter_col = [col for col in mofa_dfs_dict[modality] if col.startswith('LF')]
    modality_df = mofa_dfs_dict[modality][filter_col].copy()
    modality_df.index.name = "feature_id"
    modality_df[modality_pk] = modality_df.index.to_series()
    modality_df["modality"] = modality

    mofa_df_list.append(modality_df)

mofa_df = pd.concat(mofa_df_list, axis=0)

mofa_df.groupby("modality").sample(5)


Unnamed: 0_level_0,LFs1,LFs2,LFs3,LFs4,LFs5,LFs6,LFs7,LFs8,LFs9,LFs10,...,LFs24,LFs25,LFs26,LFs27,LFs28,LFs29,LFs30,ensembl_gene,modality,uniprot
feature_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
P10619,0.157525,-0.001121,-0.006856,-0.000532,-0.032077,0.014163,-0.055452,0.069446,0.041652,-0.002323,...,0.03924,0.026135,-0.008246,0.136725,-0.002622,-0.045592,0.024062,,proteomics,P10619
O94817,-0.021583,0.053493,0.065295,0.000293,0.014501,0.005027,-0.008566,-0.154248,0.013573,0.434847,...,0.011772,-0.0005,0.000609,0.02527,0.001872,0.000564,-9.7e-05,,proteomics,O94817
Q9Y320,0.031353,-0.018147,-0.420229,0.001172,0.001451,0.00227,-0.007167,0.114271,0.006712,0.154855,...,-0.012411,-0.007562,0.004903,0.085166,0.009934,-0.00294,-0.004994,,proteomics,Q9Y320
Q13011,0.115146,0.017297,0.104002,-0.002252,-0.02376,0.031539,-0.001304,0.139995,-0.000441,-0.003144,...,0.053806,-0.005177,0.001855,-0.025064,0.019595,-0.003953,0.009272,,proteomics,Q13011
Q9Y4F1,0.0721,-0.123391,0.008789,0.000759,0.063862,-0.146324,-0.043278,0.144303,-0.086098,0.062231,...,-0.012512,0.009882,0.003497,0.14768,0.002049,-0.006441,-0.001216,,proteomics,Q9Y4F1
ENSG00000075856,-0.006122,0.000642,-7e-05,0.017013,-0.041001,0.020535,0.066069,0.006857,0.048364,-0.000362,...,0.050571,0.023285,0.068142,0.151724,-0.050139,-0.092024,0.230439,ENSG00000075856,transcriptomics,
ENSG00000120948,-0.03341,0.034386,-2.5e-05,0.011871,0.001928,0.000794,-0.03266,-0.009901,3.3e-05,3e-05,...,-0.112563,-0.11295,-0.016824,0.065135,0.009042,0.003518,-0.008894,ENSG00000120948,transcriptomics,
ENSG00000148153,-0.016885,0.023966,1.4e-05,0.101808,0.049753,-0.01005,-0.053605,0.000394,0.002498,0.000115,...,-0.041581,0.052418,-0.034114,0.004692,0.052697,-0.026747,0.056356,ENSG00000148153,transcriptomics,
ENSG00000156256,-0.031345,0.023432,4.8e-05,0.011964,-0.013495,0.010553,-0.010285,-0.011395,-0.013437,-0.000213,...,-0.012795,-0.024091,0.124111,-0.004164,0.108016,-0.049354,0.32345,ENSG00000156256,transcriptomics,
ENSG00000179222,-0.149252,-0.000717,0.000242,-0.032592,-0.024025,-0.084776,0.048855,0.008628,0.04166,0.000111,...,-0.147715,0.260968,0.066899,-0.259053,-0.035703,-0.156871,0.110098,ENSG00000179222,transcriptomics,


In [7]:
mechanism_matching.bind_wide_results(
    sbml_dfs,
    mofa_df,
    "mudata_varm_results",
    ontologies = {"uniprot", "ensembl_gene"},
    dogmatic = False,
    verbose = True
)

sbml_dfs.species_data["mudata_varm_results"]

INFO:napistu.sbml_dfs_utils:Running in non-dogmatic mode - genes, transcripts, and proteins will be merged if possible.
  promiscuous_component_identifiers = pd.Series(
DEBUG:napistu.mechanism_matching:Validated ontology columns: {'ensembl_gene', 'uniprot'}
INFO:napistu.mechanism_matching:Using columns as results: ['LFs9', 'LFs14', 'LFs10', 'LFs7', 'LFs20', 'LFs27', 'LFs25', 'LFs23', 'feature_id', 'LFs17', 'LFs1', 'LFs15', 'LFs4', 'LFs26', 'LFs12', 'modality', 'LFs2', 'LFs16', 'LFs30', 'LFs21', 'LFs22', 'LFs8', 'LFs19', 'LFs28', 'LFs18', 'LFs29', 'LFs11', 'LFs6', 'LFs24', 'LFs13', 'LFs3', 'LFs5']
DEBUG:napistu.mechanism_matching:Final long format shape: (13922, 34)
DEBUG:napistu.mechanism_matching:Matching 4788 features to 98 species for ontology uniprot
INFO:napistu.mechanism_matching:Found 57 total matches across 1 ontologies
INFO:napistu.mechanism_matching:0.4% of feature_ids are present one or more times in the output (57/13647)
INFO:napistu.mechanism_matching:7 s_id(s) map to more

Unnamed: 0_level_0,LFs1,LFs2,LFs3,LFs4,LFs5,LFs6,LFs7,LFs8,LFs9,LFs10,...,LFs23,LFs24,LFs25,LFs26,LFs27,LFs28,LFs29,LFs30,modality,feature_id
s_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S00000006,0.05153,0.036362,-0.041906,-0.001347,-0.013508,0.001565,-0.021276,-0.244355,-0.031964,0.057272,...,-0.191047,0.075975,-0.006429,0.009816,0.074251,0.010391,0.000629,-0.004023,proteomics,10646
S00000012,0.034465,0.115698,-0.047133,-0.001525,-0.006179,0.015768,0.042666,0.118975,0.002245,0.035891,...,0.15304,-0.00704,0.000832,0.005247,0.003451,0.000202,0.001349,-0.00289,proteomics,13559
S00000013,0.111319,-0.002571,-0.073818,-0.000615,-0.066941,0.040771,-0.002893,0.12033,-0.0244,0.337955,...,-0.075491,-0.005003,0.001831,-0.016633,-0.083339,-0.000686,0.003954,-0.00243,proteomics,11290
S00000015,0.01002,0.043131,-0.000815,1e-05,-0.032613,0.026158,0.044396,0.171346,0.001049,0.003718,...,-0.071206,0.054874,-0.052803,-0.000263,-0.014903,0.000827,0.036917,-0.002143,proteomics,10647
S00000016,0.099801,0.064125,-0.015871,-0.001283,-0.085869,0.002985,0.009716,0.216461,-0.005344,0.003299,...,-0.044812,0.032438,-0.001762,0.001995,-0.094416,-0.005864,-0.006293,-0.005239,proteomics,136229675
S00000019,0.064609,0.006714,-0.110616,-0.017464,-0.020428,-0.032641,-0.003856,0.083932,0.000541,-0.054995,...,-0.121061,-0.032175,-0.006698,-0.00278,0.095618,0.002672,0.074969,-0.004676,proteomics,10901
S00000022,0.03489,0.007629,-0.122095,-0.001708,-0.076321,0.06122,-0.055165,-0.110623,0.014309,0.324787,...,-0.065491,-0.016958,-0.011363,-0.004861,-0.308404,0.006624,-0.031057,0.000834,proteomics,10118
S00000031,0.014932,-0.100028,-0.056794,-0.015458,-0.039523,0.024284,-0.102462,-0.278852,-0.129555,0.008516,...,-0.225068,0.005913,-0.00828,-0.010984,0.235486,0.014511,0.010286,0.00522,proteomics,10040101689967
S00000033,0.055625,-0.149947,0.003432,-0.031034,-0.020317,0.054525,-0.158043,-0.283876,-0.124461,0.013506,...,-0.285335,0.016808,-0.004187,-0.003736,0.322549,0.01849,0.048649,-0.000388,proteomics,9864
S00000036,0.00607,-0.180905,0.004306,0.004061,0.082143,0.124391,0.241261,0.136064,-0.099635,-0.021902,...,-0.151604,0.009547,-0.011912,-0.036209,-0.212754,0.002011,0.004901,-0.001723,proteomics,11794


In [8]:
from napistu.network import net_create

# now we can pass these species_data attributes to the graph

reaction_graph_attrs = {
    "species": {
        "LFs5": {
            "table": "mudata_varm_results",
            "variable": "LFs5",
            "trans": "identity",
        },
        "effect_case": {
            "table": "var_level_results",
            "variable": "effect_case",
            "trans": "identity",
        },
    },
}

cpr_graph = net_create.create_cpr_graph(
    sbml_dfs,
    directed=True,
    graph_type="regulatory"
)

# add species attributes
# TO DO - this is definitely not a utility function
graph_w_annotations = net_create._add_graph_species_attribute(
    cpr_graph,
    sbml_dfs,
    species_graph_attrs = reaction_graph_attrs,
)


INFO:napistu.network.net_create:Organizing all network nodes (compartmentalized species and reactions)
INFO:napistu.network.net_create:Formatting edges as a regulatory graph
INFO:napistu.network.net_create:Formatting 250 reactions species as tiered edges.
INFO:napistu.network.net_create:Adding additional attributes to edges, e.g., # of children and parents.
INFO:napistu.network.net_create:Done preparing regulatory graph
INFO:napistu.network.net_create:Adding reversibility and other meta-data from reactions_data
INFO:napistu.network.net_create:No reactions annotations provided in "graph_attrs"; returning None
INFO:napistu.network.net_create:Creating reverse reactions for reversible reactions on a directed graph
INFO:napistu.network.net_create:Formatting cpr_graph output
INFO:napistu.network.net_create:Adding meta-data from species_data
INFO:napistu.network.net_create:Adding new attribute LFs5 to vertices
INFO:napistu.network.net_create:Adding new attribute effect_case to vertices


In [9]:
from napistu import utils as napistu_utils

napistu_utils.style_df(graph_w_annotations.get_vertex_dataframe().sort_values("LFs5").head(5))

Unnamed: 0_level_0,name,node_name,node_type,LFs5,effect_case
vertex ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
26,species_372469,"SLC25A12,13 [mitochondrial inner membrane]",species,-0.086,0.021
96,species_6798333,BPGM dimer [cytosol],species,-0.084,-0.031
36,species_70499,pyruvate carboxylase holoenzyme [mitochondrial matrix],species,-0.076,0.203
18,species_376856,SLC25A11 homodimer [mitochondrial inner membrane],species,-0.067,0.291
58,species_70594,GOT2 dimer [mitochondrial matrix],species,-0.051,0.043


## Network Propagation

Here we'll implement a workflow for applying network propagation to a cpr_graph's vertex attributes.

In [10]:
test_utils.test_personalized_pagerank_by_attribute_basic()
test_utils.test_personalized_pagerank_by_attribute_no_uniform()
test_utils.test_personalized_pagerank_by_attribute_missing_and_negative()
test_utils.test_personalized_pagerank_by_attribute_additional_args_invalid()
test_utils.test_personalized_pagerank_by_attribute_all_missing()
test_utils.test_personalized_pagerank_by_attribute_all_zero()


In [13]:
graph_w_annotations.vs.attributes()

['name', 'node_name', 'node_type', 'LFs5', 'effect_case']

In [14]:
RESET_PROPORTIONAL_TO = "effect_case"

utils.personalized_pagerank_by_attribute(
    graph_w_annotations,
    RESET_PROPORTIONAL_TO
)

ValueError: Attribute 'effect_case' contains negative values.