# Network propagation development

The notebook currently covers how results from a AnnData/MuData object can be added to a cpr_graph to setup network-based inference. The general strategy is to:
1. pull out a pd.DataFrame containing feature-level measures of interest along with feature metadata.
2. These are then mapped on the species ids in an sbml_dfs model by shared on ontology, disambiguated (to handle mapping of multiple features to the same s_id), and s_id-indexed results are embedded in the sbml_dfs as a table in species_data
3. attributes of interrest are then passed from the sbml_dfs model into the graph.

This example uses real MuData results but only a small sbml_dfs object which has uniprot but not ENSG identifiers. This makes things easy to work with but a genome-scale graph will need to be used for a real analysis.

Reflecting on the current functionality,

(1) is not too hard but the interface can probably be cleaned up as we should have a function which applies 1-3 in a single call.
(2) is in pretty good shape following a LOT of new functionality being added to napistu-py for handling many-to-one mappings and wide/nested formats for identifiers.
(3) will need some better functionality since the reaction_attrs syntax is pretty cryptic but the core functionality is all there.

Next, steps will be develop basic PPR functionality.

In [30]:
import os

import mudata as md
import pandas as pd

from napistu.ingestion import sbml
from napistu import sbml_dfs_core
from napistu import mechanism_matching
from napistu.network import net_propagation
from napistu import utils as napistu_utils
from napistu.network import net_utils
from napistu.gcs import downloads

# local library
import regulation

# setup logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# paths
PROJECT_DIR =  os.path.expanduser("~/Desktop/DATA/Forny2023")
SUPPLEMENTAL_DATA_DIR = os.path.join(PROJECT_DIR, "input")
CACHE_DIR = os.path.join(PROJECT_DIR, "cache")
NAPISTU_DATA_DIR = os.path.expanduser("~/Desktop/DATA/napistu_data")

# Define the path to save hyperparameter scan results
MOFA_PARAM_SCAN_MODELS_PATH = os.path.join(CACHE_DIR, "mofa_param_scan_h5mu")
# Final results 
OPTIMAL_MODEL_H5MU_PATH = os.path.join(CACHE_DIR, "mofa_optimal_model.h5mu")

In [31]:
sbml_dfs_path = downloads.load_public_napistu_asset(
    asset = "human_consensus",
    data_dir = NAPISTU_DATA_DIR,
    subasset = "sbml_dfs"
)

cpr_graph_path  = downloads.load_public_napistu_asset(
    asset = "human_consensus",
    data_dir = NAPISTU_DATA_DIR,
    subasset = "regulatory_graph"
)

identifiers_path = downloads.load_public_napistu_asset(
    asset = "human_consensus",
    data_dir = NAPISTU_DATA_DIR,
    subasset = "identifiers"
)


In [32]:
# ~2 min load
sbml_dfs = napistu_utils.load_pickle(sbml_dfs_path)
cpr_graph = napistu_utils.load_pickle(cpr_graph_path)
identifiers = pd.read_csv(identifiers_path, delimiter = "\t")

In [33]:
#net_utils.validate_assets(
#    sbml_dfs = sbml_dfs,
#    cpr_graph = cpr_graph,
#    # TODO - it should really possible for this to be optional
#    precomputed_distances = None,
#    identifiers = identifiers
#    )

In [34]:
# lets load the Forny results so we trying adding a few different types of tables to the sbml_dfs
mdata = md.read_h5mu(OPTIMAL_MODEL_H5MU_PATH)

  self._update_attr("var", axis=0, join_common=join_common)
  self._update_attr("obs", axis=1, join_common=join_common)


## Adding genome-scale datasets

To use an 'omic dataset in Napistu, we want to:
1. mount the dataset on the pathway `sbml_dfs`. This entails:
    - matching systematic identifiers between the dataset and pathway to connect 'omic features to Napistu `species`.
    - resolve many-to-1 mappings (e.g., where 2+ features match the same species).
    - create a table with unique species ids as the index with variable from the dataset.
    - add this to the `species_data` attriute of the `sbml_dfs`. Multiple tables and/or datasets can be added to `species_data`.
2. pass variables from one or more `species_data` tables to a `cpr_graph`'s vertices with `net_create._add_graph_species_attribute`. Variables can be transformed (e.g., to make them non-negative for personalized pagerank) at this point (or this could be done before step (1)).
3. use these verterx attributes for downstream analysis (e.g., using it in the reset_proportional_to parameters of PPR).

Step (1) needs to be adapted depending on how datasets are organized. The currently, supported inputs are:
- `pd.DataFrame` objects which including 1+ systematic identifiers
- `anndata.AnnData` objects where the `var` table provided identifiers, and feature-level summaries come from either the `var`, `varm` or `X` tables.
- `mudata.MuData` objects containing multiple `AnnData` objects where `var` and `varm` attributes can be defined across multiple datasets.

We'll provide an examples using each of these inputs

### Loading results from a pd.DataFrame

In [36]:
SUPPLEMENTAL_DATA_DIR = os.path.join(PROJECT_DIR, "input")
VZ_LMM_RESULTS = {
    "transcriptomics": "diff_exp_lmm_rnaseq_pathwayact_all_annotout.txt",
    "proteomics": "diff_exp_lmm_prot_pathwayact_all_annotout.txt"
}

sideloaded_data_path = {x : os.path.join(SUPPLEMENTAL_DATA_DIR, y) for x, y in VZ_LMM_RESULTS.items()}

assert all([os.path.isfile(x) for x in sideloaded_data_path.values()])

sideloaded_data = {
    x : pd.read_csv(y, delimiter= "\t") for x, y in sideloaded_data_path.items()
}

In [43]:
for k in sideloaded_data.keys():
    x = sideloaded_data[k][["ensembl", "chi_sq", "pval", "fdr"]]

    mechanism_matching.bind_wide_results(
        sbml_dfs,
        x,
        f"{k}_loose_data",
        # map columns to controlled Napistu's vocabulary
        ontologies = {"ensembl" : "ensembl_gene"},
        species_identifiers = identifiers,
        dogmatic = False,
        verbose = True
    )

INFO:napistu.sbml_dfs_utils:Running in non-dogmatic mode - genes, transcripts, and proteins will be merged if possible.
DEBUG:napistu.mechanism_matching:Validated ontology columns: {'ensembl_gene'}
INFO:napistu.mechanism_matching:Using columns as results: ['fdr', 'feature_id', 'chi_sq', 'pval']
DEBUG:napistu.mechanism_matching:Final long format shape: (14749, 6)
DEBUG:napistu.mechanism_matching:Matching 14749 features to 42421 species for ontology ensembl_gene
INFO:napistu.mechanism_matching:Found 28526 total matches across 1 ontologies
INFO:napistu.mechanism_matching:98.6% of feature_ids are present one or more times in the output (14544/14749)
INFO:napistu.mechanism_matching:2617 s_id(s) map to more than one feature_id.
INFO:napistu.mechanism_matching:Examples of s_id mapping to multiple feature_ids (showing up to 3):
s_id       s_name                        
S00000054  UBE2L3                                     [3752, 9356]
S00000105  Ferritin Complex                           [2387

## Loading Results from an AnnData object

Since the Forny dataset is a multiomics experiment many of the variablges we are interested in will hold a common interpretation across all modalities. For example, the effect size of a term in a regression holds a common meaning as do the loadings from a multi-omic factor analysis (MOFA) decomposition.

But, many datasets will just be a single modality, and even for multiomic datasets we may be interested in exploring the biology of datamodality-specific attributes. An example in this study is the data-modality specific principal component loadings. Since PCA was performed separately on each data modality the principal components will likely be relatively uncorrelated hence it doesn't make much sense to treat the loadings of PCX to one another across modalities. This is definitely the case for this dataset - PC1 of the proteomics data largely reflects a chromatography-driven technical batch effect which is not seen in the transcriptomics data. To more directly explore this proteomics batch effect we can pull PC1 out of its `AnnData` table.

In [76]:
len(mdata["proteomics"].obsm.keys())

#mdata["proteomics"].var

1

In [108]:
from types import SimpleNamespace
import copy

ADATA = SimpleNamespace(
    LAYERS="layers",
    OBS="obs",
    OBSM="obsm",
    OBSP="obsp",
    VAR="var",
    VARM="varm",
    VARP="varp",
    X="X",   
)

ADATA_DICTLIKE_ATTRS = [ADATA.LAYERS, ADATA.OBSM, ADATA.OBSP, ADATA.VARM, ADATA.VARP]
ADATA_IDENTITY_ATTRS = [ADATA.OBS, ADATA.VAR, ADATA.X]


import anndata
from typing import Literal, Optional, List, Union, Set, Dict

def _load_raw_table(
    adata: anndata.AnnData,
    table_type: str,
    table_name: Optional[str] = None
):
    
    """
    Load an AnnData table.
    
    This function loads an AnnData table and returns it as a pd.DataFrame.
    
    Parameters
    ----------
    adata : anndata.AnnData
        The AnnData object to load the table from.
    table_type : str
        The type of table to load.
    table_name : str, optional
        The name of the table to load.

    Returns
    -------
    pd.DataFrame
        The loaded table.
    """
    
    if table_type not in [*ADATA_DICTLIKE_ATTRS, *ADATA_IDENTITY_ATTRS]:
        raise ValueError(f"table_type {table_type} is not a valid AnnData attribute. Valid attributes are: {ADATA_DICTLIKE_ATTRS + ADATA_IDENTITY_ATTRS}")

    if table_type in ADATA_IDENTITY_ATTRS:
        if table_name is not None:
            logger.debug(f"table_name {table_name} is not None, but table_type is in IDENTITY_TABLES. "
                        f"table_name will be ignored.")
        return getattr(adata, table_type)

    # pull out a dict-like attribute
    return _get_table_from_dict_attr(
        adata,
        table_type,
        table_name
    )
    
    
def _get_table_from_dict_attr(
    adata: anndata.AnnData,
    attr_name: str,
    table_name: Optional[str] = None
):
    """
    Generic function to get a table from a dict-like AnnData attribute (varm, layers, etc.)
    
    Args:
        adata: AnnData object
        attr_name: Name of the attribute ('varm', 'layers', etc.)
        table_name: Specific table name to retrieve, or None for auto-selection
    """

    if attr_name not in ADATA_DICTLIKE_ATTRS:
        raise ValueError(f"attr_name {attr_name} is not a dict-like AnnData attribute. Valid attributes are: {VALID_ATTRS}")

    attr_dict = getattr(adata, attr_name)
    available_tables = list(attr_dict.keys())
    
    if len(available_tables) == 0:
        raise ValueError(f"No tables found in adata.{attr_name}")
    elif (len(available_tables) > 1) and (table_name is None):
        raise ValueError(f"Multiple tables found in adata.{attr_name} and table_name is not specified. "
                        f"Available: {available_tables}")
    elif (len(available_tables) == 1) and (table_name is None):
        return attr_dict[available_tables[0]]
    elif table_name not in available_tables:
        raise ValueError(f"table_name '{table_name}' not found in adata.{attr_name}. "
                        f"Available: {available_tables}")
    else:
        return attr_dict[table_name]
    
#_load_raw_table(adata, "X")
#_load_raw_table(adata, "var", "ignored")
# _load_raw_table(adata, "layers")
# _load_raw_table(adata, "foo")

def _select_results_attrs(
    adata: anndata.AnnData,
    raw_results_table: Union[pd.DataFrame, np.ndarray],
    results_attrs: Optional[List[str]] = None
) -> pd.DataFrame:

    """
    Select results attributes from an AnnData object.

    This function selects results attributes from raw_results_table derived
    from an AnnData object and converts them if needed to a pd.DataFrame
    with appropriate indicies.

    Parameters
    ----------
    adata : anndata.AnnData
        The AnnData object containing the results to be formatted.
    raw_results_table : pd.DataFrame or np.ndarray
        The raw results table to be formatted.
    results_attrs : list of str, optional
        The attributes to extract from the raw_results_table.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the formatted results.
    """
    if isinstance(raw_results_table, pd.DataFrame):
        if results_attrs is not None:
            results_table_data = raw_results_table.loc[results_attrs]
        else:
            results_table_data = raw_results_table
    else:
        if results_attrs is not None:
            # Check that results_attrs exist in adata.obs.index
            valid_obs = adata.obs.index.tolist()
            
            invalid_results_attrs = [x for x in results_attrs if x not in valid_obs]
            if len(invalid_results_attrs) > 0:
                raise ValueError(f"The following results attributes are not present in the AnnData object's obs index: {invalid_results_attrs}")

            # Find positions of desired rows in adata.obs.index
            row_positions = [adata.obs.index.get_loc(attr) for attr in results_attrs]
            
            # Select ROWS from numpy array using positions
            selected_array = raw_results_table[row_positions, :]
            
            # Convert to DataFrame and set row names to results_attrs
            results_table_data = pd.DataFrame(
                selected_array,
                index = results_attrs,
                columns = adata.var.index
                ).T
        else:
            # Convert entire array to DataFrame
            results_table_data = pd.DataFrame(
                raw_results_table,
                index = adata.obs.index,
                columns = adata.var.index
            ).T

    return results_table_data

# _select_results_attrs(adata, adata.X, ["MMA001", "MMA004", "MMA005"])

def prepare_anndata_results_df(
    adata: anndata.AnnData,
    table_type: Literal[ADATA.VAR, ADATA.VARM, ADATA.X] = ADATA.VAR,
    table_name: Optional[str] = None,
    results_attrs: Optional[List[str]] = None,
    ontologies: Optional[Union[Set[str], Dict[str, str]]] = None,
    index_which_ontology: Optional[str] = None,
    verbose: bool = True
):

    """
    Prepare a results table from an AnnData object for use in Napistu.

    This function extracts a table from an AnnData object and formats it for use in Napistu.

    Parameters
    ----------
    adata : anndata.AnnData
        The AnnData object containing the results to be formatted.
    table_type : Literal["var", "varm", "X"], optional
        The type of table to extract from the AnnData object.
    table_name : str, optional
        The name of the table to extract from the AnnData object.
    results_attrs : list of str, optional
        The attributes to extract from the table.
    index_which_ontology : str, optional
        The ontology to use for the systematic identifiers. This column will be pulled out of the
        index renamed to the ontology name, and added to the results table as a new column with
        the same name.
    ontologies : Optional[Union[Set[str], Dict[str, str]]], default=None
        Either:
        - Set of columns to treat as ontologies (these should be entries in ONTOLOGIES_LIST )
        - Dict mapping wide column names to ontology names in the ONTOLOGIES_LIST controlled vocabulary
        - None to automatically detect valid ontology columns based on ONTOLOGIES_LIST

        If index_which_ontology is defined, it should be represented in these ontologies. 
    verbose : bool, optional
        Whether to print verbose output.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the formatted results.
    """

    # pull out the table containing results
    raw_results_table = _load_raw_table(adata, table_type, table_name)

    if table_type == ADATA.VAR:
        var_table = copy.deepcopy(raw_results_table)
    else:
        var_table = adata.var


    return raw_results_table

prepare_anndata_results_df(
    mdata["proteomics"],
    table_type = "X"
)


TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

In [109]:
table_type = "var"

if table_type not in [*ADATA_DICTLIKE_ATTRS, *ADATA_IDENTITY_ATTRS]:
        raise ValueError(f"table_type {table_type} is not a valid AnnData attribute. Valid attributes are: {ADATA_DICTLIKE_ATTRS + ADATA_IDENTITY_ATTRS}")

if table_type in ADATA_IDENTITY_ATTRS:
    if table_name is not None:
        logger.debug(f"table_name {table_name} is not None, but table_type is in IDENTITY_TABLES. "
                    f"table_name will be ignored.")
    return getattr(adata, table_type)


NameError: name 'table_type' is not defined

In [79]:
adata.layers

Layers with keys: log2_centered

numpy.ndarray

In [None]:
# if table_name

In [145]:
import numpy as np

adata = mdata["proteomics"]
table_type = "X"
table_name = None
results_attrs = ["MMA001", "MMA004", "MMA005"]

raw_results_table = _load_raw_table(adata, table_type, table_name)

if table_type == ADATA.VAR:
    var_table = copy.deepcopy(raw_results_table)
else:
    var_table = adata.var

# select relevant attributes returning a pd.DataFrame
# if raw_results_table is a np.ndarray select observations
# based on their primary key if results_attrs is not None




# selecting from array by name
_select_results_attrs(adata, adata.X, ["MMA001", "MMA004", "MMA005"])

Unnamed: 0_level_0,MMA001,MMA004,MMA005
uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A0AVF1,83727.578125,351463.812500,473821.687500
A0AVT1,37115.953125,71438.554688,42789.007812
A0FGR8,40237.117188,122243.023438,152602.468750
A1AG_BOVINAlpha-1-acidglycoproteinOS=BostaurusGN=ORM1PE=2SV=1;CONT_Q3SZR3,46269.468750,70375.281250,71832.406250
A1L0T0,55125.847656,40272.636719,49587.910156
...,...,...,...
Q9Y6R0,28617.105469,75366.789062,20555.519531
Q9Y6R4,75033.437500,28757.421875,44098.183594
Q9Y6U3,325.786499,41348.601562,17314.455078
Q9Y6W5,84969.218750,137218.062500,134523.421875


In [133]:
def _select_results_attrs(
    adata: anndata.AnnData,
    raw_results_table: Union[pd.DataFrame, np.ndarray],
    results_attrs: Optional[List[str]] = None
) -> pd.DataFrame:

    if isinstance(raw_results_table, pd.DataFrame):
        if results_attrs is not None:
            results_table_data = raw_results_table.loc[results_attrs]
        else:
            results_table_data = raw_results_table
    else:
        if results_attrs is not None:
            # Check that results_attrs exist in adata.obs.index
            valid_obs = adata.obs.index.tolist()
            
            invalid_results_attrs = [x for x in results_attrs if x not in valid_obs]
            if len(invalid_results_attrs) > 0:
                raise ValueError(f"The following results attributes are not present in the AnnData object's obs index: {invalid_results_attrs}")

            # Find positions of desired rows in adata.obs.index
            row_positions = [adata.obs.index.get_loc(attr) for attr in results_attrs]
            
            # Select ROWS from numpy array using positions
            selected_array = raw_results_table[row_positions, :]
            
            # Convert to DataFrame and set row names to results_attrs
            results_table_data = pd.DataFrame(
                selected_array,
                index = results_attrs,
                columns = adata.var.index
                ).T
        else:
            # Convert entire array to DataFrame
            results_table_data = pd.DataFrame(
                raw_results_table,
                index = adata.obs.index,
                columns = adata.var.index
            ).T

    return results_table_data
adata.X.shape

(221, 4788)

In [135]:
valid_obs = adata.obs.index.tolist()
            
invalid_results_attrs = [x for x in results_attrs if x not in valid_obs]
if len(invalid_results_attrs) > 0:
    raise ValueError(f"The following results attributes are not present in the AnnData object's obs index: {invalid_results_attrs}")

# Find positions of desired rows in adata.obs.index
row_positions = [adata.obs.index.get_loc(attr) for attr in results_attrs]

# Select ROWS from numpy array using positions
selected_array = raw_results_table[row_positions, :]

In [142]:
pd.DataFrame(
    selected_array,
    index = results_attrs,
    columns = adata.var.index
    ).T

Unnamed: 0_level_0,MMA001,MMA004,MMA005
uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A0AVF1,83727.578125,351463.812500,473821.687500
A0AVT1,37115.953125,71438.554688,42789.007812
A0FGR8,40237.117188,122243.023438,152602.468750
A1AG_BOVINAlpha-1-acidglycoproteinOS=BostaurusGN=ORM1PE=2SV=1;CONT_Q3SZR3,46269.468750,70375.281250,71832.406250
A1L0T0,55125.847656,40272.636719,49587.910156
...,...,...,...
Q9Y6R0,28617.105469,75366.789062,20555.519531
Q9Y6R4,75033.437500,28757.421875,44098.183594
Q9Y6U3,325.786499,41348.601562,17314.455078
Q9Y6W5,84969.218750,137218.062500,134523.421875


In [122]:
adata.obs

Unnamed: 0_level_0,case,gender,consanguinity,mut_category,wgs_zygosity,acidosis,metabolic_acidosis,metabolic_ketoacidosis,ketosis,hyperammonemia,...,ammonia_umolL,pH,base_excess,MMA_urine_after_treat,carnitine_dose,natural_protein_amount,total_protein_amount,weight_centile_quant,length_centile_quant,head_circumfernce_quant
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MMA001,1,1,0,1.0,1.0,1,1,0,0,0,...,3.762349,0.521214,-15.36,0.533427,5.643856,-0.316992,1.222182,0.000000,0.000000,0.0
MMA002,1,1,1,0.0,0.0,0,0,0,0,0,...,9.732642,-0.077624,-21.00,0.105576,6.643856,-0.104195,1.397998,1.732051,1.000000,3.0
MMA003,1,1,1,0.0,1.0,0,0,0,0,0,...,8.422174,-1.664230,-24.00,-0.514988,5.862947,-1.442128,1.411885,1.000000,1.000000,3.0
MMA004,1,0,0,0.0,1.0,1,1,0,0,0,...,8.666115,-0.641109,-16.94,-0.037999,6.715085,0.752554,1.096611,1.732051,1.732051,0.0
MMA005,1,0,0,0.0,0.0,0,0,0,0,0,...,7.108113,0.833357,-1.96,0.249966,6.513956,-0.103818,1.660840,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
MMA226,0,0,0,,,0,0,0,0,0,...,7.267153,0.414281,-3.76,-0.354647,6.360849,0.127875,0.886244,0.000000,0.000000,0.0
MMA227,0,0,1,,,0,0,0,0,0,...,5.746581,1.230584,-2.30,-0.824862,6.974927,0.081066,1.818446,0.000000,0.000000,0.0
MMA228,0,0,0,,,0,0,0,0,0,...,7.519280,0.982586,-6.10,0.060576,6.643856,0.127875,0.907549,0.000000,0.000000,0.0
MMA229,0,1,0,,,0,0,0,0,0,...,6.209349,1.144137,-4.36,-0.379918,6.643856,0.261158,1.271238,0.000000,0.000000,0.0


Index(['MMA001', 'MMA002', 'MMA003', 'MMA004', 'MMA005', 'MMA006', 'MMA007',
       'MMA008', 'MMA009', 'MMA010',
       ...
       'MMA220', 'MMA222', 'MMA223', 'MMA224', 'MMA225', 'MMA226', 'MMA227',
       'MMA228', 'MMA229', 'MMA230'],
      dtype='object', name='patient_id', length=221)

In [35]:
MODALITY_TO_ONTOLOGY = {
    "proteomics": "uniprot",
    "transcriptomics": "ensembl_gene",
}

for k, v in MODALITY_TO_ONTOLOGY.items():
    # results from var
    var_level_results = mdata[k].var[["effect_case", "qval_case"]].copy()
    var_level_results.index.name = "feature_id"
    var_level_results[v] = var_level_results.index.to_series()

    mechanism_matching.bind_wide_results(
        sbml_dfs,
        var_level_results,
        f"{k}_var_level_results",
        ontologies = {v},
        dogmatic = False,
        verbose = True
    )

INFO:napistu.sbml_dfs_utils:Running in non-dogmatic mode - genes, transcripts, and proteins will be merged if possible.
DEBUG:napistu.mechanism_matching:Validated ontology columns: {'uniprot'}
INFO:napistu.mechanism_matching:Using columns as results: ['feature_id', 'effect_case', 'qval_case']
DEBUG:napistu.mechanism_matching:Final long format shape: (4788, 5)
DEBUG:napistu.mechanism_matching:Matching 4788 features to 124813 species for ontology uniprot
INFO:napistu.mechanism_matching:Found 7062 total matches across 1 ontologies
INFO:napistu.mechanism_matching:96.2% of feature_ids are present one or more times in the output (4607/4788)
INFO:napistu.mechanism_matching:64 s_id(s) map to more than one feature_id.
INFO:napistu.mechanism_matching:Examples of s_id mapping to multiple feature_ids (showing up to 3):
s_id       s_name     
S00000766  KMT2B          [1046, 1058, 1538]
S00001060  PML                   [836, 1335]
S00001352  UGT1A9 gene          [4321, 4322]
Name: feature_id, dtype

In [58]:
??mechanism_matching.bind_wide_results

[31mSignature:[39m
mechanism_matching.bind_wide_results(
    sbml_dfs: [33m'sbml_dfs_core.SBML_dfs'[39m,
    results_df: [33m'pd.DataFrame'[39m,
    results_name: [33m'str'[39m,
    ontologies: [33m'Optional[Union[Set[str], Dict[str, str]]]'[39m = [38;5;28;01mNone[39;00m,
    dogmatic: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    species_identifiers: [33m'Optional[pd.DataFrame]'[39m = [38;5;28;01mNone[39;00m,
    feature_id_var: [33m'str'[39m = [33m'feature_id'[39m,
    numeric_agg: [33m'str'[39m = [33m'weighted_mean'[39m,
    keep_id_col: [33m'bool'[39m = [38;5;28;01mTrue[39;00m,
    verbose: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
) -> [33m'sbml_dfs_core.SBML_dfs'[39m
[31mSource:[39m   
[38;5;28;01mdef[39;00m bind_wide_results(
    sbml_dfs : sbml_dfs_core.SBML_dfs,
    results_df : pd.DataFrame,
    results_name : str,
    ontologies : Optional[Union[Set[str], Dict[str, str]]] = [38;5;28;01mNone[39;00m,
    dogmatic : bool = [38;5;28

In [29]:
sbml_dfs.species_data["transcriptomics_var_level_results"]

sbml_dfs.species_data["protoemics_var_level_results"]

KeyError: 'protoemics_var_level_results'

In [6]:
# merge factors with metadata
mofa_dfs_dict = regulation.split_varm_by_modality(mdata)

modality = "transcriptomics"

mofa_df_list = list()
for modality in mofa_dfs_dict.keys():

    modality_pk = mofa_dfs_dict[modality].index.name
    filter_col = [col for col in mofa_dfs_dict[modality] if col.startswith('LF')]
    modality_df = mofa_dfs_dict[modality][filter_col].copy()
    modality_df.index.name = "feature_id"
    modality_df[modality_pk] = modality_df.index.to_series()
    modality_df["modality"] = modality

    mofa_df_list.append(modality_df)

mofa_df = pd.concat(mofa_df_list, axis=0)

mofa_df.groupby("modality").sample(5)


Unnamed: 0_level_0,LFs1,LFs2,LFs3,LFs4,LFs5,LFs6,LFs7,LFs8,LFs9,LFs10,...,LFs24,LFs25,LFs26,LFs27,LFs28,LFs29,LFs30,ensembl_gene,modality,uniprot
feature_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Q9H792,-0.027243,-0.119244,0.144007,0.001119,0.026899,-0.030268,0.033949,0.063107,0.009831,0.016116,...,-0.009719,-0.003239,-0.001262,-0.214363,0.008808,0.021,-0.003012,,proteomics,Q9H792
P0C0L4,-0.09246,-0.093116,0.178606,0.006779,0.0835,-0.000353,0.146885,0.020423,0.017907,-0.020367,...,0.015213,-0.003062,0.005986,-0.152157,-0.005441,0.001834,0.000642,,proteomics,P0C0L4
Q9UBI1,0.029154,-0.041047,0.379047,0.001382,-0.022916,0.054253,0.016286,-0.079876,-0.004745,0.369614,...,-0.000681,0.021992,-0.017142,-0.087458,-0.018002,0.004626,-0.00615,,proteomics,Q9UBI1
Q9NRL2,-0.075254,-0.009926,0.008734,0.005759,-0.008589,0.00235,0.016089,-0.150416,-0.026265,-0.092472,...,-0.013159,-0.002475,0.001101,0.033993,-0.005234,0.005804,-0.000699,,proteomics,Q9NRL2
P37802,0.001871,-0.154841,0.074453,-0.007104,-0.077237,-0.020003,-0.023796,-0.377999,-0.058973,0.061443,...,0.021791,0.006501,0.002317,0.086762,-0.01541,0.012741,-0.008488,,proteomics,P37802
ENSG00000104299,0.030109,0.009456,-5e-05,-0.032007,0.020188,0.016357,0.047936,0.000251,-0.001013,4.8e-05,...,0.047624,0.001612,0.020146,-0.018153,-0.20568,0.118215,-0.065097,ENSG00000104299,transcriptomics,
ENSG00000185532,0.033874,-0.009365,4.5e-05,0.170626,-0.001765,-0.026931,-0.078983,-0.007647,-0.086124,-4e-05,...,0.074793,-0.037001,-0.024062,-0.019362,0.19033,-0.226037,0.07166,ENSG00000185532,transcriptomics,
ENSG00000112715,0.049501,-0.115302,-0.000124,0.016265,-0.077598,0.144006,-0.157112,0.008054,-0.122357,-7.1e-05,...,0.179531,0.220982,-0.086252,0.262317,-0.195597,-0.027755,0.115978,ENSG00000112715,transcriptomics,
ENSG00000174807,0.020886,0.130035,0.000283,-0.036658,-0.041395,-0.169194,0.146575,0.000694,0.198987,-0.000107,...,-0.234533,0.248023,-0.24355,-0.150132,0.20899,-0.253592,0.004672,ENSG00000174807,transcriptomics,
ENSG00000136986,-0.001188,-0.025556,-6e-06,0.011816,0.054273,0.077171,-0.107502,0.002933,-0.022823,-0.000134,...,0.02485,-0.032014,-0.068765,-0.010854,0.095898,-0.024257,-0.024088,ENSG00000136986,transcriptomics,


In [7]:
mechanism_matching.bind_wide_results(
    sbml_dfs,
    mofa_df,
    "mudata_varm_results",
    ontologies = {"uniprot", "ensembl_gene"},
    species_identifiers = identifiers,
    dogmatic = False,
    verbose = True
)

sbml_dfs.species_data["mudata_varm_results"]

INFO:napistu.sbml_dfs_utils:Running in non-dogmatic mode - genes, transcripts, and proteins will be merged if possible.
DEBUG:napistu.mechanism_matching:Validated ontology columns: {'uniprot', 'ensembl_gene'}
INFO:napistu.mechanism_matching:Using columns as results: ['LFs11', 'LFs10', 'LFs13', 'LFs18', 'LFs28', 'LFs29', 'LFs19', 'LFs5', 'LFs7', 'LFs8', 'LFs12', 'LFs27', 'LFs16', 'LFs26', 'LFs9', 'LFs2', 'LFs6', 'LFs14', 'LFs15', 'LFs1', 'LFs4', 'LFs25', 'LFs17', 'LFs21', 'LFs30', 'LFs23', 'LFs24', 'LFs3', 'LFs20', 'modality', 'LFs22', 'feature_id']
DEBUG:napistu.mechanism_matching:Final long format shape: (13922, 34)
DEBUG:napistu.mechanism_matching:Matching 4788 features to 98 species for ontology uniprot
INFO:napistu.mechanism_matching:Found 57 total matches across 1 ontologies
INFO:napistu.mechanism_matching:0.4% of feature_ids are present one or more times in the output (57/13647)
INFO:napistu.mechanism_matching:7 s_id(s) map to more than one feature_id.
INFO:napistu.mechanism_matc

Unnamed: 0_level_0,LFs1,LFs2,LFs3,LFs4,LFs5,LFs6,LFs7,LFs8,LFs9,LFs10,...,LFs23,LFs24,LFs25,LFs26,LFs27,LFs28,LFs29,LFs30,modality,feature_id
s_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S00000006,0.05153,0.036362,-0.041906,-0.001347,-0.013508,0.001565,-0.021276,-0.244355,-0.031964,0.057272,...,-0.191047,0.075975,-0.006429,0.009816,0.074251,0.010391,0.000629,-0.004023,proteomics,10646
S00000012,0.034465,0.115698,-0.047133,-0.001525,-0.006179,0.015768,0.042666,0.118975,0.002245,0.035891,...,0.15304,-0.00704,0.000832,0.005247,0.003451,0.000202,0.001349,-0.00289,proteomics,13559
S00000013,0.111319,-0.002571,-0.073818,-0.000615,-0.066941,0.040771,-0.002893,0.12033,-0.0244,0.337955,...,-0.075491,-0.005003,0.001831,-0.016633,-0.083339,-0.000686,0.003954,-0.00243,proteomics,11290
S00000015,0.01002,0.043131,-0.000815,1e-05,-0.032613,0.026158,0.044396,0.171346,0.001049,0.003718,...,-0.071206,0.054874,-0.052803,-0.000263,-0.014903,0.000827,0.036917,-0.002143,proteomics,10647
S00000016,0.099801,0.064125,-0.015871,-0.001283,-0.085869,0.002985,0.009716,0.216461,-0.005344,0.003299,...,-0.044812,0.032438,-0.001762,0.001995,-0.094416,-0.005864,-0.006293,-0.005239,proteomics,136229675
S00000019,0.064609,0.006714,-0.110616,-0.017464,-0.020428,-0.032641,-0.003856,0.083932,0.000541,-0.054995,...,-0.121061,-0.032175,-0.006698,-0.00278,0.095618,0.002672,0.074969,-0.004676,proteomics,10901
S00000022,0.03489,0.007629,-0.122095,-0.001708,-0.076321,0.06122,-0.055165,-0.110623,0.014309,0.324787,...,-0.065491,-0.016958,-0.011363,-0.004861,-0.308404,0.006624,-0.031057,0.000834,proteomics,10118
S00000031,0.014932,-0.100028,-0.056794,-0.015458,-0.039523,0.024284,-0.102462,-0.278852,-0.129555,0.008516,...,-0.225068,0.005913,-0.00828,-0.010984,0.235486,0.014511,0.010286,0.00522,proteomics,10040101689967
S00000033,0.055625,-0.149947,0.003432,-0.031034,-0.020317,0.054525,-0.158043,-0.283876,-0.124461,0.013506,...,-0.285335,0.016808,-0.004187,-0.003736,0.322549,0.01849,0.048649,-0.000388,proteomics,9864
S00000036,0.00607,-0.180905,0.004306,0.004061,0.082143,0.124391,0.241261,0.136064,-0.099635,-0.021902,...,-0.151604,0.009547,-0.011912,-0.036209,-0.212754,0.002011,0.004901,-0.001723,proteomics,11794


In [8]:
from napistu.network import net_create

# now we can pass these species_data attributes to the graph

reaction_graph_attrs = {
    "species": {
        "LFs5": {
            "table": "mudata_varm_results",
            "variable": "LFs5",
            "trans": "abs",
        },
        "effect_case": {
            "table": "var_level_results",
            "variable": "effect_case",
            "trans": "abs",
        },
    },
}

cpr_graph = net_create.create_cpr_graph(
    sbml_dfs,
    directed=True,
    graph_type="regulatory"
)

# add species attributes
# TO DO - this is definitely not a utility function
graph_w_annotations = net_create._add_graph_species_attribute(
    cpr_graph,
    sbml_dfs,
    species_graph_attrs = reaction_graph_attrs,
    custom_transformations = {
        # take the absolute value
        "abs" : lambda x: abs(x)
    }
)


INFO:napistu.network.net_create:Organizing all network nodes (compartmentalized species and reactions)
INFO:napistu.network.net_create:Formatting edges as a regulatory graph
INFO:napistu.network.net_create:Formatting 250 reactions species as tiered edges.
INFO:napistu.network.net_create:Adding additional attributes to edges, e.g., # of children and parents.
INFO:napistu.network.net_create:Done preparing regulatory graph
INFO:napistu.network.net_create:Adding reversibility and other meta-data from reactions_data
INFO:napistu.network.net_create:No reactions annotations provided in "graph_attrs"; returning None
INFO:napistu.network.net_create:Creating reverse reactions for reversible reactions on a directed graph
INFO:napistu.network.net_create:Formatting cpr_graph output
INFO:napistu.network.net_create:Adding meta-data from species_data
INFO:napistu.network.net_create:Adding new attribute LFs5 to vertices
INFO:napistu.network.net_create:Adding new attribute effect_case to vertices


In [9]:
napistu_utils.style_df(graph_w_annotations.get_vertex_dataframe().sort_values("LFs5").head(5))

Unnamed: 0_level_0,name,node_name,node_type,LFs5,effect_case
vertex ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,species_113780,Glc [endoplasmic reticulum lumen],species,0.0,0.0
101,species_76116,Glycerol [cytosol],species,0.0,0.0
103,species_163745,phosphoPFKFB1 dimer [cytosol],species,0.0,0.0
104,species_71786,PFKFB1 dimer [cytosol],species,0.0,0.0
107,reaction_198458,Efflux of glucose from the endoplasmic reticulum,reaction,0.0,0.0


## Network Propagation

Here we'll implement a workflow for applying network propagation to a cpr_graph's vertex attributes.

In [11]:
RESET_PROPORTIONAL_TO = "effect_case"

net_propagation.personalized_pagerank_by_attribute(
    graph_w_annotations,
    RESET_PROPORTIONAL_TO
).sort_values("effect_case", ascending=False)

Unnamed: 0,name,pagerank_by_attribute,effect_case,pagerank_uniform
18,species_376856,0.019399,0.290775,0.006498
100,species_8955798,0.016729,0.250761,0.006498
54,species_372815,0.014503,0.217395,0.006498
98,species_8955670,0.014410,0.215993,0.006498
72,species_70579,0.014029,0.210290,0.006498
...,...,...,...,...
47,species_29356,0.015775,0.000000,0.014823
45,species_29420,0.002402,0.000000,0.001767
44,species_113528,0.002402,0.000000,0.001767
43,species_29438,0.000000,0.000000,0.000000
