---
title: Tutorial - Adding Data to Graphs
author: "Shackett"
date: "September 12th 2025"
---

This notebook describes how we can add species- or reaction-level data to pathway representation (`SBML_dfs`), and also how we can propagate these attributes to the vertices and edges in a `napistu_graph`.

**Key Concepts:**

- **SBML_dfs**: Pathway representation storing species/reaction data in `.species_data` and `.reactions_data` dictionaries
- **NapistuGraph**: Network representation where data becomes vertex/edge attributes
- **Data flow**: SBML_dfs data → Graph attributes via configuration → Network analysis

## Adding data to pathways

Species- and reaction-level data is associated with the `species_data` or `reactions_data` attribute of an `SBML_dfs` object. Each of these fields can include multiple sources of entity data organized as a dictionary where keys are an information source label, and values are a `pd.DataFrame`. Each DataFrame is indexed by species or reaction ids (s_ids and _r_ids) corresponding to the indecies of the `species` and `reactions` tables.

**Current approaches:**:

1. **During construction**: Use edgelist format or specialized conversion functions. (Only recommended for advanced users; see the [wiki](https://github.com/napistu/napistu/wiki/SBML-DFs#from-an-edgelist))
2. **Post-construction**: Use `sbml_dfs.add_species_data()` and `sbml_dfs.add_reactions_data()` methods

```python
# Add species data
sbml_dfs.add_species_data("expression_data", expression_df)

# Add reaction data  
sbml_dfs.add_reactions_data("kinetics", kinetics_df)
```

3. **Identifier-based joining**: Use `mechanism_matching.features_to_pathway_species()` for merging based on systematic ids

```python
# Map external gene expression data to pathway species
mapped_data = mechanism_matching.features_to_pathway_species(
    feature_data, 
    species_identifiers,
    ontologies={"ensembl_gene"},
    feature_id_var="gene_id"
)

sbml_dfs.add_species_data("expression", mapped_data)
```

**Data organization**: In each cases, a `pd.DataFrame` is created with  appropriate entity IDs as its index (`s_id` for species, `r_id` for reactions). Each species or reaction is represented by zero or one row in the table. 

## Transferring data to graphs

Data flows from SBML_dfs to NapistuGraph through a three-step process:

### 1. Configure graph attributes

Define which data to extract and how to transform it:

```python
graph_attrs = {
    "species": {
        "expression": {
            "table": "expression_data", 
            "variable": "log_fc",
            "trans": "identity"
        }
    },
    "reactions": {
        "confidence": {
            "table": "string_scores",
            "variable": "combined_score", 
            "trans": "string_inv"
        }
    }
}

# Apply configuration
graph.set_graph_attrs(graph_attrs)
```

### 2. Add data to graph

Extract and merge data with graph entities:

```python
# Add vertex data from species_data tables
graph.add_vertex_data(sbml_dfs)

# Add edge data from reactions_data tables  
graph.add_edge_data(sbml_dfs)
```

### 3. Apply transformations

Transform raw data according to the configuration:

```python
# Transform vertex attributes
graph.transform_vertices()

# Transform edge attributes
graph.transform_edges()
```

## Alternative: Direct data addition

For simpler workflows, add data directly without configuration:

```python
# Add results table directly to graph
from napistu.network.data_handling import add_results_table_to_graph

add_results_table_to_graph(
    graph, 
    sbml_dfs,
    attribute_names=["expression", "significance"],
    table_name="rna_seq_results",
    transformation="identity"
)
```

# Demos

For these demos we'll load a simple `SBML_dfs` and its associated `NapistuGraph` from Google Cloud Storage (GCS). Then we'll learn a few ways to add data to the `SBML_dfs`. And, finally, go over how to pass information from a `SBML_dfs` to a `NapistuGraph` and how to directly side-load data directly to the graph.

In [1]:
# general logging
import logging
logger = logging.getLogger()
logger.setLevel("INFO")

# add tutorial globals
import tutorial_utils
config = tutorial_utils.NapistuConfig("config.yaml", "downloading_pathway_data")

# load assets
import pandas as pd

from napistu.sbml_dfs_core import SBML_dfs
from napistu.network.ng_core import NapistuGraph
from napistu.gcs import downloads
from napistu import utils

sbml_dfs_path = downloads.load_public_napistu_asset(
    asset = "test_pathway",
    data_dir = config.data_dir,
    subasset = "sbml_dfs",
)

napistu_graph_path = downloads.load_public_napistu_asset(
    asset = "test_pathway",
    data_dir = config.data_dir,
    subasset = "napistu_graph",
)

sbml_dfs = SBML_dfs.from_pickle(sbml_dfs_path)
napistu_graph = NapistuGraph.from_pickle(napistu_graph_path)


## During construction

Since its advanced usage, we won't provide a comprehensive example of adding data to `SBML_dfs` during construction but this route using the `edgelist` format is good to be aware of. This is how the species and/or reactions data bundled with STRING, IDEA, IntAct and other pathway sources are defined.

- [STRING](https://github.com/napistu/napistu-py/blob/main/src/napistu/ingestion/string.py)
- [IDEA](https://github.com/napistu/napistu-py/blob/main/src/napistu/ingestion/idea_yeast.py)


## Adding Data to `SBML_dfs`

To add reaction- or species-level data to an existing `SBML_dfs` object we can create an appropriate `pd.DataFrame` and directly add it to the object. As with all `species_data` or `reactions_data` entries this table must be indexed by the models species or reaction ids. Because of this, the challenge in merging results determining which species in our model match entries in the to-be-added entity data. 

First, we'll see how to add results once we've already assigned them to species or reactions, then we'll go over how to perform this merge using systematic identifiers.

### Directly using `SBML_dfs.add_species_data` or `SBML_dfs.add_reactions_data`

In [2]:
new_species_data = sbml_dfs.species[0:2].assign(spec_attr=2)[["spec_attr"]]

new_reactions_data = pd.DataFrame(
    [
        {"r_id": sbml_dfs.reactions.index[0], "rxn_attr_1": 2, "rxn_attr_2": 3},
        {"r_id": sbml_dfs.reactions.index[1], "rxn_attr_1": 3, "rxn_attr_2": 4},
    ]
).set_index("r_id")

sbml_dfs.add_species_data("species_data_1", new_species_data)
utils.show("sbml_dfs.species_data['species_data_1']")
utils.show(sbml_dfs.species_data["species_data_1"])

sbml_dfs.add_reactions_data("reactions_data_1", new_reactions_data)
utils.show("sbml_dfs.reactions_data['reactions_data_1']")
utils.show(sbml_dfs.reactions_data["reactions_data_1"])


"sbml_dfs.species_data['species_data_1']"

Unnamed: 0_level_0,spec_attr
s_id,Unnamed: 1_level_1
S00000000,2
S00000001,2


"sbml_dfs.reactions_data['reactions_data_1']"

Unnamed: 0_level_0,rxn_attr_1,rxn_attr_2
r_id,Unnamed: 1_level_1,Unnamed: 2_level_1
R00000000,2,3
R00000001,3,4


### Matching by identifiers with `matching.species.features_to_pathway_specie`

Generally we will be trying to add molecular data to a network which is associated with one or more systematic ontologies. A nice way to do this is using `matching.species.features_to_pathway_species()`. This function will compare a table containing all species or reactions identifiers in the pathway model to a set of query features to create a lookup table of query identifiers to pathway ids.

In [3]:
from napistu.matching import species

# export identifiers from pathway
species_identifiers = sbml_dfs.get_identifiers("species")

feature_annotations = pd.DataFrame(
    [
        {"uniprot": "P40926", "expression": 1000, "is_changing": True},
        {"uniprot": "P35575", "expression": 50, "is_changing": False},
    ],
    index=[0, 1],
)

updated_species_data = species.features_to_pathway_species(
    feature_annotations,
    species_identifiers,
    ontologies={"uniprot"},
    feature_identifiers_var="uniprot",
)[["s_id", "expression", "is_changing"]].set_index("s_id")

sbml_dfs.add_species_data("species_data_2", updated_species_data)
utils.show("sbml_dfs.species_data['species_data_2']")
utils.show(sbml_dfs.species_data["species_data_2"])

No feature_id column found in DataFrame, creating one


"sbml_dfs.species_data['species_data_2']"

Unnamed: 0_level_0,expression,is_changing
s_id,Unnamed: 1_level_1,Unnamed: 2_level_1
S00000000,1000,True
S00000125,50,False


## Adding data to a `NapistuGraph`

### During construction

Most source-specific attributes are added to a `NapistuGraph` after its construction. An exception to this rule is [`network.net_create.process_napistu_graph`](https://github.com/napistu/napistu-py/blob/51adb19e6abe55b24984f543a7071e92d9001a0e/src/napistu/network/net_create.py#L255) which bundles network creation and weighting. For some methods, like hybrid weighting, source-specific confidence scores (like STRING weights) are needed. These are specified using the same methodology you'll see below (define a `reaction_graph_attrs` config; add reactions_data variables as edge attributes; transform attributes).

### Using the `add_species_data` and `add_reactions_data` utilities

In [4]:
graph_attrs = {
    "reactions": {
        "reaction_wts": {"table": "reactions_data_1", "variable": "rxn_attr_1", "trans": "square"}
    },
    "species": {
        "species_var1": {
            "table": "species_data_1",
            "variable": "spec_attr",
            "trans": "square",
        },
        "species_var2": {
            "table": "species_data_2",
            "variable": "expression",
            "trans": "identity",
        },
    },
}

custom_transforms = {
    "square": lambda x: x ** 2,
}

napistu_graph.set_graph_attrs(graph_attrs, custom_transformations = custom_transforms)
napistu_graph._metadata


{'is_reversed': False,
 'wiring_approach': 'regulatory',
 'weighting_strategy': 'unweighted',
 'weight_by': [],
 'creation_params': {},
 'species_attrs': {'species_var1': {'table': 'species_data_1',
   'variable': 'spec_attr',
   'trans': 'square'},
  'species_var2': {'table': 'species_data_2',
   'variable': 'expression',
   'trans': 'identity'}},
 'reaction_attrs': {'reaction_wts': {'table': 'reactions_data_1',
   'variable': 'rxn_attr_1',
   'trans': 'square'}}}

In [5]:
napistu_graph.add_vertex_data(sbml_dfs)
# transform vertices
napistu_graph.transform_vertices(
    keep_raw_attributes = True,
    custom_transformations = custom_transforms
)

napistu_graph.add_edge_data(sbml_dfs)
# TO DO - fix once https://github.com/napistu/napistu-py/issues/220 is solved
# napistu_graph.transform_edges(custom_transformations = custom_transforms)

### Directly from pd.DataFrames

`add_vertex_data` and `add_edge_data` can use the `graph_attrs` to pull attributes out of species/reactions data but they also can pull attributes out of a separate optional `side_load_attributes` dict. This makes it easy to add graph-based results to the graph while still maintaining a consistent framework for tracking and transforming attributes.

To use this workflow we'll create a dict with 1 or more tables and each table should be indexed by a variable that is already present as vertex or edge attributes. Its also possible to use a multindex which is handy if you want to add edge-level summaries to edges based on (from, to) pairs.

In [6]:
# Get vertex dataframe and sample 20 random vertices
vertices = napistu_graph.get_vertex_dataframe()
sampled_vertices = vertices.sample(n=20, random_state=42).set_index("name")

# Create side-loaded data with 2 new variables
import numpy as np

side_loaded_data = pd.DataFrame({
    'experimental_score': np.random.normal(5, 2, 20),  # Normally distributed scores
    'category': np.random.choice(['high', 'medium', 'low'], 20)  # Categorical variable
}, index=sampled_vertices.index)

# Ensure the index has the correct name for merging
side_loaded_data.index.name = 'name'  # or whatever the vertex name column is

# Configure graph attributes for the side-loaded data
graph_attrs = {
    "species": {
        "exp_score": {
            "table": "experimental_data",
            "variable": "experimental_score",
            "trans": "identity"
        },
        "sample_category": {
            "table": "experimental_data", 
            "variable": "category",
            "trans": "identity"
        }
    }
}

# here we'll add the attributes using extend to tag the config and attributes themselves onto the existing ones
napistu_graph.set_graph_attrs(graph_attrs, mode = "extend")

napistu_graph.add_vertex_data(
    mode = "extend",
    sbml_dfs=None,
    side_loaded_attributes={"experimental_data": side_loaded_data}
)

In [7]:
utils.show("NapistuGraph.get_vertex_dataframe().head(5)") 
utils.show(napistu_graph.get_vertex_dataframe().head(5))

utils.show("NapistuGraph.get_edge_dataframe().head(5)")
utils.show(napistu_graph.get_edge_dataframe().head(5))

'NapistuGraph.get_vertex_dataframe().head(5)'

Unnamed: 0_level_0,name,node_name,node_type,species_type,s_id,c_id,species_var1,species_var2,sample_category,exp_score
vertex ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,SC00000000,MDH2 dimer [mitochondrial matrix],species,complex,S00000000,C00000000,4.0,1000.0,,
1,SC00000001,NAD+ [mitochondrial matrix],species,metabolite,S00000001,C00000000,4.0,,,
2,SC00000002,NAD+ [cytosol],species,metabolite,S00000001,C00000002,4.0,,,
3,SC00000003,MAL [mitochondrial matrix],species,metabolite,S00000002,C00000000,,,,
4,SC00000004,MAL [cytosol],species,metabolite,S00000002,C00000002,,,,


'NapistuGraph.get_edge_dataframe().head(5)'

Unnamed: 0_level_0,source,target,from,to,stoichiometry,sbo_term,r_id,species_type,r_isreversible,direction,weight,upstream_weight,reaction_wts
edge ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,159,5,R00000000,SC00000005,1.0,SBO:0000011,R00000000,metabolite,False,forward,0.5,0.5,2.0
1,159,7,R00000000,SC00000007,1.0,SBO:0000011,R00000000,metabolite,False,forward,0.5,0.5,2.0
2,160,3,R00000001,SC00000003,1.0,SBO:0000011,R00000001,metabolite,False,forward,0.5,0.5,3.0
3,161,7,R00000002,SC00000007,1.0,SBO:0000011,R00000002,metabolite,False,forward,0.5,0.5,
4,161,14,R00000002,SC00000014,1.0,SBO:0000011,R00000002,metabolite,False,forward,0.5,0.5,
