---
title: "Tutorial - Understanding the SBML DFs Pathway Format"
author: "Shackett"
date: "May 9th 2025"
---

In [2]:
#| label: config

import logging
import os

import pandas as pd

from napistu import utils

import tutorial_utils

logger = logging.getLogger()
logger.setLevel("INFO")

config = tutorial_utils.NapistuConfig("config.yaml", "understanding_sbml_dfs")
_ = config.load_asset("test_pathway", "sbml_dfs")

# Overview

This tutorial aims to cover:
- What is the `SBML_dfs` format?
- How are `SBML_dfs` created?
- How can they be modified?

# Basic Structure

The [cpr wiki: sbml_dfs](https://github.com/calico/Open-CPR/wiki/SBML-DFs) provides a detailed overview of the `SBML_dfs` data structure. Here, we'll complement that description by showing the attributes of an actual pathway. To get up-and-running quickly we'll just load an existing `SBML_dfs` object constructed from merging multiple metabolic pathways into a consensus pathway.

In [3]:
#| label: load_data
sbml_dfs = utils.load_pickle(config.artifacts["sbml_dfs"])

## Compartments

In [4]:
#| label: compartments
utils.style_df(sbml_dfs.compartments)

Unnamed: 0_level_0,c_name,c_Identifiers,c_Source
c_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C00000000,mitochondrial matrix,,
C00000001,mitochondrial inner membrane,,
C00000002,cytosol,,
C00000003,mitochondrion,,
C00000004,endoplasmic reticulum membrane,,
C00000005,nucleoplasm,,
C00000006,nuclear envelope,,
C00000007,endoplasmic reticulum lumen,,
C00000008,extracellular region,,


## Chemical Species (genes, metabolites, complexes, drugs)

In [5]:
#| label: species
utils.style_df(sbml_dfs.species.head())

Unnamed: 0_level_0,s_name,s_Identifiers,s_Source
s_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S00000000,MDH2 dimer,,
S00000001,NAD+,,
S00000002,MAL,,
S00000003,OAA,,
S00000004,NADH,,


## Compartmentalized Species (species in a compartment)

In [6]:
#| label: cspecies
utils.style_df(sbml_dfs.compartmentalized_species.head())

Unnamed: 0_level_0,sc_name,s_id,c_id,sc_Source
sc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SC00000000,MDH2 dimer [mitochondrial matrix],S00000000,C00000000,
SC00000001,NAD+ [mitochondrial matrix],S00000001,C00000000,
SC00000002,NAD+ [cytosol],S00000001,C00000002,
SC00000003,MAL [mitochondrial matrix],S00000002,C00000000,
SC00000004,MAL [cytosol],S00000002,C00000002,


## Reactions

In [7]:
#| label: reactions
utils.style_df(sbml_dfs.reactions.head())

Unnamed: 0_level_0,r_name,r_Identifiers,r_Source,r_isreversible
r_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
R00000000,(S)-Malate + NAD+ <=> Oxaloacetate + NADH + H+,,,False
R00000001,Oxaloacetate + NADH + H+ <=> (S)-Malate + NAD+,,,False
R00000002,ME2:Mg2+ tetramer oxidatively decarboxylates MAL to PYR,,,False
R00000003,NADPH + NAD+ + H+ [cytosol] => NADP+ + NADH + H+ [mitochondrial matrix],,,False
R00000004,isocitrate + NAD+ => alpha-ketoglutarate + CO2 + NADH + H+ [IDH3],,,False


## Reactions

In [8]:
#| label: reaction_species
utils.style_df(sbml_dfs.reaction_species.head())

Unnamed: 0_level_0,r_id,sc_id,stoichiometry,sbo_term
rsc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RSC00000000,R00000000,SC00000000,0.0,SBO:0000013
RSC00000001,R00000000,SC00000001,-1.0,SBO:0000010
RSC00000002,R00000000,SC00000003,-1.0,SBO:0000010
RSC00000003,R00000000,SC00000005,1.0,SBO:0000011
RSC00000004,R00000000,SC00000007,1.0,SBO:0000011


## Optional attributes (`species_data` and `reactions_data`)

We can store additional data with species and reactions allowing us to connect molecular data to our graph. These tables `species_data` and `reactions_data` are discussed in the **adding_data_to_graphs** tutorial.

# Identifiers and Sources

Many of the core tables in `SBML_dfs` contain a column to track systematic identifiers and/or the information source. For example, `s_Identifiers` tracks a molecule's systematic identifiers while `s_Source` keeps track of the pathways including the molecules (this becomes relevant once we start merging network models; see the **merging_networks_as_a_consensus** tutorial).

## Identifiers

Identifiers are described in [cpr wiki: identifiers](https://github.com/calico/Open-CPR/wiki/SBML-DFs#identifiers) but here is an actual Identifiers object.

In [9]:
#| label: identifiers_a
an_identifier = sbml_dfs.species["s_Identifiers"][0]
type(an_identifier)

napistu.identifiers.Identifiers

At its core `Identifiers` objects are list of dictionaries

In [10]:
#| label: identifiers_b
an_identifier.ids

[{'ontology': 'reactome',
  'identifier': 'R-HSA-198511',
  'url': 'https://reactome.org/content/detail/R-HSA-198511',
  'bqb': 'BQB_IS'},
 {'ontology': 'uniprot',
  'identifier': 'P40926',
  'url': 'https://purl.uniprot.org/uniprot/P40926',
  'bqb': 'BQB_HAS_PART'},
 {'ontology': 'reactome',
  'identifier': 'R-MMU-198511',
  'url': 'https://reactome.org/content/detail/R-MMU-198511',
  'bqb': 'BQB_IS_HOMOLOG_TO'},
 {'ontology': 'reactome',
  'identifier': 'R-RNO-198511',
  'url': 'https://reactome.org/content/detail/R-RNO-198511',
  'bqb': 'BQB_IS_HOMOLOG_TO'},
 {'ontology': 'reactome',
  'identifier': 'R-CFA-198511',
  'url': 'https://reactome.org/content/detail/R-CFA-198511',
  'bqb': 'BQB_IS_HOMOLOG_TO'},
 {'ontology': 'reactome',
  'identifier': 'R-BTA-198511',
  'url': 'https://reactome.org/content/detail/R-BTA-198511',
  'bqb': 'BQB_IS_HOMOLOG_TO'},
 {'ontology': 'reactome',
  'identifier': 'R-DRE-198511',
  'url': 'https://reactome.org/content/detail/R-DRE-198511',
  'bqb': 'BQB

Or, reformatted:

In [11]:
#| label: identifiers_c
utils.style_df(pd.DataFrame(an_identifier.ids))

Unnamed: 0,ontology,identifier,url,bqb
0,reactome,R-HSA-198511,https://reactome.org/content/detail/R-HSA-198511,BQB_IS
1,uniprot,P40926,https://purl.uniprot.org/uniprot/P40926,BQB_HAS_PART
2,reactome,R-MMU-198511,https://reactome.org/content/detail/R-MMU-198511,BQB_IS_HOMOLOG_TO
3,reactome,R-RNO-198511,https://reactome.org/content/detail/R-RNO-198511,BQB_IS_HOMOLOG_TO
4,reactome,R-CFA-198511,https://reactome.org/content/detail/R-CFA-198511,BQB_IS_HOMOLOG_TO
5,reactome,R-BTA-198511,https://reactome.org/content/detail/R-BTA-198511,BQB_IS_HOMOLOG_TO
6,reactome,R-DRE-198511,https://reactome.org/content/detail/R-DRE-198511,BQB_IS_HOMOLOG_TO
7,reactome,R-XTR-198511,https://reactome.org/content/detail/R-XTR-198511,BQB_IS_HOMOLOG_TO
8,reactome,R-GGA-198511,https://reactome.org/content/detail/R-GGA-198511,BQB_IS_HOMOLOG_TO
9,reactome,R-DME-198511,https://reactome.org/content/detail/R-DME-198511,BQB_IS_HOMOLOG_TO


## Sources

While `Identifiers` track the systematic identifiers associated with entities `Source` objects track the pathway representation information came from. For a single-source `SBML_dfs` object its source's will be empty. For consensus model derived by combining multiple pathways, the `Source` object will track the upstream pathways. See [cpr wiki: sources](https://github.com/calico/Open-CPR/wiki/SBML-DFs#sources) for more details. Here, we can look more closely at a representative `Source` object to better understand its structure.

In [12]:
#| label: source_a
a_source = sbml_dfs.species["s_Source"][0]
type(a_source)

napistu.source.Source

Source objects core attribute is their `source` pd.DataFrame. Since we are loading an `SBML_dfs` object which was constructed by merging multiple upstream `SBML_dfs` (see dev/create_test_pathway.qmd to see how this was done) the source table tracks which upstream pathway(s) entities came from:

In [13]:
#| label: source_b
utils.style_df(a_source.source)

Unnamed: 0,model,s_id,file,source,species,pathway_id,name,date
0,tca,S00000000,reactome_tca.sbml,Reactome,Homo sapiens,tca,citric acid cycle,
1,glucose metabolism,S00000015,reactome_glucose_metabolism.sbml,Reactome,Homo sapiens,glucose metabolism,glucose metabolism,


When, an `SBML_dfs` model from a single-source is initialized its `Source.source` attributes are generally None. So the presense of source information indicates that the model has been created using the consensus module. This is module is described in the **merging_networks_as_a_consensus** tutorial.

# Creating `SBML_dfs`

## From a `.sbml` file

`SBML_dfs` pathway models can be created in three ways (from a .sbml file, using an edgelist format, and from raw tables). The latter two modes of creating sbml_dfs are shown later in this document but first we'll demonstrate how to create sbml_dfs from .sbml files because it is straight-forward and the sbml convention is arguably the most expressive format for accurately describing molecular mechanisms: [cpr wiki: sbml](https://github.com/calico/Open-CPR/wiki/SBML).

Here is how we can load and format the Reactome Glucose Metabolism pathway as an `SBML_dfs` object.

In [14]:
#| label: sbml_dfs_from_sbml

from napistu import sbml_dfs_core
from napistu.ingestion import sbml
from napistu import utils

EXAMPLE_MODEL = "reactome_glucose_metabolism.sbml"
test_data = tutorial_utils.locate_test_data()

#| label: create_sbml_dfs
model_path = os.path.join(test_data, EXAMPLE_MODEL)
sbml_model = sbml.SBML(model_path)
sbml_dfs = sbml_dfs_core.SBML_dfs(sbml_model)

INFO:napistu.utils:creating an edgelist linking index levels s_id, entry and linking it to levels defined by ontology, identifier
DEBUG:napistu.utils:label is not defined in table_schema; adding a constant (1)


## From the underlying tables

In [15]:
#| label: sbml_dfs_from_tables

sbml_tbl_dict = {
    "compartments": sbml_dfs.compartments,
    "species": sbml_dfs.species,
    "compartmentalized_species": sbml_dfs.compartmentalized_species,
    "reactions": sbml_dfs.reactions,
    "reaction_species": sbml_dfs.reaction_species,
}

sbml_dfs_core.SBML_dfs(sbml_tbl_dict).species

Unnamed: 0_level_0,s_name,s_Identifiers,s_Source
s_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S00000000,Glc,<napistu.identifiers.Identifiers object at 0x1...,<napistu.source.Source object at 0x1581610d0>
S00000001,G6PC2,<napistu.identifiers.Identifiers object at 0x1...,<napistu.source.Source object at 0x158392150>
S00000002,G6P,<napistu.identifiers.Identifiers object at 0x1...,<napistu.source.Source object at 0x15846c710>
S00000003,H2O,<napistu.identifiers.Identifiers object at 0x1...,<napistu.source.Source object at 0x15846fed0>
S00000004,Pi,<napistu.identifiers.Identifiers object at 0x1...,<napistu.source.Source object at 0x158456290>
...,...,...,...
S00000075,PP2A-ABdeltaC complex,<napistu.identifiers.Identifiers object at 0x1...,<napistu.source.Source object at 0x1584947d0>
S00000076,phosphoPFKFB1 dimer,<napistu.identifiers.Identifiers object at 0x1...,<napistu.source.Source object at 0x1584942d0>
S00000077,PFKFB1 dimer,<napistu.identifiers.Identifiers object at 0x1...,<napistu.source.Source object at 0x1584949d0>
S00000078,PFKFB dimers,<napistu.identifiers.Identifiers object at 0x1...,<napistu.source.Source object at 0x158494410>


## From an edgelist

**TO DO**


# Modifying `sbml_dfs`

## Dropping cofactors

While cofactors such as ATP and NADH are biologically important they can generate confusion when we translate reactions into networks. For example, a dehydration reaction (X - H20) and a hydration reaction (Y + H20) would not really suggest regulation of Y by X via water. To deal with this issue, for many applications its helpful to remove cofactors.

In [16]:
#| label: drop_cofactors
from napistu.modify import pathwayannot

filtered_pathway = pathwayannot.drop_cofactors(sbml_dfs)

INFO:napistu.modify.pathwayannot:There were 13 cofactor species: H2O, Pi, NAD+, NADH, H+, L-Glu, CO2, ATP, ADP, Ac-CoA, GTP, GDP, AMP
INFO:napistu.modify.pathwayannot:Cofactor species are present 70 times in reactions
INFO:napistu.modify.pathwayannot:32 of 50 reactions include cofactor species
INFO:napistu.modify.pathwayannot:60 of 250 reaction species will be filtered as cofactors
INFO:napistu.modify.pathwayannot:<pandas.io.formats.style.Styler object at 0x15832dc50>


## Uncompartmentalizing

Compartments are necessary to represent many mechanisms involving transport reactions such as creating a proton transport to fuel the electron transport chain. But, they add complexity which may be unecessary for some applications. By merging compartments we can create an uncompartmentalized model.

In [17]:
#| label: uncompartmentalize
from napistu.modify import uncompartmentalize

one_compartment_sbml_dfs = uncompartmentalize.uncompartmentalize_sbml_dfs(sbml_dfs)

INFO:napistu.consensus:Merging reactions based on identical membership (sc_id + stoichiometry)
INFO:napistu.consensus:>>>> 2 r_id entries merged into 1
INFO:napistu.consensus:Creating formulas for to-be-merged reactions to help with reporting merges of reactions with inconsistently named reactants
INFO:napistu.consensus:Done creating reaction formulas
INFO:napistu.consensus:Testing for identical formulas of to-be-merged reactions
1 merges were of entities with distinct names, including:



INFO:napistu.consensus:Merging reactions identifiers
INFO:napistu.consensus:Merging reactions sources
INFO:napistu.consensus:Creating source table
INFO:napistu.consensus:Aggregating old sources
INFO:napistu.consensus:Returning new source table
DEBUG:napistu.utils:label is not defined in table_schema; adding a constant (1)
INFO:napistu.modify.uncompartmentalize:14 reactions species will be removed because they are substrates and products in the same reaction
INFO:napistu.modify.uncompartmentalize:2 re

## Gap-filling

**TO DO**