This notebook is a demonstration of PyBioPAX that contains a mix of functions for traversing Pathway Commons' full, detailed data dump.

In [1]:
import pystow
import gzip
from typing import Optional, Iterable, Set, Tuple, Any
import pickle
import pybiopax
from lxml import etree
from tabulate import tabulate
from pybiopax.biopax import *
from tqdm.auto import tqdm
from collections import Counter
from IPython.display import HTML

In [2]:
def ensure_pc_detailed(version: Optional[str], force: bool = False):
    if version is None:
        import bioversions
        
        version = bioversions.get_version("pathwaycommons")

    url = f"https://www.pathwaycommons.org/archives/PC2/v{version}/PathwayCommons{version}.Detailed.BIOPAX.owl.gz"
    path = pystow.ensure("bio", "pathwaycommons", version, url=url)    
    return pybiopax.model_from_owl_gz(path)

pc12 = ensure_pc_detailed(version="12")

Processing OWL elements:   0%|          | 0.00/3.75M [00:00<?, ?it/s]

A simple first look into a model is to count how many of each type of BioPAX element it contains.

In [10]:
type_counter = Counter(
    obj.__class__.__name__
    for obj in pc12.objects.values()
)

print(f"Got {sum(type_counter.values()):,} models")

HTML(tabulate(type_counter.most_common(), tablefmt="html", headers=["Type", "Count"]))

Got 3,745,386 models


Type,Count
TemplateReactionRegulation,962641
RelationshipXref,683543
Evidence,558049
TemplateReaction,166320
Control,152432
UnificationXref,133283
PublicationXref,119613
Protein,114023
SequenceSite,108257
Catalysis,100923


## Which enzymes need to be phosphorylated to catalyze a reaction?

There are a few different perspectives for the concept of "active states", but this is a quick and dirty way of identifying them. The first 15 are shown for brevity. To implement this check, it was also necessary to impleent the `iter_modifications()` function which gives all features for a given protein (or any other physical entity).

In [4]:
def iter_modifications(entity: PhysicalEntity, query: str) -> Iterable[ModificationFeature]:
    """Iterate over modification features in a protein that have the query string as a substring."""
    for feature in entity.feature or []:
        # If this is a modification feature which has a known type
        # and that type includes "phospho", i.e., is a phosphorylation
        if (
            isinstance(feature, ModificationFeature)
            and feature.modification_type
            and any(query in mod for mod in feature.modification_type.term)
        ):
            yield feature

def iter_phosphosites(protein: Protein):
    yield from iter_modifications(protein, "phospho")

rows = []
for obj in tqdm(pc12.get_objects_by_type(Catalysis)):
    for protein in obj.controller:
        if not isinstance(protein, Protein):
            continue
        features = list(iter_phosphosites(protein))
        if not features:
            continue
        rows.append((
            obj.display_name or "",
            protein.display_name,
            ", ".join(o.display_name for o in obj.controlled.left),
            ", ".join(o.display_name for o in obj.controlled.right),
        ))

print(f"Matched {len(rows)} examples")
        
HTML(tabulate(rows[:15], tablefmt="html", headers=["Name", "Enzyme", "Reactants", "Products"]))

0it [00:00, ?it/s]

Matched 1283 examples


Name,Enzyme,Reactants,Products
,RAF1,MEK1,MEK1
,JAK2_p,"STAT5, STAT5",STAT5_p
,RAF1,MEK2,MEK2
,JAK1_p,"gp130, gp130",gp130_p
,Akt_p,GSK3b,GSK3b_p
CATALYSIS,Raf1,MEK,MEK
,MEK1/2(MKK1/2)_p,ERK1/2,ERK1/2_p
,PLCgamma1,PI-4-5-P2,"DAG, IP3"
,MEK1/2(MKK1/2)_p,ERK1/2,ERK1/2_p
CATALYSIS,MAPKAPK2,SRF,SRF


## Which catalyses of biochemical reactions require a cofactor?

It turns out Pathway Commons has a few more than 80 with this granularity.

In [5]:
def iter_cofactored_catalyses(model: BioPaxModel) -> Iterable[Catalysis]:
    """Iterate over catalyses of biochemical reactions that require a cofactor."""
    for obj in model.get_objects_by_type(Catalysis):
        if not obj.cofactor:
            continue
        if not isinstance(obj.controlled, BiochemicalReaction):
            continue
        yield obj

rows = [
    (
        obj.display_name,
        obj.controller,
        obj.cofactor.display_name,
        ", ".join(o.display_name for o in obj.controlled.left),
        ", ".join(o.display_name for o in obj.controlled.right),
    )
    for obj in iter_cofactored_catalyses(pc12)
]

print(f"Matched {len(rows)} examples")

HTML(tabulate(rows[:15], tablefmt="html", headers=["Name", "Controller", "Cofactor", "Reactants", "Products"]))

Matched 84 examples


Name,Controller,Cofactor,Reactants,Products
tyrosine aminotransferase,[Complex(TAT)],pyridoxal-P,"2-oxoglutarate, tyr","glt, 4-hydroxyphenylpyruvate"
nitric oxide synthase,[Complex(iNOS)],protoheme IX,"NADPH, O2, arg, H+","L-citrulline, nitric oxide, NADP+, H2O"
protoporphyrinogen oxidase,[Complex(PPO)],FAD,"protoporphyrinogen IX, O2","hydrogen peroxide, protoporphyrin IX"
creatine kinase,[Complex(BB)],Mg2+,"creatine, ATP","creatine-phosphate, ADP, H+"
kynurenine 3-monooxygenase,[Protein(Kynurenine 3-monooxygenase)],FAD,"NADPH, O2, H+, L-kynurenine","3-hydroxy-L-kynurenine, NADP+, H2O"
arginase,[Complex(arginase type 2)],Mn2+,"arg, H2O","L-ornithine, urea"
methylarsonate reductase,"[Complex(glutathione transferase, &Omega; class)]",glutathione,"glutathione, methylarsonate","glutathione disulfide, methylarsonite"
glutamine-phenylpyruvate transaminase,[Complex(KAT1)],pyridoxal-P,"gln, 2-oxo-3-phenylpropanoate","phe, 2-oxoglutaramate"
<small>D</small>-lactate dehydrogenase (cytochrome),[Protein(DLD)],FAD,"an oxidized cytochrome c, (R)-lactate","a reduced c-type cytochrome, pyruvate, H+"
"indoleamine 2,3-dioxygenase",[Protein(IDO)],protoheme IX,"trp, O2, H+",N-formylkynurenine


## Find Phosphorylation Reactions

And later, generalize it to find any kind of addition of a modification.

In [6]:
def get_simple_physical_entity_xrefs(obj: SimplePhysicalEntity) -> Set[Tuple[str, str]]:
    """Get xrefs from a simple physical entity as pairs."""
    if not obj.entity_reference:
        return set()
    return {(xref.db, xref.id) for xref in obj.entity_reference.xref or []}

def is_modification_reaction(obj: Any) -> bool:
    """Check if the object is a biochemical reaction with the same
    entity as reactant/product but it's modified.
    """
    if not isinstance(obj, BiochemicalReaction):
        return False
    if len(obj.left) != 1 or len(obj.right) != 1:
        return False
    left, right = obj.left[0], obj.right[0]
    if not isinstance(left, Protein) or not isinstance(right, Protein):
        return False
    left_xrefs = get_simple_physical_entity_xrefs(left)
    right_xrefs = get_simple_physical_entity_xrefs(right)
    return 0 < len(left_xrefs.intersection(right_xrefs))

def iter_modification_reactions(model: BioPaxModel) -> Iterable[BiochemicalReaction]:
    """Iterate over biochemical reactions in the model that are modification reactions which
    pass :func:`is_modification_reaction`.
    """
    for obj in model.get_objects_by_type(BiochemicalReaction):
        if is_modification_reaction(obj):
            yield obj

def iter_phosphorylations(m):
    for obj in iter_modification_reactions(m):
        left = list(iter_phosphosites(obj.left[0]))
        right = list(iter_phosphosites(obj.right[0]))
        if not left and right:
            yield obj

rows = [
    (
        obj.display_name, obj.left[0]
    )
    for obj in iter_phosphorylations(pc12)
]

print(f"Matched {len(rows)} examples")

HTML(tabulate(rows[:15], tablefmt="html", headers=["Name", "Reactant"]))

Matched 15332 examples


Name,Reactant
,Protein(PDHA2)
,Protein(AVEN)
,Protein(SIT1)
,Protein(RALY)
,Protein(ATP1A1)
,Protein(Kv11.1 iso5)
,Protein(HSF1)
,Protein(IRF3)
,Protein(TPPP)
,Protein(PINX1)


## Get Proteins with Bound Small Molecules 

In general, complexes are pretty easy to get with `get_objects_by_type()` then iterating over the `component` attribute. The following functions iterate over complexes between proteins and one or more small molecules.

In [7]:
def head(it, n=10):
    for _, obj in zip(range(n), it):
        yield obj

def iter_bound(m):
    for obj in m.get_objects_by_type(Complex):
        c = Counter(c.__class__ for c in obj.component)
        if c.get(Protein) != 1:
            continue
        if {SmallMolecule, Protein} != set(c):
            continue
        yield obj

for obj in head(iter_bound(pc12)):
    print(obj)
    for component in obj.component:
        print(" ", component)

Complex(phenanthrene/ESR2 protein complex)
  Protein(ESR2 protein)
  SmallMolecule(phenanthrene)
Complex(SIRT3(?-399):Zn2+)
  SmallMolecule(Zn2+)
  Protein(SIRT3)
Complex(mono-n-octyl phthalate/PPARB protein complex)
  SmallMolecule(mono-n-octyl phthalate)
  Protein(PPARB protein)
Complex(Cholestanols/GPBAR1 protein complex)
  SmallMolecule(Cholestanols)
  Protein(GPBAR1 protein)
Complex(RAB4A:GTP)
  Protein(RAB4A)
  SmallMolecule(GTP)
Complex(fenofibric acid/PPARA protein complex)
  SmallMolecule(fenofibric acid)
  Protein(PPARA protein)
Complex(Dipyridamole/RCAN1 protein alternative form complex)
  SmallMolecule(Dipyridamole)
  Protein(RCAN1 protein alternative form)
Complex(pirinixic acid/PPARA protein complex)
  SmallMolecule(pirinixic acid)
  Protein(PPARA protein)
Complex(4-nitrobenzylthioinosine/SLC29A1 protein complex)
  Protein(SLC29A1 protein)
  SmallMolecule(4-nitrobenzylthioinosine)
Complex(bisphenol A/PGR complex)
  Protein(PGR)
  SmallMolecule(bisphenol A)


Get proteins with multiple bound small molecules

In [8]:
def iter_bound_multiple(m):
    for obj in m.get_objects_by_type(Complex):
        c = Counter(c.__class__ for c in obj.component)
        if c.get(Protein) != 1:
            continue
        if {SmallMolecule, Protein} != set(c):
            continue
        if c.get(SmallMolecule) < 2:
            continue
        yield obj

for obj in head(iter_bound_multiple(pc12)):
    print(obj)
    for component in obj.component:
        print(" ", component)

Complex(Vitamin K 3/NADP/DCXR protein complex)
  SmallMolecule(Vitamin K 3)
  SmallMolecule(NADP)
  Protein(DCXR protein)
Complex(palmitoylated, myristoylated eNOS dimer)
  SmallMolecule(Zn2+)
  SmallMolecule(FMN)
  SmallMolecule(heme)
  Protein(2xPalmC-MyrG-NOS3)
  SmallMolecule(FAD)
Complex(Cholestanol/NAD/HSD17B10 protein complex)
  SmallMolecule(NAD)
  Protein(HSD17B10 protein)
  SmallMolecule(Cholestanol)
Complex(lauroyl-coenzyme A/NADP/HSDL2 protein complex)
  Protein(HSDL2 protein)
  SmallMolecule(lauroyl-coenzyme A)
  SmallMolecule(NADP)
Complex(CIAPIN1:4Fe-4S:2Fe-2S oxidized)
  Protein(CIAPIN1)
  SmallMolecule((2Fe-2S)(2+))
  SmallMolecule(4Fe-4S cluster)
Complex(pyruvate kinase complex, liver and RBC)
  Protein(Pyruvate kinase, R/L type)
  SmallMolecule(Mg2+)
  SmallMolecule(K+)
Complex(PP2B catalytic (Fe3+, Zn2+))
  SmallMolecule(Zn2+)
  SmallMolecule(Fe3+)
  Protein(PPP3CA,B,C)
Complex(Glycyrrhetinic Acid/NADP/HSD3B1 protein complex)
  SmallMolecule(Glycyrrhetinic Acid)
  S