# *i*CH360 Knowledge Graph assembly pipeline
This script contains the complete pipeline used to parse, assemble and curate the *i*CH360 knowledge graph. 

**Important**

The pipeline used to construct the graph uses extensive HTTP querying to the EcoCyc Database to retrieve information. To guarantee reproducibility, the output of all biocyc queries was serialised and cached, enabling the script to be run without querying the databse, and guaranteing reproducibility of results in this paper.

However, it is possible to run the pipeline from scratch, retrieving required data from the online database. To this end, one must:
1. create a file  with valid credentials (username and password) of an EcoCyc account. To this end, simply use `./graph_assembly/biocyc_username_and_password_template.csv` as a template, replace the placeholder username and password with yours, and save the modified file in the same directory as `./graph_assembly/biocyc_username_and_password.csv` (note the different filename with the template file).
2. Run this script ,changing `USE_CACHE=True` into `USE_CACHE=False` in the next cell.

In this case, we cannot guarantee maintainability/reproducibility of this script in the future

In [1]:
#Change this to False if you wish to perform HTTP querying of EcoCyc
USE_CACHE=True

## Imports

In [2]:
import importlib
import cobra
import sys
import os
sys.path.append('../../utils')
import graph_utils
importlib.reload(graph_utils)
import numpy as np
import pandas as pd
import networkx as nx
from tqdm import tqdm
import pickle
import biocyc_query_utils
importlib.reload(biocyc_query_utils)
import json

## Preliminaries

### Load Model and Biocyc Session

In [3]:
#Load model
model=cobra.io.read_sbml_model('../../Model/iCH360/Escherichia_coli_iCH360.xml')

#establish Biocyc session. 
if USE_CACHE:
    biocyc_session=None
    #Uncomment the next two lines if you still wish to establish a biocyc session even if using the cache
    #biocyc_credential=pd.read_csv('./biocyc_username_and_password.csv').iloc[0]
    #biocyc_session=biocyc_query_utils.establish_biocyc_session(email=biocyc_credential['username'],password=biocyc_credential['password'])

else:
    biocyc_credential=pd.read_csv('./biocyc_username_and_password.csv').iloc[0]
    biocyc_session=biocyc_query_utils.establish_biocyc_session(email=biocyc_credential['username'],
                                                        password=biocyc_credential['password'])


Set parameter Username
Academic license - for non-commercial use only - expires 2025-03-12


### Prepare reactions to include in graph and bigg2biocyc mappings

In [4]:
reactions_to_parse=[r.id for r in model.reactions if r not in model.boundary]
print(f'{len(reactions_to_parse)} reactions to parse into the annotation graph')

bigg2biocyc_df=pd.read_csv('../../Annotation/BioCyc/bigg_biocyc_map.tsv',sep='\t',index_col=0)
print(bigg2biocyc_df.head())

bigg2biocyc_dict=bigg2biocyc_df.set_index('bigg_reaction_id').to_dict()['biocyc_reaction_id']


323 reactions to parse into the annotation graph
  bigg_reaction_id                   biocyc_reaction_id
0            NDPK5                    ECOLI:DGDPKIN-RXN
1           SHK3Dr  ECOLI:SHIKIMATE-5-DEHYDROGENASE-RXN
2            NDPK6                    ECOLI:DUDPKIN-RXN
3            NDPK8                    ECOLI:DADPKIN-RXN
4           DHORTS                ECOLI:DIHYDROOROT-RXN


### Biocyc Objects CACHE specification (for reproducibility and/or faster parsing)

In [5]:
cache_file='cache/biocyc_objects_cache.pkl'

if os.path.exists(cache_file):
    print(f'Loading cache from {cache_file}')
    with open(cache_file,'rb') as f:
        biocyc_objects=pickle.load(f)
else:
    biocyc_objects=None

Loading cache from cache/biocyc_objects_cache.pkl


## Parse catalytic and protein composition relationships from Biocyc 
Note: We are going to add manual curation (including additional nodes, edges functional annotations, and other) later in the script

In [6]:
if USE_CACHE:
    with open('tmp/graph_dict_from_ecocyc.pkl','rb') as f:
        graph_dict_ecocyc=pickle.load(f)
else:
    graph_dict_ecocyc=graph_utils.build_graph_tables(session=biocyc_session,
                                                bigg_rxns=reactions_to_parse,
                                                bigg2biocyc_map=bigg2biocyc_dict,
                                                db='ECOLI',
                                                out_path='./tmp/',
                                                cache=biocyc_objects)
    with open('tmp/graph_dict_from_ecocyc.pkl','wb') as f:
        pickle.dump(graph_dict_ecocyc,f)

## Add small molecule regulation
Regulatory Information from EcoCyc was parsed separately

In [7]:
if USE_CACHE:
    with open('cache/small_molecule_regulation_biocyc_cache.pkl','rb') as file:
        regulatory_data_from_ecocyc=pickle.load(file)
else:
    biocyc_rxn_ids=[node['biocyc_id'] for node in graph_dict_ecocyc['nodes'] if node['type']=='reaction']
    regulatory_data_from_ecocyc=biocyc_query_utils.parse_regulation_info(biocyc_rxn_ids=biocyc_rxn_ids,
                                        biocyc_session=biocyc_session,
                                        cache=graph_dict_ecocyc['objects'])
    with open('cache/small_molecule_regulation_biocyc_cache.pkl','wb') as file:
        pickle.dump(regulatory_data_from_ecocyc,file)

In [8]:
bigg_ecocyc_metabolite_map=pd.read_csv('../../Annotation/BioCyc/metabolites_bigg_biocyc_map.csv',index_col=0).drop_duplicates().set_index('bigg.metabolite')
bigg_ecocyc_metabolite_map_dict={}
for id in bigg_ecocyc_metabolite_map.index:
    if not pd.isna(bigg_ecocyc_metabolite_map.loc[id,'biocyc_id']):
        bigg_ecocyc_metabolite_map_dict[bigg_ecocyc_metabolite_map.loc[id,'biocyc_id'].replace('META:','') ]=id
added_regulator_ids=[]
for entry_id,entry in tqdm(regulatory_data_from_ecocyc.items()):
    biocyc_id=entry_id
    regulation=entry.find('./Regulation')
    regulators=regulation.findall('./regulator/')
    for regulator in regulators:
        biocyc_regulator_type=regulator.tag
        if biocyc_regulator_type=='Compound':
            regulator_type='compound'
            regulator_subtype='NA'
        elif biocyc_regulator_type=='Protein':
            regulator_type='protein'
            if regulator.attrib['frameid'] in biocyc_objects:
                regulator_obj=biocyc_objects[regulator.attrib['frameid']]
            else:
                print(f"Regulator {regulator.attrib['frameid']} not found in Biocyc object cache. Querying Biocyc from API")
                regulator_obj=biocyc_query_utils.get_biocyc_object(biocyc_session,regulator.attrib['frameid'])
                biocyc_objects[regulator.attrib['frameid']]=regulator_obj
            if len(regulator_obj.findall('.//Protein/component'))>0:
                regulator_subtype='multimeric_protein'
            elif len(regulator_obj.findall('.//Protein/unmodified-form/Protein'))==1:
                regulator_subtype='modified_protein'
            else:
                regulator_subtype='polypeptide'
        else:
            regulator_type=biocyc_regulator_type
        regulator_id=regulator.attrib['frameid']
        regulator_biocyc_id=f"{regulator.attrib['orgid']}:{regulator.attrib['frameid']}"
        enzyme=regulation.find('./regulated-entity/Enzymatic-Reaction/enzyme/Protein').attrib['frameid']
        regulated_reaction=regulation.find('./regulated-entity/Enzymatic-Reaction/reaction/Reaction').attrib['frameid']
        if regulation.find('./mechanism') is not None:
            mechanism=regulation.find('./mechanism').text
        else:
            mechanism='NA'
        regulation_mode=regulation.find('./mode').text

        if regulator_id in bigg_ecocyc_metabolite_map_dict.keys():
            regulator_bigg_id=bigg_ecocyc_metabolite_map_dict[regulator_id]
        else:
           regulator_bigg_id='NA'
        #Ki info
        if regulation.findall('./ki'):
            KIs=[float(x.text) for x in regulation.findall('./ki')]
            ki=np.median(KIs)
            ki_units=[x.attrib['units'] for x in regulation.findall('./ki')]
            if len(set(ki_units))>1:
                print(f'WARNING: different units found for {entry_id}. Skipping KI assignment')
            else:
                ki_unit=ki_units[0]
        else:
            ki='NA'
            ki_unit='NA'


        #Add a node for the regulator, unless this already exists
        if regulator_type=='compound':
            regulator_node_id=regulator_id
            regulator_node={"id":regulator_id,
                            "bigg_id":regulator_bigg_id,
                            "type":regulator_type,
                            "subtype":regulator_subtype,
                            "biocyc_id":regulator_id,
                            }
        else:
            regulator_node_id=regulator_id#+'_regulator'
            regulator_node={"id":regulator_node_id,
                            "type":regulator_type,
                            "subtype":regulator_subtype,
                            "biocyc_id":regulator_id,
                            }

        regulation_edge={'source':regulator_node_id,
                        'target': enzyme,
                         'type': 'regulation',
                         'regulated_reaction':regulated_reaction,
                        'mechanism':mechanism,
                        'regulation_mode':regulation_mode,
                        'weight': ki,
                        "weight_unit":ki_unit,
                        'subtype': 'NA',
                        'notes': '',
                        'references': ''}
        if regulator_node_id not in [node['id'] for node in graph_dict_ecocyc['nodes'] ]:
            graph_dict_ecocyc['nodes'].append(regulator_node)
            added_regulator_ids.append(regulator_id)
        graph_dict_ecocyc['edges'].append(regulation_edge)
with open('tmp/graph_dict_from_ecocyc_w_regulation.pkl','wb') as f:
    pickle.dump(graph_dict_ecocyc,f)
with open(cache_file,'wb') as file:
    pickle.dump(biocyc_objects,file)


100%|██████████| 803/803 [00:00<00:00, 11579.86it/s]


## Checkpoint 1

In [9]:
with open('tmp/graph_dict_from_ecocyc_w_regulation.pkl','rb') as f:
    graph_dict_ecocyc=pickle.load(f)

## Manual Curation

In [11]:
if USE_CACHE:
    with open('tmp/graph_dict_w_manual_curation.pkl','rb') as f:
        graph_dict_curated=pickle.load(f)
else:
    graph_dict_curated=graph_dict_ecocyc.copy()

### Spontaneous reactions


In [12]:
if not USE_CACHE:
    spontaneous_rxns=[r.id for r in model.reactions if 's0001' in r.gene_reaction_rule]
    print(f'adding edges of type spontaneous forreactions {spontaneous_rxns}')

    spontaneous_node={'id':'spontaneous_pseudogene',
                    'type':'spontaneous',
                    "subtype":"NA",
                    'biocyc_id':None}
    for rxn in spontaneous_rxns:
        parent_node=[node for node in graph_dict_curated['nodes'] if node['id']==f'bigg:{rxn}'][0]
        graph_dict_curated
        graph_dict_curated=graph_utils.add_edge(graph_dict_curated,
                                                parent_node=parent_node,
                                                child_node=spontaneous_node,
                                                 weight='NA',
                                                type='spontaneous_reaction',
                                                parse_from_biocyc=False
                                                )
                                                
                        

adding edges of type spontaneous forreactions ['G5SADs', 'H2Otex', 'O2tpp', 'CO2tpp', 'ACALDtpp', 'NH4tpp', 'GLYCtpp', 'H2Otpp', 'OMCDC', 'ETOHtrpp']
Adding new edge between bigg:G5SADs and spontaneous_pseudogene
Adding child node {'id': 'spontaneous_pseudogene', 'type': 'spontaneous', 'subtype': 'NA', 'biocyc_id': None} to the graph dict
Adding new edge between bigg:H2Otex and spontaneous_pseudogene
Adding new edge between bigg:O2tpp and spontaneous_pseudogene
Adding new edge between bigg:CO2tpp and spontaneous_pseudogene
Adding new edge between bigg:ACALDtpp and spontaneous_pseudogene
Adding new edge between bigg:NH4tpp and spontaneous_pseudogene
Adding new edge between bigg:GLYCtpp and spontaneous_pseudogene
Adding new edge between bigg:H2Otpp and spontaneous_pseudogene
Adding new edge between bigg:OMCDC and spontaneous_pseudogene
Adding new edge between bigg:ETOHtrpp and spontaneous_pseudogene


### Reactions involving Proteins as metabolites

In [13]:
if not USE_CACHE:
    # 1. Acyl_carrying-protein (ACP
    reactions_involving_acp=[r.id for r in model.reactions if 'ACP_c' in[m.id for m in r.metabolites.keys()]]
    reactions_involving_acp.remove('Biomass')
    print(f'adding edges of type protein_metabolite for reactions {reactions_involving_acp}')

    acp_node= {'id':'ACP-MONOMER',
                'type':'protein',
                "subtype":'polypeptide',
                'biocyc_id':'ACP-MONOMER'
                }
    metadata={'notes':'Free Acyl Carrying Protein (ACP) is involved in this reaction and is thus required for the reaction to operate.'}
    for acp_rxn in reactions_involving_acp:
        parent_node=[node for node in graph_dict_curated['nodes'] if node['id']==f'bigg:{acp_rxn}'][0]
        child_node=acp_node
        graph_dict_curated=graph_utils.add_edge(
                            graph_dict=graph_dict_curated,
                            parent_node=parent_node,
                            child_node=child_node,
                            weight='NA',
                            type='non_catalytic_requirement',
                            session=biocyc_session,
                            parse_from_biocyc=True                         
                            )

adding edges of type protein_metabolite for reactions ['KAS14', 'MCOATA', 'ACOATA', '3OAS161', '3OAS141', '3OAS181', '3OAS140', '3OAS60', '3OAS160', '3OAS121', '3OAS80', '3OAS100', '3OAS120', '3OAS180']
Adding new edge between bigg:KAS14 and ACP-MONOMER
Parsing components of ACP-MONOMER from biocyc as this node was not found in the graph
Adding new edge between bigg:MCOATA and ACP-MONOMER
Adding new edge between bigg:ACOATA and ACP-MONOMER
Adding new edge between bigg:3OAS161 and ACP-MONOMER
Adding new edge between bigg:3OAS141 and ACP-MONOMER
Adding new edge between bigg:3OAS181 and ACP-MONOMER
Adding new edge between bigg:3OAS140 and ACP-MONOMER
Adding new edge between bigg:3OAS60 and ACP-MONOMER
Adding new edge between bigg:3OAS160 and ACP-MONOMER
Adding new edge between bigg:3OAS121 and ACP-MONOMER
Adding new edge between bigg:3OAS80 and ACP-MONOMER
Adding new edge between bigg:3OAS100 and ACP-MONOMER
Adding new edge between bigg:3OAS120 and ACP-MONOMER
Adding new edge between bigg

In [14]:
if not USE_CACHE:
    #2. protein redox cofactors
    #Thioredoxin ================================
    reactions_involving_trd=[r.id for r in model.reactions if 'trdrd_c' in[m.id for m in r.metabolites.keys()]]
    print(f'adding edges of type protein_metabolite for reactions {reactions_involving_trd}')

    trdr_nodes=[{'id':id,'type':'protein','subtype':'polypeptide','biocyc_id':id} for id in ['RED-THIOREDOXIN-MONOMER','RED-THIOREDOXIN2-MONOMER']]
    trdr1_OR_trdr2_node= {'id':'THIOREDOXINS','type':'logical_OR','subtype':'NA','biocyc_id':None}
    metadata={'notes':'Thioredoxin is involved in this reaction as a redox cofactor'}
    for trdr_rxn in reactions_involving_trd:
        parent_node=[node for node in graph_dict_curated['nodes'] if node['id']==f'bigg:{trdr_rxn}'][0]
        graph_dict_curated=graph_utils.add_edge(
                            graph_dict=graph_dict_curated,
                            parent_node=parent_node,
                            child_node=trdr1_OR_trdr2_node,
                             weight='NA',
                            type='non_catalytic_requirement',
                            metadata=metadata,
                            parse_from_biocyc=False                         
                            )
        for trdr_node in trdr_nodes:
            graph_dict_curated=graph_utils.add_edge(
                            graph_dict=graph_dict_curated,
                            parent_node=trdr1_OR_trdr2_node,
                            child_node=trdr_node,
                            weight='NA',
                            type='logical',
                            metadata=metadata,
                            parse_from_biocyc=True,
                            session=biocyc_session                        
                            )


adding edges of type protein_metabolite for reactions ['RNDR1', 'RNDR3', 'RNDR4', 'TRDR', 'PAPSR', 'RNDR2']
Adding new edge between bigg:RNDR1 and THIOREDOXINS
Adding child node {'id': 'THIOREDOXINS', 'type': 'logical_OR', 'subtype': 'NA', 'biocyc_id': None} to the graph dict
Adding new edge between THIOREDOXINS and RED-THIOREDOXIN-MONOMER
Parsing components of RED-THIOREDOXIN-MONOMER from biocyc as this node was not found in the graph
Adding new edge between THIOREDOXINS and RED-THIOREDOXIN2-MONOMER
Parsing components of RED-THIOREDOXIN2-MONOMER from biocyc as this node was not found in the graph
Adding new edge between bigg:RNDR3 and THIOREDOXINS
Edge between THIOREDOXINS and RED-THIOREDOXIN-MONOMER already exists. Skipping
Edge between THIOREDOXINS and RED-THIOREDOXIN2-MONOMER already exists. Skipping
Adding new edge between bigg:RNDR4 and THIOREDOXINS
Edge between THIOREDOXINS and RED-THIOREDOXIN-MONOMER already exists. Skipping
Edge between THIOREDOXINS and RED-THIOREDOXIN2-MONOME

In [15]:
if not USE_CACHE:
    #glutaredoxin ================================
    reactions_involving_grx=[r.id for r in model.reactions if 'grxrd_c' in[m.id for m in r.metabolites.keys()]]
    print(f'adding edges of type protein_metabolite for reactions {reactions_involving_grx}')


    grx_nodes=[{'id':id,'type':'protein','subtype':'polypeptide','biocyc_id':id} for id in ['RED-GLUTAREDOXIN','GRXB-MONOMER','GRXC-MONOMER']]
    grx_OR_node= {'id':'GLUTAREDOXINS','type':'logical_OR','subtype':'NA','biocyc_id':None}
    metadata={'notes':'Glutaredoxin is involved in this reaction as a redox cofactor'}
    for grx_rxn in reactions_involving_grx:
        parent_node=[node for node in graph_dict_curated['nodes'] if node['id']==f'bigg:{grx_rxn}'][0]
        graph_dict_curated=graph_utils.add_edge(
                            graph_dict=graph_dict_curated,
                            parent_node=parent_node,
                            child_node=grx_OR_node,
                            weight='NA',
                            type='non_catalytic_requirement',
                            metadata=metadata,
                            parse_from_biocyc=False                         
                            )
        for grx_node in grx_nodes:
            graph_dict_curated=graph_utils.add_edge(
                                graph_dict=graph_dict_curated,
                                parent_node=grx_OR_node,
                                child_node=grx_node,
                                 weight='NA',
                                type='logical',
                                metadata=metadata,
                                session=biocyc_session,
                                parse_from_biocyc=True                         
                                )

adding edges of type protein_metabolite for reactions ['GRXR', 'RNDR1b', 'RNDR2b']
Adding new edge between bigg:GRXR and GLUTAREDOXINS
Adding child node {'id': 'GLUTAREDOXINS', 'type': 'logical_OR', 'subtype': 'NA', 'biocyc_id': None} to the graph dict
Adding new edge between GLUTAREDOXINS and RED-GLUTAREDOXIN
Adding new edge between GLUTAREDOXINS and GRXB-MONOMER
Adding new edge between GLUTAREDOXINS and GRXC-MONOMER
Adding new edge between bigg:RNDR1b and GLUTAREDOXINS
Edge between GLUTAREDOXINS and RED-GLUTAREDOXIN already exists. Skipping
Edge between GLUTAREDOXINS and GRXB-MONOMER already exists. Skipping
Edge between GLUTAREDOXINS and GRXC-MONOMER already exists. Skipping
Adding new edge between bigg:RNDR2b and GLUTAREDOXINS
Edge between GLUTAREDOXINS and RED-GLUTAREDOXIN already exists. Skipping
Edge between GLUTAREDOXINS and GRXB-MONOMER already exists. Skipping
Edge between GLUTAREDOXINS and GRXC-MONOMER already exists. Skipping


### Manually curated edges

Here, we include a number of manually curated edges into the graph. These may be
- Edges that are not found in Ecocyc
- Changes to attributs for edges already found in EcoCyc

In [16]:
edges_manual_curation=pd.read_csv('manual_curation/edge_functional_curation.csv')
edges_manual_curation.head(2)

Unnamed: 0,bigg_rxn_id,biocyc_rxn_id,parent_node,parent_node_type,parent_node_subtype,parent_node_biocyc_id,child_node,child_node_type,child_node_subtype,child_node_biocyc_id,edge_weight,edge_source,edge_type,edge_subtype,putative,edge_notes,edge_ref,putative_removal,putative_removal_notes
0,SHK3Dr,SHIKIMATE-5-DEHYDROGENASE-RXN,bigg:SHK3Dr,reaction,,SHIKIMATE-5-DEHYDROGENASE-RXN,EG11234-MONOMER,protein,polypeptide,EG11234-MONOMER,,iml1515,catalysis,secondary,False,shikimate dehydrogenase with much lower activi...,PMID:12637497,,
1,NDPK6,DUDPKIN-RXN,bigg:NDPK6,reaction,,DUDPKIN-RXN,ADENYL-KIN-MONOMER,protein,polypeptide,ADENYL-KIN-MONOMER,,iml1515,catalysis,secondary,False,adenylate kinase main catalytic function is bi...,PMID:8650159;PMID:15941717,,


In [17]:
if not USE_CACHE:
    for i,row in edges_manual_curation.iterrows():
        parent_node={'id':row['parent_node'],
                    'type':row['parent_node_type'],
                    'subtype':row['parent_node_subtype'],
                    'biocyc_id':f"ECOLI:{row['parent_node_biocyc_id']}"}
        
        child_node={'id':row['child_node'],
                    'type':row['child_node_type'],
                    'subtype':row['child_node_subtype'],
                    'biocyc_id':f"ECOLI:{row['child_node_biocyc_id']}"}
        weight=row['edge_weight'] if row['edge_weight']=='NA' else float(row['edge_weight'])
        type=row['edge_type']
        notes=row['edge_notes']
        references=row['edge_ref']
        metadata={'subtype':row['edge_subtype'],
                'notes':notes,
                'references':references}
        graph_dict_curated=graph_utils.add_edge(
                            graph_dict=graph_dict_curated,
                            parent_node=parent_node,
                            child_node=child_node,
                            weight=weight,
                            type=type,
                            metadata=metadata,
                            session=biocyc_session,
                            parse_from_biocyc=True,
                            overwrite=True                         
                            )

Adding new edge between bigg:SHK3Dr and EG11234-MONOMER
Parsing components of EG11234-MONOMER from biocyc as this node was not found in the graph
Adding new edge between bigg:NDPK6 and ADENYL-KIN-MONOMER
Adding new edge between bigg:NDPK8 and ADENYL-KIN-MONOMER
Overwriting edge between bigg:NDPK1 and ADENYL-KIN-MONOMER
Overwriting edge between bigg:NDPK2 and ADENYL-KIN-MONOMER
Overwriting edge between bigg:NDPK4 and ADENYL-KIN-MONOMER
Overwriting edge between bigg:NDPK5 and ADENYL-KIN-MONOMER
Overwriting edge between bigg:NDPK7 and ADENYL-KIN-MONOMER
Adding new edge between bigg:XYLK and RIBULOKIN-MONOMER
Parsing components of RIBULOKIN-MONOMER from biocyc as this node was not found in the graph
Adding new edge between bigg:ASPTA and TYRB-DIMER
Overwriting edge between bigg:PHETA1 and BRANCHED-CHAINAMINOTRANSFER-CPLX
Overwriting edge between bigg:PHETA1 and ASPAMINOTRANS-DIMER
Overwriting edge between bigg:TYRTA and ASPAMINOTRANS-DIMER
Adding new edge between bigg:TYRTA and BRANCHED-CH

In [18]:
with open('tmp/graph_dict_w_manual_curation.pkl','wb') as f:
    pickle.dump(graph_dict_curated,f)

## Checkpoint 2

In [19]:
with open('tmp/graph_dict_w_manual_curation.pkl','rb') as f:
    graph_dict_curated=pickle.load(f)

## Connect polypeptides to their coding gene, and add any other relevant annotations to polypeptide nodes

In [21]:
if USE_CACHE:
    with open("cache/pp_data_cache.pkl",'rb') as file:
        pp_data_cache=pickle.load(file)
else:
    pp_data_cache={}


processed_pp_nodes=[]
added_gene_nodes=[]
pp_nodes=[node for node in graph_dict_curated['nodes'] if (node['type']=='protein' and node['subtype']=='polypeptide')]

for node in tqdm(pp_nodes):
    pp_id=node['id']
    if pp_id not in processed_pp_nodes:
        if pp_id in pp_data_cache.keys():
            pp_data=pp_data_cache[pp_id]
        else:
            if node['biocyc_id'] in biocyc_objects.keys():
                node_biocyc_object=biocyc_objects[node['biocyc_id']]
            else:
                print(f"Polypeptide {node['biocyc_id']} not found in cache. Attempting to query Biocyc for it. ")
                node_biocyc_object=biocyc_query_utils.get_biocyc_object(session=biocyc_session,object_id=pp_id)
                graph_dict_curated['objects'][node['biocyc_id']]=node_biocyc_object
            if node_biocyc_object is None:
                print(f"Unable to retrieve biocyc object for {pp_id}")
                continue

            pp_data=graph_utils.pp_gene_and_annotation(session=biocyc_session,
                                                        biocyc_pp=node_biocyc_object
                                                        )
            pp_data_cache[pp_id]=pp_data
            
        #First add available  annotation to the polypeptides
        # "-" is not supported by the GML graph format. We replace them with underscores
        for key in list(pp_data['polypeptide_annotation'].keys()):
            if '-' in key:
                pp_data['polypeptide_annotation'][key.replace('-','_')]=pp_data['polypeptide_annotation'].pop(key)
        node['annotation']=pp_data['polypeptide_annotation']
        #Create a new node for the gene associated to this pp. By deafult, we use bnums as node IDs
        gene_data=pp_data['gene']
        if gene_data is not None:
            gene_node_id=gene_data['bnum'] if 'bnum' in gene_data.keys() else gene_data['id']
            if gene_node_id not in added_gene_nodes:
                graph_dict_curated['nodes'].append({"id":gene_node_id,
                                                "type":"gene",
                                                "subtype":"NA",
                                                "annotation":{k:v for k,v in gene_data.items() if k!='id'},
                                                "biocyc_id":f"ECOLI:{gene_data['id']}",
                                                })
                
                graph_dict_curated['edges'].append( {'source':pp_id,
                                                    'target': gene_node_id,
                                                    'weight': 'NA',
                                                    'type': 'coding_relation',
                                                    'subtype': 'NA',
                                                    'notes': '',
                                                    'references': ''})
                added_gene_nodes.append(gene_node_id)
        else:
            print(f"Warning: Unable map {pp_id} to a gene. Likely, this is because the polypeptide was parsed as a regulator, and lacks gene annotation on EcoCyc")
        processed_pp_nodes.append(pp_id)
with open("cache/pp_data_cache.pkl",'wb') as file:
    pickle.dump(pp_data_cache,file)
        
with open('tmp/graph_dict_curated_w_genes.pkl','wb') as f:
    pickle.dump(graph_dict_curated,f)

100%|██████████| 627/627 [00:00<00:00, 200842.26it/s]




In [22]:
with open('tmp/graph_dict_curated_w_genes.pkl','rb') as f:
    graph_dict_curated=pickle.load(f)

In [23]:
if biocyc_objects is not None:
    for object_id,xml_object in graph_dict_curated['objects'].items():
        if object_id not in biocyc_objects.keys():
            print(f"Adding {object_id} to the cache of biocyc XML objects for future use")
            biocyc_objects[object_id]=xml_object
    with open('cache/biocyc_objects_cache.pkl','wb') as file:
        pickle.dump(biocyc_objects,file)
        
        

## Build Graph in NetworkX

First, create pandas dataframes for nodes and edges

In [24]:
#Make sure all nodes and edges have a subtype field, and use NA if the field is not applicable
for node in graph_dict_curated['nodes']:
    if "subtype" not in node.keys():
        node['subtype']='NA'
for edge in graph_dict_curated['edges']:
    if "subtype" not in edge.keys():
        edge['subtype']='NA'

In [25]:
nodes_df=pd.DataFrame([pd.Series(node) for node in graph_dict_curated['nodes']]).drop_duplicates(subset=['id'])
edges_df=pd.DataFrame([pd.Series(edge) for edge in graph_dict_curated['edges']]).drop_duplicates(subset=['source','target'])
nodes_df.to_csv('tmp/graph_nodes_curated_df.csv')
edges_df.to_csv('tmp/graph_edges_curated_df.csv')


Then, use these DataFrames to parse the graph in NetworkX

In [26]:
graph=nx.DiGraph()
nodes_list=[(node['id'],node) for node in graph_dict_curated['nodes']]
edges_list=[(edge['source'],edge['target'],{k:v for k,v in edge.items() if k not in ['source','target']}) for edge in graph_dict_curated['edges']]
graph.add_nodes_from(nodes_list)
graph.add_edges_from(edges_list)
print(f'parsed graph in networkX. Graph has {len(graph.nodes)} nodes and {len(graph.edges)} edges')

parsed graph in networkX. Graph has 1665 nodes and 2002 edges


Some final manual curation 

In [27]:
to_remove=[]
# AlaB : This is a putative glutamate—pyruvate aminotransferase but lacks proper annotation on Ecocyc. We disregard it as an isozyme for ALANINE-AMINOTRANSFERASE-RXN
to_remove+=['MONOMER0-1241','G0-9281']

for node in to_remove:
    if node in graph.nodes:
        graph.remove_node(node)



#Change the ACP node (from parsing regulators) into the ACP-MONOMER
ACP_edges=graph.edges('ACP')
for edge in list(ACP_edges):
    if edge[0]=='ACP':
        graph.add_edge('ACP-MONOMER',edge[1],**graph.edges[edge])
        graph.remove_edge(edge[0],edge[1])
    elif edge[1]=='ACP':
        graph.add_edge(edge[0],'ACP-MONOMER',**graph.edges[edge])
        graph.remove_edge(edge[0],edge[1])
graph.remove_node('ACP')

save to file

In [28]:
with open('tmp/graph_final.pkl','wb') as f:
    pickle.dump(graph,f)

## Checkpoint 3

In [29]:
#Reload serialised graph from file
with open('tmp/graph_final.pkl','rb') as f:
    graph=pickle.load(f)
    

# Computation of Graph Attributes
Here, we annotate the knowledge graph with Molecular weight for each protein node, and compute the GPR of each reaction, which we then add to the stocihiometric model

## Add Molecular Weight data to the graph

In [30]:
# First, retrieve all polypetide MWs from Ecocyc. Once this has run, you can used cached data for fast execution.
pp_mw_cache_file='cache/pp_mw_map.pkl'
pps=[node for node in graph.nodes() if graph.nodes[node]['subtype']=='polypeptide']
if USE_CACHE:
    with open(pp_mw_cache_file,'rb') as file:
        pp_mw_map=pickle.load(file)
else:
    pp_mw_map={pp:biocyc_query_utils.pp_mw(biocyc_session,pp_id=pp) for pp in tqdm(pps)}
    with open('cache/pp_mw_map.pkl','wb') as file:
        pickle.dump(pp_mw_map,file)

In [31]:
#Next, compute the MW for each protein node in the graph
graph=graph_utils.compute_graph_mw(graph,pp_mw_map)
with open('tmp/graph_with_molecular_weights.pkl','wb') as file:
    pickle.dump(graph,file)

## Checkpoint 4

In [32]:
with open('tmp/graph_with_molecular_weights.pkl','rb') as file:
    graph=pickle.load(file)

In [33]:
pp_nodes=[node for node in graph.nodes() if graph.nodes[node]['subtype']=='polypeptide']
graph_genes=[]
for node in pp_nodes:
    gene_children_nodes=[child for child in graph.successors(node) if graph.nodes[child]['type']=='gene']
    graph_genes+=gene_children_nodes
len(set(graph_genes))

360

## Compute GPR based on the graph and add to the metabolic model

In [34]:
all_genes=[]
graph_gprs={}
for r in model.reactions:
    node_id=f'bigg:{r.id}'
    if node_id in graph.nodes:
        cur_gpr=graph_utils.compute_node_gpr(graph,node_id)
        graph_gprs[r.id]=cur_gpr
        all_genes+=graph_utils.genes_in_gpr(cur_gpr)
all_genes=set(all_genes)

In [35]:
for r in model.reactions:
    if r.id in graph_gprs.keys():
        cur_gpr=graph_gprs[r.id]
        genes_in_gpr=graph_utils.genes_in_gpr(cur_gpr)
        if np.any([g not in model.genes for g in genes_in_gpr]):
            print(r.id,cur_gpr)
        r.gene_reaction_rule=graph_gprs[r.id]


EAR161y 
EAR141y 
EAR181y 
EAR121y 
Biomass 


In [36]:
#remove genes that are not associated to any gprs
genes_to_remove=[g.id for g in model.genes if g.id not in all_genes]
cobra.manipulation.remove_genes(model,genes_to_remove)

In [37]:
#save to file
cobra.io.write_sbml_model(model,'../../Model/iCH360/Escherichia_coli_iCH360.xml')

#Make sure you can open the model with the new gpr
model=cobra.io.read_sbml_model('../../Model/iCH360/Escherichia_coli_iCH360.xml')

#Final output
cobra.io.write_sbml_model(model,'../../Model/iCH360/Escherichia_coli_iCH360.xml')
cobra.io.save_json_model(model,'../../Model/iCH360/Escherichia_coli_iCH360.json')

### Create a model only including primary catalysis GPRs


In [38]:
model_primary_catalysis_gprs=model.copy()
graph_gprs_primary_catalysis={}
all_genes_primary_catalysis=[]
for r in model_primary_catalysis_gprs.reactions:
    node_id=f'bigg:{r.id}'
    if node_id in graph.nodes:
        cur_gpr=graph_utils.compute_node_gpr(graph,
                                             node_id,
                                             catalysis_subtypes_to_include=['primary'],
                                             include_spontaneous=True,
                                             spontaneous_gpr='s0001'
                                             )
        graph_gprs_primary_catalysis[r.id]=cur_gpr
        all_genes_primary_catalysis+=graph_utils.genes_in_gpr(cur_gpr)
all_genes_primary_catalysis=set(all_genes_primary_catalysis)       
for r in model_primary_catalysis_gprs.reactions:
    if r.id in graph_gprs_primary_catalysis.keys():
        r.gene_reaction_rule=graph_gprs_primary_catalysis[r.id]
#remove orphan genes
genes_to_remove=[g.id for g in model_primary_catalysis_gprs.genes if g.id not in all_genes_primary_catalysis]
cobra.manipulation.remove_genes(model_primary_catalysis_gprs,genes_to_remove)
cobra.io.write_sbml_model(model_primary_catalysis_gprs,'../../Model/iCH360/Escherichia_coli_iCH360_primary_catalysis_only.xml')

model_primary_catalysis_gprs=cobra.io.read_sbml_model('../../Model/iCH360/Escherichia_coli_iCH360_primary_catalysis_only.xml')
cobra.io.write_sbml_model(model_primary_catalysis_gprs,'../../Model/iCH360/Escherichia_coli_iCH360_primary_catalysis_only.xml')
cobra.io.save_json_model(model_primary_catalysis_gprs,'../../Model/iCH360/Escherichia_coli_iCH360_primary_catalysis_only.json')


Read LP format model from file C:\Users\marco\AppData\Local\Temp\tmp03tvc_ok.lp
Reading time = 0.01 seconds
: 304 rows, 698 columns, 2988 nonzeros


## Save Final Graph in GML format (preferred for graph I/O) and Cytoscape format (JSON)

In [39]:
#remove None values as these are not supported by the GML format
for node in graph.nodes:
    attr_to_replace=[]
    for attr, value in graph.nodes[node].items():
        if value is None:
            graph.nodes[node][attr]='NA'

#GML
nx.write_gml(graph,"../ich360_graph.gml")
#Cytoscape
cytoscape_data=nx.cytoscape_data(graph)
with open("../ich360_graph.cyjs",'w') as file:
    json.dump(cytoscape_data,file)


Ensure we can read the graph back without issues

In [40]:
try:
    graph_reloaded_from_gml=nx.read_gml("../ich360_graph.gml")
except:
    print("Unable to read saved GML-format graph!")

try:
    with open("../ich360_graph.cyjs",'r') as file:
        graph_data=json.load(file)

        graph_reloaded_from_json=nx.cytoscape_graph(graph_data)
except:
    print("Unable to read saved GML-format graph!")