# Using the *i*CH360 knowledge graph

## Introduction
In a metabolic model, the the path connecting a biochemical reaction to its associated genes typically encounters a number of intermediate entities, including enzymes and their protein suunits, protein activators, prtoein cofactors, and many others.  The *i*CH360 knowledge graph is a graph data strucuture that enhances and support the stoichiometric model by providing an explicit representation of the intermediate biological entities and the relationship connecting them. The graph was assembled based on data retrieved from the EcoCyc database, which was then manually curated, extended, and annotated based on available literature.

In the graph, nodes represent biological entities (reactions, proteins, genes, compounds) and edges denote (potentially quantitative) relationships between them. For example A reaction may be connected to a protein by a *catalysis* edge, indicating that the protein is an enzyme for that reaction. Similarly, a protein may be connected to another protein by a *subunit composition* relationship, implying that the latter protein is a subunit of the former. you can read more about the graph and its nodes/edge types in `Knowledge_graph/Knowledge_graph/knowledge_graph.md`.


In this tutorial, we demonstrate how to load the graph, inspect it, and visualise relevant parts of it. We will use the package NetworkX to manipulate the graph, and the gravis package (https://github.com/robert-haas/gravis) for interactive visualisation.

## Importing the required packages

In [65]:
import networkx as nx #for graph manipulation
import gravis #for interactive visualisation

## Loading the graph
We will load the graph using the `GML` (Graph Modelling Language) file in this repository.

In [66]:
ich360_graph=nx.read_gml('../Knowledge_Graph/ich360_graph.gml')

## Inspecting nodes and their attributes

The object we just created is a directed graph (meaning edges have defined directionality). Let's inspect inspect its nodes:

In [67]:
ich360_graph.nodes

NodeView(('bigg:NDPK5', 'ADENYL-KIN-MONOMER', 'NUCLEOSIDE-DIP-KIN-MONOMER', 'NUCLEOSIDE-DIP-KIN-CPLX', 'bigg:SHK3Dr', 'AROE-MONOMER', 'bigg:NDPK6', 'bigg:NDPK8', 'bigg:DHORTS', 'DIHYDROOROT-MONOMER', 'DIHYDROOROT-CPLX', 'bigg:OMPDC', 'OROTPDECARB-MONOMER', 'OROTPDECARB-CPLX', 'bigg:G5SD', 'GLUTSEMIALDEHYDROG-MONOMER', 'GLUTSEMIALDEHYDROG-CPLX', 'bigg:CS', 'CITSYN-MONOMER', 'CITRATE-SI-SYNTHASE', 'bigg:ICDHyr', 'ISOCITDEH-SUBUNIT', 'ISOCITHASE-CPLX', 'bigg:ACALD', 'MHPF-MONOMER', 'ADHE-MONOMER', 'ADHE-CPLX', 'G7285-MONOMER', 'bigg:PPA', 'ALKAPHOSPHA-MONOMER', 'ALKAPHOSPHA-CPLX', 'INORGPYROPHOSPHAT-MONOMER', 'CPLX0-243', 'bigg:PPCK', 'PEPCARBOXYKIN-MONOMER', 'bigg:ME1', 'MALIC-NAD-MONOMER', 'MALIC-NAD-CPLX', 'bigg:ALATA_L', 'G7242-MONOMER', 'CPLX0-7887', 'G7184-MONOMER', 'CPLX0-7888', 'bigg:XYLK', 'XYLULOKIN-MONOMER', 'CPLX0-7466', 'bigg:RBK', 'RIBOKIN-MONOMER', 'CPLX0-7647', 'bigg:GLYK', 'GLYCEROL-KIN-MONOMER', 'GLYCEROL-KIN-CPLX', 'bigg:ASPTA', 'ASPAMINOTRANS-MONOMER', 'ASPAMINOTRANS-D

In the graph, every node is assigned a unique identifier. For example, reactions nodes are named via the same BIGG identifiers used in the stoichiometric model (e.g. `bigg:NDPK5` corresponds to the `NDPK5` reaction in the COBRA model). Since the graph data was largely parsed from EcoCyc, protein nodes are, for simplicity, named as their EcoCyc ID (e.g. `ADENYL-KIN-MONOMER`), but note that these IDs are intended to be internal identifiers and not used as annotations to external databses (as will see, databse annotations are available as node attributes, whenever applicable). Finally, genes are named after their b-number (e.g. `b2316`).

Each node in the graph, come with a collection of attributes. We can inspect these by looking up a node using square brackets notation, which will return a dictionary of attributes/values pairs:

In [68]:
ich360_graph.nodes['ISOCITHASE-CPLX']

{'type': 'protein',
 'subtype': 'multimeric_protein',
 'biocyc_id': 'ECOLI:ISOCITHASE-CPLX',
 'mw': 91.514008}

In [69]:
[edge for edge in ich360_graph.edges if ich360_graph.edges[edge]['type']=='logic_operator']

[]

All nodes in the graph have a a `type` attribute (for example, `reaction`, `protein`,`gene` or `compound`), corresponding to the biological entity they represent, and, whenever applicable, an annotation to the biocyc databse (`biocyc_id`). In addition, `protein` nodes are further annotated with a subtype (`polypeptide`,`multimeric_protein`, or `modified_protein`) a molecular weight (`mw`, in kDa). These attributes can be used to perform arbitrary filtering queries in the graph. For example: 

In [70]:
#identify all protein nodes
protein_nodes=[node for node in ich360_graph.nodes if ich360_graph.nodes[node]['type']=='protein']
#Identify all polypeptides
polypeptide_nodes=[node for node in ich360_graph.nodes if ich360_graph.nodes[node]['type']=='protein' and ich360_graph.nodes[node]['subtype']=='polypeptide']
#Identify all nodes, excluding genes
all_but_genes=[node for node in ich360_graph.nodes if ich360_graph.nodes[node]['type']!='gene']
#Identify all proteins with Molecular weight above 500 kDa
heavy_proteins=[node for node in ich360_graph.nodes if ich360_graph.nodes[node]['type']=='protein' and ich360_graph.nodes[node]['mw']>500]

#----------------
print(f"Found {len(protein_nodes)} nodes of type protein in the graph")
print(f"Found {len(polypeptide_nodes)} polypeptide nodes in the graph")
print(f"Found {len(all_but_genes)} nodes that are not genes in the graph")
print(f"Found {len(heavy_proteins)} protein nodes with annotated molecular weight greater than 500 kDa")

Found 589 nodes of type protein in the graph
Found 360 polypeptide nodes in the graph
Found 1302 nodes that are not genes in the graph
Found 13 protein nodes with annotated molecular weight greater than 500 kDa


Among all proteins, a wide range of external annotations (as parsed from EcoCyc) are typivally availale for polypeptides nodes. These are pooled under the `annotation` attribute:

In [71]:
ich360_graph.nodes['PGLUCISOM']

{'type': 'protein',
 'subtype': 'polypeptide',
 'biocyc_id': 'ECOLI:PGLUCISOM',
 'annotation': {'INTERPRO': 'IPR001672',
  'ALPHAFOLD': 'P0A6T1',
  'PFAM': 'PF00342',
  'SMR': 'P0A6T1',
  'DIP': 'DIP-35887N',
  'PROSITE': 'PS00765',
  'PDB': '3NBU',
  'PRIDE': 'P0A6T1',
  'PANTHER': 'PTHR11469',
  'PRINTS': 'PR00662',
  'PRODB': 'PRO_000025181',
  'ECOLIWIKI': 'b4025',
  'MODBASE': 'P0A6T1',
  'SWISSMODEL': 'P0A6T1',
  'UNIPROT': 'P0A6T1',
  'REFSEQ': 'NP_418449'},
 'mw': 61.530003}

## Inspecting edges and their attributes
Nodes in the graph are connceted to each other via functional relationships. These relationships (or *edges*, as they are often called in graph theory) are directional (i.e. they are characterised by a parent and a child node) and have, similarly to nodes, a well defined type. For example, we can inspect the edges associated with a node by running:

In [72]:
my_edges=ich360_graph.edges('ISOCITHASE-CPLX')
my_edges

OutEdgeDataView([('ISOCITHASE-CPLX', 'ISOCITDEH-SUBUNIT')])

In this case, we cans see that the node `ISOCITHASE-CPLX` has a single (outward) edge targeting the node `ISOCITDEH-SUBUNIT`. Similarly to before, we can find out more about an edge by looking it up in the graph with square bracket notation. However, edges in the graph are not associated to an ID, but are instead identified by the parent-child node pair they are associated with:

In [73]:
ich360_graph.edges['ISOCITHASE-CPLX','ISOCITDEH-SUBUNIT']

{'weight': 2,
 'type': 'subunit_composition',
 'subtype': 'requirement',
 'notes': '',
 'references': ''}

As their name may suggest, these nodes are connected by a `subunit_composition` relationship, indicateing that the child node (`ISOCITDEH-SUBUNIT`) is a protein subunit of the parent node (`ISOCITHASE-CPLX`). Some edge types, like this one, have a `weight` which, in this case, represent the stoichiometry of the subunit in the complex. Some nodes may be associated to multiple edges. For example a reaction may be associated to multiple catalytic edges, implying that there are multiple isoenzymes that are able catalyze it:

In [74]:
edges=ich360_graph.edges('bigg:MALS')
for edge in edges:
    print(f"{edge[0]}  ---{ich360_graph.edges[edge]['type']}--->  {edge[1]}")

bigg:MALS  ---catalysis--->  MALSYNG-MONOMER
bigg:MALS  ---catalysis--->  MALATE-SYNTHASE


Similarly, a modified protein can be associacted to its unmodified form by a `protein_modification` edge, and to another protein, such as an activator, via a `protein_modification_requirement` edge:

In [75]:
edges=ich360_graph.edges('PYRUVFORMLY-CPLX')
for edge in edges:
    print(f"{edge[0]} ---{ich360_graph.edges[edge]['type']}--->  {edge[1]}")

PYRUVFORMLY-CPLX ---protein_modification--->  PYRUVFORMLY-INACTIVE-CPLX
PYRUVFORMLY-CPLX ---protein_modification_requirement--->  PFLACTENZ-MONOMER


## More advanced querying

Having investigate nodes and edges individually, let's now look at some slighlty more involved examples of information querying from the graph. For this purpuse, we are going to use a number of NetworkX built-in functions taylord for graph analysis.

In [76]:
reaction_node='bigg:MCOATA'
enzyme_node='MALATE-SYNTHASE'
# Get all children of a node, i.e. all nodes receiving an edge from that node, using the successors() method
children_nodes=ich360_graph.successors(reaction_node)
# Get all catalytic children nodes, i.e. all isoenzymes for a reaction
catalytic_children=[child for child in ich360_graph.successors(reaction_node)
                    if ich360_graph.edges[reaction_node,child]['type']=='catalysis'
                    ]

#get all parents of a node, i.e. all nodes sending an edge to this node, using the predecessors() method
parent_nodes=ich360_graph.predecessors(enzyme_node)
# Get all annotated small molecule regulators of an enzyme
regulators=[child for child in ich360_graph.predecessors(enzyme_node)
            if ich360_graph.edges[child,enzyme_node]['type'] in ['regulation']
            ]

#---------
print(f"Reaction node {reaction_node} has the following children in the graph: {list(children_nodes)}")
print(f"Of these, the following are related to {reaction_node} via a catalytic relationship: {catalytic_children}")
print()
print(f"Protein node {enzyme_node} has the following parents: {list(parent_nodes)}")
print(f"Of these, the following are related to {enzyme_node} via a regulation edge: {regulators}")


Reaction node bigg:MCOATA has the following children in the graph: ['MALONYL-COA-ACP-TRANSACYL-MONOMER', 'ACP-MONOMER']
Of these, the following are related to bigg:MCOATA via a catalytic relationship: ['MALONYL-COA-ACP-TRANSACYL-MONOMER']

Protein node MALATE-SYNTHASE has the following parents: ['bigg:MALS', 'PYRUVATE', 'OXALATE']
Of these, the following are related to MALATE-SYNTHASE via a regulation edge: ['PYRUVATE', 'OXALATE']


## Creating subgraphs
So far, we have been looking at relationships in the graph between connected nodes. But what if we wanted to inspect all nodes related to another, both directly or indirectly (through other nodes). For example, we may want to extract from the graph the information related to a given reaction. Luckily for us, structuring this information in a graph format means that many common operation, such as this one, have already been implemented in packages such as NetworkX. In this case, we will use the `node_connected_component()` method of NetworkX to obtain all nodes connected to reaction. Since the annotated small molecule regulation interaction can be very promiscuous and create a temporary copy of the graph that ommits them for now. For example, let's look at the subgraph related to the `bigg:GLUDy` reaction node:

In [77]:
#make a temporary copy of the graph
temp_graph=ich360_graph.copy()
#Remove regulation edges to simplify the subgraph
temp_graph.remove_edges_from([edge for edge in ich360_graph.edges if ich360_graph.edges[edge]['type']=='regulation'])
#Generate a list of all nodes connected (directly or indirectly) to the reaction of interest. Note that we convert the graph to undirected form before passing it to this function
subgraph_nodes=nx.node_connected_component(temp_graph.to_undirected(),
                                        'bigg:GLUDy')
subgraph=nx.subgraph(ich360_graph,subgraph_nodes)

for edge in subgraph.edges:
    print(f"{edge[0]} ---{ich360_graph.edges[edge]['type']}--->  {edge[1]}")

bigg:GLUDy ---catalysis--->  GLUTAMATESYN-CPLX
bigg:GLUDy ---catalysis--->  GDHA-CPLX
GLUTAMATESYN-DIMER ---subunit_composition--->  GLUSYNSMALL-MONOMER
GLUTAMATESYN-DIMER ---subunit_composition--->  GLUSYNLARGE-MONOMER
GDHA-MONOMER ---coding_relation--->  b1761
GDHA-CPLX ---subunit_composition--->  GDHA-MONOMER
GLUTAMATESYN-CPLX ---subunit_composition--->  GLUTAMATESYN-DIMER
GLUSYNLARGE-MONOMER ---coding_relation--->  b3212
bigg:GLUSy ---catalysis--->  GLUTAMATESYN-CPLX
GLUSYNSMALL-MONOMER ---coding_relation--->  b3213


## Interactive visualisation with Gravis

While the `print` statements used above can enable us to do some quick, rough exploration of knowledge in the graph, it can become daunting to investigate larger portions of the graph, composed of many interconnectd nodes, this way. In this case, graphic visualisation of a (sub)graph could be desireble. Luckily, there are a number of tools out there designed for graph visualisation, and you can pick the one that best suits your need. Here, we demonstrate the use of `gravis` (https://github.com/robert-haas/gravis) which, among other things, integrates directly with NetworkX and provide us with an interactive visualisation that can be easily personalised with a graphical interface.

For example, let's use `gravis` to visualise the subgraph we created above. We will use the `d3()` function:

In [78]:
gravis.d3(subgraph,graph_height=300)

You can call `help(gravis.d3)` to get information about the (many) plotting parameters you can set. 

Currently, all nodes and edges are plotted the same color, but it would be nice to show different node types with different colors. To decide on the color of a node, gravis looks for a `color` attribute in the node, and uses black if no such attribute is found. Hence, to plot different node types with different colors, we simply need to manually specify this attribute in each node (note that the same applies for edge colouring).

Let's define a colormap mapping each node type (or subtype) to a desired color and then use it to annotate nodes in the graph with their desired plot colour:

In [79]:
#Define colormap for plotting 
nodes_colormap={'reaction':'red',
                'protein':{'multimeric_protein':'blue',
                            'modified_protein':'yellow',
                            'polypeptide':'lightblue'},
                'gene':'green',
                'compound':'gray',
                'logical_OR':'pink',
                'spontaneous':'black'}

#Now annotate each node with a "color" attribute, using the above map
for node_id in subgraph.nodes:
    node_type=subgraph.nodes[node_id]['type'] 
    if node_type=='protein':
        #If the node is a protein, we choose a color depending on the protein subtype
        protein_subtype=subgraph.nodes[node_id]['subtype']
        subgraph.nodes[node_id]['color']=nodes_colormap['protein'][protein_subtype]
    else:
        #We simply choose the color for this node type from the colormap
        subgraph.nodes[node_id]['color']=nodes_colormap[node_type]

# Plot the graph again, now with different colours for different node types
gravis.d3(subgraph,
          graph_height=300)           

Success! As a last thing, let's also visualise edge types as labels over the edges. To do this,  we set `show_edge_label=True` and `edge_label_data_source` to the attribute we'd like to use as a label, in this case the `type` attribute. Tuning the `many_body_force_strength` parameters will also help distancing nodes from each other (negative values correspond to repulsion between nodes). 

In [80]:
#Plot graph again, showing the edge type as a label over each edge
gravis.d3(subgraph,
          graph_height=450,
          show_edge_label=True,
          edge_label_data_source='type',
          many_body_force_strength=-300
         )

If you wanted to include edge subtypes in the labels (for those edge types where this is applicable), you could just create a new attribute for each edge, obtained by merging the edge type and subtype strings, and specify that as the desired `edge_label_data_source`.

That's it for now! Make sure to look in the docs for NetworkX (https://networkx.org/documentation/stable/) and gravis (https://robert-haas.github.io/gravis-docs/) to learn more about the many ways you have to embed the *i*CH360 knowledge graph in your pipelines. To conclude, here is a visualisation of the enture graph (once again, omitting small molecule regulation to streamline visualisation):

In [81]:
#Annotate the graph with colors
for node_id in ich360_graph.nodes:
    node_type=ich360_graph.nodes[node_id]['type'] 
    if node_type=='protein':
        #If the node is a protein, we choose a color depending on the protein subtype
        protein_subtype=ich360_graph.nodes[node_id]['subtype']
        ich360_graph.nodes[node_id]['color']=nodes_colormap['protein'][protein_subtype]
    else:
        #We simply choose the color for this node type from the colormap
        ich360_graph.nodes[node_id]['color']=nodes_colormap[node_type]

#Remove regulation edges to simplify visualisation
full_graph_wo_regulation=ich360_graph.copy()
full_graph_wo_regulation.remove_edges_from([edge for edge in ich360_graph.edges if ich360_graph.edges[edge]['type']=='regulation'])

#Plot
gravis.d3(full_graph_wo_regulation,
          graph_height=1000,
          zoom_factor=0.1
         )