# Assigning Pathways

During NeuroMMSig v1.0, pathways were manually assigned to each edge. For v2.0, we would like to automate this process first with a rule based system, then a machine learning system for prioritizing curation before resorting to manual curation.

## Preamble

### Imports

In [1]:
import getpass
import itertools as itt
import os
import random
import sys
import time
from collections import defaultdict

import bio2bel_wikipathways
import hbp_knowledge
import pybel
from pybel.dsl import BaseAbundance, ListAbundance

### Environment

In [2]:
print(time.asctime())

Mon Aug 19 13:58:20 2019


In [3]:
print(sys.version)

3.7.3 (default, Mar 27 2019, 09:23:39) 
[Clang 10.0.0 (clang-1000.11.45.5)]


In [4]:
print(getpass.getuser())

cthoyt


In [5]:
pybel.get_version()

'0.13.3-dev'

In [6]:
print(hbp_knowledge.VERSION)

0.0.7


In [7]:
print(bio2bel_wikipathways.get_version())

0.2.4-dev


### Data

In [8]:
graph = hbp_knowledge.get_graph()
graph.summarize()

Human Brain Pharmacome Knowledge v0.0.7
Number of Nodes: 6023
Number of Edges: 21625
Number of Citations: 358
Number of Authors: 2012
Network Density: 5.96E-04
Number of Components: 31


## Assigning Pathways

Generate mappings from a given database to HGNC gene identifiers.

In [9]:
wikipathways_manager = bio2bel_wikipathways.Manager()
wikipathways_manager.summarize()

{'pathways': 556, 'proteins': 6613}

In [10]:
pathway_to_symbols = {
    ('wikipathways', pathway.wikipathways_id, pathway.name): {
        protein.hgnc_symbol
        for protein in pathway.proteins
    }
    for pathway in wikipathways_manager._query_pathway().all()
}

symbol_to_pathways = defaultdict(set)
for pathway_tuple, genes in pathway_to_symbols.items():
    for gene in genes:
        symbol_to_pathways[gene].add(pathway_tuple)

In [11]:
pathway_to_key = defaultdict(set)
key_to_pathway = defaultdict(set)
pmid_to_pathway = defaultdict(set)
double_annotated = defaultdict(lambda: defaultdict(list))

In [12]:
def is_hgnc(node):
    try:
        return node.namespace.lower() == 'hgnc'
    except AttributeError:
        return False

### Assigning HGNC-HGNC Edges

1. `If` the subject and object in an edge are both in a canonical pathway, then the edge gets assigned to the pathway.
2. `Else if` only one of the subject and the object in the edge have been assigned in the pathway:
  1. `If` the edge is an ontological edge, than add it to the pathway
  2. `If` there are other edges in the pathway mentioned in the same article, assign the edge to the pathway
  3. `Else` leave for manual curation
3. `Else if` neither of the nodes are assigned to the pathway, but both nodes are connected to nodes in the pathway by directed edges, assign both edge to the pathway as well as incident edges
4. `Else` the nodes don't get assigned to the pathway

In [13]:
c = 0

for u, v, k, d in graph.edges(keys=True, data=True):
    if not isinstance(u, BaseAbundance) or not isinstance(v, BaseAbundance):
        continue
    
    if not is_hgnc(u) or not is_hgnc(v):
        continue
    
    u_name, v_name = u.name, v.name

    for pathway_tuple, symbols in pathway_to_symbols.items():
        if u_name not in symbols or v_name not in symbols:
            continue

        double_annotated[pathway_tuple][tuple(sorted([u_name, v_name]))].append((u, v, k, d))

        pathway_to_key[pathway_tuple].add(k)
        key_to_pathway[k].add(pathway_tuple)

        citation = d.get('citation')
        if citation is not None:
            pmid_to_pathway[citation['reference']].add(pathway_tuple)

        c += 1
            
print(f'Made {c} annotations')

Made 9423 annotations


### Assigning Chemical/Biological Process/Disease - HGNC edges

`If` an entity is related to a gene in a pathway, then that edge gets annotated to the pathway

In [14]:
c = 0

for u, v, k, d in graph.edges(keys=True, data=True):
    if not isinstance(u, BaseAbundance) or not isinstance(v, BaseAbundance):
        continue

    if is_hgnc(u) and not is_hgnc(v):
        gene_name = u.name
        other_name = v.name
    elif not is_hgnc(u) and is_hgnc(v):
        gene_name = v.name
        other_name = u.name
    else:
        continue

    for pathway_tuple, symbols in pathway_to_symbols.items():
        if gene_name not in symbols:
            continue

        double_annotated[pathway_tuple][tuple(sorted([gene_name, other_name]))].append((u, v, k, d))

        pathway_to_key[pathway_tuple].add(k)
        key_to_pathway[k].add(pathway_tuple)

        citation = d.get('citation')
        if citation is not None:
            pmid_to_pathway[citation['reference']].add(pathway_tuple)

        c += 1
            
print(f'Made {c} annotations')

Made 62318 annotations


In [15]:
pmid_to_pathway

defaultdict(set,
            {'30663117': {('wikipathways', 'WP127', 'IL-5 Signaling Pathway'),
              ('wikipathways', 'WP138', 'Androgen receptor signaling pathway'),
              ('wikipathways',
               'WP1403',
               'AMP-activated Protein Kinase (AMPK) Signaling'),
              ('wikipathways', 'WP1438', 'Influenza A virus infection'),
              ('wikipathways',
               'WP1449',
               'Regulation of toll-like receptor signaling pathway'),
              ('wikipathways',
               'WP1530',
               'miRNA Regulation of DNA Damage Response'),
              ('wikipathways',
               'WP1544',
               'MicroRNAs in cardiomyocyte hypertrophy'),
              ('wikipathways',
               'WP1545',
               'miRNAs involved in DNA damage response'),
              ('wikipathways',
               'WP1559',
               'TFs Regulate miRNAs related to cardiac hypertrophy'),
              ('wikipathways', 'WP1

## Assigning Tangential Nodes

If an edge has only one node that appears in a pathway, but that pathway has already been mentioned in the paper, then it gets annotated to that pathway too.

In [16]:
c = 0
 
for u, v, k, d in graph.edges(keys=True, data=True):
    citation = d.get('citation')
    if citation is None:
        continue
        
    reference = citation['reference']
    pathways = pmid_to_pathway[reference]

    if is_hgnc(u) and not is_hgnc(v):
        gene_name = u.name
    elif not is_hgnc(u) and is_hgnc(v):
        gene_name = v.name
    else:
        continue

    for pathway in pathways:
        if pathway not in symbol_to_pathways[gene_name]:
            continue

        double_annotated[pathway_tuple][gene_name].append((u, v, k, d))
        pathway_to_key[pathway_tuple].add(k)

        c += 1
            
print(f'Made {c} annotations')

Made 66254 annotations


### Assigning Complexes

If two or more members of a complex are in a pathway, then the whole complex and all of its partOf relationships will get assigned to that pathway.

In [17]:
c = 0
for node in graph:
    if not isinstance(node, ListAbundance):
        continue
    
    hgnc_count = sum(
        member.namespace.lower() == 'hgnc'
        for member in node.members
        if isinstance(member, BaseAbundance)
    )
    
    if 0 == hgnc_count:
        continue

    for pathway_tuple, symbols in pathway_to_symbols.items():
        in_count = sum(
            member.name in symbols
            for member in node.members
            if isinstance(member, BaseAbundance) and member.namespace.lower() == 'hgnc'
        )
        
        do_it = (
            (1 == hgnc_count and 1 == in_count)  # Other stuff going on, lets do it
            or 2 <= in_count  # enough is going on
        
        )
        
        if not do_it:
            continue
        
        for u, v, k, d in graph.edges(node, keys=True, data=True):
            double_annotated[pathway_tuple][node].append((u, v, k, d))
            pathway_to_key[pathway_tuple].add(k)
            c += 1
            
print(f'Made {c} annotations')

Made 9398 annotations


### Print Results

In [18]:
with open('assignments.tsv', 'w') as file, open('assignments.rst', 'w') as log_file:
    print('database', 'pathway_id', 'pathway_name', 'key', 'bel', sep='\t', file=file)
    for (db, pathway_id, pathway), names_dict in double_annotated.items():
        title = f'{db}:{pathway_id} - {pathway}'
        print(title, file=log_file)
        print('=' * len(title), file=log_file)

        for node_key, keys_and_data in names_dict.items():
            print('', file=log_file)
            print(node_key, file=log_file)
            l = len(str(node_key))
            print('-' * l, file=log_file)
            for u, v, key, data in keys_and_data:
                print('-', key[:8], graph.edge_to_bel(u, v, data), file=log_file)
                print(db, pathway_id, pathway, key, graph.edge_to_bel(u, v, data), sep='\t', file=file)

        print('', file=log_file)

How did we do?

In [19]:
annotated_edge_keys = set(itt.chain.from_iterable(pathway_to_key.values()))
n_edges_annotated = len(annotated_edge_keys)

print(f'{n_edges_annotated} ({n_edges_annotated / graph.number_of_edges():.2%}) of {graph.number_of_edges()} edges were annotated')

9165 (42.38%) of 21625 edges were annotated


In [20]:
annotated_nodes = {
    node
    for u, v, k in graph.edges(keys=True)
    if k in annotated_edge_keys
    for node in (u, v)
}

n_nodes_annotated = len(annotated_nodes)

print(f'{n_nodes_annotated} ({n_nodes_annotated / graph.number_of_nodes():.2%}) of {graph.number_of_nodes()} nodes were annotated')

3242 (53.83%) of 6023 nodes were annotated


### Investigating what's Left

- Dealing with orthologs
- Reasoning over hierarchical relations (isA, partOf, hasMember)
- Protein complex membership for GO cellular components
- Checking protein families
- Annotation of GO cellular components to pathways
- Reactions - need to enrich with connections to biological processes in GO or annotate based on any enzymes that they interact with.

In [21]:
unannotated_edges = [
    (u, v, k, d)
    for u, v, k, d in graph.edges(data=True, keys=True)
    if k not in annotated_edge_keys
]

print(f'There are {len(unannotated_edges)} unannotated edges')

There are 12460 unannotated edges


In [22]:
for u, v, k, d in random.sample(unannotated_edges, 15):
    print(k[:8], graph.edge_to_bel(u, v, d))

0fb28d1e p(HGNCGENEFAMILY:Cathepsins) association a(GO:lysosome)
d271d359 p(HGNC:CDC37) positiveCorrelation p(HGNC:MAPT, pmod(Ph, Thr, 231))
e1fc3be8 a(GO:axon) positiveCorrelation p(RGD:Cdk5r2)
6279f96b a(CONSO:CONSO00018) association a(MESH:"Amyloid beta-Peptides")
7240d478 complex(GO:"NLRP1 inflammasome complex") increases bp(GO:pyroptosis)
f5253ac2 p(HGNCGENEFAMILY:"Cholinergic receptors nicotinic subunits") increases act(a(CHEBI:"amyloid-beta"))
a3020019 path(MESH:Amblyopia) association bp(GO:"GABAergic neuron differentiation")
8ad30f6f a(PUBCHEM:1476756) increases bp(GO:memory)
de00f580 a(MESH:"Receptors, Scavenger") increases deg(a(CHEBI:"amyloid-beta"))
d4e7e344 path(MESH:"Alzheimer Disease") increases complex(a(CONSO:"dystrophic neurite"), a(GO:autophagosome))
5757a00d path(CONSO:"isolation rearing") regulates p(RGD:Bdnf)
9ec0796c tloc(a(MESH:Proteins), fromLoc(GO:intracellular), toLoc(GO:lysosome)) increases deg(a(MESH:Proteins))
ecdcc28c path(MESH:"Protein Aggregation, Patho

In [23]:
unannotated_nodes = set(graph) - annotated_nodes

print(f'There are {len(unannotated_nodes)} unannotated edges')

There are 2781 unannotated edges


In [24]:
for node in random.sample(unannotated_nodes, 15):
    print(node)

complex(a(CONSO:"Tau antibody, pS396"), p(MGI:Mapt, pmod(Ph, Ser, 396)))
p(HGNC:CDC37L1)
complex(GO:"L-type voltage-gated calcium channel complex")
a(CHEBI:physostigmine)
a(CHEBI:glutaraldehyde)
bp(GO:"regulation of circadian sleep/wake cycle, sleep")
path(MESH:"Diabetes Mellitus, Type 2")
p(HGNCGENEFAMILY:"Glutamate ionotropic receptor kainate type subunits")
p(CONSO:"protein aggregates", pmod(Ub))
p(MGI:Gap43)
bp(MESH:"Stress, Physiological")
r(RGD:Irf1)
path(MESH:Ischemia)
p(MGI:Dpysl2)
path(MESH:"Motor Disorders")
