# Graph-Based Model Checking for EMMAA

This notebook explores the use of pathfinding over directed graphs as a means of checking EMMAA models against tests. Statements and tests from the Ras Machine 2.0 model are used to generate a directed graph and then the tests satisfied by paths in the directed graph are compared to those satisfied by the higher precision PySB/Kappa model assembly and analysis approach.

In [1]:
import json
import pickle
from collections import defaultdict
import boto3
import pandas as pd
import networkx as nx
from indra.statements import *
from emmaa.util import get_s3_client

In [2]:
pd.set_option('display.max_rows', 10)

Load the latest model manager and test results JSON for the EMMAA Ras Machine model:

In [3]:
s3 = get_s3_client()
# Get the Model Manager containing the assembled INDRA Statements
s3_obj = s3.get_object(Bucket='emmaa', Key='results/rasmachine/latest_model_manager.pkl')
mm = pickle.loads(s3_obj['Body'].read())
# Get the Test results JSON
s3_obj = s3.get_object(Bucket='emmaa', Key='results/rasmachine/results_2019-04-30-17-51-24.json')
res = json.load(s3_obj['Body'])

INFO: [2019-05-01 16:40:27] indra.preassembler.grounding_mapper - DEFT will not be available for grounding disambiguation.


Get the assembled statements from the Model Manager and build a NetworkX DiGraph based on agent names:

In [None]:
stmts = mm.model.assembled_stmts
len(stmts)

In [None]:
edges = []
provenance = defaultdict(list)
for stmt in stmts:
    if not len(stmt.agent_list()) == 2:
        continue
    subj, obj = stmt.agent_list()
    edges.append((subj.name, obj.name))
    if isinstance(stmt, Complex):
        edges.append((obj.name, subj.name))
    provenance[(subj.name, obj.name)].append(stmt)
g = nx.DiGraph()
g.add_edges_from(edges)    

Get the tests and results from the result JSON and compile a list of (source, target) test pairs along with results obtained from the ModelChecker:

In [None]:
g_tests = []
for result in res[1:]:
    code = result['result_json']['result_code']
    if result['english_path']:
        mc_path_len = len(result['english_path'][0])
    else:
        mc_path_len = 0
    tj = result['test_json']
    test_stmt = stmts_from_json([tj])[0]
    if len(test_stmt.agent_list()) == 2:
        subj, obj = test_stmt.agent_list()
        if subj.name != obj.name:
            g_tests.append((tj['type'], subj.name, obj.name, code, mc_path_len))

Using the directed graph compiled from the model statements we look for shortest paths connecting the pairs of test nodes and build a Pandas DataFrame with the results:

In [None]:
rows = []
for stmt_type, subj, obj, code, mc_path_len in g_tests:
    sp_text = ''
    g_path_len = 0
    if subj not in g:
        g_code = 'SUBJECT_NODE_NOT_FOUND'
    elif obj not in g:
        g_code = 'OBJECT_NODE_NOT_FOUND'
    else:
        g_code = 'PATH_FOUND'
        try:
            sp = nx.shortest_path(g, subj, obj)
            g_path_len = len(sp) - 1
            sp_text = ' -> '.join(sp)
        except nx.NetworkXNoPath:
            g_code = 'NO_PATH'
    rows.append((stmt_type, subj, obj, code, mc_path_len, g_code, g_path_len, sp_text))
df = pd.DataFrame.from_records(rows, columns=['test_stmt_type', 'subj', 'obj', 'mc_code',
                                              'mc_path_len', 'g_code', 'g_path_len', 'shortest_path'])

Let's look at some statistics from the results.
First, the number of tests passed by the Model Checker versus the graph:


In [None]:
mc_path = df[(df.mc_code == 'PATHS_FOUND') | (df.mc_code == 'MAX_PATH_LENGTH_EXCEEDED')]
g_path = df[df.g_code == 'PATH_FOUND']
print("Number of MC tests passed: %d (%.1f%%)" % (len(mc_path), 100*len(mc_path)/len(df)))
print("Number of Graph tests passed: %d (%.1f%%)" % (len(g_path), 100*len(g_path)/len(df)))

Next, the cases where both the ModelChecker and the Graph found a path:

In [None]:
both_path = mc_path[mc_path.g_code == 'PATH_FOUND']
print("Number of tests passed by both MC and G: %d" % len(both_path))
both_path

Does it ever happen that the ModelChecker finds a path where there is none in the graph? Apparently not, which is reassuring because the paths found by the ModelChecker should be a strict subset of those in the directed graph:

In [None]:
mc_path[mc_path.g_code != 'PATH_FOUND']

As an additional sanity check, we look for cases where the ModelChecker yields a *shorter* path than the shortest path in the graph, which should not happen (and doesn't). The rows in the table below consist only of 

In [None]:
df[(df.mc_code == 'PATHS_FOUND') & (df.mc_path_len < df.g_path_len)]

Next, how often is it that tests failed by the ModelChecker are passed in the Graph?

In [None]:
g_only_path = df[(df.mc_code != 'PATHS_FOUND') & (df.g_code == 'PATH_FOUND')]
mc_fail = len(df[df.mc_code != 'PATHS_FOUND'])
g_only = len(g_only_path)
ratio = g_only / mc_fail
print("Number of MC tests failed: %d" % mc_fail)
print("Number of tests failed by MC and passed by G: %d" % g_only)
print("Ratio: %.3f" % ratio)
g_only_path


So approximately half of the tests that fail in the ModelChecker pass in the directed graph. Inspection of the specific cases highlights a few recurring scenarios.

First, **statements involving protein families.** In some cases tests involve protein families rather than specific genes, e.g. "BRAF activates ERK", rather than the specific genes MAPK1 or MAPK3. One possibility is that the detailed PySB model doesn't contain sufficient detailed mechanistic statements involving families to find  a connection. However, the directed graph contains a number of (sometimes indirect) edges that create paths satisfying the tests.

In [None]:
g_only_path[g_only_path.obj == 'ERK']

Second, **statements types not handled by the Model Checker.** Complex statements, which describe binding between two proteins, are currently not handled by the INDRA Model Checker for technical reasons. These were often trivially satisfied by links in the graph.

In [None]:
g_only_path[g_only_path.mc_code == 'STATEMENT_TYPE_NOT_HANDLED']

Third, cases where the **specific state of the subject or object was not found in the model**. These represent cases where the graph is explicitly ignoring details of the subject or object state in determining the existence of a causal link. Determining the proportion of paths produced by the directed graph that are mechanistically defensible will require inspection of the provenance of the underlying statements.

In [None]:
g_only_path[(g_only_path.mc_code == 'SUBJECT_MONOMERS_NOT_FOUND') |
            (g_only_path.mc_code == 'OBSERVABLES_NOT_FOUND')]