# Path profiling with sketches

If we want to characterize how process control moves through a graph over time, _path profiling_ is a general approach that captures detail that node or edge profiling might miss.  Depending on the concrete application of path profiling, several engineering constraints may be more or less important:  do we want to count a small number of paths quickly and precisely or a much larger number of paths scalably?

The classic approach of [Ball and Larus](https://dl.acm.org/citation.cfm?id=243857) ([pdf](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.138.7451&rep=rep1&type=pdf)) provides an efficient technique for tracking acyclic intraprocedural paths through the control-flow graphs (CFG) of compiled programs.  Essentially, each possible path is represented by an integer, which is constructed by setting bits on block entry.  (Their technique constructs a spanning tree of the CFG so that it doesn't have to set bits after _every_ branch.)  The path numbering then indexes into an array of counters (for functions with a small number of possible paths) or into a (precise) hash table of counters (for functions with a large number of possible paths but sparsely-distributed actual paths).

In this notebook, we'll evaluate path profiling for a different problem with different constraints:  profiling the top paths to exceptional states in synthetic business process flowcharts.  The charts resemble function control-flow graphs but may have substantially more nodes than a typical compiled function and (especially since we may want to explicitly consider cycles, at least up to a certain loop bound) substantially longer paths.  We also would like to maintain a separate profile for the paths leading to each terminal node, so that we can identify the most likely ways to get to a particular exceptional state and use this information to refine our processes.

Because 

In [1]:
import networkx as nx
from ipysigma import Sigma

In [2]:
def synthetic_flowchart(size=100):
    import numpy as np
    
    def insert_step(g, f, t, n):
        """ inserts n as an extra node between f and t; removes an existing edge between f and t """
        g.add_edge(f, n)
        g.add_edge(n, t)
        g.remove_edge(f, t)
        
    def insert_branch(g, f, t, n):
        """ inserts n as an extra node between f and t """
        g.add_edge(f, n)
        g.add_edge(n, t)
    
    def insert_leaf(g, f, n):
        """ inserts n as a leaf node with an edge from f """
        g.add_edge(f, n)
        
    def create_loop(g, f, t):
        """ inserts a back edge from t to f """
        g.add_edge(t, f)
    
    def choose_edge(g):
        # mea maxima culpa
        edges = list(g.edges())
        return edges[np.random.randint(len(edges))]
    
    def choose_node(g):
        nodes = list(g)
        return np.random.choice(nodes)
        
    g = nx.DiGraph()
    g.add_edge(0, 1)
    
    next_node = 2
    while next_node <= size:
        edges = g.edges()
        nodes = g.nodes()
        which = np.random.randint(100)
        if which <= 30:
            f, t = choose_edge(g)
            insert_step(g, f, t, next_node)
            next_node = next_node + 1
        elif which <= 60:
            f, t = choose_edge(g)
            insert_branch(g, f, t, next_node)
            next_node = next_node + 1
        elif which <= 80:
            f, t = choose_edge(g)
            create_loop(g, f, t)
        else:
            insert_leaf(g, choose_node(g), next_node)
            next_node = next_node + 1
    
    return g

In [3]:
g = synthetic_flowchart()

In [4]:
Sigma(g)

In [5]:
def path(g, bound=100):
    """ find a random path of at most size bound through a directed graph """
    import numpy as np
    
    path = [0]
    plen = 1
    while plen < bound:
        edges = list(g.edges(path[-1]))
        if len(edges) == 0:
            return path
        c, n = edges[np.random.randint(len(edges))]
        path.append(n)
        plen = plen + 1
    return path

def unipath(g, bound=100):
    """ find a random path of at most size bound through a directed graph; 
        return a representation of the path as a Unicode string (so it's hashable) """
    p = path(g, bound)
    return "".join([chr(0x1f600 + c) for c in p])

In [6]:
unipath(g)

'😀🙒😂🙆😟😂😃'