# Path Data and Higher-Order De Bruijn Graphs

## Motivation and Learning Objective

While `pathpyG` is useful to handle static graphs - as the name suggests - its main advantage is that it facilitates the GPU-based analysis of time series data capturing paths in networks. As we shall see in the following tutorial, there are various types of data that naturally give rise to paths, including (random) walks or trajectories in networks, traces of dynamical processes giving rise to directed acyclic graphs, or temporal graphs with time-stamped edges. ``pathpyG` allows to model patterns in such graphs based on higher-order De Bruijn graph models, a modelling framework that naturally extends well-known graph models to causal topologies in temporal graphs.

In this first unit, we will introduce the `PathData` class, which can be used to store data on walks and directed acyclic graphs. We will show how such data are internally stored as torch Tensors, and how this approach faciliattes the GPU-based gemeration of higher-order De Bruijn graph models, a modelling approach that is able to captures higher-order correlations in time series data, and which facilitates graph learning and network analysis tasks. 

As before, we first import the modules `torch` and `pathpyG`. By setting the device used by `torch`, we can pecify whether we want to run our code on the CPU or on the GPU. If you want to run your code on the GPU, just set the device to `cuda`.

In [1]:
import torch
import pathpyG as pp

pp.config['torch']['device'] = 'cpu'

## Modelling Walks in a Graph

Let us consider a simple graph consisting of five nodes `a`, `b`, `c`, `d`, `e` and four edges. The graph is shown below:

In [3]:
g = pp.Graph.from_edge_list([['a', 'c'], ['b', 'c'], ['c', 'd'], ['c', 'e']])
pp.plot(g);

Let us now assume that we have additional data that captures two observations each for the following two paths (or walks) of length two:

- 2 x `a` -> `c` -> `d`  
- 2 x `b` -> `c` -> `e` 

Note that we define the length of a path or walk as the number of edges that are traversed, i.e. a sequence that consists of a single node, e.g. `a` is a walk or path of length zero, while every edge in a graph is a walk or path of length one.

The `PathData` object in `pathpyG` can be used to store such observations of paths or walks of length $l$ in the form of tensor with shape $(2,l)$. To manually construct an instance of `PathData` we can just create an empty instance and then add observations.

In the following, we will use the mapping node IDs to indices from the Graph above:

In [20]:
g.node_index_to_id

{0: 'a', 1: 'c', 2: 'b', 3: 'd', 4: 'e'}

With this mapping, we can use the following ordered `edge_index` with source and target node indices to represent the walk `a` -> `c` -> `d`:

`[[0,2],` # source node indices for `a` and `c`  
 ` [2,3]]`   # target node indices for `c` and `d`

It is easy to see that this representation naturally extends the `edge_index` semantics of `pyG`. It will further allow us to efficiently generate higher-order De Bruijn graph models for path data based on a convolution operation that can actually be executed on the GPU. 

For our data set, we can add two walks as follows, where each walk is observed two times.

In [106]:
paths = pp.PathData()
paths.add_walk(torch.tensor([[0,2],[2,3]]), freq=2) # a -> c -> d
paths.add_walk(torch.tensor([[1,2],[2,4]]), freq=2) # b -> c -> e

After adding the four walks, we can directly access an aggregate graph representation of the paths:

In [53]:
print(f'Paths traverse {paths.num_nodes} nodes via {paths.num_edges}')
print(paths.edge_index)

Paths traverse 5 nodes via 4
tensor([[0, 1, 2, 2],
        [2, 2, 3, 4]])


We can also create a weighted `edge_index` where each traversed edge occurs only once, and the weights capture the number of times each edge is traversed. This function returnes two tensors, the first one being the edge_index and the second one being the edge weights.

In [54]:
print(paths.edge_index_weighted)

(tensor([[0, 1, 2, 2],
        [2, 2, 3, 4]]), tensor([2., 2., 2., 2.]))


## Loading empirical walks from N-Gram Files

Naturally, for real data on observed walks in graphs it is not convenient to manually construct and add walks based on edge tensors. For such data, we can load paths from an n-gram file, i.e. a text file where each line correponds to one observed path consisting of comma-separated node IDs. The last component of each line is considered to be the observation count of that particular walk.

As an example, the file `data/tube_paths_train.ngram` contains observed passenger itineraries between stations in the London Tube network, along with their observation frequencies. The following is an excerpt from that file:

```
Southwark,Waterloo,212.0
Liverpool Street,Bank / Monument,1271.0
Barking,West Ham,283.0
Tufnell Park,Kentish Town,103.0
...
```

In [55]:
paths_tube = pp.PathData.from_csv('../data/tube_paths_train.ngram', sep=',')
print(paths_tube)
print(f'London Tube network has {paths_tube.num_nodes} nodes and {paths_tube.num_edges} edges.')

PathData with 61748 walks and 0 dags
London Tube network has 268 nodes and 646 edges.


In the example above, our observations exclusively consisted of walks, i.e. simple sequences of traversed nodes. Let's now have a look how walks are internally stored in the `PathData` object. Each observation in the `PathData` object is internally assigned an integer identifier and there are three dictionaries that store (i) the tensors capturing the observed path, (ii) their frequencies, and (iii) the type of the observation.

In [36]:
paths.paths

{0: tensor([[0, 2],
         [2, 3]]),
 1: tensor([[1, 2],
         [2, 4]])}

For the sake of convenience, there is a helper function that can be used to transform the tensor-based `edge_index` representation of a given walk into a simple sequence of traversed nodes, i.e. for the walk with index 0 above we get the node index sequence `[0,2,3]`:

In [48]:
pp.PathData.walk_to_node_seq(paths.paths[0])

tensor([0, 2, 3])

The frequencies of walks are stored in the dictionary `path_freq`.

In [56]:
paths.path_freq

{0: 2, 1: 2}

There is a third dictionary that stores the type of paths:

In [57]:
paths.path_types

{0: <PathType.WALK: 0>, 1: <PathType.WALK: 0>}

## From Walks to DAGs

We see that this corresponds to our two walks, each being observed two times. But what is the point of explicitly storing the type of an observation as a walk? What other types of path data can be stored and analyzed in `pathpyG`? 

A major feature of `pathpyG` is that it allows us to model the causal topology of temporal graph data, i.e. the topology of time-stamped edges by which nodes can causally influence each other via time-respecting paths, i.e. paths that must (minimally) follow the arrow of time.

As we shall see in the next tutorial, time-respecting paths in temporal graphs naturally give rise to directed acyclic graphs, where the directionality of edges is due to the directionality of the arrow of time. A very simple example for a DAG is one that consists of the following edges:

`a` -> `b`  
`b` -> `c`  
`b` -> `d`  

This DAG captures that node `a` causally influences node `b`, which in turn causally influences the two nodes `c` and `d`, potentially at a later point in time. In `pathpyG`, such a DAG can be represented by a topologically ordered edge index, where the order of edges corresponds to the topological ordering. For the example above, and assuming the same node ID mapping as in the example before, we can thus add two observations of this DAG to a `pathData` object as follows:

In [109]:
paths = pp.PathData()
paths.node_id = ['a', 'b', 'c', 'd']
paths.add_dag(torch.tensor([[0, 1, 1],
                            [1, 2, 3]]),
                        freq=2)

In pathpyG, we can actually mix observations of walks and DAGs, i.e. we can additionally add two observations of a walk `a` -> `b` -> `c` -> `d` to our `PathData` object as follows:

In [110]:
paths.add_walk(torch.tensor([[0, 1, 2], 
                             [1, 2, 3]]), 
                             freq=2)

We can now again inspect the internal dictionaries holding our data, which now consists of two tensors with different types:

In [111]:
print(paths.paths)
print(paths.path_types)
print(paths.path_freq)

{0: tensor([[0, 1, 1],
        [1, 2, 3]]), 1: tensor([[0, 1, 2],
        [1, 2, 3]])}
{0: <PathType.DAG: 1>, 1: <PathType.WALK: 0>}
{0: 2, 1: 2}


At first glance, it may actually seem unnecessary to distinguish between walks and DAGs, as a walk is simply a special type of a DAG, where all nodes have in- and out-degrees smaller or equal than one. And indeed, you could simply ignore this distinction and store both walks and DAGs as a DAG. Nevertheless, `pathpyG` explicitly dinstiguishes between the two types of path data, since some downstream operations - specifically the creation of higher-order De Bruijn graph models - are much faster for walks (which are essentially just sequences of nodes) than for DAGs  (which can have arbitrarily complex structures). It is thus advisable to explicitly store walk data as WALK. To improve the scalability of  path calculations in temporal graphs, `pathpyG` will further detect whether the resulting paths actually require a DAG representation, automatically choosing the most efficient representation.

## Higher-Order De Bruijn Graph Models of Paths

As shown before, we can use the `PathData` class to easily generate an `edge_index` tensor of a weighted graph representation, which essentially aggregates all of the observed walks and DAGs into a weighted static graph. For the example above, this graph has four nodes connected by four edges that have weights four and two.

In [112]:
edge_index, edge_weight = paths.edge_index_weighted
print(edge_index)
print(edge_weight)

tensor([[0, 1, 1, 2],
        [1, 2, 3, 3]])
tensor([4., 4., 2., 2.])


Let's have a look at a visualization of this graph:

In [113]:
g = pp.Graph(edge_index, edge_weight=edge_weight, node_id = paths.node_id)
pp.plot(g);

This is a first-order graph representation, as it only captures the (weighted) edges in the underlying path data, i.e. we could say that we only count the paths of length one. This naturally gives rise to an `edge_index` tensor with shape $(2,m)$, where $m$ is the number of unique edges in the graph that are traversed by the paths.

A key feature of `pathpyG` is it allows to generalize this first-order modelling perspective to $k$-th order De Bruijn graph models for paths, where the nodes in a $k$-th order De Bruijn graph model are sequences of $k$ nodes. Edges connect pairs of nodes that overlap in $k-1$ nodes and capture paths of length $k$.

A De Bruijn graph of order $k=1$ is simply a normal (weighted) graph consisting of nodes and edges. Fere the pairs of nodes connected by edges overlap in $k-1=0$ nodes and capture paths of length $k=1$, i.e. simple edges in the underlying path data.

For a De Bruijn graph with order $k=2$, in our example above, an edge could connect a pair of nodes $[a,b]$ and $[b,c]$ overlapping in the $k-1=1$ node $b$ and such an edge would represent the path $a -> b -> c$ of length two.

In `pathpyG`, we can directly calculate the `edge_index` for a $k$-th order De Bruijn graph model of the paths contained in a `PathData` object. For our example, we can do this as follows:

In [114]:
i, w = paths.edge_index_k_weighted(k=2)
print('higher-order edges =', i)
print('weights =', w)

higher-order edges = tensor([[[0, 1],
         [0, 1],
         [1, 2]],

        [[1, 2],
         [1, 3],
         [2, 3]]])
weights = tensor([4., 2., 2.])


Naturally extending the `pyG`-style `edge_index` to a higher-dimensional representation, the edge_index of a k-th De Bruijn graph model with m edges has the shape [2,m,k], i.e. it consists of a src and dst tensor with $m$ entries, where each entry is a k-dimensional tensor that contains the k nodes in the graph that constitute the higher-order node. For the example above, each node is thus actually a tensor with two elements. 

While this go way beyond the scope of the tutorial, thanks to the tensor-based representation of walks and DAGs in the `PathData` object, the construction of a higher-order De Bruijn graph model can actually be done based on efficient GPU operations, i.e. we can scale up the models for large graphs.

We can use paths to generate k-th order graphs that can be used for GNNs. To make access to nodes and edges convenient, we can pass a node_id mapping that assigns string IDs to the indices of the first-order nodes:

In [142]:
g2 = pp.HigherOrderGraph(paths, order=2, node_id=paths.node_id)
print(g2)
pp.plot(g2);

HigherOrderGraph (k=2) with 4 nodes and 3 edges
	Total edge weight = 8.0
Node attributes
	node_id		<class 'list'>

Edge attributes
	edge_weight		<class 'torch.Tensor'> -> torch.Size([3])

Graph attributes
	num_nodes		<class 'int'>



Just like for a "normal" first-order graph, we can iterate through the nodes of a higher-order graph, which are tuples with k elements:

In [128]:
for n in g2.nodes:
    print(n)

('a', 'b')
('b', 'c')
('b', 'd')
('c', 'd')


Edges are tuples with two elements, where each element is a k-th order node:

In [129]:
for e in g.edges:
    print(e)

(('a', 'b', 'c'), ('b', 'c', 'd'))


The weight attribute stores a tensor whose entries capture the frequencies of edges, i.e. the frequencies of paths of length $k$.

In [133]:
for e in g2.edges:
    print(e, g2['edge_weight', e[0], e[1]].item())

(('a', 'b'), ('b', 'c')) 4.0
(('a', 'b'), ('b', 'd')) 2.0
(('b', 'c'), ('c', 'd')) 2.0


In [134]:
g3 = pp.HigherOrderGraph(paths, order=3, node_id=paths.node_id)
print(g3)

HigherOrderGraph (k=3) with 2 nodes and 1 edges
	Total edge weight = 2.0
Edge attributes
	edge_weight		<class 'torch.Tensor'> -> torch.Size([1])

Graph attributes
	node_id		<class 'list'>
	num_nodes		<class 'int'>



In [140]:
edge_index, edge_weight = paths.edge_index_k_weighted(k=3)
print(edge_index)
print(edge_weight)

tensor([[[0, 1, 2]],

        [[1, 2, 3]]])
tensor([2.])


In [141]:
for e in g3.edges:
    print(e, g3['edge_weight', e[0], e[1]].item())

(('a', 'b', 'c'), ('b', 'c', 'd')) 2.0


As we shall see in the following tutorial, the ``PathData` and the `HigherOrderGraph` classes are the basis for the GPU-based analysis and modelling of causal structures in temporal graphs. In particular, the underlying generalization of first-order static graph models to higher-order De Bruijn graphs allows us to easily build causality-aware graph neural network architectures that consider both the topology and the temoral ordering of time-stamped edges in a temporal graph. 