# Multi-Order Models

[Run notebook in Google Colab](https://colab.research.google.com/github/pathpy/pathpy/blob/master/doc/tutorial/multi_order_models.ipynb)  
[Download notebook](https://github.com/pathpy/pathpy/raw/master/doc/tutorial/multi_order_models.ipynb)


The exmaples given in the previpus notebook about `Higher Order Models` are too simple in many ways. But real data are more complex, as we have observations of paths at multiple lengths simultaneously. Such data are likely to exhibit multiple correlation lengths at the same time.



In [None]:
pip install git+git://github.com/pathpy/pathpy.git

In [19]:
import pathpy as pp
from scipy.stats import chi2

Even more importantly, in real data the model selection will unfortunately not work as described before. In fact, we have cheated because we cannot - in general - directly compare likelihoods of models with different order. The following example highlights this problem:

In [37]:
path = pp.Path('a','b','c','d','e','c','b','a','c','d','e','c','e','d','c','a')

pc = pp.PathCollection()
pc.add(path)
print(pc)
print(pc.counter)

hon_1 = pp.HigherOrderNetwork.from_paths(pc, order=1)
hon_2 = pp.HigherOrderNetwork.from_paths(pc, order=2)
hon_5 = pp.HigherOrderNetwork.from_paths(pc, order=5)

print(hon_1.likelihood(pc, log=False))
print(hon_2.likelihood(pc, log=False))
print(hon_5.likelihood(pc, log=False))

{Path ('a', 'b', 'c', 'd', 'e', 'c', 'b', 'a', 'c', 'd', 'e', 'c', 'e', 'd', 'c', 'a')}
PathPyCounter({'0x7fb4f07edb80': 1})
1.7558299039780557e-06
0.25
1.0


Shouldn't the likelihoods of these three models be identical? They are not, and this is a major issue when we have data that consists of large numbers of short paths: in terms of the number of transitions that enter the likelihood calculation, a model of order $k$ discards the first $k$ nodes on each path. That is, a second-order model can only account for all but the first edge traversals on the path. This means that - in the general case - we actually compare likelihoods computed for different sample spaces, which is not valid. Let us highlight this by calculating the number of transitions that enter the likelihood calculation:

In [16]:
print('Path consists of {0} nodes'.format(len(path)))

Path consists of 15 nodes


To fix the issues above, we need a probabilistic generative model that can deal with large collections of (short) paths in a network. The key idea is to combine multiple higher-order network models into a single multi-layered, `multi-order model`. To calculate the likelihood of such a model we can use all layers, thus avoiding the problem that we discard prefixes of paths. For each path, we start the calculation at a layer of order zero, which considers the relative probabilities of nodes. We then use this model layer to calculate the probability to observe the first node on a path. For the next transition to step two, we then use a first-order model. The next transition is calculated in the second-order model and so on, until we have reached the maximum order of our multi-order model. At this point, we can transitively calculate the likelihood based on the remaining transitions of the path.

`pathpy` can directly generate, visualise, and analyze multi-order network models. Let us try this in our example:

In [31]:
mog = pp.MultiOrderModel.from_paths(pc, max_order=2)
print(mog)
print(mog.likelihood(pc, log=False))

Multi-order model
- General --------------------------------------------
layer  |        network        |         DoF         
order  |   nodes      edges    |   paths      ngrams  
     0 |          5          0 |          4          4
     1 |          5         12 |          7         20
     2 |         12         12 |         20        100
3.2921810699588516e-07


We can now use the likelihood function of the class `MultiOrderModel` to repeat our likelihood ratio test. Rather than generating multiple `MultiOrderModel` instances for different hypotheses, we can directly calculate likelihoods based on different model layers within the same `MultiOrderModel` instance.

In [32]:
mog = pp.MultiOrderModel.from_paths(pc, max_order=2)

d = mog.degrees_of_freedom(order=2) - mog.degrees_of_freedom(order=1)
x = - 2 * (mog.likelihood(pc, log=True, order=1) 
    - mog.likelihood(pc, log=True, order=2))
p_val = 1 - chi2.cdf(x, d)

print('p value of null hypothesis that data has maximum order 1 = {0}'.format(p_val))

p value of null hypothesis that data has maximum order 1 = 0.32202130203459367


In [38]:
pc.counter[path.uid] = 5
print(pc.counter)
mog = pp.MultiOrderModel.from_paths(pc, max_order=2)

d = mog.degrees_of_freedom(order=2) - mog.degrees_of_freedom(order=1)
x = - 2 * (mog.likelihood(pc, log=True, order=1) 
    - mog.likelihood(pc, log=True, order=2))
p_val = 1 - chi2.cdf(x, d)

print('p value of null hypothesis that data has maximum order 1 = {0}'.format(p_val))

PathPyCounter({'0x7fb4f07edb80': 5})
p value of null hypothesis that data has maximum order 1 = 9.43689570931383e-15


We find strong evidence against the null hypothesis that the paths can be explained by a first-order network model. We actually get a different p-value, as we also account for a zero-order model, i.e. we account for the relative frequencies at which nodes occur at the start of a path.

Rather than performing the likelihood test ourselves, we can actually simply call the method `MultiOrderModel.estimate_order`. It will return the `maximum order` among all of its layers for which the likelihood ratio test rejects the null hypothesis.

In [39]:
mog.predict(pc)

2

We now test whether this approach to learn the optimal representation of path data actually works. For this, let us generate path statistics that are in line with what we expect based on a first-order network model, and check whether the order estimation gives the right result.

In [None]:
mog = pp.MultiOrderModel(paths_2, max_order=2)
print('Optimal order = ', mog.estimate_order(paths_2))