In [1]:
import torch

import pathpyG as pp

print('Running on', pp.config['torch']['device'])

Running on cuda


Let's create a simple path data set with four paths of length two:

- 2 x A -> C -> D  
- 2 x B -> C -> E 

Using the following index mappings

0: A  
1: B  
2: C  
3: D  
4: E  

we can create these data as follows:

In [2]:
paths = pp.PathData()
paths.add_walk(torch.tensor([[0,2],[2,3]])) # A -> C -> D
paths.add_walk(torch.tensor([[0,2],[2,3]])) # A -> C -> D
paths.add_walk(torch.tensor([[1,2],[2,4]])) # B -> C -> E
paths.add_walk(torch.tensor([[1,2],[2,4]])) # B -> C -> E

In [3]:
paths.num_paths

4

In [23]:
paths.num_nodes

5

In [24]:
paths.num_edges

8

In [25]:
print(paths)

PathStorage with 4 paths


We can assign features to paths

In [26]:
paths['path_feature'] = torch.tensor([0,1]).to(pp.config['torch']['device'])

In [27]:
paths.is_path_attr('path_feature')

True

Paths are internally stored in a path_index dictionary, where the key is the index of the path

In [28]:
paths['path_index']

{0: tensor([[0, 2],
         [2, 3]]),
 1: tensor([[0, 2],
         [2, 3]]),
 2: tensor([[1, 2],
         [2, 4]]),
 3: tensor([[1, 2],
         [2, 4]])}

We can project the path_index to an edge index, which corresponds to the edge_index of all traversed edges in a graph

In [29]:
paths.edge_index

tensor([[0, 2, 0, 2, 1, 2, 1, 2],
        [2, 3, 2, 3, 2, 4, 2, 4]])

We can also create a weighted edge_index with unique edges

In [30]:
paths.edge_index_weighted

(tensor([[0, 1, 2, 2],
         [2, 2, 3, 4]]),
 tensor([2, 2, 2, 2]))

This is a first-order graph representation. We can also generate k-th order representations with and without weights.

A k-th order edge index with m edges has the shape [2,m,k], i.e. it consists of a src and dst tensor with m entries, where each entry is a k-dim tensor representing the k nodes that constitute the higher-order node.

In [31]:
paths.edge_index_k(k=2)

tensor([[[0, 2],
         [0, 2],
         [1, 2],
         [1, 2]],

        [[2, 3],
         [2, 3],
         [2, 4],
         [2, 4]]])

In [32]:
i, w = paths.edge_index_k_weighted(k=2)
print('higher-order edges =', i)
print('weights =', w)

higher-order edges = tensor([[[0, 2],
         [1, 2]],

        [[2, 3],
         [2, 4]]])
weights = tensor([2, 2])


We can use paths to generate k-th order graphs that can be used for GNNs. To make access to nodes and edges convenient, we can pass a node_id mapping that assigns Ids to first-order nodes:

In [33]:
g = pp.HigherOrderGraph(paths, order=2, node_id=['a', 'b', 'c', 'd', 'e'])
print(g)

HigherOrderGraph (k=2) with 4 nodes and 2 edges

Edge attributes
	edge_weight		<class 'torch.Tensor'> -> torch.Size([2])

Graph attributes
	num_nodes		<class 'int'>
	node_id		<class 'list'>



Just like for a normal graph, we can iterate through nodes, which are tuples with k elements:

In [34]:
for n in g.nodes:
    print(n)

('a', 'c')
('b', 'c')
('c', 'd')
('c', 'e')


Edges are tuples with two elements, where each element is a k-th order node:

In [35]:
for e in g.edges:
    print(e)

(('a', 'c'), ('c', 'd'))
(('b', 'c'), ('c', 'e'))


The weight attribute stores tensors that capture the frequencies of edges:

In [36]:
print(g.data.is_edge_attr('edge_weight'))

True


In [37]:
g['edge_weight', ('a', 'c'), ('c', 'd')].item()

2

We can also add paths with frequencies:

In [3]:
paths = pp.GlobalPathStorage()
paths.add_walk(torch.tensor([[0,2,3],[2,3,4]]).to(pp.config['torch']['device'])) # A -> C -> D -> E
paths.add_walk(torch.tensor([[1,2],[2,4]]).to(pp.config['torch']['device'])) # B -> C -> E
paths.add_walk(torch.tensor([[0,2],[2,3]]).to(pp.config['torch']['device'])) # A -> C -> D


paths['path_freq'] = torch.tensor([2,2,1]).to(pp.config['torch']['device'])

print(paths['path_index'])

# second-order weights: 
# (0,2), (2,3) -> 3
# (1,2), (2,4) -> 2
# (0,2), (2,3) -> 1

{0: tensor([[0, 2, 3],
        [2, 3, 4]], device='cuda:0'), 1: tensor([[1, 2],
        [2, 4]], device='cuda:0'), 2: tensor([[0, 2],
        [2, 3]], device='cuda:0')}


Compute 2nd-order edges

In [49]:
i = torch.cat(list(pp.GlobalPathStorage.edge_index_kth_order(x, k=2) for x in paths['path_index'].values()), dim=1)
print(i)

tensor([[[0, 2],
         [2, 3],
         [1, 2],
         [0, 2]],

        [[2, 3],
         [3, 4],
         [2, 4],
         [2, 3]]], device='cuda:0')


Compute frequencies of second-order edges based on path frequencies

In [50]:
freq = torch.cat(list(torch.Tensor([paths['path_freq'][idx].item()]*(paths['path_index'][idx].size()[1]-1)).to(pp.config['torch']['device']) for idx in range(paths.num_paths)), dim=0)
print(freq)

tensor([2., 2., 2., 1.], device='cuda:0')


Compute unique second-order edges + plus mapping from non-unique to unique second-order edges

In [51]:
edge_index, reverse_index = i.unique(dim=1, return_inverse=True)
print(edge_index)
print(reverse_index)

tensor([[[0, 2],
         [1, 2],
         [2, 3]],

        [[2, 3],
         [2, 4],
         [3, 4]]], device='cuda:0')
tensor([0, 2, 1, 0], device='cuda:0')


For each edge in the unique edge_index, x contains all indices in the non-unique edge_index i that correspond to that edge

In [52]:
x = list((reverse_index == idx).nonzero() for idx in range(edge_index.size()[1]))

print(x[0])
print(freq[x[0]])
print(torch.sum(freq[x[0]]))
print(x[1])
print(freq[x[1]])
print(torch.sum(freq[x[1]]))
print(x[2])
print(freq[x[2]])
print(torch.sum(freq[x[2]]))

tensor([[0],
        [3]], device='cuda:0')
tensor([[2.],
        [1.]], device='cuda:0')
tensor(3., device='cuda:0')
tensor([[2]], device='cuda:0')
tensor([[2.]], device='cuda:0')
tensor(2., device='cuda:0')
tensor([[1]], device='cuda:0')
tensor([[2.]], device='cuda:0')
tensor(2., device='cuda:0')


In [44]:
edge_weights = torch.tensor([torch.sum(freq[x[idx]]) for idx in range(edge_index.size()[1])])
print(edge_weights)

tensor([3., 2., 2.])


In [45]:
paths.edge_index_k_weighted(k=2)

(tensor([[[0, 2],
          [1, 2],
          [2, 3]],
 
         [[2, 3],
          [2, 4],
          [3, 4]]]),
 tensor([2, 1, 1]))

In [46]:
paths.edge_index_k_weighted(k=2, path_freq='path_freq')

(tensor([[[0, 2],
          [1, 2],
          [2, 3]],
 
         [[2, 3],
          [2, 4],
          [3, 4]]]),
 tensor([3., 2., 2.]))

In [47]:
paths.edge_index_k_weighted(k=3, path_freq='path_freq')

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

We can read data from an n-gram file, where each line is a path consisting of comma-separated nodes:

In [48]:
p = GlobalPathStorage.from_csv('tube_paths_train.ngram')

In [34]:
print(p)

PathStorage with 61748 paths


In [35]:
len(p['path_freq'])

61748

In [36]:
p.edge_index_k(k=2).size()

torch.Size([2, 634916, 2])

In [37]:
i, w = p.edge_index_k_weighted(k=2, path_freq='path_freq')
torch.sum(w)

tensor(12356472.)

In [38]:
i, w = p.edge_index_k_weighted(k=3, path_freq='path_freq')
torch.sum(w)

tensor(10308563.)

In [39]:
i, w = p.edge_index_k_weighted(k=5, path_freq='path_freq')
torch.sum(w)

tensor(6791043.)

When reading paths from a csv file, the PathStorage automatically assigns node_names and path frequencies:

In [40]:
print(p['node_name'])

['Southwark', 'Waterloo', 'Liverpool Street', 'Bank / Monument', 'Barking', 'West Ham', 'Tufnell Park', 'Kentish Town', 'Ruislip Gardens', 'South Ruislip', 'Turnpike Lane', 'Manor House', 'Seven Sisters', 'Finsbury Park', 'Tower Hill', 'Upminster', 'Embankment', 'Temple', 'Finchley Road', 'Willesden Green', 'Angel', 'Old Street', 'Holland Park', 'Notting Hill Gate', 'Baker Street', 'Bond Street', 'Latimer Road', 'Ladbroke Grove', 'North Harrow', 'Pinner', 'Chesham', 'Chalfont & Latimer', 'Westminster', 'Turnham Green', 'Stamford Brook', 'Belsize Park', 'Chalk Farm', 'Bow Road', 'BromleyByBow', 'Woodford', 'Buckhurst Hill', 'Hanger Lane', 'North Acton', 'Vauxhall', 'Stockwell', "Shepherd's Bush (Cen)", 'White City', 'Stratford', 'Whitechapel', 'South Woodford', 'Snaresbrook', 'East Acton', 'Moorgate', 'Edgware Road (Cir)', 'Paddington', 'Caledonian Road', 'Holloway Road', 'West Brompton', "Earl's Court", 'Queensbury', 'Kingsbury', 'Loughton', 'Debden', 'Lancaster Gate', 'Marble Arch', '

In [41]:
print(p['path_freq'])

tensor([2.1200e+02, 1.2710e+03, 2.8300e+02,  ..., 3.0000e+00, 1.0000e+00,
        1.0000e+00])


The node names can be used as node_ids in a HigherOrder Graph

In [49]:
g = HigherOrderGraph(p, order=2, node_id=p['node_name'], path_freq='path_freq')
print(g)

HigherOrderGraph (k=2) with 642 nodes and 1139 edges

Edge attributes
	edge_weight		<class 'torch.Tensor'> -> torch.Size([1139])

Graph attributes
	num_nodes		<class 'int'>
	node_id		<class 'list'>



In [43]:
for e in g.edges:
    print(e, g['edge_weight', e[0], e[1]].item())

(('Southwark', 'Waterloo'), ('Waterloo', 'Embankment')) 8405.0
(('Southwark', 'Waterloo'), ('Waterloo', 'Westminster')) 91710.0
(('Southwark', 'Waterloo'), ('Waterloo', 'Lambeth North')) 204.0
(('Southwark', 'Waterloo'), ('Waterloo', 'Kennington')) 13182.0
(('Southwark', 'London Bridge'), ('London Bridge', 'Bank / Monument')) 73218.0
(('Southwark', 'London Bridge'), ('London Bridge', 'Bermondsey')) 51530.0
(('Southwark', 'London Bridge'), ('London Bridge', 'Borough')) 296.0
(('Waterloo', 'Southwark'), ('Southwark', 'London Bridge')) 140333.0
(('Waterloo', 'Embankment'), ('Embankment', 'Temple')) 1499.0
(('Waterloo', 'Embankment'), ('Embankment', 'Charing Cross')) 11663.0
(('Waterloo', 'Westminster'), ('Westminster', 'Green Park')) 103374.0
(('Waterloo', 'Westminster'), ('Westminster', "St. James's Park")) 19691.0
(('Waterloo', 'Lambeth North'), ('Lambeth North', 'Elephant & Castle')) 3719.0
(('Waterloo', 'Kennington'), ('Kennington', 'Elephant & Castle')) 3771.0
(('Waterloo', 'Kenningt

In [44]:
g = HigherOrderGraph(p, order=3, node_id=p['node_name'])
print(g)

HigherOrderGraph (k=3) with 1134 nodes and 1869 edges

Edge attributes
	edge_weight		<class 'torch.Tensor'> -> torch.Size([1869])

Graph attributes
	num_nodes		<class 'int'>
	node_id		<class 'list'>



In [45]:
p.edge_index_k(k=3).size()

torch.Size([2, 573765, 3])

In [46]:
for e in g.edges:
    print(e, g['edge_weight', e[0], e[1]].item())

(('Southwark', 'Waterloo', 'Embankment'), ('Waterloo', 'Embankment', 'Temple')) 7
(('Southwark', 'Waterloo', 'Embankment'), ('Waterloo', 'Embankment', 'Charing Cross')) 147
(('Southwark', 'Waterloo', 'Westminster'), ('Waterloo', 'Westminster', 'Green Park')) 2663
(('Southwark', 'Waterloo', 'Westminster'), ('Waterloo', 'Westminster', "St. James's Park")) 1102
(('Southwark', 'Waterloo', 'Lambeth North'), ('Waterloo', 'Lambeth North', 'Elephant & Castle')) 1
(('Southwark', 'Waterloo', 'Kennington'), ('Waterloo', 'Kennington', 'Elephant & Castle')) 1
(('Southwark', 'Waterloo', 'Kennington'), ('Waterloo', 'Kennington', 'Oval')) 604
(('Southwark', 'London Bridge', 'Bank / Monument'), ('London Bridge', 'Bank / Monument', 'Liverpool Street')) 2271
(('Southwark', 'London Bridge', 'Bank / Monument'), ('London Bridge', 'Bank / Monument', 'Tower Hill')) 1035
(('Southwark', 'London Bridge', 'Bank / Monument'), ('London Bridge', 'Bank / Monument', 'Moorgate')) 129
(('Southwark', 'London Bridge', 'Ba