## Data Handling of Graphs

9/27/2022

Installation:

1. Install Pytorch Geometric
https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

    Note: Mac M1 with acceleration:   
    https://pytorch.org/get-started/locally/ 
 
2. Install scikit-learn  
https://scikit-learn.org/stable/install.html    

3. Install networkx   
https://networkx.org/documentation/stable/install.html    


#### Pytorch Geometric (PyG) - Graph Data Object
A graph is used to model pairwise relations (edges) between objects (nodes). A single graph in PyG is described by an instance of torch_geometric.data.Data, which holds the following attributes by default:

data.x: Node feature matrix with shape [num_nodes, num_node_features]

data.edge_index: Graph connectivity in COO (coordinate list) format with shape [2, num_edges] and type torch.long

data.edge_attr: Edge feature matrix with shape [num_edges, num_edge_features]

data.y: Target to train against (may have arbitrary shape), e.g., node-level targets of shape [num_nodes, *] or graph-level targets of shape [1, *]

data.pos: Node position matrix with shape [num_nodes, num_dimensions]

None of these attributes are required. In fact, the Data object is not even restricted to these attributes. We can, e.g., extend it by data.face to save the connectivity of triangles from a 3D mesh in a tensor with shape [3, num_faces] and type torch.long.



A simple example of an unweighted and undirected graph with three nodes and four edges. Edges between 0-1 and 1-2. Each node contains exactly one feature:

In [None]:
import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index)

data

![](graph_data_1.svg)

Note: edge_index, i.e. the tensor defining the source and target nodes of all edges, is not a list of index tuples. If you want to write your indices this way, you can transpose and call `contiguous` on it before passing them to the data constructor:



In [None]:
import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1],
                           [1, 0],
                           [1, 2],
                           [2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index.t().contiguous())

data

Although the graph has only two edges, we need to define four index tuples to account for both directions of an undirected edge.

Besides holding a number of node-level, edge-level or graph-level attributes, `Data` provides a number of useful utility functions, e.g.:


In [None]:
print('data.keys', data.keys)

print("data['x']", data['x'])

for key, item in data:
    print(f'{key} found in data')

'edge_attr' in data

print('data.num_nodes', data.num_nodes)

print('data.num_edges', data.num_edges)

print('data.num_node_features', data.num_node_features)

print('data.has_isolated_nodes()', data.has_isolated_nodes())

print('data.has_self_loops()',  data.has_self_loops())

print('data.is_directed()', data.is_directed())

# Transfer data object to GPU.
print('torch.cuda.is_available()', torch.cuda.is_available())
if torch.cuda.is_available():
    device = torch.device('cuda')
    data = data.to(device)

A complete list of all methods can be found here: [torch_geometric.data.Data](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data)

## Common Benchmark Datasets

PyG contains a large number of common benchmark datasets, e.g., all Planetoid datasets (Cora, Citeseer, Pubmed), all graph classification datasets from http://graphkernels.cs.tu-dortmund.de and their cleaned versions, the QM7 and QM9 dataset, and a handful of 3D mesh/point cloud datasets like FAUST, ModelNet10/40 and ShapeNet.

Initializing a dataset is straightforward. An initialization of a dataset will automatically download its raw files and process them to the previously described `Data` format. E.g., to load the `ENZYMES` dataset (consisting of 600 graphs within 6 classes), type:
    

In [None]:
from torch_geometric.datasets import TUDataset

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
print('dataset', dataset)

print('len(dataset))', len(dataset))

print('dataset.num_classes', dataset.num_classes)

print('dataset.num_node_features', dataset.num_node_features)

print('dataset.num_nodes', data.num_nodes)

print('dataset.num_edges', data.num_edges)

600 nodes in the dataset

In [None]:
data = dataset[0]
print('data', data)

print('data.is_undirected()', data.is_undirected())

The first graph in the dataset contains 37 nodes, each one having 3 features. There are 168/2 = 84 undirected edges and the graph is assigned to exactly one class. In addition, the data object is holding exactly one graph-level target.

We can even use slices, long or bool tensors to split the dataset. E.g., to create a 90/10 train/test split, type:

In [None]:
train_dataset = dataset[:540]
print('train_dataset', train_dataset)

test_dataset = dataset[540:]
print('test_dataset', test_dataset)


If you are unsure whether the dataset is already shuffled before you split, you can randomly permutate it by running:

In [None]:
dataset = dataset.shuffle()
print('dataset', dataset)

Download Cora, the standard benchmark dataset for semi-supervised graph node classification:

In [None]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora')
print('dataset', dataset)

print('len(dataset)', len(dataset))

print('dataset.num_classes', dataset.num_classes)

print('dataset.num_node_features', dataset.num_node_features)


Here, the dataset contains only a single, undirected citation graph:


In [None]:
data = dataset[0]
print('data', data)

print('data.is_undirected()', data.is_undirected())

print('data.train_mask.sum().item()', data.train_mask.sum().item())

print('data.val_mask.sum().item()', data.val_mask.sum().item())

print('data.test_mask.sum().item()', data.test_mask.sum().item())


The Data objects holds a label for each node, and additional node-level attributes: train_mask, val_mask and test_mask, where

train_mask denotes against which nodes to train (140 nodes),

val_mask denotes which nodes to use for validation, e.g., to perform early stopping (500 nodes),

test_mask denotes against which nodes to test (1000 nodes).
