In [None]:
import torch
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

In [None]:
from torch_geometric.datasets import KarateClub

dataset = KarateClub()

print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

data = dataset[0]  # Get the first graph object.
print(data)

The following result of the graph is expected: **Data(x=[34, 34], edge_index=[2, 156], y=[34], train_mask=[34])**

- x=[34, 34] is the node feature matrix with shape (number of nodes, number of features). In our case, it means that we have 34 nodes (our 34 members), each node being associated to a 34-dim feature vector. **Node Feature Matrix**

- edge_index=[2, 156] represents the graph connectivity. 156 edges with first row the source nodes and the second the target nodes.

- y=[34], representing the category label of every node

- train_mask = [34] is a boolean mask (list of True and False values) indicating which nodes should be used for training.


In [None]:
print(f'Shape of Node Feature Matrix:{data.x.shape}')
print(data.x)

The following result of the graph is expected: 

Shape of Node Feature Matrix:torch.Size([34, 34])
<br></br>
<br>tensor([[1., 0., 0.,  ..., 0., 0., 0.],  
    [0., 1., 0.,  ..., 0., 0., 0.],
<br>    [0., 0., 1.,  ..., 0., 0., 0.],
<br>    ...,
<br>    [0., 0., 0.,  ..., 1., 0., 0.],
<br>    [0., 0., 0.,  ..., 0., 1., 0.],
<br>    [0., 0., 0.,  ..., 0., 0., 1.]])


Here, the node feature matrix x is an identity matrix: it doesn’t contain any relevant information about the nodes. It could contain information like age, skill level, etc. but this is not the case in this dataset. It means we’ll have to classify our nodes just by looking at their connections.

In [None]:
print(f'Shape Edge Index:{data.edge_index.shape}')
print(f'Edge Index:{data.edge_index}')

The edge_index has a quite counter-intuitive way of storing the graph connectivity. Here, we have two lists of 156 directed edges (78 bidirectional edges) because the first list contains the sources and the second one the destinations. It is called a coordinate list (COO) and is just one way of efficiently storing a sparse matrix.

The adjacency matrix can be inferred from the edge_index with a utility function.

In [None]:
from torch_geometric.utils import to_dense_adj

A = to_dense_adj(data.edge_index)[0].numpy().astype(int)
print(f'A = {A.shape}')
print(A)

Our node ground-truth labels stored in y simply encode the group number (0, 1, 2, 3) for each node, which is why we have 34 values.

In [None]:
print(f'y = {data.y.shape}')
print(data.y)

The train mask shows which nodes are supposed to be used for training with True statements. These nodes represent the training set, while the others can be considered as the test set.

In [None]:
print(f'train_mask = {data.train_mask.shape}')
print(data.train_mask)