In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Reference: https://pytorch-geometric.readthedocs.io/en/latest/

# Install PyTorch Geometric (PyG)

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [3]:
# install
!pip install torch_geometric

# Optional dependencies:
!pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu121.html

Collecting torch_geometric
  Downloading torch_geometric-2.4.0-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch_geometric
Successfully installed torch_geometric-2.4.0
Looking in links: https://data.pyg.org/whl/torch-2.1.0+cu121.html
Collecting torch_scatter
  Downloading https://data.pyg.org/whl/torch-2.1.0%2Bcu121/torch_scatter-2.1.2%2Bpt21cu121-cp310-cp310-linux_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch_sparse
  Downloading https://data.pyg.org/whl/torch-2.1.0%2Bcu121/torch_sparse-0.6.18%2Bpt21cu121-cp310-cp310-linux_x86_64.whl (5.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch_cluster
  Downloading https://data.pyg.org/whl/torch-2.1.0

In [4]:
# load modules
import torch
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

# Data Handling of Graphs

A graph is used to model pairwise relations (edges) between objects (nodes). A single graph in PyG is described by an instance of **torch_geometric.data.Data**, which holds the following attributes by default:

* **data.x**: Node feature matrix with shape [num_nodes, num_node_features]

* **data.edge_index**: Graph connectivity in COO format with shape [2, num_edges] and type torch.long

* **data.edge_attr**: Edge feature matrix with shape [num_edges, num_edge_features]

* **data.y**: Target to train against (may have arbitrary shape), e.g., node-level targets of shape [num_nodes, *] or graph-level targets of shape [1, *]

* **data.pos**: Node position matrix with shape [num_nodes, num_dimensions]

None of these attributes are required. In fact, the Data object is not even restricted to these attributes. We can, e.g., extend it by data.face to save the connectivity of triangles from a 3D mesh in a tensor with shape [3, num_faces] and type torch.long.

In [5]:
import torch
from torch_geometric.data import Data

connectivity = torch.tensor([[0, 1, 1, 2],
                             [1, 0, 2, 1]], dtype=torch.long) # 2 X 4 텐서

node_feature = torch.tensor([[-1], [0], [1]], dtype=torch.float) # 1 X 3 텐서

data = Data(x = node_feature, edge_index = connectivity)
print(data)

Data(x=[3, 1], edge_index=[2, 4])


We show a simple example of an unweighted and undirected graph with three nodes and four edges.

Each node contains exactly one feature:

<img src="https://pytorch-geometric.readthedocs.io/en/latest/_images/graph.svg" width="400"/>

Note that edge_index, i.e. the tensor defining the source and target nodes of all edges, is not a list of index tuples. If you want to write your indices this way, you should transpose and call contiguous on it before passing them to the data constructor:

In [6]:
import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1],
                           [1, 0],
                           [1, 2],
                           [2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index.t().contiguous())

print(data)
print(data.num_edges)
print(data.is_undirected())
print(data.is_directed())

Data(x=[3, 1], edge_index=[2, 4])
4
True
False


Although the graph has only two edges, we need to define **four index tuples to account for both directions of a edge**.

If a pair of indices are not indicated, the graph is considered as **directed**.

Besides holding a number of node-level, edge-level or graph-level attributes, Data provides a number of useful utility functions, e.g.:

In [7]:
print(data.keys)

<bound method BaseData.keys of Data(x=[3, 1], edge_index=[2, 4])>


In [8]:
print(data['x'])    # similar to using dictionaries

tensor([[-1.],
        [ 0.],
        [ 1.]])


* To check what attributes are defined

In [9]:
for key, item in data:
    print("{} found in data".format(key))

x found in data
edge_index found in data


In [10]:
'edge_attr' in data

False

In [11]:
data.num_nodes

3

In [12]:
data.num_node_features

1

In [13]:
data.has_isolated_nodes()

False

In [14]:
data.has_self_loops()

False

In [15]:
# Transfer data object to GPU.
device = torch.device('cuda')
data = data.to(device)

# Common dataset

PyG contains a large number of common benchmark datasets, e.g., all Planetoid datasets (Cora, Citeseer, Pubmed), all graph classification datasets from http://graphkernels.cs.tu-dortmund.de and their cleaned versions, the QM7 and QM9 dataset, and a handful of 3D mesh/point cloud datasets like FAUST, ModelNet10/40 and ShapeNet.

Initializing a dataset is straightforward. An initialization of a dataset will automatically download its raw files and process them to the previously described Data format. E.g., to load the ENZYMES dataset (consisting of 600 graphs within 6 classes), type:

In [16]:
from torch_geometric.datasets import TUDataset

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')

Downloading https://www.chrsmrrs.com/graphkerneldatasets/ENZYMES.zip
Extracting /tmp/ENZYMES/ENZYMES/ENZYMES.zip
Processing...
Done!


In [17]:
print(dataset)

ENZYMES(600)


In [18]:
print(len(dataset))

600


In [19]:
dataset.num_classes

6

In [20]:
dataset.num_node_features

3

We now have access to all 600 graphs in the dataset:

In [21]:
data = dataset[0]
print(data)
print(data.y)

Data(edge_index=[2, 168], x=[37, 3], y=[1])
tensor([5])


In [22]:
print(data.is_undirected())

True


We can see that the first graph in the dataset contains 37 nodes, each one having 3 features.

There are 168/2 = 84 undirected edges and the graph is assigned to exactly one class. In addition, the data object is holding exactly one graph-level target.

We can even use slices, long or bool tensors to split the dataset. E.g., to create a 90/10 train/test split, type:

In [23]:
train_dataset = dataset[:540]
print(train_dataset)

ENZYMES(540)


In [24]:
test_dataset = dataset[540:]
print(test_dataset)

ENZYMES(60)


If you are unsure whether the dataset is already shuffled before you split, you can randomly permutate it by running:

In [25]:
dataset = dataset.shuffle()

This is equivalent of doing:

In [26]:
perm = torch.randperm(len(dataset))
print(perm) # list of randomly shuffled indices

dataset = dataset[perm]
print(dataset)

tensor([595, 136, 190, 425, 504, 505, 129, 520, 265,  31,  62, 216, 162, 271,
        266, 282, 489, 118, 169,  83,  21, 252, 215, 560, 170, 305,  45, 269,
        589, 397, 343, 360, 374, 267, 557,  93, 506, 209,  72, 448,  13, 355,
          5, 341,  27, 546, 103, 454,  59, 351, 452, 588, 507, 167, 172, 524,
        386, 572, 529, 219, 443,  42, 154, 257, 109, 473, 248, 513,  29, 455,
        482, 112, 142, 584, 498, 280, 532, 123, 437, 568, 426,  33, 311, 333,
         97,  98, 365, 577, 152, 542, 472, 594, 416, 274,  28, 318, 369, 404,
        548,  73, 412, 125, 201, 298, 281, 549,  34, 101, 272, 205, 213, 585,
        243, 277, 110, 133, 198,  39, 494, 352,  43, 515, 551, 185, 251, 500,
        293, 310, 394,  20, 194, 405, 466, 574, 346, 222, 173, 458, 459, 531,
        160, 522, 245, 165, 268, 299, 336, 253, 151, 581, 225, 161, 501, 183,
        236, 231, 446, 439, 519, 419, 453, 288, 218, 376,   4, 395, 238, 487,
        262, 436,  36, 259, 157, 573, 220, 334, 291, 586,  40, 5

Let’s try another one! Let’s download Cora, the standard benchmark dataset for semi-supervised graph node classification:

In [27]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora')
print(dataset)

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index


Cora()


Processing...
Done!


In [28]:
len(dataset)

1

In [29]:
dataset.num_classes

7

In [31]:
dataset.num_node_features

1433

Here, the dataset contains only a single, undirected citation graph:

In [32]:
data = dataset[0]
print(data)

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])


In [33]:
data.is_undirected()

True

In [34]:
data.train_mask.sum().item()

140

In [35]:
data.val_mask.sum().item()

500

In [36]:
data.test_mask.sum().item()

1000

# Chemical datasets in PyG

PyG includes many pre-compiled chemistry-related datasets.

For example, ZINC dataset contains all commercially available molecules.
QM7b and QM9 datasets have quantum mechanical properties of molecules.

The complete list of datasets can be found in https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html

## QM9 dataset

----
Ref: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.QM9

Source code: https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/datasets/qm9.html

In [38]:
from torch_geometric.datasets import QM9

dataset = QM9(root='./QM9')
print("Dataset Name:", dataset)

Downloading https://data.pyg.org/datasets/qm9_v3.zip
Extracting QM9/raw/qm9_v3.zip
Processing...
Using a pre-processed version of the dataset. Please install 'rdkit' to alternatively process the raw data.


Dataset Name: QM9(130831)


Done!


In [39]:
dataset.num_classes

19

In [40]:
dataset.num_node_features

11

- First data (moleclue)

In [42]:
data = dataset[0]

In [43]:
data.y

tensor([[    0.0000,    13.2100,   -10.5499,     3.1865,    13.7363,    35.3641,
             1.2177, -1101.4878, -1101.4098, -1101.3840, -1102.0229,     6.4690,
           -17.1722,   -17.2868,   -17.3897,   -16.1519,   157.7118,   157.7100,
           157.7070]])

In [44]:
print(data)

Data(x=[5, 11], edge_index=[2, 8], edge_attr=[8, 4], y=[1, 19], pos=[5, 3], idx=[1], name='gdb_1', z=[5])


In [45]:
data.x

tensor([[0., 1., 0., 0., 0., 6., 0., 0., 0., 0., 4.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])