# Practical Machine Learning and Deep Learning

# Lab 9

# Graph Classification Task


In this lesson you will implement graph classifier using a graph neural network (GNN).

Graph classification is an important problem with applications across many fields, such as bioinformatics, chemoinformatics, social network analysis, urban computing, and cybersecurity. Applying graph neural networks to this problem has been a popular approach recently. This can be seen in the following research references:

[Ying et al., 2018](https://arxiv.org/abs/1806.08804),
[Cangea et al., 2018](https://arxiv.org/abs/1811.01287),
[Knyazev et al., 2018](https://arxiv.org/abs/1811.09595),
[Bianchi et al., 2019](https://arxiv.org/abs/1901.01343),
[Liao et al., 2019](https://arxiv.org/abs/1901.01484),
[Gao et al., 2019](https://openreview.net/forum?id=HJePRoAct7)


## Prerequisites

For working with graphs we will use library `dgl`. It offer a lot of methods and useful structures for working with graph structures alongside the `pytorch` support.

In [None]:
!pip install torch==2.1.2

In [None]:
!pip install dgl -q

## Data reading and preprocessing

First, let's create dataset class for our task. This is the same dataset as for any other classification tasks.

---

## EXERCISE 1:
Create the following functions:
* `__getitem__` should return sample and its label. Label is single integer and sample type is `dgl.DGLGraph`
* `__len__` returns length of the of the set

---

In [None]:
from typing import Tuple, List
import os
os.environ['DGLBACKEND'] = 'pytorch'
import pandas as pd
from dgl import DGLGraph, graph as graph_constructor
from torch.utils.data import Dataset


class GraphClassificationDataset(Dataset):
    def __init__(self, csv_path):
        super().__init__()

        self.data_path = csv_path
        self._read_data()

    def _read_data(self) -> Tuple[List[DGLGraph], List[int]]:
        self.graph_ids = []
        self.graphs = []
        self.labels = []
        df = pd.read_csv(self.data_path, header=0, index_col='graph_id')

        for graph_id, sample in df.iterrows():
            graph = self._create_graph(
                num_nodes=sample.num_nodes,
                num_edges=sample.num_edges,
                edges_from=list(map(int, sample.edges_from.split(' '))),
                edges_to=list(map(int, sample.edges_from.split(' '))),
            )

            self.graph_ids.append(graph_id)
            self.graphs.append(graph)
            try:
                self.labels.append(sample.label)
            except:
                self.labels.append(graph_id) # just to get it later

    def _create_graph(self, num_nodes: int, num_edges: int, edges_from: List[int], edges_to: List[int]) -> DGLGraph:
        assert len(edges_from) == num_edges, 'Something is wrong with edges_from'
        assert len(edges_to) == num_edges, 'Something is wrong with edges_to'

        graph = graph_constructor(
            num_nodes=num_nodes,
            data=(edges_from, edges_to),
        )
        return graph

    def __getitem__(self, index) -> Tuple[DGLGraph, int]:
        return self.graphs[index], self.labels[index]

    def __len__(self):
        return len(self.labels)

In [None]:
train_set_raw = GraphClassificationDataset('graph_classification_train.csv')
test_dataset = GraphClassificationDataset('graph_classification_test.csv')

In [None]:
assert len(train_set_raw)==800
assert len(test_dataset)==200
print("Test cases passed")

In [None]:
from torch.utils.data import random_split

# Set percentage of data to use as a training subset
train_ratio = 0.8

train_size = int(train_ratio * len(train_set_raw))
val_size = len(train_set_raw) - train_size

train_dataset, val_dataset = random_split(train_set_raw, (train_size, val_size))

### DataLoaders

To train neural networks efficiently, a common practice is to batch
multiple samples together to form a mini-batch. Batching fixed-shaped tensor
inputs is common. For example, batching two images of size 28 x 28
gives a tensor of shape 2 x 28 x 28. By contrast, batching graph inputs
has two challenges:

* Graphs are sparse.
* Graphs can have various length. For example, number of nodes and edges.

To address this, DGL provides a :func:`dgl.batch` API. It leverages the idea that
a batch of graphs can be viewed as a large graph that has many disjointed
connected components. Below is a visualization that gives the general idea.

![](https://data.dgl.ai/tutorial/batch/batch.png)

In [None]:
from torch import Tensor

from dgl import batch as construct_graph_batch

def collate_graph_batch(samples: List[Tuple[DGLGraph, int]]) -> Tuple[Tensor, Tensor]:
    graphs, labels = map(list, zip(*samples))
    batched_graph = construct_graph_batch(graphs)
    return batched_graph, Tensor(labels)

In [None]:
from torch.utils.data import DataLoader

BATCH_SIZE = 64
NUM_CLASSES = 8

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_graph_batch, drop_last=True)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_graph_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_graph_batch)

## Model implementation

Graph classification proceeds as follows.

![](https://data.dgl.ai/tutorial/batch/graph_classifier.png)


From a batch of graphs, perform message passing and graph convolution for nodes to communicate with others. After message passing, compute a tensor for graph representation from node (and edge) attributes. This step might  be called readout or aggregation. Finally, the graph  representations are fed into a classifier $g$ to predict the graph labels.

Graph convolution layer can be found in the ``dgl.nn.<backend>`` submodule.

In this lab the easiest choice is `GraphConv`. [Docs](https://docs.dgl.ai/en/1.1.x/generated/dgl.nn.pytorch.conv.GraphConv.html#dgl.nn.pytorch.conv.GraphConv).


In [None]:
import torch
from torch import nn
import dgl
from dgl.nn.pytorch import GraphConv

class GraphClassifier(nn.Module):
    def __init__(self, hidden_dim, n_classes):
        super().__init__()

        input_dim = 1
        # NOTE: for educational purposes here we use 1 as input dimension.
        # However, in production feature vector is the information about the node.
        # For example, in social networks the feature vector could represent the user
        # (e.g. its choices of movies)
        self.conv1 = GraphConv(input_dim, hidden_dim)
        self.relu1 = nn.ReLU()
        self.conv2 = GraphConv(hidden_dim, hidden_dim)
        self.relu2 = nn.ReLU()
        self.head = nn.Linear(hidden_dim, n_classes)

    def forward(self, graphs):
        # Use node degree as the initial node feature. For undirected graphs, the in-degree
        # is the same as the out_degree.
        h = graphs.in_degrees().view(-1, 1).float()

        # Perform graph convolution and activation function.
        h = self.conv1(graphs, h)
        h = self.relu1(h)
        h = self.conv2(graphs, h)
        h = self.relu2(h)

        graphs.ndata['h'] = h
        # Calculate graph representation by averaging all the node representations.
        hg = dgl.mean_nodes(graphs, 'h')
        return self.head(hg)

## Training


---

## EXERCISE 2:
Set the device type

*   If cuda is available, set torch device to cuda
*   If cuda is unavailable, set torch device to cpu
---



In [None]:
HIDDEN_DIM = 256

# Select the where to perform
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# Create an instance of the model and pass its weights to the device
model = GraphClassifier(HIDDEN_DIM, NUM_CLASSES).to(device)
# Set the loss function
loss_function = nn.CrossEntropyLoss()
# Set the opimizer
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4) # Using Karpathy's learning rate constant

Here are utility functions to calculate and show metrics:

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

def calculate_metric(metric_fn, true_y, pred_y):
    if metric_fn != accuracy_score:
        return metric_fn(true_y, pred_y, average="macro")
    else:
        return metric_fn(true_y, pred_y)

def print_scores(p, r, f1, a, batch_size):
    for name, scores in zip(("precision", "recall", "F1", "accuracy"), (p, r, f1, a)):
        print(f"\t{name.rjust(14, ' ')}: {sum(scores)/batch_size:.4f}")

In [None]:
import warnings
warnings.filterwarnings('ignore')

from tqdm import tqdm

epochs = 60

losses = []
batches = len(train_dataloader)
val_batches = len(val_dataloader)

# loop for every epoch (training + evaluation)
for epoch in range(epochs):
    total_loss = 0

    # progress bar
    progress = tqdm(enumerate(train_dataloader), desc="Loss: ", total=batches)

    # ----------------- TRAINING  --------------------
    # set model to training
    model.train()

    for i, (graphs, labels) in progress:

        graphs = graphs.to(device)
        labels = labels.to(device).long()

        # training step for single batch
        model.zero_grad()
        outputs = model(graphs)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        # update running training loss
        current_loss = loss.item()
        total_loss += current_loss * BATCH_SIZE

        # updating progress bar
        progress.set_description("Loss: {:.4f}".format(total_loss/(i+1)))

    # releasing unceseccary memory in GPU
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

    # ----------------- VALIDATION  -----------------
    val_losses = 0
    precision, recall, f1, accuracy = [], [], [], []

    # set model to evaluating (testing)
    model.eval()
    with torch.no_grad():
        for i, (graphs, labels) in enumerate(val_dataloader):
            graphs = graphs.to(device)
            labels = labels.to(device).long()

            outputs = model(graphs)

            # update running validation loss
            val_losses += loss_function(outputs, labels) * BATCH_SIZE

            predicted_classes = torch.max(outputs, 1)[1]

            # calculate P/R/F1/A metrics for batch
            for acc, metric in zip((precision, recall, f1, accuracy),
                                   (precision_score, recall_score, f1_score, accuracy_score)):
                acc.append(
                    calculate_metric(metric, labels.cpu(), predicted_classes.cpu())
                )

    print(f"Epoch {epoch + 1}/{epochs}, training loss: {total_loss/batches}, validation loss: {val_losses/val_batches}")
    print_scores(precision, recall, f1, accuracy, val_batches)
    losses.append(total_loss/batches)

## Inference

Produce labels on the testing graphs


---

## EXERCISE 3:
1. Pass graphs to model for generating output
2. Use [torch.max](https://pytorch.org/docs/stable/generated/torch.max.html) to select maximum value from outputs generated in previous step
---



In [None]:
predictions = []

with torch.no_grad():
    model.eval()
    for i, (graphs, graph_ids) in enumerate(test_dataloader):

        graphs = graphs.to(device)
        outputs = model(graphs)

        predicted = torch.max(outputs, 1)[1]
        predictions.extend(predicted.tolist())

In [None]:
assert list(outputs.shape)==[8,8]
assert len(predictions)==200
print("Test cases passed")

In [None]:
# generate the results file
results = pd.DataFrame(columns=['graph_id', 'label'])
results['graph_id'] = test_dataset.graph_ids
results['label'] = predictions
results.to_csv('results.csv', index=None)