Using the Cora dataset, a common benchmark dataset for graph learning tasks, where each node is a document, edges represent citations, and node features represent word occurrences, a GCN model for classification is constructed.

Problem Statement:

This Classification task is a form of node classification within a graph structure. Each node represents a scientific document, and the objective is to correctly classify each document into one of several predefined categories based on the content of the document and its citation network.

1. Dataset Characteristics:
*   Nodes: Each node in the Cora dataset represents a scientific publication.
*   Node Features: Each node has a feature vector derived from the textual content of the document. Specifically, the features are binary word vectors indicating the presence or absence of corresponding words from a predefined dictionary.
*   Edges: Each edge represents a citation link between two documents, meaning that one document cites another. This creates a directed graph where the direction points from the citing document to the cited document.

2. Classes:
*   The Cora dataset typically includes seven different classes that correspond to different areas of machine learning and computer science, such as Genetic Algorithms, Neural Networks, Probabilistic Methods, etc. Each class represents a field of study that the document could belong to.

3. The goal is to predict the class (field of study) for each document based on its content and its position within the citation network. This is a classic semi-supervised learning problem where only a subset of the nodes (documents) have labeled data. The GCN leverages both the node features and the graph structure to learn how to classify nodes.

4. Why Graph Neural Networks:
*   Graph Neural Networks (specifically GCN in this case) are particularly suited for this type of problem because they can efficiently propagate label information through the graph structure. By learning from both the local (node features) and global (graph structure) information, GCNs can predict labels for unlabeled nodes effectively.
*   GCNs use the node features and the edges (citations) to aggregate information from a node’s neighborhood (including itself), which helps in capturing both the topical relevance and the contextual relevance (how nodes influence each other through citations).

For example, if a particular document is about "Neural Networks" and it cites other documents about "Neural Networks," and is cited by documents about "Neural Networks," a GCN can help to identify that the document likely belongs to the category of "Neural Networks" even if the document's label is unknown. This capability makes GCNs highly effective for tasks where the relational structure between data points significantly informs or affects the output variable.

In [2]:
!pip install torch-geometric

Collecting torch-geometric
  Downloading torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/63.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Downloading torch_geometric-2.6.1-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch-geometric
Successfully installed torch-geometric-2.6.1


In [3]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid
from torch_geometric.data import DataLoader
from torch_geometric.utils import to_networkx
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

In [4]:
# Load the Cora dataset (each node represents a document, each edge represents a citation)
dataset = Planetoid(root="data/Cora", name="Cora")

# Get the first graph from the dataset (Cora has only one graph)
data = dataset[0]
print(f"Dataset has {len(dataset)} graph(s). Each graph has {data.num_nodes} nodes and {data.num_edges} edges.")
print(f"Each node has {data.num_node_features} features.")
print(f"There are {dataset.num_classes} classes for node classification.")

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...


Dataset has 1 graph(s). Each graph has 2708 nodes and 10556 edges.
Each node has 1433 features.
There are 7 classes for node classification.


Done!


In [5]:
# Print unique classes and their counts
unique_classes, counts = torch.unique(data.y, return_counts=True)

class_labels = ['Case_Based', 'Genetic_Algorithms', 'Neural_Networks', 'Probabilistic_Methods', 'Reinforcement_Learning', 'Rule_Learning', 'Theory']
print("Classes and their corresponding node counts:")
for class_label, count in zip(class_labels, counts):
    print(f"{class_label}: {count} nodes")

Classes and their corresponding node counts:
Case_Based: 351 nodes
Genetic_Algorithms: 217 nodes
Neural_Networks: 418 nodes
Probabilistic_Methods: 818 nodes
Reinforcement_Learning: 426 nodes
Rule_Learning: 298 nodes
Theory: 180 nodes


Each call to self.conv1(x, edge_index) or self.conv2(x, edge_index) internally performs the linear transformation, aggregation, and normalization described below.
Step 1: Linear Transformation
Before any messages are passed between nodes, each node's feature vector undergoes a linear transformation using a weight matrix that is learned during training. This transformation is applied to all node features simultaneously, which can be efficiently implemented as a matrix multiplication:
X' = XW
Where:
*   X is is the matrix of input features for all nodes.
*   W is the weight matrix associated with the layer.

Step 2: Aggregation
After the initial transformation, the next step is to aggregate features from the neighboring nodes. In the case of the standard GCNConv, this aggregation is typically a sum (or mean or max) of the features of the neighboring nodes

Step 3: Update Function
In the standard implementation of GCNConv, after aggregation, the aggregated features are normalized using the degrees of the nodes. This normalization is a crucial step and is done to avoid nodes with high degrees dominating the feature representation.

When using GCNConv, much of this complexity is abstracted away, and the layer can be applied directly like any other module in PyTorch.



In [6]:
class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super(GCN, self).__init__()
        # Define the GCN layers
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        # First GCN layer + ReLU activation
        x = self.conv1(x, edge_index)
        x = F.relu(x)

        # Second GCN layer + log_softmax for classification probabilities
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

In [7]:
#Set up the model, optimizer, and device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GCN(in_channels=dataset.num_node_features, hidden_channels=16, out_channels=dataset.num_classes).to(device)
data = data.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

In [8]:
def train():
    model.train()
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])  # Only use training nodes
    loss.backward()
    optimizer.step()
    return loss.item()


In [9]:
def test():
    model.eval()
    out = model(data.x, data.edge_index)
    pred = out.argmax(dim=1)  # Get predicted class
    accs = []
    for mask in [data.train_mask, data.val_mask, data.test_mask]:
        correct = (pred[mask] == data.y[mask]).sum()
        acc = int(correct) / int(mask.sum())
        accs.append(acc)
    return accs  # Returns training, validation, and test accuracy


In [10]:
for epoch in range(1, 201):  # Training for 200 epochs
    loss = train()
    train_acc, val_acc, test_acc = test()
    if epoch % 10 == 0:
        print(f"Epoch: {epoch:03d}, Loss: {loss:.4f}, Train Acc: {train_acc:.4f}, "
              f"Val Acc: {val_acc:.4f}, Test Acc: {test_acc:.4f}")

Epoch: 010, Loss: 0.6314, Train Acc: 0.9857, Val Acc: 0.7740, Test Acc: 0.7860
Epoch: 020, Loss: 0.1008, Train Acc: 1.0000, Val Acc: 0.7700, Test Acc: 0.7870
Epoch: 030, Loss: 0.0244, Train Acc: 1.0000, Val Acc: 0.7680, Test Acc: 0.7840
Epoch: 040, Loss: 0.0133, Train Acc: 1.0000, Val Acc: 0.7700, Test Acc: 0.7940
Epoch: 050, Loss: 0.0124, Train Acc: 1.0000, Val Acc: 0.7700, Test Acc: 0.7970
Epoch: 060, Loss: 0.0142, Train Acc: 1.0000, Val Acc: 0.7760, Test Acc: 0.8040
Epoch: 070, Loss: 0.0161, Train Acc: 1.0000, Val Acc: 0.7780, Test Acc: 0.8060
Epoch: 080, Loss: 0.0168, Train Acc: 1.0000, Val Acc: 0.7780, Test Acc: 0.8070
Epoch: 090, Loss: 0.0164, Train Acc: 1.0000, Val Acc: 0.7780, Test Acc: 0.8050
Epoch: 100, Loss: 0.0155, Train Acc: 1.0000, Val Acc: 0.7820, Test Acc: 0.8060
Epoch: 110, Loss: 0.0145, Train Acc: 1.0000, Val Acc: 0.7800, Test Acc: 0.8030
Epoch: 120, Loss: 0.0138, Train Acc: 1.0000, Val Acc: 0.7800, Test Acc: 0.8040
Epoch: 130, Loss: 0.0131, Train Acc: 1.0000, Val Acc