# Continuation from GNN_Hands_on_Part_1

## Previously we explored : 

1. Introduction to Graph Data Structure in PyG
2. Homogeneous Data Example
3. Loading a Default Dataset
4. Creating Custom Graph Data
5. Splitting the Data
6. Working with Mini-Batches

#### Cora Dataset

The Cora dataset is a well-known benchmark dataset used in graph neural networks, particularly for node classification tasks. It consists of scientific publications, which are represented as nodes, and citations between these publications, which are represented as edges

Components of the Cora Dataset

Nodes:
- Each node represents a paper in the dataset
- There are a total of 2,708 papers (nodes) in the Cora dataset

Edges:
- Each edge represents a citation from one paper to another
- The dataset contains 5,429 edges (citations), indicating the relationships between the papers

Node Features:
- Each paper has a feature vector that describes its content, typically based on the words in the paper.
- The Cora dataset has 1,433 unique features derived from the words in the papers, which are represented in a binary format (1 if a word is present, 0 if not).

Labels:
- Each node (paper) belongs to one of seven classes. These classes represent different topics or categories of research.
- The classes are: Neural Networks, Genetic Algorithms, Probabilistic Methods, Reinforcement Learning, and more.

In [18]:
# Import necessary libraries
from torch_geometric.datasets import Planetoid
from torch_geometric.loader import DataLoader

# Load the Cora dataset
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]  # We only have 1 single graph in this dataset

# Create a DataLoader for mini-batching
# Here, we specify a batch size of 64
loader = DataLoader(data, batch_size=64, shuffle=True)

## Table of Contents for the Current Notebook

7. Creating a GNN Model
8. Defining Train and Evaluation Methods
9. Demonstrating Node Classification Task
10. Demonstrating Graph Classification Task

## 7. Creating a GNN Model

To work with graph data, we need to define a Graph Neural Network (GNN) model. PyG provides a variety of GNN layers that can be used to build models for node classification, graph classification, and other tasks.

We'll cover:
- Choosing the appropriate GNN layer (e.g., `GCNConv`, `GATConv`, etc.)
- Defining the GNN architecture (input, hidden, and output layers)
- Incorporating activation functions, dropout, and other regularization techniques

In [20]:
# Import necessary libraries
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

# Define the Graph Convolutional Network (GCN)
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        # Define the layers of the GCN
        self.conv1 = GCNConv(input_dim, hidden_dim)  # First GCN layer
        self.conv2 = GCNConv(hidden_dim, output_dim)  # Second GCN layer

    def forward(self, x, edge_index):
        # Forward pass through the first GCN layer
        x = self.conv1(x, edge_index)
        x = F.relu(x)  # Apply ReLU non-linearity
        # Forward pass through the second GCN layer
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)  # Apply log softmax for multi-class classification

# Example of initializing the model
input_dim = dataset.num_features  # Number of node features (from the dataset)
hidden_dim = 16  # Arbitrary number of hidden units
output_dim = dataset.num_classes  # Number of classes (from the dataset)

# Initialize the GCN model
model = GCN(input_dim, hidden_dim, output_dim)

# Display the model architecture
print(model)


GCN(
  (conv1): GCNConv(1433, 16)
  (conv2): GCNConv(16, 7)
)


## 8. Defining Train and Evaluation Methods

Once we have our model, it's important to define methods for training and evaluating its performance. In this section, we'll build the core training loop and evaluation function.

We'll explore:
- Setting up a loss function (e.g., Cross Entropy for classification)
- Optimizing with an appropriate optimizer (e.g., Adam)
- Tracking performance metrics (e.g., accuracy, loss)
- How to implement the training loop with backpropagation
- How to evaluate the model on validation and test sets

##### Functions

In [16]:
from sklearn.metrics import accuracy_score

# Define the training function
def train(model, data, optimizer, criterion):
    model.train()  # Set the model to training mode
    optimizer.zero_grad()  # Clear gradients
    # Forward pass
    out = model(data.x, data.edge_index)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Calculate the loss on the training nodes
    loss.backward()  # Backpropagate the gradients
    optimizer.step()  # Update model parameters
    return loss.item()  # Return the loss value for logging

# Define the evaluation function
def evaluate(model, data):
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():  # Disable gradient computation for evaluation
        out = model(data.x, data.edge_index)
        # Get predictions for the nodes in the test set
        pred = out[data.test_mask].argmax(dim=1)
        # Calculate accuracy by comparing predictions to true labels
        accuracy = accuracy_score(data.y[data.test_mask].cpu(), pred.cpu())
    return accuracy  # Return the accuracy value for logging

## 9. Demonstrating Node Classification Task

Node classification is a common task in GNNs, where the goal is to predict labels for individual nodes in a graph.

We'll demonstrate:
- How to set up a node classification task using a default or custom dataset
- Training the GNN model to classify nodes
- Evaluating performance using accuracy or other relevant metrics


In [15]:
# Import additional libraries
import torch.optim as optim

# Example: Setting up the optimizer and loss criterion
optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

# Example: Training loop
for epoch in range(100):  # Train for 100 epochs
    loss = train(model, data, optimizer, criterion)
    if epoch % 10 == 0:  # Log every 10 epochs
        print(f'Epoch {epoch:03d}, Loss: {loss:.4f}')
        
# Example: Evaluating the model
accuracy = evaluate(model, data)
print(f'Test Accuracy: {accuracy:.4f}')


Epoch 000, Loss: 0.0162
Epoch 010, Loss: 0.0122
Epoch 020, Loss: 0.0098
Epoch 030, Loss: 0.0086
Epoch 040, Loss: 0.0080
Epoch 050, Loss: 0.0076
Epoch 060, Loss: 0.0073
Epoch 070, Loss: 0.0072
Epoch 080, Loss: 0.0070
Epoch 090, Loss: 0.0070
Test Accuracy: 0.8040


## 10. Demonstrating Graph Classification Task

Graph classification is another important task, where the aim is to predict a label for an entire graph. We use the MUTAG Dataset

https://pytorch-geometric.readthedocs.io/en/2.5.2/generated/torch_geometric.datasets.TUDataset.html#torch_geometric.datasets.TUDataset

In [None]:
import torch
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

# Load the MUTAG dataset
dataset = TUDataset(root='data/MUTAG', name='MUTAG')

# Print dataset properties
print(f'Dataset: {dataset.name}')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of classes: {dataset.num_classes}')
print(f'Number of node features: {dataset.num_node_features}')



##### Split the Dataset

In [44]:
# Define the sizes for splitting
train_size = int(0.7 * len(dataset))  # 70% for training
val_size = int(0.15 * len(dataset))    # 15% for validation
test_size = len(dataset) - train_size - val_size  # 15% for testing

# Shuffle the dataset and create indices for splitting
torch.manual_seed(42)  # For reproducibility
indices = torch.randperm(len(dataset)).tolist()

# Create train, validation, and test datasets
train_data = dataset[indices[:train_size]]
val_data = dataset[indices[train_size:train_size + val_size]]
test_data = dataset[indices[train_size + val_size:]]

# Create loaders for train, validation, and test splits
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_data, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

##### Create the GNN Model

In [54]:
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, SAGEConv
from torch.nn import Sequential, Linear, ReLU, Dropout
from torch_geometric.nn import global_mean_pool 

class GNNModel(torch.nn.Module):
    def __init__(self, num_node_features, num_classes):
        super(GNNModel, self).__init__()
        self.conv1 = SAGEConv(num_node_features, 64)   # Change to GCNConv to see a comparison
        self.conv2 = SAGEConv(64, 64)                  # Change to GCNConv to see a comparison
        self.fc = Linear(64, num_classes)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, training=self.training)
        x = F.relu(self.conv2(x, edge_index))
        x = global_mean_pool(x, batch)  # Global pooling
        x = self.fc(x)
        return F.log_softmax(x, dim=1)

In [55]:
def train(model, loader, optimizer, criterion):
    model.train()
    for batch in loader:
        optimizer.zero_grad()
        out = model(batch)
        loss = criterion(out, batch.y)
        loss.backward()
        optimizer.step()

def evaluate(model, loader):
    model.eval()
    correct = 0
    for batch in loader:
        out = model(batch)
        pred = out.argmax(dim=1)
        correct += (pred == batch.y).sum().item()
    return correct / len(loader.dataset)


In [57]:
# Initialize model, optimizer, and loss function
model = GNNModel(num_node_features=dataset.num_node_features, num_classes=dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()

# Training loop
num_epochs = 50
for epoch in range(num_epochs):
    train(model, train_loader, optimizer, criterion)
    train_acc = evaluate(model, train_loader)
    val_acc = evaluate(model, val_loader)
    print(f'Epoch {epoch + 1}/{num_epochs}, Train Accuracy: {train_acc:.4f}, Validation Accuracy: {val_acc:.4f}')

# Evaluate on the test set after training
test_acc = evaluate(model, test_loader)
print(f'Test Accuracy: {test_acc:.4f}')


Epoch 1/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 2/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 3/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 4/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 5/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 6/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 7/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 8/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 9/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 10/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 11/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 12/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 13/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 14/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 15/50, Train Accuracy: 0.7099, Validation Accuracy: 0.6071
Epoch 16/50, Train Accuracy: 0.709