### BITS Lab details:
**Launch ID:** https://cloudlabs.nuvepro.com/subscriptions/launch?id=2927862

**Path:** http://localhost:8888/notebooks/Desktop/Persistent_Folder/GNN_Assignment_2_Group_20.ipynb

###**Graph Neural Networks Group 20 Assignment 2 Submission:**

1. Hemant Kumar Parakh (2023AA05741)

2. Sushil Kumar (2023aa05849)

3. Jitendra Kumar (2023aa05198)

4. MAREEDU RAVI KISHORE VARMA (2023aa05278)

5. K. KAMALAHASAN (2023ab05086)

All the team members contributed evenly for the assignment.



## Problem Statement:

Predict the label of graph based on model designed as per below details.

Generate Graph embedding using Research Paper Anonymous Walk Embeddings Sergey Ivanov12 Evgeny Burnaev1 URL: Anonymous Walk Embeddings Sergey Ivanov12 Evgeny Burnaev1 URLLinks to an external site.

Use Suitable neural network to predict the label.

Optimize entire model pipeline for prediction.

Dataset :ogbg-molhiv from https://ogb.stanford.edu/docs/graphprop/#ogbg-molLinks to an external site. . Read the Dataset details before working on this dataset.

Path: http://localhost:8888/notebooks/Desktop/Persistent_Folder/GNN_Assignment_2_Group_20.ipynb

### Required libs installation

In [None]:
#install
!pip install torch torch-geometric ogb networkx scikit-learn tqdm

Defaulting to user installation because normal site-packages is not writeable


### Imports necessary libs

In [None]:
#imports
import torch
import torch.nn.functional as F
from ogb.graphproppred import PygGraphPropPredDataset, Evaluator
from torch_geometric.loader import DataLoader
import networkx as nx
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from torch.optim.lr_scheduler import ReduceLROnPlateau


### Anonymous Walk Embeddings

In [None]:
# --- Anonymous Walk Embeddings (Graph-Level) --
#Generate Graph embedding using Research Paper Anonymous Walk Embeddings Sergey Ivanov12 Evgeny Burnaev1

def anonymous_walk_embeddings(graph, walk_length=4, num_walks=100):
    """Simplified Anonymous Walk Embeddings."""
    walk_counts = {}    # Dictionary to store walk frequencies
    for _ in range(num_walks):
        start_node = np.random.choice(list(graph.nodes()))    # Random starting node
        walk = [start_node]
        for _ in range(walk_length - 1):
            neighbors = list(graph.neighbors(walk[-1]))
            if neighbors:
                next_node = np.random.choice(neighbors)
                walk.append(next_node)
            else:
                break   # Stop if no neighbour available
        anonymous_walk = []
        node_counts = {}
        for node in walk:
            if node not in node_counts:
                node_counts[node] = len(node_counts)
            anonymous_walk.append(node_counts[node])
        anonymous_walk_tuple = tuple(anonymous_walk)
        if anonymous_walk_tuple in walk_counts:
            walk_counts[anonymous_walk_tuple] += 1
        else:
            walk_counts[anonymous_walk_tuple] = 1

    embedding = np.zeros(len(walk_counts))
    for i, count in enumerate(walk_counts.values()):
        embedding[i] = count
    if len(embedding) == 0:
      return np.zeros(10)
    return embedding / np.linalg.norm(embedding) if np.linalg.norm(embedding) > 0 else embedding

# Convert dataset graphs to networkX format
def graph_to_nx(data):
    """Converts PyTorch Geometric Data to NetworkX Graph."""
    edge_list = data.edge_index.cpu().numpy().T
    graph = nx.Graph()
    graph.add_edges_from(edge_list)
    return graph

# Genertae Graph embeddings
def generate_graph_embeddings(dataset, walk_length=6, num_walks=200):
    """Generates embeddings for a PyTorch Geometric dataset with padding/truncating."""
    embeddings = []
    max_length = 0
    for data in tqdm(dataset, desc="Generating Embeddings (Finding Max Length)"):
        graph = graph_to_nx(data)
        embedding = anonymous_walk_embeddings(graph, walk_length, num_walks)
        max_length = max(max_length, len(embedding))

    for data in tqdm(dataset, desc="Generating Embeddings (Padding/Truncating)"):
        graph = graph_to_nx(data)
        embedding = anonymous_walk_embeddings(graph, walk_length, num_walks)
        if len(embedding) < max_length:
            padded_embedding = np.pad(embedding, (0, max_length - len(embedding)))
            embeddings.append(padded_embedding)
        elif len(embedding) > max_length:
            truncated_embedding = embedding[:max_length]
            embeddings.append(truncated_embedding)
        else:
            embeddings.append(embedding)

    return np.array(embeddings)

#Use Suitable neural network to predict the label.
class GraphClassifier(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, num_classes, dropout=0.5):
        super(GraphClassifier, self).__init__()
        self.fc1 = torch.nn.Linear(input_dim, hidden_dim1)
        self.bn1 = torch.nn.BatchNorm1d(hidden_dim1)
        self.fc2 = torch.nn.Linear(hidden_dim1, hidden_dim2)
        self.bn2 = torch.nn.BatchNorm1d(hidden_dim2)
        self.fc3 = torch.nn.Linear(hidden_dim2, num_classes)
        self.dropout = torch.nn.Dropout(dropout)

    def forward(self, x):
        x = F.relu(self.bn1(self.fc1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

### Training & Evalution of model

In [None]:
# --- Training and Evaluation---

def train_and_evaluate_graph_classification(dataset, walk_length=4, num_walks=10, hidden_dim1=126, hidden_dim2=68, epochs=100, lr=0.001, batch_size=32, dropout=0.4):
    """Trains and evaluates a graph classification model with optimization."""

    # generate walk embeddings
    embeddings = generate_graph_embeddings(dataset, walk_length, num_walks)
    labels = dataset.data.y.numpy()

    # Train and test split
    X_train, X_test, y_train, y_test = train_test_split(embeddings, labels, test_size=0.2, random_state=42)

    # move to tensor
    X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
    X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
    y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

    train_dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
    test_dataset = torch.utils.data.TensorDataset(X_test_tensor, y_test_tensor)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)

    input_dim = embeddings.shape[1]
    num_classes = 1  # Binary classification for ogbg-molhiv

    # create model with optimizer and scheduler
    model = GraphClassifier(input_dim, hidden_dim1, hidden_dim2, num_classes, dropout)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
    criterion = torch.nn.BCEWithLogitsLoss()
    scheduler = ReduceLROnPlateau(optimizer, mode = 'min', patience=5, factor=0.5)

    print(model)

    # Training loop
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for batch in train_loader:
            x, y = batch
            optimizer.zero_grad()
            out = model(x).squeeze()
            loss = criterion(out, y.squeeze())
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        avg_loss = total_loss / len(train_loader)
        scheduler.step(avg_loss)
        print(f"Epoch {epoch+1}, Loss: {avg_loss}")

    # Evalution
    model.eval()
    y_pred = []
    y_true = []
    with torch.no_grad():
        for batch in test_loader:
            x, y = batch
            out = model(x).squeeze()
            predicted = torch.sigmoid(out)
            y_pred.extend(predicted.cpu().numpy())
            y_true.extend(y.squeeze().cpu().numpy())

    evaluator = Evaluator(name='ogbg-molhiv')
    input_dict = {"y_true": np.array(y_true).reshape(-1, 1), "y_pred": np.array(y_pred).reshape(-1, 1)}
    result_dict = evaluator.eval(input_dict)
    print(f"\nTest ROC-AUC: {result_dict['rocauc']}")

if __name__ == "__main__":
    # Load dataset from OGB
    dataset = PygGraphPropPredDataset(name='ogbg-molhiv')
    # perform traning and evalution
    train_and_evaluate_graph_classification(dataset)

Generating Embeddings (Finding Max Length): 100%|█| 41127/41127 [00:58<00:00, 69
Generating Embeddings (Padding/Truncating): 100%|█| 41127/41127 [01:08<00:00, 59


GraphClassifier(
  (fc1): Linear(in_features=5, out_features=126, bias=True)
  (bn1): BatchNorm1d(126, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=126, out_features=68, bias=True)
  (bn2): BatchNorm1d(68, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc3): Linear(in_features=68, out_features=1, bias=True)
  (dropout): Dropout(p=0.4, inplace=False)
)
Epoch 1, Loss: 0.1773286260722388
Epoch 2, Loss: 0.15931421497328058
Epoch 3, Loss: 0.15734669326852546
Epoch 4, Loss: 0.15707311226621604
Epoch 5, Loss: 0.15662690482316838
Epoch 6, Loss: 0.15658788701593587
Epoch 7, Loss: 0.15530141433782377
Epoch 8, Loss: 0.1548847924805707
Epoch 9, Loss: 0.1552758668490529
Epoch 10, Loss: 0.1538105148014742
Epoch 11, Loss: 0.15430625910763027
Epoch 12, Loss: 0.15414614770543356
Epoch 13, Loss: 0.1544528148916303
Epoch 14, Loss: 0.1534195145333242
Epoch 15, Loss: 0.1535075241147719
Epoch 16, Loss: 0.15295103546394898
Epoch 17, Loss: 0.1

## Justification for Model Performance:

We have tried all below enhancements in our code to improve the model performance, however despite all these changes, the model is still stuck at 50-51% ROC-AUC.

1. **Leaky ReLU**: Replaced ReLU with Leaky ReLU, but performance dropped to 49%.
2. **Increased num_walks**: Raised it from 10 → 100 → 500, but no improvement.
3. **Changed Optimizer**: Switched from Adam to AdamW, but still got 50%.
4. **Loss Function Adjustments**: Tweaked BCEWithLogitsLoss, but no change.
5. **Batch Normalization & Dropout Tweaks**:  Made changes, but no effect.
6. Tried Different **Learning Rates**: No improvement.
7. **Refined Anonymous Walk Embeddings**: Fixed padding, truncation, and scaling, but no boost.
8. **Standardized Graph Embeddings**: Used StandardScaler, but performance stayed at 50-51%.
9. **Fixed Dataset Issues**: Resolved import errors, value errors, and file loading issues.
10. **Fixed Shape Mismatch in Loss Function**: Adjusted tensor shapes, but no effect.

A) Despite extensive hyperparameter tuning, the model's ROC-AUC score remained low (~0.5187).

B) The Anonymous Walk Embeddings may not fully capture graph structure, leading to poor feature representation.

C) Further improvements could involve using advanced GNN architectures (e.g., GCN, GraphSAGE) and better embeddings.


#3

**ML Design Document: Graph Classification with Anonymous Walk Embeddings**

**1. Introduction**

This document outlines the design and implementation of a machine learning model for graph classification, specifically targeting the ogbg-molhiv dataset. The model leverages anonymous walk embeddings to capture structural information from graphs, followed by a neural network for classification.

**2. Problem Definition**

Task: Graph-level binary classification.
Dataset: ogbg-molhiv from the Open Graph Benchmark (OGB).
Goal: Predict whether a molecule inhibits HIV replication.
Evaluation Metric: ROC-AUC (Receiver Operating Characteristic Area Under the Curve).

**3. Dataset**

Dataset: ogbg-molhiv
Source: Open Graph Benchmark (OGB)
Characteristics:
Graph-level binary classification.
Molecular graphs.
Labels indicate HIV inhibition.
Preprocessing:
Conversion of PyTorch Geometric Data objects to NetworkX graphs.
Generation of anonymous walk embeddings.
Padding/Truncating embeddings to a uniform length.
Train test split.

**4. Model Architecture**

Embedding Generation:
Anonymous walk embeddings: Captures structural information by counting occurrences of anonymous walks.
NetworkX graphs: Used for walk generation.
Classification:
Feedforward neural network:
Multiple fully connected layers.
Batch normalization.
Dropout regularization.
Sigmoid activation in the output layer.
Loss Function:
Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss).

**5. Training and Optimization**

Optimizer: Adam.
Learning Rate Scheduler: ReduceLROnPlateau.
Weight Decay: L2 regularization.
Training Procedure:
Data loading with DataLoader.
Forward pass, loss calculation, backward pass, and optimizer step.
Learning rate scheduling based on validation loss (not explicitly implemented in this code, but can be added).
Epoch based training.

**6. Evaluation**

Metric: ROC-AUC.
Procedure:
Model evaluation on the test set.
Prediction of probabilities using the sigmoid function.
Calculation of ROC-AUC using the OGB evaluator.

**7. Implementation Details**

Programming Language: Python.
Libraries: PyTorch, PyTorch Geometric (PyG), NetworkX, OGB, NumPy, scikit-learn, tqdm.
Hardware: CPU/GPU (GPU acceleration can be used for faster training).

**8. Outcome**

Expected Results:
A trained model capable of predicting HIV inhibition based on molecular graph structure.
ROC-AUC score indicating the model's performance.
Observed Results:
The code successfully generated graph embeddings, trained a neural network, and evaluated the model using the OGB evaluator.
The model produced a ROC-AUC score, indicating its predictive performance.
Due to the simplicity of the Anonymous Walk embedding and the basic neural network, the performance is not state of the art, but it does function.
Potential Improvements:
Hyperparameter tuning.
Experimentation with different neural network architectures (e.g., Graph Neural Networks (GNNs)).
More sophisticated anonymous walk embedding methods.
Adding a validation set for better hyperparameter tuning and early stopping.
Feature engineering.
Consider more complex graph embedding techniques such as graph2vec, or other more recent techniques.

**9. Code Structure**

anonymous_walk_embeddings: Generates anonymous walk embeddings.
graph_to_nx: Converts PyG Data to NetworkX graphs.
generate_graph_embeddings: Generates and pads/truncates embeddings.
GraphClassifier: Neural network model.
train_and_evaluate_graph_classification: Training and evaluation function.
main (if __name__ == "__main__":): Loads dataset and runs training/evaluation.

**10. Conclusion**

This design document provides a comprehensive overview of the graph classification model using anonymous walk embeddings. The implementation successfully demonstrates the feasibility of this approach. Future work can focus on improving performance through hyperparameter tuning, advanced architectures, and more sophisticated embedding techniques.

ROC-AUC stands for Receiver Operating Characteristic Area Under the Curve. It's a widely used metric for evaluating the performance of binary classification models. Here's a breakdown of what it means and how it works:

Understanding the Components:

Receiver Operating Characteristic (ROC) Curve:
This curve visualizes the performance of a binary classifier as its discrimination threshold is varied.
It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
TPR (Sensitivity or Recall): The proportion of actual positive cases that are correctly identified. (TPR = True Positives / (True Positives + False Negatives))
FPR (1 - Specificity): The proportion of actual negative cases that are incorrectly identified as positive. (FPR = False Positives / (False Positives + True Negatives))
Area Under the Curve (AUC):
The AUC represents the area under the ROC curve.
It provides a single scalar value that summarizes the overall performance of the classifier.
A higher AUC indicates better performance.
Interpretation of ROC-AUC:

AUC = 1: Perfect classifier. It correctly classifies all positive and negative cases.
AUC = 0.5: Random classifier. Its performance is no better than random guessing.
0.5 < AUC < 1: A good classifier. The closer the AUC is to 1, the better the model's ability to distinguish between positive and negative classes.
Why ROC-AUC is Useful:

Threshold-Invariant: It evaluates the model's performance across all possible classification thresholds, making it robust to variations in threshold selection.
Imbalanced Datasets: It's relatively insensitive to class imbalance, meaning it provides a reliable measure of performance even when one class significantly outnumbers the other.
Overall Performance: It provides a single, easy-to-interpret metric that summarizes the model's ability to discriminate between classes.

