## 3. Cora dataset
***
One of the well-known categories of machine learning problems is “supervised learning”. In supervised learning, we are given some information called “input” features about certain objects. For each object, we are also given an “output” or target variable that we are trying to predict about. Our goal is to learn the mapping between the features and the target variable. Typically, there is a portion of data where both input features and target variables are available. This portion of the dataset is called the training set. There is also typically another portion of the dataset where the target variable is missing and we want to predict it. This portion is called the “test set”. When the target variable can take on a finite number of discrete values, we call the problem at hand a “classification” problem. In this project, we are trying to solve a classification problem in settings where some additional information is provided in the form of “graph structure”. In this project we work with “Cora” dataset. Cora consists of a set of 2708 documents that are Machine Learning related papers. Each documents is labeled with one of the following seven classes: Case Based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, Theory. For each class, only 20 documents are labeled (a total of 140 for the seven classes). We refer to them as “seed” documents. Each document comes with a set of features about its text content. These features are occurrences of a selection of 1433 words in the vocabulary. We are also given an undirected graph where each node is a document and each edge represents a citation. There are a total of 5429 edges. Our goal is to use the hints from text features as well as from graph connections to classify (assign labels to) these documents. To solve this problem for Cora dataset, we pursue three parallel ideas. Implement each idea and compare.

> Ans: In terms of accuracy, the rank of models of different ideas will be Idea 1 > Idea 3 > Idea 2.

In [1]:
import random

import networkx as nx
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from node2vec import Node2Vec
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.neural_network import MLPClassifier
from sklearn.svm import LinearSVC
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv
from torch_geometric.transforms import NormalizeFeatures
from torch_geometric.utils import to_networkx
from tqdm.auto import tqdm


seed = 0
random.seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)

device = torch.device("cuda:2")
device

device(type='cuda', index=2)

### Load Cora Dataset
***

In [2]:
dataset = Planetoid(root="./data/Cora", name="Cora", transform=NormalizeFeatures())

data = dataset[0]
print(f"Dataset: {dataset}:")
print("======================")
print(f"Number of graphs: {len(dataset)}")
print(f"Number of features: {dataset.num_features}")
print(f"Number of classes: {dataset.num_classes}")

print(f"Number of nodes: {data.num_nodes}")
print(f"Number of edges: {data.num_edges}")
print(f"Average node degree: {data.num_edges / data.num_nodes:.2f}")
print(f"Number of training nodes: {data.train_mask.sum()}")
print(f"Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}")
print(f"Contains isolated nodes: {data.contains_isolated_nodes()}")
print(f"Contains self-loops: {data.contains_self_loops()}")
print(f"Is undirected: {data.is_undirected()}")

data = data.to(device)
G = to_networkx(data, to_undirected=True)

Dataset: Cora():
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Number of nodes: 2708
Number of edges: 10556
Average node degree: 3.90
Number of training nodes: 140
Training node label rate: 0.05
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True


#### QUESTION 23: Idea 1
Use Graph Convolutional Networks [1]. What hyperparameters do you choose to get the optimal performance? How many layers did you choose?

Ref: [1] Kipf, Thomas N., and Max Welling. “Semi-supervised classification with graph convolutional networks.” arXiv preprint arXiv:1609.02907 (2016).

> Ans: A three-layer GCN with `hidden_dim=32`, `lr=0.01`, and `weight_decay=0.0005` has the best accuracy of 0.818.

In [3]:
class TwoLayerGCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x


class ThreeLayerGCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.conv3 = GCNConv(hidden_dim, output_dim)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv3(x, edge_index)
        return x


class FourLayerGCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.conv3 = GCNConv(hidden_dim, hidden_dim)
        self.conv4 = GCNConv(hidden_dim, output_dim)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv3(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv4(x, edge_index)
        return x


def train_and_get_best_score(hidden_dim, lr, weight_decay, n_layer, verbose=False):
    print(f"Try: {hidden_dim=}, {lr=}, {weight_decay=}, {n_layer=}")
    n_epoch = 100
    model_map = {2: TwoLayerGCN, 3: ThreeLayerGCN, 4: FourLayerGCN}
    model = model_map[n_layer](dataset.num_features, hidden_dim, dataset.num_classes).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
    criterion = torch.nn.CrossEntropyLoss()

    best_valid_acc = 0
    best_test_acc = None

    model.train()

    for epoch in range(n_epoch):
        optimizer.zero_grad()
        out = model(data.x, data.edge_index)
        loss = criterion(out[data.train_mask], data.y[data.train_mask])
        loss.backward()
        optimizer.step()

        model.eval()
        pred = model(data.x, data.edge_index).argmax(dim=1)
        correct = (pred[data.train_mask] == data.y[data.train_mask]).sum()
        train_acc = int(correct) / int(data.train_mask.sum())
        correct = (pred[data.val_mask] == data.y[data.val_mask]).sum()
        valid_acc = int(correct) / int(data.val_mask.sum())
        correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
        test_acc = int(correct) / int(data.test_mask.sum())
        
        if verbose:
            print(f"Epoch {epoch + 1}:")
            print(f"  - Train Accuracy: {train_acc: .5f}")
            print(f"  - Valid Accuracy: {valid_acc: .5f}")
            print(f"  - Test Accuracy: {test_acc: .5f}")

        if valid_acc > best_valid_acc:
            best_valid_acc = valid_acc
            best_test_acc = test_acc

    print(f"Test accuracy of model with best valid accuracy: {best_test_acc: .5f}\n")
    return best_test_acc


hidden_dims = [16, 32, 64]
lrs = [1e-2, 1e-3, 1e-4]
weight_decays = [5e-4, 5e-5, 0]
n_layers = [2, 3, 4]

best_score = 0
best_params = None

for hidden_dim in hidden_dims:
    for lr in lrs:
        for weight_decay in weight_decays:
            for n_layer in n_layers:
                score = train_and_get_best_score(hidden_dim, lr, weight_decay, n_layer)
                if score > best_score:
                    best_score = score
                    best_params = {
                        "hidden_dim": hidden_dim,
                        "lr": lr,
                        "weight_decay": weight_decay,
                        "n_layer": n_layer,
                    }

best_score, best_params

Try: hidden_dim=16, lr=0.01, weight_decay=0.0005, n_layer=2
Test accuracy of model with best valid accuracy:  0.79600

Try: hidden_dim=16, lr=0.01, weight_decay=0.0005, n_layer=3
Test accuracy of model with best valid accuracy:  0.79400

Try: hidden_dim=16, lr=0.01, weight_decay=0.0005, n_layer=4
Test accuracy of model with best valid accuracy:  0.70000

Try: hidden_dim=16, lr=0.01, weight_decay=5e-05, n_layer=2
Test accuracy of model with best valid accuracy:  0.78100

Try: hidden_dim=16, lr=0.01, weight_decay=5e-05, n_layer=3
Test accuracy of model with best valid accuracy:  0.76900

Try: hidden_dim=16, lr=0.01, weight_decay=5e-05, n_layer=4
Test accuracy of model with best valid accuracy:  0.76400

Try: hidden_dim=16, lr=0.01, weight_decay=0, n_layer=2
Test accuracy of model with best valid accuracy:  0.78500

Try: hidden_dim=16, lr=0.01, weight_decay=0, n_layer=3
Test accuracy of model with best valid accuracy:  0.77900

Try: hidden_dim=16, lr=0.01, weight_decay=0, n_layer=4
Test a

Test accuracy of model with best valid accuracy:  0.78000

Try: hidden_dim=64, lr=0.001, weight_decay=0, n_layer=3
Test accuracy of model with best valid accuracy:  0.79300

Try: hidden_dim=64, lr=0.001, weight_decay=0, n_layer=4
Test accuracy of model with best valid accuracy:  0.78800

Try: hidden_dim=64, lr=0.0001, weight_decay=0.0005, n_layer=2
Test accuracy of model with best valid accuracy:  0.69500

Try: hidden_dim=64, lr=0.0001, weight_decay=0.0005, n_layer=3
Test accuracy of model with best valid accuracy:  0.73500

Try: hidden_dim=64, lr=0.0001, weight_decay=0.0005, n_layer=4
Test accuracy of model with best valid accuracy:  0.63800

Try: hidden_dim=64, lr=0.0001, weight_decay=5e-05, n_layer=2
Test accuracy of model with best valid accuracy:  0.63900

Try: hidden_dim=64, lr=0.0001, weight_decay=5e-05, n_layer=3
Test accuracy of model with best valid accuracy:  0.66600

Try: hidden_dim=64, lr=0.0001, weight_decay=5e-05, n_layer=4
Test accuracy of model with best valid accuracy

(0.818, {'hidden_dim': 32, 'lr': 0.01, 'weight_decay': 0.0005, 'n_layer': 3})

#### QUESTION 24: Idea 2
Extract structure-based node features using Node2Vec [2]. Briefly describe how Node2Vec finds node features. Choose your desired classifier (one of SVM, Neural Network, or Random Forest) and classify the documents using only Node2Vec (graph structure) features. Now classify the documents using only the 1433-dimensional text features. Which one outperforms? Why do you think this is the case? Combine the Node2Vec and text features and train your classifier on the combined features. What is the best classification accuracy you get (in terms of the percentage of test documents correctly classified)?

> Ans: Node2Vec first runs several biased random walks on the graph. Then it takes random walks as sentences to feed into the Word2Vec model. As we can see in the result below, no matter what classifier we use, ones trained on text features outperform those trained on Node2Vec features. We think that it is because text features should contain more useful information for a content-classifying task.
>
> Then, if we combine both features, Random Forest gives the best accuracy of 0.305. However, the performance is still much worse than the model trained on text features only.

In [4]:
node2vec = Node2Vec(G, dimensions=128, walk_length=80, num_walks=100, p=1, q=1, workers=16)
model = node2vec.fit(window=10, min_count=0, sg=1, epochs=10, workers=16)
node_embeddings = model.wv.vectors
node_embeddings.shape

Computing transition probabilities:   0%|          | 0/2708 [00:00<?, ?it/s]

Generating walks (CPU: 2): 100%|██████████| 7/7 [00:12<00:00,  1.82s/it]]
Generating walks (CPU: 1): 100%|██████████| 7/7 [00:13<00:00,  1.94s/it]
Generating walks (CPU: 3): 100%|██████████| 7/7 [00:13<00:00,  1.89s/it]]
Generating walks (CPU: 6): 100%|██████████| 6/6 [00:11<00:00,  1.95s/it]
Generating walks (CPU: 7): 100%|██████████| 6/6 [00:11<00:00,  1.89s/it]]
Generating walks (CPU: 8): 100%|██████████| 6/6 [00:11<00:00,  1.86s/it]
Generating walks (CPU: 5): 100%|██████████| 6/6 [00:13<00:00,  2.19s/it]]
Generating walks (CPU: 4): 100%|██████████| 7/7 [00:13<00:00,  1.98s/it]]
Generating walks (CPU: 9): 100%|██████████| 6/6 [00:12<00:00,  2.01s/it]]
Generating walks (CPU: 10): 100%|██████████| 6/6 [00:12<00:00,  2.04s/it]
Generating walks (CPU: 12): 100%|██████████| 6/6 [00:11<00:00,  1.90s/it]
Generating walks (CPU: 11): 100%|██████████| 6/6 [00:12<00:00,  2.02s/it]
Generating walks (CPU: 13): 100%|██████████| 6/6 [00:11<00:00,  1.96s/it]
Generating walks (CPU: 14): 100%|████████

(2708, 128)

In [4]:
def evaluate(y_test, y_pred):
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average="macro")
    return acc, f1

In [6]:
def print_results(clf, X_train, y_train, X_test, y_test):
    model = clf().fit(X_train, y_train)
    y_pred = model.predict(X_train)
    acc, f1 = evaluate(y_train, y_pred)
    print(f"    - Accuracy (train): {acc:.5f}")
    print(f"    - F1 score (train): {f1:.5f}")
    y_pred = model.predict(X_test)
    acc, f1 = evaluate(y_test, y_pred)
    print(f"    - Accuracy (test): {acc:.5f}")
    print(f"    - F1 score (test): {f1:.5f}")


# Mistake 3: Accuracy should be at least 65%
#            which is not stated in the question...
clfs = {"SVM": LinearSVC, "NN": MLPClassifier, "RF": RandomForestClassifier}

for clf_name, clf in clfs.items():
    print(f"{clf_name=}:")
    train_mask = data.train_mask.detach().cpu().numpy()
    test_mask = data.test_mask.detach().cpu().numpy()
    X_train = data.x[train_mask].detach().cpu().numpy()
    y_train = data.y[train_mask].detach().cpu().numpy()
    X_test = data.x[test_mask].detach().cpu().numpy()
    y_test = data.y[test_mask].detach().cpu().numpy()
    
    # node2vec
    model = clf().fit(node_embeddings[train_mask], y_train)
    print("  - Node2Vec:")
    print_results(
        clf,
        node_embeddings[train_mask],
        y_train,
        node_embeddings[test_mask],
        y_test,
    )
    
    # text
    print("  - Text:")
    print_results(clf, X_train, y_train, X_test, y_test)
    
    # node2vec + text
    X_train_all = np.concatenate((X_train, node_embeddings[train_mask]), axis=1)
    X_test_all = np.concatenate((X_test, node_embeddings[test_mask]), axis=1)
    
    print("  - Node2Vec + Text:")
    print_results(clf, X_train_all, y_train, X_test_all, y_test)

clf_name='SVM':
  - Node2Vec:
    - Accuracy (train): 1.00000
    - F1 score (train): 1.00000
    - Accuracy (test): 0.15700
    - F1 score (test): 0.13597
  - Text:
    - Accuracy (train): 0.99286
    - F1 score (train): 0.99285
    - Accuracy (test): 0.58400
    - F1 score (test): 0.56144
  - Node2Vec + Text:
    - Accuracy (train): 1.00000
    - F1 score (train): 1.00000
    - Accuracy (test): 0.16500
    - F1 score (test): 0.14302
clf_name='NN':
  - Node2Vec:
    - Accuracy (train): 1.00000
    - F1 score (train): 1.00000
    - Accuracy (test): 0.14500
    - F1 score (test): 0.12486
  - Text:
    - Accuracy (train): 1.00000
    - F1 score (train): 1.00000
    - Accuracy (test): 0.54400
    - F1 score (test): 0.52698
  - Node2Vec + Text:
    - Accuracy (train): 1.00000
    - F1 score (train): 1.00000
    - Accuracy (test): 0.16300
    - F1 score (test): 0.14050
clf_name='RF':
  - Node2Vec:
    - Accuracy (train): 1.00000
    - F1 score (train): 1.00000
    - Accuracy (test): 0.13900

#### QUESTION 25: Idea 3
We can find the personalized PageRank of each document in seven different runs, one per class. In each run, select one of the classes and take the 20 seed documents of that class. Then, perform a random walk with the following customized properties: (a) teleportation takes the random walker to one of the seed documents of that class (with a uniform probability of 1/20 per seed document). Vary the teleportation probability in {0, 0.1, 0.2}. (b) the probability of transitioning to neighbors is not uniform among the neighbors. Rather, it is proportional to the cosine similarity between the text features of the current node and the next neighboring node. Particularly, assume we are currently visiting a document x0 which has neighbors x1, x2, x3.

Then the probability of transitioning to each neighbor is:
$$
p_i = \frac{\exp(x_0\cdot x_i)}
{\exp(x_0\cdot x_1) + \exp(x_0\cdot x2) + \exp(x_0\cdot x_3)}\text{; for i = 1, 2, 3.}
$$
Repeat part b for every teleportation probability in part a. Run the PageRank only on the GCC. for each seed node, do 1000 random walks. Maintain a class-wise visited frequency count for every unlabeled node. The predicted class for that unlabeled node is the class which lead to maximum visits to that node. Report accuracy and f1 scores.

For example if node ’n’ was visited by 7 random walks from class A, 6 random walks from class B... 1 random walk from class G, then the predicted label of node of ’n’ is class A.

> Ans: The accuracy and f1 (macro) scores are shown in the below cell.

In [5]:
def run_personalized_pagerank(graph, p_teleportation):
    # get gcc nodes
    sorted_ccs = sorted(nx.connected_components(graph), key=len, reverse=True)
    gcc_nodes = set(graph.subgraph(sorted_ccs[0]).nodes)

    # aggregate seeds by class
    train_idxs = np.arange(len(data.x))[data.train_mask.cpu().numpy()]
    y_numpy = data.y.cpu().numpy()
    class2seeds = {i: [] for i in range(dataset.num_classes)}
    print("Number of seed nodes:", len(train_idxs))
    
    n_used = 0
    for train_idx in train_idxs:
        if train_idx in gcc_nodes:
            class2seeds[y_numpy[train_idx]].append(train_idx)
            n_used += 1
    print("Number of seed nodes in GCC:", n_used)
    
    # run pagerank
    pagerank_counts = np.zeros((len(data.x), dataset.num_classes))
    steps = 1000
    
    for class_, seeds in tqdm(class2seeds.items()):
        for seed in seeds:
            cur = seed
            pagerank_counts[cur][class_] += 1
            
            for _ in range(steps):
                neighbors = list(graph.neighbors(cur))
                
                if random.random() < p_teleportation or len(neighbors) == 0:
                    cur = random.choice(seeds)
                    pagerank_counts[cur][class_] += 1
                    neighbors = list(graph.neighbors(cur))
                
                p_sample = torch.softmax(
                    (data.x[cur] * data.x[neighbors]).sum(dim=1), dim=0
                )
                cur = random.choices(neighbors, weights=p_sample)[0]
                pagerank_counts[cur][class_] += 1
    
    # make predictions
    y_pred = np.argmax(pagerank_counts, axis=1)
    
    # evaluate
    test_mask = data.test_mask.cpu().numpy()
    y_test = data.y[test_mask].cpu().numpy()
    acc, f1 = evaluate(y_test, y_pred[test_mask])
    print(f"Teleportation Probability: {p_teleportation}")
    print(f"  - Accuracy: {acc:.5f}")
    print(f"  - F1 score: {f1:.5f}")


# Mistake 4: Accuracy should be at least 30% for p=0, while 70% otherwise
#            which is also not stated in the question...
p_teleportations = [0, 0.1, 0.2]

for p_teleportation in p_teleportations:
    run_personalized_pagerank(G, p_teleportation)

Number of seed nodes: 140
Number of seed nodes in GCC: 122


  0%|          | 0/7 [00:00<?, ?it/s]

Teleportation Probability: 0
  - Accuracy: 0.20100
  - F1 score: 0.18606
Number of seed nodes: 140
Number of seed nodes in GCC: 122


  0%|          | 0/7 [00:00<?, ?it/s]

Teleportation Probability: 0.1
  - Accuracy: 0.62900
  - F1 score: 0.63472
Number of seed nodes: 140
Number of seed nodes in GCC: 122


  0%|          | 0/7 [00:00<?, ?it/s]

Teleportation Probability: 0.2
  - Accuracy: 0.63400
  - F1 score: 0.64491
