# CogDL for Node Classification on Custom Dataset.

This homework focuses on using **CogDL** to do node classification on custom dataset. We first show an example on a small ***corax*** dataset, then the homework requires you to train on the provided ***Aminer*** dataset. 

- Firstly, let's import necessary libraries.The dataset is stored in *.pkl*, so we need *pickle* to load it. Some necessary functions of **CogDL** are also imported as the following.

In [34]:
import numpy as np
import torch
import pickle
import random

from cogdl.data import Data
from cogdl.datasets import build_dataset
from cogdl.models import build_model
from cogdl.tasks import build_task
from cogdl.utils import build_args_from_dict


- Construct a class for your custom dataset.

In [35]:
class CustomDataset:
    def __init__(self, data):
        self.data = data
        self.num_features = data.x.shape[1]
        self.num_classes = torch.max(self.data.y).item() + 1

- Set default args of the model.

In [36]:
def get_default_args():
    cuda_available = torch.cuda.is_available()
    default_dict = {'hidden_size': 16,
                    'dropout': 0.5,
                    'patience': 100,
                    'max_epoch': 500,
                    'cpu': not cuda_available,
                    'lr': 0.01,
                    'weight_decay': 5e-4}
    return build_args_from_dict(default_dict)

- Then we construct our custom dataset based on the features, labels and adjacency. The dataset fed into the **CogDL** shall consist of $x$ for node features, *edge_index* for edges on the graph, $y$ for labels, *train_mask* to identify whether a node belongs to train set or not, *val_mask* to identify whether a node belongs to validation set, and *test_mask* which identifies whether a node belongs to the test set. 


In [37]:
def construct_custom_dataset():
    
    f = open("corax_features.pkl","rb")
    x_features = pickle.load(f)
    f.close()
    print("number of nodes:", x_features.shape[0])
    print("number of node features:", x_features.shape[1])

    f = open("corax_labels.pkl","rb")
    x_labels = pickle.load(f)
    f.close()
    print("number of classes:",x_labels.shape[1])

    f = open("corax_adj.pkl","rb")
    x_adj = pickle.load(f)
    f.close()

    edges = x_adj.nonzero()
    
    x = torch.from_numpy(x_features).float()
    y_onehot = torch.from_numpy(x_labels).long()
    y = torch.topk(y_onehot,1)[1].squeeze(1)
    edge_index = torch.from_numpy(np.array([edges[0],edges[1]])).long()
    
    data = Data(x=x, edge_index=edge_index, y=y)
    
    num_samples = x_labels.shape[0]
    idx = list(range(num_samples))
    data.train_mask = torch.zeros(num_samples, dtype = torch.bool)
    data.train_mask[idx[:-1500]] = True
    data.val_mask = torch.zeros(num_samples, dtype = torch.bool)
    data.val_mask[idx[-1500:-1000]] = True
    data.test_mask = torch.zeros(num_samples, dtype = torch.bool)
    data.test_mask[idx[-1000:]] = True
    
    custom_dataset = CustomDataset(data)

    return custom_dataset

- As we finished all these preparations, we select a task and a model, construct our custom dataset and set the number of layers of your model. Here let's use GCN model for node classification:

In [38]:
args = get_default_args()
args.task = 'node_classification'
args.model = 'pyg_gcn'

dataset = construct_custom_dataset()

args.num_features = dataset.num_features
args.num_classes = dataset.num_classes
args.num_layers = 2


number of nodes: 2680
number of node features: 302
number of classes: 7


- Finally, we shall run the model.

In [39]:
model = build_model(args)
task = build_task(args, dataset=dataset, model=model)
ret = task.train()

Epoch: 336, Train: 0.9449, Val: 0.8840:  67%|██████▋   | 333/500 [00:02<00:00, 173.90it/s]

Test accuracy = 0.847





The model shall yield a test accuracy around 0.85 for corax dataset.

# Homework: Train on Aminer Dataset

Apart from the ***corax*** dataset, we also provide the larger ***Aminer*** dataset, whose node features, adjacency, and labels are provided in *Aminer_adj.pkl, Aminer_features.pkl* and *Aminer_labels.pkl*.

Your goal is to 

- 1: build a custom model configuration: select your base model and hyperparameters for your custom model. Better base model or hyperparameter leads to better results.
- 2: Run your custom model on ***Aminer*** dataset, you shall randomly select 50,000 nodes for testing and 50,000 for validation.
- 3: The homework will be scored based on your code and training results. 


In [40]:
def construct_aminer_dataset():
    #write your code here
    f = open("Aminer_features.pkl","rb")
    x_features = pickle.load(f)
    f.close()
    print("number of nodes:", x_features.shape[0])
    print("number of node features:", x_features.shape[1])
    
    f = open("Aminer_labels.pkl","rb")
    x_labels = pickle.load(f)
    f.close()
    print("number of classes:",x_labels.shape[1])

    f = open("Aminer_adj.pkl","rb")
    x_adj = pickle.load(f)
    f.close()
    
    edges = x_adj.nonzero()
    
    x = torch.from_numpy(x_features).float()
    y_onehot = torch.from_numpy(x_labels).long()
    y = torch.topk(y_onehot,1)[1].squeeze(1)
    edge_index = torch.from_numpy(np.array([edges[0],edges[1]])).long()
    
    data = Data(x=x, edge_index=edge_index, y=y)
    
    num_samples = x_labels.shape[0]
    idx = list(range(num_samples))
    data.train_mask = torch.zeros(num_samples, dtype = torch.bool)
    data.train_mask[idx[:-1500]] = True
    data.val_mask = torch.zeros(num_samples, dtype = torch.bool)
    data.val_mask[idx[-1500:-1000]] = True
    data.test_mask = torch.zeros(num_samples, dtype = torch.bool)
    data.test_mask[idx[-1000:]] = True
    
    aminer_dataset = CustomDataset(data)

    
    return aminer_dataset

In [41]:
def get_custom_args():
    cuda_available = torch.cuda.is_available()
    # select your custom_dict here
    custom_dict = {'hidden_size': 16,
                    'dropout': 0.5,
                    'patience': 100,
                    'max_epoch': 500,
                    'cpu': not cuda_available,
                    'lr': 0.01,
                    'weight_decay': 5e-4}
    
    return build_args_from_dict(custom_dict)

In [27]:
# build your own model
args = get_custom_args()
args.task = 'node_classification'
args.model = 'pyg_gcn'

dataset = construct_aminer_dataset()

args.num_features = dataset.num_features
args.num_classes = dataset.num_classes
args.num_layers = 3

model = build_model(args)
task = build_task(args, dataset=dataset, model=model)
ret = task.train()



number of nodes: 593486
number of node features: 100
number of classes: 18


Epoch: 499, Train: 0.6262, Val: 0.6420: 100%|██████████| 500/500 [03:46<00:00,  2.22it/s]

Test accuracy = 0.634





# The test accuracy for your custom model on Aminer dataset is: 0.634