# Node Classification on Knowledge Graphs

This is the implementation for one of the optional exercices proposed on ...(add reference).

## Setup

In [2]:
# # Check CUDA Version
# !python -c "import torch; print(torch.version.cuda)"

# # Install Pytorch Geometric
# !pip install -q torch-scatter==latest+cu101 -f https://pytorch-geometric.com/whl/torch-1.6.0.html
# !pip install -q torch-sparse==latest+cu101 -f https://pytorch-geometric.com/whl/torch-1.6.0.html
# !pip install -q git+https://github.com/rusty1s/pytorch_geometric.git

In [4]:
import torch
from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures

print(f"torch: {torch.__version__}")

torch: 1.8.1


## Knowledge Graphs and Node Classification

Some common caracteristics of knowledge graph datasets:

- Only one large graph and not many individual graphs (like molecules).
- Unlabeled nodes are infered performing node-level predictions.

## Dataset Introduction

We will use Cora to showcase the use of binary mask for node-level prediciton.

### What is the Cora Dataset
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words("bag of words" vocabulary).

- Nodes: Publications (Papers, Books ...)
- Edges: Citations
- Node Features: word vectors 
- Labels: Seven pubilcation types (Neural_Networks, Reinforcement_Learning, ...)

We normalize the features using torch geometric's transform functions.

In [5]:
dataset = Planetoid(root="data/Planetoid", name="Cora", transform=NormalizeFeatures())

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!


### Explore Dataset

In [8]:
# Get some basic info about the dataset
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features:,}')
print(f'Number of classes: {dataset.num_classes}')
print(50*'=')

# There is only one graph in the dataset, use it as new data object
data = dataset[0]  

# Gather some statistics about the graph.
print(data)
print(f'Number of nodes: {data.num_nodes:,}')
print(f'Number of edges: {data.num_edges:,}')
print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Is undirected: {data.is_undirected()}')


Number of graphs: 1
Number of features: 1,433
Number of classes: 7
Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
Number of nodes: 2,708
Number of edges: 10,556
Number of training nodes: 140
Training node label rate: 0.05
Is undirected: True


Observations:

- We have 5% of training nodes that a relative small set of training set. 
- Only 20 nodes per labeled class.

In [22]:
print(f"#Nodes x Features: {data.x.shape}")
data.x[1][:50]

#Nodes x Features: torch.Size([2708, 1433])


tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0435, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000])

### Why do we even use the graph structure - aren't the features enough?

Apparently, simple MLP models perform a lot worse than GNNs on this type of task, as the citation information is crucial for a correct classification.