# Introduction

This notebook considers the use of EGI for link prediction tasks.

It largely follows [this example](https://stellargraph.readthedocs.io/en/stable/demos/link-prediction/node2vec-link-prediction.html), but using EGI instead of node2vec for encoding.


**The stages to link prediction are:**

1. Create and train an encoder to create node embeddings for the source graph.
2. Using a binary operator, combine node embeddings to form edge embeddings.
3. Train a classifier to distinguish between real and fake edges.


For the encoder, we use EGI. Then we create edge embeddings using the hadamard product, and use these to train a SGD classifier.

The source dataset is the Cora citation graph. We then consider transferring this to a small subgraph of PubMed citations, generated by randomly sampling edges of the full Pubmed graph.

# Setup

In [1]:
!pip install -e ..
# if the library is not installed yet, restart the notebook

Obtaining file:///workspace/mount
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: graphtransferlearning
  Building editable for graphtransferlearning (pyproject.toml) ... [?25ldone
[?25h  Created wheel for graphtransferlearning: filename=graphtransferlearning-0.0.1-0.editable-py3-none-any.whl size=1319 sha256=00fac8f10ebae2841997675471451004c820bc07c65967ccd3db6c04be38a861
  Stored in directory: /tmp/pip-ephem-wheel-cache-o2ebaz02/wheels/97/9f/f8/a43530fa3975ba118be1e5dde0c412ea8db4864e54b615e504
Successfully built graphtransferlearning
Installing collected packages: graphtransferlearning
  Attempting uninstall: graphtransferlearning
    Found existing installation: graphtransferlearning 0.0.1
    Uninstalling graphtransferlearning-0.0.1:
 

In [2]:
import graphtransferlearning as gtl
from graphtransferlearning.features import degree_bucketing

import dgl
import torch
import warnings
from random import randint,sample
from dgl.data import CoraGraphDataset,PubmedGraphDataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

import networkx as nx
import numpy as np

Using backend: pytorch


# The dataset

For this example, we use the Cora citation graph dataset.

In [3]:
dataset = CoraGraphDataset()

Downloading /root/.dgl/cora_v2.zip from https://data.dgl.ai/dataset/cora_v2.zip...
Extracting file to /root/.dgl/cora_v2
Finished data loading and preprocessing.
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.


In [4]:
# do things in this not-recommended way as our EGI graph requires a DGLGraphStale not a DGLGraph
# see the version 0.5 source code: https://github.com/dmlc/dgl/blob/0.5.x/python/dgl/data/citation_graph.py
dgl_graph = dgl.DGLGraphStale()
dgl_graph.from_networkx(dataset.graph)
dgl_graph.readonly()



# Encoder

First, an encoder is trained to produce node embeddings for the Cora graph.

In [5]:
with warnings.catch_warnings(): # hide all the "XYZ is deprecated" messages
    warnings.simplefilter('ignore')
    encoder = gtl.training.train_egi_encoder(dgl_graph,gpu=0,save_weights_to="../models/egi.pickle")

100%|██████████| 100/100 [00:53<00:00,  1.86it/s]

Saving model parameters to ../models/egi.pickle





# Edge embeddings

Next, the node embeddings are converted to edge embeddings. In this case, the nodes at the start and end of the edge are converted into node embeddings, then combined using the hadamard product.

In [6]:
features = degree_bucketing(dgl_graph,32) # the maximum degree must be the same as used in training.
                                          # this is usually equal to n_hidden

torch.cuda.set_device(torch.device('cuda:0'))
features = features.cuda()

embs = encoder(features)

embs = embs.cuda()


In [7]:
def get_edge_embedding(emb,a,b):
    return np.multiply(emb[a].detach().cpu(),emb[b].detach().cpu())

In [8]:
get_edge_embedding(embs,10,20)

tensor([0.0000, 0.0226, 0.0000, 0.0000, 0.7384, 0.0000, 0.2285, 0.0000, 0.0000,
        0.0000, 0.1601, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.4549, 0.0539,
        0.0000, 0.0000, 0.4325, 0.2416, 0.0000, 0.0000, 0.0000, 0.0000, 0.2010,
        0.0274, 0.0000, 0.0000, 1.3740, 0.0000])

# Training a link prediction classifier

To predict links in a graph, we train a classifier to see if a given edge is real or fake using its embedding as input.

To train the classifier, we need both a set of real (positive) edges, and a set of non-existent (negative) edges.

**Create these real and fake edges, and convert them into edge embeddings:**

In [9]:
positive_edges = list(dataset.graph.edges)
nodes = list(dataset.graph.nodes)



In [10]:
def generate_negative_edges(edges,nodes,n):
    negative_edges = []
    for i in range(n):
        u = randint(0,n)
        v = randint(0,n)
        while u == v or (u,v) in edges or (v,u) in edges or v not in nodes or u not in nodes:
            u = randint(0,n)
            v = randint(0,n)

        negative_edges.append((u,v))
    
    return negative_edges

In [11]:
negative_edges = generate_negative_edges(positive_edges,nodes,len(positive_edges))

In [12]:
edges = []
values = []

for u,v in positive_edges:
    edges.append(get_edge_embedding(embs,u,v))
    values.append(1)
    
for u,v in negative_edges:
    edges.append(get_edge_embedding(embs,u,v))
    values.append(0)



In [13]:
train_edges,test_edges,train_classes,test_classes = train_test_split(edges,values)
train_edges =torch.stack(train_edges) # list of tensors to 3d tensor
test_edges =torch.stack(test_edges) # list of tensors to 3d tensor

Fit and evaluate the logistic regression model:

The classifer needs to support the `partial_fit` API - this allows us to update the parameters according to new data during fine-tuning.

The supported models are listed here https://scikit-learn.org/0.15/modules/scaling_strategies.html

In [14]:
classifier = SGDClassifier(max_iter=1000).fit(train_edges,train_classes)

In [15]:
print(f"The Cora link predictor has an accuracy score of \
{classifier.score(test_edges,test_classes)}")

The Cora link predictor has an accuracy score of 0.819439181508147


# Transfer Learning and fine-tuning

A small sub-graph of the pubmed citation graph will be chosen as the transfer target.

This graph will have 1000 edges, sampled at random from the larger dataset.

In [16]:
transfer_dataset = PubmedGraphDataset()
transfer_g = transfer_dataset.graph
edge_cnt = 0

transfer_g = nx.edge_subgraph(transfer_dataset.graph,sample(transfer_g.edges(),1000)).to_undirected(reciprocal=False)
transfer_g = nx.convert_node_labels_to_integers(transfer_g) # renumber nodes to be sequential integers
print(transfer_g)

Downloading /root/.dgl/pubmed.zip from https://data.dgl.ai/dataset/pubmed.zip...
Extracting file to /root/.dgl/pubmed
Finished data loading and preprocessing.
  NumNodes: 19717
  NumEdges: 88651
  NumFeats: 500
  NumClasses: 3
  NumTrainingSamples: 60
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.
Graph with 1699 nodes and 995 edges




In [17]:
# convert to dgl for EGI training
transfer_g_dgl = dgl.DGLGraphStale()
transfer_g_dgl.from_networkx(transfer_g)

transfer_g_dgl.readonly()

## Fine-tuning the embedder

In [25]:
transfer_encoder = gtl.training.train_egi_encoder(transfer_g_dgl,gpu=0,pre_train="../models/egi.pickle")

100%|██████████| 100/100 [00:40<00:00,  2.45it/s]


## Fine-tuning the classifier

The above steps are repeated, but for the transfer dataset:

In [26]:
features = degree_bucketing(transfer_g_dgl,32)

torch.cuda.set_device(torch.device('cuda:0'))
features = features.cuda()

In [27]:
embs = transfer_encoder(features)
embs.shape

torch.Size([1699, 32])

In [28]:
positive_edges = list(transfer_g.edges)
nodes = list(transfer_g.nodes)
negative_edges = generate_negative_edges(positive_edges,nodes,len(positive_edges)) 


In [29]:
edges = []
values = []

for u,v in positive_edges:
    edges.append(get_edge_embedding(embs,u,v))
    values.append(1)
    
for u,v in negative_edges:
    edges.append(get_edge_embedding(embs,u,v))
    values.append(0)


In [30]:
train_edges,test_edges,train_classes,test_classes = train_test_split(edges,values)
train_edges =torch.stack(train_edges) # list of tensors to 3d tensor
test_edges =torch.stack(test_edges) # list of tensors to 3d tensor

Note that we use *partial fit* to train the classifier instead of *fit*. This updates the current weights of the classifier instead of overwriting them.

In [31]:
classifier2 = classifier.partial_fit(train_edges,train_classes)
print(f"The link predictor has an accuracy score of \
{classifier2.score(test_edges,test_classes)}")

The link predictor has an accuracy score of 0.748995983935743


# TODO

TODO:
- model tuning (epochs in fine-tuning, k, etc)
- comparisions of using:
    - finetuned pubmed model
    - cobra model on pubmed
    - a model trained only on pubmed
- target graph size vs accuracy - (this is adjustable above)
