# EGI Airport Experiment

This notebook contains a minimal reproduction of the airport experiment from the [EGI](https://arxiv.org/abs/2009.05204) paper.

The aim of this experiment is to learn node labels from a graph of the airports
of one region, and transfer them to another region. This transfer will occur
directly, without finetuning.

The node labels are the relative popularity of the airports, as quartiles (1-4).


In [1]:
from pathlib import Path

import torch
import dgl
import numpy as np
from numpy.typing import NDArray
import wandb
import sklearn.linear_model
import sklearn.model_selection

import gtl
import gtl.training
from gtl import Graph
from gtl.typing import PathLike
from gtl.features import degree_bucketing


In [2]:
# where is the data stored?
# in this case, i have it committed in the git repo under data/airports
# larger datasets should be downloaded seperately!
DATA_DIR: Path = Path().cwd().parent / "data" / "airports"
print(DATA_DIR)

/Users/niklas/src/staris/main/data/airports


In [3]:
# auto-detect if we are on a GPU or not
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Data loading
The data contains two files for each region: a list of edges, and a list of labels for each edge.

In [4]:
with open(DATA_DIR / "europe-airports.edgelist") as f:
    for i in range(3):
        print(f.readline())

252 36

57 50

43 3



In [5]:
with open(DATA_DIR / "labels-europe-airports.txt") as f:
    for i in range(5):
        print(f.readline())

node label

0 1

1 1

2 2

3 1



In [6]:
def load_dataset(edgefile: PathLike, labelfile: PathLike) -> tuple[Graph, NDArray]:
    edges = np.loadtxt(edgefile, dtype="int")
    us = torch.from_numpy(edges[:, 0]).to(device)
    vs = torch.from_numpy(edges[:, 1]).to(device)
    dgl_graph: dgl.DGLGraph = dgl.graph((us, vs), device=torch.device("cpu"))
    dgl_graph = dgl.to_bidirected(dgl_graph).to(device)

    
    graph: Graph = gtl.Graph.from_dgl_graph(dgl_graph)
    #graph.mine_triangles() # only necessary for triangle model.
    
    labels = np.loadtxt(labelfile, skiprows=1)
    return graph, labels[:, 1]

In [7]:
europe_g,europe_labels = load_dataset(DATA_DIR / "europe-airports.edgelist",DATA_DIR / "labels-europe-airports.txt")

In [8]:
brazil_g,brazil_labels = load_dataset(DATA_DIR / "brazil-airports.edgelist",DATA_DIR / "labels-brazil-airports.txt")

## Running the model
Now, we define a single run of the model.


In this example we will use EGI, transferring from europe to brazil.

Configuration options (including hyperparamaters) must be defined in a single dictionary.

Valid options are listed in the gtl.training.train function documentation. Invalid options are ignored silently, allowing this config dict to be used for other things.

In [9]:
?gtl.training.train

[1;31mSignature:[0m
[0mgtl[0m[1;33m.[0m[0mtraining[0m[1;33m.[0m[0mtrain[0m[1;33m([0m[1;33m[0m
[1;33m[0m    [0mmodel[0m[1;33m:[0m [0mstr[0m[1;33m,[0m[1;33m[0m
[1;33m[0m    [0mgraph[0m[1;33m:[0m [0mgtl[0m[1;33m.[0m[0mgraph[0m[1;33m.[0m[0mGraph[0m[1;33m,[0m[1;33m[0m
[1;33m[0m    [0mfeatures[0m[1;33m:[0m [0mtorch[0m[1;33m.[0m[0mTensor[0m[1;33m,[0m[1;33m[0m
[1;33m[0m    [0mconfig[0m[1;33m:[0m [0mcollections[0m[1;33m.[0m[0mabc[0m[1;33m.[0m[0mMapping[0m[1;33m,[0m[1;33m[0m
[1;33m[0m    [0mdevice[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m[0m
[1;33m[0m[1;33m)[0m [1;33m->[0m [0mcollections[0m[1;33m.[0m[0mabc[0m[1;33m.[0m[0mCallable[0m[1;33m[[0m[1;33m[[0m[0mdgl[0m[1;33m.[0m[0mheterograph[0m[1;33m.[0m[0mDGLGraph[0m[1;33m,[0m [0mtorch[0m[1;33m.[0m[0mTensor[0m[1;33m][0m[1;33m,[0m [0mtorch[0m[1;33m.[0m[0mTensor[0m[1;33m][0m[1;33m[0m[1;33m[0m[0m
[1;31mDo

In [10]:
config = {
  "lr": 0.01,
  "hidden_layers": 32,
  "patience": 50,
  "min_delta": 0.01,
  "n_epochs": 200,
  "k":2
}

We have no node features, so we create some using degree bucketing.

In [11]:
# node features for encoder
europe_node_feats = degree_bucketing(
    europe_g.as_dgl_graph(device), config["hidden_layers"]
).to(device)
brazil_node_feats = degree_bucketing(
    brazil_g.as_dgl_graph(device), config["hidden_layers"]
).to(device)

Metrics and results are tracked using Weights and Biases runs. We will use this in offline mode for now:

In [12]:
wandb.init(mode="offline")

We now have everything we need to train the source encoder.

In [13]:
encoder = gtl.training.train("egi",europe_g,europe_node_feats,config,device)

  assert input.numel() == input.storage().size(), (
 46%|███████████████████████████████████████████████████████▊                                                                | 93/200 [00:25<00:29,  3.63it/s]

Early stopping!





In [15]:
# Generate graph embeddings
source_embs = encoder(europe_g.as_dgl_graph(device),europe_node_feats)

# Direct transfer encoder to target
target_embs = (
    encoder(brazil_g.as_dgl_graph(device), brazil_node_feats)
    .to(torch.device("cpu")))

# We transfer embeddings, but create seperate node classifiers.
# This is up to you.

# We use sklearn to create the node classifier.
# You could use a MLP using Pytorch instead.
train_embs, val_embs, train_classes, val_classes = sklearn.model_selection.train_test_split(
    source_embs.detach().numpy(), europe_labels
)


classifier = sklearn.linear_model.SGDClassifier(loss="log_loss")
classifier = classifier.fit(train_embs, train_classes)
print(f"Source Accuracy {classifier.score(val_embs,val_classes)}")

# Now do target accuracy
train_embs, val_embs, train_classes, val_classes = sklearn.model_selection.train_test_split(
    target_embs.detach().numpy(), brazil_labels
)


classifier = sklearn.linear_model.SGDClassifier(loss="log_loss")
classifier = classifier.fit(train_embs, train_classes)
print(f"Target Accuracy {classifier.score(val_embs,val_classes)}")


Source Accuracy 0.46
Target Accuracy 0.6060606060606061


In [16]:
wandb.finish()

0,1
-training-loss,██▇▆▅▆▆▅▅▄▃▄▃▃▄▄▄▄▄▄▃▃▃▃▄▂▂▂▂▂▂▂▂▁▃▂▁▁▁▂
-validation-loss,█▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
-early-stopping-epoch,42.0
-training-loss,-0.22749
-validation-loss,-0.09549


## Ideas of things to try to change
1. Run this for graphsage's mean and pool variants.
2. Turn this code into an experiment that determines the model performance based on different values of k.