# Shallow methods for supervised learning

In this notebook we will exploring a very naive (yet powerful) approach for solving graph-based supervised machine learning. The idea rely on the classic machine learning approach of handcrafted feature extraction.

In Chapter 1 you learned how local and global graph properties can be extracted from graphs. Those properties represent the graph itself and bring important informations which can be useful for classification.

In [1]:
# !pip install stellargraph

In this demo, we will be using the PROTEINS dataset, already integrated in StellarGraph

In [2]:
from stellargraph import datasets
from IPython.display import display, HTML

dataset = datasets.PROTEINS()
display(HTML(dataset.description))
graphs, graph_labels = dataset.load()

2023-03-22 11:38:05.675992: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-22 11:38:05.676022: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-22 11:38:06.791433: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-03-22 11:38:06.791608: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-22 11:38:06.791620: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-22 11:38:06.791637: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running 

To compute the graph metrics, one way is to retrieve the adjacency matrix representation of each graph.

In [3]:
# convert graphs from StellarGraph format to numpy adj matrices
adjs = [graph.to_adjacency_matrix().A for graph in graphs]
# convert labes fom Pandas.Series to numpy array
labels = graph_labels.to_numpy(dtype=int)

In [4]:
import numpy as np
import networkx as nx

metrics = []
for adj in adjs:
    G = nx.from_numpy_matrix(adj)
    # basic properties
    num_edges = G.number_of_edges()
    # clustering measures
    cc = nx.average_clustering(G)
    # measure of efficiency
    eff = nx.global_efficiency(G)

    metrics.append([num_edges, cc, eff])



We can now exploit scikit-learn utilities to create a train and test set. In our experiments, we will be using 70% of the dataset as training set and the remaining as testset

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(metrics, labels, test_size=0.3, random_state=42)

As commonly done in many Machine Learning workflows, we preprocess features to have zero mean and unit standard deviation

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

It's now time for training a proper algorithm. We chose a support vector machine for this task

In [7]:
from sklearn import svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

clf = svm.SVC()
clf.fit(X_train_scaled, y_train)

y_pred = clf.predict(X_test_scaled)

print('Accuracy', accuracy_score(y_test,y_pred))
print('Precision', precision_score(y_test,y_pred))
print('Recall', recall_score(y_test,y_pred))
print('F1-score', f1_score(y_test,y_pred))

Accuracy 0.7455089820359282
Precision 0.7709251101321586
Recall 0.8413461538461539
F1-score 0.8045977011494253


# Supervised graph representation learning using Graph ConvNet

In this notebook we will be performing supervised graph representation learning using Deep Graph ConvNet as encoder.

The model embeds a graph by using stacked Graph ConvNet layers

In this demo, we will be using the PROTEINS dataset, already integrated in StellarGraph

In [8]:
import pandas as pd
from stellargraph import datasets
from IPython.display import display, HTML

dataset = datasets.PROTEINS()
display(HTML(dataset.description))
graphs, graph_labels = dataset.load()

labels = graph_labels.to_numpy(dtype=int)

# necessary for converting default string labels to int
graph_labels = pd.get_dummies(graph_labels, drop_first=True)

StellarGraph we are using for building the model, uses tf.Keras as backend. According to its specific, we need a data generator for feeding the model. For supervised graph classification, we create an instance of StellarGraph's PaddedGraphGenerator class. This generator supplies the features arrays and the adjacency matrices to a mini-batch Keras graph classification model. Differences in the number of nodes are resolved by padding each batch of features and adjacency matrices, and supplying a boolean mask indicating which are valid and which are padding.

In [9]:
from stellargraph.mapper import PaddedGraphGenerator
generator = PaddedGraphGenerator(graphs=graphs)

Now we are ready for actually create the model. The GCN layers will be created and stacked togheter through StellarGraph's utility function. This _backbone_ will be then concateneted to 1D Convolutional layers and Fully connected layers using tf.Keras

In [10]:
from stellargraph.layer import DeepGraphCNN
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow.keras.losses import binary_crossentropy
import tensorflow as tf

nrows = 35  # the number of rows for the output tensor
layer_dims = [32, 32, 32, 1]

dgcnn_model = DeepGraphCNN(
    layer_sizes=layer_dims,
    activations=["tanh", "tanh", "tanh", "tanh"],
    k=nrows,
    bias=False,
    generator=generator,
)
gnn_inp, gnn_out = dgcnn_model.in_out_tensors()


x_out = Conv1D(filters=16, kernel_size=sum(layer_dims), strides=sum(layer_dims))(gnn_out)
x_out = MaxPool1D(pool_size=2)(x_out)

x_out = Conv1D(filters=32, kernel_size=5, strides=1)(x_out)

x_out = Flatten()(x_out)

x_out = Dense(units=128, activation="relu")(x_out)
x_out = Dropout(rate=0.5)(x_out)

predictions = Dense(units=1, activation="sigmoid")(x_out)

Instructions for updating:
Use fn_output_signature instead


Let's now compile the model

In [11]:
model = Model(inputs=gnn_inp, outputs=predictions)
model.compile(optimizer=Adam(lr=0.0001), loss=binary_crossentropy, metrics=["acc"])

We use 70% of the dataset for training and the remaining for test

In [12]:
from sklearn import model_selection
train_graphs, test_graphs = model_selection.train_test_split(
    graph_labels, test_size=.3, stratify=labels,
)

In [13]:
gen = PaddedGraphGenerator(graphs=graphs)

train_gen = gen.flow(
    list(train_graphs.index - 1),
    targets=train_graphs.values,
    symmetric_normalization=False,
    batch_size=50,
)

test_gen = gen.flow(
    list(test_graphs.index - 1),
    targets=test_graphs.values,
    symmetric_normalization=False,
    batch_size=1,
)

It's now time for training!

In [14]:
epochs = 100
history = model.fit(
    train_gen, epochs=epochs, verbose=1, validation_data=test_gen, shuffle=True,
)

2023-03-22 11:38:28.500336: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2023-03-22 11:38:28.519094: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2199995000 Hz


Epoch 1/100


  "shape. This may consume a large amount of memory." % value)


 1/16 [>.............................] - ETA: 19s - loss: 0.7161 - acc: 0.4800

2023-03-22 11:38:29.904274: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 76880000 exceeds 10% of free system memory.


Epoch 2/100

2023-03-22 11:38:31.815607: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 76880000 exceeds 10% of free system memory.


Epoch 3/100

2023-03-22 11:38:33.269068: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 76880000 exceeds 10% of free system memory.


Epoch 4/100

2023-03-22 11:38:34.777061: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 76880000 exceeds 10% of free system memory.


Epoch 5/100
 3/16 [====>.........................] - ETA: 2s - loss: 0.6158 - acc: 0.7089

2023-03-22 11:38:35.663251: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 76880000 exceeds 10% of free system memory.


Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epo

In [15]:
# https://stellargraph.readthedocs.io/en/stable/demos/graph-classification/index.html

## Supervised node representation learning using GraphSAGE

In [16]:
from stellargraph import datasets
from IPython.display import display, HTML

dataset = datasets.Cora()
display(HTML(dataset.description))
G, nodes = dataset.load()

Let's split the dataset into training and testing set

In [17]:
from sklearn.model_selection import train_test_split
train_nodes, test_nodes = train_test_split(
    nodes, train_size=0.1, test_size=None, stratify=nodes
)

Since we are performing a categorical classification, it is useful to represent each categorical label in its one-hot encoding

In [18]:
from sklearn import preprocessing, feature_extraction, model_selection
label_encoding = preprocessing.LabelBinarizer()
train_labels = label_encoding.fit_transform(train_nodes)
test_labels = label_encoding.transform(test_nodes)

It's now time for creating the mdoel. It will be composed by two GraphSAGE layers followed by a Dense layer with softmax activation for classification

In [19]:
from stellargraph.mapper import GraphSAGENodeGenerator
batchsize = 50
n_samples = [10, 5, 7]
generator = GraphSAGENodeGenerator(G, batchsize, n_samples)

In [20]:
from stellargraph.layer import GraphSAGE
from tensorflow.keras.layers import Dense

graphsage_model = GraphSAGE(
    layer_sizes=[32, 32, 16], generator=generator, bias=True, dropout=0.6,
)

In [21]:
gnn_inp, gnn_out = graphsage_model.in_out_tensors()
outputs = Dense(units=train_labels.shape[1], activation="softmax")(gnn_out)

In [22]:
from tensorflow.keras.losses import categorical_crossentropy
from keras.models import Model
from keras.optimizers import Adam

model = Model(inputs=gnn_inp, outputs=outputs)
model.compile(optimizer=Adam(lr=0.003), loss=categorical_crossentropy, metrics=["acc"],)

We will use the flow function of the generator for feeding the model with the train and the test set.

In [23]:
train_gen = generator.flow(train_nodes.index, train_labels, shuffle=True)
test_gen = generator.flow(test_nodes.index, test_labels)

Finally, let's train the model!

In [24]:
history = model.fit(train_gen, epochs=20, validation_data=test_gen, verbose=2, shuffle=False)

Epoch 1/20
6/6 - 12s - loss: 1.8998 - acc: 0.1963 - val_loss: 1.8041 - val_acc: 0.3076
Epoch 2/20
6/6 - 10s - loss: 1.8258 - acc: 0.3148 - val_loss: 1.7662 - val_acc: 0.3093
Epoch 3/20
6/6 - 10s - loss: 1.7868 - acc: 0.3481 - val_loss: 1.7234 - val_acc: 0.3220
Epoch 4/20
6/6 - 10s - loss: 1.7427 - acc: 0.3407 - val_loss: 1.6617 - val_acc: 0.3991
Epoch 5/20
6/6 - 11s - loss: 1.6938 - acc: 0.3852 - val_loss: 1.5804 - val_acc: 0.5160
Epoch 6/20
6/6 - 10s - loss: 1.6212 - acc: 0.5037 - val_loss: 1.4891 - val_acc: 0.5853
Epoch 7/20
6/6 - 11s - loss: 1.5469 - acc: 0.6000 - val_loss: 1.3960 - val_acc: 0.6870
Epoch 8/20
6/6 - 11s - loss: 1.4961 - acc: 0.6222 - val_loss: 1.3267 - val_acc: 0.7559
Epoch 9/20
6/6 - 11s - loss: 1.3982 - acc: 0.7074 - val_loss: 1.2663 - val_acc: 0.7765
Epoch 10/20
6/6 - 11s - loss: 1.3558 - acc: 0.7296 - val_loss: 1.2178 - val_acc: 0.7646
Epoch 11/20
6/6 - 11s - loss: 1.2754 - acc: 0.7778 - val_loss: 1.1740 - val_acc: 0.7744
Epoch 12/20
6/6 - 13s - loss: 1.2252 - ac