Trying to recreate Bi-GCN from https://arxiv.org/abs/2010.07565 #816

theo-long · 2022-12-10T22:19:58Z

theo-long
Dec 10, 2022

Hi,

I'm currently trying to recreate the Binarized Graph Convolutional Network from the paper 'Bi-GCN: Binary Graph Convolutional Network' (link) by Wang et. al using Larq. The code for that paper is available here: Bi-GCN Github. It is written in PyTorch and PyTorch Geometric, and I am trying to rewrite it using tensorflow, larq, and spektral. However, I am struggling to achieve good performance with my current implementation - the torch code generates a model with ~80% accuracy and I am only able to achieve ~20% with my model!

One of the main issues seems to be a bad gradient signal - the model gets stuck at a relatively large loss and fails to decrease it any further. I've been having a hard time diagnosing what exactly is causing this. One thing to note is that when I set input_quantizer=None, kernel_quantizer=None (i.e. just a normal Dense layer), I'm able to achieve 80+% accuracy which is comparable to the original non-binarized GCN architecture. This means that the issues I'm seeing are directly caused by the quantization and not my implementation of the GCN. Please let me know if you have any ideas what the issue may be here and any tips for dealing with it. See below for my code.

I first define the graph convolutional layer and read in the Cora dataset using the spektral library:

import tensorflow as tf
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.optimizers import Adam

import larq as lq

from spektral.datasets.citation import Cora
from spektral.layers import GCNConv
from spektral.models.gcn import GCN
from spektral.transforms import AdjToSpTensor, LayerPreprocess
from spektral.utils import tic, toc

class GraphConv(tf.keras.layers.Layer):
    def __init__(self, a):
        super(GraphConv, self).__init__()
        self.a = a

    def get_config(self):
        config = super().get_config()
        return config

    def call(self, inputs):
        return tf.matmul(a, inputs)

dataset = Cora(normalize_x=True, transforms=[LayerPreprocess(GCNConv), AdjToSpTensor()])
graph = dataset[0]
x, a, y = graph.x, graph.a, graph.y
a = tf.sparse.to_dense(a)
mask_tr, mask_va, mask_te = dataset.mask_tr, dataset.mask_va, dataset.mask_te

# I then define my model using the same architecture specified in the Bi-GCN paper:

def generate_model(quantizer, channels, dropout_rate, layers=2, **layer_kwargs):
    node_features = tf.keras.Input(shape=x.shape)
    x_intermediate = tf.keras.layers.BatchNormalization(trainable=False)(node_features)
    for layer in range(layers - 1):
        x_intermediate = tf.keras.layers.Dropout(rate=dropout_rate)(x_intermediate)
        x_intermediate = lq.layers.QuantDense(
            units=channels, 
            kernel_initializer="he_uniform",
            input_quantizer=quantizer, 
            kernel_quantizer=quantizer,
            **layer_kwargs)(x_intermediate)
        x_intermediate = GraphConv(a)(x_intermediate)
    x_intermediate = tf.keras.layers.Dropout(rate=dropout_rate)(x_intermediate)
    x_intermediate = lq.layers.QuantDense(
        units=dataset.n_labels, 
        kernel_initializer="he_uniform",
        input_quantizer=quantizer, 
        kernel_quantizer=quantizer)(x_intermediate)
    x_intermediate = GraphConv(a)(x_intermediate)
    outputs = tf.keras.layers.Softmax()(x_intermediate)
    model = tf.keras.Model(inputs=node_features, outputs=outputs, name="BiGCN")
    return model

model = generate_model(quantizer="magnitude_aware_sign", dropout_rate=0.4, channels=64, layers=2)

# Finally I train the model using a custom training loop:

optimizer = tf.keras.optimizers.Adam(learning_rate=0.01, beta_1=0.999)
loss_fn = CategoricalCrossentropy()
accuracy_fn = tf.keras.metrics.CategoricalAccuracy()

callback = tf.compat.v1.keras.callbacks.TensorBoard(
    log_dir=f"tensorboard/{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}",
    histogram_freq=5, write_images=True, update_freq="epoch", write_grads=True)
callback.set_model(model)

# Training step
@tf.function
def train():
    with tf.GradientTape() as tape:
        predictions = model(x[None, :], training=True)[0]
        loss = loss_fn(y[mask_tr], predictions[mask_tr])
        loss += sum(model.losses)
        accuracy = accuracy_fn(y[mask_tr], predictions[mask_tr])

    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    return loss

# val step
@tf.function
def validate():
    predictions = model(x[None, :], training=False)[0]
    loss = loss_fn(y[mask_va], predictions[mask_va])
    loss += sum(model.losses)
    accuracy = accuracy_fn(y[mask_va], predictions[mask_va])
    return loss, accuracy

callback.on_train_begin()
for epoch in range(1, 1000):
    callback.on_epoch_begin(epoch=epoch)
    train_loss = train()
    val_loss, val_acc = validate()
    print(f"Epoch {epoch} | train loss: {train_loss:.4f}, validation loss: {val_loss:.4f}, validation acc: {val_acc:.4f}", end="\r")
    callback.on_epoch_end(epoch=epoch)
callback.on_train_end()

lgeiger · 2022-12-12T09:55:28Z

lgeiger
Dec 12, 2022
Maintainer

Hi @theo-long, thanks for the question.

Unfortunately, I'm neither familiar with the Spectral library nor the paper you mentioned so I can only give a few general pointers regarding BNN training that might help you debug this:

You are setting trainable=False in the batch normalization layer. This will make it run in inference mode not actually doing any normalization. I don't think that's desirable here.
In general I would remove the dropout layer. You are using dropout before the binarize layer which means activations that are set to 0 by dropout would become 0 after the input quantizer which probably isn't desirable here and could lead to convergence problems. Using dropout after the binarize layer will set some of the binary activations to zero making this effectively a ternary network during training instead which is probably not desired either.
In general adding batch normalization layers after each QuantDense layer and not only at the beginning should help convergence of the BNN as mentioned here.
You are using magnitude_aware_sign quantization for weight and activations. If I recall correctly the original paper only uses this for the weight quantizer and uses ste_sign or approx_sign for the input activations.

I hope these ideas help you debug the convergence issues.
In general it helps to first try and get stable training by only quantizing the weights and in a second step also quantize the activations once training with weight only quantization is stable. That way you can isolate the problems and should have an easier time to debug.

If this doesn't help, maybe it would be worth trying to use the default Keras training loop via model.fit() to make sure there is no problem in your gradient calculation.

1 reply

theo-long Dec 12, 2022
Author

@lgeiger thanks for the reply!

You were right about the trainable=False being the issue here. The original torch implementation uses affine=False which simply fixes the gamma and beta parameters, but still allowing for running mean and variance calculation. Changing this already allowed my model to reach 70% accuracy.
Your point about the dropout layers making this effectively ternary during training makes sense - I originally included them since the paper and torch implementation use dropout layers, but I will certainly explore the effect of removing these
The original BiGCN paper uses a version of magnitude_aware_sign for both input and weight quantization which is why I followed their lead initially. Is there any intuition as to why the ste_sign quantization is preferable for activations?

Really appreciate the help. I was also wondering - is there any interest in adding implementations for binarized GNN layers/models to Larq?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to recreate Bi-GCN from https://arxiv.org/abs/2010.07565 #816

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Trying to recreate Bi-GCN from https://arxiv.org/abs/2010.07565 #816

theo-long Dec 10, 2022

Replies: 2 comments · 1 reply

lgeiger Dec 12, 2022 Maintainer

theo-long Dec 12, 2022 Author

theo-long
Dec 10, 2022

Replies: 2 comments 1 reply

lgeiger
Dec 12, 2022
Maintainer

theo-long Dec 12, 2022
Author