# Graph Attention Network

\[[paper](https://arxiv.org/abs/1710.10903)\] , \[[original code](https://github.com/PetarV-/GAT)\] , \[[all other implementations](https://paperswithcode.com/paper/graph-attention-networks)\]

From [Graph Convolutional Network (GCN)](https://arxiv.org/abs/1609.02907), we learned that combining local graph structure and node-level features yields good performance on node classification task. Hwever, the way GCN aggregates is structure-dependent, which may hurt its generalizability.

One workaround is to simply average over all neighbor node features as in GraphSAGE. Graph Attention Network proposes an alternative way by weighting neighbor features with feature dependent and structure free normalization, in the style of attention.

The goal of this tutorial:

- Explain what is Graph Attention Network.
- Understand the attentions learnt.
- Introduce to inductive learning.

Introducing Attention to GCN
----------------------------

The key difference between GAT and GCN is how the information from the one-hop neighborhood is aggregated.

For GCN, a graph convolution operation produces the normalized sum of the node features of neighbors:


$$h_i^{(l+1)}=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\frac{1}{c_{ij}} W^{(l)}h^{(l)}_j}\right)$$

where $\mathcal{N}(i)$ is the set of its one-hop neighbors (to include $v_i$ in the set, simply add a self-loop to each node),
$c_{ij}=\sqrt{|\mathcal{N}(i)|}\sqrt{|\mathcal{N}(j)|}$ is a normalization constant based on graph structure, $\sigma$ is an activation function (GCN uses ReLU), and $W^{(l)}$ is a shared weight matrix for node-wise feature transformation. Another model proposed in
[GraphSAGE](https://www-cs-faculty.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)
employs the same update rule except that they set
$c_{ij}=|\mathcal{N}(i)|$.

GAT introduces the attention mechanism as a substitute for the statically
normalized convolution operation. Below are the equations to compute the node
embedding $h_i^{(l+1)}$ of layer $l+1$ from the embeddings of
layer $l$:

<img src="https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/gat/gat.png" height="350" width="450" align="center">

In [11]:
!pip install -q --upgrade git+https://github.com/mlss-skoltech/tutorials_week2.git#subdirectory=graph_neural_networks

In [13]:
import pkg_resources

ZIP_PATH = pkg_resources.resource_filename('gnnutils', 'data/data.zip')
DATA_PATH = './data'

!unzip -u {ZIP_PATH} -d ./

Archive:  /anaconda3/lib/python3.7/site-packages/gnnutils/data/data.zip
  inflating: ./data/ind.cora.ally    
  inflating: ./data/ind.cora.test.index  
  inflating: ./data/ind.pubmed.graph  
  inflating: ./data/ind.cora.allx    
  inflating: ./data/ind.citeseer.test.index  
  inflating: ./data/ind.citeseer.ty  
  inflating: ./data/ind.citeseer.tx  
  inflating: ./data/ind.citeseer.graph  
  inflating: ./data/preprocessed_MNIST.dump  
  inflating: ./data/ind.cora.tx      
  inflating: ./data/ind.pubmed.test.index  
  inflating: ./data/ind.cora.ty      
  inflating: ./data/ind.cora.graph   
  inflating: ./data/ind.citeseer.y   
  inflating: ./data/ind.citeseer.ally  
  inflating: ./data/ind.citeseer.allx  
  inflating: ./data/ind.citeseer.x   
  inflating: ./data/ind.pubmed.ally  
  inflating: ./data/ind.pubmed.allx  
  inflating: ./data/ind.pubmed.x     
  inflating: ./data/ind.pubmed.tx    
  inflating: ./data/ind.cora.x       
  inflating: ./data/ind.pubmed.ty    
  inflating: ./data/

In [14]:
import os,sys,inspect
import os
import joblib
import tensorflow as tf
import numpy as np
import h5py
import scipy.sparse.linalg as la
import scipy.sparse as sp
import scipy
import time
import pickle

import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
%matplotlib inline

import scipy.io as sio

from gnnutils import process_data

In [15]:
def count_no_weights():
    total_parameters = 0
    for variable in tf.trainable_variables():
        # shape is an array of tf.Dimension
        shape = variable.get_shape()
        variable_parameters = 1
        for dim in shape:
            variable_parameters *= dim.value
        total_parameters += variable_parameters
    print('#weights in the model: %d' % (total_parameters,))

def frobenius_norm(tensor):
    square_tensor = tf.square(tensor)
    tensor_sum = tf.reduce_sum(square_tensor)
    frobenius_norm = tf.sqrt(tensor_sum)
    return frobenius_norm


In [16]:
class GAT:
    
    """
    The neural network model.
    """
    def __init__(self, idx_rows, idx_cols, A_shape, X, Y, num_hidden_feat, n_heads, learning_rate=5e-2, gamma=1e-3, idx_gpu = '/gpu:3'):
        
        self.num_hidden_feat = num_hidden_feat
        self.learning_rate = learning_rate
        self.gamma=gamma
        with tf.Graph().as_default() as g:
                self.graph = g
                
                with tf.device(idx_gpu):
                            
                        # list of weights' tensors l2-loss 
                        self.regularizers = []
                            
                        #definition of constant matrices
                        self.X = tf.constant(X, dtype=tf.float32) 
                        self.Y = tf.constant(Y, dtype=tf.float32)
                        
                        #placeholder definition
                        self.idx_nodes = tf.placeholder(tf.int32)
                        self.keep_prob = tf.placeholder(tf.float32)
                        
                        #model definition
                        
                        self.X0 = []
                        for k in range(n_heads):
                            with tf.variable_scope('GCL_1_{}'.format(k+1)):
                                self.X0.append(self.GAT_layer(self.X, num_hidden_feat, idx_rows, idx_cols, A_shape, tf.nn.elu))
                        self.X0 = tf.concat(self.X0, 1)
                        
                        with tf.variable_scope('GCL_2'):
                            self.logits = self.GAT_layer(self.X0, Y.shape[1], idx_rows, idx_cols, A_shape, tf.identity)
                        
                        self.l_out = tf.gather(self.logits, self.idx_nodes)
                        self.c_Y = tf.gather(self.Y, self.idx_nodes)
                        
                        #loss function definition
                        self.l2_reg = tf.reduce_sum(self.regularizers)
                        self.data_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=self.l_out, labels=self.c_Y)) 
                        
                        self.loss = self.data_loss + self.gamma*self.l2_reg
                        
                        #solver definition
                        self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
                        self.opt_step = self.optimizer.minimize(self.loss)
                        
                        #predictions and accuracy extraction
                        self.c_predictions = tf.argmax(tf.nn.softmax(self.l_out), 1)
                        self.accuracy = tf.contrib.metrics.accuracy(self.c_predictions, tf.argmax(self.c_Y, 1))
                        
                        #gradients computation
                        self.trainable_variables = tf.trainable_variables()
                        self.var_grad = tf.gradients(self.loss, tf.trainable_variables())
                        self.norm_grad = frobenius_norm(tf.concat([tf.reshape(g, [-1]) for g in self.var_grad], 0))
                        
                        #session creation
                        config = tf.ConfigProto(allow_soft_placement = True)
                        config.gpu_options.allow_growth = True
                        self.session = tf.Session(config=config)

                        #session initialization
                        init = tf.global_variables_initializer()
                        self.session.run(init)
                        
                        count_no_weights()

    def GAT_layer(self, X, Fout, idx_rows, idx_cols, A_shape, activation):
        X = tf.nn.dropout(X,  self.keep_prob)
        
        W = tf.get_variable("W", shape=[X.shape[1], Fout], initializer=tf.glorot_uniform_initializer())
        self.regularizers.append(tf.nn.l2_loss(W))
        X_w = tf.matmul(X, W)

        # simplest possible attention mechanism
        W_att1 = tf.get_variable("W_att1", shape=[X_w.shape[1], 1], initializer=tf.glorot_uniform_initializer())
        b_att1 = tf.get_variable("b_att1", shape=[1,], initializer=tf.zeros_initializer())
        self.regularizers.append(tf.nn.l2_loss(W_att1))
        W_att2 = tf.get_variable("W_att2", shape=[X_w.shape[1], 1], initializer=tf.glorot_uniform_initializer())
        b_att2 = tf.get_variable("b_att2", shape=[1,], initializer=tf.zeros_initializer())
        self.regularizers.append(tf.nn.l2_loss(W_att2))
                            
        X_att_1 = tf.squeeze(tf.matmul(X_w, W_att1)) + b_att1
        X_att_2 = tf.squeeze(tf.matmul(X_w, W_att2)) + b_att2
        
        logits = tf.gather(X_att_1, idx_rows) +  tf.gather(X_att_2, idx_cols)
                            
        A_att = tf.SparseTensor(indices=np.vstack([idx_rows, idx_cols]).T, 
                                values=tf.nn.leaky_relu(logits), 
                                dense_shape=A_shape)
        A_att = tf.sparse_reorder(A_att)
        A_att = tf.sparse_softmax(A_att)
        
        # apply dropout
        A_att = tf.SparseTensor(indices=A_att.indices,
                                values=tf.nn.dropout(A_att.values, self.keep_prob),
                                dense_shape=A_shape)
        A_att = tf.sparse_reorder(A_att)

        X_w = tf.nn.dropout(X_w, self.keep_prob)
        res = tf.sparse_tensor_dense_matmul(A_att, X_w)
        res = tf.contrib.layers.bias_add(res)

        return activation(res)
     

Multi-head Attention
^^^^^^^^^^^^^^^^^^^^

Analogous to multiple channels in ConvNet, GAT introduces **multi-head
attention** to enrich the model capacity and to stabilize the learning
process. Each attention head has its own parameters and their outputs can be
merged in two ways:

\begin{align}\text{concatenation}: h^{(l+1)}_{i} =||_{k=1}^{K}\sigma\left(\sum_{j\in \mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right)\end{align}

or

\begin{align}\text{average}: h_{i}^{(l+1)}=\sigma\left(\frac{1}{K}\sum_{k=1}^{K}\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right)\end{align}

where $K$ is the number of heads. The authors suggest using
concatenation for intermediary layers and average for the final layer.


In [17]:
#learning parameters and path dataset

learning_rate = 5e-3
val_test_interval = 1
num_hidden_feat = 8
n_heads = 8
gamma = 5e-4
patience = 100
path_dataset = './CORA/dataset.pickle'
    
#dataset loading
#ds = Dataset(path_dataset, normalize_feat=1)

A, X, Y, train_idx, val_idx, test_idx = process_data.load_data("cora", path_to_data=DATA_PATH)
X = process_data.preprocess_features(X)

(2708, 2708)
(2708, 1433)


In [None]:
# extracts rows and cols of adjacency matrix
A = sp.csr_matrix(A)
A.setdiag(1)

idx_rows, idx_cols = A.nonzero()

In [None]:
from tqdm import tqdm

In [None]:
# num_exp = 10 #number of times training GCN over the given dataset
num_exp = 1 #number of times training GCN over the given dataset

list_all_acc = []
list_all_cost_val_avg  = []
list_all_data_cost_val_avg = []
list_all_acc_val_avg   = []
list_all_cost_test_avg = []
list_all_acc_test_avg  = []

num_done = 0

In [None]:
num_total_iter_training = int(10e4)

GCNN = GAT(idx_rows, idx_cols, A.shape, X, Y, num_hidden_feat, n_heads, learning_rate=learning_rate, gamma=gamma)

cost_train_avg      = []
grad_norm_train_avg = []
acc_train_avg       = []
cost_test_avg       = []
grad_norm_test_avg  = []
acc_test_avg        = []
cost_val_avg        = []
data_cost_val_avg   = []
acc_val_avg         = []
iter_test           = []
list_training_time = list()

max_val_acc = 0
min_val_loss = np.inf

#Training code
for i in tqdm(range(num_total_iter_training)):
    if (len(cost_train_avg) % val_test_interval) == 0:
        #Print last training performance
        if (len(cost_train_avg)>0):
            tqdm.write("[TRN] epoch = %03i, cost = %3.2e, |grad| = %.2e, acc = %3.2e (%03.2fs)" % \
            (len(cost_train_avg), cost_train_avg[-1], grad_norm_train_avg[-1], acc_train_avg[-1], time.time() - tic))

        #Validate the model
        tic = time.time()

        feed_dict = {GCNN.idx_nodes: val_idx, GCNN.keep_prob:1.0}
        acc_val, cost_val, data_cost_val = GCNN.session.run([GCNN.accuracy, GCNN.loss, GCNN.data_loss], feed_dict)

        data_cost_val_avg.append(data_cost_val)
        cost_val_avg.append(cost_val)
        acc_val_avg.append(acc_val)
        tqdm.write("[VAL] epoch = %03i, data_cost = %3.2e, cost = %3.2e, acc = %3.2e (%03.2fs)" % \
            (len(cost_train_avg), data_cost_val_avg[-1], cost_val_avg[-1], acc_val_avg[-1],  time.time() - tic))

        #Test the model
        tic = time.time()

        feed_dict = {GCNN.idx_nodes: test_idx, GCNN.keep_prob:1.0}
        acc_test, cost_test = GCNN.session.run([GCNN.accuracy, GCNN.loss], feed_dict)

        cost_test_avg.append(cost_test)
        acc_test_avg.append(acc_test)
        tqdm.write("[TST] epoch = %03i, cost = %3.2e, acc = %3.2e (%03.2fs)" % \
            (len(cost_train_avg), cost_test_avg[-1], acc_test_avg[-1],  time.time() - tic))
        iter_test.append(len(cost_train_avg))


        if acc_val_avg[-1] >= max_val_acc or data_cost_val_avg[-1] <= min_val_loss:
            max_val_acc = np.maximum(acc_val_avg[-1], max_val_acc)
            min_val_loss = np.minimum(data_cost_val_avg[-1], min_val_loss)
            if acc_val_avg[-1] >= max_val_acc and data_cost_val_avg[-1] <= min_val_loss:
                best_model_test_acc = acc_test_avg[-1]
            curr_step = 0
        else:
            curr_step += 1
            if curr_step == patience:
                tqdm.write('Early stop! Min loss: ', min_val_loss, ', Max accuracy: ', max_val_acc)
                break

    tic = time.time()
    feed_dict = {GCNN.idx_nodes: train_idx, GCNN.keep_prob: 0.4}

    _, current_training_loss, norm_grad, current_acc_training = GCNN.session.run([GCNN.opt_step, GCNN.loss, GCNN.norm_grad, GCNN.accuracy], feed_dict) 

    training_time = time.time() - tic   

    cost_train_avg.append(current_training_loss)
    grad_norm_train_avg.append(norm_grad)
    acc_train_avg.append(current_acc_training)


#Compute and print statistics of the last realized experiment
list_all_acc.append(100*best_model_test_acc)
list_all_cost_val_avg.append(cost_val_avg)
list_all_data_cost_val_avg.append(data_cost_val_avg)
list_all_acc_val_avg.append(acc_val_avg)
list_all_cost_test_avg.append(cost_test_avg)
list_all_acc_test_avg.append(acc_test_avg)

print('Num done: %d' % num_done)
print('Max accuracy on test set achieved: %f%%' % np.max(np.asarray(acc_test_avg)*100))
print('Max suggested accuracy: %f%%' % (100*best_model_test_acc))#(np.asarray(acc_test_avg)[np.asarray(data_cost_val_avg)==np.min(data_cost_val_avg)]),))
print('Current mean: %f%%' % np.mean(list_all_acc))
print('Current std: %f' % np.std(list_all_acc))

num_done += 1

In [7]:
#Print average performance
print(np.mean(list_all_acc))
print(np.std(list_all_acc))

83.25999975204468
0.5765412469708043
