# Graph Attention Network (GAT)

### challenges with GCN:


GCNs work by aggregating information from a node’s neighbors to update its feature representation. One common approach is to use a normalization factor based on the degrees of the nodes, that is, $\frac{1}{\sqrt{deg(i)}\sqrt{deg(j)}}$. This means nodes with fewer neighbors are given more weight. The idea is to balance the influence of each node's neighbors so that nodes with many neighbors don't dominate the feature updates. As a result, nodes with fewer neighbors are considered more important.

The normalization based solely on node degrees can be limiting. It doesn't take into account the actual features of the nodes. Therefore, important feature information might be overlooked, and nodes with low degrees might be given undue importance regardless of their actual relevance to the task.

GATs address these challenges by introducing attention mechanisms into the aggregation process. Instead of using a fixed normalization factor, GATs learn dynamic weighting factors for each neighbor, allowing the network to focus on the most relevant nodes and features.

### Advantages of GATs

- Feature-based Attention: GATs compute attention coefficients for each pair of connected nodes based on their features. This means the importance of a neighbor is learned from the data, considering both the node's and its neighbor's features.

- Adaptive Weights: The attention mechanism allows the model to adaptively assign different weights to different neighbors, making it more flexible and powerful in capturing the underlying structure and feature significance in the graph.
- End-to-End Training: The attention coefficients are learned end-to-end with the rest of the model parameters, allowing the GAT to optimize these weights for the specific task at hand.

## Graph attentional layer

A key feature of GATs is their ability to calculate *attention scores* through a mechanism known as self-attention. This involves comparing node features to each other to determine the importance of neighboring nodes, allowing the model to focus on the most relevant parts of the graph.

In traditional graph neural networks, the aggregation of information from neighboring nodes often treats all neighbors with equal importance. However, GATs introduce a more nuanced approach by assigning different attention scores to different neighbors, enabling the network to learn which connections are more significant for the task at hand.

We will delve into the detailed process of calculating these attention scores in four steps:

- Linear transformation
- Computing self-attention scores
- Softmax normalization
- Multi-head attention

### Linear transformation

The input to our layer is a set of node features, $h = \{h_1, h_2, ..., h_n\}, h_i \in \mathbb{R}^F$ , where $n$ is the
number of nodes, and $F$ is the number of features in each node. The layer produces a new set of node features (of potentially different cardinality $F'$), $h' = \{h'_1, h'_2, ..., h'_n\}, h'_i \in \mathbb{R}^{F'}$, as its output.
In order to obtain sufficient expressive power to transform the input features into higher-level features, at least one learnable linear transformation is required. To that end, as an initial step, a shared linear transformation, parametrized by a weight matrix, $W \in \mathbb{R}^{F' \times F}$ , is applied to every node.

<center><img src="images/GAT1.png" width=600></center>
<center><small>image from https://epichka.com/blog/2023/gat-paper-explained/</small></center>

### Computing self-attention scores

We then perform self-attention on the nodes— *a shared attentional mechanism $f: \mathbb{R}^{F'} \times \mathbb{R}^{F'} \rightarrow \mathbb{R}$* to computes attention coefficients:
\begin{equation*}
\alpha_{ij} = f(Wh_i, Wh_j)
\end{equation*}
that indicate the importance of node $j$’s features to node $i$. We inject the graph structure into the mechanism by performing **masked attention**— we only compute $\alpha_{ij}$ for nodes $j \in \mathcal{N}_i$, where $\mathcal{N}_i$ is some neighborhood of node i in the graph.

The attention mechanism $f$ is a single-layer feedforward neural network, parametrized by a weight vector $\alpha \in \mathbb{R}^{2F'}$, and applying the LeakyReLU nonlinearity.
\begin{equation*}
\alpha_{ij} = exp\left(LeakyReLU\left(a^T [Wh_i||Wh_j]\right)\right)
\end{equation*}

### Softmax normalization

To make coefficients easily comparable across different nodes, we normalize them across all choices of $j$ using the softmax function:

\begin{equation}
\alpha_{ij}=\frac{exp(LeakyReLU\left(a^T [Wh_i||Wh_j]\right))}{\sum_{k \in \mathcal{N}_i} exp(LeakyReLU\left(a^T [Wh_i||Wh_j]\right))}
\end{equation}
<center><img src="images/GAT2.png" width=800></center>
<center><small>image from https://epichka.com/blog/2023/gat-paper-explained/</small></center>


Once obtained, the normalized attention coefficients are used to compute a linear combination of the features corresponding to them, to serve as the final output features for every node:

<center><img src="images/GAT3.png" width=800></center>
<center><small>image from https://epichka.com/blog/2023/gat-paper-explained/</small></center>

### Multi-head attention

To stabilize the learning process of self-attention, we extend this mechanism to employ multi-head attention to be beneficial.  Specifically, $K$ independent attention mechanisms execute the transformation of Equation (1), and then their features are
concatenated (or averaged), resulting in the following output feature representation:

<center><img src="images/multihead_attention.png" width=400></center>
<center><small>image from https://arxiv.org/pdf/1710.10903</small></center>

<center><img src="images/multihead.png" width=600></center>
<center><small>Illustration of multi-head attention mechanism used in the GAT model. Here, the number of attention is set as K = 3. The aggregated features from each head are concatenated or averaged to obtain a higher representation for each node. The average operation is only used in the output layer. The self-connection is not considered</small></center>
<center><small>image from https://link.springer.com/article/10.1007/s00138-021-01251-0</small></center>

In practice, Graph Attention Networks (GAT) are implemented in PyTorch Geometric using a slightly different approach for efficiency. Instead of directly masking the attention scores based on the graph's adjacency matrix, it is more efficient to initially compute the attention coefficients $e_{ij}$ for **all possible pairs of nodes** and then select only those that correspond to the existing edges. This approach leverages the sparse nature of real-world graphs and ensures computational efficiency.

so at first, we compute attention scores for all pairs of nodes as follows:
<center><img src="images/GAT_pytorch.png" width=600></center>
<center><small>image from https://epichka.com/blog/2023/gat-paper-explained/</small></center>

To ensure that attention is only applied to existing edges, we use the adjacency matrix of the graph. This matrix is used to mask the non-existent edges by assigning a large negative value (such as $-\infty$) to $e_{ij}$ for pairs of nodes that are not connected. This effectively zeroes out their corresponding attention weights after the softmax operation. This is illustrated in the following figure:

<center><img src="images/GAT_pytorch2.png" width=600></center>
<center><small>image from https://epichka.com/blog/2023/gat-paper-explained/</small></center>

## GAT in python using Geometric library

The following code demonstrates how to implement a Graph Attention Network (GAT) using PyTorch Geometric with the Cora dataset. It involves defining a GAT model with two graph attention layers, setting up the dataset and data masks, and training the model over a specified number of epochs. The training loop includes computing predictions, calculating the loss, performing backpropagation, and evaluating the model's accuracy on the training and test sets.

In [2]:
from torch_geometric.datasets import Planetoid
import torch_geometric
from torch_geometric.nn import GATv2Conv
import torch.nn as nn
import torch.optim as optim
import torch
import torch.nn.functional as F
from torch_geometric.transforms import NormalizeFeatures

# Load the Cora dataset
dataset = Planetoid(root=".", name="Cora",transform=NormalizeFeatures())
data = dataset[0]

# Extract training, validation, and test masks from the data
# These masks are boolean arrays indicating which nodes are used for training, validation, and testing
train_mask = data.train_mask
val_mask = data.val_mask
test_mask = data.test_mask

# Define the GCN (Graph Convolutional Network) model
class GAT(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim, heads=8):
        super().__init__()
        # Define the first graph convolution layer
        self.gat1 = GATv2Conv(in_dim, hidden_dim, heads=8, dropout=0.6)
        # Define the second graph convolution layer
        self.gat2 = GATv2Conv(hidden_dim*heads, out_dim, heads=1, concat=False, dropout=0.6)
    
    def forward(self, x, edge_index):
        h = F.dropout(x, p=0.6, training=self.training)
        # Apply the first GAT and eLU activation
        h = self.gat1(h, edge_index)
        h = F.elu(h)
        h = F.dropout(h, p=0.6, training=self.training)
        # Apply the second GAT layer
        h = self.gat2(h, edge_index)
        return h 

# Instantiate the GAT model
gat = GAT(dataset.num_features, 32, dataset.num_classes)

# Define the optimizer (Adam) and the loss function (Cross Entropy Loss)
optimizer = optim.Adam(gat.parameters(), lr=0.005, weight_decay=5e-4)
loss_fn = nn.CrossEntropyLoss()

# Set the number of training epochs
n_epochs = 200

# Define a function to calculate accuracy
def accuracy(y_pred, y_true):
    return torch.sum(y_pred == y_true) / len(y_true)

# Training loop
for epoch in range(n_epochs):
    gat.train()
    # Forward pass: Compute predictions
    prediction = gat(data.x, data.edge_index)
    
    # Compute the loss on the training data
    loss = loss_fn(prediction[train_mask, :], data.y[train_mask])
    optimizer.zero_grad()
    # Zero the gradients, perform backpropagation, and update the weights
    loss.backward()
    optimizer.step()
    

    # Print progress every 10 epochs
    if epoch % 10 == 0:
        # Evaluate on the test set without updating weights
        with torch.no_grad():
            gat.eval()
            prediction = gat(data.x, data.edge_index)
            # Calculate training accuracy
            train_acc = accuracy(torch.argmax(prediction[train_mask, :], dim=1), data.y[train_mask])
            test_acc = accuracy(torch.argmax(prediction[test_mask, :], dim=1), data.y[test_mask])
            print(f'Epoch: {epoch}, Loss: {loss.item():.4f}, Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}')


Epoch: 0, Loss: 1.9455, Train Accuracy: 0.2857, Test Accuracy: 0.1800
Epoch: 10, Loss: 1.8028, Train Accuracy: 0.7929, Test Accuracy: 0.5110
Epoch: 20, Loss: 1.6520, Train Accuracy: 0.9357, Test Accuracy: 0.8050
Epoch: 30, Loss: 1.4082, Train Accuracy: 0.9214, Test Accuracy: 0.7420
Epoch: 40, Loss: 1.2132, Train Accuracy: 0.9571, Test Accuracy: 0.8070
Epoch: 50, Loss: 1.0593, Train Accuracy: 0.9571, Test Accuracy: 0.8180
Epoch: 60, Loss: 0.9141, Train Accuracy: 0.9643, Test Accuracy: 0.7970
Epoch: 70, Loss: 0.9031, Train Accuracy: 0.9786, Test Accuracy: 0.8230
Epoch: 80, Loss: 0.8355, Train Accuracy: 0.9857, Test Accuracy: 0.8090
Epoch: 90, Loss: 0.7935, Train Accuracy: 0.9857, Test Accuracy: 0.7900
Epoch: 100, Loss: 0.6930, Train Accuracy: 0.9857, Test Accuracy: 0.8190
Epoch: 110, Loss: 0.7375, Train Accuracy: 0.9929, Test Accuracy: 0.8210
Epoch: 120, Loss: 0.7936, Train Accuracy: 0.9786, Test Accuracy: 0.7940
Epoch: 130, Loss: 0.6390, Train Accuracy: 0.9929, Test Accuracy: 0.8020
Epo