In [1]:
'''
    @Author: King
    @Date: 2019.06.25
    @Purpose: Graph Convolutional Network
    @Introduction:   This is a gentle introduction of using DGL to implement 
                    Graph Convolutional Networks (Kipf & Welling et al., 
                    Semi-Supervised Classification with Graph Convolutional Networks). 
                    We build upon the earlier tutorial on DGLGraph and demonstrate how DGL 
                    combines graph with deep neural network and learn structural representations.
    @Datasets: 
    @Link : 
    @Reference : https://docs.dgl.ai/tutorials/models/1_gnn/1_gcn.html
'''

'\n    @Author: King\n    @Date: 2019.06.25\n    @Purpose: Graph Convolutional Network\n    @Introduction:   This is a gentle introduction of using DGL to implement \n                    Graph Convolutional Networks (Kipf & Welling et al., \n                    Semi-Supervised Classification with Graph Convolutional Networks). \n                    We build upon the earlier tutorial on DGLGraph and demonstrate how DGL \n                    combines graph with deep neural network and learn structural representations.\n    @Datasets: \n    @Link : \n    @Reference : https://docs.dgl.ai/tutorials/models/1_gnn/1_gcn.html\n'

# Understand Graph Attention Network

From Graph Convolutional Network (GCN), we learned that combining local graph structure and node-level features yields good performance on node classification task. However, the way GCN aggregates is structure-dependent, which may hurt its generalizability.
从图卷积网络（GCN），我们了解到结合局部图结构和节点级特征可以在节点分类任务上产生良好的性能。但是，GCN聚合的方式依赖于结构，这可能会损害其普遍性。


One workaround is to simply average over all neighbor node features as in GraphSAGE. Graph Attention Network proposes an alternative way by weighting neighbor features with feature dependent and structure free normalization, in the style of attention
一种解决方法是简单地平均所有邻居节点功能，如GraphSAGE中所示。图注意网络通过以注意的方式加权具有特征相关和结构自由规范化的邻居特征来提出另一种方法


The goal of this tutorial:

- Explain what is Graph Attention Network.
- Demonstrate how it can be implemented in DGL.
- Understand the attentions learnt.
- Introduce to inductive learning.



## Introducing Attention to GCN


The key difference between GAT and GCN is how the information from the one-hop neighborhood is aggregated.
GAT和GCN之间的关键区别在于如何聚合来自一跳邻域的信息。

For GCN, a graph convolution operation produces the normalized sum of the node features of neighbors:

$$
h_{i}^{(l+1)}=\sigma\left(\sum_{j \in \mathcal{N}(i)} \frac{1}{c_{i j}} W^{(l)} h_{j}^{(l)}\right)
$$


GAT introduces the attention mechanism as a substitute for the statically normalized convolution operation. (GAT引入了注意机制作为静态归一化卷积运算的替代。) Below are the equations to compute the node embedding $h_{i}^{(l+1)}$ of layer l+1 from the embeddings of layer l:

![](img/GAT_1.png)


![](img/GAT_2.png)

There are other details from the paper, such as dropout and skip connections. For the purpose of simplicity, we omit them in this tutorial and leave the link to the full example at the end for interested readers.

In its essence, GAT is just a different aggregation function with attention over features of neighbors, instead of a simple mean aggregation.


## GAT in DGL

Let’s first have an overall impression about how a GATLayer module is implemented in DGL. Don’t worry, we will break down the four equations above one-by-one.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class GATLayer(nn.Module):
    def __init__(self, g, in_dim, out_dim):
        super(GATLayer, self).__init__()
        self.g = g
        # equation (1)
        self.fc = nn.Linear(in_dim, out_dim, bias=False)
        # equation (2)
        self.attn_fc = nn.Linear(2 * out_dim, 1, bias=False)

    def edge_attention(self, edges):
        # edge UDF for equation (2)
        z2 = torch.cat([edges.src['z'], edges.dst['z']], dim=1)
        a = self.attn_fc(z2)
        return {'e': F.leaky_relu(a)}

    def message_func(self, edges):
        # message UDF for equation (3) & (4)
        return {'z': edges.src['z'], 'e': edges.data['e']}

    def reduce_func(self, nodes):
        # reduce UDF for equation (3) & (4)
        # equation (3)
        alpha = F.softmax(nodes.mailbox['e'], dim=1)
        # equation (4)
        h = torch.sum(alpha * nodes.mailbox['z'], dim=1)
        return {'h': h}

    def forward(self, h):
        # equation (1)
        z = self.fc(h)
        self.g.ndata['z'] = z
        # equation (2)
        self.g.apply_edges(self.edge_attention)
        # equation (3) & (4)
        self.g.update_all(self.message_func, self.reduce_func)
        return self.g.ndata.pop('h')

## Multi-head Attention

Analogous to multiple channels in ConvNet, GAT introduces multi-head attention to enrich the model capacity and to stabilize the learning process.(类似于ConvNet中的多个渠道，GAT引入了多头注意力，以丰富模型容量并稳定学习过程。) Each attention head has its own parameters and their outputs can be merged in two ways:

![](img/GAT_3.png)

In [3]:
class MultiHeadGATLayer(nn.Module):
    def __init__(self, g, in_dim, out_dim, num_heads, merge='cat'):
        super(MultiHeadGATLayer, self).__init__()
        self.heads = nn.ModuleList()
        for i in range(num_heads):
            self.heads.append(GATLayer(g, in_dim, out_dim))
        self.merge = merge

    def forward(self, h):
        head_outs = [attn_head(h) for attn_head in self.heads]
        if self.merge == 'cat':
            # concat on the output feature dimension (dim=1)
            return torch.cat(head_outs, dim=1)
        else:
            # merge using average
            return torch.mean(torch.stack(head_outs))

## Put everything together

Now, we can define a two-layer GAT model:

In [4]:
class GAT(nn.Module):
    def __init__(self, g, in_dim, hidden_dim, out_dim, num_heads):
        super(GAT, self).__init__()
        self.layer1 = MultiHeadGATLayer(g, in_dim, hidden_dim, num_heads)
        # Be aware that the input dimension is hidden_dim*num_heads since
        # multiple head outputs are concatenated together. Also, only
        # one attention head in the output layer.
        self.layer2 = MultiHeadGATLayer(g, hidden_dim * num_heads, out_dim, 1)

    def forward(self, h):
        h = self.layer1(h)
        h = F.elu(h)
        h = self.layer2(h)
        return h

In [5]:
# We then load the cora dataset using DGL’s built-in data module.

from dgl import DGLGraph
from dgl.data import citation_graph as citegrh

def load_cora_data():
    data = citegrh.load_cora()
    features = torch.FloatTensor(data.features)
    labels = torch.LongTensor(data.labels)
    mask = torch.ByteTensor(data.train_mask)
    g = data.graph
    # add self loop
    g.remove_edges_from(g.selfloop_edges())
    g = DGLGraph(g)
    g.add_edges(g.nodes(), g.nodes())
    return g, features, labels, mask

In [6]:
# The training loop is exactly the same as in the GCN tutorial.

import time
import numpy as np

g, features, labels, mask = load_cora_data()

# create the model, 2 heads, each head has hidden size 8
net = GAT(g,
          in_dim=features.size()[1],
          hidden_dim=8,
          out_dim=7,
          num_heads=2)

# create optimizer
optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)

# main loop
dur = []
for epoch in range(30):
    if epoch >= 3:
        t0 = time.time()

    logits = net(features)
    logp = F.log_softmax(logits, 1)
    loss = F.nll_loss(logp[mask], labels[mask])

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch >= 3:
        dur.append(time.time() - t0)

    print("Epoch {:05d} | Loss {:.4f} | Time(s) {:.4f}".format(
        epoch, loss.item(), np.mean(dur)))

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Epoch 00000 | Loss 1.9459 | Time(s) nan
Epoch 00001 | Loss 1.9451 | Time(s) nan
Epoch 00002 | Loss 1.9443 | Time(s) nan
Epoch 00003 | Loss 1.9434 | Time(s) 0.2730
Epoch 00004 | Loss 1.9426 | Time(s) 0.2697
Epoch 00005 | Loss 1.9418 | Time(s) 0.2794
Epoch 00006 | Loss 1.9410 | Time(s) 0.2815
Epoch 00007 | Loss 1.9401 | Time(s) 0.2837
Epoch 00008 | Loss 1.9392 | Time(s) 0.2826
Epoch 00009 | Loss 1.9384 | Time(s) 0.2817
Epoch 00010 | Loss 1.9375 | Time(s) 0.2800
Epoch 00011 | Loss 1.9366 | Time(s) 0.2789
Epoch 00012 | Loss 1.9356 | Time(s) 0.2777
Epoch 00013 | Loss 1.9347 | Time(s) 0.2774
Epoch 00014 | Loss 1.9338 | Time(s) 0.2772
Epoch 00015 | Loss 1.9328 | Time(s) 0.2767
Epoch 00016 | Loss 1.9318 | Time(s) 0.2760
Epoch 00017 | Loss 1.9308 | Time(s) 0.2759
Epoch 00018 | Loss 1.9298 | Time(s) 0.2753
Epoch 00019 | Loss 1.9288 | Time(s) 0.2752
Epoch 00020 | Loss 1.9278 | Time(s) 0.2753
Epoch 00021 | Loss 1.9267 | Time(s) 0.2747
Epoch 00022 | Loss 1.9256 | Time(s) 0.2743
Epoch 00023 | Loss 1