# GCN科学出版物分类
本实验基于MindSpore2.0,在启智NPU平台上运行，使用的数据集是下载的Cora和Citeseer数据集。

### 1.实验简介
图卷积网络（Graph Convolutional Network，GCN）是近年来逐渐流行的一种神经网络结构。不同于只能用于网格结构（grid-based）数据的传统网络模型LSTM和CNN，图卷积网络能够处理具有广义拓扑图结构的数据，并深入发掘其特征和规律。
本实验主要介绍在下载的Cora和Citeseer数据集上使用MindSpore进行图卷积网络的训练。

### 2.实验环境
本实验基于MindSpore2.0,在启智NPU平台上运行。

### 3. 图卷积神经网络概述
GCN的本质目的就是用来提取拓扑图的空间特征。图卷积神经网络主要有两类，一类是基于空间域（spatial domain）或顶点域（vertex domain）的，另一类则是基于频域或谱域（spectral domain）的。GCN属于频域图卷积神经网络。  
空间域方法直接将卷积操作定义在每个结点的连接关系上，它跟传统的卷积神经网络中的卷积更相似一些。在这个类别中比较有代表性的方法有Message Passing Neural Networks(MPNN), GraphSage, Diffusion Convolution Neural Networks(DCNN), PATCHY-SAN等。  
频域方法希望借助图谱的理论来实现拓扑图上的卷积操作。从整个研究的时间进程来看：首先研究GSP（graph signal processing）的学者定义了graph上的傅里叶变化（Fourier Transformation），进而定义了graph上的卷积，最后与深度学习结合提出了Graph Convolutional Network（GCN）。

### 4. 图神经网络数据集
Cora和CiteSeer是图神经网络常用的数据集，数据集官网LINQS Datasets。  
Cora数据集包含2708个科学出版物，分为七个类别。 引用网络由5429个链接组成。 数据集中的每个出版物都用一个0/1值的词向量描述，0/1指示词向量中是否出现字典中相应的词。 该词典包含1433个独特的单词。 数据集中的README文件提供了更多详细信息。  
CiteSeer数据集包含3312种科学出版物，分为六类。 引用网络由4732个链接组成。 数据集中的每个出版物都用一个0/1值的词向量描述，0/1指示词向量中是否出现字典中相应的词。 该词典包含3703个独特的单词。 数据集中的README文件提供了更多详细信息。  
本实验使用Github上kimiyoung/planetoid预处理和划分好的数据集。  
将数据集放置在data目录下，该文件夹应包含以下文件：  
data   
├── ind.cora.allx   
├── ind.cora.ally   
├── ...  
├── ind.cora.test.index   
├── trans.citeseer.tx  
├── trans.citeseer.ty  
├── ...  
└── trans.pubmed.y  
模型的输入包含：  
x：已标记的训练实例的特征向量，  
y：已标记的训练实例的one-hot标签，  
allx：标记的和未标记的训练实例（x的超集）的特征向量，  
graph：一个dict，格式为{index: [index_of_neighbor_nodes]}。令n为标记和未标记训练实例的数量。在graph中这n个实例的索引应从0到n-1，其顺序与allx中的顺序相同。  
除了x，y，allx，和graph如上所述，预处理的数据集还包括：  
1)	tx，测试实例的特征向量，  
2)	ty，测试实例的one-hot标签，  
3)	test.index，graph中测试实例的索引，  
4)	ally，是allx中实例的标签。  

### 5.实验过程
步骤1 导入依赖库

In [2]:

from download import download

url = "https://ascend-professional-construction-dataset.obs.cn-north-4.myhuaweicloud.com:443/NLP/gcn.zip"

download(url, "./", kind="zip", replace=True)

Downloading data from https://ascend-professional-construction-dataset.obs.cn-north-4.myhuaweicloud.com:443/NLP/gcn.zip (12.0 MB)

file_sizes: 100%|██████████████████████████| 12.5M/12.5M [00:00<00:00, 28.1MB/s]
Extracting zip file...
Successfully downloaded / unzipped to ./


'./'

In [3]:
import os
# os.environ['DEVICE_ID']='7'

import time
import argparse
import numpy as np

from mindspore import nn
from mindspore import Tensor
from mindspore import context
from mindspore.ops import operations as P
#from mindspore.nn.layer.activation import get_activation
from mindspore.nn import get_activation
from easydict import EasyDict as edict

from gcn.src.gcn import glorot, LossAccuracyWrapper, TrainNetWrapper
from gcn.src.dataset import get_adj_features_labels, get_mask
from gcn.graph_to_mindrecord.writer import run

步骤2	运行环境配置

In [4]:
context.set_context(mode=context.GRAPH_MODE,device_target="Ascend", save_graphs=False)

步骤3 定义参数配置

In [5]:
dataname = 'cora'
datadir_save = './gcn/data_mr'
datadir = os.path.join(datadir_save, dataname)
cfg = edict({
    'SRC_PATH': './gcn/data',
    'MINDRECORD_PATH': datadir_save,
    'DATASET_NAME': dataname,  # citeseer,cora
    'mindrecord_partitions':1,
    'mindrecord_header_size_by_bit' : 18,
    'mindrecord_page_size_by_bit' : 20,

    'data_dir': datadir,
    'seed' : 123,
    'train_nodes_num':140,
    'eval_nodes_num':500,
    'test_nodes_num':1000
})


步骤4	转换数据格式为mindrecord

In [6]:
# 转换数据格式
print("============== Graph To Mindrecord ==============")
run(cfg)


Init writer  ...
exec task 0, parallel: False ...
Node task is 0
transformed 512 record...
transformed 1024 record...
transformed 1536 record...
transformed 2048 record...
transformed 2560 record...
Processed 2708 lines for nodes.
transformed 2708 record...
exec task 0, parallel: False ...
Edge task is 0
transformed 512 record...
transformed 1024 record...
transformed 1536 record...
transformed 2048 record...
transformed 2560 record...
transformed 3072 record...
transformed 3584 record...
transformed 4096 record...
transformed 4608 record...
transformed 5120 record...
transformed 5632 record...
transformed 6144 record...
transformed 6656 record...
transformed 7168 record...
transformed 7680 record...
transformed 8192 record...
transformed 8704 record...
transformed 9216 record...
transformed 9728 record...
transformed 10240 record...
transformed 10752 record...
Processed 10858 lines for edges.
transformed 10858 record...
--------------------------------------------
END. Total time: 8.2

步骤5 定义GCN网络参数

In [7]:
class ConfigGCN():
    learning_rate = 0.01
    epochs = 200
    hidden1 = 16
    dropout = 0.5
    weight_decay = 5e-4
    early_stopping = 10


步骤6 定义GCN网络结构

In [8]:
class GraphConvolution(nn.Cell):
    """
    GCN graph convolution layer.

    Args:
        feature_in_dim (int): The input feature dimension.
        feature_out_dim (int): The output feature dimension.
        dropout_ratio (float): Dropout ratio for the dropout layer. Default: None.
        activation (str): Activation function applied to the output of the layer, eg. 'relu'. Default: None.

    Inputs:
        - **adj** (Tensor) - Tensor of shape :math:`(N, N)`.
        - **input_feature** (Tensor) - Tensor of shape :math:`(N, C)`.

    Outputs:
        Tensor, output tensor.
    """

    def __init__(self,
                 feature_in_dim,
                 feature_out_dim,
                 dropout_ratio=None,
                 activation=None):
        super(GraphConvolution, self).__init__()
        self.in_dim = feature_in_dim
        self.out_dim = feature_out_dim
        self.weight_init = glorot([self.out_dim, self.in_dim])
        self.fc = nn.Dense(self.in_dim,
                           self.out_dim,
                           weight_init=self.weight_init,
                           has_bias=False)
        self.dropout_ratio = dropout_ratio
        if self.dropout_ratio is not None:
            self.dropout = nn.Dropout(keep_prob=1-self.dropout_ratio)
        self.dropout_flag = self.dropout_ratio is not None
        self.activation = get_activation(activation)
        self.activation_flag = self.activation is not None
        self.matmul = P.MatMul()

    def construct(self, adj, input_feature):
        dropout = input_feature
        if self.dropout_flag:
            dropout = self.dropout(dropout)

        fc = self.fc(dropout)
        output_feature = self.matmul(adj, fc)

        if self.activation_flag:
            output_feature = self.activation(output_feature)
        return output_feature


class GCN(nn.Cell):
    """
    GCN architecture.

    Args:
        config (ConfigGCN): Configuration for GCN.
        adj (numpy.ndarray): Numbers of block in different layers.
        feature (numpy.ndarray): Input channel in each layer.
        output_dim (int): The number of output channels, equal to classes num.
    """

    def __init__(self, config, adj, feature, output_dim):
        super(GCN, self).__init__()
        self.adj = Tensor(adj)
        self.feature = Tensor(feature)
        input_dim = feature.shape[1]
        self.layer0 = GraphConvolution(input_dim, config.hidden1, activation="relu", dropout_ratio=config.dropout)
        self.layer1 = GraphConvolution(config.hidden1, output_dim, dropout_ratio=None)

    def construct(self):
        output0 = self.layer0(self.adj, self.feature)
        output1 = self.layer1(self.adj, output0)
        return output1


步骤7 定义训练、评估函数

In [9]:
def train_eval(args_opt):
    """Train model."""
    np.random.seed(args_opt.seed)
    config = ConfigGCN()
    adj, feature, label = get_adj_features_labels(args_opt.data_dir)

    nodes_num = label.shape[0]
    train_mask = get_mask(nodes_num, 0, args_opt.train_nodes_num)
    eval_mask = get_mask(nodes_num, args_opt.train_nodes_num, args_opt.train_nodes_num + args_opt.eval_nodes_num)
    test_mask = get_mask(nodes_num, nodes_num - args_opt.test_nodes_num, nodes_num)

    class_num = label.shape[1]
    gcn_net = GCN(config, adj, feature, class_num)
    gcn_net.add_flags_recursive(fp16=True)

    eval_net = LossAccuracyWrapper(gcn_net, label, eval_mask, config.weight_decay)
    test_net = LossAccuracyWrapper(gcn_net, label, test_mask, config.weight_decay)
    train_net = TrainNetWrapper(gcn_net, label, train_mask, config)

    loss_list = []
    for epoch in range(config.epochs):
        t = time.time()

        train_net.set_train()
        train_result = train_net()
        train_loss = train_result[0].asnumpy()
        train_accuracy = train_result[1].asnumpy()

        eval_net.set_train(False)
        eval_result = eval_net()
        eval_loss = eval_result[0].asnumpy()
        eval_accuracy = eval_result[1].asnumpy()

        loss_list.append(eval_loss)
        if epoch%10==0:
            print("Epoch:", '%04d' % (epoch), "train_loss=", "{:.5f}".format(train_loss),
                "train_acc=", "{:.5f}".format(train_accuracy), "val_loss=", "{:.5f}".format(eval_loss),
                "val_acc=", "{:.5f}".format(eval_accuracy), "time=", "{:.5f}".format(time.time() - t))

        if epoch > config.early_stopping and loss_list[-1] > np.mean(loss_list[-(config.early_stopping+1):-1]):
            print("Early stopping...")
            break

    t_test = time.time()
    test_net.set_train(False)
    test_result = test_net()
    test_loss = test_result[0].asnumpy()
    test_accuracy = test_result[1].asnumpy()
    print("Test set results:", "loss=", "{:.5f}".format(test_loss),
          "accuracy=", "{:.5f}".format(test_accuracy), "time=", "{:.5f}".format(time.time() - t_test))


步骤8 启动训练、评估

In [10]:
#训练
print("============== Starting Training ==============")
train_eval(cfg)


Epoch: 0000 train_loss= 1.95375 train_acc= 0.62857 val_loss= 1.94876 val_acc= 0.35800 time= 22.02676
Epoch: 0010 train_loss= 1.86080 train_acc= 0.82857 val_loss= 1.90495 val_acc= 0.50000 time= 0.00418
Epoch: 0020 train_loss= 1.75320 train_acc= 0.85000 val_loss= 1.86154 val_acc= 0.52400 time= 0.00409
Epoch: 0030 train_loss= 1.60470 train_acc= 0.87857 val_loss= 1.80616 val_acc= 0.55800 time= 0.00405
Epoch: 0040 train_loss= 1.46578 train_acc= 0.92857 val_loss= 1.74064 val_acc= 0.59000 time= 0.00400
Epoch: 0050 train_loss= 1.30271 train_acc= 0.95000 val_loss= 1.66939 val_acc= 0.66400 time= 0.00413
Epoch: 0060 train_loss= 1.18308 train_acc= 0.95714 val_loss= 1.59799 val_acc= 0.72000 time= 0.00401
Epoch: 0070 train_loss= 1.06383 train_acc= 0.97857 val_loss= 1.52205 val_acc= 0.76200 time= 0.00465
Epoch: 0080 train_loss= 0.95960 train_acc= 0.97143 val_loss= 1.45305 val_acc= 0.77200 time= 0.00416
Epoch: 0090 train_loss= 0.86225 train_acc= 0.97143 val_loss= 1.39328 val_acc= 0.77400 time= 0.00412