## 注意

本方案完全在官方baseline的基础上进行改动。

运行Res_Unimp_Large代码需要使用 **GPU(32G显存)** 来运行。

## 1. 重要参考
### （1）UniMP算法
[UniMP算法GitHub链接](https://github.com/PaddlePaddle/PGL/tree/main/ogb_examples/nodeproppred/unimp)

**unimp是在ogbn-arixv数据集上刷新了sota的模型，必须马住。**

参考论文：
* Predict then Propagate: Graph Neural Networks meet Personalized PageRank (https://arxiv.org/abs/1810.05997)
* Simple and Deep Graph Convolutional Networks (https://arxiv.org/abs/2007.02133)
* Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification (https://arxiv.org/abs/2009.03509)
* Combining Label Propagation and Simple Models Out-performs Graph Neural Networks (https://arxiv.org/abs/2010.13993)

### （2）图神经网络配套课程
[课程链接](https://aistudio.baidu.com/aistudio/education/group/info/1956)

老师讲解很细致，理解起来很容易，强烈推荐学习。

**课节6中老师专门讲解了这个比赛，并提供了Res修改版的GAT：**

[ResGAT](https://aistudio.baidu.com/aistudio/projectdetail/1280598)

### （3）参考代码
[https://aistudio.baidu.com/aistudio/projectdetail/1467127?channelType=0&channel=0](https://aistudio.baidu.com/aistudio/projectdetail/1467127?channelType=0&channel=0)



## 2.代码实现

### （1）模型构建（mode.py）
Res_Unimp_Large代码，也可参见model_modified.py。

```
class res_unimp_large(object):
    def __init__(self, config, num_class):
        self.num_class = num_class
        self.num_layers = config.get("num_layers", 2)
        self.hidden_size = config.get("hidden_size", 128)
        self.out_size=config.get("out_size", 40)
        self.embed_size=config.get("embed_size", 100)
        self.heads = config.get("heads", 8) 
        self.dropout = config.get("dropout", 0.3)
        self.edge_dropout = config.get("edge_dropout", 0.0)
        self.use_label_e = config.get("use_label_e", False)
    
    # 编码输入        
    def embed_input(self, feature):   
        lay_norm_attr=F.ParamAttr(initializer=F.initializer.ConstantInitializer(value=1))
        lay_norm_bias=F.ParamAttr(initializer=F.initializer.ConstantInitializer(value=0))
        feature=L.layer_norm(feature, name='layer_norm_feature_input', 
                                      param_attr=lay_norm_attr, 
                                      bias_attr=lay_norm_bias)
        return feature
    
    # 连同部分已知的标签编码输入（MaskLabel）
    def label_embed_input(self, feature):
        label = F.data(name="label", shape=[None, 1], dtype="int64")
        label_idx = F.data(name='label_idx', shape=[None, 1], dtype="int64")

        label = L.reshape(label, shape=[-1])
        label_idx = L.reshape(label_idx, shape=[-1])

        embed_attr = F.ParamAttr(initializer=F.initializer.NormalInitializer(loc=0.0, scale=1.0))
        embed = F.embedding(input=label, size=(self.out_size, self.embed_size), param_attr=embed_attr )

        feature_label = L.gather(feature, label_idx, overwrite=False)
        feature_label = feature_label + embed
        feature = L.scatter(feature, label_idx, feature_label, overwrite=True)
     
        lay_norm_attr = F.ParamAttr(initializer=F.initializer.ConstantInitializer(value=1))
        lay_norm_bias = F.ParamAttr(initializer=F.initializer.ConstantInitializer(value=0))
        feature = L.layer_norm(feature, name='layer_norm_feature_input', 
                                      param_attr=lay_norm_attr, 
                                      bias_attr=lay_norm_bias)
        return feature
        
    def forward(self, graph_wrapper, feature, phase):
        if phase == "train": 
            edge_dropout = self.edge_dropout
            dropout = self.dropout
        else:
            edge_dropout = 0
            dropout = 0

        if self.use_label_e:
            feature = self.label_embed_input(feature)
        else:
            feature = self.embed_input(feature)
        if dropout > 0:
            feature = L.dropout(feature, dropout_prob=dropout, 
                                    dropout_implementation='upscale_in_train')
        
        #改变输入特征维度是为了Res连接可以直接相加
        feature = L.fc(feature, size=self.hidden_size * self.heads, name="init_feature")


        for i in range(self.num_layers - 1):
            ngw = pgl.sample.edge_drop(graph_wrapper, edge_dropout) 
            from model_unimp_large import graph_transformer, attn_appnp

            res_feature = feature

            feature, _, cks = graph_transformer(str(i), ngw, feature, 
                                             hidden_size=self.hidden_size,
                                             num_heads=self.heads, 
                                             concat=True, skip_feat=True,
                                             layer_norm=True, relu=True, gate=True)
            if dropout > 0:
                feature = L.dropout(feature, dropout_prob=dropout, 
                                     dropout_implementation='upscale_in_train') 
            
            # 下面这行便是Res连接了
            feature = res_feature + feature 
        
        feature, attn, cks = graph_transformer(str(self.num_layers - 1), ngw, feature, 
                                             hidden_size=self.out_size,
                                             num_heads=self.heads, 
                                             concat=False, skip_feat=True,
                                             layer_norm=False, relu=False, gate=True)

        feature = attn_appnp(ngw, feature, attn, alpha=0.2, k_hop=10)

        pred = L.fc(
            feature, self.num_class, act=None, name="pred_output")
        return pred
```
        
### （2）模型配置（Notebook）
最优策略：3层res_unimp_large，隐层神经元128个，配置两种dropout，使用MaskLabel，且label_rate = 0.66（在模型训练中设置）。

```
config = {
    "model_name": "res_unimp_large",
    "num_layers": 3,
    "hidden_size": 64,
    "heads": 2,
    "learning_rate": 0.001,
    "dropout": 0.33,
    "weight_decay": 0.0005,
    "edge_dropout": 0.32,
    "use_label_e": True
}

```
### （3）模型训练（Notebook）

```
import os
use_label_e = True
label_rate = 0.66
epoch = 4000
exe.run(startup_program)
max_val_acc = 0

# 这里可以恢复训练
pretrained = False
if pretrained:
    def name_filter(var):
        res = var.name in os.listdir('./output')
        return res
    fluid.io.load_vars(exe, './output',predicate=name_filter)
    max_val_acc = 0.756

earlystop = 0
# 将图数据变成 feed_dict 用于传入Paddle Excecutor
feed_dict = gw.to_feed(dataset.graph)
for epoch in range(epoch):
    # Full Batch 训练
    # 设定图上面那些节点要获取
    # node_index: 未知label节点的nid    
    # node_label: 未知label
    # label_idx: 已知label节点的nid    
    # label: 已知label
    
    if use_label_e:
        # 在训练集中抽取部分数据，其Label已知，并可以输入网络训练
        train_idx_temp = np.array(train_index, dtype="int64")
        train_lab_temp = np.array(train_label, dtype="int64")
        state = np.random.get_state()
        np.random.shuffle(train_idx_temp)
        np.random.set_state(state)
        np.random.shuffle(train_lab_temp)

        label_idx=train_idx_temp[:int(label_rate*len(train_idx_temp))]
        unlabel_idx=train_idx_temp[int(label_rate*len(train_idx_temp)):]
        label=train_lab_temp[:int(label_rate*len(train_idx_temp))]
        unlabel=train_lab_temp[int(label_rate*len(train_idx_temp)):]

        feed_dict["node_index"] = unlabel_idx
        feed_dict["node_label"] = unlabel
        feed_dict['label_idx']= label_idx
        feed_dict['label']= label
    else:
        feed_dict["node_label"] = np.array(train_label, dtype="int64")
        feed_dict["node_index"] = np.array(train_index, dtype="int64")
        

    train_loss, train_acc = exe.run(train_program,
                                feed=feed_dict,
                                fetch_list=[loss, acc],
                                return_numpy=True)

    # Full Batch 验证
    # 设定图上面那些节点要获取
    # node_index: 未知label节点的nid    
    # node_label: 未知label
    # label_idx: 已知label节点的nid    
    # label: 已知label
    
    feed_dict["node_index"] = np.array(val_index, dtype="int64")
    feed_dict["node_label"] = np.array(val_label, dtype="int64")
    if use_label_e:
        feed_dict['label_idx'] = np.array(train_index, dtype="int64")
        feed_dict['label'] = np.array(train_label, dtype="int64")
    val_loss, val_acc = exe.run(test_program,
                            feed=feed_dict,
                            fetch_list=[v_loss, v_acc],
                            return_numpy=True)
    print("Epoch", epoch, "Train Acc", train_acc[0], "Valid Acc", val_acc[0])
    
    # 保存历史最优验证精度对应的模型
    if val_acc[0] > max_val_acc:
        max_val_acc = val_acc[0]
        fluid.io.save_persistables(exe, './output', train_program)
    
    # 训练精度持续大于验证精度，结束训练
    if train_acc[0] > val_acc[0]:
        earlystop += 1
        if earlystop == 40:
            break
    else:
        earlystop = 0
```

### （4）简单投票（本地）
```
这里将训练出来的文件进行简单投票
import csv
from collections import Counter

def vote_merge(filelst):
    result = {}
    fw = open('D:/subexl/76/merge.csv', encoding='utf-8', mode='w', newline='')
    csv_writer = csv.writer(fw)
    csv_writer.writerow(['nid', 'label'])
    for filepath in filelst:
        cr = open(filepath, encoding='utf-8', mode='r')
        csv_reader = csv.reader(cr)
        for i, row in enumerate(csv_reader):
            if i == 0:
                continue
            idx, cls = row
            if idx not in result:
                result[idx] = []
            result[idx].append(cls)

    for nid, clss in result.items():
        counter = Counter(clss)
        true_cls = counter.most_common(1)
        csv_writer.writerow([nid, true_cls[0][0]])

if __name__ == '__main__':
    vote_merge([
 #       "D:/subexl/75/0.76436.csv",
 #       "D:/subexl/75/0.7635.csv",
  #      "D:/subexl/75/0.75666.csv",
   #     "D:/subexl/75/0.75736.csv",
 #       "D:/subexl/75/0.75755.csv",
 #       "D:/subexl/75/0.75801.csv",
 #       "D:/subexl/75/0.75868.csv",
 #       "D:/subexl/75/0.75978.csv",
 #       "D:/subexl/75/0.76171.csv",
 #       "D:/subexl/75/0.76288.csv",
 #       "D:/subexl/75/0.76412.csv",
 #       "D:/subexl/75/0.759664.csv",
 #       "D:/subexl/75/0.75973517.csv",
 #       "D:/subexl/75/0.75980633.csv",
 #       "D:/subexl/75/0.76322347.csv",
 #       "D:/subexl/75/0.763223471.csv",
        "D:/subexl/76/0.75736.csv",
        "D:/subexl/76/0.75755.csv",
        "D:/subexl/76/0.75801.csv",
        "D:/subexl/76/0.75868.csv",
        "D:/subexl/76/0.75978.csv",
        "D:/subexl/76/0.76436.csv",
        "D:/subexl/76/0.759664.csv",
        "D:/subexl/76/0.75973517.csv",
        "D:/subexl/76/0.75980633.csv",
        "D:/subexl/76/0.76322347.csv",
        "D:/subexl/76/0.763223471.csv",
        "D:/subexl/76/submission.csv",
                ])
```


### （5）绝对多数投票（本地）
绝对多数投票法很简单，分类器的投票数超过半数便认可预测结果，否则拒绝。
将简单投票后的结果再与新训练出来的文件进行绝对多数投票
在这里，我将所有提交文件的名称改为“测试精度.csv”，例如0.76087.csv；然后按照精度大小排序，首先使用绝对多数投票法进行投票，若某一投票不过半数，直接取精度最高csv的预测结果。

```
import os
import numpy as np
from scipy import stats
import pandas as pd
#path放的是你所有的提交文件
path = 'D:/subexl/75'
filelist = os.listdir(path)

# 下面这行代码按照测试精度进行排序
filelist = sorted(filelist, key= lambda x:float(x[:-4]), reverse=True)
print(filelist)

# n为测试集条目数
n = 37311
num_files = len(filelist)
a = np.zeros([num_files, n], dtype='i8')
for i, filename in enumerate(filelist):
    filepath = os.path.join(path, filename)
    a[i, :] = np.array(np.loadtxt(filepath, dtype=int, delimiter=',', skiprows=1, usecols=1, encoding='utf-8'))

res = np.zeros([n, 1])
for j in range(n):
    counts = np.bincount(a[:,j])
    maxnum = np.argmax(counts)
    
    # 判读最大投票数是否过半数
    if counts[maxnum] > num_files//2:
        res[j] = maxnum
    else:
        res[j] = a[0,j]

# 写入文件
data=pd.read_csv('D:/subexl/75/0.75897.csv')
data['label'] = res
data.to_csv('E:/submission.csv',index=False,header=True)
```


# 总结：
主要改进思路嘛是在对预测结果进行简单投票和绝对多数投票下了点功夫，我只用了10个预测样本进行融合，预测的样本更多的话，提交结果的分数可能也越高，具体流程如下图所示。
![](https://ai-studio-static-online.cdn.bcebos.com/5172690bda594068ac99a25b53427b53e0d5da1f53684fb99d4f184ffecb0cf4)


注意图中简单投票和绝对多数投票的样本是任意数目哦，是不是很像人大代表统计各地意见然后在人大提出决议呢，哈哈哈纯属个人想法。
最后感谢百度paddlepaddle提供学习的平台。

## 以下为原baseline的修改，可以一键运行全部：

## 运行方式
本次基线基于飞桨PaddlePaddle 1.8.4版本，若本地运行则可能需要额外安装pgl、easydict、pandas等模块。

## 本地运行
下载左侧文件夹中的所有py文件（包括build_model.py, model.py）,以及work目录，然后在右上角“文件”->“导出Notebook到py”，这样可以保证代码是最新版本），执行导出的py文件即可。完成后下载submission.csv提交结果即可。

## AI Studio (Notebook)运行
依次运行下方的cell，完成后下载submission.csv提交结果即可。若运行时修改了cell，推荐在右上角重启执行器后再以此运行，避免因内存未清空而产生报错。 Tips：若修改了左侧文件夹中数据，也需要重启执行器后才会加载新文件。

## 代码整体逻辑

1. 读取提供的数据集，包含构图以及读取节点特征（用户可自己改动边的构造方式）

2. 配置化生成模型，用户也可以根据教程进行图神经网络的实现。

3. 开始训练

4. 执行预测并产生结果文件


## 环境配置

该项目依赖飞桨paddlepaddle==1.8.4, 以及pgl==1.2.0。请按照版本号下载对应版本就可运行。

In [None]:
!cd work && unzip -oq graph.zip  


In [None]:
#导入相关包
!pip install --upgrade python-dateutil
!pip install easydict
!pip install pgl==1.2.0 
!pip install pandas>=0.25
!pip install pyarrow==0.13.0
!pip install chardet==3.0.4

Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting python-dateutil
[?25l  Downloading https://mirror.baidu.com/pypi/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)
[K     |████████████████████████████████| 235kB 14.8MB/s eta 0:00:01
[31mERROR: blackhole 0.3.2 has requirement xgboost==1.1.0, but you'll have xgboost 1.3.3 which is incompatible.[0m
Installing collected packages: python-dateutil
  Found existing installation: python-dateutil 2.8.0
    Uninstalling python-dateutil-2.8.0:
      Successfully uninstalled python-dateutil-2.8.0
Successfully installed python-dateutil-2.8.1
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting pgl==1.2.0
[?25l  Downloading https://mirror.baidu.com/pypi/packages/35/fa/2290e78914d34d4e4480d7982b8f4d0c58a7e53535113a668a9d75d5c3b6/pgl-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (7.9MB

In [None]:
import sys 

In [None]:
import pgl
import paddle.fluid as fluid
import numpy as np
import time
import pandas as pd

## 图网络配置

这里已经有很多强大的模型配置了，你可以尝试简单的改一下config的字段。
例如，换成GAT的配置
```
config = {
    "model_name": "GAT",
    "num_layers":  1,
    "dropout": 0.5,
    "learning_rate": 0.01,
    "weight_decay": 0.0005,
    "edge_dropout": 0.00,
}
```

In [None]:
from easydict import EasyDict as edict
config = {
    "model_name": "res_unimp_large",
    "num_layers": 3,
    "hidden_size": 64,
    "heads": 2,
    "learning_rate": 0.001,
    "dropout": 0.33,
    "weight_decay": 0.0005,
    "edge_dropout": 0.32,
    "use_label_e": True
}



config = edict(config)

## 数据加载模块

这里主要是用于读取数据集，包括读取图数据构图，以及训练集的划分。

In [None]:
from collections import namedtuple

Dataset = namedtuple("Dataset", 
               ["graph", "num_classes", "train_index",
                "train_label", "valid_index", "valid_label", "test_index"])

def load_edges(num_nodes, self_loop=True, add_inverse_edge=True):
    # 从数据中读取边
    edges = pd.read_csv("work/edges.csv", header=None, names=["src", "dst"]).values






    if add_inverse_edge:
        edges = np.vstack([edges, edges[:, ::-1]])

    if self_loop:
        src = np.arange(0, num_nodes)
        dst = np.arange(0, num_nodes)
        self_loop = np.vstack([src, dst]).T
        edges = np.vstack([edges, self_loop])
    
    return edges

def load():
    # 从数据中读取点特征和边，以及数据划分
    node_feat = np.load("work/feat.npy")
    num_nodes = node_feat.shape[0]
    edges = load_edges(num_nodes=num_nodes, self_loop=True, add_inverse_edge=True)
    graph = pgl.graph.Graph(num_nodes=num_nodes, edges=edges, node_feat={"feat": node_feat})
    
    indegree = graph.indegree()
    norm = np.maximum(indegree.astype("float32"), 1)
    norm = np.power(norm, -0.5)
    graph.node_feat["norm"] = np.expand_dims(norm, -1)
    
    df = pd.read_csv("work/train.csv")
    # 打乱顺序
    df.sample(frac=1.0) 
    node_index = df["nid"].values
    node_label = df["label"].values
    train_part = int(len(node_index) * 0.8)
    train_index = node_index[:train_part]
    train_label = node_label[:train_part]
    valid_index = node_index[train_part:]
    valid_label = node_label[train_part:]
    test_index = pd.read_csv("work/test.csv")["nid"].values
    dataset = Dataset(graph=graph, 
                    train_label=train_label,
                    train_index=train_index,
                    valid_index=valid_index,
                    valid_label=valid_label,
                    test_index=test_index, num_classes=35)
    return dataset

In [None]:
dataset = load()

train_index = dataset.train_index
train_label = np.reshape(dataset.train_label, [-1 , 1])
train_index = np.expand_dims(train_index, -1)

val_index = dataset.valid_index
val_label = np.reshape(dataset.valid_label, [-1, 1])
val_index = np.expand_dims(val_index, -1)

test_index = dataset.test_index
test_index = np.expand_dims(test_index, -1)
test_label = np.zeros((len(test_index), 1), dtype="int64")


## 组网模块

这里是组网模块，目前已经提供了一些预定义的模型，包括**GCN**, **GAT**, **APPNP**等。可以通过简单的配置，设定模型的层数，hidden_size等。你也可以深入到model.py里面，去奇思妙想，写自己的图神经网络。

In [None]:
import pgl
import model
import paddle.fluid as fluid
import numpy as np
import time
from build_model import build_model

# # 使用CPU
#place = fluid.CPUPlace()

# 使用GPU
place = fluid.CUDAPlace(0)

train_program = fluid.default_main_program()
startup_program = fluid.default_startup_program()
with fluid.program_guard(train_program, startup_program):
    with fluid.unique_name.guard():
        gw, loss, acc, pred = build_model(dataset,
                            config=config,
                            phase="train",
                            main_prog=train_program)

test_program = fluid.Program()
with fluid.program_guard(test_program, startup_program):
    with fluid.unique_name.guard():
        _gw, v_loss, v_acc, v_pred = build_model(dataset,
            config=config,
            phase="test",
            main_prog=test_program)


test_program = test_program.clone(for_test=True)

exe = fluid.Executor(place)

## 开始训练过程

图神经网络采用FullBatch的训练方式，每一步训练就会把所有整张图训练样本全部训练一遍。



In [8]:
import os
use_label_e = True
label_rate = 0.66
epoch = 500
exe.run(startup_program)
max_val_acc = 0

# 这里可以恢复训练
pretrained = False
if pretrained:
    def name_filter(var):
        res = var.name in os.listdir('./output')
        return res
    fluid.io.load_vars(exe, './output',predicate=name_filter)
    max_val_acc = 0.756

earlystop = 0
# 将图数据变成 feed_dict 用于传入Paddle Excecutor
feed_dict = gw.to_feed(dataset.graph)
for epoch in range(epoch):
    # Full Batch 训练
    # 设定图上面那些节点要获取
    # node_index: 未知label节点的nid    
    # node_label: 未知label
    # label_idx: 已知label节点的nid    
    # label: 已知label
    
    if use_label_e:
        # 在训练集中抽取部分数据，其Label已知，并可以输入网络训练
        train_idx_temp = np.array(train_index, dtype="int64")
        train_lab_temp = np.array(train_label, dtype="int64")
        state = np.random.get_state()
        np.random.shuffle(train_idx_temp)
        np.random.set_state(state)
        np.random.shuffle(train_lab_temp)

        label_idx=train_idx_temp[:int(label_rate*len(train_idx_temp))]
        unlabel_idx=train_idx_temp[int(label_rate*len(train_idx_temp)):]
        label=train_lab_temp[:int(label_rate*len(train_idx_temp))]
        unlabel=train_lab_temp[int(label_rate*len(train_idx_temp)):]

        feed_dict["node_index"] = unlabel_idx
        feed_dict["node_label"] = unlabel
        feed_dict['label_idx']= label_idx
        feed_dict['label']= label
    else:
        feed_dict["node_label"] = np.array(train_label, dtype="int64")
        feed_dict["node_index"] = np.array(train_index, dtype="int64")
        

    train_loss, train_acc = exe.run(train_program,
                                feed=feed_dict,
                                fetch_list=[loss, acc],
                                return_numpy=True)

    # Full Batch 验证
    # 设定图上面那些节点要获取
    # node_index: 未知label节点的nid    
    # node_label: 未知label
    # label_idx: 已知label节点的nid    
    # label: 已知label
    
    feed_dict["node_index"] = np.array(val_index, dtype="int64")
    feed_dict["node_label"] = np.array(val_label, dtype="int64")
    if use_label_e:
        feed_dict['label_idx'] = np.array(train_index, dtype="int64")
        feed_dict['label'] = np.array(train_label, dtype="int64")
    val_loss, val_acc = exe.run(test_program,
                            feed=feed_dict,
                            fetch_list=[v_loss, v_acc],
                            return_numpy=True)
    print("Epoch", epoch, "Train Acc", train_acc[0], "Valid Acc", val_acc[0],"train loss",train_loss[0],"val loss",val_loss[0])
    
    # 保存历史最优验证精度对应的模型
    if val_acc[0] > max_val_acc:
        max_val_acc = val_acc[0]
        print(val_acc[0])
        fluid.io.save_persistables(exe, './output', train_program)
    
    # 训练精度持续大于验证精度，结束训练
    if train_acc[0] > val_acc[0]:
        earlystop += 1
        if earlystop == 100:
            break
    else:
        earlystop = 0

Epoch 0 Train Acc 0.006229062 Valid Acc 0.03367267 train loss 3.9256654 val loss 3.3745832
0.03367267
Epoch 1 Train Acc 0.03062186 Valid Acc 0.18608956 train loss 3.4400425 val loss 3.0611267
0.18608956


KeyboardInterrupt: 

## 对测试集进行预测

训练完成后，我们对测试集进行预测。预测的时候，由于不知道测试集合的标签，我们随意给一些测试label。最终我们获得测试数据的预测结果。


In [None]:
pretrained = True
if pretrained:
    def name_filter(var):
        res = var.name in os.listdir('./output')
        return res
    fluid.io.load_vars(exe, './output',predicate=name_filter)

In [None]:
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64") #假标签
test_prediction = exe.run(test_program,
                            feed=feed_dict,
                            fetch_list=[v_pred],
                            return_numpy=True)[0]

## 生成提交文件

最后一步，我们可以使用pandas轻松生成提交文件，最后下载 submission.csv 提交就好了。

In [None]:
submission = pd.DataFrame(data={
                            "nid": test_index.reshape(-1),
                            "label": test_prediction.reshape(-1)
                        })
submission.to_csv("submission.csv", index=False)

In [9]:
def publicnum(num, d = 0):
    dictnum = {}
    for i in range(len(num)):
        if num[i] in dictnum.keys():
            dictnum[num[i]] += 1
        else:
            dictnum.setdefault(num[i], 1)
    maxnum = 0
    maxkey = 0
    for k, v in dictnum.items():
        if v > maxnum:
            maxnum = v
            maxkey = k
    return maxkey

df0=pd.read_csv("submission0.76136.csv")
df1=pd.read_csv("submission0.757822.csv")
df2=pd.read_csv("submission0.7583.csv")
df3=pd.read_csv("submission0.75758.csv")
df4=pd.read_csv("submission0.75921.csv")
df5=pd.read_csv("submission0.75782.csv")
df6=pd.read_csv("submission0.75956.csv")
df7=pd.read_csv("submission0.75801.csv")
df8=pd.read_csv("submission0.75884.csv")
#df9=pd.read_csv("submission9.csv")
#df10=pd.read_csv("submission10.csv")




nids=[]
labels=[]

for i in range(df4.shape[0]):
    label_zs=[]
    label_zs.append(df0.label[i])
    label_zs.append(df1.label[i])
    label_zs.append(df2.label[i])
    label_zs.append(df3.label[i])
    label_zs.append(df4.label[i])
    label_zs.append(df5.label[i])
    label_zs.append(df6.label[i])
    label_zs.append(df7.label[i])
    label_zs.append(df8.label[i])
    #label_zs.append(df9.label[i])
    #label_zs.append(df10.label[i])
    lab=publicnum(label_zs, d = 0)
    labels.append(lab)
    nids.append(df4.nid[i])


submission = pd.DataFrame(data={
                            "nid": nids,
                            "label": labels
                        })
submission.to_csv("submissiona.csv", index=False)