[我在AI Studio上获得黄金等级，点亮9个徽章，来互关呀~](http://aistudio.baidu.com/aistudio/personalcenter/thirdview/335435) 

# （1）赛题介绍

图神经网络（Graph Neural Network）是一种专门处理图结构数据的神经网络，目前被广泛应用于推荐系统、金融风控、生物计算等领域。图神经网络的经典问题主要有三类，分别为节点分类、连接预测和图分类。本次比赛旨在让参赛同学了解并掌握如何使用图神经网络处理节点分类问题。

在过去的一个世纪里，科学出版物的数量每12年增加近一倍，对每一种出版物的主题及领域进行自动分类已成为当下十分重要的工作。本次任务的目标是预测未知论文的主题类别，如软件工程，人工智能，语言计算和操作系统等。比赛所选35个领域标签已得到论文作者和arXiv版主确认并标记。

本次比赛选用的数据集为arXiv论文引用网络——ogbn-arixv数据集的子集。ogbn-arixv数据集由大量的学术论文组成，论文之间的引用关系形成一张巨大的有向图，每一条有向边表示一篇论文引用另一篇论文，每一个节点提供100维简单的词向量作为节点特征。在论文引用网络中，我们已对训练集对应节点做了论文类别标注处理。本次任务希望参赛者通过已有的节点类别以及论文之间的引用关系，预测未知节点的论文类别。


[赛题地址](https://aistudio.baidu.com/aistudio/competition/detail/59)



# （2）重要参考
###  1.UniMP算法
[UniMP算法GitHub链接](https://github.com/PaddlePaddle/PGL/tree/main/ogb_examples/nodeproppred/unimp)




### 2.参考代码
[https://aistudio.baidu.com/aistudio/projectdetail/1467127?channelType=0&channel=0](https://aistudio.baidu.com/aistudio/projectdetail/1467127?channelType=0&channel=0)

[飞桨常规赛：图神经网络入门节点分类 5月第4名方案](https://aistudio.baidu.com/aistudio/projectdetail/1931047)



# （3）具体方案分享

## 环境配置

```

#导入相关包
!pip install --upgrade python-dateutil
!pip install easydict
!pip install pgl==1.2.0 
!pip install pandas>=0.25
!pip install pyarrow==0.13.0
!pip install chardet==3.0.4


```



# 尝试使用标签进行训练：
（试验结果：过拟合严重，但可以为最后投票集成提供数据，代码仅供参考）
### ①从训练集中随机选择35个类别各1个
### ②利用余弦相似度计算每个feat可能的类别
### ③把每个feat中用余弦相似度计算出的标签类别的位置设置成1，其余位置设置为0
### ④用新的feat进行训练
### 代码如下：




In [None]:

import pandas as pd
import numpy as np
edges = pd.read_csv("work/edges.csv", header=None, names=["src", "dst"]).values
node_feat = np.load("work/feat.npy")
df = pd.read_csv("work/train.csv")
node_index = df["nid"].values
node_label = df["label"].values
feat_dict={}
feat1=np.zeros([node_feat.shape[0],100],dtype='float32')
for i in range(len(node_feat)):
    feat_dict[i]=0
for i in range(len(node_index)):
    if i<int(0.8*len(node_index)):
        feat_dict[node_index[i]]=1
    else:
        feat_dict[node_index[i]]=2
node_labels={}
for i in range(len(node_index)):
    node_labels[node_index[i]]=node_label[i]





In [None]:

def cos_sim(vector_a, vector_b):
    """
    计算两个向量之间的余弦相似度
     :param vector_a: 向量 a
     :param vector_b: 向量 b
    :return: sim
    """
    vector_a = np.mat(vector_a)
    vector_b = np.mat(vector_b)
    num = float(vector_a * vector_b.T)
    denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
    cos = num / denom
    sim = 0.5 + 0.5 * cos
    return sim


In [None]:

rnd_list=[]

for j in range(35):
    while True:
        n=np.random.randint(0,len(node_index))
        if node_label[n]==j:
            rnd_list.append(n)      
            break      


for i in range(len(node_feat)):
    
    if feat_dict[i]!=1:
        cos_sim_max=0
        j_max=0
        for j in range(len(rnd_list)):
            sim=cos_sim(node_feat[i],node_feat[rnd_list[j]])
            if sim>cos_sim_max:
                com_sim_max=sim
                j_max=j
            if cos_sim_max>0.95:
                break
        feat1[i,node_label[rnd_list[j_max]]]=1.0
    else:
        feat1[i,node_labels[i]]=1.0
    if i%10000==0:
        print(i)
node_feat=feat1


# （4）代码实现

###  1.模型构建（mode.py）
Res_Unimp_Large代码，也可参见model_modified.py。

```
class res_unimp_large(object):
    def __init__(self, config, num_class):
        self.num_class = num_class
        self.num_layers = config.get("num_layers", 2)
        self.hidden_size = config.get("hidden_size", 128)
        self.out_size=config.get("out_size", 40)
        self.embed_size=config.get("embed_size", 100)
        self.heads = config.get("heads", 8) 
        self.dropout = config.get("dropout", 0.3)
        self.edge_dropout = config.get("edge_dropout", 0.0)
        self.use_label_e = config.get("use_label_e", False)
    
    # 编码输入        
    def embed_input(self, feature):   
        lay_norm_attr=F.ParamAttr(initializer=F.initializer.ConstantInitializer(value=1))
        lay_norm_bias=F.ParamAttr(initializer=F.initializer.ConstantInitializer(value=0))
        feature=L.layer_norm(feature, name='layer_norm_feature_input', 
                                      param_attr=lay_norm_attr, 
                                      bias_attr=lay_norm_bias)
        return feature
    
    # 连同部分已知的标签编码输入（MaskLabel）
    def label_embed_input(self, feature):
        label = F.data(name="label", shape=[None, 1], dtype="int64")
        label_idx = F.data(name='label_idx', shape=[None, 1], dtype="int64")

        label = L.reshape(label, shape=[-1])
        label_idx = L.reshape(label_idx, shape=[-1])

        embed_attr = F.ParamAttr(initializer=F.initializer.NormalInitializer(loc=0.0, scale=1.0))
        embed = F.embedding(input=label, size=(self.out_size, self.embed_size), param_attr=embed_attr )

        feature_label = L.gather(feature, label_idx, overwrite=False)
        feature_label = feature_label + embed
        feature = L.scatter(feature, label_idx, feature_label, overwrite=True)
     
        lay_norm_attr = F.ParamAttr(initializer=F.initializer.ConstantInitializer(value=1))
        lay_norm_bias = F.ParamAttr(initializer=F.initializer.ConstantInitializer(value=0))
        feature = L.layer_norm(feature, name='layer_norm_feature_input', 
                                      param_attr=lay_norm_attr, 
                                      bias_attr=lay_norm_bias)
        return feature
        
    def forward(self, graph_wrapper, feature, phase):
        if phase == "train": 
            edge_dropout = self.edge_dropout
            dropout = self.dropout
        else:
            edge_dropout = 0
            dropout = 0

        if self.use_label_e:
            feature = self.label_embed_input(feature)
        else:
            feature = self.embed_input(feature)
        if dropout > 0:
            feature = L.dropout(feature, dropout_prob=dropout, 
                                    dropout_implementation='upscale_in_train')
        
        #改变输入特征维度是为了Res连接可以直接相加
        feature = L.fc(feature, size=self.hidden_size * self.heads, name="init_feature")


        for i in range(self.num_layers - 1):
            ngw = pgl.sample.edge_drop(graph_wrapper, edge_dropout) 
            from model_unimp_large import graph_transformer, attn_appnp

            res_feature = feature

            feature, _, cks = graph_transformer(str(i), ngw, feature, 
                                             hidden_size=self.hidden_size,
                                             num_heads=self.heads, 
                                             concat=True, skip_feat=True,
                                             layer_norm=True, relu=True, gate=True)
            if dropout > 0:
                feature = L.dropout(feature, dropout_prob=dropout, 
                                     dropout_implementation='upscale_in_train') 
            
            # 下面这行便是Res连接了
            feature = res_feature + feature 
        
        feature, attn, cks = graph_transformer(str(self.num_layers - 1), ngw, feature, 
                                             hidden_size=self.out_size,
                                             num_heads=self.heads, 
                                             concat=False, skip_feat=True,
                                             layer_norm=False, relu=False, gate=True)

        feature = attn_appnp(ngw, feature, attn, alpha=0.2, k_hop=10)

        pred = L.fc(
            feature, self.num_class, act=None, name="pred_output")
        return pred
```
        
### 2.模型配置（Notebook）
最优策略：3层res_unimp_large，隐层神经元128个，配置两种dropout，使用MaskLabel，且label_rate = 0.66（在模型训练中设置）。

```
config = {
    "model_name": "res_unimp_large",
    "num_layers": 3,
    "hidden_size": 128,
    "heads": 2,
    "learning_rate": 0.001,
    "dropout": 0.33,
    "weight_decay": 0.0005,
    "edge_dropout": 0.32,
    "use_label_e": True
}

```
###  3.模型训练（Notebook）

```
import os
use_label_e = True
label_rate = 0.66
epoch = 4000
exe.run(startup_program)
max_val_acc = 0

# 这里可以恢复训练
pretrained = False
if pretrained:
    def name_filter(var):
        res = var.name in os.listdir('./output')
        return res
    fluid.io.load_vars(exe, './output',predicate=name_filter)
    max_val_acc = 0.756

earlystop = 0
# 将图数据变成 feed_dict 用于传入Paddle Excecutor
feed_dict = gw.to_feed(dataset.graph)
for epoch in range(epoch):
    # Full Batch 训练
    # 设定图上面那些节点要获取
    # node_index: 未知label节点的nid    
    # node_label: 未知label
    # label_idx: 已知label节点的nid    
    # label: 已知label
    
    if use_label_e:
        # 在训练集中抽取部分数据，其Label已知，并可以输入网络训练
        train_idx_temp = np.array(train_index, dtype="int64")
        train_lab_temp = np.array(train_label, dtype="int64")
        state = np.random.get_state()
        np.random.shuffle(train_idx_temp)
        np.random.set_state(state)
        np.random.shuffle(train_lab_temp)

        label_idx=train_idx_temp[:int(label_rate*len(train_idx_temp))]
        unlabel_idx=train_idx_temp[int(label_rate*len(train_idx_temp)):]
        label=train_lab_temp[:int(label_rate*len(train_idx_temp))]
        unlabel=train_lab_temp[int(label_rate*len(train_idx_temp)):]

        feed_dict["node_index"] = unlabel_idx
        feed_dict["node_label"] = unlabel
        feed_dict['label_idx']= label_idx
        feed_dict['label']= label
    else:
        feed_dict["node_label"] = np.array(train_label, dtype="int64")
        feed_dict["node_index"] = np.array(train_index, dtype="int64")
        

    train_loss, train_acc = exe.run(train_program,
                                feed=feed_dict,
                                fetch_list=[loss, acc],
                                return_numpy=True)

    # Full Batch 验证
    # 设定图上面那些节点要获取
    # node_index: 未知label节点的nid    
    # node_label: 未知label
    # label_idx: 已知label节点的nid    
    # label: 已知label
    
    feed_dict["node_index"] = np.array(val_index, dtype="int64")
    feed_dict["node_label"] = np.array(val_label, dtype="int64")
    if use_label_e:
        feed_dict['label_idx'] = np.array(train_index, dtype="int64")
        feed_dict['label'] = np.array(train_label, dtype="int64")
    val_loss, val_acc = exe.run(test_program,
                            feed=feed_dict,
                            fetch_list=[v_loss, v_acc],
                            return_numpy=True)
    print("Epoch", epoch, "Train Acc", train_acc[0], "Valid Acc", val_acc[0])
    
    # 保存历史最优验证精度对应的模型
    if val_acc[0] > max_val_acc:
        max_val_acc = val_acc[0]
        fluid.io.save_persistables(exe, './output', train_program)
    
    # 训练精度持续大于验证精度，结束训练
    if train_acc[0] > val_acc[0]:
        earlystop += 1
        if earlystop == 40:
            break
    else:
        earlystop = 0
```

### 4.简单投票
```
def publicnum(num, d = 0):
    dictnum = {}
    for i in range(len(num)):
        if num[i] in dictnum.keys():
            dictnum[num[i]] += 1
        else:
            dictnum.setdefault(num[i], 1)
    maxnum = 0
    maxkey = 0
    for k, v in dictnum.items():
        if v > maxnum:
            maxnum = v
            maxkey = k
    return maxkey

df0=pd.read_csv("submission0.76136.csv")
df1=pd.read_csv("submission0.757822.csv")
df2=pd.read_csv("submission0.7583.csv")
df3=pd.read_csv("submission0.75758.csv")
df4=pd.read_csv("submission0.75921.csv")
df5=pd.read_csv("submission0.75782.csv")
df6=pd.read_csv("submission0.75956.csv")
df7=pd.read_csv("submission0.75801.csv")
df8=pd.read_csv("submission0.75884.csv")
#df9=pd.read_csv("submission9.csv")
#df10=pd.read_csv("submission10.csv")




nids=[]
labels=[]

for i in range(df4.shape[0]):
    label_zs=[]
    label_zs.append(df0.label[i])
    label_zs.append(df1.label[i])
    label_zs.append(df2.label[i])
    label_zs.append(df3.label[i])
    label_zs.append(df4.label[i])
    label_zs.append(df5.label[i])
    label_zs.append(df6.label[i])
    label_zs.append(df7.label[i])
    label_zs.append(df8.label[i])
    #label_zs.append(df9.label[i])
    #label_zs.append(df10.label[i])
    lab=publicnum(label_zs, d = 0)
    labels.append(lab)
    nids.append(df4.nid[i])


submission = pd.DataFrame(data={
                            "nid": nids,
                            "label": labels
                        })
submission.to_csv("submissiona.csv", index=False)
```





# （5）总结及改善方向
1. 使用UniMP算法可以提高成绩。
2. 提前中止有利于减少过拟合提高成绩。
3. 投票方法能提高成绩，但是存在天花板。

# （6）给其他选手学习飞桨的建议

####  建议大家多参加百度AI Studio课程，多看别人写的AI Studio项目，也许会有灵感迸发，在比赛中取得更好的成绩。

# （7）One More Thing
如果大家还想要别的奇思妙想，可以参考以下论文，他们都在节点分类上有很大提升。

[Predict then Propagate: Graph Neural Networks meet Personalized PageRank](https://arxiv.org/abs/1810.05997)

[Simple and Deep Graph Convolutional Networks](https://arxiv.org/abs/2007.02133)

[Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification](https://arxiv.org/abs/2009.03509)

[Combining Label Propagation and Simple Models Out-performs Graph Neural Networks](https://arxiv.org/abs/2010.13993)

大家也可以看看github的[ UniMP](https://github.com/PaddlePaddle/PGL/tree/main/ogb_examples/nodeproppred/unimp)算法 这个例子，里面有相似的数据集，并且最近也是SOTA效果，有帮助👏欢迎点Star

相关课程连接：[图神经网络7日打卡营](http://aistudio.baidu.com/aistudio/education/group/info/1956)

代码参考：

[论文引用网络节点分类—炼丹经验总结](https://aistudio.baidu.com/aistudio/projectdetail/1642136)

[飞桨常规赛：图神经网络入门节点分类 5月第4名方案](https://aistudio.baidu.com/aistudio/projectdetail/1931047)

[我在AI Studio上获得黄金等级，点亮9个徽章，来互关呀~](http://aistudio.baidu.com/aistudio/personalcenter/thirdview/335435) 

## 代码整体逻辑

1. 读取提供的数据集，包含构图以及读取节点特征（用户可自己改动边的构造方式）

2. 配置化生成模型，用户也可以根据教程进行图神经网络的实现。

3. 开始训练

4. 执行预测并产生结果文件


## 环境配置

该项目依赖飞桨paddlepaddle==1.8.4, 以及pgl==1.2.0。请按照版本号下载对应版本就可运行。

In [None]:
!unzip -oq /home/aistudio/data/data93851/graph.zip -d work/

In [None]:
#导入相关包
!pip install --upgrade python-dateutil
!pip install easydict
!pip install pgl==1.2.0 
!pip install pandas>=0.25
!pip install pyarrow==0.13.0
!pip install chardet==3.0.4

In [None]:
import sys 

In [None]:
import pgl
import paddle.fluid as fluid
import numpy as np
import time
import pandas as pd

## 图网络配置

这里已经有很多强大的模型配置了，你可以尝试简单的改一下config的字段。
例如，换成GAT的配置
```
config = {
    "model_name": "GAT",
    "num_layers":  1,
    "dropout": 0.5,
    "learning_rate": 0.01,
    "weight_decay": 0.0005,
    "edge_dropout": 0.00,
}
```

In [None]:
from easydict import EasyDict as edict
config = {
    "model_name": "res_unimp_large",
    "num_layers": 3,
    "hidden_size": 128,
    "heads": 2,
    "learning_rate": 0.001,
    "dropout": 0.33,
    "weight_decay": 0.0005,
    "edge_dropout": 0.32,
    "use_label_e": True
}



config = edict(config)

## 数据加载模块

这里主要是用于读取数据集，包括读取图数据构图，以及训练集的划分。

In [None]:
from collections import namedtuple

Dataset = namedtuple("Dataset", 
               ["graph", "num_classes", "train_index",
                "train_label", "valid_index", "valid_label", "test_index"])

def load_edges(num_nodes, self_loop=True, add_inverse_edge=True):
    # 从数据中读取边
    edges = pd.read_csv("work/edges.csv", header=None, names=["src", "dst"]).values






    if add_inverse_edge:
        edges = np.vstack([edges, edges[:, ::-1]])

    if self_loop:
        src = np.arange(0, num_nodes)
        dst = np.arange(0, num_nodes)
        self_loop = np.vstack([src, dst]).T
        edges = np.vstack([edges, self_loop])
    
    return edges

def load():
    # 从数据中读取点特征和边，以及数据划分
    node_feat = np.load("work/feat.npy")
    #node_feat=feat1
    num_nodes = node_feat.shape[0]
    edges = load_edges(num_nodes=num_nodes, self_loop=True, add_inverse_edge=True)
    graph = pgl.graph.Graph(num_nodes=num_nodes, edges=edges, node_feat={"feat": node_feat})
    
    indegree = graph.indegree()
    norm = np.maximum(indegree.astype("float32"), 1)
    norm = np.power(norm, -0.5)
    graph.node_feat["norm"] = np.expand_dims(norm, -1)
    
    df = pd.read_csv("work/train.csv")
    # 打乱顺序
    df.sample(frac=1.0) 
    node_index = df["nid"].values
    node_label = df["label"].values
    train_part = int(len(node_index) * 0.8)
    train_index = node_index[:train_part]
    train_label = node_label[:train_part]
    valid_index = node_index[train_part:]
    valid_label = node_label[train_part:]
    test_index = pd.read_csv("work/test.csv")["nid"].values
    dataset = Dataset(graph=graph, 
                    train_label=train_label,
                    train_index=train_index,
                    valid_index=valid_index,
                    valid_label=valid_label,
                    test_index=test_index, num_classes=35)
    return dataset

In [None]:
dataset = load()

train_index = dataset.train_index
train_label = np.reshape(dataset.train_label, [-1 , 1])
train_index = np.expand_dims(train_index, -1)

val_index = dataset.valid_index
val_label = np.reshape(dataset.valid_label, [-1, 1])
val_index = np.expand_dims(val_index, -1)

test_index = dataset.test_index
test_index = np.expand_dims(test_index, -1)
test_label = np.zeros((len(test_index), 1), dtype="int64")


## 组网模块

这里是组网模块，目前已经提供了一些预定义的模型，包括**GCN**, **GAT**, **APPNP**等。可以通过简单的配置，设定模型的层数，hidden_size等。你也可以深入到model.py里面，去奇思妙想，写自己的图神经网络。

In [None]:
import pgl
import model
import paddle.fluid as fluid
import numpy as np
import time
from build_model import build_model

# # 使用CPU
#place = fluid.CPUPlace()

# 使用GPU
place = fluid.CUDAPlace(0)

train_program = fluid.default_main_program()
startup_program = fluid.default_startup_program()
with fluid.program_guard(train_program, startup_program):
    with fluid.unique_name.guard():
        gw, loss, acc, pred = build_model(dataset,
                            config=config,
                            phase="train",
                            main_prog=train_program)

test_program = fluid.Program()
with fluid.program_guard(test_program, startup_program):
    with fluid.unique_name.guard():
        _gw, v_loss, v_acc, v_pred = build_model(dataset,
            config=config,
            phase="test",
            main_prog=test_program)


test_program = test_program.clone(for_test=True)

exe = fluid.Executor(place)

## 开始训练过程

图神经网络采用FullBatch的训练方式，每一步训练就会把所有整张图训练样本全部训练一遍。



In [16]:
import os
use_label_e = True
label_rate = 0.66
epoch = 4000
exe.run(startup_program)
max_val_acc = 0

# 这里可以恢复训练
pretrained = True
if pretrained:
    def name_filter(var):
        res = var.name in os.listdir('./output')
        return res
    fluid.io.load_vars(exe, './output',predicate=name_filter)
    max_val_acc = 0.63

earlystop = 0
# 将图数据变成 feed_dict 用于传入Paddle Excecutor
feed_dict = gw.to_feed(dataset.graph)
for epoch in range(epoch):
    # Full Batch 训练
    # 设定图上面那些节点要获取
    # node_index: 未知label节点的nid    
    # node_label: 未知label
    # label_idx: 已知label节点的nid    
    # label: 已知label
    
    if use_label_e:
        # 在训练集中抽取部分数据，其Label已知，并可以输入网络训练
        train_idx_temp = np.array(train_index, dtype="int64")
        train_lab_temp = np.array(train_label, dtype="int64")
        state = np.random.get_state()
        np.random.shuffle(train_idx_temp)
        np.random.set_state(state)
        np.random.shuffle(train_lab_temp)

        label_idx=train_idx_temp[:int(label_rate*len(train_idx_temp))]
        unlabel_idx=train_idx_temp[int(label_rate*len(train_idx_temp)):]
        label=train_lab_temp[:int(label_rate*len(train_idx_temp))]
        unlabel=train_lab_temp[int(label_rate*len(train_idx_temp)):]

        feed_dict["node_index"] = unlabel_idx
        feed_dict["node_label"] = unlabel
        feed_dict['label_idx']= label_idx
        feed_dict['label']= label
    else:
        feed_dict["node_label"] = np.array(train_label, dtype="int64")
        feed_dict["node_index"] = np.array(train_index, dtype="int64")
        

    train_loss, train_acc = exe.run(train_program,
                                feed=feed_dict,
                                fetch_list=[loss, acc],
                                return_numpy=True)

    # Full Batch 验证
    # 设定图上面那些节点要获取
    # node_index: 未知label节点的nid    
    # node_label: 未知label
    # label_idx: 已知label节点的nid    
    # label: 已知label
    
    feed_dict["node_index"] = np.array(val_index, dtype="int64")
    feed_dict["node_label"] = np.array(val_label, dtype="int64")
    if use_label_e:
        feed_dict['label_idx'] = np.array(train_index, dtype="int64")
        feed_dict['label'] = np.array(train_label, dtype="int64")
    val_loss, val_acc = exe.run(test_program,
                            feed=feed_dict,
                            fetch_list=[v_loss, v_acc],
                            return_numpy=True)
    print("Epoch", epoch, "Train Acc", train_acc[0], "Valid Acc", val_acc[0],"train loss",train_loss[0],"val loss",val_loss[0])
    
    # 保存历史最优验证精度对应的模型
    if val_acc[0] > max_val_acc:
        max_val_acc = val_acc[0]
        print(val_acc[0])
        fluid.io.save_persistables(exe, './output', train_program)
    
    # 训练精度持续大于验证精度，结束训练
    if train_acc[0] > val_acc[0]:
        earlystop += 1
        if earlystop == 40:
            break
    else:
        earlystop = 0

## 对测试集进行预测

训练完成后，我们对测试集进行预测。预测的时候，由于不知道测试集合的标签，我们随意给一些测试label。最终我们获得测试数据的预测结果。


In [17]:
pretrained = True
if pretrained:
    def name_filter(var):
        res = var.name in os.listdir('./output')
        return res
    fluid.io.load_vars(exe, './output',predicate=name_filter)

In [18]:
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64") #假标签
test_prediction = exe.run(test_program,
                            feed=feed_dict,
                            fetch_list=[v_pred],
                            return_numpy=True)[0]

## 生成提交文件

最后一步，我们可以使用pandas轻松生成提交文件，最后下载 submission.csv 提交就好了。

In [19]:
submission = pd.DataFrame(data={
                            "nid": test_index.reshape(-1),
                            "label": test_prediction.reshape(-1)
                        })
submission.to_csv("submission.csv", index=False)

In [None]:
def publicnum(num, d = 0):
    dictnum = {}
    for i in range(len(num)):
        if num[i] in dictnum.keys():
            dictnum[num[i]] += 1
        else:
            dictnum.setdefault(num[i], 1)
    maxnum = 0
    maxkey = 0
    for k, v in dictnum.items():
        if v > maxnum:
            maxnum = v
            maxkey = k
    return maxkey

df0=pd.read_csv("submission0.76136.csv")
df1=pd.read_csv("submission0.757822.csv")
df2=pd.read_csv("submission0.7583.csv")
df3=pd.read_csv("submission0.75758.csv")
df4=pd.read_csv("submission0.75921.csv")
df5=pd.read_csv("submission0.75782.csv")
df6=pd.read_csv("submission0.75956.csv")
df7=pd.read_csv("submission0.75801.csv")
df8=pd.read_csv("submission0.75884.csv")
#df9=pd.read_csv("submission9.csv")
#df10=pd.read_csv("submission10.csv")




nids=[]
labels=[]

for i in range(df4.shape[0]):
    label_zs=[]
    label_zs.append(df0.label[i])
    label_zs.append(df1.label[i])
    label_zs.append(df2.label[i])
    label_zs.append(df3.label[i])
    label_zs.append(df4.label[i])
    label_zs.append(df5.label[i])
    label_zs.append(df6.label[i])
    label_zs.append(df7.label[i])
    label_zs.append(df8.label[i])
    #label_zs.append(df9.label[i])
    #label_zs.append(df10.label[i])
    lab=publicnum(label_zs, d = 0)
    labels.append(lab)
    nids.append(df4.nid[i])


submission = pd.DataFrame(data={
                            "nid": nids,
                            "label": labels
                        })
submission.to_csv("submissiona.csv", index=False)