# Exp6: 基于集成学习的 Amazon 用户评论质量预测

## 一、案例简介

随着电商平台的兴起，以及疫情的持续影响，线上购物在我们的日常生活中扮演着越来越重要的角色。在进行线上商品挑选时，评论往往是我们十分关注的一个方面。然而目前电商网站的评论质量参差不齐，甚至有水军刷好评或者恶意差评的情况出现，严重影响了顾客的购物体验。因此，对于评论质量的预测成为电商平台越来越关注的话题，如果能自动对评论质量进行评估，就能根据预测结果避免展现低质量的评论。本案例中我们将基于集成学习的方法对 Amazon 现实场景中的评论质量进行预测。

## 二、作业说明

本案例中需要大家完成两种集成学习算法的实现（Bagging、AdaBoost.M1），其中基分类器要求使用 SVM 和决策树两种，因此，一共需要对比四组结果（[AUC](https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics) 作为评价指标）：

* Bagging + SVM
* Bagging + 决策树
* AdaBoost.M1 + SVM
* AdaBoost.M1 + 决策树

注意集成学习的核心算法需要**手动进行实现**，基分类器可以调库。

### 基本要求
* 根据数据格式设计特征的表示
* 汇报不同组合下得到的 AUC
* 结合不同集成学习算法的特点分析结果之间的差异
* （使用 sklearn 等第三方库的集成学习算法会酌情扣分）

### 扩展要求
* 尝试其他基分类器（如 k-NN、朴素贝叶斯）
* 分析不同特征的影响
* 分析集成学习算法参数的影响

## 三、数据概览

In [1]:
import pandas as pd 


In [2]:
train_df = pd.read_csv('./data/train.csv', sep='\t')
train_df

Unnamed: 0,reviewerID,asin,reviewText,overall,votes_up,votes_all,label
0,7885,3901,"First off, allow me to correct a common mistak...",5.0,6,7,0
1,52087,47978,I am really troubled by this Story and Enterta...,3.0,99,134,0
2,5701,3667,A near-perfect film version of a downright glo...,4.0,14,14,1
3,47191,40892,Keep your expectations low. Really really low...,1.0,4,7,0
4,40957,15367,"""they dont make em like this no more...""well.....",5.0,3,6,0
...,...,...,...,...,...,...,...
57034,58315,29374,"If you like beautifully shot, well acted films...",2.0,12,21,0
57035,23328,45548,This is a great set of films Wayne did Fox and...,5.0,15,18,0
57036,27203,42453,It's what's known as a comedy of manners. It's...,3.0,4,5,0
57037,33992,44891,Ellen can do no wrong as far a creating wonder...,5.0,4,5,0


本次数据来源于 Amazon 电商平台，包含超过 50,000 条用户在购买商品后留下的评论，各列的含义如下：

* reviewerID：用户 ID
* asin：商品 ID
* reviewText：英文评论文本
* overall：用户对商品的打分（1-5）
* votes_up：认为评论有用的点赞数（只在训练集出现）
* votes_all：该评论得到的总评价数（只在训练集出现）
* label：评论质量的 label，1 表示高质量，0 表示低质量（只在训练集出现）

评论质量的 label 来自于其他用户对评论的 votes，votes_up/votes_all ≥ 0.9 的作为高质量评论。此外测试集包含一个额外的列Id，标识了每一个测试的样例。

## 四、比赛提交格式

课程页面：https://aistudio.baidu.com/aistudio/education/dashboard

提交文件需要对测试集中每一条评论给出预测为高质量的概率，每行包括一个Id（和测试集对应）以及预测的概率Predicted（0-1的浮点数），用逗号分隔。示例提交格式如下：

```
Id,Predicted
0,0.9
1,0.45
2,0.78
...
```
命名为`result.csv`

**注意除了提交比赛，还需要像之前作业一样在学堂在线提交代码和报告（不包括数据）**

In [3]:
import numpy as np


In [4]:
test_df = pd.read_csv('data\\test.csv', sep='\t')
test_df

Unnamed: 0,Id,reviewerID,asin,reviewText,overall
0,0,82947,37386,I REALLY wanted this series but I am in SHOCK ...,1.0
1,1,10154,23543,I have to say that this is a work of art for m...,4.0
2,2,5789,5724,Alien 3 is certainly the most controversal fil...,3.0
3,3,9198,5909,"I love this film...preachy? Well, of course i...",5.0
4,4,33252,21214,Even though I previously bought the Gamera Dou...,5.0
...,...,...,...,...,...
11203,11203,18250,35309,I honestly never heard of the graphic novel un...,5.0
11204,11204,3200,2130,Archie Bunker's command to stifle YOURSELF! wa...,5.0
11205,11205,37366,41971,"In LSD - My Problem Child, Albert Hoffman wrot...",5.0
11206,11206,1781,33089,I have owned this DVD for over a year now and ...,5.0


In [14]:
#训练集选取时，label 1 和 0 的数量相同
small_batch_df = pd.concat([train_df[train_df['label'] == 0][:12000], train_df[train_df['label'] == 1][:12000]])
small_batch_df

Unnamed: 0,reviewerID,asin,reviewText,overall,votes_up,votes_all,label
0,7885,3901,"First off, allow me to correct a common mistak...",5.0,6,7,0
1,52087,47978,I am really troubled by this Story and Enterta...,3.0,99,134,0
3,47191,40892,Keep your expectations low. Really really low...,1.0,4,7,0
4,40957,15367,"""they dont make em like this no more...""well.....",5.0,3,6,0
5,60198,18940,I had only briefly read of Dorothy Day and did...,5.0,7,8,0
...,...,...,...,...,...,...,...
52955,65649,34775,I like most everything about this complete ser...,4.0,15,16,1
52959,57179,29091,This powerful program chronicles the inspiring...,5.0,28,28,1
52972,1748,14039,"This is a pretty good body-sculpting workout, ...",4.0,70,72,1
52973,54316,46945,When problems in a school district affects two...,5.0,11,11,1


In [15]:
#抄作业bert

import pandas as pd
import codecs, gc
import numpy as np
from sklearn.model_selection import KFold
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.metrics import top_k_categorical_accuracy
from keras.layers import *
from keras.callbacks import *
from keras.models import Model
import keras.backend as K
from keras.optimizers import Adam
from keras.utils import to_categorical
 
maxlen = 100  #设置序列长度为120，要保证序列长度不超过512
 
#预训练好的模型

config_path = 'uncased_L-12_H-768_A-12/bert_config.json'
checkpoint_path = 'uncased_L-12_H-768_A-12/bert_model.ckpt'
dict_path = 'uncased_L-12_H-768_A-12/vocab.txt'
 
#将词表中的词编号转换为字典
token_dict = {}
with codecs.open(dict_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)
 
# 重写tokenizer        
class OurTokenizer(Tokenizer):
    def _tokenize(self, text):
        R = []
        for c in text:
            if c in self._token_dict:
                R.append(c)
            elif self._is_space(c):
                R.append('[unused1]')  # 用[unused1]来表示空格类字符
            else:
                R.append('[UNK]')  # 不在列表的字符用[UNK]表示
        return R
    
tokenizer = OurTokenizer(token_dict)
 
#让每条文本的长度相同，用0填充
def seq_padding(X, padding=0):
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([
        np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X
    ])
 
#data_generator只是一种为了节约内存的数据方式
class data_generator:
    def __init__(self, data, batch_size=32, shuffle=True):
        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.steps = len(self.data) // self.batch_size
        if len(self.data) % self.batch_size != 0:
            self.steps += 1
 
    def __len__(self):
        return self.steps
 
    def __iter__(self):
        while True:
            idxs = list(range(len(self.data)))
 
            if self.shuffle:
                np.random.shuffle(idxs)
 
            X1, X2, Y = [], [], []
            for i in idxs:
                d = self.data[i]
                text = d[0][:maxlen]
                x1, x2 = tokenizer.encode(first=text)
                y = d[1]
                X1.append(x1)
                X2.append(x2)
                Y.append([y])
                if len(X1) == self.batch_size or i == idxs[-1]:
                    X1 = seq_padding(X1)
                    X2 = seq_padding(X2)
                    Y = seq_padding(Y)
                    yield [X1, X2], Y[:, 0, :]
                    [X1, X2, Y] = [], [], []
 
# 计算top-k正确率,当预测值的前k个值中存在目标类别即认为预测正确                 
def acc_top1(y_true, y_pred):
    return top_k_categorical_accuracy(y_true, y_pred, k=1)
 
#bert模型设置
def build_bert(nclass):
    bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)  #加载预训练模型
 
    for l in bert_model.layers:
        l.trainable = True
 
    x1_in = Input(shape=(None,))
    x2_in = Input(shape=(None,))
 
    x = bert_model([x1_in, x2_in])
    x = Lambda(lambda x: x[:, 0])(x) # 取出[CLS]对应的向量用来做分类
    p = Dense(nclass, activation='softmax')(x)
 
    model = Model([x1_in, x2_in], p)
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(1e-5),    #用足够小的学习率
                  metrics=['accuracy', acc_top1])
    print(model.summary())
    return model
 
#训练数据、测试数据和标签转化为模型输入格式
DATA_LIST = []
for data_row in train_df.iloc[:].itertuples():
    DATA_LIST.append((data_row.reviewText, to_categorical(data_row.label, 2)))
DATA_LIST = np.array(DATA_LIST)
 
DATA_LIST_TEST = []
for data_row in test_df.iloc[:].itertuples():
    DATA_LIST_TEST.append((data_row.reviewText, to_categorical(0, 2)))
DATA_LIST_TEST = np.array(DATA_LIST_TEST)

In [16]:
#测试一个小数据集上的
DATA_LIST2 = []
for data_row in small_batch_df.iloc[:].itertuples():
    DATA_LIST2.append((data_row.reviewText, to_categorical(data_row.label, 2)))
DATA_LIST2 = np.array(DATA_LIST2)

In [25]:
nfold = 2
data = DATA_LIST2
data_test = DATA_LIST_TEST

kf = KFold(n_splits=nfold, shuffle=True, random_state=520).split(data)
train_model_pred = np.zeros((len(data), 2))
test_model_pred1 = np.zeros((len(data_test), 2))
for i, (train_fold, test_fold) in enumerate(kf):
        X_train, X_valid, = data[train_fold, :], data[test_fold, :]
 
        model = build_bert(2)
                                            #是不是这里用acc_top1？
        early_stopping = EarlyStopping(monitor='val_loss', patience=3)   #早停法，防止过拟合
        plateau = ReduceLROnPlateau(monitor="acc_top1", verbose=1, mode='max', factor=0.5, patience=2) #当评价指标不在提升时，减少学习率 
        
        checkpoint = ModelCheckpoint('./bert_dump/' + str(i) + '.hdf5', monitor='val_acc',verbose=2, save_best_only=True, mode='max', save_weights_only=True) #保存最好的模型
 
        train_D = data_generator(X_train, shuffle=True)
        valid_D = data_generator(X_valid, shuffle=True)
        test_D = data_generator(data_test, shuffle=False)
        #模型训练
        model.fit_generator(
            train_D.__iter__(),
            steps_per_epoch=100,
            epochs=1,
            validation_data=valid_D.__iter__(),
            validation_steps=len(valid_D),
            callbacks=[early_stopping, plateau, checkpoint],
        )
 
        # model.load_weights('./bert_dump/' + str(i) + '.hdf5')
 
        # return model
        train_model_pred[test_fold, :] = model.predict_generator(valid_D.__iter__(), steps=len(valid_D), verbose=1)
        test_model_pred += model.predict_generator(test_D.__iter__(), steps=len(test_D), verbose=1)
 
        del model
        gc.collect()   #清理内存
        K.clear_session()   #clear_session就是清除一个session
        # break

Model: "functional_17"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
functional_15 (Functional)      (None, None, 768)    108891648   input_5[0][0]                    
                                                                 input_6[0][0]                    
__________________________________________________________________________________________________
lambda_2 (Lambda)               (None, 768)          0           functional_15[0][0]  

In [29]:
test_pred1 = [np.argmax(x) for x in test_model_pred]

output=pd.DataFrame({'Id':test_df.Id,'Predicted':test_pred})
# output.to_csv('data/results2.csv', index=None)

In [28]:
test_pred

[1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,


In [None]:
model = build_bert(2)


In [5]:
DATA_LIST.shape

(57039, 2)

In [None]:
#交叉验证训练和测试模型
def run_cv(nfold, data, data_labels, data_test):
    kf = KFold(n_splits=nfold, shuffle=True, random_state=520).split(data)
    train_model_pred = np.zeros((len(data), 2))
    test_model_pred = np.zeros((len(data_test), 2))
 
    for i, (train_fold, test_fold) in enumerate(kf):
        X_train, X_valid, = data[train_fold, :], data[test_fold, :]
 
        model = build_bert(2)
                                            #是不是这里用acc_top1？
        early_stopping = EarlyStopping(monitor='val_loss', patience=3)   #早停法，防止过拟合
        plateau = ReduceLROnPlateau(monitor="val_acc", verbose=1, mode='max', factor=0.5, patience=2) #当评价指标不在提升时，减少学习率 
        
        checkpoint = ModelCheckpoint('./bert_dump/' + str(i) + '.hdf5', monitor='val_acc',verbose=2, save_best_only=True, mode='max', save_weights_only=True) #保存最好的模型
 
        train_D = data_generator(X_train, shuffle=True)
        valid_D = data_generator(X_valid, shuffle=True)
        test_D = data_generator(data_test, shuffle=False)
        #模型训练
        model.fit_generator(
            train_D.__iter__(),
            steps_per_epoch=len(train_D),
            epochs=5,
            validation_data=valid_D.__iter__(),
            validation_steps=len(valid_D),
            callbacks=[early_stopping, plateau, checkpoint],
        )
 
        # model.load_weights('./bert_dump/' + str(i) + '.hdf5')
 
        # return model
        train_model_pred[test_fold, :] = model.predict_generator(valid_D.__iter__(), steps=len(valid_D), verbose=1)
        test_model_pred += model.predict_generator(test_D.__iter__(), steps=len(test_D), verbose=1)
 
        del model
        gc.collect()   #清理内存
        K.clear_session()   #clear_session就是清除一个session
        # break
 
    return train_model_pred, test_model_pred
 
#n折交叉验证
train_model_pred, test_model_pred = run_cv(2, DATA_LIST, None, DATA_LIST_TEST)
 
test_pred = [np.argmax(x) for x in test_model_pred]
 
#将测试集预测结果写入文件
output=pd.DataFrame({'Id':test_df.id,'Predicted':test_pred})
output.to_csv('data/results.csv', index=None)

In [8]:
from pytorch_transformers import  BertModel, BertConfig,BertTokenizer
from torch import nn
import torch
#——————构造模型——————
class TextNet(nn.Module):
    def __init__(self,  code_length): #code_length为fc映射到的维度大小
        super(TextNet, self).__init__()

        modelConfig = BertConfig.from_pretrained('bert-base-uncased')
        self.textExtractor = BertModel.from_pretrained('bert-base-uncased', config=modelConfig)
        embedding_dim = self.textExtractor.config.hidden_size

        self.fc = nn.Linear(embedding_dim, code_length)
        self.tanh = torch.nn.Tanh()

    def forward(self, tokens, segments, input_masks):
        output=self.textExtractor(tokens, token_type_ids=segments,
                                 		attention_mask=input_masks)
        text_embeddings = output[0][:, 0, :]  
        #output[0](batch size, sequence length, model hidden dimension)

        features = self.fc(text_embeddings)
        features=self.tanh(features)
        return features

textNet = TextNet(code_length=32)



In [9]:
#——————输入处理——————
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

texts = ["[CLS] Who was Jim Henson ? [SEP]",
        "[CLS] Jim Henson was a puppeteer [SEP]"]

# texts = DATA_LIST[:, 0][:maxlen]

tokens, segments, input_masks = [], [], []
for text in texts:
    tokenized_text = tokenizer.tokenize(text) #用tokenizer对句子分词
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)#索引列表
    tokens.append(indexed_tokens)
    segments.append([0] * len(indexed_tokens))
    input_masks.append([1] * len(indexed_tokens))

max_len = max([len(single) for single in tokens]) #最大的句子长度

for j in range(len(tokens)):
    padding = [0] * (max_len - len(tokens[j]))
    tokens[j] += padding
    segments[j] += padding
    input_masks[j] += padding
#segments列表全0，因为只有一个句子1，没有句子2
#input_masks列表1的部分代表句子单词，而后面0的部分代表paddig，只是用于保持输入整齐，没有实际意义。
#相当于告诉BertModel不要利用后面0的部分
    
#转换成PyTorch tensors
tokens_tensor = torch.tensor(tokens)
# segments_tensors = torch.tensor(segments)
# input_masks_tensors = torch.tensor(input_masks)


#——————提取文本特征——————
text_hashCodes = textNet(tokens_tensor , segments_tensors , input_masks_tensors ) #text_hashCodes是一个32-dim文本特征

In [10]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

SyntaxError: 'yield' outside function (<ipython-input-10-a6466f1e4360>, line 21)

In [5]:
# #保存一份只有前20词的train1.csv 和 test1.csv
# maxlen = 20
# train_df['fcontent'] = 0
# for i in range(len(train_df['reviewText'])):
#     train_df['fcontent'][i] = " ".join(train_df['reviewText'][i].split(sep = ' ')[:maxlen])
# test_df['fcontent'] = 0
# for i in range(len(test_df['reviewText'])):
#     test_df['fcontent'][i] = " ".join(test_df['reviewText'][i].split(sep = ' ')[:maxlen])

# test_df

# train_new = pd.DataFrame()
# test_new = pd.DataFrame()

# train_new['content'] = train_df['fcontent']
# train_new['label'] = train_df['label']
# test_new['Id'] = test_df['Id']
# test_new['content'] = test_df['fcontent']

# train_new.to_csv('data/train1.csv', index=None)
# test_new.to_csv('data/test1.csv', index = None)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Id,reviewerID,asin,reviewText,overall,fcontent
0,0,82947,37386,I REALLY wanted this series but I am in SHOCK ...,1.0,I REALLY wanted this series but I am in SHOCK ...
1,1,10154,23543,I have to say that this is a work of art for m...,4.0,I have to say that this is a work of art for m...
2,2,5789,5724,Alien 3 is certainly the most controversal fil...,3.0,Alien 3 is certainly the most controversal fil...
3,3,9198,5909,"I love this film...preachy? Well, of course i...",5.0,"I love this film...preachy? Well, of course i..."
4,4,33252,21214,Even though I previously bought the Gamera Dou...,5.0,Even though I previously bought the Gamera Dou...
...,...,...,...,...,...,...
11203,11203,18250,35309,I honestly never heard of the graphic novel un...,5.0,I honestly never heard of the graphic novel un...
11204,11204,3200,2130,Archie Bunker's command to stifle YOURSELF! wa...,5.0,Archie Bunker's command to stifle YOURSELF! wa...
11205,11205,37366,41971,"In LSD - My Problem Child, Albert Hoffman wrot...",5.0,"In LSD - My Problem Child, Albert Hoffman wrot..."
11206,11206,1781,33089,I have owned this DVD for over a year now and ...,5.0,I have owned this DVD for over a year now and ...


In [10]:
import pandas as pd
import numpy as np

# #预处理过，只有20词的content
train_df = pd.read_csv(r'data/train1.csv', sep=',')
test_df = pd.read_csv(r'data/test1.csv', sep=',')
train_df.head()

Unnamed: 0,content,label
0,"First off, allow me to correct a common mistak...",0
1,I am really troubled by this Story and Enterta...,0
2,A near-perfect film version of a downright glo...,1
3,Keep your expectations low. Really really low...,0
4,"""they dont make em like this no more...""well.....",0


In [12]:
#抄作业bert

import pandas as pd
import codecs, gc
import numpy as np
from sklearn.model_selection import KFold
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.metrics import top_k_categorical_accuracy
from keras.layers import *
from keras.callbacks import *
from keras.models import Model
import keras.backend as K
from keras.optimizers import Adam
from keras.utils import to_categorical
 
maxlen = 100  #设置序列长度为120，要保证序列长度不超过512
 
#预训练好的模型
config_path = 'uncased_L-12_H-768_A-12/bert_config.json'
checkpoint_path = 'uncased_L-12_H-768_A-12/bert_model.ckpt'
dict_path = 'uncased_L-12_H-768_A-12/vocab.txt'
 
#将词表中的词编号转换为字典
token_dict = {}
with codecs.open(dict_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)
 
# 重写tokenizer        
class OurTokenizer(Tokenizer):
    def _tokenize(self, text):
        R = []
        for c in text:
            if c in self._token_dict:
                R.append(c)
            elif self._is_space(c):
                R.append('[unused1]')  # 用[unused1]来表示空格类字符
            else:
                R.append('[UNK]')  # 不在列表的字符用[UNK]表示
        return R
    
tokenizer = OurTokenizer(token_dict)
 
#让每条文本的长度相同，用0填充
def seq_padding(X, padding=0):
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([
        np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X
    ])
 
#data_generator只是一种为了节约内存的数据方式
class data_generator:
    def __init__(self, data, batch_size=32, shuffle=True):
        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.steps = len(self.data) // self.batch_size
        if len(self.data) % self.batch_size != 0:
            self.steps += 1
 
    def __len__(self):
        return self.steps
 
    def __iter__(self):
        while True:
            idxs = list(range(len(self.data)))
 
            if self.shuffle:
                np.random.shuffle(idxs)
 
            X1, X2, Y = [], [], []
            for i in idxs:
                d = self.data[i]
                text = d[0][:maxlen]
                x1, x2 = tokenizer.encode(first=text)
                y = d[1]
                X1.append(x1)
                X2.append(x2)
                Y.append([y])
                if len(X1) == self.batch_size or i == idxs[-1]:
                    X1 = seq_padding(X1)
                    X2 = seq_padding(X2)
                    Y = seq_padding(Y)
                    yield [X1, X2], Y[:, 0, :]
                    [X1, X2, Y] = [], [], []
 
# 计算top-k正确率,当预测值的前k个值中存在目标类别即认为预测正确                 
def acc_top1(y_true, y_pred):
    return top_k_categorical_accuracy(y_true, y_pred, k=1)
 
#bert模型设置
def build_bert(nclass):
    bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)  #加载预训练模型
 
    for l in bert_model.layers:
        l.trainable = True
 
    x1_in = Input(shape=(None,))
    x2_in = Input(shape=(None,))
 
    x = bert_model([x1_in, x2_in])
    x = Lambda(lambda x: x[:, 0])(x) # 取出[CLS]对应的向量用来做分类
    p = Dense(nclass, activation='softmax')(x)
 
    model = Model([x1_in, x2_in], p)
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(1e-5),    #用足够小的学习率
                  metrics=['accuracy', acc_top1])
    print(model.summary())
    return model

In [13]:
#训练数据、测试数据和标签转化为模型输入格式
DATA_LIST = []
for data_row in train_df.iloc[:].itertuples():
    DATA_LIST.append((data_row.content, to_categorical(data_row.label, 2)))
DATA_LIST = np.array(DATA_LIST)
 
DATA_LIST_TEST = []
for data_row in test_df.iloc[:].itertuples():
    DATA_LIST_TEST.append((data_row.content, to_categorical(0, 2)))
DATA_LIST_TEST = np.array(DATA_LIST_TEST)

In [14]:
def run_cv(nfold, data, data_labels, data_test):
    kf = KFold(n_splits=nfold, shuffle=True, random_state=520).split(data)
    train_model_pred = np.zeros((len(data), 2))
    test_model_pred = np.zeros((len(data_test), 2))
 
    for i, (train_fold, test_fold) in enumerate(kf):
        X_train, X_valid, = data[train_fold, :], data[test_fold, :]
 
        model = build_bert(2)
        early_stopping = EarlyStopping(monitor='val_acc', patience=3)   #早停法，防止过拟合
        plateau = ReduceLROnPlateau(monitor="val_acc", verbose=1, mode='max', factor=0.5, patience=2) #当评价指标不在提升时，减少学习率
        checkpoint = ModelCheckpoint('./bert_dump/' + str(i) + '.hdf5', monitor='val_acc',verbose=2, save_best_only=True, mode='max', save_weights_only=True) #保存最好的模型
 
        train_D = data_generator(X_train, shuffle=True)
        valid_D = data_generator(X_valid, shuffle=True)
        test_D = data_generator(data_test, shuffle=False)
        #模型训练
        model.fit(
            train_D.__iter__(),
            steps_per_epoch=100,
            epochs=2,
            validation_data=valid_D.__iter__(),
            validation_steps=len(valid_D),
            callbacks=[early_stopping, plateau, checkpoint],
        )
 
        # model.load_weights('./bert_dump/' + str(i) + '.hdf5')
 
        # return model
        train_model_pred[test_fold, :] = model.predict(valid_D.__iter__(), steps=len(valid_D), verbose=1)
        test_model_pred += model.predict(test_D.__iter__(), steps=len(test_D), verbose=1)
 
        del model
        gc.collect()   #清理内存
        K.clear_session()   #clear_session就是清除一个session
        # break
 
    return train_model_pred, test_model_pred
 


In [15]:
#n折交叉验证
train_model_pred, test_model_pred = run_cv(2, DATA_LIST, None, DATA_LIST_TEST)
 
test_pred = [np.argmax(x) for x in test_model_pred]
 
#将测试集预测结果写入文件
output=pd.DataFrame({'Id':test_df.id,'Predicted':test_pred})
# output.to_csv('data/results.csv', index=None)

Model: "functional_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
functional_3 (Functional)       (None, None, 768)    108891648   input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
lambda (Lambda)                 (None, 768)          0           functional_3[0][0]    

AttributeError: 'DataFrame' object has no attribute 'id'

In [16]:
test_model_pred

array([[1.60793507, 0.39206493],
       [1.6540674 , 0.34593257],
       [1.57430983, 0.42569016],
       ...,
       [1.49494547, 0.50505456],
       [1.61476743, 0.38523252],
       [1.62190527, 0.37809466]])

In [17]:
[np.argmax(x) for x in test_model_pred]

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [12]:
text_hashCodes

tensor([[-2.9873e-01, -9.9776e-02,  4.2495e-01,  ...,  4.4706e-01,
          2.3568e-01, -4.6980e-02],
        [-5.9173e-01,  2.6519e-01,  1.9885e-01,  ...,  3.6595e-01,
          1.9525e-01, -3.8467e-02],
        [-2.5105e-01,  4.2369e-01,  6.5390e-01,  ...,  6.6715e-01,
         -3.1308e-02, -6.8767e-02],
        ...,
        [-2.8702e-01,  1.6408e-01,  2.4477e-01,  ...,  5.3268e-01,
          1.0447e-01, -1.0089e-01],
        [-6.1198e-04,  2.5202e-01,  4.4362e-01,  ...,  3.7491e-01,
          2.9650e-01, -1.8157e-02],
        [-4.8216e-01,  1.7214e-01,  3.7160e-01,  ...,  3.6745e-01,
          7.2876e-02, -3.5178e-01]], grad_fn=<TanhBackward>)

In [18]:
train_model_pred

array([[0.74232954, 0.25767049],
       [0.81577289, 0.18422714],
       [0.7534427 , 0.24655725],
       ...,
       [0.81694144, 0.1830585 ],
       [0.76253951, 0.23746051],
       [0.85831869, 0.14168125]])