# 机器学习第二次作业——谣言检测

马锦贵 学号 2401212867

说明：按照要求，本次作业的报告直接和jupyter notebook合并，方便结合代码解释

本次任务有两个部分：

任务一：在原有代码上实现K-Fold验证，报告基于transformer方法的准确率

任务二：以任务一为baseline，改进提升准确率

# 任务一.
**基于原有代码实现K-Fold交叉验证**，报告基于transformer方法的准确率


**代码的核心改动在数据集加载和模型训练[在代码中标记为“K-Fold关键改动”]**

**结论：**
经过K-Fold验证，基于Transformer的方法的预测准确率**平均是86.81%**

## 导入包

In [1]:
!pip install jieba



In [None]:
import json, os
from tqdm import tqdm

#PyTorch用的包
import torch
import torch.nn as nn
import torch.optim
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

# 自然语言处理相关的包
import re #正则表达式的包
import jieba #结巴分词包
from collections import Counter #搜集器，可以让统计词频更简单

#绘图、计算用的程序包
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

## 数据预处理

In [3]:
!git clone https://github.com/thunlp/Chinese_Rumor_Dataset.git

fatal: destination path 'Chinese_Rumor_Dataset' already exists and is not an empty directory.


In [4]:
!ls ./Chinese_Rumor_Dataset
!ls ./Chinese_Rumor_Dataset/CED_Dataset/

CED_Dataset  README.md	rumors_v170613.json
non-rumor-repost  original-microblog  README.md  rumor-repost


### 获取微博文本及其配对标签

In [6]:
# 数据来源文件夹 -- 内含多个json文件
non_rumor = './Chinese_Rumor_Dataset/CED_Dataset/non-rumor-repost'
rumor = './Chinese_Rumor_Dataset/CED_Dataset/rumor-repost'
original = './Chinese_Rumor_Dataset/CED_Dataset/original-microblog'

non_rumor_data = []
rumor_data = []

# 遍历文件夹，读取文本数据
print('开始读取数据')
for file in tqdm(os.listdir(original)):
    try:
        data = json.load(open(os.path.join(original, file), 'rb'))['text']
    except:
        continue

    is_rumor = (file in os.listdir(rumor))
    if is_rumor:
        rumor_data.append(data)
    else:
        non_rumor_data.append(data)

print('结束, 有{}条谣言, 有{}条非谣言!'.format(len(rumor_data), len(non_rumor_data)))
print(non_rumor_data[-2:])
print('-'*20)
print(rumor_data[-2:])


# 把数据储存到指定地方 -- 统一到2个txt文件
pth = './rumor_detection_data'
if not os.path.exists(pth):
    os.makedirs(pth)

good_file = os.path.join(pth, 'non_rumor.txt')
bad_file = os.path.join(pth, 'rumor.txt')

with open(good_file, 'w', encoding='utf-8') as f:
    f.write('\n'.join(non_rumor_data))
with open(bad_file, 'w', encoding='utf-8') as f:
    f.write('\n'.join(rumor_data))

开始读取数据


100%|██████████| 3389/3389 [00:07<00:00, 464.28it/s]

结束, 有1538条谣言, 有1849条非谣言!
['今晚研究百合网，婚恋网站，因为明天做会议主持要对话百合网老大。第一次注册，看到好多姑娘的照片，但是系统给我推荐的怎么全是30岁以上的姑娘呢？在百合上看到一个有意思的统计分析，见附图~，女人应该在25岁之前把自己嫁出去哦~', '【亲，清华食堂新增了热干面哟！】@长江日报 讯：为吸引湖北的优质生源，清华大学食堂推出热干面，据说口味正宗，很受欢迎，想吃得排20分钟队，供不应求。清华大学湖北招生组甚至微博卖萌“亲，刚刚新增了热干面哟~”网友称赞清华以神来之笔，在与北大PK“掐尖”中完胜。[哈哈]']
--------------------
['【新浪内部人控制的用户黑洞】新浪内部技术人员开发了一套微博平台，据称现在该平台拥有上亿用户，并且规模在不断增加。这些用户都是制造的，但可以被操纵去评论、投票等，内部人员利用这些业务让一级代理、二级代理去兜售，形成了特别利益链条。#钛爱拍# http://t.cn/zYC5lcZ', '美国高二的数学...请用一句话概况你的感受！']





### 文本预处理（过滤标点，分词）

In [7]:
# 将文本中的标点符号过滤掉
def filter_punc(sentence):
    sentence = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？?、~@#￥%……&*（）：:；“”】》《-【\][]", "",sentence.strip())
    return sentence

# 扫描所有的文本，分词、建立词典，分出是谣言还是非谣言，is_filter可以过滤是否筛选掉标点符号
def Prepare_data(good_file, bad_file, is_filter = True, threshold=3):
    all_words = [] #存储所有的单词
    pos_sentences = [] #存储非谣言
    neg_sentences = [] #存储谣言
    with open(good_file, 'r', encoding='utf-8') as fr:
        for idx, line in enumerate(fr):
            if is_filter:
                #过滤标点符号
                line = filter_punc(line)
                if not idx: # 只打印第一个例子看看
                    print('分词前：', line)
            #分词
            words = jieba.lcut(line)
            if not idx: # 只打印第一个例子看看
                print('分词后：', words)
            if len(words) > 0:
                all_words += words
                pos_sentences.append(words)
    print('{0} 包含 {1} 行, {2} 个词.'.format(good_file, idx+1, len(all_words)))

    count = len(all_words)
    with open(bad_file, 'r', encoding='utf-8') as fr:
        for idx, line in enumerate(fr):
            if is_filter:
                line = filter_punc(line.strip())
            words = jieba.lcut(line)
            if len(words) > 0:
                all_words += words
                neg_sentences.append(words)
    print('{0} 包含 {1} 行, {2} 个词.'.format(bad_file, idx+1, len(all_words)-count))

    #建立词典，只保留频次大于threshold的单词
    vocab = {'<unk>': 0}
    cnt = Counter(all_words)
    for word, freq in cnt.items():
        if freq > threshold:
            vocab[word] = len(vocab)

    print('过滤掉词频 <= {}的单词后，字典大小：{}'.format(threshold, len(vocab)))
    return pos_sentences, neg_sentences, vocab


pos_sentences, neg_sentences, vocab = Prepare_data(good_file, bad_file, True, threshold=3)

Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...


分词前： 请转发请让国人们都看看请转发请让某些人也看看某某人一定能看到


Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.760 seconds.
DEBUG:jieba:Loading model cost 1.760 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.


分词后： ['请', '转发', '请', '让', '国', '人们', '都', '看看', '请', '转发', '请', '让', '某些', '人', '也', '看看', '某某人', '一定', '能', '看到']
./rumor_detection_data/non_rumor.txt 包含 1849 行, 92943 个词.
./rumor_detection_data/rumor.txt 包含 1538 行, 78645 个词.
过滤掉词频 <= 3的单词后，字典大小：6690


### 数据集划分

In [8]:
# 获得句子的向量化表示
def sentence2vec(word_ids, vocab_size):
    vector = np.zeros(vocab_size)
    for word_id in word_ids:
        vector[word_id] += 1
    return 1.0 * vector / len(word_ids)

bow = [] #词袋
labels = [] #标签
sentences = [] #原始句子，调试用
sentences_id = [] #原始句子对应的index列表

# 处理非谣言
for sentence in pos_sentences:
    new_sentence = []
    for word in sentence:
        new_sentence.append(vocab[word] if word in vocab else vocab['<unk>'])

    bow.append(sentence2vec(new_sentence, len(vocab)))
    labels.append(0) #正标签为0
    sentences.append(sentence)
    sentences_id.append(new_sentence)

# 处理谣言
for sentence in neg_sentences:
    new_sentence = []
    for word in sentence:
        new_sentence.append(vocab[word] if word in vocab else vocab['<unk>'])

    bow.append(sentence2vec(new_sentence, len(vocab)))
    labels.append(1) #负标签为1
    sentences.append(sentence)
    sentences_id.append(new_sentence)

# 打乱所有的数据顺序，形成数据集
# indices为所有数据下标的一个全排列
indices = np.random.permutation(len(bow))

#对整个数据集进行划分，分为：训练集、验证集和测试集，这里是2:1:1
test_size = len(bow) // 4

data = {
    'bow': bow,# 词袋数据
    'labels': labels,# 标签
    'sentences_id': sentences_id,# 句子对应的下标列表
    'sentences': sentences,# 句子
    'vocab': vocab # 词典,
}

### 【K-Fold关键改动】——构造K-**Fold数据集划分**
这里使用KFold类，首先，把测试数据集单独划分出来。然后对剩下的训练数据集做K-Fold划分：把数据集划分5个批次，每一次取其中一批为验证集，剩下4批为训练集。

为了代码解耦，足够灵活，在实现上，构造了一个split列表splits，里面5个元素，每一个元素split都是一个字典，它代表了K-Fold过程中每一次训练的时候训练集、验证集、测试集的划分情况。

在训练的时候，每一次训练都只需要把列表中的一个split字典配置传进去就行，能够和原来的代码无缝衔接，代码兼容性非常好。

In [9]:
# K-Fold交叉验证
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
splits = []
test_split_index = indices[:test_size]
train_split_index = indices[test_size:]

for train_index, val_index in kf.split(train_split_index):
    splits.append(
        {
          'train': train_split_index[train_index],
          'vali': train_split_index[val_index],
          'test': test_split_index
        }
    )


In [10]:
# 查看一下划分情况
for split in splits:
  for key, indices in split.items():
    count = [0, 0]
    for idx in indices:
        count[labels[idx]] += 1
    print(key, '非谣言有{}条，谣言有{}条'.format(count[0], count[1]))

train 非谣言有1096条，谣言有936条
vali 非谣言有282条，谣言有227条
test 非谣言有471条，谣言有375条
train 非谣言有1098条，谣言有935条
vali 非谣言有280条，谣言有228条
test 非谣言有471条，谣言有375条
train 非谣言有1115条，谣言有918条
vali 非谣言有263条，谣言有245条
test 非谣言有471条，谣言有375条
train 非谣言有1104条，谣言有929条
vali 非谣言有274条，谣言有234条
test 非谣言有471条，谣言有375条
train 非谣言有1099条，谣言有934条
vali 非谣言有279条，谣言有229条
test 非谣言有471条，谣言有375条


## 训练/测试函数定义

In [11]:
class AverageMeter(object):
    """
    用于储存与计算平均值
    """
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1, multiply=True):
        self.val = val
        if multiply:
            self.sum += val * n
        else:
            self.sum += val
        self.count += n
        self.avg = self.sum / self.count

def training(model, loader, crit, optim, device):
    # 模型调成训练模式
    model.train()
    # 把模型移到指定设备
    model.to(device)
    # 用于记录损失和正确率
    meter_loss, meter_acc = AverageMeter(), AverageMeter()

    for data in loader:
        # 清空梯度
        optim.zero_grad()
        # 获取数据并将其移至指定设备中, cpu / gpu
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        labels = labels.view(-1)
        # 将输入送入网络，获得输出
        outputs = model(inputs)
        # 计算损失
        loss = crit(outputs, labels)
        # 反向传播，计算梯度
        loss.backward()
        # 更新网络参数
        optim.step()

        # 记录损失
        num_sample = inputs.size(0)
        meter_loss.update(loss.item(), num_sample)
        # 记录预测正确率
        preds = outputs.max(dim=1)[1] # 网络预测的类别结果
        correct = (preds == labels).sum() # 计算预测的正确个数
        meter_acc.update(correct.item(), num_sample, multiply=False)

    # 返回训练集的平均损失和平均正确率
    return meter_loss.avg, meter_acc.avg

@torch.no_grad()
def evaluate(model, loader, crit, device):
    # 模型调成评估模式
    model.eval()
    # 把模型移到指定设备
    model.to(device)
    # 用于记录损失和正确率
    meter_loss, meter_acc = AverageMeter(), AverageMeter()
    for data in loader:
        # 获取数据并将其移至指定设备中, cpu / gpu
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        labels = labels.view(-1)
        # 将输入送入网络，获得输出
        outputs = model(inputs)

        # 计算并记录损失
        loss = crit(outputs, labels)
        num_sample = inputs.size(0)
        meter_loss.update(loss.item(), num_sample)
        # 记录预测正确率
        preds = outputs.max(dim=1)[1] # 网络预测的类别结果
        correct = (preds == labels).sum() # 计算预测的正确个数
        meter_acc.update(correct.item(), num_sample, multiply=False)

    return meter_loss.avg, meter_acc.avg

## 基于MLP的方法

### 数据加载器定义

In [15]:
class BaseDataset(Dataset):
    def __init__(self, data, split):
        super().__init__()
        self.make_dataset(data, split)

    def make_dataset(self, data, split):
        # Data是包含了整个数据集的数据
        # 而我们只需要训练集/验证集/测试集的数据
        # 我们按照划分基准split里面的下标来确定加载哪部分的数据
        self.dataset = []
        for idx in split:
            item = [torch.FloatTensor(data['bow'][idx]),
                    torch.LongTensor([data['labels'][idx]])]
            self.dataset.append(item)

    def __getitem__(self, ix):
        # ix大于等于0，小于len(self.dataset)
        return self.dataset[ix]

    def __len__(self):
        # 一共有多少数据
        return len(self.dataset)


def get_loader(data, split, batch_size=64, class_func=BaseDataset):
    # split.keys() 包括 'train', 'vali', 'test'
    # 所以此函数是为了拿到训练集，验证集和测试集的数据加载器
    loader = []
    for mode in split.keys():
        # split[mode]指定了要取data的哪些数据
        dataset = class_func(data, split[mode])
        # Dataloader可帮助我们一次性取batch_size个样本出来
        loader.append(
            DataLoader(
                dataset,
                batch_size = batch_size,
                shuffle = True if mode=='train' else False
            )
        )
    return loader

In [21]:
# 测试一下
_, _, fake_loader = get_loader(data, splits[0], 64)
x, y = iter(fake_loader).__next__()
print('词袋输入的形状：', x.shape)
print('标签的形状：', y.shape)

词袋输入的形状： torch.Size([64, 6690])
标签的形状： torch.Size([64, 1])


### 模型定义

In [22]:
# 一个简单的前馈神经网络，三层，第一层线性层，加一个非线性ReLU，第二层线性层
# 输入维度为词典的大小：每一段评论的词袋模型
class Linear_Model(torch.nn.Module):
    def __init__(self, vocab_size, hidden_size, num_class=2, dropout=0):
        super(Linear_Model,self).__init__()

        self.net = nn.Sequential(
                    nn.Linear(vocab_size, hidden_size),
                    nn.ReLU(),
                    nn.Dropout(dropout),
                    nn.Linear(hidden_size, num_class),
                )

    def forward(self,x):
        x = self.net(x)
        x = F.log_softmax(x, dim=1)
        return x

### 【K-Fold关键改动】模型训练
这里只需要把原来的训练代码外层套一个for循环，每一次循环的时候读取前面已经K-Fold划分好的split配置即可（它本质是提供了当次训练数据集的划分索引情况）。

In [25]:
# 参数
num_epochs = 10
learning_rate = 0.005
batch_size = 32
vocab_size = len(vocab)
hidden_size = 10
dropout = 0.0

# 运行的设备
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# 进行K-Fold验证

k_fold_val_acc = []
k_fold_test_acc = []
for val_index in range(5):
  split = splits[val_index]
  print('\n交叉验证第%d / 5 轮训练\n'% val_index)
  # 数据加载器
  train_loader, vali_loader, test_loader = get_loader(data, split, batch_size=batch_size, class_func=BaseDataset)
  # 模型实例化
  model = Linear_Model(vocab_size, hidden_size, dropout=dropout)
  # 打印模型
  # print(model)
  # 损失函数 -- 交叉熵
  crit = torch.nn.NLLLoss()
  # 优化方法 -- Adam
  optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)

  records = []
  for epoch in range(num_epochs):
    # 训练
    train_loss, train_acc = training(model, train_loader, crit, optimizer, device)
    # 验证
    vali_loss, vali_acc = evaluate(model, vali_loader, crit, device)
    # 打印消息
    print('第{}轮，训练集损失：{:.2f}, 训练集准确率：{:.2f}, 验证集损失：{:.2f}, 验证集准确率: {:.2f}'.format(
        epoch, train_loss, train_acc, vali_loss, vali_acc))
    # 储存信息以便可视化
    records.append([train_loss, train_acc, vali_loss, vali_acc])
    if epoch == num_epochs - 1:
      k_fold_val_acc.append(vali_acc)
  # 测试
  _, test_acc = evaluate(model, test_loader, crit, device)
  k_fold_test_acc.append(test_acc)
  print('测试集正确率：', test_acc)

print('K-fold交叉验证，验证集的平均准确率：%.4f, 测试集的平均准确率：%.4f' % (sum(k_fold_val_acc)/len(k_fold_val_acc),sum(k_fold_test_acc)/len(k_fold_test_acc)))



交叉验证第0 / 5 轮训练

第0轮，训练集损失：0.66, 训练集准确率：0.59, 验证集损失：0.61, 验证集准确率: 0.71
第1轮，训练集损失：0.52, 训练集准确率：0.81, 验证集损失：0.46, 验证集准确率: 0.87
第2轮，训练集损失：0.37, 训练集准确率：0.89, 验证集损失：0.38, 验证集准确率: 0.88
第3轮，训练集损失：0.27, 训练集准确率：0.93, 验证集损失：0.34, 验证集准确率: 0.88
第4轮，训练集损失：0.20, 训练集准确率：0.96, 验证集损失：0.31, 验证集准确率: 0.90
第5轮，训练集损失：0.15, 训练集准确率：0.97, 验证集损失：0.30, 验证集准确率: 0.89
第6轮，训练集损失：0.12, 训练集准确率：0.98, 验证集损失：0.28, 验证集准确率: 0.89
第7轮，训练集损失：0.09, 训练集准确率：0.99, 验证集损失：0.28, 验证集准确率: 0.88
第8轮，训练集损失：0.07, 训练集准确率：0.99, 验证集损失：0.28, 验证集准确率: 0.88
第9轮，训练集损失：0.06, 训练集准确率：0.99, 验证集损失：0.28, 验证集准确率: 0.89
测试集正确率： 0.8829787234042553

交叉验证第1 / 5 轮训练

第0轮，训练集损失：0.66, 训练集准确率：0.62, 验证集损失：0.60, 验证集准确率: 0.76
第1轮，训练集损失：0.49, 训练集准确率：0.84, 验证集损失：0.45, 验证集准确率: 0.84
第2轮，训练集损失：0.34, 训练集准确率：0.90, 验证集损失：0.38, 验证集准确率: 0.86
第3轮，训练集损失：0.25, 训练集准确率：0.93, 验证集损失：0.34, 验证集准确率: 0.87
第4轮，训练集损失：0.19, 训练集准确率：0.96, 验证集损失：0.32, 验证集准确率: 0.87
第5轮，训练集损失：0.14, 训练集准确率：0.97, 验证集损失：0.30, 验证集准确率: 0.87
第6轮，训练集损失：0.11, 训练集准确率：0.98, 验证集损失：0.29, 验证集准确率: 0.88
第7轮，训练集损失：0.08, 训练集准确

## 基于Transformer的方法

### 数据加载器定义

In [26]:
class MyDataset(Dataset):
    def __init__(self, data, split):
        super().__init__()
        self.vocab = data['vocab']
        self.pad_index = len(self.vocab.keys()) if '<pad>' not in self.vocab.keys() else self.vocab['<pad>']
        self.max_len = data.get('max_len', 30)
        self.make_dataset(data, split)

    def make_dataset(self, data, split):
        # Data是包含了整个数据集的数据
        # 而我们只需要训练集/验证集/测试集的数据
        # 我们按照划分基准split里面的下标来确定加载哪部分的数据
        self.dataset = []
        for idx in split:
            this_sentence_id = data['sentences_id'][idx]
            item = [
                torch.LongTensor(self.pad_data(this_sentence_id)),
                torch.LongTensor([data['labels'][idx]])
            ]
            self.dataset.append(item)

    def pad_data(self, seq):
        # 让序列长度最长只有max_len，不足就补pad，超过就截断
        if len(seq) < self.max_len:
            seq += [self.pad_index] * (self.max_len - len(seq))

        else:
            seq = seq[:self.max_len]
        return seq

    def get_pad_index(self):
        return self.pad_index

    def __getitem__(self, ix):
        # ix大于等于0，小于len(self.dataset)
        return self.dataset[ix]

    def __len__(self):
        # 一共有多少数据
        return len(self.dataset)

### Transformer基础组件实现

In [27]:
import math
class ScaledDotProductAttention(nn.Module):
    def __init__(self, hidden_size, num_heads, dropout):
        super().__init__()
        assert hidden_size % num_heads == 0
        self.num_heads = num_heads
        self.attention_head_size = hidden_size // num_heads
        self.all_head_size = hidden_size

        self.query = nn.Linear(hidden_size, hidden_size)
        self.key = nn.Linear(hidden_size, hidden_size)
        self.value = nn.Linear(hidden_size, hidden_size)
        self.dropout = nn.Dropout(dropout)

    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
        new_x_shape = x.size()[:-1] + (self.num_heads, self.attention_head_size)
        x = x.view(*new_x_shape) # [bsz, seq_len, n_head, head_size]
        return x.permute(0, 2, 1, 3) # [bsz, n_head, seq_len, head_size]

    def forward(self, q, k, v, attention_mask=None):
        query = self.transpose_for_scores(self.query(q)) # [bsz, n_head, lq, head_size]
        key = self.transpose_for_scores(self.key(k))     # [bsz, n_head, lk, head_size]
        value = self.transpose_for_scores(self.value(v)) # [bsz, n_head, lv, head_size]

        attention_scores = torch.matmul(query, key.transpose(-1, -2)) # [bsz, n_head, lq, lk]
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)

        if attention_mask is not None:
            if attention_mask.dim() == 2:
                attention_mask = attention_mask[:, None, None, :] # [bsz, 1, 1, lk]
            if attention_mask.dim() == 3:
                attention_mask = attention_mask[:, None, :, :] # [bsz, 1, lq, lk]
            attention_scores = attention_scores.masked_fill(attention_mask, -1e9)

        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        attention_probs = self.dropout(attention_probs) # [bsz, n_head, lq, lk]

        context = torch.matmul(attention_probs, value) # [bsz, n_head, lq, head_size]
        context = context.permute(0, 2, 1, 3).contiguous() # [bsz, lq, n_head, head_size]

        new_context_shape = context.size()[:-2] + (self.all_head_size,)
        context = context.view(*new_context_shape) # [bsz, lq, dim_hidden]

        return context, attention_probs


class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_size, num_heads, attn_dropout, dropout):
        super().__init__()
        self.SDPA = ScaledDotProductAttention(hidden_size, num_heads, attn_dropout)
        self.dense = nn.Linear(hidden_size, hidden_size)
        self.dropout = nn.Dropout(dropout)
        self.LayerNorm = nn.LayerNorm(hidden_size)

    def forward(self, hidden_states, attention_mask=None):
        q, k, v = hidden_states, hidden_states, hidden_states
        context, attention_probs = self.SDPA(q, k, v, attention_mask)
        context = self.dense(context)
        context = self.dropout(context)

        hidden_states = self.LayerNorm(hidden_states + context)
        return hidden_states, attention_probs


class FeedForwardNetwork(nn.Module):
    def __init__(self, hidden_size, intermediate_size, dropout=.0):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(hidden_size, intermediate_size),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(intermediate_size, hidden_size),
            nn.Dropout(dropout)
        )
        self.LN = nn.LayerNorm(hidden_size)

    def forward(self, x):
        return self.LN(x + self.net(x))


class TransformerEncoderLayer(nn.Module):
    def __init__(self, hidden_size, num_heads, intermediate_size, attn_dropout=.0, dropout=.0):
        super().__init__()
        self.mha = MultiHeadAttention(hidden_size, num_heads, attn_dropout, dropout)
        self.ffn = FeedForwardNetwork(hidden_size, intermediate_size, dropout)

    def forward(self, hidden_states, attention_mask):
        hidden_states, _ = self.mha(hidden_states, attention_mask)
        hidden_states = self.ffn(hidden_states)
        return hidden_states


class TransformerEncoder(nn.Module):
    def __init__(self, hidden_size, num_layers, num_heads, intermediate_size, attn_dropout=.0, dropout=.0):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(hidden_size, num_heads, intermediate_size, attn_dropout, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, x, attn_mask):
        for layer in self.layers:
            x = layer(x, attn_mask)
        return x

### 模型定义

In [28]:
class TFModel(nn.Module):
    def __init__(self,
            vocab_size,
            hidden_size,
            max_len,
            pad_index,
            num_class=2,
            num_heads=4,
            num_layers=1,
            dropout=0.0,
            attn_dropout=0.0
    ):
        super().__init__()
        self.word_embs = nn.Embedding(vocab_size + 1, hidden_size, padding_idx=pad_index)
        self.position_embs = nn.Embedding(max_len+1, hidden_size)

        self.net = TransformerEncoder(
            hidden_size,
            num_layers,
            num_heads,
            4 * hidden_size,
            attn_dropout,
            dropout
        )

        self.dropout = nn.Dropout(dropout)
        self.prj = nn.Linear(hidden_size, num_class)

        self.cls_index = vocab_size
        self.pad_index = pad_index

        nn.init.normal_(self.word_embs.weight, std=.02)
        nn.init.normal_(self.position_embs.weight, std=.02)
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.normal_(m.weight, std=.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    def forward(self, input_ids):
        batch_size, device = input_ids.size(0), input_ids.device
        cls_tokens = torch.zeros((batch_size, 1)).to(input_ids.device) + self.cls_index
        cls_tokens = cls_tokens.long()
        input_ids = torch.cat((cls_tokens, input_ids), dim=1)

        embs = self.word_embs(input_ids)

        seq_len = embs.size(1)
        position_ids = torch.arange(seq_len, dtype=torch.long, device=device)
        position_ids = position_ids[None, :]
        position_embs = self.position_embs(position_ids)

        embs = embs + position_embs
        embs = self.dropout(embs)

        attention_mask = (input_ids == self.pad_index) # (batch_size, seq_len)
        hidden_states = self.net(embs, attention_mask)

        cls_hidden_state = hidden_states[:, 0, :]
        cls_hidden_state = self.dropout(cls_hidden_state)

        output = self.prj(cls_hidden_state)
        output = F.log_softmax(output, dim=-1)

        return output


### 【K-Fold关键改动】transformer K-Fold训练
和前面的MLP模型一样，只需要把原来的训练代码外层包一层for循环，在每一个训练的时候，加载对应的split配置（它提供了当次训练数据集的划分情况的索引）

In [29]:
# 参数
num_epochs = 10
learning_rate = 0.005
batch_size = 128
vocab_size = len(vocab)+1
hidden_size = 16
num_heads = 4
num_layers = 1
attn_dropout = 0.1
dropout = 0.3

data['max_len'] = 100

# 运行的设备
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# 进行K-Fold验证

k_fold_val_acc = []
k_fold_test_acc = []
for val_index in range(5):
  split = splits[val_index]
  print('\n交叉验证第%d / 5 轮训练\n'% val_index)
  # 数据加载器
  train_loader, vali_loader, test_loader = get_loader(data, split, batch_size=batch_size, class_func=MyDataset)
  # 模型实例化
  model = TFModel(
    vocab_size,
    hidden_size,
    data['max_len'],
    pad_index=train_loader.dataset.get_pad_index(),
    num_heads=num_heads,
    num_layers=num_layers,
    attn_dropout=attn_dropout,
    dropout=dropout
  )
  # 打印模型
  #print(model)
  # 损失函数 -- 交叉熵
  crit = torch.nn.NLLLoss()
  # 优化方法
  optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

  records = []
  for epoch in range(num_epochs):
    # 训练
    train_loss, train_acc = training(model, train_loader, crit, optimizer, device)
    # 验证
    vali_loss, vali_acc = evaluate(model, vali_loader, crit, device)
    # 打印消息
    print('第{}轮，训练集损失：{:.2f}, 训练集准确率：{:.2f}, 验证集损失：{:.2f}, 验证集准确率: {:.2f}'.format(
        epoch, train_loss, train_acc, vali_loss, vali_acc))
    # 储存信息以便可视化
    records.append([train_loss, train_acc, vali_loss, vali_acc])
    if epoch == num_epochs - 1:
      k_fold_val_acc.append(vali_acc)
  # 测试
  _, test_acc = evaluate(model, test_loader, crit, device)
  print('测试集正确率：', test_acc)
  k_fold_test_acc.append(test_acc)
  print('测试集正确率：', test_acc)

print('K-fold交叉验证，验证集的平均准确率：%.4f, 测试集的平均准确率：%.4f' % (sum(k_fold_val_acc)/len(k_fold_val_acc),sum(k_fold_test_acc)/len(k_fold_test_acc)))




交叉验证第0 / 5 轮训练

第0轮，训练集损失：0.69, 训练集准确率：0.55, 验证集损失：0.68, 验证集准确率: 0.53
第1轮，训练集损失：0.55, 训练集准确率：0.77, 验证集损失：0.38, 验证集准确率: 0.86
第2轮，训练集损失：0.25, 训练集准确率：0.92, 验证集损失：0.35, 验证集准确率: 0.88
第3轮，训练集损失：0.14, 训练集准确率：0.96, 验证集损失：0.40, 验证集准确率: 0.88
第4轮，训练集损失：0.09, 训练集准确率：0.98, 验证集损失：0.42, 验证集准确率: 0.88
第5轮，训练集损失：0.06, 训练集准确率：0.99, 验证集损失：0.46, 验证集准确率: 0.87
第6轮，训练集损失：0.04, 训练集准确率：0.99, 验证集损失：0.44, 验证集准确率: 0.89
第7轮，训练集损失：0.03, 训练集准确率：1.00, 验证集损失：0.54, 验证集准确率: 0.88
第8轮，训练集损失：0.02, 训练集准确率：1.00, 验证集损失：0.58, 验证集准确率: 0.88
第9轮，训练集损失：0.02, 训练集准确率：1.00, 验证集损失：0.68, 验证集准确率: 0.88
测试集正确率： 0.8605200945626478
测试集正确率： 0.8605200945626478

交叉验证第1 / 5 轮训练

第0轮，训练集损失：0.69, 训练集准确率：0.54, 验证集损失：0.69, 验证集准确率: 0.55
第1轮，训练集损失：0.62, 训练集准确率：0.68, 验证集损失：0.42, 验证集准确率: 0.82
第2轮，训练集损失：0.31, 训练集准确率：0.90, 验证集损失：0.33, 验证集准确率: 0.88
第3轮，训练集损失：0.16, 训练集准确率：0.95, 验证集损失：0.40, 验证集准确率: 0.87
第4轮，训练集损失：0.10, 训练集准确率：0.97, 验证集损失：0.39, 验证集准确率: 0.89
第5轮，训练集损失：0.06, 训练集准确率：0.99, 验证集损失：0.49, 验证集准确率: 0.88
第6轮，训练集损失：0.03, 训练集准确率：0.99, 验证集损失：0.47, 验证集准确率:

# 任务二.
**改进**，设法提升谣言检测的准确率


**结论：**
经过K-Fold验证，改进后的方法的谣言检测准确率**平均是86.81%**

## 数据集加载
原先的词袋模型采用one-hot的方法编码词向量，丢失了很多词语之间的关联性信息，所以考虑换成transformer的token化方法对词向量进行编码。

In [6]:
import os
os.environ["HF_TOKEN"] = "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" # 模型加载需要

# 数据来源文件夹 -- 内含多个json文件
non_rumor = './Chinese_Rumor_Dataset/CED_Dataset/non-rumor-repost'
rumor = './Chinese_Rumor_Dataset/CED_Dataset/rumor-repost'
original = './Chinese_Rumor_Dataset/CED_Dataset/original-microblog'

non_rumor_data = []
rumor_data = []

# 遍历文件夹，读取文本数据
print('开始读取数据')
for file in tqdm(os.listdir(original)):
    try:
        data = json.load(open(os.path.join(original, file), 'rb'))['text']
    except:
        continue

    is_rumor = (file in os.listdir(rumor))
    if is_rumor:
        rumor_data.append(data)
    else:
        non_rumor_data.append(data)

print('结束, 有{}条谣言, 有{}条非谣言!'.format(len(rumor_data), len(non_rumor_data)))
print(non_rumor_data[-2:])
print('-'*20)
print(rumor_data[-2:])

labels = [0]*len(non_rumor_data)
labels.extend([1]*len(rumor_data))
data = []
data.extend(non_rumor_data)
data.extend(rumor_data)

print('datalength:%d,labels length:%d'%(len(data),len(labels)))



开始读取数据


100%|██████████| 3389/3389 [00:06<00:00, 498.02it/s]

结束, 有1538条谣言, 有1849条非谣言!
['今晚研究百合网，婚恋网站，因为明天做会议主持要对话百合网老大。第一次注册，看到好多姑娘的照片，但是系统给我推荐的怎么全是30岁以上的姑娘呢？在百合上看到一个有意思的统计分析，见附图~，女人应该在25岁之前把自己嫁出去哦~', '卡索拉上季在西甲创造84次射门机会，仅次于厄齐尔、梅西和纳瓦斯，他决心加盟枪手前还从皮雷和小法那得到很多鼓励。阿森纳队近年来罕见地一口气引进波多尔斯基、吉鲁和卡索拉三位国脚级新援。高高在上的范大将军，曼城、尤文图斯都打退堂鼓了，难道你真的，真的下定决心要走出“枪门”？']
--------------------
['李双江之子李天一涉嫌强奸罪于7日被批捕。李的76人律师团领队、法律大学副校长张爱国教授对媒体表示，李天一因第一个与被害女子发生性关系，不构成轮奸罪，只是以判罚较轻的强奸罪批捕，这是律师团所有成员共同努力的结果。http://t.cn/zYmXiH2', '「求辟谣：常务副市长陈尸护城河」原铁岭市常务副市长清华才子袁卫亮尸沉沈阳护城河！当地有关 部门及个别领导正在全力以赴不惜代价不择手段组织在各大论坛贴吧博客删帖，仅此一 贴，即创昨日中国删帖经济历史最高，个别管理单笔收入超过10万元，现在关于袁卫亮死讯99%以上都打不开，令死因扑朔迷离。 \u200b']
datalength:3387,labels length:3387





In [7]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AdamW
from sklearn.metrics import accuracy_score



In [8]:
class RumorDataset(Dataset):
    def __init__(self, data, labels, tokenizer, max_len):
        self.data = data
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data[idx]
        label = self.labels[idx]

        encoding = self.tokenizer.encode_plus(
            text,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

In [9]:
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("clue/roberta_chinese_base")
model = BertModel.from_pretrained("clue/roberta_chinese_base")


# tokenizer = RobertaTokenizer.from_pretrained('roberta_chinese_base')
# model = RobertaForSequenceClassification.from_pretrained('roberta-base-chinese', num_labels=2)


max_len = 512  # 可以根据实际情况调整最大序列长度


# 创建数据集实例
dataset = RumorDataset(data, labels, tokenizer, max_len)

# 创建数据加载器，并设置shuffle=True以打散数据集
train_loader = DataLoader(dataset, batch_size=16, shuffle=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RobertaTokenizer'. 
The class this function is called from is 'BertTokenizer'.
You are using a model of type roberta to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.


In [10]:
def train_epoch(model, data_loader, optimizer, device):
    print('epoch start')
    model.train()
    total_loss = 0
    iter = 0
    for batch in data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        print('iter %d loss = %.2f'%(iter, loss / len(batch)))
        iter = iter + 1
    return total_loss / len(data_loader)

def evaluate(model, data_loader, device):
    model.eval()
    y_true = []
    y_pred = []
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=1)
            y_pred.extend(predictions.cpu().numpy())
            y_true.extend(labels.cpu().numpy())

    accuracy = accuracy_score(y_true, y_pred)
    return accuracy

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)

num_epochs = 3
for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, optimizer, device)
    accuracy = evaluate(model, train_loader, device)
    print(f'Epoch {epoch+1}, Train Loss: {train_loss:.4f}, Train Accuracy: {accuracy:.4f}')



epoch start


In [12]:
import torch
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, data, split):
        super().__init__()
        self.vocab = data['vocab']
        self.pad_index = len(self.vocab.keys()) if '<pad>' not in self.vocab.keys() else self.vocab['<pad>']
        self.max_len = data.get('max_len', 30)
        self.make_dataset(data, split)

    def make_dataset(self, data, split):
        self.dataset = []
        for idx in split:
            this_sentence_id = data['sentences_id'][idx]
            item = [
                torch.LongTensor(self.pad_data(this_sentence_id)),
                torch.LongTensor([data['labels'][idx]]),
                # 保存原始序列长度，用于生成attention_mask
                torch.LongTensor([len(this_sentence_id)])
            ]
            self.dataset.append(item)

    def pad_data(self, seq):
        if len(seq) < self.max_len:
            seq += [self.pad_index] * (self.max_len - len(seq))
        else:
            seq = seq[:self.max_len]
        return seq

    def get_pad_index(self):
        return self.pad_index

    def __getitem__(self, ix):
        item = self.dataset[ix]
        input_ids = item[0]
        label = item[1]
        orig_len = item[2].item()  # 获取原始序列长度
        attention_mask = torch.zeros(self.max_len, dtype=torch.long)
        attention_mask[:orig_len] = 1  # 将原始序列部分设置为1
        return input_ids, attention_mask, label

    def __len__(self):
        return len(self.dataset)

## 定义训练和测试函数

In [34]:
class AverageMeter(object):
    """
    用于储存与计算平均值
    """
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1, multiply=True):
        self.val = val
        if multiply:
            self.sum += val * n
        else:
            self.sum += val
        self.count += n
        self.avg = self.sum / self.count

def training(model, loader, crit, optim, device):
    # 模型调成训练模式
    model.train()
    # 把模型移到指定设备
    model.to(device)
    # 用于记录损失和正确率
    meter_loss, meter_acc = AverageMeter(), AverageMeter()
    meter_loss.reset()
    meter_acc.reset()
    for data in loader:
        # 清空梯度
        optim.zero_grad()
        # 获取数据并将其移至指定设备中, cpu / gpu
        inputs, mask, labels = data
        inputs, mask, labels = inputs.to(device), mask.to(device), labels.to(device)
        labels = labels.view(-1)
        all_inputs = {
            'input_ids': inputs,
            'attention_mask': mask,
            'labels': labels
        }
        # 将输入送入网络，获得输出
        outputs = model(**all_inputs).logits

        # 计算损失
        loss = crit(outputs, labels)
        # 反向传播，计算梯度
        loss.backward()
        # 更新网络参数
        optim.step()

        # 记录损失
        num_sample = inputs.size(0)
        meter_loss.update(loss.item(), num_sample, multiply=False)
        # 记录预测正确率
        preds = outputs.max(dim=1)[1] # 网络预测的类别结果
        correct = (preds == labels).sum() # 计算预测的正确个数
        meter_acc.update(correct.item(), num_sample, multiply=False)

    # 返回训练集的平均损失和平均正确率
    return meter_loss.avg, meter_acc.avg

@torch.no_grad()
def evaluate(model, loader, crit, device):
    # 模型调成评估模式
    model.eval()
    # 把模型移到指定设备
    model.to(device)
    # 用于记录损失和正确率
    meter_loss, meter_acc = AverageMeter(), AverageMeter()
    meter_loss.reset()
    meter_acc.reset()
    for data in loader:
        # 获取数据并将其移至指定设备中, cpu / gpu
        inputs, mask, labels = data
        inputs, mask, labels = inputs.to(device), mask.to(device), labels.to(device)
        labels = labels.view(-1)
        all_inputs = {
            'input_ids': inputs,
            'attention_mask': mask,
            'labels': labels
        }
        # 将输入送入网络，获得输出
        outputs = model(**all_inputs).logits

        # 计算并记录损失
        loss = crit(outputs, labels)
        print(outputs, labels)
        num_sample = inputs.size(0)
        meter_loss.update(loss.item(), num_sample,multiply=False)
        # 记录预测正确率
        preds = outputs.max(dim=1)[1] # 网络预测的类别结果
        correct = (preds == labels).sum() # 计算预测的正确个数
        meter_acc.update(correct.item(), num_sample, multiply=False)

    return meter_loss.avg, meter_acc.avg

## 模型训练

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.metrics import accuracy_score

num_epochs = 50
learning_rate = 2e-6
# 运行的设备
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# 进行K-Fold验证

k_fold_val_acc = []
k_fold_test_acc = []
for val_index in range(5):
  split = splits[val_index]
  print('\n交叉验证第%d / 5 轮训练\n'% val_index)
  # 数据加载器
  train_loader, vali_loader, test_loader = get_loader(data, split, batch_size=8, class_func=MyDataset)
  # 加载预训练的BERT模型和分词器
  model_name = 'bert-base-uncased'
  tokenizer = BertTokenizer.from_pretrained(model_name)
  model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

  # 打印模型
  #print(model)
  # 损失函数 -- 交叉熵
  crit = torch.nn.NLLLoss()
  # 优化方法
  optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

  records = []
  for epoch in range(num_epochs):
    # 训练
    train_loss, train_acc = training(model, train_loader, crit, optimizer, device)
    # 验证
    vali_loss, vali_acc = evaluate(model, vali_loader, crit, device)
    # 打印消息
    print('第{}轮，训练集损失：{:.2f}, 训练集准确率：{:.2f}, 验证集损失：{:.2f}, 验证集准确率: {:.2f}'.format(
        epoch, train_loss, train_acc, vali_loss, vali_acc))
    # 储存信息以便可视化
    records.append([train_loss, train_acc, vali_loss, vali_acc])
    if epoch == num_epochs - 1:
      k_fold_val_acc.append(vali_acc)
  # 测试
  _, test_acc = evaluate(model, test_loader, crit, device)
  print('测试集正确率：', test_acc)
  k_fold_test_acc.append(test_acc)
  print('测试集正确率：', test_acc)

print('K-fold交叉验证，验证集的平均准确率：%.4f, 测试集的平均准确率：%.4f' % (sum(k_fold_val_acc)/len(k_fold_val_acc),sum(k_fold_test_acc)/len(k_fold_test_acc)))




交叉验证第0 / 5 轮训练



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


第0轮，训练集损失：-0.23, 训练集准确率：0.60, 验证集损失：-0.42, 验证集准确率: 0.64
第1轮，训练集损失：-0.50, 训练集准确率：0.63, 验证集损失：-0.63, 验证集准确率: 0.69
第2轮，训练集损失：-0.69, 训练集准确率：0.69, 验证集损失：-0.81, 验证集准确率: 0.75
第3轮，训练集损失：-0.86, 训练集准确率：0.72, 验证集损失：-0.96, 验证集准确率: 0.78
第4轮，训练集损失：-1.00, 训练集准确率：0.77, 验证集损失：-1.09, 验证集准确率: 0.79
第5轮，训练集损失：-1.11, 训练集准确率：0.79, 验证集损失：-1.19, 验证集准确率: 0.82
第6轮，训练集损失：-1.22, 训练集准确率：0.82, 验证集损失：-1.29, 验证集准确率: 0.83
第7轮，训练集损失：-1.31, 训练集准确率：0.86, 验证集损失：-1.36, 验证集准确率: 0.80
第8轮，训练集损失：-1.40, 训练集准确率：0.88, 验证集损失：-1.45, 验证集准确率: 0.84
第9轮，训练集损失：-1.48, 训练集准确率：0.90, 验证集损失：-1.52, 验证集准确率: 0.83
第10轮，训练集损失：-1.56, 训练集准确率：0.91, 验证集损失：-1.57, 验证集准确率: 0.81
第11轮，训练集损失：-1.63, 训练集准确率：0.93, 验证集损失：-1.65, 验证集准确率: 0.83
第12轮，训练集损失：-1.70, 训练集准确率：0.94, 验证集损失：-1.71, 验证集准确率: 0.83
第13轮，训练集损失：-1.77, 训练集准确率：0.94, 验证集损失：-1.77, 验证集准确率: 0.84
第14轮，训练集损失：-1.84, 训练集准确率：0.95, 验证集损失：-1.81, 验证集准确率: 0.82
第15轮，训练集损失：-1.90, 训练集准确率：0.95, 验证集损失：-1.87, 验证集准确率: 0.82
第16轮，训练集损失：-1.96, 训练集准确率：0.95, 验证集损失：-1.92, 验证集准确率: 0.83
第17轮，训练集损失：-2.02, 训练集准确率：0.96, 验证集损失：-1.9

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


第0轮，训练集损失：-0.24, 训练集准确率：0.57, 验证集损失：-0.42, 验证集准确率: 0.57
第1轮，训练集损失：-0.51, 训练集准确率：0.61, 验证集损失：-0.65, 验证集准确率: 0.62
第2轮，训练集损失：-0.71, 训练集准确率：0.68, 验证集损失：-0.83, 验证集准确率: 0.71
第3轮，训练集损失：-0.88, 训练集准确率：0.73, 验证集损失：-0.99, 验证集准确率: 0.70
第4轮，训练集损失：-1.03, 训练集准确率：0.76, 验证集损失：-1.12, 验证集准确率: 0.73
第5轮，训练集损失：-1.15, 训练集准确率：0.80, 验证集损失：-1.23, 验证集准确率: 0.77
第6轮，训练集损失：-1.26, 训练集准确率：0.83, 验证集损失：-1.33, 验证集准确率: 0.80
第7轮，训练集损失：-1.36, 训练集准确率：0.86, 验证集损失：-1.42, 验证集准确率: 0.83
第8轮，训练集损失：-1.45, 训练集准确率：0.87, 验证集损失：-1.49, 验证集准确率: 0.81
第9轮，训练集损失：-1.53, 训练集准确率：0.88, 验证集损失：-1.56, 验证集准确率: 0.82
第10轮，训练集损失：-1.62, 训练集准确率：0.91, 验证集损失：-1.65, 验证集准确率: 0.85
第11轮，训练集损失：-1.71, 训练集准确率：0.92, 验证集损失：-1.71, 验证集准确率: 0.84
第12轮，训练集损失：-1.78, 训练集准确率：0.93, 验证集损失：-1.78, 验证集准确率: 0.85
第13轮，训练集损失：-1.85, 训练集准确率：0.93, 验证集损失：-1.85, 验证集准确率: 0.86
第14轮，训练集损失：-1.93, 训练集准确率：0.95, 验证集损失：-1.90, 验证集准确率: 0.85
第15轮，训练集损失：-1.99, 训练集准确率：0.95, 验证集损失：-1.96, 验证集准确率: 0.87
第16轮，训练集损失：-2.05, 训练集准确率：0.94, 验证集损失：-2.01, 验证集准确率: 0.85
第17轮，训练集损失：-2.12, 训练集准确率：0.95, 验证集损失：-2.0

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


第0轮，训练集损失：-0.23, 训练集准确率：0.57, 验证集损失：-0.43, 验证集准确率: 0.55
第1轮，训练集损失：-0.51, 训练集准确率：0.57, 验证集损失：-0.64, 验证集准确率: 0.61
第2轮，训练集损失：-0.70, 训练集准确率：0.60, 验证集损失：-0.81, 验证集准确率: 0.59
第3轮，训练集损失：-0.87, 训练集准确率：0.64, 验证集损失：-0.96, 验证集准确率: 0.64
第4轮，训练集损失：-1.01, 训练集准确率：0.70, 验证集损失：-1.08, 验证集准确率: 0.73
第5轮，训练集损失：-1.13, 训练集准确率：0.76, 验证集损失：-1.18, 验证集准确率: 0.69
第6轮，训练集损失：-1.24, 训练集准确率：0.83, 验证集损失：-1.28, 验证集准确率: 0.76
第7轮，训练集损失：-1.33, 训练集准确率：0.85, 验证集损失：-1.37, 验证集准确率: 0.78
第8轮，训练集损失：-1.42, 训练集准确率：0.87, 验证集损失：-1.45, 验证集准确率: 0.79
第9轮，训练集损失：-1.51, 训练集准确率：0.90, 验证集损失：-1.52, 验证集准确率: 0.79
第10轮，训练集损失：-1.59, 训练集准确率：0.91, 验证集损失：-1.55, 验证集准确率: 0.75
第11轮，训练集损失：-1.66, 训练集准确率：0.93, 验证集损失：-1.65, 验证集准确率: 0.81
第12轮，训练集损失：-1.74, 训练集准确率：0.93, 验证集损失：-1.70, 验证集准确率: 0.79
第13轮，训练集损失：-1.81, 训练集准确率：0.93, 验证集损失：-1.76, 验证集准确率: 0.79
第14轮，训练集损失：-1.88, 训练集准确率：0.94, 验证集损失：-1.81, 验证集准确率: 0.80
第15轮，训练集损失：-1.95, 训练集准确率：0.96, 验证集损失：-1.85, 验证集准确率: 0.79
第16轮，训练集损失：-2.01, 训练集准确率：0.96, 验证集损失：-1.91, 验证集准确率: 0.79
第17轮，训练集损失：-2.06, 训练集准确率：0.96, 验证集损失：-1.9

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


第0轮，训练集损失：-0.23, 训练集准确率：0.66, 验证集损失：-0.41, 验证集准确率: 0.68
第1轮，训练集损失：-0.49, 训练集准确率：0.73, 验证集损失：-0.62, 验证集准确率: 0.72
第2轮，训练集损失：-0.68, 训练集准确率：0.77, 验证集损失：-0.79, 验证集准确率: 0.72
第3轮，训练集损失：-0.84, 训练集准确率：0.79, 验证集损失：-0.94, 验证集准确率: 0.80
第4轮，训练集损失：-0.98, 训练集准确率：0.81, 验证集损失：-1.07, 验证集准确率: 0.80
第5轮，训练集损失：-1.10, 训练集准确率：0.84, 验证集损失：-1.18, 验证集准确率: 0.82
第6轮，训练集损失：-1.22, 训练集准确率：0.88, 验证集损失：-1.27, 验证集准确率: 0.81
第7轮，训练集损失：-1.32, 训练集准确率：0.89, 验证集损失：-1.37, 验证集准确率: 0.83
第8轮，训练集损失：-1.42, 训练集准确率：0.90, 验证集损失：-1.45, 验证集准确率: 0.84
第9轮，训练集损失：-1.51, 训练集准确率：0.91, 验证集损失：-1.52, 验证集准确率: 0.83
第10轮，训练集损失：-1.60, 训练集准确率：0.93, 验证集损失：-1.59, 验证集准确率: 0.83
第11轮，训练集损失：-1.68, 训练集准确率：0.93, 验证集损失：-1.66, 验证集准确率: 0.84
第12轮，训练集损失：-1.75, 训练集准确率：0.94, 验证集损失：-1.69, 验证集准确率: 0.81
第13轮，训练集损失：-1.82, 训练集准确率：0.94, 验证集损失：-1.77, 验证集准确率: 0.83
第14轮，训练集损失：-1.89, 训练集准确率：0.96, 验证集损失：-1.83, 验证集准确率: 0.84
第15轮，训练集损失：-1.96, 训练集准确率：0.96, 验证集损失：-1.88, 验证集准确率: 0.84
第16轮，训练集损失：-2.02, 训练集准确率：0.95, 验证集损失：-1.94, 验证集准确率: 0.85
第17轮，训练集损失：-2.08, 训练集准确率：0.96, 验证集损失：-1.9

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


第0轮，训练集损失：-0.24, 训练集准确率：0.60, 验证集损失：-0.42, 验证集准确率: 0.74
第1轮，训练集损失：-0.51, 训练集准确率：0.71, 验证集损失：-0.64, 验证集准确率: 0.75
第2轮，训练集损失：-0.71, 训练集准确率：0.73, 验证集损失：-0.83, 验证集准确率: 0.79
第3轮，训练集损失：-0.88, 训练集准确率：0.76, 验证集损失：-0.99, 验证集准确率: 0.79
第4轮，训练集损失：-1.02, 训练集准确率：0.78, 验证集损失：-1.11, 验证集准确率: 0.80
第5轮，训练集损失：-1.15, 训练集准确率：0.81, 验证集损失：-1.22, 验证集准确率: 0.81
第6轮，训练集损失：-1.25, 训练集准确率：0.83, 验证集损失：-1.32, 验证集准确率: 0.82
第7轮，训练集损失：-1.34, 训练集准确率：0.85, 验证集损失：-1.40, 验证集准确率: 0.83
第8轮，训练集损失：-1.43, 训练集准确率：0.87, 验证集损失：-1.48, 验证集准确率: 0.83
第9轮，训练集损失：-1.51, 训练集准确率：0.90, 验证集损失：-1.56, 验证集准确率: 0.84
