## 环境配置

第一步：设置python版本为3.9.0

In [1]:
%%capture captured_output
!/home/ma-user/anaconda3/bin/conda create -n python-3.9.0 python=3.9.0 -y --override-channels --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
!/home/ma-user/anaconda3/envs/python-3.9.0/bin/pip install ipykernel

In [2]:
import json
import os

data = {
   "display_name": "python-3.9.0",
   "env": {
      "PATH": "/home/ma-user/anaconda3/envs/python-3.9.0/bin:/home/ma-user/anaconda3/envs/python-3.7.10/bin:/modelarts/authoring/notebook-conda/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ma-user/modelarts/ma-cli/bin:/home/ma-user/modelarts/ma-cli/bin"
   },
   "language": "python",
   "argv": [
      "/home/ma-user/anaconda3/envs/python-3.9.0/bin/python",
      "-m",
      "ipykernel",
      "-f",
      "{connection_file}"
   ]
}

if not os.path.exists("/home/ma-user/anaconda3/share/jupyter/kernels/python-3.9.0/"):
    os.mkdir("/home/ma-user/anaconda3/share/jupyter/kernels/python-3.9.0/")

with open('/home/ma-user/anaconda3/share/jupyter/kernels/python-3.9.0/kernel.json', 'w') as f:
    json.dump(data, f, indent=4)

#### 注：以上代码运行完成后，需要重新设置kernel为python-3.9.0

第二步：安装MindSpore框架和MindNLP套件

In [1]:

!pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.2.13/MindSpore/unified/x86_64/mindspore-2.2.13-cp39-cp39-linux_x86_64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting mindspore==2.2.13
  Using cached https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.2.13/MindSpore/unified/x86_64/mindspore-2.2.13-cp39-cp39-linux_x86_64.whl (756.2 MB)
Note: you may need to restart the kernel to use updated packages.


In [2]:
!pip install mindnlp-0.2.0-py3-none-any.whl

Looking in indexes: http://repo.myhuaweicloud.com/repository/pypi/simple
Processing ./mindnlp-0.2.0-py3-none-any.whl
Installing collected packages: mindnlp
  Attempting uninstall: mindnlp
    Found existing installation: mindnlp 0.3.2
    Uninstalling mindnlp-0.3.2:
      Successfully uninstalled mindnlp-0.3.2
Successfully installed mindnlp-0.2.0


# 基于 MindSpore 实现 BERT 虚假消息检测


## 模型简介

BERT全称是来自变换器的双向编码器表征量（Bidirectional Encoder Representations from Transformers），它是Google于2018年末开发并发布的一种新型语言模型。与BERT模型相似的预训练语言模型例如问答、命名实体识别、自然语言推理、文本分类等在许多自然语言处理任务中发挥着重要作用。

BERT预训练之后，会保存它的Embedding table和12层Transformer权重（BERT-BASE）或24层Transformer权重（BERT-LARGE）。使用预训练好的BERT模型可以对下游任务进行Fine-tuning，比如：文本分类、相似度判断、阅读理解等。

在当今信息爆炸的时代，虚假消息的传播给社会带来了巨大的负面影响，包括误导公众、损害个人或组织的声誉以及破坏社会稳定。因此，迅速准确地识别和过滤虚假消息变得至关重要。BERT语言模型作为一种强大的自然语言处理工具，具备理解语境和推断文本含义的能力，可以用于虚假消息的自动分类和检测。利用BERT语言模型进行虚假消息检测，可以有效应对当今信息泛滥和虚假信息传播的挑战，提高信息可信度，保护用户利益，提升用户体验，促进社会稳定，带来重要的商业和社会价值。

我们使用LIAR数据集，把数据集变成“真/假”的二分类，然后通过BERT模型的训练，实现对一句话的真假性进行检测。

下面就是实现BERT虚假消息检测的应用过程。

#### 注：MindNLP whl包下载链接为：[MindNLP](https://repo.mindspore.cn/mindspore-lab/mindnlp/newest/any/)

In [47]:
import os

import mindspore
from mindspore.dataset import text, GeneratorDataset, transforms
from mindspore import nn, context

from mindnlp.engine import Trainer, Evaluator
from mindnlp.engine.callbacks import CheckpointCallback, BestModelCallback
from mindnlp.metrics import Accuracy

In [48]:
class SentimentDataset:
    """Sentiment Dataset"""
    def __init__(self, path):
        # 初始化方法，加载数据集路径
        self.path = path
        self._labels, self._text_a = [], []
        self._load()

    def _load(self):
        # 加载数据集文件并进行预处理
        with open(self.path, "r", encoding="utf-8") as f:
            dataset = f.read()
        lines = dataset.split("\n")
        for line in lines[1:-1]:  # 跳过第一行（通常是表头）和最后一行（空行）
            label, text_a = line.split("\t")[0:2]  # 读取标签和文本
            self._labels.append(1 if label == "TRUE" else 0)  # 将标签转换为二进制形式
            self._text_a.append(text_a)  # 保存文本内容

    def __getitem__(self, index):
        # 获取指定索引的数据（标签和文本）
        return self._labels[index], self._text_a[index]

    def __len__(self):
        # 返回数据集的长度
        return len(self._labels)


## 数据集

我们使用的是LIAR数据集。数据集包括来自POLITIFACT.COM API5的12.8K人类标记的简短语句，每个语句都由POLITIFACT.COM编辑器评估其真实性。设置false、half-true、true标签。和六种细粒度的真实性评级标签:pants-fire/FALSE/barely-true/half-true/mostly-true/TRUE。

我们把数据集中除了真假标签和句子内容的列全部删除，并给他们加上标题"label"和"text_a"。
接下来通过excel的查找替换功能，把每条数据的label列false/pants-fire/barely-true改为false，mostly-true/true改为true，把half-true删除，以此达到二分类的效果，如下示例。

label--text_a

TRUE--Heroin comes in the United States from the southern border.

FALSE--Says every day of a special session costs taxpayers $40,000.

TRUE--Our trade with Mexico is $720 million a day; thats our No. 1 trading partner.

这部分主要包括数据集读取，数据格式转换，数据 Tokenize 处理和 pad 操作。

In [49]:
#下载数据集
!git clone https://gitee.com/lmh041027/fndata.git

fatal: destination path 'fndata' already exists and is not an empty directory.


### 数据加载和数据预处理

新建 process_dataset 函数用于数据加载和数据预处理，具体内容可见下面代码注释。

In [50]:
import numpy as np
import mindspore.dataset as ds
from mindspore.dataset.transforms import TypeCast
from mindspore.dataset.transforms import c_transforms as transforms

这个预处理过程通过加载数据集、使用指定的分词器对文本进行分词和填充（使其长度一致），并将标签和分词结果转换为MindSpore所需的格式。它根据设备类型（Ascend或其他）选择适当的批处理方式，以生成适合模型训练的数据集。最终返回预处理后的数据集，以便于后续的模型训练和评估。

In [51]:
def process_dataset(source, tokenizer, max_seq_len=128, batch_size=32, shuffle=True):
    is_ascend = mindspore.context.get_context('device_target') == 'Ascend'

    column_names = ["label", "text_a"]
    dataset = ds.GeneratorDataset(source, column_names=column_names, shuffle=shuffle)

    def tokenize_and_pad(text):
        tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=max_seq_len)
        input_ids = np.asarray(tokenized['input_ids'], dtype=np.int32)
        attention_mask = np.asarray(tokenized['attention_mask'], dtype=np.int32)
        return input_ids, attention_mask

    type_cast_op = TypeCast(mindspore.int32)

    dataset = dataset.map(operations=tokenize_and_pad, input_columns=["text_a"], output_columns=["input_ids", "attention_mask"])
    dataset = dataset.map(operations=type_cast_op, input_columns=["label"], output_columns="labels")

    if is_ascend:
        dataset = dataset.batch(batch_size)
    else:
        dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': ([None], tokenizer.pad_token_id),
                                                             'attention_mask': ([None], 0)})
    
    return dataset

昇腾NPU环境下暂不支持动态Shape，数据预处理部分采用静态Shape处理：

将文本数据转换为BERT模型所需的输入格式，包括将文本分割成词片段（tokens），并将这些词片段映射到对应的词汇表索引，同时生成注意力掩码（attention masks）等。

In [76]:
from mindnlp.transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [77]:
tokenizer.pad_token_id

0

In [78]:
#展示数据集前十条
with open("fndata/train.tsv", "r", encoding="utf-8") as f:
    dataset = f.read()
lines = dataset.split("\n")
s=0
for line in lines[1:-1]:
    print(line.split('\t')[0:2])
    s+=1
    if s==10:
        break

['FALSE', 'Says the Annies List political group supports third-trimester abortions on demand.']
['TRUE', 'Hillary Clinton agrees with John McCain "by voting to give George Bush the benefit of the doubt on Iran."']
['FALSE', 'Health care reform legislation is likely to mandate free sex change surgeries.']
['TRUE', 'The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.']
['FALSE', 'Jim Dunnam has not lived in the district he represents for years now.']
['TRUE', 'Says GOP primary opponents Glenn Grothman and Joe Leibham cast a compromise vote that cost $788 million in higher electricity costs.']
['TRUE', '"For the first time in history, the share of the national popular vote margin is smaller than the Latino vote margin."']
['FALSE', '"When Mitt Romney was governor of Massachusetts, we didnt just slow the rate of growth of our government, we actually cut it."']
['TRUE', 'The economy bled 

In [79]:
dataset_train = process_dataset(SentimentDataset("fndata/train.tsv"), tokenizer)
dataset_val = process_dataset(SentimentDataset("fndata/valid.tsv"), tokenizer)
dataset_test = process_dataset(SentimentDataset("fndata/test.tsv"), tokenizer, shuffle=False)

In [80]:
dataset_train.get_col_names()

['input_ids', 'attention_mask', 'labels']

输入ID张量：第一个张量，形状为 [32, 128]，每行代表一个句子的分词ID。BERT模型输入的每个句子长度被填充或截断为128（max_seq_len=128），101是BERT的CLS标记，用于表示句子的开始，后续是分词后的词汇表索引，0表示填充值。

注意力掩码张量：第二个张量，形状同样为 [32, 128]，每个位置上1表示实际的词片段，0表示填充的部分。这个掩码用于在模型中忽略填充值对计算的影响。

标签张量：第三个张量，形状为 [32]，每个值是一个标签，1表示真实消息，0表示虚假消息。

In [81]:
print(next(dataset_train.create_tuple_iterator()))

[Tensor(shape=[32, 128], dtype=Int32, value=
[[ 101, 1000, 2758 ...    0,    0,    0],
 [ 101, 1037, 2883 ...    0,    0,    0],
 [ 101, 2416, 4896 ...    0,    0,    0],
 ...
 [ 101, 2758, 2028 ...    0,    0,    0],
 [ 101, 1000, 2758 ...    0,    0,    0],
 [ 101, 1031, 1056 ...    0,    0,    0]]), Tensor(shape=[32, 128], dtype=Int32, value=
[[1, 1, 1 ... 0, 0, 0],
 [1, 1, 1 ... 0, 0, 0],
 [1, 1, 1 ... 0, 0, 0],
 ...
 [1, 1, 1 ... 0, 0, 0],
 [1, 1, 1 ... 0, 0, 0],
 [1, 1, 1 ... 0, 0, 0]]), Tensor(shape=[32], dtype=Int32, value= [0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 
 0, 1, 0, 0, 0, 0, 0, 0])]


## 模型构建

通过 BertForSequenceClassification 构建用于虚假消息检测的 BERT 模型，加载预训练权重，设置真假二分类的超参数自动构建模型。后面对模型采用自动混合精度操作，提高训练的速度，然后实例化优化器，紧接着实例化评价指标，设置模型训练的权重保存策略，最后就是构建训练器，模型开始训练。

In [82]:
from mindnlp.transformers import BertForSequenceClassification, BertModel
from mindnlp._legacy.amp import auto_mixed_precision

# set bert config and define parameters for training
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model = auto_mixed_precision(model, 'O1')

optimizer = nn.Adam(model.trainable_params(), learning_rate=2e-5)

The following parameters in checkpoint files are not loaded:
['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
The following parameters in models are missing parameter:
['classifier.weight', 'classifier.bias']


In [83]:
metric = Accuracy()
# define callbacks to save checkpoints
ckpoint_cb = CheckpointCallback(save_path='checkpoint', ckpt_name='bert_emotect', epochs=1, keep_checkpoint_max=2)
best_model_cb = BestModelCallback(save_path='checkpoint', ckpt_name='bert_emotect_best', auto_load=True)

trainer = Trainer(network=model, train_dataset=dataset_train,
                  eval_dataset=dataset_val, metrics=metric,
                  epochs=5, optimizer=optimizer, callbacks=[ckpoint_cb, best_model_cb])

In [84]:
# start training
trainer.run(tgt_columns="labels")

The train will start from the checkpoint saved in 'checkpoint'.


Epoch 0: 100%|██████████| 255/255 [02:18<00:00,  1.84it/s, loss=0.6533276] 


Checkpoint: 'bert_emotect_epoch_0.ckpt' has been saved in epoch: 0.


Evaluate: 100%|██████████| 33/33 [00:06<00:00,  4.88it/s]


Evaluate Score: {'Accuracy': 0.6583011583011583}
---------------Best Model: 'bert_emotect_best.ckpt' has been saved in epoch: 0.---------------


Epoch 1: 100%|██████████| 255/255 [02:15<00:00,  1.88it/s, loss=0.59775037]


Checkpoint: 'bert_emotect_epoch_1.ckpt' has been saved in epoch: 1.


Evaluate: 100%|██████████| 33/33 [00:06<00:00,  4.84it/s]


Evaluate Score: {'Accuracy': 0.6698841698841699}
---------------Best Model: 'bert_emotect_best.ckpt' has been saved in epoch: 1.---------------


Epoch 2: 100%|██████████| 255/255 [02:15<00:00,  1.88it/s, loss=0.48619908]


The maximum number of stored checkpoints has been reached.
Checkpoint: 'bert_emotect_epoch_2.ckpt' has been saved in epoch: 2.


Evaluate: 100%|██████████| 33/33 [00:06<00:00,  4.87it/s]


Evaluate Score: {'Accuracy': 0.6496138996138996}


Epoch 3: 100%|██████████| 255/255 [02:15<00:00,  1.88it/s, loss=0.31588903]


The maximum number of stored checkpoints has been reached.
Checkpoint: 'bert_emotect_epoch_3.ckpt' has been saved in epoch: 3.


Evaluate: 100%|██████████| 33/33 [00:06<00:00,  4.87it/s]


Evaluate Score: {'Accuracy': 0.6544401544401545}


Epoch 4: 100%|██████████| 255/255 [02:15<00:00,  1.88it/s, loss=0.17769843]


The maximum number of stored checkpoints has been reached.
Checkpoint: 'bert_emotect_epoch_4.ckpt' has been saved in epoch: 4.


Evaluate: 100%|██████████| 33/33 [00:06<00:00,  4.89it/s]


Evaluate Score: {'Accuracy': 0.6573359073359073}
Loading best model from 'checkpoint' with '['Accuracy']': [0.6698841698841699]...
---------------The model is already load the best model from 'bert_emotect_best.ckpt'.---------------


## 模型验证

将验证数据集加再进训练好的模型，对数据集进行验证，查看模型在验证数据上面的效果，此处的评价指标为准确率。

In [85]:
evaluator = Evaluator(network=model, eval_dataset=dataset_test, metrics=metric)
evaluator.run(tgt_columns="labels")

Evaluate: 100%|██████████| 32/32 [00:06<00:00,  4.85it/s]

Evaluate Score: {'Accuracy': 0.6545275590551181}





## 模型推理

遍历测试数据集，展示预测正确的条数。

首先加载测试数据集 dataset_infer，然后定义了一个用于预测文本真假的 predict 函数。在预测函数中，文本被分词器 tokenizer 分词并转换为张量格式 text_tokenized，然后通过模型 model 进行预测，获取预测的标签。根据预测结果和实际标签，更新统计变量 tru_num 和 fal_num，分别记录预测正确和错误的数量。最后，通过遍历 dataset_infer 中的每个文本和标签，对整个数据集进行预测，并打印数据集的总数、预测正确和错误的数量。

In [86]:
dataset_infer = SentimentDataset("fndata/test.tsv")
tru_num=0
fal_num=0

def predict(text, label):
    label_map = {0: "FALSE", 1: "TRUE"}
    global tru_num
    global fal_num
    text_tokenized = Tensor([tokenizer(text).input_ids])
    logits = model(text_tokenized)
    predict_label = logits[0].asnumpy().argmax()
    info = f"inputs: '{text}', predict: '{label_map[predict_label]}'"
    if label == predict_label:
        tru_num+=1
    else:
        fal_num+=1

from mindspore import Tensor
s=0
for label, text in dataset_infer:
    predict(text, label)
    s+=1

print(f"数据总数是{s}, 预测正确的有{tru_num}条，预测错误的有{fal_num}条")

数据总数是1016, 预测正确的有665条，预测错误的有351条


安装Gradio库

In [20]:
!pip install gradio

Looking in indexes: http://repo.myhuaweicloud.com/repository/pypi/simple
Collecting gradio
  Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/f1/1f/95a1bc5ec7cf8cdf92f2abf9e0bd59cbddf7870ae233aa33a99262cff9e1/gradio-4.37.2-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m67.4 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/c5/19/5af6804c4cc0fed83f47bff6e413a98a36618e7d40185cd36e69737f3b0e/aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting altair<6.0,>=4.2.0 (from gradio)
  Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/46/30/2118537233fa72c1d91a81f5908a7e843a6601ccc68b76838ebc4951505f/altair-5.3.0-py3-none-any.whl (857 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m857.8/857.8 kB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi 

In [87]:
# Add Gradio integration here
import gradio as gr
from mindnlp.transformers import BertTokenizer, BertForSequenceClassification
import numpy as np

In [68]:
# # 加载预训练的BERT模型和Tokenizer
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

The following parameters in checkpoint files are not loaded:
['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
The following parameters in models are missing parameter:
['classifier.weight', 'classifier.bias']


gradio_predict接受输入文本text和对话历史history，然后使用预先训练的模型对文本进行预测。首先，它将文本进行标记化，确保输入数据是Tensor格式。接着，模型进行预测并输出logits，选择概率最高的标签作为预测结果，并将其映射为"TRUE"或"FALSE"。预测结果添加到对话历史中，并返回更新后的历史记录。如果过程中发生错误，函数会捕获异常并返回错误信息和原始历史记录。

In [89]:
def gradio_predict(text, history):
    try:
        label_map = {0: "FALSE", 1: "TRUE"}
        tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors='ms')
        input_ids = tokenized['input_ids']
        attention_mask = tokenized['attention_mask']

        # 确保输入是正确的Tensor格式
        if not isinstance(input_ids, Tensor):
            input_ids = Tensor(input_ids)
        if not isinstance(attention_mask, Tensor):
            attention_mask = Tensor(attention_mask)

        # 模型预测并提取logits
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predict_label = logits.asnumpy().argmax(axis=1)[0]     #获取最大值索引
        prediction = label_map[predict_label]

        # 添加到历史记录
        history.append((text, prediction))
        
        return history, history
    except Exception as e:
        return f"错误: {str(e)}", history

In [90]:
# 定义重置输入框函数
def reset_user_input():
    return gr.update(value='')

# 定义重置状态函数
def reset_state():
    return [], []

In [91]:
# 创建一个 Gradio 界面 Blocks
with gr.Blocks() as demo:
    # 添加 HTML 标题
    gr.HTML("""<h1 align="center">BERT 虚假消息检测</h1>""")

    # 创建聊天机器人界面
    chatbot = gr.Chatbot()
    # 创建一行布局
    with gr.Row():
        # 创建一列布局，并设置比例
        with gr.Column(scale=4):
            # 在列内创建另一列布局，并设置比例
            with gr.Column(scale=12):
                # 添加用户输入文本框
                user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=3, container=False)
            # 创建另一列布局，用于按钮
            with gr.Column(min_width=32, scale=1):
                with gr.Row():
                    # 添加推理按钮
                    submitBtn = gr.Button("推理", variant="primary")
                    # 添加清除历史按钮
                    emptyBtn = gr.Button("清除历史")

    # 初始化历史记录状态
    history = gr.State([])

    # 绑定推理按钮点击事件，调用 gradio_predict 函数
    submitBtn.click(gradio_predict, [user_input, history], [chatbot, history], show_progress=True)
    # 绑定推理按钮点击事件，调用 reset_user_input 函数
    submitBtn.click(reset_user_input, [], [user_input])

    # 绑定清除历史按钮点击事件，调用 reset_state 函数
    emptyBtn.click(reset_state, outputs=[chatbot, history], show_progress=True)

# 启动 Gradio 界面，允许分享
demo.queue().launch(share=True)

Running on local URL:  http://127.0.0.1:7870

Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.


