# 作业

更换数据集MSRA和ERNIE-Gram或BERT等预训练模型。

- 数据集：
`train_ds, test_ds = load_dataset("msra_ner", splits=["train", "test"])`
- 模型：
	将`from paddlenlp.transformers import ErnieTokenizer, ErnieForTokenClassification`换成相应的模型。

# 使用PaddleNLP语义预训练模型ERNIE完成快递单信息抽取


**注意**

本项目代码需要使用GPU环境来运行:

<img src="https://ai-studio-static-online.cdn.bcebos.com/767f625548714f03b105b6ccb3aa78df9080e38d329e445380f505ddec6c7042" width="40%" height="40%">
<br>
<br>

命名实体识别是NLP中一项非常基础的任务，是信息提取、问答系统、句法分析、机器翻译等众多NLP任务的重要基础工具。命名实体识别的准确度，决定了下游任务的效果，是NLP中的一个基础问题。在NER任务提供了两种解决方案，一类LSTM/GRU + CRF，通过RNN类的模型来抽取底层文本的信息，而CRF(条件随机场)模型来学习底层Token之间的联系；另外一类是通过预训练模型，例如ERNIE，BERT模型，直接来预测Token的标签信息。

本项目将演示如何使用PaddleNLP语义预训练模型ERNIE完成从快递单中抽取姓名、电话、省、市、区、详细地址等内容，形成结构化信息。辅助物流行业从业者进行有效信息的提取，从而降低客户填单的成本。

在2017年之前，工业界和学术界对文本处理依赖于序列模型[Recurrent Neural Network (RNN)](https://baike.baidu.com/item/%E5%BE%AA%E7%8E%AF%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C/23199490?fromtitle=RNN&fromid=5707183&fr=aladdin).

<p align="center">
<img src="http://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-general.png" width="40%" height="30%"> <br />
</p><br><center>图1：RNN示意图</center></br>

[基于BiGRU+CRF的快递单信息抽取](https://aistudio.baidu.com/aistudio/projectdetail/1317771)项目介绍了如何使用序列模型完成快递单信息抽取任务。
<br>

近年来随着深度学习的发展，模型参数的数量飞速增长。为了训练这些参数，需要更大的数据集来避免过拟合。然而，对于大部分NLP任务来说，构建大规模的标注数据集非常困难（成本过高），特别是对于句法和语义相关的任务。相比之下，大规模的未标注语料库的构建则相对容易。为了利用这些数据，我们可以先从其中学习到一个好的表示，再将这些表示应用到其他任务中。最近的研究表明，基于大规模未标注语料库的预训练模型（Pretrained Models, PTM) 在NLP任务上取得了很好的表现。

近年来，大量的研究表明基于大型语料库的预训练模型（Pretrained Models, PTM）可以学习通用的语言表示，有利于下游NLP任务，同时能够避免从零开始训练模型。随着计算能力的不断提高，深度模型的出现（即 Transformer）和训练技巧的增强使得 PTM 不断发展，由浅变深。


<p align="center">
<img src="https://ai-studio-static-online.cdn.bcebos.com/327f44ff3ed24493adca5ddc4dc24bf61eebe67c84a6492f872406f464fde91e" width="60%" height="50%"> <br />
</p><br><center>图2：预训练模型一览，图片来源于：https://github.com/thunlp/PLMpapers</center></br>
                                                                                                                             
本示例展示了以ERNIE([Enhanced Representation through Knowledge Integration](https://arxiv.org/pdf/1904.09223))为代表的预训练模型如何Finetune完成序列标注任务。

**记得给[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)点个小小的Star⭐**

开源不易，希望大家多多支持~ 

GitHub地址：[https://github.com/PaddlePaddle/PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
![](https://ai-studio-static-online.cdn.bcebos.com/a0e8ca7743ea4fe9aa741682a63e767f8c48dc55981f4e44a40e0e00d3ab369e)

AI Studio平台后续会默认安装PaddleNLP，在此之前可使用如下命令安装。

In [1]:
!pip install --upgrade paddlenlp

Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting paddlenlp
[?25l  Downloading https://mirror.baidu.com/pypi/packages/b1/e9/128dfc1371db3fc2fa883d8ef27ab6b21e3876e76750a43f58cf3c24e707/paddlenlp-2.0.2-py3-none-any.whl (426kB)
[K     |████████████████████████████████| 430kB 14.2MB/s eta 0:00:01
Installing collected packages: paddlenlp
  Found existing installation: paddlenlp 2.0.1
    Uninstalling paddlenlp-2.0.1:
      Successfully uninstalled paddlenlp-2.0.1
Successfully installed paddlenlp-2.0.2


In [2]:
import paddlenlp

print(paddlenlp.__version__)

2.0.2


# 下载 MSRA-NER 数据集

In [3]:
# 下载 msra-ner 数据集到当前目录
!wget https://paddlenlp.bj.bcebos.com/datasets/msra_ner.tar.gz

--2021-06-12 04:18:27--  https://paddlenlp.bj.bcebos.com/datasets/msra_ner.tar.gz
Resolving paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)... 182.61.200.229, 182.61.200.195, 2409:8c00:6c21:10ad:0:ff:b00e:67d
Connecting to paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)|182.61.200.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3743966 (3.6M) [application/octet-stream]
Saving to: ‘msra_ner.tar.gz’


2021-06-12 04:18:28 (21.3 MB/s) - ‘msra_ner.tar.gz’ saved [3743966/3743966]



In [4]:
# 解压msra-ner数据集
!tar -zxvf ./msra_ner.tar.gz

msra_ner/
msra_ner/train.tsv
msra_ner/label_map.json
msra_ner/test.tsv


# 数据集描述

解压后的msra_ner目录下有三个文件：train.tsv, test.tsv, label_map.json

其中 train.tsv 和 test.tsv 的每一行为一个样本，样本格式为：

tokens	labels

tokens 为需要标注的样本，labels 为标注信息，二者由制表符(\t)隔开。tokens 中的每一个字和labels中的每一个标注由STX（即\002）字符隔开。

label_map.json，即标记字典文件，存放了每一标记和标记ID的映射：

MSRA-NER数据集中，训练集包含45000个样本，测试集包含3442个样本。

数据集包含三类实体：人物(Person), 组织机构（Organization），地点（Location）和其他（Other），对应的标记简写为：PER，ORG，LOC和O。

数据集采用BIO标注的方式，label_map.json文件，即标记字典文件，存放了每一个标记和标记ID的映射：

```
    {
      "B-PER": 0,
      "I-PER": 1,
      "B-ORG": 2,
      "I-ORG": 3,
      "B-LOC": 4,
      "I-LOC": 5,
      "O": 6
    }
```

每一个标记的含义如下：

|标记|含义|
|---|---|
|B-PER|人名的开始字符|
|I-PER|人名的非开始字符|
|B-ORG|组织机构名的开始字符|
|I-ORG|组织机构名的非开始字符|
|B-LOC|地点/位置名的开始字符|
|I-LOC|地点/位置名的非开始字符|
|O|非命名实体部分|


# 划分数据集

训练集包含 45000个样本，测试集包含3442个样本，由于没有验证集，因此我们需要从训练集中划分出与测试集规模相当的验证集出来，用于在训练过程中对模型进行评估。测试集保持不变。我们选择从训练集中随机抽取3000个样本作为验证集。

|数据集|样本数|
|---|---|
|train set| 42000|
|dev set|3000|
|test set|3442|

In [6]:
def train_test_split(train_file, proportion=3000):
    """
    proportion: int or float
    """
    with open(train_file, 'r', encoding='utf-8') as fRead:
        train_data = []
        for line in fRead.readlines():
            train_data.append(line.strip('\n'))
        
        data_id = list(range(len(train_data)))

        train_len = len(data_id)
        
        if isinstance(proportion , int):
            test_split_len = proportion
        elif isinstance(proportion, float):
            test_split_len = int(train_len * proportion)
        else:
            raise ValueError('proportion must be int or float!')


        import random
        random.seed(5233)
        random.shuffle(data_id)

        test_split_data = [train_data[idx] for idx in data_id[:test_split_len]]
        train_split_data = [train_data[idx] for idx in data_id[test_split_len:]]

        import os
        data_dir = './msra_ner/data/msra'
        if not os.path.exists(data_dir):
            os.makedirs(data_dir)

        train_file = os.path.join(data_dir, 'train.tsv')
        test_file = os.path.join(data_dir, 'dev.tsv')

        with open(train_file, 'w', encoding='utf-8') as fWriter1:
            for train_line in train_split_data:
                fWriter1.write(train_line + '\n')

        with open(test_file, 'w', encoding='utf-8') as fWriter2:
            for test_line in test_split_data:
                fWriter2.write(test_line + '\n')

In [7]:
train_test_split('./msra_ner/train.tsv')

In [8]:
!cp ./msra_ner/test.tsv ./msra_ner/data/msra/test.tsv
!cp ./msra_ner/label_map.json ./msra_ner/data/msra/label_map.json

In [9]:
train_data = []
with open('./msra_ner/data/msra/train.tsv', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        train_data.append(line)

print(len(train_data))

42000


## 加载自定义数据集

In [10]:
import os

from paddlenlp.datasets import load_dataset


def read(data_path):
    with open(data_path, 'r', encoding='utf-8') as fRead:
        for line in fRead:
            tokens, labels = line.strip('\n').split('\t')
            tokens = tokens.split('\002')  # \002 即 STX 字符
            labels = labels.split('\002')
            yield {'tokens': tokens, 'labels': labels}

data_dir = './msra_ner/data/msra'
train_file = os.path.join(data_dir, 'train.tsv')
dev_file = os.path.join(data_dir, 'dev.tsv')
test_file = os.path.join(data_dir, 'test.tsv')

train_dataset = load_dataset(read, data_path=train_file, lazy=False)
dev_dataset = load_dataset(read, data_path=dev_file, lazy=False)
test_dataset = load_dataset(read, data_path=test_file, lazy=False)


In [11]:
for idx, ex in enumerate(train_dataset):
    if idx < 3:
        print(ex)

{'tokens': ['苗', '苗', '妈', '妈', '给', '孩', '子', '以', '信', '任', '和', '期', '待', '，', '并', '且', '巧', '妙', '地', '从', '孩', '子', '力', '所', '能', '及', '的', '日', '常', '生', '活', '小', '事', '出', '发', '，', '让', '孩', '子', '在', '劳', '动', '中', '体', '验', '生', '活', '的', '苦', '与', '乐', '。'], 'labels': ['B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}
{'tokens': ['郎', '青', '山', '亲', '自', '抓', '产', '供', '销', '和', '技', '术', '改', '造', '，', '副', '总', '经', '理', '分', '管', '财', '务', '、', '后', '勤', '和', '生', '产', '调', '度', '，', '其', '他', '人', '负', '责', '技', '术', '监', '督', '。'], 'labels': ['B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '

每条数据包含一句文本和这个文本中每个汉字以及数字对应的label标签。

之后，还需要对输入句子进行数据处理，如切词，映射词表id等。

## 数据处理 —— 构建BERT模型输入特征

预训练模型ERNIE对中文数据的处理是以字为单位。PaddleNLP对于各种预训练模型已经内置了相应的tokenizer。指定想要使用的模型名字即可加载对应的tokenizer。

tokenizer作用为将原始输入文本转化成模型model可以接受的输入数据形式。


<p align="center">
<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/ernie_network_1.png" hspace='10'/> <br />
</p>


<p align="center">
<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/ernie_network_2.png" hspace='10'/> <br />
</p>
<br><center>图3：ERNIE模型示意图</center></br>

加载标记字典

In [12]:
import json


with open('./msra_ner/data/msra/label_map.json', 'r', encoding='utf-8') as fRead:
    label_vocab = json.load(fRead)

print(label_vocab)

print(len(label_vocab))

{'B-PER': 0, 'I-PER': 1, 'B-ORG': 2, 'I-ORG': 3, 'B-LOC': 4, 'I-LOC': 5, 'O': 6}
7


In [13]:
from paddlenlp.transformers import BertTokenizer


tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

def convert_example(example, tokenizer, label_vocab, max_seq_len=512):
    tokens, labels = example['tokens'], example['labels']
    
    tokenized_inputs = tokenizer(tokens, return_length=True, is_split_into_words=True, max_seq_len=max_seq_len)

    labels = [label_vocab[label] for label in labels]
    if len(labels) + 2 > max_seq_len:
        labels = labels[: max_seq_len - 2] 

    tokenized_inputs['labels'] = [label_vocab['O']] + labels + [label_vocab['O']] 

    return tokenized_inputs['input_ids'], tokenized_inputs['token_type_ids'], tokenized_inputs['seq_len'], tokenized_inputs['labels']

for idx, example in enumerate(train_dataset):
    if idx < 1:
        print(convert_example(example, tokenizer, label_vocab))

[2021-06-12 04:33:50,743] [    INFO] - Downloading bert-base-chinese-vocab.txt from https://paddle-hapi.bj.bcebos.com/models/bert/bert-base-chinese-vocab.txt
100%|██████████| 107/107 [00:00<00:00, 3273.38it/s]


([101, 5728, 5728, 1968, 1968, 5314, 2111, 2094, 809, 928, 818, 1469, 3309, 2521, 8024, 2400, 684, 2341, 1975, 1765, 794, 2111, 2094, 1213, 2792, 5543, 1350, 4638, 3189, 2382, 4495, 3833, 2207, 752, 1139, 1355, 8024, 6375, 2111, 2094, 1762, 1227, 1220, 704, 860, 7741, 4495, 3833, 4638, 5736, 680, 727, 511, 102], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 54, [6, 0, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6])


In [14]:
from functools import partial


trans_func = partial(convert_example, tokenizer=tokenizer, label_vocab=label_vocab)

train_dataset.map(trans_func)
dev_dataset.map(trans_func)
test_dataset.map(trans_func)

<paddlenlp.datasets.dataset.MapDataset at 0x7f83ce011ed0>

In [15]:
for idx, example in enumerate(train_dataset):
    if idx < 3:
        print(example)

([101, 5728, 5728, 1968, 1968, 5314, 2111, 2094, 809, 928, 818, 1469, 3309, 2521, 8024, 2400, 684, 2341, 1975, 1765, 794, 2111, 2094, 1213, 2792, 5543, 1350, 4638, 3189, 2382, 4495, 3833, 2207, 752, 1139, 1355, 8024, 6375, 2111, 2094, 1762, 1227, 1220, 704, 860, 7741, 4495, 3833, 4638, 5736, 680, 727, 511, 102], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 54, [6, 0, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6])
([101, 6947, 7471, 2255, 779, 5632, 2831, 772, 897, 7218, 1469, 2825, 3318, 3121, 6863, 8024, 1199, 2600, 5307, 4415, 1146, 5052, 6568, 1218, 510, 1400, 1249, 1469, 4495, 772, 6444, 2428, 8024, 1071, 800, 782, 6566, 6569, 2825, 3318, 4664, 4719, 511, 102], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

### 数据读入 —— 构建数据集加载器 DataLoader

使用`paddle.io.DataLoader`接口多线程异步加载数据。

In [16]:
from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler
from paddle.io import BatchSampler
from paddlenlp.data import Pad
from paddlenlp.data import Stack
from paddlenlp.data import Tuple


pad_label_id = -1
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
    Stack(), # seq_len
    Pad(axis=0, pad_val=pad_label_id)  # labels
): [data for data in fn(samples)]

train_batch_sampler = DistributedBatchSampler(
    dataset=train_dataset, 
    batch_size=32, 
    shuffle=True
)
train_data_loader = DataLoader(
    dataset=train_dataset, 
    batch_sampler=train_batch_sampler, 
    collate_fn=batchify_fn, 
    return_list=True
)

dev_batch_sampler = BatchSampler(
    dataset=dev_dataset,
    batch_size=32,
    shuffle=False
)
dev_data_loader = DataLoader(
    dataset=dev_dataset,
    batch_sampler=dev_batch_sampler,
    collate_fn=batchify_fn,
    return_list=True
)

test_batch_sampler = BatchSampler(
    dataset=test_dataset,
    batch_size=16,
    shuffle=False
)
test_data_loader = DataLoader(
    dataset=test_dataset,
    batch_sampler=test_batch_sampler,
    collate_fn=batchify_fn,
    return_list=True
)


## PaddleNLP一键加载预训练模型


快递单信息抽取本质是一个序列标注任务，PaddleNLP对于各种预训练模型已经内置了对于下游任务文本分类Fine-tune网络。以下教程以ERNIE为预训练模型完成序列标注任务。

`paddlenlp.transformers.ErnieForTokenClassification()`一行代码即可加载预训练模型ERNIE用于序列标注任务的fine-tune网络。其在ERNIE模型后拼接上一个全连接网络进行分类。

`paddlenlp.transformers.ErnieForTokenClassification.from_pretrained()`方法只需指定想要使用的模型名称和文本分类的类别数即可完成定义模型网络。

In [17]:
from paddlenlp.transformers import BertForTokenClassification


model = BertForTokenClassification.from_pretrained('bert-base-chinese', num_classes=len(label_vocab))

[2021-06-12 04:38:43,988] [    INFO] - Downloading http://paddlenlp.bj.bcebos.com/models/transformers/bert/bert-base-chinese.pdparams and saved to /home/aistudio/.paddlenlp/models/bert-base-chinese
[2021-06-12 04:38:44,019] [    INFO] - Downloading bert-base-chinese.pdparams from http://paddlenlp.bj.bcebos.com/models/transformers/bert/bert-base-chinese.pdparams
100%|██████████| 696494/696494 [00:14<00:00, 47208.45it/s]


PaddleNLP不仅支持ERNIE预训练模型，还支持BERT、RoBERTa、Electra等预训练模型。
下表汇总了目前PaddleNLP支持的各类预训练模型。您可以使用PaddleNLP提供的模型，完成文本分类、序列标注、问答等任务。同时我们提供了众多预训练模型的参数权重供用户使用，其中包含了二十多种中文语言模型的预训练权重。中文的预训练模型有`bert-base-chinese, bert-wwm-chinese, bert-wwm-ext-chinese, ernie-1.0, ernie-tiny, gpt2-base-cn, roberta-wwm-ext, roberta-wwm-ext-large, rbt3, rbtl3, chinese-electra-base, chinese-electra-small, chinese-xlnet-base, chinese-xlnet-mid, chinese-xlnet-large, unified_transformer-12L-cn, unified_transformer-12L-cn-luge`等。

更多预训练模型参考：[PaddleNLP Transformer API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/transformers.md)。

更多预训练模型fine-tune下游任务使用方法，请参考：[examples](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples)。

## 设置Fine-Tune优化策略，模型配置
适用于ERNIE/BERT这类Transformer模型的迁移优化学习率策略为warmup的动态学习率。

<p align="center">
<img src="https://ai-studio-static-online.cdn.bcebos.com/2bc624280a614a80b5449773192be460f195b13af89e4e5cbaf62bf6ac16de2c" width="40%" height="30%"/> <br />
</p><br><center>图4：动态学习率示意图</center></br>



In [20]:
from paddle.nn import CrossEntropyLoss
from paddle.optimizer import AdamW
from paddlenlp.metrics import ChunkEvaluator

from paddlenlp.transformers import LinearDecayWithWarmup

epochs = 5
num_training_steps = len(train_data_loader) * epochs

# 定义 learning_rate_scheduler，负责在训练过程中对 lr 进行调度
lr_scheduler = LinearDecayWithWarmup(2e-5, num_training_steps, 0.0)

# Generate parameter names needed to perform weight decay.
# All bias and LayerNorm parameters are excluded.
decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]

# 定义 Optimizer
optimizer = AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=0.0,
    apply_decay_param_fun=lambda x: x in decay_params)

# 采用交叉熵 损失函数
loss_fn = CrossEntropyLoss(ignore_index=pad_label_id)

# 评估的时候采用准确率指标
metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=False)


## 模型训练与评估


模型训练的过程通常有以下步骤：

1. 从dataloader中取出一个batch data
2. 将batch data喂给model，做前向计算
3. 将前向计算结果传给损失函数，计算loss。将前向计算结果传给评价方法，计算评价指标。
4. loss反向回传，更新梯度。重复以上步骤。

每训练一个epoch时，程序将会评估一次，评估当前模型训练的效果。

In [21]:
import paddle
import numpy as np


@paddle.no_grad()
def evaluate(model,loss_fn, metric, data_loader):
    model.eval()
    metric.reset()

    losses = []

    for batch_data in data_loader:

        input_ids, token_type_ids, length, labels = batch_data

        logits = model(input_ids, token_type_ids)

        loss = loss_fn(logits, labels)
        losses.append(loss.numpy())

        preds = paddle.argmax(logits, axis=-1)
        n_infer, n_label, n_correct = metric.compute(None, length, preds, labels)
        metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy())
        precission, recall, f1_score = metric.accumulate()

    model.train()

    return np.mean(losses), precission, recall, f1_score


In [23]:
import paddle
import time
import os


global_step = 0

best_f1_score = 0.0
best_precission = 0.0
best_recall = 0.0

print_every_step = 10
evaluate_every_step = 100

save_dir = os.path.join('./data', 'checkpoints')

if not os.path.exists(save_dir):
    os.makedirs(save_dir)
save_param_path = os.path.join(save_dir, 'bets_model_state.pdparams')

tic_train = time.time()

for epoch in range(epochs):
    for step, batch_data in enumerate(train_data_loader):
        
        input_ids, token_type_ids, length, labels = batch_data
        logits = model(input_ids, token_type_ids)

        loss = loss_fn(logits, labels)

        preds = paddle.argmax(logits, axis=-1)
        n_infer, n_label, n_correct = metric.compute(None, length, preds, labels)
        metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy())
        precission, recall, f1_score = metric.accumulate()

        global_step += 1

        if global_step % print_every_step == 0:
            print('global_step: %d, epoch: %d, batch: %d, loss: %.5f, precission: %.5f, recall: %.5f, f1_score: %.5f, speed: %.2f step/s' % (
                global_step, epoch, step, loss.numpy(), precission, recall, f1_score, print_every_step / (time.time() - tic_train)
            ))
            tic_train = time.time()
        
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()

        if global_step % evaluate_every_step == 0:
            loss, precission, recall, f1_score = evaluate(model, loss_fn, metric, dev_data_loader)
            print('eval dev loss: %.5f, precission: %.5f, recall: %.5f, f1_score: %.5f' % (
                loss, precission, recall, f1_score
            ))
            if f1_score > best_f1_score and precission > best_precission and recall > best_recall:
                
                best_f1_score = f1_score
                best_precission = precission 
                best_recall = recall

                print('save model at global step : %d, best_precission: %.5f, best_recall: %.5f, best val f1_score: %.5f' % (
                    global_step, best_precission, best_recall, best_f1_score
                ))

                paddle.save(model.state_dict(), save_param_path)
                tokenizer.save_pretrained(save_dir)

  format(lhs_dtype, rhs_dtype, lhs_dtype))
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if data.dtype == np.object:


global_step: 10, epoch: 0, batch: 9, loss: 0.38094, precission: 0.00038, recall: 0.00223, f1_score: 0.00065, speed: 4.26 step/s
global_step: 20, epoch: 0, batch: 19, loss: 0.38861, precission: 0.00038, recall: 0.00108, f1_score: 0.00056, speed: 3.80 step/s
global_step: 30, epoch: 0, batch: 29, loss: 0.23985, precission: 0.00212, recall: 0.00422, f1_score: 0.00283, speed: 4.78 step/s
global_step: 40, epoch: 0, batch: 39, loss: 0.24867, precission: 0.01549, recall: 0.02669, f1_score: 0.01960, speed: 4.64 step/s
global_step: 50, epoch: 0, batch: 49, loss: 0.10196, precission: 0.07232, recall: 0.11760, f1_score: 0.08956, speed: 4.50 step/s
global_step: 60, epoch: 0, batch: 59, loss: 0.10404, precission: 0.12995, recall: 0.20321, f1_score: 0.15853, speed: 4.02 step/s
global_step: 70, epoch: 0, batch: 69, loss: 0.12064, precission: 0.18256, recall: 0.27677, f1_score: 0.22000, speed: 3.62 step/s
global_step: 80, epoch: 0, batch: 79, loss: 0.11108, precission: 0.21901, recall: 0.32335, f1_scor

|dataset|Precission| Recall|F1-Score|
|---|---|---|---|
|train|96.340|97.002|96.67|
|dev|95.783|96.514|96.147|

## 模型预测

训练保存好的模型，即可用于预测。如以下示例代码自定义预测数据，调用`predict()`函数即可一键预测。

In [35]:
test_batch_sampler = BatchSampler(
    dataset=test_dataset,
    batch_size=2,
    shuffle=False
)
test_data_loader = DataLoader(
    dataset=test_dataset,
    batch_sampler=test_batch_sampler,
    collate_fn=batchify_fn,
    return_list=True
)

In [37]:
def predict(model, data_loader, ds, label_vocab):
    pred_list = []
    len_list = []
    for input_ids, seg_ids, lens, labels in data_loader:
        logits = model(input_ids, seg_ids)
        pred = paddle.argmax(logits, axis=-1)
        pred_list.append(pred.numpy())
        len_list.append(lens.numpy())
    preds = parse_decodes(ds, pred_list, len_list, label_vocab)
    return preds

def parse_decodes(ds, decodes, lens, label_vocab):
    decodes = [x for batch in decodes for x in batch]  # [[]]
    lens = [x for batch in lens for x in batch]        # []
    id_label = dict(zip(label_vocab.values(), label_vocab.keys()))

    outputs = []
    for idx, end in enumerate(lens):
        sent = ds.data[idx]['tokens'][:end]
        tags = [id_label[x] for x in decodes[idx][1:end]]
        sent_out = []
        tags_out = []
        words = ""
        for s, t in zip(sent, tags):
            if t.startswith('B-') or t == 'O':
                if len(words):
                    sent_out.append(words)
                if t.startswith('B-'):
                    tags_out.append(t.split('-')[1])
                else:
                    tags_out.append(t)
                words = s
            else:
                words += s
        if len(sent_out) < len(tags_out):
            sent_out.append(words)
        outputs.append(''.join(
            [str((s, t)) for s, t in zip(sent_out, tags_out)]))
    return outputs



In [36]:

preds = predict(model, test_data_loader, test_dataset, label_vocab)
file_path = "msra_ner_results.txt"
with open(file_path, "w", encoding="utf8") as fout:
    fout.write("\n".join(preds))
# Print some examples
print(
    "The results have been saved in the file: %s, some examples are shown below: "
    % file_path)
print("\n".join(preds[:10]))

# 进一步使用CRF

PaddleNLP提供了CRF Layer，它能够学习label之间的关系，能够帮助模型更好地学习、预测序列标注任务。

我们在PaddleNLP仓库中提供了[示例](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/information_extraction/waybill_ie/run_ernie_crf.py)，您可以参照示例代码使用Ernie-CRF结构完成快递单信息抽取任务。

# 加入交流群，一起学习吧

现在就加入课程QQ交流群，一起交流NLP技术吧！

<img src="https://ai-studio-static-online.cdn.bcebos.com/d953727af0c24a7c806ab529495f0904f22f809961be420b8c88cdf59b837394" width="200" height="250" >