# 作业

- 补全程序中的代码，理解其含义，并跑通整个项目；
- 报名参加[千言数据集：信息抽取比赛](https://aistudio.baidu.com/aistudio/competition/detail/46)。

# Github作业链接和requirements截图

<b>Github: https://github.com/libertatis/PaddleNLP-Learning</b>

![](https://ai-studio-static-online.cdn.bcebos.com/ecdbaf6026a749bf8e438ec633d2265b1af283d5c7d84dc78624c87fe90b6e44)


# 基于预训练模型完成实体关系抽取

信息抽取旨在从非结构化自然语言文本中提取结构化知识，如实体、关系、事件等。对于给定的自然语言句子，根据预先定义的schema集合，抽取出所有满足schema约束的SPO三元组。

例如，「妻子」关系的schema定义为：      
{      
    S_TYPE: 人物,        
    P: 妻子,      
    O_TYPE: {      
        @value: 人物       
    }       
}        

该示例展示了如何使用PaddleNLP快速完成实体关系抽取，参与[千言信息抽取-关系抽取比赛](https://aistudio.baidu.com/aistudio/competition/detail/46)打榜。




In [2]:
# 安装paddlenlp最新版本
!pip install --upgrade paddlenlp

%cd relation_extraction/

Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already up-to-date: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.0.2)
/home/aistudio/relation_extraction


In [3]:
!ls -l

total 92
drwxr-xr-x 2 aistudio aistudio  4096 Jun 14 17:37 checkpoints
drwxr-xr-x 3 aistudio aistudio  4096 Jun 14 17:37 data
-rw-r--r-- 1 aistudio aistudio 13221 Jun 13 19:00 data_loader.py
-rw-r--r-- 1 aistudio aistudio  4675 Jun 13 19:00 extract_chinese_and_punct.py
-rw-r--r-- 1 aistudio aistudio   333 Jun 14 18:05 predict.sh
drwxr-xr-x 2 aistudio aistudio  4096 Jun 14 06:31 __pycache__
-rw-r--r-- 1 aistudio aistudio  6626 Jun 13 19:00 README.md
-rw-r--r-- 1 aistudio aistudio  9849 Jun 13 19:00 re_official_evaluation.py
-rw-r--r-- 1 aistudio aistudio 13477 Jun 14 12:28 run_duie.py
-rw-r--r-- 1 aistudio aistudio   640 Jun 13 19:00 train.sh
-rw-r--r-- 1 aistudio aistudio  8308 Jun 13 19:00 utils.py


## 关系抽取介绍

针对 DuIE2.0 任务中多条、交叠SPO这一抽取目标，比赛对标准的 'BIO' 标注进行了扩展。
对于每个 token，根据其在实体span中的位置（包括B、I、O三种），我们为其打上三类标签，并且根据其所参与构建的predicate种类，将 B 标签进一步区分。给定 schema 集合，对于 N 种不同 predicate，以及头实体/尾实体两种情况，我们设计对应的共 2*N 种 B 标签，再合并 I 和 O 标签，故每个 token 一共有 (2*N+2) 个标签，如下图所示。


<div align="center">
<img src="https://ai-studio-static-online.cdn.bcebos.com/f984664777b241a9b43ef843c9b752f33906c8916bc146a69f7270b5858bee63" width="500" height="400" alt="标注策略" align=center />
</div>

### 评价方法

对测试集上参评系统输出的SPO结果和人工标注的SPO结果进行精准匹配，采用F1值作为评价指标。注意，对于复杂O值类型的SPO，必须所有槽位都精确匹配才认为该SPO抽取正确。针对部分文本中存在实体别名的问题，使用百度知识图谱的别名词典来辅助评测。F1值的计算方式如下：

F1 = (2 * P * R) / (P + R)，其中

- P = 测试集所有句子中预测正确的SPO个数 / 测试集所有句子中预测出的SPO个数
- R = 测试集所有句子中预测正确的SPO个数 / 测试集所有句子中人工标注的SPO个数

### Step1：构建模型

该任务可以看作一个序列标注任务，所以基线模型采用的是ERNIE序列标注模型。

**PaddleNLP提供了ERNIE预训练模型常用序列标注模型，可以通过指定模型名字完成一键加载。PaddleNLP为了方便用户处理数据，内置了对于各个预训练模型对应的Tokenizer，可以完成文本token化，转token ID，文本长度截断等操作。**

文本数据处理直接调用tokenizer即可输出模型所需输入数据。



In [None]:
import os
import json

import paddle
from paddlenlp.transformers import ErnieGramTokenizer
from paddlenlp.transformers import ErnieGramForTokenClassification


label_map_path = os.path.join('data', "predicate2id.json")

if not (os.path.exists(label_map_path) and os.path.isfile(label_map_path)):
    sys.exit("{} dose not exists or is not a file.".format(label_map_path))
with open(label_map_path, 'r', encoding='utf8') as fp:
    label_map = json.load(fp)
    
num_classes = (len(label_map.keys()) - 2) * 2 + 2

# 补齐代码，理解TokenClassification接口含义，理解关系抽取标注体系和类别数由来
model = ErnieGramForTokenClassification.from_pretrained('ernie-gram-zh', num_classes=num_classes)
tokenizer = ErnieGramTokenizer.from_pretrained("ernie-gram-zh")

inputs = tokenizer(text="请输入测试样例", max_seq_len=20)

[2021-06-14 12:51:00,208] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie_gram_zh/ernie_gram_zh.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-gram-zh
[2021-06-14 12:51:00,211] [    INFO] - Downloading ernie_gram_zh.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_gram_zh/ernie_gram_zh.pdparams
100%|██████████| 583566/583566 [00:15<00:00, 38382.83it/s]
[2021-06-14 12:51:25,475] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_gram_zh/vocab.txt
100%|██████████| 78/78 [00:00<00:00, 3079.81it/s]


### Step2：加载并处理数据


从比赛官网下载数据集，解压存放于data/目录下并重命名为train_data.json, dev_data.json, test_data.json.

我们可以加载自定义数据集。通过继承[`paddle.io.Dataset`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/Dataset_cn.html#dataset)，自定义实现`__getitem__` 和 `__len__`两个方法。


In [None]:
from typing import Dict
from typing import List
from typing import Optional
from typing import Union


import numpy as np
import paddle
from tqdm import tqdm
from paddlenlp.utils.log import logger

from data_loader import DataCollator
from data_loader import convert_example_to_feature
from data_loader import parse_label
from extract_chinese_and_punct import ChineseAndPunctuationExtractor


class DuIEDataset(paddle.io.Dataset):
    """
    Dataset of DuIE.
    """

    def __init__(
            self,
            input_ids: List[Union[List[int], np.ndarray]],
            seq_lens: List[Union[List[int], np.ndarray]],
            tok_to_orig_start_index: List[Union[List[int], np.ndarray]],
            tok_to_orig_end_index: List[Union[List[int], np.ndarray]],
            labels: List[Union[List[int], np.ndarray, List[str], List[Dict]]]):
        super(DuIEDataset, self).__init__()

        self.input_ids = input_ids
        self.seq_lens = seq_lens
        self.tok_to_orig_start_index = tok_to_orig_start_index
        self.tok_to_orig_end_index = tok_to_orig_end_index
        self.labels = labels

    def __len__(self):
        if isinstance(self.input_ids, np.ndarray):
            return self.input_ids.shape[0]
        else:
            return len(self.input_ids)

    def __getitem__(self, item):
        return {
            "input_ids": np.array(self.input_ids[item]),
            "seq_lens": np.array(self.seq_lens[item]),
            "tok_to_orig_start_index":
            np.array(self.tok_to_orig_start_index[item]),
            "tok_to_orig_end_index": np.array(self.tok_to_orig_end_index[item]),
            # If model inputs is generated in `collate_fn`, delete the data type casting.
            "labels": np.array(
                self.labels[item], dtype=np.float32),
        }

    @classmethod
    def from_file(cls,
                  file_path: Union[str, os.PathLike],
                  tokenizer: ErnieGramTokenizer,
                  max_length: Optional[int]=512,
                  pad_to_max_length: Optional[bool]=None):
        
        assert os.path.exists(file_path) and os.path.isfile(
            file_path), f"{file_path} dose not exists or is not a file."
        label_map_path = os.path.join(os.path.dirname(file_path), "predicate2id.json")
        assert os.path.exists(label_map_path) and os.path.isfile(
            label_map_path), f"{label_map_path} dose not exists or is not a file."
        with open(label_map_path, 'r', encoding='utf8') as fp:
            label_map = json.load(fp)
        
        chineseandpunctuationextractor = ChineseAndPunctuationExtractor()

        input_ids, seq_lens, \
        tok_to_orig_start_index, \
        tok_to_orig_end_index, labels = ([] for _ in range(5))

        dataset_scale = sum(1 for line in open(file_path, 'r'))
        logger.info("Preprocessing data, loaded from %s" % file_path)
        
        with open(file_path, "r", encoding="utf-8") as fp:
            lines = fp.readlines()
            for line in tqdm(lines):
                example = json.loads(line)
                input_feature = convert_example_to_feature(
                    example=example, 
                    tokenizer=tokenizer, 
                    chineseandpunctuationextractor=chineseandpunctuationextractor,
                    label_map=label_map, 
                    max_length=max_length, 
                    pad_to_max_length=pad_to_max_length
                )
                input_ids.append(input_feature.input_ids)
                seq_lens.append(input_feature.seq_len)
                tok_to_orig_start_index.append(
                    input_feature.tok_to_orig_start_index
                )
                tok_to_orig_end_index.append(
                    input_feature.tok_to_orig_end_index
                )
                labels.append(input_feature.labels)

        return cls(
            input_ids, 
            seq_lens, 
            tok_to_orig_start_index,
            tok_to_orig_end_index, labels
        )


In [None]:
data_path = 'data'
batch_size = 32
max_seq_length = 128

# train Dataset
train_file_path = os.path.join(data_path, 'train.json')
train_dataset = DuIEDataset.from_file(
    file_path=train_file_path, 
    tokenizer=tokenizer, 
    max_length=max_seq_length, 
    pad_to_max_length=True
)
# train BatchSampler
train_batch_sampler = paddle.io.BatchSampler(
    dataset=train_dataset, 
    batch_size=batch_size, 
    shuffle=True, 
    drop_last=True
)
# train DataLoader
collator = DataCollator()
train_data_loader = paddle.io.DataLoader(
    dataset=train_dataset,
    batch_sampler=train_batch_sampler,
    collate_fn=collator
)

# dev Dataset
eval_file_path = os.path.join(data_path, 'dev.json')
test_dataset = DuIEDataset.from_file(
    file_path=eval_file_path, 
    tokenizer=tokenizer, 
    max_length=max_seq_length, 
    pad_to_max_length=True
)
# dev BatchSampler
test_batch_sampler = paddle.io.BatchSampler(
    dataset=test_dataset, 
    batch_size=batch_size, 
    shuffle=False, 
    drop_last=True
)
# dev DataLoader
test_data_loader = paddle.io.DataLoader(
    dataset=test_dataset,
    batch_sampler=test_batch_sampler,
    collate_fn=collator
)

[2021-06-14 12:52:28,550] [    INFO] - Preprocessing data, loaded from data/train.json
100%|██████████| 171293/171293 [05:16<00:00, 540.79it/s]
[2021-06-14 12:57:45,740] [    INFO] - Preprocessing data, loaded from data/dev.json
100%|██████████| 20674/20674 [00:37<00:00, 548.41it/s]


### Step3：定义损失函数和优化器，开始训练

我们选择均方误差作为损失函数，使用[`paddle.optimizer.AdamW`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/optimizer/adamw/AdamW_cn.html#adamw)作为优化器。



在训练过程中，模型保存在当前目录checkpoints文件夹下。同时在训练的同时使用官方评测脚本进行评估，输出P/R/F1指标。
在验证集上F1可以达到69.42。


In [None]:
import paddle.nn as nn

class BCELossForDuIE(nn.Layer):
    def __init__(self, ):
        super(BCELossForDuIE, self).__init__()
        self.criterion = nn.BCEWithLogitsLoss(reduction='none')

    def forward(self, logits, labels, mask):
        loss = self.criterion(logits, labels)
        mask = paddle.cast(mask, 'float32')
        loss = loss * mask.unsqueeze(-1)
        loss = paddle.sum(loss.mean(axis=2), axis=1) / paddle.sum(mask, axis=1)
        loss = loss.mean()
        return loss

In [None]:
from utils import decoding
from utils import get_precision_recall_f1
from utils import write_prediction_results

@paddle.no_grad()
def evaluate(model, criterion, data_loader, file_path, mode):
    """
    mode eval:
    eval on development set and compute P/R/F1, called between training.
    mode predict:
    eval on development / test set, then write predictions to \
        predict_test.json and predict_test.json.zip \
        under /home/aistudio/relation_extraction/data dir for later submission or evaluation.
    """
    example_all = []
    with open(file_path, "r", encoding="utf-8") as fp:
        for line in fp:
            example_all.append(json.loads(line))
    
    id2spo_path = os.path.join(os.path.dirname(file_path), "id2spo.json")
    with open(id2spo_path, 'r', encoding='utf8') as fp:
        id2spo = json.load(fp)

    model.eval()

    loss_all = 0
    eval_steps = 0
    formatted_outputs = []
    current_idx = 0

    for batch in tqdm(data_loader, total=len(data_loader)):
        
        eval_steps += 1

        input_ids, seq_len, \
        tok_to_orig_start_index, \
        tok_to_orig_end_index, labels = batch

        logits = model(input_ids=input_ids)
        mask = (input_ids != 0).logical_and((input_ids != 1)).logical_and((input_ids != 2))
        loss = criterion(logits, labels, mask)
        loss_all += loss.numpy().item()

        probs = F.sigmoid(logits)
        logits_batch = probs.numpy()

        seq_len_batch = seq_len.numpy()
        tok_to_orig_start_index_batch = tok_to_orig_start_index.numpy()
        tok_to_orig_end_index_batch = tok_to_orig_end_index.numpy()

        formatted_outputs.extend(
            decoding(
                example_all[current_idx: current_idx+len(logits)],
                id2spo,
                logits_batch,
                seq_len_batch,
                tok_to_orig_start_index_batch,
                tok_to_orig_end_index_batch
            )
        )
        current_idx = current_idx+len(logits)
        
    loss_avg = loss_all / eval_steps
    print("eval loss: %f" % (loss_avg))

    if mode == "predict":
        predict_file_path = os.path.join("/home/aistudio/relation_extraction/data", 'predictions.json')
    else:
        predict_file_path = os.path.join("/home/aistudio/relation_extraction/data", 'predict_eval.json')

    predict_zipfile_path = write_prediction_results(formatted_outputs, predict_file_path)

    if mode == "eval":
        precision, recall, f1 = get_precision_recall_f1(file_path, predict_zipfile_path)
        os.system('rm {} {}'.format(predict_file_path, predict_zipfile_path))
        return precision, recall, f1
    elif mode != "predict":
        raise Exception("wrong mode for eval func")

In [None]:
from paddlenlp.transformers import LinearDecayWithWarmup


learning_rate = 2e-5
num_train_epochs = 12
warmup_ratio = 0.06

# Loss
criterion = BCELossForDuIE()

# Defines learning rate strategy.
steps_by_epoch = len(train_data_loader)
num_training_steps = steps_by_epoch * num_train_epochs
# Learning rate Scheduler
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_ratio)

# Optimizer
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
)

In [None]:
# 模型参数保存路径
!mkdir checkpoints

### Step4：提交预测结果

加载训练保存的模型加载后进行预测。

**NOTE:** 注意设置用于预测的模型参数路径。

In [None]:
import time
import paddle.nn.functional as F

# Starts training.
global_step = 0
logging_steps = 100
save_steps = 5000
num_train_epochs = 12
output_dir = 'checkpoints'
tic_train = time.time()
model.train()
for epoch in range(num_train_epochs):
    print("\n=====start training of %d epochs=====" % epoch)
    tic_epoch = time.time()
    for step, batch in enumerate(train_data_loader):
        input_ids, seq_lens, tok_to_orig_start_index, tok_to_orig_end_index, labels = batch
        logits = model(input_ids=input_ids)
        mask = (input_ids != 0).logical_and((input_ids != 1)).logical_and(
            (input_ids != 2))
        loss = criterion(logits, labels, mask)
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_gradients()
        loss_item = loss.numpy().item()

        if global_step % logging_steps == 0:
            print(
                "epoch: %d / %d, steps: %d / %d, loss: %f, speed: %.2f step/s"
                % (epoch, num_train_epochs, step, steps_by_epoch,
                    loss_item, logging_steps / (time.time() - tic_train)))
            tic_train = time.time()

        if global_step % save_steps == 0 and global_step != 0:
            print("\n=====start evaluating ckpt of %d steps=====" %
                    global_step)
            precision, recall, f1 = evaluate(
                model, criterion, test_data_loader, eval_file_path, "eval")
            print("precision: %.2f\t recall: %.2f\t f1: %.2f\t" %
                    (100 * precision, 100 * recall, 100 * f1))
            print("saving checkpoing model_%d.pdparams to %s " %
                    (global_step, output_dir))
            paddle.save(model.state_dict(),
                        os.path.join(output_dir, 
                                        "model_%d.pdparams" % global_step))
            model.train()

        global_step += 1
    tic_epoch = time.time() - tic_epoch
    print("epoch time footprint: %d hour %d min %d sec" %
            (tic_epoch // 3600, (tic_epoch % 3600) // 60, tic_epoch % 60))

# Does final evaluation.
print("\n=====start evaluating last ckpt of %d steps=====" %
        global_step)
precision, recall, f1 = evaluate(model, criterion, test_data_loader,
                                    eval_file_path, "eval")
print("precision: %.2f\t recall: %.2f\t f1: %.2f\t" %
        (100 * precision, 100 * recall, 100 * f1))
paddle.save(model.state_dict(),
            os.path.join(output_dir,
                            "model_%d.pdparams" % global_step))
print("\n=====training complete=====")


=====start training of 0 epochs=====
epoch: 0 / 12, steps: 0 / 5352, loss: 0.744464, speed: 202.94 step/s
epoch: 0 / 12, steps: 200 / 5352, loss: 0.508270, speed: 4.14 step/s
epoch: 0 / 12, steps: 300 / 5352, loss: 0.313107, speed: 4.13 step/s
epoch: 0 / 12, steps: 400 / 5352, loss: 0.236216, speed: 4.13 step/s
epoch: 0 / 12, steps: 500 / 5352, loss: 0.196758, speed: 4.13 step/s
epoch: 0 / 12, steps: 600 / 5352, loss: 0.168844, speed: 4.11 step/s
epoch: 0 / 12, steps: 700 / 5352, loss: 0.144960, speed: 4.10 step/s
epoch: 0 / 12, steps: 800 / 5352, loss: 0.123290, speed: 4.13 step/s
epoch: 0 / 12, steps: 900 / 5352, loss: 0.103853, speed: 4.08 step/s
epoch: 0 / 12, steps: 1000 / 5352, loss: 0.089236, speed: 4.14 step/s
epoch: 0 / 12, steps: 1100 / 5352, loss: 0.077180, speed: 4.13 step/s
epoch: 0 / 12, steps: 1200 / 5352, loss: 0.065765, speed: 4.14 step/s
epoch: 0 / 12, steps: 1300 / 5352, loss: 0.056916, speed: 4.11 step/s
epoch: 0 / 12, steps: 1400 / 5352, loss: 0.048526, speed: 4.1

100%|██████████| 646/646 [01:00<00:00, 10.73it/s]


eval loss: 0.004111
precision: 60.04	 recall: 35.59	 f1: 44.69	
saving checkpoing model_5000.pdparams to checkpoints 
epoch: 0 / 12, steps: 5100 / 5352, loss: 0.005252, speed: 1.07 step/s
epoch: 0 / 12, steps: 5200 / 5352, loss: 0.004375, speed: 4.13 step/s
epoch: 0 / 12, steps: 5300 / 5352, loss: 0.004112, speed: 4.05 step/s
epoch time footprint: 0 hour 22 min 47 sec

=====start training of 1 epochs=====
epoch: 1 / 12, steps: 48 / 5352, loss: 0.006314, speed: 4.07 step/s
epoch: 1 / 12, steps: 148 / 5352, loss: 0.003308, speed: 4.00 step/s
epoch: 1 / 12, steps: 248 / 5352, loss: 0.003674, speed: 4.13 step/s
epoch: 1 / 12, steps: 348 / 5352, loss: 0.003547, speed: 4.11 step/s
epoch: 1 / 12, steps: 448 / 5352, loss: 0.003775, speed: 4.11 step/s
epoch: 1 / 12, steps: 548 / 5352, loss: 0.003131, speed: 4.15 step/s
epoch: 1 / 12, steps: 648 / 5352, loss: 0.004093, speed: 4.09 step/s
epoch: 1 / 12, steps: 748 / 5352, loss: 0.003733, speed: 4.13 step/s
epoch: 1 / 12, steps: 848 / 5352, loss: 

100%|██████████| 646/646 [01:00<00:00, 10.60it/s]


eval loss: 0.002381
precision: 64.60	 recall: 63.21	 f1: 63.90	
saving checkpoing model_10000.pdparams to checkpoints 
epoch: 1 / 12, steps: 4748 / 5352, loss: 0.002981, speed: 1.05 step/s
epoch: 1 / 12, steps: 4848 / 5352, loss: 0.002436, speed: 4.10 step/s
epoch: 1 / 12, steps: 4948 / 5352, loss: 0.002184, speed: 4.11 step/s
epoch: 1 / 12, steps: 5048 / 5352, loss: 0.002903, speed: 4.10 step/s
epoch: 1 / 12, steps: 5148 / 5352, loss: 0.002977, speed: 4.07 step/s
epoch: 1 / 12, steps: 5248 / 5352, loss: 0.003068, speed: 4.04 step/s
epoch: 1 / 12, steps: 5348 / 5352, loss: 0.003506, speed: 4.03 step/s
epoch time footprint: 0 hour 22 min 58 sec

=====start training of 2 epochs=====
epoch: 2 / 12, steps: 96 / 5352, loss: 0.002204, speed: 4.06 step/s
epoch: 2 / 12, steps: 196 / 5352, loss: 0.002520, speed: 4.06 step/s
epoch: 2 / 12, steps: 296 / 5352, loss: 0.001738, speed: 4.02 step/s
epoch: 2 / 12, steps: 396 / 5352, loss: 0.001697, speed: 4.02 step/s
epoch: 2 / 12, steps: 496 / 5352, l

100%|██████████| 646/646 [01:01<00:00, 10.58it/s]


eval loss: 0.002238
precision: 61.74	 recall: 70.87	 f1: 65.99	
saving checkpoing model_15000.pdparams to checkpoints 
epoch: 2 / 12, steps: 4396 / 5352, loss: 0.002477, speed: 1.05 step/s
epoch: 2 / 12, steps: 4496 / 5352, loss: 0.003344, speed: 4.08 step/s
epoch: 2 / 12, steps: 4596 / 5352, loss: 0.001533, speed: 4.10 step/s
epoch: 2 / 12, steps: 4696 / 5352, loss: 0.002968, speed: 4.10 step/s
epoch: 2 / 12, steps: 4796 / 5352, loss: 0.002064, speed: 4.09 step/s
epoch: 2 / 12, steps: 4896 / 5352, loss: 0.001616, speed: 4.09 step/s
epoch: 2 / 12, steps: 4996 / 5352, loss: 0.001847, speed: 4.07 step/s
epoch: 2 / 12, steps: 5096 / 5352, loss: 0.002440, speed: 4.09 step/s
epoch: 2 / 12, steps: 5196 / 5352, loss: 0.002296, speed: 4.09 step/s
epoch: 2 / 12, steps: 5296 / 5352, loss: 0.003625, speed: 4.11 step/s
epoch time footprint: 0 hour 23 min 6 sec

=====start training of 3 epochs=====
epoch: 3 / 12, steps: 44 / 5352, loss: 0.001991, speed: 4.08 step/s
epoch: 3 / 12, steps: 144 / 5352,

100%|██████████| 646/646 [01:01<00:00, 10.54it/s]


eval loss: 0.002223
precision: 56.64	 recall: 75.31	 f1: 64.65	
saving checkpoing model_20000.pdparams to checkpoints 
epoch: 3 / 12, steps: 4044 / 5352, loss: 0.001771, speed: 1.04 step/s
epoch: 3 / 12, steps: 4144 / 5352, loss: 0.001909, speed: 4.01 step/s
epoch: 3 / 12, steps: 4244 / 5352, loss: 0.001293, speed: 4.01 step/s
epoch: 3 / 12, steps: 4344 / 5352, loss: 0.001260, speed: 4.01 step/s
epoch: 3 / 12, steps: 4444 / 5352, loss: 0.001909, speed: 4.03 step/s
epoch: 3 / 12, steps: 4544 / 5352, loss: 0.001677, speed: 4.10 step/s
epoch: 3 / 12, steps: 4644 / 5352, loss: 0.002507, speed: 4.10 step/s
epoch: 3 / 12, steps: 4744 / 5352, loss: 0.002020, speed: 4.11 step/s
epoch: 3 / 12, steps: 4844 / 5352, loss: 0.002141, speed: 4.15 step/s
epoch: 3 / 12, steps: 4944 / 5352, loss: 0.001549, speed: 4.08 step/s
epoch: 3 / 12, steps: 5044 / 5352, loss: 0.001560, speed: 4.08 step/s
epoch: 3 / 12, steps: 5144 / 5352, loss: 0.001358, speed: 4.10 step/s
epoch: 3 / 12, steps: 5244 / 5352, loss: 

100%|██████████| 646/646 [01:01<00:00, 10.43it/s]


eval loss: 0.002097
precision: 62.42	 recall: 72.18	 f1: 66.94	
saving checkpoing model_25000.pdparams to checkpoints 
epoch: 4 / 12, steps: 3692 / 5352, loss: 0.001687, speed: 1.03 step/s
epoch: 4 / 12, steps: 3792 / 5352, loss: 0.001492, speed: 4.09 step/s
epoch: 4 / 12, steps: 3892 / 5352, loss: 0.001914, speed: 4.13 step/s
epoch: 4 / 12, steps: 3992 / 5352, loss: 0.001211, speed: 4.12 step/s
epoch: 4 / 12, steps: 4092 / 5352, loss: 0.001887, speed: 4.13 step/s
epoch: 4 / 12, steps: 4192 / 5352, loss: 0.001293, speed: 4.10 step/s
epoch: 4 / 12, steps: 4292 / 5352, loss: 0.001368, speed: 4.11 step/s
epoch: 4 / 12, steps: 4392 / 5352, loss: 0.001580, speed: 4.13 step/s
epoch: 4 / 12, steps: 4492 / 5352, loss: 0.001692, speed: 4.11 step/s
epoch: 4 / 12, steps: 4592 / 5352, loss: 0.002473, speed: 4.12 step/s
epoch: 4 / 12, steps: 4692 / 5352, loss: 0.001635, speed: 4.10 step/s
epoch: 4 / 12, steps: 4792 / 5352, loss: 0.001872, speed: 4.14 step/s
epoch: 4 / 12, steps: 4892 / 5352, loss: 

100%|██████████| 646/646 [01:01<00:00, 10.45it/s]


eval loss: 0.002137
precision: 62.20	 recall: 73.73	 f1: 67.48	
saving checkpoing model_30000.pdparams to checkpoints 
epoch: 5 / 12, steps: 3440 / 5352, loss: 0.001758, speed: 4.04 step/s
epoch: 5 / 12, steps: 3540 / 5352, loss: 0.001161, speed: 4.03 step/s
epoch: 5 / 12, steps: 3640 / 5352, loss: 0.001489, speed: 4.03 step/s
epoch: 5 / 12, steps: 3740 / 5352, loss: 0.002413, speed: 4.05 step/s
epoch: 5 / 12, steps: 3840 / 5352, loss: 0.001611, speed: 4.04 step/s
epoch: 5 / 12, steps: 3940 / 5352, loss: 0.001594, speed: 4.04 step/s
epoch: 5 / 12, steps: 4040 / 5352, loss: 0.001453, speed: 4.03 step/s
epoch: 5 / 12, steps: 4140 / 5352, loss: 0.001401, speed: 4.06 step/s
epoch: 5 / 12, steps: 4240 / 5352, loss: 0.001436, speed: 4.04 step/s
epoch: 5 / 12, steps: 4340 / 5352, loss: 0.001072, speed: 4.04 step/s
epoch: 5 / 12, steps: 4440 / 5352, loss: 0.002107, speed: 4.02 step/s
epoch: 5 / 12, steps: 4540 / 5352, loss: 0.001223, speed: 4.05 step/s
epoch: 5 / 12, steps: 4640 / 5352, loss: 

100%|██████████| 646/646 [01:01<00:00, 10.43it/s]


eval loss: 0.002271
precision: 60.82	 recall: 76.28	 f1: 67.68	
saving checkpoing model_35000.pdparams to checkpoints 
epoch: 6 / 12, steps: 2988 / 5352, loss: 0.001169, speed: 1.03 step/s
epoch: 6 / 12, steps: 3088 / 5352, loss: 0.001444, speed: 4.07 step/s
epoch: 6 / 12, steps: 3188 / 5352, loss: 0.001378, speed: 4.03 step/s
epoch: 6 / 12, steps: 3288 / 5352, loss: 0.000606, speed: 4.05 step/s
epoch: 6 / 12, steps: 3488 / 5352, loss: 0.001637, speed: 4.08 step/s
epoch: 6 / 12, steps: 3588 / 5352, loss: 0.000640, speed: 4.07 step/s
epoch: 6 / 12, steps: 3688 / 5352, loss: 0.001274, speed: 4.09 step/s
epoch: 6 / 12, steps: 3788 / 5352, loss: 0.001535, speed: 4.08 step/s
epoch: 6 / 12, steps: 3888 / 5352, loss: 0.001255, speed: 4.08 step/s
epoch: 6 / 12, steps: 3988 / 5352, loss: 0.000838, speed: 4.06 step/s
epoch: 6 / 12, steps: 4088 / 5352, loss: 0.001374, speed: 4.06 step/s
epoch: 6 / 12, steps: 4188 / 5352, loss: 0.001600, speed: 4.04 step/s
epoch: 6 / 12, steps: 4288 / 5352, loss: 

100%|██████████| 646/646 [01:00<00:00, 10.60it/s]


eval loss: 0.002340
precision: 60.74	 recall: 75.61	 f1: 67.37	
saving checkpoing model_40000.pdparams to checkpoints 
epoch: 7 / 12, steps: 2636 / 5352, loss: 0.000892, speed: 1.05 step/s
epoch: 7 / 12, steps: 2736 / 5352, loss: 0.002022, speed: 4.13 step/s
epoch: 7 / 12, steps: 2836 / 5352, loss: 0.001166, speed: 4.10 step/s
epoch: 7 / 12, steps: 2936 / 5352, loss: 0.001355, speed: 4.15 step/s
epoch: 7 / 12, steps: 3036 / 5352, loss: 0.001363, speed: 4.16 step/s
epoch: 7 / 12, steps: 3136 / 5352, loss: 0.001914, speed: 4.11 step/s
epoch: 7 / 12, steps: 3236 / 5352, loss: 0.000735, speed: 4.14 step/s
epoch: 7 / 12, steps: 3336 / 5352, loss: 0.001152, speed: 4.12 step/s
epoch: 7 / 12, steps: 3436 / 5352, loss: 0.000925, speed: 4.16 step/s
epoch: 7 / 12, steps: 3536 / 5352, loss: 0.002039, speed: 4.14 step/s
epoch: 7 / 12, steps: 3636 / 5352, loss: 0.001439, speed: 4.12 step/s
epoch: 7 / 12, steps: 3736 / 5352, loss: 0.000926, speed: 4.16 step/s
epoch: 7 / 12, steps: 3836 / 5352, loss: 

100%|██████████| 646/646 [01:01<00:00, 10.59it/s]


eval loss: 0.002329
precision: 62.39	 recall: 75.04	 f1: 68.13	
saving checkpoing model_45000.pdparams to checkpoints 
epoch: 8 / 12, steps: 2284 / 5352, loss: 0.001252, speed: 1.05 step/s
epoch: 8 / 12, steps: 2384 / 5352, loss: 0.001372, speed: 4.15 step/s
epoch: 8 / 12, steps: 2484 / 5352, loss: 0.001230, speed: 4.15 step/s
epoch: 8 / 12, steps: 2584 / 5352, loss: 0.001131, speed: 4.18 step/s
epoch: 8 / 12, steps: 2684 / 5352, loss: 0.000817, speed: 4.10 step/s
epoch: 8 / 12, steps: 2784 / 5352, loss: 0.001098, speed: 4.14 step/s
epoch: 8 / 12, steps: 2884 / 5352, loss: 0.000908, speed: 4.15 step/s
epoch: 8 / 12, steps: 2984 / 5352, loss: 0.000800, speed: 4.16 step/s
epoch: 8 / 12, steps: 3084 / 5352, loss: 0.000513, speed: 4.18 step/s
epoch: 8 / 12, steps: 3184 / 5352, loss: 0.001152, speed: 4.16 step/s
epoch: 8 / 12, steps: 3284 / 5352, loss: 0.001025, speed: 4.15 step/s
epoch: 8 / 12, steps: 3384 / 5352, loss: 0.001106, speed: 4.18 step/s
epoch: 8 / 12, steps: 3484 / 5352, loss: 

100%|██████████| 646/646 [01:02<00:00, 10.39it/s]


eval loss: 0.002533
precision: 62.26	 recall: 75.64	 f1: 68.30	
saving checkpoing model_50000.pdparams to checkpoints 
epoch: 9 / 12, steps: 1932 / 5352, loss: 0.001091, speed: 1.03 step/s
epoch: 9 / 12, steps: 2032 / 5352, loss: 0.001153, speed: 4.07 step/s
epoch: 9 / 12, steps: 2132 / 5352, loss: 0.001275, speed: 4.05 step/s
epoch: 9 / 12, steps: 2232 / 5352, loss: 0.000759, speed: 4.07 step/s
epoch: 9 / 12, steps: 2332 / 5352, loss: 0.000609, speed: 4.07 step/s
epoch: 9 / 12, steps: 2432 / 5352, loss: 0.001258, speed: 4.02 step/s
epoch: 9 / 12, steps: 2532 / 5352, loss: 0.000748, speed: 4.06 step/s
epoch: 9 / 12, steps: 2632 / 5352, loss: 0.001124, speed: 4.02 step/s
epoch: 9 / 12, steps: 2732 / 5352, loss: 0.000946, speed: 4.06 step/s
epoch: 9 / 12, steps: 2832 / 5352, loss: 0.001167, speed: 4.08 step/s
epoch: 9 / 12, steps: 2932 / 5352, loss: 0.001401, speed: 4.03 step/s
epoch: 9 / 12, steps: 3032 / 5352, loss: 0.001138, speed: 4.02 step/s
epoch: 9 / 12, steps: 3132 / 5352, loss: 

100%|██████████| 646/646 [01:01<00:00, 10.43it/s]


eval loss: 0.002527
precision: 61.12	 recall: 76.11	 f1: 67.79	
saving checkpoing model_55000.pdparams to checkpoints 
epoch: 10 / 12, steps: 1680 / 5352, loss: 0.000837, speed: 4.05 step/s
epoch: 10 / 12, steps: 1780 / 5352, loss: 0.000787, speed: 4.16 step/s
epoch: 10 / 12, steps: 1880 / 5352, loss: 0.000778, speed: 4.17 step/s
epoch: 10 / 12, steps: 1980 / 5352, loss: 0.000749, speed: 4.16 step/s
epoch: 10 / 12, steps: 2080 / 5352, loss: 0.000603, speed: 4.21 step/s
epoch: 10 / 12, steps: 2180 / 5352, loss: 0.000807, speed: 4.12 step/s
epoch: 10 / 12, steps: 2280 / 5352, loss: 0.001261, speed: 4.16 step/s
epoch: 10 / 12, steps: 2380 / 5352, loss: 0.001749, speed: 4.14 step/s
epoch: 10 / 12, steps: 2480 / 5352, loss: 0.001250, speed: 4.13 step/s
epoch: 10 / 12, steps: 2580 / 5352, loss: 0.000929, speed: 4.13 step/s
epoch: 10 / 12, steps: 2680 / 5352, loss: 0.000821, speed: 4.12 step/s
epoch: 10 / 12, steps: 2780 / 5352, loss: 0.001921, speed: 4.20 step/s
epoch: 10 / 12, steps: 2880 /

100%|██████████| 646/646 [01:01<00:00, 10.57it/s]


eval loss: 0.002589
precision: 62.56	 recall: 75.80	 f1: 68.55	
saving checkpoing model_60000.pdparams to checkpoints 
epoch: 11 / 12, steps: 1228 / 5352, loss: 0.001184, speed: 1.05 step/s
epoch: 11 / 12, steps: 1328 / 5352, loss: 0.001324, speed: 4.08 step/s
epoch: 11 / 12, steps: 1428 / 5352, loss: 0.000902, speed: 4.04 step/s
epoch: 11 / 12, steps: 1528 / 5352, loss: 0.000969, speed: 4.04 step/s
epoch: 11 / 12, steps: 1628 / 5352, loss: 0.000848, speed: 4.05 step/s
epoch: 11 / 12, steps: 1728 / 5352, loss: 0.000628, speed: 4.07 step/s
epoch: 11 / 12, steps: 1828 / 5352, loss: 0.001002, speed: 4.06 step/s
epoch: 11 / 12, steps: 1928 / 5352, loss: 0.000728, speed: 4.06 step/s
epoch: 11 / 12, steps: 2028 / 5352, loss: 0.000603, speed: 4.08 step/s
epoch: 11 / 12, steps: 2128 / 5352, loss: 0.001005, speed: 4.07 step/s
epoch: 11 / 12, steps: 2228 / 5352, loss: 0.000832, speed: 4.07 step/s
epoch: 11 / 12, steps: 2328 / 5352, loss: 0.000931, speed: 4.07 step/s
epoch: 11 / 12, steps: 2428 /

100%|██████████| 646/646 [01:01<00:00, 10.48it/s]


eval loss: 0.002565
precision: 61.94	 recall: 76.13	 f1: 68.30	

=====training complete=====


In [23]:
!bash predict.sh

+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ export BATCH_SIZE=32
+ BATCH_SIZE=32
+ export CKPT=./checkpoints/model_64224.pdparams
+ CKPT=./checkpoints/model_64224.pdparams
+ export DATASET_FILE=./data/test.json
+ DATASET_FILE=./data/test.json
+ python run_duie.py --do_predict --init_checkpoint ./checkpoints/model_64224.pdparams --predict_data_file ./data/test.json --max_seq_length 128 --batch_size 32 --device gpu
[32m[2021-06-14 21:26:50,629] [    INFO][0m - Already cached /home/aistudio/.paddlenlp/models/ernie-gram-zh/ernie_gram_zh.pdparams[0m
W0614 21:26:50.630327 20241 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0614 21:26:50.635741 20241 device_context.cc:422] device: 0, cuDNN Version: 7.6.


预测结果会被保存在data/predictions.json，data/predictions.json.zip，其格式与原数据集文件一致。

之后可以使用官方评估脚本评估训练模型在dev_data.json上的效果。如：

```shell
python re_official_evaluation.py --golden_file=dev_data.json  --predict_file=predicitons.json.zip [--alias_file alias_dict]
```
输出指标为Precision, Recall 和 F1，Alias file包含了合法的实体别名，最终评测的时候会使用，这里不予提供。

之后在test_data.json上预测，然后预测结果（.zip文件）至[千言评测页面](https://aistudio.baidu.com/aistudio/competition/detail/46)。





## 【千言数据集：信息抽取】比赛结果提交与排名截图

![](https://ai-studio-static-online.cdn.bcebos.com/5cb578f9f7b14e20ae92fc608d7f05db7d45e2d11af641c2a28dc55961eb68b7)

![](https://ai-studio-static-online.cdn.bcebos.com/1cd2574968864b3c8c45689d9b624bec17d46221059440adac98b16c416097c6)





## Tricks

### 尝试更多的预训练模型

基线采用的预训练模型为ERNIE，PaddleNLP提供了丰富的预训练模型，如BERT，RoBERTa，Electra，XLNet等
参考[预训练模型文档](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html)

如可以选择RoBERTa large中文模型优化模型效果，只需更换模型和tokenizer即可无缝衔接。

In [None]:
from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer

model = RobertaForTokenClassification.from_pretrained(
    "roberta-wwm-ext-large",
    num_classes=(len(label_map) - 2) * 2 + 2)
tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large")

### 模型集成

使用多个模型进行训练预测，将各个模型预测结果进行融合。

以上基线实现基于PaddleNLP，开源不易，希望大家多多支持~ 
**记得给[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)点个小小的Star⭐，及时跟踪最新消息和功能哦**

GitHub地址：[https://github.com/PaddlePaddle/PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
