# CLUE-CLUENER 细粒度命名实体识别

本数据是在清华大学开源的文本分类数据集THUCTC基础上，选出部分数据进行细粒度命名实体标注，原数据来源于Sina News RSS.

训练集：10748 验证集：1343

标签类别：
数据分为10个标签类别，分别为: 地址（address），书名（book），公司（company），游戏（game），政府（goverment），电影（movie），姓名（name），组织机构（organization），职位（position），景点（scene）

数据下载地址：https://github.com/CLUEbenchmark/CLUENER2020

排行榜地址：https://cluebenchmarks.com/ner.html

|模型|线上效果f1|
|------|------:|
|Bert-base|78.82|
|RoBERTa-wwm-large-ext|80.42|
|Bi-Lstm + CRF|70.00|

In [1]:
%reload_ext autoreload
%autoreload 2

## 1. 数据观察

In [2]:
import json
import pandas
from tqdm import tqdm
from loguru import logger
import numpy as np
from collections import Counter

seg_len=0
seg_backoff=0
fold = 0



In [3]:
train_file = './data/rawdata/train.json'
test_file = './data/rawdata/test.json'
dev_file = './data/rawdata/dev.json'

In [4]:
def load_json_data(json_file):
    rd = open(json_file, 'r')
    lines = rd.readlines()
    rd.close()
    json_data = []
    for line in tqdm(lines):
        line = line.strip()
        line_data = json.loads(line)
        json_data.append(line_data)
    print(f"Total: {len(json_data)}")
    print(json_data[:5])
    return json_data

In [5]:
train_data = load_json_data(train_file)
test_data = load_json_data(test_file)
dev_data = load_json_data(dev_file)

100%|██████████| 10748/10748 [00:00<00:00, 169188.25it/s]
100%|██████████| 1345/1345 [00:00<00:00, 319912.61it/s]
100%|██████████| 1343/1343 [00:00<00:00, 33401.63it/s]

Total: 10748
[{'text': '浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为，对目前国内商业银行而言，', 'label': {'name': {'叶老桂': [[9, 11]]}, 'company': {'浙商银行': [[0, 3]]}}}, {'text': '生生不息CSOL生化狂潮让你填弹狂扫', 'label': {'game': {'CSOL': [[4, 7]]}}}, {'text': '那不勒斯vs锡耶纳以及桑普vs热那亚之上呢？', 'label': {'organization': {'那不勒斯': [[0, 3]], '锡耶纳': [[6, 8]], '桑普': [[11, 12]], '热那亚': [[15, 17]]}}}, {'text': '加勒比海盗3：世界尽头》的去年同期成绩死死甩在身后，后者则即将赶超《变形金刚》，', 'label': {'movie': {'加勒比海盗3：世界尽头》': [[0, 11]], '《变形金刚》': [[33, 38]]}}}, {'text': '布鲁京斯研究所桑顿中国中心研究部主任李成说，东亚的和平与安全，是美国的“核心利益”之一。', 'label': {'address': {'美国': [[32, 33]]}, 'organization': {'布鲁京斯研究所桑顿中国中心': [[0, 12]]}, 'name': {'李成': [[18, 19]]}, 'position': {'研究部主任': [[13, 17]]}}}]
Total: 1345
[{'id': 0, 'text': '四川敦煌学”。近年来，丹棱县等地一些不知名的石窟迎来了海内外的游客，他们随身携带着胡文和的著作。'}, {'id': 1, 'text': '尼日利亚海军发言人当天在阿布贾向尼日利亚通讯社证实了这一消息。'}, {'id': 2, 'text': '销售冠军：辐射3-Bethesda'}, {'id': 3, 'text': '所以大多数人都是从巴厘岛南部开始环岛之旅。'}, {'id': 4, 'text': '备受瞩目的动作及冒险类大作《迷失》在其英文版上市之初就受到了全球玩家的大力追捧。'}]
Total: 1343
[{'text': '彭小




### 1.1 样本数量分布

In [6]:
all_data = train_data + dev_data

### 1.2 样本长度分布

In [7]:
lengths = [ len(x['text']) for x in tqdm(all_data)]
logger.info(f"***** Text Lengths *****")
logger.info(f"mean: {np.mean(lengths):.2f}")
logger.info(f"std: {np.mean(lengths):.2f}")
logger.info(f"max: {np.max(lengths)}")
logger.info(f"min: {np.min(lengths)}")

100%|██████████| 12091/12091 [00:00<00:00, 2158932.72it/s]
2020-06-01 13:48:13.393 | INFO     | __main__:<module>:2 - ***** Text Lengths *****
2020-06-01 13:48:13.395 | INFO     | __main__:<module>:3 - mean: 37.39
2020-06-01 13:48:13.396 | INFO     | __main__:<module>:4 - std: 37.39
2020-06-01 13:48:13.398 | INFO     | __main__:<module>:5 - max: 50
2020-06-01 13:48:13.399 | INFO     | __main__:<module>:6 - min: 2


### 1.3 样本标签

In [8]:
all_labels = []
for text_data in tqdm(all_data):
    labels = text_data['label']
    for k, v in labels.items():
        all_labels.append(k)
print(f"{Counter(all_labels)}")

100%|██████████| 12091/12091 [00:00<00:00, 1282745.15it/s]

Counter({'name': 3199, 'position': 2811, 'company': 2494, 'address': 2363, 'game': 2123, 'organization': 2100, 'government': 1651, 'scene': 1070, 'book': 1029, 'movie': 880})





In [9]:
ner_labels = np.unique(all_labels).tolist()
ner_labels

['address',
 'book',
 'company',
 'game',
 'government',
 'movie',
 'name',
 'organization',
 'position',
 'scene']

## 2. 模型构建

In [10]:
import os, sys, json, random
from collections import Counter
from tqdm import tqdm
from loguru import logger
from pathlib import Path

from theta.utils import init_theta, split_train_eval_examples
from theta.modeling.ner_span import NerTrainer, load_model, load_examples, init_labels


### 2.1 模型输入数据

In [11]:
def clean_text(text):
    text = text.strip()
    return text


def labeling_text_bios(text, entities):
    """
    text: str

        "万通地产设计总监刘克峰；"

    entities: [(role, mention, s, e), ...]
            
        [("name", 刘克峰", 8, 10), ("company", "万通地产", 0, 3), ...]

    output:

         
    """
    words = [w for w in text]
    labels = ['O'] * len(words)

    for entity in entities:
        role, mention, s, e = entity
        assert s <= e
        mention_len = e - s + 1
        if mention_len == 1:
            labels[s] = f"S-{role}"
        else:
            labels[s] = f"B-{role}"
            for j0 in range(1, mention_len):
                labels[s + j0] = f"I-{role}"
    return labels

def labeling_text_span(text, entities):
    """
    text: str

        "万通地产设计总监刘克峰；"

    entities: [(role, mention, s, e), ...]
            
        [("name", 刘克峰", 8, 10), ("company", "万通地产", 0, 3), ...]

    output:

         
    """
    labels = []

    for entity in entities:
        role, mention, s, e = entity
        assert s <= e
        labels.append((role, s, e))
        
    return labels

def train_data_generator(args, train_file, seg_len=0, seg_backoff=0):
    """
    每行一条json格式数据。
    """

#     guid = 0
#     examples = []
#     with open(args.train_file, 'r') as fr:
#         lines = fr.readlines()
#         for i, line in enumerate(tqdm(lines, desc=f"train & eval")):
#             d = json.loads(line)

#             # -------------------- 自定义json格式 --------------------
#             #  {
#             #      "text": "万通地产设计总监刘克峰；",
#             #      "label": {
#             #          "name": {
#             #              "刘克峰": [[8, 10]]
#             #          },
#             #          "company": {
#             #              "万通地产": [[0, 3]]
#             #          },
#             #          "position": {
#             #              "设计总监": [[4, 7]]
#             #          }
#             #      }
#             #  }
        
    all_data = train_data+dev_data
    all_labels = []
    
    total_examples = len(all_data)
    num_sample_examples = int(total_examples * args.train_sample_rate)
    logger.warning(
        f"Sample {num_sample_examples}/{total_examples} ({args.train_sample_rate*100:.1f}%) train examples."
    )

    for i, d in enumerate(tqdm(all_data, desc="train")):
        if i >= num_sample_examples:
            break        

        text = d['text']
        text = clean_text(text)

        entities = []
        classes = d['label'].keys()
        for c in classes:
            c_labels = d['label'][c]
            #  logger.debug(f"c_labels:{c_labels}")
            for label, span in c_labels.items():
                x0, x1 = span[0]
                entities.append((c, x0, x1))
                guid = f"{i}"

                #examples.append({
                #    'guid': guid,
                #    'text': text,
                #    'entities': entities
                #})
                yield guid, text, None, entities
                    
    
def load_train_val_examples(args, seg_len=0, seg_backoff=0):
    train_base_examples = load_examples(args,
                                        train_data_generator,
                                        args.train_file,
                                        seg_len=seg_len,
                                        seg_backoff=seg_backoff)
#     logger.debug(f"{train_base_examples[:10]}")

    train_examples, val_examples = split_train_eval_examples(
        train_base_examples,
        train_rate=args.train_rate,
        fold=args.fold,
        shuffle=True,
        random_state=args.seed)

    logger.info(
        f"Loaded {len(train_examples)} train examples, {len(val_examples)} val examples."
    )
    return train_examples, val_examples
    

def load_test_examples(args, seg_len=0, seg_backoff=0):
    from theta.modeling.ner import InputExample

    test_examples = []
    with open(args.test_file, 'r') as fr:
        lines = fr.readlines()
        for i, line in enumerate(tqdm(lines, desc=f"train & eval")):
            d = json.loads(line)

            # -------------------- 自定义json格式 --------------------
            #  {
            #      "id": 1,
            #      "text": "尼日利亚海军发言人当天在阿布贾向尼日利亚通讯社证实了这一消息。"
            #  }

            guid = str(d['id'])
            text = d['text']
            text = clean_text(text)

            test_examples.append(
                InputExample(guid=guid, text_a=text, labels=None))

    logger.info(f"Loaded {len(test_examples)} test examples.")
    return test_examples


### 2.2 模型输出结果

In [12]:
def save_predict_results(args, pred_results, test_examples):
    from theta.utils import get_pred_results_file
    pred_results_file = get_pred_results_file(args)

    test_results = {}
    for json_d, example in tqdm(zip(pred_results, test_examples)):
        guid = example.guid
        text = ''.join(example.text_a)

        if guid not in test_results:
            test_results[guid] = {
                "guid": guid,
                "content": "",
                "events": [],
                "tagged_text": ""
            }

        s0 = 0
        tagged_text = test_results[guid]['tagged_text']
        text_offset = len(test_results[guid]['content'])
        for entity in json_d['entities']:
            event_type = entity[0]
            s = entity[1]
            e = entity[2] + 1
            entity_text = text[s:e]
            test_results[guid]['events'].append(
                (event_type, entity_text, text_offset + s, text_offset + e))

            tagged_text += f"{text[s0:s]}\n"
            tagged_text += f"【{event_type} | {entity_text}】\n"
            s0 = e

        tagged_text += f"{text[s0:]}\n"
        test_results[guid]['tagged_text'] = tagged_text
        test_results[guid]['content'] += text

    json.dump(test_results,
              open(f"{pred_results_file}", 'w'),
              ensure_ascii=False,
              indent=2)
    logger.info(f"Saved predict results to {pred_results_file}")


### 2.4 自定义模型
Theta对每类任务都有缺省模型，通常情况下不需要自定义模型。训练器Trainer中传入参数build_model=None即可。

In [13]:
# -------------------- Model --------------------


def build_model(args):
    """
    自定义模型
    规格要求返回模型(model)、优化器(optimizer)、调度器(scheduler)三元组。
    """
    
    # -------- model --------
    from theta.modeling.ner_span import load_pretrained_model
    model = load_pretrained_model(args)
    model.to(args.device)

    # -------- optimizer --------
    from transformers.optimization import AdamW
    from theta.modeling.trainer import get_default_optimizer_parameters
    optimizer_parameters = get_default_optimizer_parameters(
        model, args.weight_decay)
    optimizer = AdamW(optimizer_parameters,
                      lr=args.learning_rate,
                      correct_bias=False)

    # -------- scheduler --------
    from transformers.optimization import get_linear_schedule_with_warmup
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=args.total_steps * args.warmup_rate,
        num_training_steps=args.total_steps)

    return model, optimizer, scheduler

### 2.5 自定训练器

训练器也不是必须定义的，可以直接用NerTrainer实例化训练器。

自定义训练器通常是为了使用自定义模型或重载训练、评估、推理过程的关键节点，便于输出、调试等。

In [14]:
# -------------------- Trainer --------------------

from theta.modeling.ner_span import NerTrainer

class AppTrainer(NerTrainer):
    def __init__(self, args):
        # 使用自定义模型时，传入build_model参数。
        super(AppTrainer, self).__init__(args, build_model=None)


### 2.6 主控流程

In [15]:
def main(args):
    init_theta(args)
    init_labels(args, ner_labels)

    trainer = AppTrainer(args)

    # --------------- train phase ---------------
    if args.do_train:
        train_examples, val_examples = load_train_val_examples(
            args, seg_len=seg_len, seg_backoff=seg_backoff)
        trainer.train(args, train_examples, val_examples)
    # --------------- predict phase ---------------
    if args.do_predict:
        test_examples = load_test_examples(args,
                                           seg_len=seg_len,
                                           seg_backoff=seg_backoff)

        model = load_model(args)
        trainer.predict(args, model, test_examples)
        save_predict_results(args, trainer.pred_results, f"{args.dataset_name}_predict.json",
                             test_examples)
    # --------------- evaluate phase ---------------
    if args.do_eval:
#        eval_examples = load_eval_examples(args,
#                                           seg_len=seg_len,
#                                           seg_backoff=seg_backoff)
        train_examples, eval_examples = load_train_val_examples(
            args, seg_len=seg_len, seg_backoff=seg_backoff)

        model = load_model(args)
        trainer.evaluate(args, model, eval_examples)



## 3. 运行

### 3.1 全局参数

In [16]:
#def add_special_args(parser):
#    return parser

#from theta.modeling.glue.args import get_args
#args = get_args([add_special_args])

import sys, argparse

def get_init_args():
    parser = argparse.ArgumentParser()
    for arg in sys.argv:
        if arg.startswith('-'):
            parser.add_argument(arg, type=str)
    args = parser.parse_args()
    return args

#import argparse
#parser = argparse.ArgumentParser()
#parser.add_argument("-f",type=str)
#args = parser.parse_args()

args = get_init_args()

DATASET_NAME="cluener"
DATA_DIR="./data"
OUTPUT_DIR=f"output_{DATASET_NAME}"
CHECKPOINT_MODEL=f"{OUTPUT_DIR}/best"

TRAIN_FILE = "./data/rawdata/train.json"
TEST_FILE = "./data/rawdatda/test.json"
EVAL_FILE = "./data/rawdata/eval.json"

EPOCHS=10
TRAIN_SAMPLE_RATE=1.0

MODEL_TYPE="bert"
PRETRAINED_MODEL="/opt/share/pretrained/pytorch/bert-base-chinese"
LEARNING_RATE=2e-5
TRAIN_MAX_SEQ_LENGTH=64
EVAL_MAX_SEQ_LENGTH=64
TRAIN_BATCH_SIZE=128
EVAL_BATCH_SIZE=64
PREDICT_BATCH_SIZE=64

args.do_train=False
args.do_predict=False
args.do_eval=False
args.train_max_seq_length = TRAIN_MAX_SEQ_LENGTH
args.eval_max_seq_length = EVAL_MAX_SEQ_LENGTH
args.num_train_epochs = EPOCHS
args.learning_rate = LEARNING_RATE
args.per_gpu_train_batch_size = TRAIN_BATCH_SIZE
args.per_gpu_eval_batch_size = EVAL_BATCH_SIZE
args.per_gpu_predict_batch_size = EVAL_BATCH_SIZE

args.data_dir = DATA_DIR
args.dataset_name = DATASET_NAME
args.train_file = TRAIN_FILE
args.eval_file = EVAL_FILE
args.test_file = TEST_FILE

args.output_dir = OUTPUT_DIR
args.pred_output_dir = OUTPUT_DIR

args.model_type = MODEL_TYPE
args.model_path = PRETRAINED_MODEL
args.overwrite_cache = True
args.train_sample_rate = TRAIN_SAMPLE_RATE
args.seed = 8864
args.local_rank=-1
args.no_cuda = None
args.do_lower_case=True
args.cache_dir = None
args.train_rate=0.8
args.fold = 0
args.gradient_accumulation_steps = 1
args.max_steps = 0
args.focalloss_gamma = 1.5
args.focalloss_alpha = None
args.weight_decay = 0.0
args.warmup_rate = 0.1
args.fp16 = True
args.fp16_opt_level = 'O1'
args.max_grad_norm = 1.0
args.save_checkpoints = False
args.no_eval_on_each_epoch=False

args.soft_label = True
args.loss_type = 'CrossEntropyLoss'
#args.loss_type = 'FocalLoss'


### 3.2 启动训练

In [None]:
args.do_train=True
args.do_predict=False
args.do_eval=False

main(args)

2020-06-01 13:48:19.481 | INFO     | theta.modeling.ner_span.dataset:init_labels:100 - args.label2id: {'[unused1]': 0, 'address': 1, 'book': 2, 'company': 3, 'game': 4, 'government': 5, 'movie': 6, 'name': 7, 'organization': 8, 'position': 9, 'scene': 10}
2020-06-01 13:48:19.482 | INFO     | theta.modeling.ner_span.dataset:init_labels:101 - args.id2label: {0: '[unused1]', 1: 'address', 2: 'book', 3: 'company', 4: 'game', 5: 'government', 6: 'movie', 7: 'name', 8: 'organization', 9: 'position', 10: 'scene'}
2020-06-01 13:48:19.483 | INFO     | theta.modeling.ner_span.dataset:init_labels:102 - args.num_labels: 11
train: 100%|██████████| 12091/12091 [00:00<00:00, 71062.90it/s]
2020-06-01 13:48:19.695 | INFO     | theta.modeling.ner_span.dataset:load_examples:88 - Loaded 26320 examples.
2020-06-01 13:48:19.701 | INFO     | __main__:load_train_val_examples:136 - Loaded 21057 train examples, 5263 val examples.
2020-06-01 13:48:19.703 | INFO     | theta.modeling.trainer:train:136 - Start trai

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic




Epoch(1/10)   1/165 [..............................] - ETA: 8:59 - lr: 0.00e+00 - loss: 2.5604Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0






2020-06-01 13:49:28.829 | INFO     | theta.modeling.trainer:train:314 - Epoch(1/10) evaluating.
Tokenize: 100%|██████████| 5263/5263 [00:00<00:00, 65643.97it/s]
2020-06-01 13:49:29.147 | INFO     | theta.modeling.trainer:evaluate:397 - Start evaluating ...
2020-06-01 13:49:29.148 | INFO     | theta.modeling.trainer:evaluate:398 -   Num examples    = 5263
2020-06-01 13:49:29.149 | INFO     | theta.modeling.trainer:evaluate:399 -   Num epoch steps = 83
2020-06-01 13:49:29.150 | INFO     | theta.modeling.trainer:evaluate:400 -   Batch size = 64




2020-06-01 13:51:50.647 | INFO     | theta.utils.ner_utils:get_ner_results:14 -                                    acc    recall f1    
2020-06-01 13:51:50.648 | INFO     | theta.utils.ner_utils:get_ner_results:15 - -------------------------------------------------------
2020-06-01 13:51:50.649 | INFO     | theta.utils.ner_utils:get_ner_results:26 - name                             | 0.8916 0.9273 0.9091
2020-06-01 13:51:50.649 | INFO     | theta.utils.ner_utils:get_ner_results:26 - book                             | 0.8323 0.9177 0.8729
2020-06-01 13:51:50.650 | INFO     | theta.utils.ner_utils:get_ner_results:26 - company                          | 0.8681 0.8430 0.8553
2020-06-01 13:51:50.650 | INFO     | theta.utils.ner_utils:get_ner_results:26 - game                             | 0.8454 0.8603 0.8528
2020-06-01 13:51:50.651 | INFO     | theta.utils.ner_utils:get_ner_results:26 - position                         | 0.8269 0.8787 0.8521
2020-06-01 13:51:50.652 | INFO     | theta.utils

{"eval_acc": "0.833036", "eval_recall": "0.852470", "eval_f1": "0.842641", "learning_rate": "0.000020", "loss": "0.355189", "step": 165}




 


2020-06-01 13:52:56.754 | INFO     | theta.modeling.trainer:train:314 - Epoch(2/10) evaluating.
Tokenize: 100%|██████████| 5263/5263 [00:00<00:00, 61145.32it/s]
2020-06-01 13:52:57.082 | INFO     | theta.modeling.trainer:evaluate:397 - Start evaluating ...
2020-06-01 13:52:57.083 | INFO     | theta.modeling.trainer:evaluate:398 -   Num examples    = 5263
2020-06-01 13:52:57.083 | INFO     | theta.modeling.trainer:evaluate:399 -   Num epoch steps = 83
2020-06-01 13:52:57.084 | INFO     | theta.modeling.trainer:evaluate:400 -   Batch size = 64




2020-06-01 13:55:19.301 | INFO     | theta.utils.ner_utils:get_ner_results:14 -                                    acc    recall f1    
2020-06-01 13:55:19.302 | INFO     | theta.utils.ner_utils:get_ner_results:15 - -------------------------------------------------------
2020-06-01 13:55:19.303 | INFO     | theta.utils.ner_utils:get_ner_results:26 - name                             | 0.9255 0.9618 0.9433
2020-06-01 13:55:19.303 | INFO     | theta.utils.ner_utils:get_ner_results:26 - book                             | 0.9219 0.9393 0.9305
2020-06-01 13:55:19.304 | INFO     | theta.utils.ner_utils:get_ner_results:26 - company                          | 0.8922 0.9209 0.9064
2020-06-01 13:55:19.304 | INFO     | theta.utils.ner_utils:get_ner_results:26 - position                         | 0.8748 0.9383 0.9054
2020-06-01 13:55:19.305 | INFO     | theta.utils.ner_utils:get_ner_results:26 - government                       | 0.8847 0.9217 0.9029
2020-06-01 13:55:19.306 | INFO     | theta.utils

{"eval_acc": "0.887618", "eval_recall": "0.920240", "eval_f1": "0.903635", "learning_rate": "0.000018", "loss": "0.035475", "step": 330}
 


2020-06-01 13:56:24.987 | INFO     | theta.modeling.trainer:train:314 - Epoch(3/10) evaluating.
Tokenize: 100%|██████████| 5263/5263 [00:00<00:00, 65440.03it/s]
2020-06-01 13:56:25.309 | INFO     | theta.modeling.trainer:evaluate:397 - Start evaluating ...
2020-06-01 13:56:25.310 | INFO     | theta.modeling.trainer:evaluate:398 -   Num examples    = 5263
2020-06-01 13:56:25.311 | INFO     | theta.modeling.trainer:evaluate:399 -   Num epoch steps = 83
2020-06-01 13:56:25.312 | INFO     | theta.modeling.trainer:evaluate:400 -   Batch size = 64




### 3.3 启动推理

In [None]:
args.do_train=False
args.do_predict=True
args.do_eval=False
args.model_path=CHECKPOINT_MODEL
main(args)