# CLUE-CLUENER 细粒度命名实体识别

本数据是在清华大学开源的文本分类数据集THUCTC基础上，选出部分数据进行细粒度命名实体标注，原数据来源于Sina News RSS.

训练集：10748 验证集：1343

标签类别：
数据分为10个标签类别，分别为: 地址（address），书名（book），公司（company），游戏（game），政府（goverment），电影（movie），姓名（name），组织机构（organization），职位（position），景点（scene）

数据下载地址：https://github.com/CLUEbenchmark/CLUENER2020

排行榜地址：https://cluebenchmarks.com/ner.html

|模型|线上效果f1|
|------|------:|
|Bert-base|78.82|
|RoBERTa-wwm-large-ext|80.42|
|Bi-Lstm + CRF|70.00|

In [4]:
%reload_ext autoreload
%autoreload 2

## 1. 数据观察

In [5]:
import json
import pandas
from tqdm import tqdm
from loguru import logger
import numpy as np
from collections import Counter

seg_len=0
seg_backoff=0
fold = 0

In [6]:
train_file = './data/rawdata/train.json'
test_file = './data/rawdata/test.json'
dev_file = './data/rawdata/dev.json'

In [7]:
def load_json_data(json_file):
    rd = open(json_file, 'r')
    lines = rd.readlines()
    rd.close()
    json_data = []
    for line in tqdm(lines):
        line = line.strip()
        line_data = json.loads(line)
        json_data.append(line_data)
    print(f"Total: {len(json_data)}")
    print(json_data[:5])
    return json_data

In [8]:
train_data = load_json_data(train_file)
test_data = load_json_data(test_file)
dev_data = load_json_data(dev_file)

100%|██████████| 10748/10748 [00:00<00:00, 87197.78it/s]
100%|██████████| 1345/1345 [00:00<00:00, 289047.44it/s]
100%|██████████| 1343/1343 [00:00<00:00, 22562.31it/s]

Total: 10748
[{'text': '浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为，对目前国内商业银行而言，', 'label': {'name': {'叶老桂': [[9, 11]]}, 'company': {'浙商银行': [[0, 3]]}}}, {'text': '生生不息CSOL生化狂潮让你填弹狂扫', 'label': {'game': {'CSOL': [[4, 7]]}}}, {'text': '那不勒斯vs锡耶纳以及桑普vs热那亚之上呢？', 'label': {'organization': {'那不勒斯': [[0, 3]], '锡耶纳': [[6, 8]], '桑普': [[11, 12]], '热那亚': [[15, 17]]}}}, {'text': '加勒比海盗3：世界尽头》的去年同期成绩死死甩在身后，后者则即将赶超《变形金刚》，', 'label': {'movie': {'加勒比海盗3：世界尽头》': [[0, 11]], '《变形金刚》': [[33, 38]]}}}, {'text': '布鲁京斯研究所桑顿中国中心研究部主任李成说，东亚的和平与安全，是美国的“核心利益”之一。', 'label': {'address': {'美国': [[32, 33]]}, 'organization': {'布鲁京斯研究所桑顿中国中心': [[0, 12]]}, 'name': {'李成': [[18, 19]]}, 'position': {'研究部主任': [[13, 17]]}}}]
Total: 1345
[{'id': 0, 'text': '四川敦煌学”。近年来，丹棱县等地一些不知名的石窟迎来了海内外的游客，他们随身携带着胡文和的著作。'}, {'id': 1, 'text': '尼日利亚海军发言人当天在阿布贾向尼日利亚通讯社证实了这一消息。'}, {'id': 2, 'text': '销售冠军：辐射3-Bethesda'}, {'id': 3, 'text': '所以大多数人都是从巴厘岛南部开始环岛之旅。'}, {'id': 4, 'text': '备受瞩目的动作及冒险类大作《迷失》在其英文版上市之初就受到了全球玩家的大力追捧。'}]
Total: 1343
[{'text': '彭小




### 1.1 样本数量分布

In [9]:
all_data = train_data + dev_data

### 1.2 样本长度分布

In [10]:
lengths = [ len(x['text']) for x in tqdm(all_data)]
logger.info(f"***** Text Lengths *****")
logger.info(f"mean: {np.mean(lengths):.2f}")
logger.info(f"std: {np.mean(lengths):.2f}")
logger.info(f"max: {np.max(lengths)}")
logger.info(f"min: {np.min(lengths)}")

100%|██████████| 12091/12091 [00:00<00:00, 1621996.09it/s]
2020-06-16 18:15:09.576 | INFO     | __main__:<module>:2 - ***** Text Lengths *****
2020-06-16 18:15:09.578 | INFO     | __main__:<module>:3 - mean: 37.39
2020-06-16 18:15:09.579 | INFO     | __main__:<module>:4 - std: 37.39
2020-06-16 18:15:09.581 | INFO     | __main__:<module>:5 - max: 50
2020-06-16 18:15:09.582 | INFO     | __main__:<module>:6 - min: 2


### 1.3 样本标签

In [11]:
all_labels = []
for text_data in tqdm(all_data):
    labels = text_data['label']
    for k, v in labels.items():
        all_labels.append(k)
print(f"{Counter(all_labels)}")

100%|██████████| 12091/12091 [00:00<00:00, 1129372.21it/s]

Counter({'name': 3199, 'position': 2811, 'company': 2494, 'address': 2363, 'game': 2123, 'organization': 2100, 'government': 1651, 'scene': 1070, 'book': 1029, 'movie': 880})





In [12]:
ner_labels = np.unique(all_labels).tolist()
ner_labels

['address',
 'book',
 'company',
 'game',
 'government',
 'movie',
 'name',
 'organization',
 'position',
 'scene']

## 2. 模型构建

In [14]:
import os, sys, json, random
from collections import Counter
from tqdm import tqdm
from loguru import logger
from pathlib import Path

from theta.utils import load_json_file, split_train_eval_examples
from theta.modeling import LabeledText, load_ner_examples, load_ner_labeled_examples, save_ner_preds, show_ner_datainfo

from theta.modeling.ner_span import load_model, NerTrainer, get_args
#from theta.modeling.ner import load_model, NerTrainer, get_args

### 2.1 模型输入数据

In [15]:
def clean_text(text):
    if text:
        text = text.strip()
        #  text = re.sub('\t', ' ', text)
    return text


def train_data_generator(train_file):

    lines = load_json_file(train_file)

    for i, x in enumerate(tqdm(lines)):
        guid = str(i)
        text = clean_text(x['text'])
        sl = LabeledText(guid, text)

        # -------------------- 训练数据json格式 --------------------
        #  {
        #      "text": "万通地产设计总监刘克峰；",
        #      "label": {
        #          "name": {
        #              "刘克峰": [[8, 10]]
        #          },
        #          "company": {
        #              "万通地产": [[0, 3]]
        #          },
        #          "position": {
        #              "设计总监": [[4, 7]]
        #          }
        #      }
        #  }

        entities = []
        classes = x['label'].keys()
        for c in classes:
            c_labels = x['label'][c]
            #  logger.debug(f"c_labels:{c_labels}")
            for label, span in c_labels.items():
                x0, x1 = span[0]
                sl.add_entity(c, x0, x1)

        yield str(i), text, None, sl.entities


def load_train_val_examples(args):
    lines = []
    for guid, text, _, entities in train_data_generator(args.train_file):
        sl = LabeledText(guid, text, entities)
        lines.append({'guid': guid, 'text': text, 'entities': entities})

    allow_overlap = args.allow_overlap
    if args.num_augements > 0:
        allow_overlap = False

    train_base_examples = load_ner_labeled_examples(
        lines,
        ner_labels,
        seg_len=args.seg_len,
        seg_backoff=args.seg_backoff,
        num_augements=args.num_augements,
        allow_overlap=allow_overlap)

    train_examples, val_examples = split_train_eval_examples(
        train_base_examples,
        train_rate=args.train_rate,
        fold=args.fold,
        shuffle=True,
        random_state=args.seed)

    logger.info(f"Loaded {len(train_examples)} train examples, "
                f"{len(val_examples)} val examples.")
    return train_examples, val_examples


def test_data_generator(test_file):

    lines = load_json_file(test_file)
    for i, s in enumerate(tqdm(lines)):
        guid = str(i)
        text_a = clean_text(s['text'])

        yield guid, text_a, None, None


def load_test_examples(args):
    test_base_examples = load_ner_examples(test_data_generator,
                                           args.test_file,
                                           seg_len=args.seg_len,
                                           seg_backoff=args.seg_backoff)

    logger.info(f"Loaded {len(test_base_examples)} test examples.")
    return test_base_examples


### 2.2 模型输出结果

In [16]:
def generate_submission(args):
    reviews_file = f"{args.output_dir}/{args.dataset_name}_reviews_fold{args.fold}.json"
    reviews = json.load(open(reviews_file, 'r'))

    submission_file = f"{args.dataset_name}_predict.json"
    test_results = {}
    for guid, json_data in reviews.items():
        text = json_data['text']

        if guid not in test_results:
            test_results[guid] = {
                "guid": guid,
                "content": "",
                "events": [],
                "tagged_text": "",
            }

        s0 = 0
        tagged_text = test_results[guid]['tagged_text']
        for json_entity in json_data['entities']:
            event_type = json_entity['category']
            entity_text = json_entity['mention']
            s = json_entity['start']
            e = json_entity['end']
            test_results[guid]['events'].append(
                (event_type, entity_text, s, e))

            tagged_text += f"{text[s0:s]}\n"
            tagged_text += f"【{event_type} | {entity_text}】\n"
            test_results[guid]['tagged_text'] = tagged_text
            test_results[guid]['content'] += text

            s0 = e

        test_results[guid]['events'] = sorted(test_results[guid]['events'],
                                              key=lambda x: x[3])

    json.dump(test_results,
              open(f"{submission_file}", 'w'),
              ensure_ascii=False,
              indent=2)

    logger.info(f"Saved {len(reviews)} lines in {submission_file}")




### 2.4 自定义模型
Theta对每类任务都有缺省模型，通常情况下不需要自定义模型。训练器Trainer中传入参数build_model=None即可。

### 2.5 自定训练器

训练器也不是必须定义的，可以直接用NerTrainer实例化训练器。

自定义训练器通常是为了使用自定义模型或重载训练、评估、推理过程的关键节点，便于输出、调试等。

In [17]:
# -------------------- Trainer --------------------

class AppTrainer(NerTrainer):
    def __init__(self, args, ner_labels):
        super(AppTrainer, self).__init__(args, ner_labels, build_model=None)


### 2.6 主控流程

In [23]:
def main(args):

    if args.generate_submission:
        generate_submission(args)
    else:
        trainer = AppTrainer(args, ner_labels)

        if args.do_eda:
            show_ner_datainfo(ner_labels, train_data_generator,
                              args.train_file, test_data_generator,
                              args.test_file)

        elif args.do_train:
            train_examples, val_examples = load_train_val_examples(args)
            trainer.train(args, train_examples, val_examples)

        elif args.do_eval:
            _, eval_examples = load_train_val_examples(args)
            model = load_model(args)
            trainer.evaluate(args, model, eval_examples)

        elif args.do_predict:
            test_examples = load_test_examples(args)
            model = load_model(args)
            trainer.predict(args, model, test_examples)
            save_ner_preds(args, trainer.pred_results, test_examples)


## 3. 运行

### 3.1 全局参数

In [42]:
#def add_special_args(parser):
#    return parser

#from theta.modeling.glue.args import get_args
#args = get_args([add_special_args])

import sys, argparse

def get_init_args():
    parser = argparse.ArgumentParser()
    for arg in sys.argv:
        if arg.startswith('-'):
            parser.add_argument(arg, type=str)
    parser.add_argument('--do_eda', action="store_true")
    parser.add_argument('--generate_submission', action="store_true")
    parser.add_argument('--allow_overlap', action="store_true")
    args = parser.parse_args()
    return args

#import argparse
#parser = argparse.ArgumentParser()
#parser.add_argument("-f",type=str)
#args = parser.parse_args()

args = get_init_args()
FOLD=0
DATASET_NAME="cluener"
DATA_DIR="./data"
OUTPUT_DIR=f"output_{DATASET_NAME}"
CHECKPOINT_MODEL=f"{OUTPUT_DIR}/best_fold{FOLD}"

TRAIN_FILE = "./data/rawdata/train.json"
TEST_FILE = "./data/rawdata/test.json"
EVAL_FILE = "./data/rawdata/eval.json"

EPOCHS=3
TRAIN_SAMPLE_RATE=1.0

MODEL_TYPE="bert"
PRETRAINED_MODEL="/opt/share/pretrained/pytorch/bert-base-chinese"
LEARNING_RATE=2e-5
TRAIN_MAX_SEQ_LENGTH=256
EVAL_MAX_SEQ_LENGTH=256
TRAIN_BATCH_SIZE=12
EVAL_BATCH_SIZE=12
PREDICT_BATCH_SIZE=12

args.do_train=False
args.do_predict=False
args.do_eval=False
args.train_max_seq_length = TRAIN_MAX_SEQ_LENGTH
args.eval_max_seq_length = EVAL_MAX_SEQ_LENGTH
args.num_train_epochs = EPOCHS
args.learning_rate = LEARNING_RATE
args.per_gpu_train_batch_size = TRAIN_BATCH_SIZE
args.per_gpu_eval_batch_size = EVAL_BATCH_SIZE
args.per_gpu_predict_batch_size = EVAL_BATCH_SIZE

args.data_dir = DATA_DIR
args.dataset_name = DATASET_NAME
args.train_file = TRAIN_FILE
args.eval_file = EVAL_FILE
args.test_file = TEST_FILE

args.output_dir = OUTPUT_DIR
args.pred_output_dir = OUTPUT_DIR

args.model_type = MODEL_TYPE
args.model_path = PRETRAINED_MODEL
args.overwrite_cache = True
args.train_sample_rate = TRAIN_SAMPLE_RATE
args.seed = 8864
args.local_rank=-1
args.no_cuda = None
args.do_lower_case=True
args.cache_dir = None
args.train_rate=0.8
args.fold = 0
args.gradient_accumulation_steps = 1
args.max_steps = 0
args.focalloss_gamma = 1.5
args.focalloss_alpha = None
args.weight_decay = 0.0
args.warmup_rate = 0.1
args.fp16 = True
args.fp16_opt_level = 'O1'
args.max_grad_norm = 1.0
args.save_checkpoints = False
args.no_eval_on_each_epoch=False
args.num_augements = 0
args.seg_len = 254
args.seg_backoff=64
args.enable_kd=False


args.soft_label = False
args.loss_type = 'CrossEntropyLoss'
#args.loss_type = 'FocalLoss'


### 3.2 启动训练

In [40]:
args.do_train=True
args.do_predict=False
args.do_eval=False

main(args)

2020-06-16 18:22:49.696 | INFO     | theta.modeling.ner_span.trainer:init_labels:283 - args.label2id: {'[unused1]': 0, 'address': 1, 'book': 2, 'company': 3, 'game': 4, 'government': 5, 'movie': 6, 'name': 7, 'organization': 8, 'position': 9, 'scene': 10}
2020-06-16 18:22:49.697 | INFO     | theta.modeling.ner_span.trainer:init_labels:284 - args.id2label: {0: '[unused1]', 1: 'address', 2: 'book', 3: 'company', 4: 'game', 5: 'government', 6: 'movie', 7: 'name', 8: 'organization', 9: 'position', 10: 'scene'}
2020-06-16 18:22:49.698 | INFO     | theta.modeling.ner_span.trainer:init_labels:285 - args.num_labels: 11
100%|██████████| 10748/10748 [00:00<00:00, 32330.91it/s]
100%|██████████| 10748/10748 [00:00<00:00, 163443.08it/s]
100%|██████████| 10748/10748 [00:00<00:00, 139091.90it/s]
2020-06-16 18:22:50.218 | INFO     | theta.modeling.ner_utils:load_ner_labeled_examples:437 - Loaded 10748 examples.
2020-06-16 18:22:50.226 | INFO     | __main__:load_train_val_examples:70 - Loaded 8598 trai

Total: 10748
[{'text': '浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为，对目前国内商业银行而言，', 'label': {'name': {'叶老桂': [[9, 11]]}, 'company': {'浙商银行': [[0, 3]]}}}, {'text': '生生不息CSOL生化狂潮让你填弹狂扫', 'label': {'game': {'CSOL': [[4, 7]]}}}, {'text': '那不勒斯vs锡耶纳以及桑普vs热那亚之上呢？', 'label': {'organization': {'那不勒斯': [[0, 3]], '锡耶纳': [[6, 8]], '桑普': [[11, 12]], '热那亚': [[15, 17]]}}}, {'text': '加勒比海盗3：世界尽头》的去年同期成绩死死甩在身后，后者则即将赶超《变形金刚》，', 'label': {'movie': {'加勒比海盗3：世界尽头》': [[0, 11]], '《变形金刚》': [[33, 38]]}}}, {'text': '布鲁京斯研究所桑顿中国中心研究部主任李成说，东亚的和平与安全，是美国的“核心利益”之一。', 'label': {'address': {'美国': [[32, 33]]}, 'organization': {'布鲁京斯研究所桑顿中国中心': [[0, 12]]}, 'name': {'李成': [[18, 19]]}, 'position': {'研究部主任': [[13, 17]]}}}]


Tokenize: 100%|██████████| 8598/8598 [00:00<00:00, 26317.03it/s]
2020-06-16 18:22:51.158 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:151 - all_input_ids.shape: (8598, 256)
2020-06-16 18:22:51.282 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:153 - all_attention_mask.shape: (8598, 256)
2020-06-16 18:22:51.405 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:155 - all_token_type_ids.shape: (8598, 256)
2020-06-16 18:22:51.530 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:156 - all_start_ids.shape: (8598, 256)
2020-06-16 18:22:51.654 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:157 - all_end_ids.shape: (8598, 256)
2020-06-16 18:22:51.660 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:158 - all_subjects_ids.shape: (8598,)
2020-06-16 18:22:51.661 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:159 - all_input_lens.shape: (8598,)
2020-06-16 18:22:52.173 | INFO     | theta.modeling.trainer:tra

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Epoch(1/3)   1/717 [..............................] - ETA: 1:49 - lr: 0.00e+00 - loss: 2.6310Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Epoch(1/3)   2/717 [..............................] - ETA: 1:39 - lr: 4.65e-08 - loss: 2.6897Gradient overflow.  Skipping step, loss scaler 



Epoch(1/3)   3/717 [..............................] - ETA: 2:49 - lr: 9.30e-08 - loss: 2.6476Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0


2020-06-16 18:24:47.644 | INFO     | theta.modeling.trainer:train:362 - Epoch(1/3) evaluating.
Tokenize: 100%|██████████| 2150/2150 [00:00<00:00, 71416.44it/s]
2020-06-16 18:24:47.818 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:151 - all_input_ids.shape: (2150, 256)
2020-06-16 18:24:47.849 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:153 - all_attention_mask.shape: (2150, 256)
2020-06-16 18:24:47.879 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:155 - all_token_type_ids.shape: (2150, 256)
2020-06-16 18:24:47.910 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:156 - all_start_ids.shape: (2150, 256)
2020-06-16 18:24:47.941 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:157 - all_end_ids.shape: (2150, 256)
2020-06-16 18:24:47.943 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:158 - all_subjects_ids.shape: (2150,)
2020-06-16 18:24:47.945 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:159 



2020-06-16 18:25:44.645 | INFO     | theta.utils.ner_utils:get_ner_results:14 -                                    acc    recall f1    
2020-06-16 18:25:44.646 | INFO     | theta.utils.ner_utils:get_ner_results:15 - -------------------------------------------------------
2020-06-16 18:25:44.647 | INFO     | theta.utils.ner_utils:get_ner_results:26 - name                             | 0.7709 0.9280 0.8422
2020-06-16 18:25:44.648 | INFO     | theta.utils.ner_utils:get_ner_results:26 - book                             | 0.8689 0.7571 0.8092
2020-06-16 18:25:44.649 | INFO     | theta.utils.ner_utils:get_ner_results:26 - position                         | 0.8155 0.7896 0.8024
2020-06-16 18:25:44.650 | INFO     | theta.utils.ner_utils:get_ner_results:26 - game                             | 0.7543 0.8506 0.7996
2020-06-16 18:25:44.651 | INFO     | theta.utils.ner_utils:get_ner_results:26 - company                          | 0.7532 0.8179 0.7842
2020-06-16 18:25:44.652 | INFO     | theta.utils

{"eval_acc": "0.767849", "eval_recall": "0.756745", "eval_f1": "0.762257", "learning_rate": "0.000015", "loss": "0.146133", "step": 717}
 


2020-06-16 18:27:42.665 | INFO     | theta.modeling.trainer:train:362 - Epoch(2/3) evaluating.
Tokenize: 100%|██████████| 2150/2150 [00:00<00:00, 68638.71it/s]
2020-06-16 18:27:42.843 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:151 - all_input_ids.shape: (2150, 256)
2020-06-16 18:27:42.874 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:153 - all_attention_mask.shape: (2150, 256)
2020-06-16 18:27:42.904 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:155 - all_token_type_ids.shape: (2150, 256)
2020-06-16 18:27:42.936 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:156 - all_start_ids.shape: (2150, 256)
2020-06-16 18:27:42.967 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:157 - all_end_ids.shape: (2150, 256)
2020-06-16 18:27:42.969 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:158 - all_subjects_ids.shape: (2150,)
2020-06-16 18:27:42.970 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:159 



2020-06-16 18:28:39.704 | INFO     | theta.utils.ner_utils:get_ner_results:14 -                                    acc    recall f1    
2020-06-16 18:28:39.705 | INFO     | theta.utils.ner_utils:get_ner_results:15 - -------------------------------------------------------
2020-06-16 18:28:39.705 | INFO     | theta.utils.ner_utils:get_ner_results:26 - name                             | 0.8329 0.9076 0.8687
2020-06-16 18:28:39.706 | INFO     | theta.utils.ner_utils:get_ner_results:26 - book                             | 0.9244 0.7571 0.8325
2020-06-16 18:28:39.707 | INFO     | theta.utils.ner_utils:get_ner_results:26 - game                             | 0.8071 0.8420 0.8242
2020-06-16 18:28:39.707 | INFO     | theta.utils.ner_utils:get_ner_results:26 - position                         | 0.7961 0.8147 0.8053
2020-06-16 18:28:39.708 | INFO     | theta.utils.ner_utils:get_ner_results:26 - company                          | 0.7884 0.8091 0.7986
2020-06-16 18:28:39.709 | INFO     | theta.utils

{"eval_acc": "0.776024", "eval_recall": "0.785236", "eval_f1": "0.780603", "learning_rate": "0.000007", "loss": "0.037003", "step": 1434}
 


2020-06-16 18:30:38.477 | INFO     | theta.modeling.trainer:train:362 - Epoch(3/3) evaluating.
Tokenize: 100%|██████████| 2150/2150 [00:00<00:00, 70480.38it/s]
2020-06-16 18:30:38.650 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:151 - all_input_ids.shape: (2150, 256)
2020-06-16 18:30:38.681 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:153 - all_attention_mask.shape: (2150, 256)
2020-06-16 18:30:38.711 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:155 - all_token_type_ids.shape: (2150, 256)
2020-06-16 18:30:38.742 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:156 - all_start_ids.shape: (2150, 256)
2020-06-16 18:30:38.773 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:157 - all_end_ids.shape: (2150, 256)
2020-06-16 18:30:38.775 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:158 - all_subjects_ids.shape: (2150,)
2020-06-16 18:30:38.776 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:159 



2020-06-16 18:31:35.400 | INFO     | theta.utils.ner_utils:get_ner_results:14 -                                    acc    recall f1    
2020-06-16 18:31:35.400 | INFO     | theta.utils.ner_utils:get_ner_results:15 - -------------------------------------------------------
2020-06-16 18:31:35.401 | INFO     | theta.utils.ner_utils:get_ner_results:26 - name                             | 0.8194 0.9062 0.8606
2020-06-16 18:31:35.402 | INFO     | theta.utils.ner_utils:get_ner_results:26 - game                             | 0.8101 0.8312 0.8205
2020-06-16 18:31:35.403 | INFO     | theta.utils.ner_utils:get_ner_results:26 - book                             | 0.8333 0.7857 0.8088
2020-06-16 18:31:35.403 | INFO     | theta.utils.ner_utils:get_ner_results:26 - position                         | 0.8017 0.8097 0.8056
2020-06-16 18:31:35.404 | INFO     | theta.utils.ner_utils:get_ner_results:26 - company                          | 0.7886 0.7968 0.7927
2020-06-16 18:31:35.405 | INFO     | theta.utils

{"eval_acc": "0.774483", "eval_recall": "0.783510", "eval_f1": "0.778970", "learning_rate": "0.000000", "loss": "0.025855", "step": 2151}
 


### 3.3 启动推理

In [43]:
args.do_train=False
args.do_predict=True
args.do_eval=False
args.model_path=CHECKPOINT_MODEL
main(args)

2020-06-16 18:32:31.465 | INFO     | theta.modeling.ner_span.trainer:init_labels:283 - args.label2id: {'[unused1]': 0, 'address': 1, 'book': 2, 'company': 3, 'game': 4, 'government': 5, 'movie': 6, 'name': 7, 'organization': 8, 'position': 9, 'scene': 10}
2020-06-16 18:32:31.466 | INFO     | theta.modeling.ner_span.trainer:init_labels:284 - args.id2label: {0: '[unused1]', 1: 'address', 2: 'book', 3: 'company', 4: 'game', 5: 'government', 6: 'movie', 7: 'name', 8: 'organization', 9: 'position', 10: 'scene'}
2020-06-16 18:32:31.468 | INFO     | theta.modeling.ner_span.trainer:init_labels:285 - args.num_labels: 11
100%|██████████| 1345/1345 [00:00<00:00, 260473.68it/s]
100%|██████████| 1345/1345 [00:00<00:00, 356122.65it/s]
2020-06-16 18:32:31.501 | INFO     | theta.modeling.ner_utils:load_ner_examples:410 - Loaded 1345 examples.
2020-06-16 18:32:31.501 | INFO     | __main__:load_test_examples:91 - Loaded 1345 test examples.
2020-06-16 18:32:31.504 | INFO     | theta.modeling.ner_span.tra

Total: 1345
[{'id': 0, 'text': '四川敦煌学”。近年来，丹棱县等地一些不知名的石窟迎来了海内外的游客，他们随身携带着胡文和的著作。'}, {'id': 1, 'text': '尼日利亚海军发言人当天在阿布贾向尼日利亚通讯社证实了这一消息。'}, {'id': 2, 'text': '销售冠军：辐射3-Bethesda'}, {'id': 3, 'text': '所以大多数人都是从巴厘岛南部开始环岛之旅。'}, {'id': 4, 'text': '备受瞩目的动作及冒险类大作《迷失》在其英文版上市之初就受到了全球玩家的大力追捧。'}]


Tokenize: 100%|██████████| 1345/1345 [00:00<00:00, 70311.08it/s]
2020-06-16 18:32:36.226 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:151 - all_input_ids.shape: (1345, 256)
2020-06-16 18:32:36.245 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:153 - all_attention_mask.shape: (1345, 256)
2020-06-16 18:32:36.265 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:155 - all_token_type_ids.shape: (1345, 256)
2020-06-16 18:32:36.284 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:156 - all_start_ids.shape: (1345, 256)
2020-06-16 18:32:36.303 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:157 - all_end_ids.shape: (1345, 256)
2020-06-16 18:32:36.304 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:158 - all_subjects_ids.shape: (1345, 0)
2020-06-16 18:32:36.305 | DEBUG    | theta.modeling.ner_span.dataset:encode_examples:159 - all_input_lens.shape: (1345,)
2020-06-16 18:32:36.383 | INFO     | theta.modeling.trainer:p



1345it [00:00, 44595.56it/s]
2020-06-16 18:33:11.564 | INFO     | theta.modeling.ner_utils:save_ner_preds:174 - Reviews file: output_cluener/cluener_reviews_fold0.json
2020-06-16 18:33:11.567 | INFO     | theta.modeling.ner_utils:save_ner_preds:185 - Total 10 categories and 2123 mentions saved to output_cluener/cluener_category_mentions_fold0.txt


### 3.4 生成提交结果文件

In [44]:
args.do_train=False
args.do_predict=False
args.do_eval=False
args.model_path=CHECKPOINT_MODEL
args.generate_submission = True
main(args)

2020-06-16 18:33:14.318 | INFO     | __main__:generate_submission:43 - Saved 1345 lines in cluener_predict.json
