# 自然语言处理实战——命名实体识别

在前面的六期实战营案例中，无论是图像分类还是物体识别，都是对于图像的处理。

从本期实战营开始，我们进入到人工智能的另一大重要领域——自然语言处理（NLP，Natural Language Processing）。

自然语言处理是人工智能最重要，也是最困难的领域之一，其任务大概可以分为以下几类：

- 词法分析：分词、词性标注、拼写校正等


- 分类任务：文本分类、情感计算等


- 信息抽取：命名实体识别、实体消歧、术语抽取、关系抽取等


- 顶层任务：机器翻译、文本摘要、问答系统、阅读理解等

在接下来的四期中，我们将接触到四个自然语言处理的任务，同学们有没很期待！

首先准备实战环境。

### 进入ModelArts

点击如下链接：https://www.huaweicloud.com/product/modelarts.html ， 进入ModelArts主页。点击“立即使用”按钮，输入用户名和密码登录，进入ModelArts使用页面。

### 创建ModelArts notebook

下面，我们在ModelArts中创建一个notebook开发环境，ModelArts notebook提供网页版的Python开发环境，可以方便的编写、运行代码，并查看运行结果。

第一步：在ModelArts服务主界面依次点击“开发环境”、“创建”

![create_nb_create_button](./img/create_nb_create_button.png)

第二步：填写notebook所需的参数：

| 参数 | 说明 |
| - - - - - | - - - - - |
| 计费方式 | 按需计费  |
| 名称 | Notebook实例名称 |
| 工作环境 | Python3 |
| 资源池 | 选择"公共资源池"即可 |
| 类型 | 选择"GPU" |
| 规格 | 选择"8核 &#124; 64GiB &#124; 1*p100" |
| 存储配置 | 选择EVS，磁盘规格5GB |

第三步：配置好notebook参数后，点击下一步，进入notebook信息预览。确认无误后，点击“立即创建”

![create_nb_creation_summary](./img/create_nb_creation_summary.png)

第四步：创建完成后，返回开发环境主界面，等待Notebook创建完毕后，打开Notebook，进行下一步操作。
![modelarts_notebook_index](./img/modelarts_notebook_index.png)

### 在ModelArts中创建开发环境

接下来，我们创建一个实际的开发环境，用于后续的实验步骤。

第一步：点击下图所示的“打开”按钮，进入刚刚创建的Notebook
![inter_dev_env](img/enter_dev_env.png)

第二步：创建一个Python3环境的的Notebook。点击右上角的"New"，然后创建TensorFlow 1.13.1开发环境。

第三步：点击左上方的文件名"Untitled"，并输入一个与本实验相关的名称

![notebook_untitled_filename](./img/notebook_untitled_filename.png)
![notebook_name_the_ipynb](./img/notebook_name_the_ipynb.png)


### 在Notebook中编写并执行代码

在Notebook中，我们输入一个简单的打印语句，然后点击上方的运行按钮，可以查看语句执行的结果：
![run_helloworld](./img/run_helloworld.png)


开发环境准备好啦，接下来可以愉快地写代码啦！


### 准备源代码和数据

准备案例所需的源代码和数据，相关资源已经保存在 OBS 中，我们通过 ModelArts SDK 将资源下载到本地。

In [1]:
from modelarts.session import Session
session = Session()

if session.region_name == 'cn-north-1':
    bucket_path = 'modelarts-labs/notebook/DL_nlp_ner/ner.tar.gz'
    
elif session.region_name == 'cn-north-4':
    bucket_path = 'modelarts-labs-bj4/notebook/DL_nlp_ner/ner.tar.gz'
else:
    print("请更换地区到北京一或北京四")
    
session.download_data(bucket_path=bucket_path, path='./ner.tar.gz')

!ls -la   

Successfully download file modelarts-labs/notebook/DL_nlp_ner/ner.tar.gz from OBS to local ./ner.tar.gz
total 375220
drwxrwsrwx  4 ma-user ma-group      4096 Sep  6 13:34 .
drwsrwsr-x 22 ma-user ma-group      4096 Sep  6 13:03 ..
drwxr-s---  2 ma-user ma-group      4096 Sep  6 13:33 .ipynb_checkpoints
-rw-r-----  1 ma-user ma-group     45114 Sep  6 13:33 ner.ipynb
-rw-r-----  1 ma-user ma-group 384157325 Sep  6 13:35 ner.tar.gz
drwx--S---  2 ma-user ma-group      4096 Sep  6 13:03 .Trash-1000


解压从obs下载的压缩包，解压后删除压缩包。

In [2]:
# 解压
!tar xf ./ner.tar.gz

# 删除
!rm ./ner.tar.gz

!ls -la    

total 68
drwxrwsrwx  5 ma-user ma-group  4096 Sep  6 13:35 .
drwsrwsr-x 22 ma-user ma-group  4096 Sep  6 13:03 ..
drwxr-s---  2 ma-user ma-group  4096 Sep  6 13:33 .ipynb_checkpoints
drwxr-s---  8 ma-user ma-group  4096 Sep  6 00:24 ner
-rw-r-----  1 ma-user ma-group 45114 Sep  6 13:33 ner.ipynb
drwx--S---  2 ma-user ma-group  4096 Sep  6 13:03 .Trash-1000


## 命名实体识别简介

在自然语言处理任务中，命名实体识别是最为基础的任务之一，为信息抽取、信息检索、机器翻译、问答系统等高阶任务做铺垫。

文本中的人名、地名、组织机构名等统一称之为命名实体。

在本实战中，选择使用BIO标注：将每个元素标注为“B-X”、“I-X”或者“O”。

- B-PER、I-PER 代表人名首字、人名非首字


- B-LOC、I-LOC 代表地名首字、地名非首字


- B-ORG、I-ORG 代表组织机构名首字、组织机构名非首字


- O 代表非命名实体

示例如下：

![](./img/BIO.png)

## ModelArts 命名实体标注功能

本部分将介绍通过 ModelArts 的命名实体标注功能：针对文本中的实体字段进行标注，如“时间”、“地点”等。

登录 ModelArts 管理控制台，在左侧菜单栏中选择`数据标注`，进入`数据集`管理页面。

点击`创建数据集`，准备用于数据标注的文本数据。

![](./img/data_tagging.png)

#### 准备未标注数据集

首先需要在 OBS 中创建一个数据集，后续的操作如标注数据、数据集发布等，都是基于创建和管理的数据集。

OBS 链接在这里：https://www.huaweicloud.com/product/obs0.html

数据标注功能需要获取访问 OBS 权限，在未进行委托授权之前，无法使用此功能。需要可以在`数据标注`页面，单击`服务授权`，由具备授权的账号`同意授权`后，即可使用。

创建用于存储数据的 OBS 桶及文件夹。本实践中桶名设定为`ner-tagging`，**请用户建立新桶并自定义命名，OBS桶名全局唯一，若创建时桶名冲突，请选择其他不冲突桶名**。

桶创建成功后，在桶中创建标注输入和标注输出的文件夹，并将用于标注是文本文件上传到输入文件夹中。

文本标注文件的要求为：**文件格式要求 txt 或者 csv，文件大小不超过 8M，以换行符作为分隔符，每行数据代表一个标注对象。**

在本实践中使用的示例标注文件`text.txt`可以[点此下载](https://modelarts-labs.obs.cn-north-1.myhuaweicloud.com/notebook/DL_nlp_ner/text.tar.gz)，解压后可上传到输入文件夹中按照本案例步骤使用。

在本实践中创建文件夹结构示例如下：

```
tagging
   │
   ├─input
   │       └─text.txt
   └─output
```

其中

- `input`   为命名实体输入文件夹
- `text.txt`   为命名实体输入文本文件
- `output`   为命名实体输出文件夹



创建命名实体任务数据集，如下图所示

![](./img/tagging_ner_1.png)

注意创建参数

- 名称：可自定义数据集名称，本案例中设定为`ner-tagging`
- 数据集输入位置：本案例中设定为`/ner-tagging/tagging/input/`
- 数据集输出位置：本案例中设定为`/ner-tagging/tagging/output/`
- 标注场景：选择`文本`
- 标注类型：选择`命名实体`
- 添加标签集：可自定义标签名称、个数、颜色。本案例中设定三个分类标签：`人物`标签为蓝色；`时间`标签为绿色；`地点`标签为红色。

完成以上设定后，点击右下角`创建`。命名实体数据集创建完成后，系统自动跳转至数据集管理页面。

![](./img/tagging_ner_2.png)

点击数据集名称，进入标注界面。选择未标注对象，点击标签进行标注，如图所示

![](./img/tagging_ner_3.png)

选择标注对象：`明天张亮要去体育场打篮球。`

- 选取“明天”字段，选择标签`时间`
- 选取“张亮”字段，选择标签`人物`
- 选取“体育场”字段，选择标签`地点`

然后点击下方`保存当前页`进行保存。

继续选择其他标注对象，按上述方法进行标注。若需增加标注数据，点击左上角的`添加文件`即可自行增加标注文本。数据全部标注完成后（本样例中仅提供三条命名实体文本），点击`已标注`可查看标注结果。

![](./img/tagging_ner_4.png)

点击`返回数据集`，可以看到数据集已全部标注成功。

![](./img/tagging_ner_5.png)

针对刚创建的数据集（未发布前），无数据集版本信息，必须执行发布操作后，才能应用于模型开发或训练。

点击`发布`，可以编辑版本名称，本案例中为默认`V001`。

![](./img/tagging_ner_6.png)

发布成功如图所示。

![](./img/tagging_ner_7.png)

可以查看数据集版本的 “名称”、 “状态”、 “文件总数”、 “已标注文件个数”，并在左侧的 “演进过程”中查看版本的发布时间。

随后可以使用标注成功的数据集，标注结果储存在`output`文件夹中。

后续 ModelArts 将会上线**智能标注**功能，相信大家已经体验过第二期实战的图像智能标注，能够快速完成数据标注，节省70%以上的标注时间。智能标注是指基于当前标注阶段的标签及学习训练，选中系统中已有的模型进行智能标注，快速完成剩余数据的标注操作。请持续关注数据标注功能。

## 数据集

本实践使用的是《人民日报1998年中文标注语料库》。

数据集格式为：每行的第一个是字，第二个是它的标签，字与标签之间使用空格分隔，两句话之间空一行。如下图所示：

![](./img/数据集示例.png)

In [3]:
!pip install tensorflow==1.11.0

!pip install tensorflow-gpu==1.11.0

[33mYou are using pip version 9.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


#### 导入Python库

In [4]:
import os
import json
import numpy as np
import tensorflow as tf
import codecs
import pickle
import collections
from ner.bert import modeling, optimization, tokenization

#### 定义路径及参数

In [5]:
data_dir = "./ner/data"    
output_dir = "./ner/output"    
vocab_file = "./ner/chinese_L-12_H-768_A-12/vocab.txt"    
data_config_path = "./ner/chinese_L-12_H-768_A-12/bert_config.json"    
init_checkpoint = "./ner/chinese_L-12_H-768_A-12/bert_model.ckpt"    
max_seq_length = 128    
batch_size = 64    
num_train_epochs = 5.0    

#### 定义processor类获取数据，打印标签

In [6]:
tf.logging.set_verbosity(tf.logging.INFO)
from ner.src.models import InputFeatures, InputExample, DataProcessor, NerProcessor

processors = {"ner": NerProcessor }
processor = processors["ner"](output_dir)

label_list = processor.get_labels()
print("labels:", label_list)

labels: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'X', '[CLS]', '[SEP]']


#### 加载预训练参数

In [7]:
data_config = json.load(codecs.open(data_config_path))
train_examples = processor.get_train_examples(data_dir)    
num_train_steps = int(len(train_examples) / batch_size * num_train_epochs)    
num_warmup_steps = int(num_train_steps * 0.1)   
data_config['num_train_steps'] = num_train_steps
data_config['num_warmup_steps'] = num_warmup_steps
data_config['num_train_size'] = len(train_examples)

print("显示配置信息:")
for key,value in data_config.items():
    print('{key}:{value}'.format(key = key, value = value))

bert_config = modeling.BertConfig.from_json_file(data_config_path)
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)

#tf.estimator运行参数
run_config = tf.estimator.RunConfig(
    model_dir=output_dir,
    save_summary_steps=1000,
    save_checkpoints_steps=1000,
    session_config=tf.ConfigProto(
        log_device_placement=False,
        inter_op_parallelism_threads=0,
        intra_op_parallelism_threads=0,
        allow_soft_placement=True
    )
)

显示配置信息:
attention_probs_dropout_prob:0.1
directionality:bidi
hidden_act:gelu
hidden_dropout_prob:0.1
hidden_size:768
initializer_range:0.02
intermediate_size:3072
max_position_embeddings:512
num_attention_heads:12
num_hidden_layers:12
pooler_fc_size:768
pooler_num_attention_heads:12
pooler_num_fc_layers:3
pooler_size_per_head:128
pooler_type:first_token_transform
type_vocab_size:2
vocab_size:21128
num_train_steps:1630
num_warmup_steps:163
num_train_size:20864


#### 读取数据，获取句向量

In [8]:
def convert_single_example(ex_index, example, label_list, max_seq_length, 
                           tokenizer, output_dir, mode):
    label_map = {}
    for (i, label) in enumerate(label_list, 1):
        label_map[label] = i
    if not os.path.exists(os.path.join(output_dir, 'label2id.pkl')):
        with codecs.open(os.path.join(output_dir, 'label2id.pkl'), 'wb') as w:
            pickle.dump(label_map, w)

    textlist = example.text.split(' ')
    labellist = example.label.split(' ')
    tokens = []
    labels = []
    for i, word in enumerate(textlist):
        token = tokenizer.tokenize(word)
        tokens.extend(token)
        label_1 = labellist[i]
        for m in range(len(token)):
            if m == 0:
                labels.append(label_1)
            else:  
                labels.append("X")
    if len(tokens) >= max_seq_length - 1:
        tokens = tokens[0:(max_seq_length - 2)]
        labels = labels[0:(max_seq_length - 2)]
    ntokens = []
    segment_ids = []
    label_ids = []
    ntokens.append("[CLS]")  # 句子开始设置 [CLS] 标志
    segment_ids.append(0)
    label_ids.append(label_map["[CLS]"])  
    for i, token in enumerate(tokens):
        ntokens.append(token)
        segment_ids.append(0)
        label_ids.append(label_map[labels[i]])
    ntokens.append("[SEP]")  # 句尾添加 [SEP] 标志
    segment_ids.append(0)
    label_ids.append(label_map["[SEP]"])
    input_ids = tokenizer.convert_tokens_to_ids(ntokens)  
    input_mask = [1] * len(input_ids)

    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
        label_ids.append(0)
        ntokens.append("**NULL**")

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length
    assert len(label_ids) == max_seq_length

    feature = InputFeatures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_ids=label_ids,
    )
   
    return feature

def filed_based_convert_examples_to_features(
        examples, label_list, max_seq_length, tokenizer, output_file, mode=None):
    writer = tf.python_io.TFRecordWriter(output_file)
    for (ex_index, example) in enumerate(examples):
        if ex_index % 5000 == 0:
            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
        feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, output_dir, mode)

        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        features = collections.OrderedDict()
        features["input_ids"] = create_int_feature(feature.input_ids)
        features["input_mask"] = create_int_feature(feature.input_mask)
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        features["label_ids"] = create_int_feature(feature.label_ids)
        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString())

train_file = os.path.join(output_dir, "train.tf_record")

#将训练集中字符转化为features作为训练的输入
filed_based_convert_examples_to_features(
            train_examples, label_list, max_seq_length, tokenizer, output_file=train_file)

INFO:tensorflow:Writing example 0 of 20864
INFO:tensorflow:Writing example 5000 of 20864
INFO:tensorflow:Writing example 10000 of 20864
INFO:tensorflow:Writing example 15000 of 20864
INFO:tensorflow:Writing example 20000 of 20864


#### 引入 BiLSTM+CRF 层，作为下游模型

In [9]:
learning_rate = 5e-5 
dropout_rate = 1.0   
lstm_size=1    
cell='lstm'
num_layers=1

from ner.src.models import BLSTM_CRF
from tensorflow.contrib.layers.python.layers import initializers

def create_model(bert_config, is_training, input_ids, input_mask,
                 segment_ids, labels, num_labels, use_one_hot_embeddings,
                 dropout_rate=dropout_rate, lstm_size=1, cell='lstm', num_layers=1):
    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings
    )
    embedding = model.get_sequence_output()
    max_seq_length = embedding.shape[1].value
    used = tf.sign(tf.abs(input_ids))
    lengths = tf.reduce_sum(used, reduction_indices=1)  
    blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=1, cell_type='lstm', num_layers=1,
                          dropout_rate=dropout_rate, initializers=initializers, num_labels=num_labels,
                          seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
    rst = blstm_crf.add_blstm_crf_layer(crf_only=True)
    return rst

def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps,use_one_hot_embeddings=False):
    #构建模型
    def model_fn(features, labels, mode, params):
        tf.logging.info("*** Features ***")
        for name in sorted(features.keys()):
            tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
        input_ids = features["input_ids"]
        input_mask = features["input_mask"]
        segment_ids = features["segment_ids"]
        label_ids = features["label_ids"]

        print('shape of input_ids', input_ids.shape)
        is_training = (mode == tf.estimator.ModeKeys.TRAIN)

        total_loss, logits, trans, pred_ids = create_model(
            bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
            num_labels, False, dropout_rate, lstm_size, cell, num_layers)

        tvars = tf.trainable_variables()

        if init_checkpoint:
            (assignment_map, initialized_variable_names) = \
                 modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
            tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
        
        output_spec = None
        if mode == tf.estimator.ModeKeys.TRAIN:
            train_op = optimization.create_optimizer(
                 total_loss, learning_rate, num_train_steps, num_warmup_steps, False)
            hook_dict = {}
            hook_dict['loss'] = total_loss
            hook_dict['global_steps'] = tf.train.get_or_create_global_step()
            logging_hook = tf.train.LoggingTensorHook(
                hook_dict, every_n_iter=100)

            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=total_loss,
                train_op=train_op,
                training_hooks=[logging_hook])

        elif mode == tf.estimator.ModeKeys.EVAL:
            def metric_fn(label_ids, pred_ids):

                return {
                    "eval_loss": tf.metrics.mean_squared_error(labels=label_ids, predictions=pred_ids),   }
            
            eval_metrics = metric_fn(label_ids, pred_ids)
            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=total_loss,
                eval_metric_ops=eval_metrics
            )
        else:
            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                predictions=pred_ids
            )
        return output_spec

    return model_fn


#### 创建模型，开始训练

In [10]:
model_fn = model_fn_builder(
        bert_config=bert_config,
        num_labels=len(label_list) + 1,
        init_checkpoint=init_checkpoint,
        learning_rate=learning_rate,
        num_train_steps=num_train_steps,
        num_warmup_steps=num_warmup_steps,
        use_one_hot_embeddings=False)

def file_based_input_fn_builder(input_file, seq_length, is_training, drop_remainder):
    name_to_features = {
        "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
        "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "label_ids": tf.FixedLenFeature([seq_length], tf.int64),
    }

    def _decode_record(record, name_to_features):
        example = tf.parse_single_example(record, name_to_features)
        for name in list(example.keys()):
            t = example[name]
            if t.dtype == tf.int64:
                t = tf.to_int32(t)
            example[name] = t
        return example

    def input_fn(params):
        params["batch_size"] = 32
        batch_size = params["batch_size"]
        d = tf.data.TFRecordDataset(input_file)
        if is_training:
            d = d.repeat()
            d = d.shuffle(buffer_size=300)
        d = d.apply(tf.contrib.data.map_and_batch(
            lambda record: _decode_record(record, name_to_features),
            batch_size=batch_size,
            drop_remainder=drop_remainder
        ))
        return d

    return input_fn

#训练输入
train_input_fn = file_based_input_fn_builder(
            input_file=train_file,
            seq_length=max_seq_length,
            is_training=True,
            drop_remainder=True)

num_train_size = len(train_examples)

tf.logging.info("***** Running training *****")
tf.logging.info("  Num examples = %d", num_train_size)
tf.logging.info("  Batch size = %d", batch_size)
tf.logging.info("  Num steps = %d", num_train_steps)

#模型预测estimator
estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        config=run_config,
        params={
        'batch_size':batch_size
    })

estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)


INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 20864
INFO:tensorflow:  Batch size = 64
INFO:tensorflow:  Num steps = 1630
INFO:tensorflow:Using config: {'_model_dir': './ner/output', '_tf_random_seed': None, '_save_summary_steps': 1000, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fca68ba6748>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, sh

shape of input_ids (32, 128)


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./ner/output/model.ckpt.
INFO:tensorflow:loss = 159.27417, step = 0
INFO:tensorflow:global_steps = 0, loss = 159.27417
INFO:tensorflow:global_step/sec: 1.43142
INFO:tensorflow:loss = 52.92234, step = 100 (69.862 sec)
INFO:tensorflow:global_steps = 100, loss = 52.92234 (69.862 sec)
INFO:tensorflow:global_step/sec: 1.81746
INFO:tensorflow:loss = 45.81129, step = 200 (55.022 sec)
INFO:tensorflow:global_steps = 200, loss = 45.81129 (55.022 sec)
INFO:tensorflow:global_step/sec: 1.82186
INFO:tensorflow:loss = 48.826424, step = 300 (54.890 sec)
INFO:tensorflow:global_steps = 300, loss = 48.826424 (54.890 sec)
INFO:tensorflow:global_step/sec: 1.82109
INFO:tensorflow:loss = 44.61993, step = 400 (54.910 sec)
INFO:tensorflow:global_steps = 400, loss = 

<tensorflow.python.estimator.estimator.Estimator at 0x7fca68ad65c0>

#### 在验证集上验证模型

In [11]:
eval_examples = processor.get_dev_examples(data_dir)
eval_file = os.path.join(output_dir, "eval.tf_record")
filed_based_convert_examples_to_features(
                eval_examples, label_list, max_seq_length, tokenizer, eval_file)
data_config['eval.tf_record_path'] = eval_file
data_config['num_eval_size'] = len(eval_examples)
num_eval_size = data_config.get('num_eval_size', 0)

tf.logging.info("***** Running evaluation *****")
tf.logging.info("  Num examples = %d", num_eval_size)
tf.logging.info("  Batch size = %d", batch_size)

eval_steps = None
eval_drop_remainder = False
eval_input_fn = file_based_input_fn_builder(
            input_file=eval_file,
            seq_length=max_seq_length,
            is_training=False,
            drop_remainder=eval_drop_remainder)

result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
output_eval_file = os.path.join(output_dir, "eval_results.txt")
with codecs.open(output_eval_file, "w", encoding='utf-8') as writer:
    tf.logging.info("***** Eval results *****")
    for key in sorted(result.keys()):
        tf.logging.info("  %s = %s", key, str(result[key]))
        writer.write("%s = %s\n" % (key, str(result[key])))

if not os.path.exists(data_config_path):
    with codecs.open(data_config_path, 'a', encoding='utf-8') as fd:
        json.dump(data_config, fd)


INFO:tensorflow:Writing example 0 of 4631
INFO:tensorflow:***** Running evaluation *****
INFO:tensorflow:  Num examples = 4631
INFO:tensorflow:  Batch size = 64
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (?, 128)
INFO:tensorflow:  name = input_mask, shape = (?, 128)
INFO:tensorflow:  name = label_ids, shape = (?, 128)
INFO:tensorflow:  name = segment_ids, shape = (?, 128)


shape of input_ids (?, 128)


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-09-06-05:53:54
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./ner/output/model.ckpt-1630
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-09-06-05:54:35
INFO:tensorflow:Saving dict for global step 1630: eval_loss = 0.042222254, global_step = 1630, loss = 35.88692
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1630: ./ner/output/model.ckpt-1630
INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  eval_loss = 0.042222254
INFO:tensorflow:  global_step = 1630
INFO:tensorflow:  loss = 35.88692


#### 在测试集上进行测试

In [12]:
token_path = os.path.join(output_dir, "token_test.txt")
if os.path.exists(token_path):
    os.remove(token_path)

with codecs.open(os.path.join(output_dir, 'label2id.pkl'), 'rb') as rf:
    label2id = pickle.load(rf)
    id2label = {value: key for key, value in label2id.items()}

predict_examples = processor.get_test_examples(data_dir)
predict_file = os.path.join(output_dir, "predict.tf_record")
filed_based_convert_examples_to_features(predict_examples, label_list,
                                                 max_seq_length, tokenizer,
                                                 predict_file, mode="test")

tf.logging.info("***** Running prediction*****")
tf.logging.info("  Num examples = %d", len(predict_examples))
tf.logging.info("  Batch size = %d", batch_size)
    
predict_drop_remainder = False
predict_input_fn = file_based_input_fn_builder(
            input_file=predict_file,
            seq_length=max_seq_length,
            is_training=False,
            drop_remainder=predict_drop_remainder)

predicted_result = estimator.evaluate(input_fn=predict_input_fn)
output_eval_file = os.path.join(output_dir, "predicted_results.txt")
with codecs.open(output_eval_file, "w", encoding='utf-8') as writer:
    tf.logging.info("***** Predict results *****")
    for key in sorted(predicted_result.keys()):
        tf.logging.info("  %s = %s", key, str(predicted_result[key]))
        writer.write("%s = %s\n" % (key, str(predicted_result[key])))

result = estimator.predict(input_fn=predict_input_fn)
output_predict_file = os.path.join(output_dir, "label_test.txt")

def result_to_pair(writer):
    for predict_line, prediction in zip(predict_examples, result):
        idx = 0
        line = ''
        line_token = str(predict_line.text).split(' ')
        label_token = str(predict_line.label).split(' ')
        if len(line_token) != len(label_token):
            tf.logging.info(predict_line.text)
            tf.logging.info(predict_line.label)
        for id in prediction:
            if id == 0:
                continue
            curr_labels = id2label[id]
            if curr_labels in ['[CLS]', '[SEP]']:
                continue
            try:
                line += line_token[idx] + ' ' + label_token[idx] + ' ' + curr_labels + '\n'
            except Exception as e:
                tf.logging.info(e)
                tf.logging.info(predict_line.text)
                tf.logging.info(predict_line.label)
                line = ''
                break
            idx += 1
        writer.write(line + '\n')
            
from ner.src.conlleval import return_report

with codecs.open(output_predict_file, 'w', encoding='utf-8') as writer:
    result_to_pair(writer)
eval_result = return_report(output_predict_file)
for line in eval_result:
    print(line)

INFO:tensorflow:Writing example 0 of 68
INFO:tensorflow:***** Running prediction*****
INFO:tensorflow:  Num examples = 68
INFO:tensorflow:  Batch size = 64
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (?, 128)
INFO:tensorflow:  name = input_mask, shape = (?, 128)
INFO:tensorflow:  name = label_ids, shape = (?, 128)
INFO:tensorflow:  name = segment_ids, shape = (?, 128)


shape of input_ids (?, 128)


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-09-06-05:55:20
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./ner/output/model.ckpt-1630
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-09-06-05:55:24
INFO:tensorflow:Saving dict for global step 1630: eval_loss = 0.017003676, global_step = 1630, loss = 34.176826
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1630: ./ner/output/model.ckpt-1630
INFO:tensorflow:***** Predict results *****
INFO:tensorflow:  eval_loss = 0.017003676
INFO:tensorflow:  global_step = 1630
INFO:tensorflow:  loss = 34.176826
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (?, 128)
INFO:tensorflow:  name = input_mask, shape = (?, 128)
INFO:tensorflow:  name = label_ids, shape = (?, 128)
INFO:tensorflow:  name = segment_ids, shape = (?, 128)


shape of input_ids (?, 128)


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./ner/output/model.ckpt-1630
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


processed 2270 tokens with 78 phrases; found: 82 phrases; correct: 76.

accuracy:  99.56%; precision:  92.68%; recall:  97.44%; FB1:  95.00

              LOC: precision:  97.83%; recall: 100.00%; FB1:  98.90  46

              ORG: precision:  66.67%; recall: 100.00%; FB1:  80.00  12

              PER: precision:  95.83%; recall:  92.00%; FB1:  93.88  24



### 在线命名实体识别

由以上训练得到模型进行在线测试，可以任意输入句子，进行命名实体识别。

输入“再见”，结束在线命名实体识别。

<span style="color:red">若下述程序未执行成功，则表示训练完成后，GPU显存还在占用，需要restart kernel，然后执行 %run 命令。</span>

释放资源具体流程为：菜单 > Kernel > Restart  

![释放资源](./img/释放资源.png)


In [1]:
%run ner/src/terminal_predict.py

checkpoint path:./ner/output/checkpoint
going to restore checkpoint
INFO:tensorflow:Restoring parameters from ./ner/output/model.ckpt-1630
{1: 'O', 2: 'B-PER', 3: 'I-PER', 4: 'B-ORG', 5: 'I-ORG', 6: 'B-LOC', 7: 'I-LOC', 8: 'X', 9: '[CLS]', 10: '[SEP]'}
输入句子:
中国男篮与委内瑞拉队在北京五棵松体育馆展开小组赛最后一场比赛的争夺，赵继伟12分4助攻3抢断、易建联11分8篮板、周琦8分7篮板2盖帽。
[['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
LOC, 北京, 五棵松体育馆
PER, 赵继伟, 易建联, 周琦
ORG, 中国男篮, 委内瑞拉队
time used: 0.908481 sec
输入句子:
周杰伦（Jay Chou），1979年1月18日出生于台湾省新北市，毕业于淡江中学，中国台湾流行乐男歌手。
[['B-PER', 'I-PER', 'I-PER', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'