# 自然语言处理实战 —— 文本相似度

在自然语言处理（NLP）过程中，经常会涉及到如何度量两个文本之间的相似性。文本相似度（Semantic Text Similarity）计算就是判断两段文本之间是否相似。相似度有多种粒度的表示方式，可以使用（1,2,3,4,5）中的一个数字表示相似度，值越大表示越相似；也可以使用（0,1）中的一个数字表示相似度，1表示相似，0表示不相似。在本案例中，我们使用（0,1）表示相似度。文本相似度技术可以应用到信息检索、自动问答、机器翻译和自动文摘等NLP任务中。

度量文本相似度包括如下三种方法：

1. 基于关键词匹配的传统方法，如N-gram相似度；

2. 将文本映射到向量空间，再利用余弦相似度等方法；

3. 深度学习的方法，如基于用户点击数据的深度学习语义匹配模型DSSM，基于卷积神经网络的ConvNet等方法。 

本案例中将使用深度学习的 **BERT** 模型进行文本相似度计算。

中文相似度按照长度可以有字与字的相似度、单词与单词的相似度、句子与句子的相似度、段落与段落的相似度和文章与文章的相似度。

本案例主要介绍一种基于词嵌入的中文短句文本相似度计算方法。

### 进入ModelArts

点击如下链接：https://www.huaweicloud.com/product/modelarts.html ， 进入ModelArts主页。点击“立即使用”按钮，输入用户名和密码登录，进入ModelArts使用页面。

### 创建ModelArts notebook

下面，我们在ModelArts中创建一个notebook开发环境，ModelArts notebook提供网页版的Python开发环境，可以方便的编写、运行代码，并查看运行结果。

第一步：在ModelArts服务主界面依次点击“开发环境”、“创建”

![create_nb_create_button](./img/create_nb_create_button.png)

第二步：填写notebook所需的参数：

| 参数 | 说明 |
| - - - - - | - - - - - |
| 计费方式 | 按需计费  |
| 名称 | Notebook实例名称，如 text_sentiment_analysis |
| 工作环境 | Python3 |
| 资源池 | 选择"公共资源池"即可 |
| 类型 | 本案例使用较为复杂的深度神经网络模型，需要较高算力，选择"GPU" |
| 规格 | 选择"8核 &#124; 64GiB &#124; 1*p100" |
| 存储配置 | 选择EVS，磁盘规格5GB |

第三步：配置好notebook参数后，点击下一步，进入notebook信息预览。确认无误后，点击“立即创建”

![create_nb_creation_summary](./img/create_nb_creation_summary.png)

第四步：创建完成后，返回开发环境主界面，等待Notebook创建完毕后，打开Notebook，进行下一步操作。
![modelarts_notebook_index](./img/modelarts_notebook_index.png)

### 在ModelArts中创建开发环境

接下来，我们创建一个实际的开发环境，用于后续的实验步骤。

第一步：点击下图所示的“打开”按钮，进入刚刚创建的Notebook
![inter_dev_env](img/enter_dev_env.png)

第二步：创建一个Python3环境的的Notebook。点击右上角的"New"，然后创建TensorFlow 1.13.1开发环境。

第三步：点击左上方的文件名"Untitled"，并输入一个与本实验相关的名称
![notebook_untitled_filename](./img/notebook_untitled_filename.png)
![notebook_name_the_ipynb](./img/notebook_name_the_ipynb.png)


### 在Notebook中编写并执行代码

在Notebook中，我们输入一个简单的打印语句，然后点击上方的运行按钮，可以查看语句执行的结果：
![run_helloworld](./img/run_helloworld.png)


## 文本相似度计算



### 数据集

本案例采用西安科技大学提供的中文文本相似度语料库。相似度值：（0,1），0表示不相似，1表示相似。

数据格式：

| 字段 | Quality | #1 ID      | #2 ID        | #1 String  | #2 String  |
| ---- | ------- | ---------- | ------------ | ---------- | ---------- |
| 含义 | 相似度  | 第一句编号 | 第二句的编号 | 第一句文本 | 第二句文本 |


### BERT 模型

本实践使用 NLP 领域最新最强大的 **BERT** 模型。

中文**BERT-Base,Chinese**预训练模型，可以从链接[BERT-Base, Chinese](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)下载使用。

#### 准备源代码和数据

准备案例所需的源代码和数据，相关资源已经保存在 OBS 中，我们通过 ModelArts SDK 将资源下载到本地。

In [1]:
from modelarts.session import Session
session = Session()

if session.region_name == 'cn-north-1':
    bucket_path = 'modelarts-labs/notebook/DL_nlp_text_similarity/text_similarity.tar.gz'
    
elif session.region_name == 'cn-north-4':
    bucket_path = 'modelarts-labs-bj4/notebook/DL_nlp_text_similarity/text_similarity.tar.gz'
else:
    print("请更换地区到北京一或北京四")
    
session.download_data(bucket_path=bucket_path, path='./text_similarity.tar.gz')

!ls -la

Successfully download file modelarts-labs/notebook/DL_nlp_text_similarity/text_similarity.tar.gz from OBS to local ./text_similarity.tar.gz
total 375896
drwxrwxrwx  4 ma-user ma-group      4096 Sep 29 10:12 .
drwsrwsr-x 22 ma-user ma-group      4096 Sep 29 10:10 ..
drwxr-x---  2 ma-user ma-group      4096 Sep 29 10:00 .ipynb_checkpoints
-rw-r-----  1 ma-user ma-group   1855715 Sep 29 10:10 text_similarity.ipynb
-rw-r-----  1 ma-user ma-group 383037805 Sep 29 10:12 text_similarity.tar.gz
drwx------  2 ma-user ma-group      4096 Sep 29 10:12 .Trash-1000


解压从obs下载的压缩包，解压后删除压缩包。

In [2]:
!tar xf ./text_similarity.tar.gz

!rm ./text_similarity.tar.gz

!ls -la

total 1836
drwxrwxrwx  5 ma-user ma-group    4096 Sep 29 10:12 .
drwsrwsr-x 22 ma-user ma-group    4096 Sep 29 10:10 ..
drwxr-x---  2 ma-user ma-group    4096 Sep 29 10:00 .ipynb_checkpoints
drwxr-x---  6 ma-user ma-group    4096 Sep 24 18:12 text_similarity
-rw-r-----  1 ma-user ma-group 1855715 Sep 29 10:10 text_similarity.ipynb
drwx------  2 ma-user ma-group    4096 Sep 29 10:12 .Trash-1000


#### 导入依赖包

In [3]:
import tensorflow as tf
import os
import csv
import collections
from text_similarity.bert import modeling, optimization, tokenization

tf.logging.set_verbosity(tf.logging.INFO)

#### 定义数据和模型路径

In [4]:
# BERT模型配置文件
bert_config_file = 'text_similarity/chinese_L-12_H-768_A-12/bert_config.json'
vocab_file = 'text_similarity/chinese_L-12_H-768_A-12/vocab.txt'
init_checkpoint = 'text_similarity/chinese_L-12_H-768_A-12/bert_model.ckpt'

# 数据集路径
data_dir = 'text_similarity/data/'

# 模型训练输出位置
output_dir = 'text_similarity/output/'

#### 设置TensorFlow运行相关参数

In [5]:
label_list = ["0", "1"]
do_lower_case = False
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
use_tpu = False
tpu_cluster_resolver = None
master = None


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.



#### 设置模型参数

In [6]:
train_batch_size=32
eval_batch_size=8 
predict_batch_size=8
num_epochs = 5.0 
warmup_proportion = 0.1 
learning_rate = 2e-5 
max_seq_length = 128 
save_checkpoints_steps = 1000 
iterations_per_loop = 1000 
num_gpu_cores = 1 

#### 读取BERT预训练模型中文字典

In [7]:
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer.tokenize("今天的天气真好！")

['今', '天', '的', '天', '气', '真', '好', '！']

#### 创建数据输入类

In [8]:
class InputExample(object):

  def __init__(self, guid, text_a, text_b=None, label=None):
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.label = label

class InputFeatures(object):

  def __init__(self,
               input_ids,
               input_mask,
               segment_ids,
               label_id,
               is_real_example=True):
    self.input_ids = input_ids
    self.input_mask = input_mask
    self.segment_ids = segment_ids
    self.label_id = label_id
    self.is_real_example = is_real_example
    
    
class PaddingInputExample(object):
    pass

#### 读取训练数据集

数据集每行的格式为：相似度（Quality），第一句编号（1 ID），第二句的编号（2 ID），第一句文本（1 String），第二句文本（2 String）


In [9]:
def read_tsv(input_file, quotechar=None):
    with tf.gfile.Open(input_file, "r") as f:
        reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
        lines = []
        i = 0
        for line in reader:
            lines.append(line)
            if i < 50:
                print('line', i, ':', line)
                i += 1
        return lines
    
def create_examples(lines, set_type):
    examples = []
    for (i, line) in enumerate(lines):
      if i == 0:
        continue
      guid = "%s-%s" % (set_type, i)
      text_a = tokenization.convert_to_unicode(line[3])
      text_b = tokenization.convert_to_unicode(line[4])
      if set_type == "test":
        label = "0"
      else:
        label = tokenization.convert_to_unicode(line[0])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
    return examples

def get_train_examples(data_dir):
    return create_examples(read_tsv(os.path.join(data_dir, "train.tsv")), "train")

train_examples = get_train_examples(data_dir)

line 0 : ['Quality', '#1 ID', '#2 ID', '#1 String', '#2 String']
line 1 : ['0', '636', '3175303', '老太太的情绪不稳定。', '这个模式对很多领域都很实用，尤其是数位版权管理方面，因为它的最小描述单元层次，可利用来指定识别码进行个别的筛选与授权使用。']
line 2 : ['0', '922', '4606174', '事情几乎没办成。', '作为拜仁慕尼黑的青训产物，穆勒在2009至10赛季被时任主教练路易斯·范加尔提拔至一线队，他在该赛季几乎参加了队内的所有比赛，为球队赢得了联赛及杯赛双冠王，并且晋身欧洲冠军联赛决赛。']
line 3 : ['0', '387', '1930624', '小明在家务上给妈妈帮了不少忙。', '中华民国的绥远省政府至此彻底消亡。']
line 4 : ['1', '930', '3125', '好不难受。', '很不难受。']
line 5 : ['0', '327', '1633734', '他下午也许来不了。', '从20世界90年代开始，哮喘的得病率在发达国家趋于平稳，而在发展中国家快速增长。']
line 6 : ['1', '832', '2617', '假设真的没有文明和文化，那么这个世界就像个未成品。', '如果真的没有文明和文化，这个世界便像个未成品。']
line 7 : ['1', '804', '2464', '当官不为民做主，不如回家卖红薯。', '假如当官不为民做主，还不如回家卖红薯。']
line 8 : ['0', '801', '4000941', '必须尽快改变现状，否则我真的没有出路了。', '另外，由于每位选手必须要最少投球一次（捕手除外），故参加者多是全能球员。']
line 9 : ['0', '574', '2866598', '路上有许多人在赶路。', '后来，站台上部玻璃板开始松动，考虑到玻璃板可能断裂车站立即紧急疏散。']
line 10 : ['1', '788', '2089', '这一次，投资者仅仅是“有望收回成本”，换句话说，很可能赔本！', '这一次，投资者仅仅是“有望收回成本”，就是说，很可能赔本！']
line 11 : ['0', '569', '2841282',

#### 转换为 BERT 输入向量

打印前5个样例文本，及其字向量、文本向量、位置向量和标签。

In [10]:
def truncate_seq_pair(tokens_a, tokens_b, max_length):
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()


def convert_single_example(ex_index, example, label_list, max_seq_length,
                           tokenizer):

    if isinstance(example, PaddingInputExample):
        return InputFeatures(
            input_ids=[0] * max_seq_length,
            input_mask=[0] * max_seq_length,
            segment_ids=[0] * max_seq_length,
            label_id=0,
            is_real_example=False)
    
    label_map = {}
    for (i, label) in enumerate(label_list):
        label_map[label] = i

    tokens_a = tokenizer.tokenize(example.text_a)
    tokens_b = None
    if example.text_b:
        tokens_b = tokenizer.tokenize(example.text_b)

    if tokens_b:
        truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
    else:
        if len(tokens_a) > max_seq_length - 2:
            tokens_a = tokens_a[0:(max_seq_length - 2)]

    tokens = []
    segment_ids = []
    tokens.append("[CLS]") # 句头添加 [CLS] 标志
    segment_ids.append(0)
    for token in tokens_a:
        tokens.append(token)
        segment_ids.append(0)
    tokens.append("[SEP]") # 句尾添加[SEP] 标志
    segment_ids.append(0)

    if tokens_b:
        for token in tokens_b:
            tokens.append(token)
            segment_ids.append(1)
        tokens.append("[SEP]")
        segment_ids.append(1)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)  
    input_mask = [1] * len(input_ids)

    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    label_id = label_map[example.label]
    
    if ex_index < 5:
        tf.logging.info("*** Example ***")
        tf.logging.info("guid: %s" % (example.guid)) 
        tf.logging.info("tokens: %s" % " ".join([tokenization.printable_text(x) for x in tokens])) 
        tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))  
        tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 
        tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 
        tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 

    feature = InputFeatures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_id=label_id,
        is_real_example=True)
    return feature


def file_based_convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_file):
    writer = tf.python_io.TFRecordWriter(output_file)

    for (ex_index, example) in enumerate(examples):
        if ex_index % 10000 == 0:
            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))

        feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer)
        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        features = collections.OrderedDict()
        features["input_ids"] = create_int_feature(feature.input_ids)
        features["input_mask"] = create_int_feature(feature.input_mask)
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        features["label_ids"] = create_int_feature([feature.label_id])
        features["is_real_example"] = create_int_feature([int(feature.is_real_example)])

        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString())
    writer.close()

train_file = os.path.join(output_dir, "train.tf_record")

file_based_convert_examples_to_features(train_examples, label_list, max_seq_length, tokenizer, train_file)

INFO:tensorflow:Writing example 0 of 9416
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: train-1
INFO:tensorflow:tokens: [CLS] 老 太 太 的 情 绪 不 稳 定 。 [SEP] 这 个 模 式 对 很 多 领 域 都 很 实 用 ， 尤 其 是 数 位 版 权 管 理 方 面 ， 因 为 它 的 最 小 描 述 单 元 层 次 ， 可 利 用 来 指 定 识 别 码 进 行 个 别 的 筛 选 与 授 权 使 用 。 [SEP]
INFO:tensorflow:input_ids: 101 5439 1922 1922 4638 2658 5328 679 4937 2137 511 102 6821 702 3563 2466 2190 2523 1914 7566 1818 6963 2523 2141 4500 8024 2215 1071 3221 3144 855 4276 3326 5052 4415 3175 7481 8024 1728 711 2124 4638 3297 2207 2989 6835 1296 1039 2231 3613 8024 1377 1164 4500 3341 2900 2137 6399 1166 4772 6822 6121 702 1166 4638 5033 6848 680 2956 3326 886 4500 511 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

#### 加载模型参数，构造模型结构

In [11]:
bert_config = modeling.BertConfig.from_json_file(bert_config_file)

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 labels, num_labels, use_one_hot_embeddings):
    
    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)

    output_layer = model.get_pooled_output()
    hidden_size = output_layer.shape[-1].value

    output_weights = tf.get_variable(
        "output_weights", [num_labels, hidden_size],
        initializer=tf.truncated_normal_initializer(stddev=0.02))

    output_bias = tf.get_variable("output_bias", [num_labels], initializer=tf.zeros_initializer())

    with tf.variable_scope("loss"):
        if is_training:
            output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

        logits = tf.matmul(output_layer, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        probabilities = tf.nn.softmax(logits, axis=-1)
        log_probs = tf.nn.log_softmax(logits, axis=-1)

        one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

        per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
        loss = tf.reduce_mean(per_example_loss)

        return (loss, per_example_loss, logits, probabilities)

def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps, use_tpu,
                     use_one_hot_embeddings):

  def model_fn(features, labels, mode, params):

    tf.logging.info("*** Features ***")
    for name in sorted(features.keys()):
      tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))

    input_ids = features["input_ids"]
    input_mask = features["input_mask"]
    segment_ids = features["segment_ids"]
    label_ids = features["label_ids"]
    is_real_example = None
    if "is_real_example" in features:
      is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
    else:
      is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)

    is_training = (mode == tf.estimator.ModeKeys.TRAIN)

    (total_loss, per_example_loss, logits, probabilities) = create_model(
        bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
        num_labels, use_one_hot_embeddings)

    tvars = tf.trainable_variables()
    initialized_variable_names = {}
    scaffold_fn = None
    if init_checkpoint:
      (assignment_map, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
      tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

    tf.logging.info("**** Trainable Variables ****")
    for var in tvars:
      init_string = ""
      if var.name in initialized_variable_names:
        init_string = ", *INIT_FROM_CKPT*"
      tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                      init_string)

    output_spec = None

    if mode == tf.estimator.ModeKeys.TRAIN:

      train_op = optimization.create_optimizer(
          total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)

      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          loss=total_loss,
          train_op=train_op,
          scaffold_fn=scaffold_fn)

    elif mode == tf.estimator.ModeKeys.EVAL:

      def metric_fn(per_example_loss, label_ids, logits, is_real_example):
        predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
        accuracy = tf.metrics.accuracy(
            labels=label_ids, predictions=predictions, weights=is_real_example)
        loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
        return {
            "eval_accuracy": accuracy,
            "eval_loss": loss,
        }

      eval_metrics = (metric_fn, [per_example_loss, label_ids, logits, is_real_example])
      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          loss=total_loss,
          eval_metrics=eval_metrics,
          scaffold_fn=scaffold_fn)

    else:
      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          predictions={"probabilities": probabilities},
          scaffold_fn=scaffold_fn)
    return output_spec
  return model_fn


num_train_steps = int(len(train_examples) / train_batch_size * num_epochs)
num_warmup_steps = int(num_train_steps * warmup_proportion)

model_fn = model_fn_builder(
    bert_config=bert_config,
    num_labels=len(label_list),
    init_checkpoint=init_checkpoint,
    learning_rate=learning_rate,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=use_tpu,
    use_one_hot_embeddings=use_tpu)

#### 模型训练

In [12]:
def file_based_input_fn_builder(input_file, seq_length, is_training, drop_remainder):

    name_to_features = {
        "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
        "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "label_ids": tf.FixedLenFeature([], tf.int64),
        "is_real_example": tf.FixedLenFeature([], tf.int64),
    }

    def _decode_record(record, name_to_features):
        example = tf.parse_single_example(record, name_to_features)

        for name in list(example.keys()):
            t = example[name]
            if t.dtype == tf.int64:
                t = tf.to_int32(t)
            example[name] = t
        return example

    def input_fn(params):
        batch_size = params["batch_size"]

        d = tf.data.TFRecordDataset(input_file)        
        if is_training:
            d = d.repeat()
            d = d.shuffle(buffer_size=100)

        d = d.apply(
            tf.contrib.data.map_and_batch(
                lambda record: _decode_record(record, name_to_features),
                batch_size=batch_size,
                drop_remainder=drop_remainder))

        return d
    return input_fn


run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    master=master,
    model_dir=output_dir,
    save_checkpoints_steps=save_checkpoints_steps,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=iterations_per_loop,
        num_shards=num_gpu_cores,
        per_host_input_for_training=is_per_host))

train_input_fn = file_based_input_fn_builder(
    input_file=train_file,
    seq_length=max_seq_length,
    is_training=True,
    drop_remainder=False) 


estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=use_tpu,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=train_batch_size,
    eval_batch_size=eval_batch_size,
    predict_batch_size=predict_batch_size)

tf.logging.info("***** Running training *****")
tf.logging.info("  Num examples = %d", len(train_examples))
tf.logging.info("  Batch size = %d", train_batch_size)
tf.logging.info("  Num steps = %d", num_train_steps)

estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)

INFO:tensorflow:Using config: {'_model_dir': 'text_similarity/output/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f42c4092240>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=1, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=N

INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/intermediate/dense/kernel:0, shape = (768, 3072), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/intermediate/dense/bias:0, shape = (3072,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/output/dense/kernel:0, shape = (3072, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/output/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_3/attention/self/query/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  nam

INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/key/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/value/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/value/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/dense/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/intermediate/dense/kernel:0, shape = (768, 3072), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/intermediate/dense/bias:0, shape = (3072,), *INIT_FROM_CKP

INFO:tensorflow:  name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/pooler/dense/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/pooler/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = output_weights:0, shape = (2, 768)
INFO:tensorflow:  name = output_bias:0, shape = (2,)
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into text_similarity/output/model.ckpt.
INFO:tensorflow:global_step/sec: 1.85518
INFO:tensorflow:examples/sec: 59.3657
INFO:tensorflow:global_step/sec: 2.08442
INFO:tensorflow:exam

<tensorflow.contrib.tpu.python.tpu.tpu_estimator.TPUEstimator at 0x7f42c406bf98>

#### 读取验证集

In [13]:
def get_dev_examples(data_dir):
    return create_examples(read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

eval_examples = get_dev_examples(data_dir)

eval_file = os.path.join(output_dir, "eval.tf_record")

file_based_convert_examples_to_features(eval_examples, label_list, max_seq_length, tokenizer, eval_file)

INFO:tensorflow:Writing example 0 of 2000
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: dev-1
INFO:tensorflow:tokens: [CLS] 他 花 光 了 钱 。 [SEP] 崇 祯 十 四 年 （ 164 ##1 年 ） ， 李 自 成 数 次 围 攻 开 封 ， 丁 启 睿 督 催 左 良 玉 、 虎 大 威 、 杨 德 政 、 方 国 安 、 傅 宗 龙 等 人 率 兵 解 围 。 [SEP]
INFO:tensorflow:input_ids: 101 800 5709 1045 749 7178 511 102 2300 4875 1282 1724 2399 8020 10048 8148 2399 8021 8024 3330 5632 2768 3144 3613 1741 3122 2458 2196 8024 672 1423 4729 4719 998 2340 5679 4373 510 5988 1920 2014 510 3342 2548 3124 510 3175 1744 2128 510 987 2134 7987 5023 782 4372 1070 6237 1741 511 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


line 0 : ['Quality', '#1 ID', '#2 ID', '#1 String', '#2 String']
line 1 : ['0', '662', '3307482', '他花光了钱。', '崇祯十四年（1641年），李自成数次围攻开封，丁启睿督催左良玉、虎大威、杨德政、方国安、傅宗龙等人率兵解围。']
line 2 : ['0', '662', '3306833', '他花光了钱。', '这种抗生素的临床实验开始于1960年代，并成功的治疗急性白血病和淋巴瘤。']
line 3 : ['1', '818', '2516', '有我杨某在，你就别想翻天！', '只要有我杨某在，你就别想翻天！']
line 4 : ['0', '559', '2791731', '余德利抬头发现李东宝的目光很慌。', '后母回家后，发现叶限抱树而睡，便没有追究。']
line 5 : ['0', '828', '4138697', '我不常逛街，因为我老没时间。', '齐格飞（Siegfried，齐格鲁德的德语写法，为同一人）在杀掉法夫纳时就全身浴血，但因为有一片树叶黏在背后，所以造成有一小块皮肤没有沾到血，而成为他唯一的弱点。']
line 6 : ['0', '78', '387909', '嘴干死了。', '此文先后对诗，赋，碑，诔，铭，箴，颂，论，奏，说十种进行分析。']
line 7 : ['1', '881', '2386', '塑料不腐烂分解是一大长处，因为当塑料垃圾被深埋时，他永远不会变成任何有毒的化学物质污染人类生存的环境，而且即便被焚烧，大部分塑料也不会释放出有毒气体。', '塑料垃圾被深埋时，他永远不会变成任何有毒的化学物质污染人类生存的环境，而且即便被焚烧，大部分塑料也不会释放出有毒气体，故塑料不腐烂分解是一大长处。']
line 8 : ['0', '421', '2101313', '静静地坐着。', '这个发现令许多人想进一步了解海马区在记忆及学习机制的作用，因而成为一种流行，无论在神经解剖学、生理学、行为学等等各种不同领域，都对海马区做了相当丰富的研究。']
line 9 : ['0', '846', '4228944', '阳春四月，平原地区的桃花早就凋谢了，可是这里却仍然是一片绯红，桃花含苞欲放，艳丽多姿。', '在大陆地

#### 在验证集上验证模型，评估结果

In [14]:
num_actual_eval_examples = len(eval_examples)

tf.logging.info("***** Running evaluation *****")
tf.logging.info("  Num examples = %d (%d actual, %d padding)",
                len(eval_examples), num_actual_eval_examples,
                len(eval_examples) - num_actual_eval_examples)
tf.logging.info("  Batch size = %d", eval_batch_size)


eval_input_fn = file_based_input_fn_builder(
    input_file=eval_file,
    seq_length=max_seq_length,
    is_training=False,
    drop_remainder=False)

result = estimator.evaluate(input_fn=eval_input_fn, steps=None)

print("\n打印文本相似度评估指标")
for key in result:
    print(key+' : '+str(result[key]))


INFO:tensorflow:***** Running evaluation *****
INFO:tensorflow:  Num examples = 2000 (2000 actual, 0 padding)
INFO:tensorflow:  Batch size = 8
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running eval on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (?, 128)
INFO:tensorflow:  name = input_mask, shape = (?, 128)
INFO:tensorflow:  name = is_real_example, shape = (?,)
INFO:tensorflow:  name = label_ids, shape = (?,)
INFO:tensorflow:  name = segment_ids, shape = (?, 128)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (21128, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNor

INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/key/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/key/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/value/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/value/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/dense/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/intermediate/dense/kernel:0, shape = (768, 3072), *INIT_FRO

INFO:tensorflow:  name = bert/encoder/layer_8/output/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_8/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_8/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/query/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/query/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/key/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/key/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/value/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/value/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert


打印文本相似度评估指标
eval_accuracy : 0.9895
eval_loss : 0.037643015
loss : 0.037643015
global_step : 1471


### 在线测试

由以上训练得到模型进行在线测试，可以任意输入两个句子，进行相似度分析。

任意一个句子未输入，则结束在线文本相似度分析。

In [15]:
from text_similarity.bert import similarity
sim = similarity.BertSim()

print("在线测试\n")
sim.set_mode(tf.estimator.ModeKeys.PREDICT)
predict = 1
while predict is not None:
    sentence1 = input('\n输入句子1: ')
    sentence2 = input('\n输入句子2: ')
    predict = sim.predict(sentence1, sentence2)
    if predict is not None:
        print('\n相似度是：{}'.format(predict[0][1]))

INFO:tensorflow:Using config: {'_model_dir': './text_similarity/output/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.9
  allow_growth: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f42bbb764e0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


在线测试



Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./text_similarity/output/model.ckpt-1471
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.



输入句子1: 我曾经帮这位教授整理过稿子。

输入句子2: 这位教授的稿子我帮着整理过。


INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: test-0
INFO:tensorflow:tokens: [CLS] 我 曾 经 帮 这 位 教 授 整 理 过 稿 子 。 [SEP] 这 位 教 授 的 稿 子 我 帮 着 整 理 过 。 [SEP]
INFO:tensorflow:input_ids: 101 2769 3295 5307 2376 6821 855 3136 2956 3146 4415 6814 4943 2094 511 102 6821 855 3136 2956 4638 4943 2094 2769 2376 4708 3146 4415 6814 511 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)



相似度是：0.9986897110939026

输入句子1: 我习惯喝咖啡不放糖。

输入句子2: 他边打工挣学费边上学。


INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: test-0
INFO:tensorflow:tokens: [CLS] 我 习 惯 喝 咖 啡 不 放 糖 。 [SEP] 他 边 打 工 挣 学 费 边 上 学 。 [SEP]
INFO:tensorflow:input_ids: 101 2769 739 2679 1600 1476 1565 679 3123 5131 511 102 800 6804 2802 2339 2914 2110 6589 6804 677 2110 511 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)



相似度是：0.0003190733550582081

输入句子1: 

输入句子2: 

再见
