GitHub - jojolin/albert_zh: 基于albert_zh微调垂直领域 lac 任务

albert lac

该项目使用albert_zh的预训练模型进行lac标注的finetune. 并通过导出为SavedModel模型，使用tf.serving (docker)提供服务.
see also README.md

添加的主要代码

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 labels, num_labels, use_one_hot_embeddings):

  # ......

    output_layer = tf.reshape(output_layer, [-1, hidden_size])
    logits = tf.matmul(output_layer, output_weight, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    logits = tf.reshape(logits, [-1, FLAGS.max_seq_length, num_labels])

    log_probs = tf.nn.log_softmax(logits, axis=-1)
    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_sum(per_example_loss)
    probabilities = tf.nn.softmax(logits, axis=-1)
    return (loss, per_example_loss, logits, probabilities)


def main(_):

  # ......

  if FLAGS.do_export:    # export model for serving

    def serving_input_receiver_fn():
      input_ids  = tf.placeholder(tf.int32, 
          shape=[None, FLAGS.max_seq_length], name="input_ids")
      input_mask = tf.placeholder(tf.int32, 
          shape=[None, FLAGS.max_seq_length], name="input_mask")
      segment_ids = tf.placeholder(tf.int32, 
          shape=[None, FLAGS.max_seq_length], name="segment_ids")
      label_ids = tf.placeholder(tf.int32, 
          shape=[None, FLAGS.max_seq_length], name="label_ids")
      features = {
        "input_ids": input_ids,
        "input_mask": input_mask,
        "segment_ids": segment_ids,
        "label_ids": label_ids
      }
      receiver_tensors = {
        "input_ids": input_ids,
        "input_mask": input_mask,
        "segment_ids": segment_ids,
        "label_ids": label_ids
      }
      return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)


def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,

    # ......

      predictions={"probabilities": probabilities} 
      output = {'serving_default': tf.estimator.export.PredictOutput(predictions)}
      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          predictions=predictions,
          scaffold_fn=scaffold_fn,
          export_outputs=output)
    return output_spec

Train & Test

下载预训练模型, 参考albert_tiny
数据准备
- 文件结构

├── dev.tsv
├── label_class.txt
├── test.tsv
└── train.tsv

*.tsv
- 其中, ^B: \x02, 语句和标签用\t隔开
- lac 标签复用百度PaddleNLP

text_a	label
苦^B瓜^B味^B苦^B，^B性^B寒^B，^B归^B心^B、^B肺^B、^B脾^B、^B胃^B经^B，^B具^B有^B消^B暑^B清^B热^B，^B解^B毒^B健^B胃^B，^B除^B邪^B热^B，^B聪^B耳^B明^B目^B，^B润^B泽^B肌^B肤^B，^B强^B身^B，^B使^B人^B精^B力^B旺^B盛^B，^B不^B易^B衰^B老^B的^B功^B效^B，^B还^B有^B降^B血^B糖^B、^B抗^B肿^B瘤^B、^B抗^B病^B毒^B、^B抗^B菌^B、^B促^B进^B免^B疫^B力^B等^B作^B用^B。  n-B^Bn-I^Ba-B^Ba-I^Bw-B^Bn-B^Ba-B^Bw-B^Bv-B^Bv-I^Bw-B^Bn-B^Bw-B^Bn-B^Bw-B^Bn-B^Bn-I^Bw-B^Bv-B^Bv-I^Bv-B^Bv-I^Bv-B^Bv-I^Bw-B^Bv-B^Bv-I^Bv-B^Bv-I^Bw-B^Bp-B^Ban-B^Ban-I^Bw-B^Bv-B^Bv-I^Bv-I^Bv-I^Bw-B^Bv-B^Bv-I^Bn-B^Bn-I^Bw-B^Bv-B^Bv-I^Bw-B^Bn-B^Bn-I^Bn-B^Bn-I^Ba-B^Ba-I^Bw-B^Bad-B^Bad-I^Bv-B^Bv-I^Bu-B^Bn-B^Bn-I^Bw-B^Bv-B^Bv-I^Bvn-B^Bvn-I^Bvn-I^Bw-B^Bv-B^Bn-B^Bn-I^Bw-B^Bnz-B^Bnz-I^Bnz-I^Bw-B^Bvn-B^Bvn-I^Bw-B^Bv-B^Bv-I^Bn-B^Bn-I^Bn-I^Bu-B^Bn-B^Bn-I^Bw-B
黄^B豆^B芽^B富^B含^B蛋^B白^B质^B、^B维^B生^B素^B、^B粗^B纤^B维^B、^B胡^B萝^B卜^B素^B、^B钙^B、^B磷^B、^B铁^B等^B营^B养^B元^B素^B。^B其^B所^B含^B的^B维^B生^B素^Bc^B能^B营^B养^B毛^B
发^B，^B使^B头^B发^B保^B持^B乌^B黑^B光^B亮^B，^B对^B面^B部^B雀^B斑^B有^B较^B好^B的^B淡^B化^B作^B用^B。^B其^B所^B含^B的^B维^B生^B素^Be^B能^B保^B护^B皮^B肤^B和^B毛^B细^B血^B管^B，^B
防^B止^B动^B脉^B硬^B化^B，^B防^B治^B老^B年^B高^B血^B压^B。        nz-B^Bnz-I^Bnz-I^Bv-B^Bv-I^Bnz-B^Bnz-I^Bnz-I^Bw-B^Bn-B^Bn-I^Bn-I^Bw-B^Ba-B^Bn-B^Bn-I^Bw-B^Bnz-B^Bnz-I^Bnz-I^Bnz-I
^Bw-B^Bn-B^Bw-B^Bn-B^Bw-B^Bn-B^Bu-B^Bn-B^Bn-I^Bn-B^Bn-I^Bw-B^Br-B^Bu-B^Bv-B^Bu-B^Bnz-B^Bnz-I^Bnz-I^Bnz-I^Bv-B^Bn-B^Bn-I^Bn-B^Bn-I^Bw-B^Bv-B^Bn-B^Bn-I^Bv-B^Bv-I^Ba-B^Ba-I^Ba-B^Ba-I
^Bw-B^Bp-B^Bn-B^Bn-I^Bn-B^Bn-I^Bv-B^Ba-B^Ba-I^Bu-B^Bvn-B^Bvn-I^Bn-B^Bn-I^Bw-B^Br-B^Bu-B^Bv-B^Bu-B^Bn-B^Bn-I^Bn-I^Bxc-B^Bv-B^Bv-B^Bv-I^Bn-B^Bn-I^Bc-B^Bn-B^Bn-I^Bn-I^Bn-I^Bw-B^Bv-B^Bv-I^Bvn-B^Bvn-I^Bvn-I^Bvn-I^Bw-B^Bv-B^Bv-I^Bn-B^Bn-I^Bn-B^Bn-I^Bn-I^Bw-B

label_class.txt

a-B
a-I
ad-B
ad-I
an-B
an-I
c-B
c-I
d-B
...

执行finetune: sh run_lac_remy.sh train
执行predict: sh run_lac_remy.sh test

Serving

导出模型: sh run_lac_remy.sh export
- 查看目录cd export_serving_remy_lac

export_serving_remy_lac/
├── 1574843457
│   ├── saved_model.pb
│   └── variables
│       ├── variables.data-00000-of-00001
│       └── variables.index
├── 1574903339
│   ├── saved_model.pb
│   └── variables
│       ├── variables.data-00000-of-00001
│       └── variables.index
└── models.config

其中,models.config为自己添加的文件, 1574843457, 1574903339为导出模型版本号

model_config_list {
  config {
    name: 'albert_remy_lac'
    base_path: '/models/albert_remy_lac'
    model_platform: 'tensorflow'
    model_version_policy {
        specific {
            versions: 1574843457
            versions: 1574903339
        }
    }
  }
}

利用docker tf.serving提供服务
- cd albert_lac_ipynbs
- 创建软链：ln -s ../export_serving_remy_lac albert_remy_lac
- 运行tf.serving, no gpu: sh start-tf1.14.0-rc.sh albert_remy_lac
- 运行tf.serving, gpu: sh start-tfgpu1.14.0.sh albert_remy_lac

albert_lac_ipynbs/
├── label_class.txt
├── main.ipynb
├── main.py
├── merge_lac.py
├── start-tf1.14.0-rc.sh
├── start-tfgpu1.14.0.sh
└── tokenization.py

Client

查看模型接口 saved_model_cli show --dir '$dir1' --tag_set serve --signature_def serving_default
- $dir1: 'export_serving_remy_lac/1574843457'

The given SavedModel SignatureDef contains the following input(s):
  inputs['input_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 128)
      name: input_ids:0
  inputs['input_mask'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 128)
      name: input_mask:0
  inputs['label_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 128)
      name: label_ids:0
  inputs['segment_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 128)
      name: segment_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['probabilities'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 128, 59)
      name: loss/Softmax:0
Method name is: tensorflow/serving/predict

构造访问rest api. 参考albert_lac_ipynbs/main.ipynb 或者 albert_lac_ipynbs/main.py

import requests
import json
import numpy as np

def predict(input_ids, input_mask, segment_ids, label_ids, hostport_http, version="1"):
    data = json.dumps({
            "signature_name": "serving_default", 
            "instances": [{
                "input_ids": input_ids,
                "input_mask": input_mask,
                "label_ids": label_ids,
                "segment_ids": segment_ids  
            }],
        })
    headers = {"content-type": "application/json"}
    url = 'http://{}/v1/models/{}/versions/{}:predict'.format(hostport_http, MODEL_NAME, version)
    #print(url)
    json_response = requests.post(url, data=data, headers=headers)
    #print(json_response.text)
    predictions = json.loads(json_response.text)['predictions']
    return [np.argmax(x) for pred in predictions for x in pred ]

模型预测结果

tiny:

['查/v', '一下/m', '餐/nz', '包/n', '有/v', '哪些/r', '做法/n']
['瘦肉末/nz', '或者/c', '猪肉丸/nz', '能/v', '做/v', '什么/r']
['猪肉/n', '丁/nz', '可以/v', '和/p', '什么/r', '一起/d', '煮汤/vn']
['看看/v', '蒜/n', '香/a', '排骨/v', '的/u', '做法/n']
['如何/d', '调理/v', '不/d', '正常/a', '的/u', '日常/a', '膳食/n']
['能/v', '介绍/v', '些/q', '养颜/vn', '美容/vn', '的/u', '食物/n', '吗/xc']
['健康/a', '饮食：/n', '怎样/d', '吃/v', '海鲜/n', '更/d', '安全/a']
['吃/v', '大闸蟹/nz', '应该/v', '注意/v', '什么/r']

base:
['查一/nz', '下/f', '餐包/n', '有/v', '哪些/r', '做法/n']
['瘦肉末/nz', '或者/c', '猪肉丸/nz', '能/v', '做/v', '什么/r']
['猪肉丁/nz', '可以/v', '和/p', '什么/r', '一起/d', '煮汤/v']
['看看/v', '蒜香排骨/nz', '的/u', '做法/n']
['如何/c', '调理/v', '不正常/a', '的/u', '日常/a', '膳食/n']
['能/v', '介绍/v', '些/q', '养颜/vn', '美容/vn', '的/u', '食物/n', '吗/xc']
['健康/a', '饮食/n', '：/w', '怎样/d', '吃/v', '海鲜/n', '更/a', '安全/a']
['吃/v', '大闸蟹/nz', '应该/v', '注意/v', '什么/r']

预测性能
- gpu: GeForce RTX 2080-ti
- tiny版: 单线程，200个请求, ~5.6s/~4.5s (no gpu/gpu)
- base版: 单线程，200个请求, ~20s/~6.2s(no gpu/gpu)
- gpu: GeForce RTX 2080-ti
- base版: 单线程，1000个请求, ~25s/~12.5s(gpu, http/grpc)
准确性

数据量

    2001 dev.tsv
    1190 test.tsv
   10000 train.tsv

tiny版:
Eval results albert_tiny_remy_lac_checkpoints/model.ckpt-18748 
eval_accuracy = 0.95755637
eval_loss = 0.32848656
global_step = 18748
loss = 335.19763

base版
Eval results albert_base_remy_lac_checkpoints/model.ckpt-18748
eval_accuracy = 0.96240944
eval_loss = 0.41811892
global_step = 18748
loss = 426.6612

large版(batch_size=16)
Eval results albert_large_remy_lac_checkpoints/model.ckpt-37496
eval_accuracy = 0.96251875
eval_loss = 0.45184305
global_step = 37496
loss = 461.07434

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
albert_config		albert_config
albert_lac_ipynbs		albert_lac_ipynbs
albert_tfserving_client		albert_tfserving_client
albert_tiny_remy_lac_checkpoints		albert_tiny_remy_lac_checkpoints
data		data
export_serving_remy_lac		export_serving_remy_lac
resources		resources
.gitignore		.gitignore
README.ORI.md		README.ORI.md
README.md		README.md
args.py		args.py
bert_utils.py		bert_utils.py
create_pretrain_data.sh		create_pretrain_data.sh
create_pretraining_data.py		create_pretraining_data.py
create_pretraining_data_google.py		create_pretraining_data_google.py
lamb_optimizer_google.py		lamb_optimizer_google.py
merge_lac.py		merge_lac.py
modeling.py		modeling.py
modeling_google.py		modeling_google.py
modeling_google_fast.py		modeling_google_fast.py
optimization.py		optimization.py
optimization_finetuning.py		optimization_finetuning.py
optimization_google.py		optimization_google.py
run_classifier.py		run_classifier.py
run_classifier_lcqmc.sh		run_classifier_lcqmc.sh
run_classifier_sp_google.py		run_classifier_sp_google.py
run_lac.py		run_lac.py
run_lac_classifier.py		run_lac_classifier.py
run_lac_classifier.py.bak		run_lac_classifier.py.bak
run_lac_remy.sh		run_lac_remy.sh
run_lac_withcls_remy.sh		run_lac_withcls_remy.sh
run_pretraining.py		run_pretraining.py
run_pretraining_google.py		run_pretraining_google.py
run_pretraining_google_fast.py		run_pretraining_google_fast.py
similarity.py		similarity.py
test_changes.py		test_changes.py
tokenization.py		tokenization.py
tokenization_google.py		tokenization_google.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

albert lac

添加的主要代码

Train & Test

Serving

Client

Thanks

About

Releases

Packages

Languages

jojolin/albert_zh

Folders and files

Latest commit

History

Repository files navigation

albert lac

添加的主要代码

Train & Test

Serving

Client

Thanks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages