Skip to content
中文语料 Bert finetune(Fine-tune Chinese for BERT)
Python Jupyter Notebook
Branch: master
Clone or download
Pull request Compare This branch is 6 commits ahead, 6 commits behind google-research:master.
Latest commit 428e822 Feb 23, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data add demo data and more readme Feb 23, 2019
.gitignore Initial BERT release Oct 31, 2018
CONTRIBUTING.md Initial BERT release Oct 31, 2018
LICENSE Initial BERT release Oct 31, 2018
README.bak Rename README.md to README.bak Feb 21, 2019
README.md change demo data location Feb 23, 2019
__init__.py Initial BERT release Oct 31, 2018
create_pretraining_data.py Adding TF Hub support Feb 7, 2019
extract_features.py Running through pyformat to meet Google code standards Nov 9, 2018
modeling.py Adding TF Hub support Feb 7, 2019
modeling_test.py Adding SQuAD 2.0 support Nov 15, 2018
multilingual.md Padding examples for TPU eval/predictions and checking case match Dec 18, 2018
optimization.py Padding examples for TPU eval/predictions and checking case match Dec 18, 2018
optimization_test.py Initial BERT release Oct 31, 2018
predicting_movie_reviews_with_bert_on_tf_hub.ipynb
requirements.txt Updating requirements.txt to make it only 1.11.0 Oct 31, 2018
run_classifier.py Padding examples for TPU eval/predictions and checking case match Dec 18, 2018
run_classifier_with_tfhub.py Adding TF Hub support Feb 7, 2019
run_pretraining.py Fixing typo in function name and updating README Nov 5, 2018
run_squad.py
sample_text.txt Initial BERT release Oct 31, 2018
tokenization.py Padding examples for TPU eval/predictions and checking case match Dec 18, 2018
tokenization_test.py Adding TF Hub support Feb 7, 2019

README.md

Finetune Bert for Chinese

NLP 问题被证明同图像一样,可以通过 finetune 在垂直领域取得效果的提升。Bert 模型本身极其依赖计算资源,从 0 训练对大多数开发者都是难以想象的事。在节省资源避免重头开始训练的同时,为更好的拟合垂直领域的语料,我们有了 finetune 的动机。

Bert 的文档本身对 finetune 进行了较为详细的描述,但对于不熟悉官方标准数据集的工程师来说,有一定的上手难度。随着 Bert as service 代码的开源,使用 Bert 分类或阅读理解的副产物--词空间,成为一个更具实用价值的方向。

因而,此文档着重以一个例子,梳理 finetune 垂直语料,获得微调后的模型 这一过程。Bert 原理或 Bert as service 还请移步官方文档。

依赖

python==3.6
tensorflow>=1.11.0

预训练模型

  • 下载 BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

数据准备

  • train.tsv 训练集
  • dev.tsv 验证集

数据格式

第一列为 label,第二列为具体内容,tab 分隔。因模型本身在字符级别做处理,因而无需分词。

fashion	衬衫和它一起穿,让你减龄十岁!越活越年轻!太美了!...
houseliving	95㎡简约美式小三居,过精美别致、悠然自得的小日子! 屋主的客...
game	赛季末用他们两天上一段,7.20最强LOL上分英雄推荐! 各位小伙...

样例数据位置:data

数据格式取决于业务场景,后面也可根据格式调整代码里的数据导入方式。

操作

git clone https://github.com/google-research/bert.git
cd bert

bert 的 finetune 主要存在两类应用场景:分类和阅读理解。因分类较为容易获得样本,以下以分类为例,做模型微调:

修改 run_classifier.py

自定义 DataProcessor

class DemoProcessor(DataProcessor):
    """Processor for Demo data set."""

    def __init__(self):
        self.labels = set()
    
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_test_examples(self, data_dir):
      """See base class."""
      return self._create_examples(
          self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_labels(self):
        """See base class."""
        # return list(self.labels)
        return ["fashion", "houseliving","game"] # 根据 label 自定义


    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text_a = tokenization.convert_to_unicode(line[1])
            label = tokenization.convert_to_unicode(line[0])
            self.labels.add(label)
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

添加 DemoProcessor

  processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
      "mrpc": MrpcProcessor,
      "xnli": XnliProcessor,
      "demo": DemoProcessor,
  }

启动训练

export BERT_Chinese_DIR=/path/to/bert/chinese_L-12_H-768_A-12
export Demo_DIR=/path/to/DemoDate

python run_classifier.py \
  --task_name=demo \
  --do_train=true \
  --do_eval=true \
  --data_dir=$Demo_DIR \
  --vocab_file=$BERT_Chinese_DIR/vocab.txt \
  --bert_config_file=$BERT_Chinese_DIR/bert_config.json \
  --init_checkpoint=$BERT_Chinese_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=/tmp/Demo_output/

若一切顺利,将会有以下输出:

***** Eval results *****
  eval_accuracy = xx
  eval_loss = xx
  global_step = xx
  loss = xx

最终,微调后的模型保存在output_dir指向的文件夹中。

总结

Bert 预训练后的 finetune,是一种很高效的方式,节省时间,同时提高模型在垂直语料的表现。finetune 过程,实际上不难。较大的难点在于数据准备和 pipeline 的设计。从商业角度讲,应着重考虑 finetune 之后,模型有效性的证明,以及在业务场景中的应用。如果评估指标和业务场景都已缕清,那么不妨一试。

参考资料

You can’t perform that action at this time.