<a href="https://colab.research.google.com/github/hellozhaojian/transformers/blob/master/bert_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**简介**

本文主要内容如何使用TPU完成对bert的基准模型进行fine-tuning。
前置条件：
1. 在google cloud里有一个项目。
   本次教程中项目名称为 pre-train-bert-sogou； 
2. 实现我们有如下数据：
   1. 基准模型， [bert 中文基准模型](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)
   2. 文本数据。 数据格式：每行一个句子；一篇文档的句子之间没有空行。不同篇章的句子之间有一个空行。


主要步骤如下：

1. 将及基准模型和数据放置到google cloud项目中。

2. 将数据准备为tf-record的模式。

3. 训练模型。



**数据、配置、模型准备**


* 登录google cloud

In [1]:
! gcloud auth application-default login
! gcloud auth login
from google.colab import auth


Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?code_challenge=Cl3EXLnI74Nt_Ycbw8KZc0d9BqsUkLg7ITc9qd-4hr0&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth


Enter verification code: 4/tAERhDT9cnSatGDsnpAbbrlzxQF0yXVbN7gntm19gdUUIQt-0o1PBNg

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credentials.

To generate an access token for other uses, run:
  gcloud auth application-default print-access-token
Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth

* 代码准备

In [2]:
! git clone https://github.com/google-research/bert
! ls ./

Cloning into 'bert'...
remote: Enumerating objects: 336, done.[K
remote: Total 336 (delta 0), reused 0 (delta 0), pack-reused 336[K
Receiving objects: 100% (336/336), 283.23 KiB | 3.99 MiB/s, done.
Resolving deltas: 100% (185/185), done.
bert  sample_data


* 数据模型准备

In [31]:
import tensorflow as tf
GOOGLE_CLOUD_PROJECT_NAME = "pre-train-bert-sogou" #@param {type: "string" }
BUCKET_NAME = "bert-sogou-pretrain"  #@param {type: "string"}
BASE_MODEL_DIR = "fine_tuning/base_model" #@param {type: "string"}
NEW_MODEL_DIR = "fine_tuning/model" #@param {type: "string"}
MODEL_NAME = "chinese_L-12_H-768_A-12" #@param {type: "string"}
#
# Fine-Tuning 模型保存的位置
NEW_MODEL_NAME = "chinese_L-12_H-768_A-12_sample" #@param {type: "string"}
# 数据文件存放的地址
# 每行一个句子。每篇文档的句子之间没有空行。 不同篇的文章有一个空行。
INPUT_DATA_DIR = "fine_tuning/data/sample" #@param {type: "string"}

PROCESSES = 4 #@param {type: "integer"}
DO_LOWER_CASE = True
MAX_SEQ_LENGTH = 128 #@param {type : "integer"}
MASKED_LM_PROB = 0.15 #@param {type: "number" }
# xxxx
MAX_PREDICTIONS = 20 #@param {type: "integer"

# google 的基准模型放置在这个目录， 其中文件包括：
#  bert_config.json；  bert_model.ckpt.data-00000-of-00001 ； bert_model.ckpt.index；
#  bert_model.ckpt.meta ；  vocab.txt

base_model_name = "gs://{}/{}/{}".format(BUCKET_NAME, BASE_MODEL_DIR, MODEL_NAME)

INIT_CHECKPOINT = "{}/bert_model.ckpt".format(base_model_name)

BERT_GCS_DIR = "gs://{}/{}/{}_latest".format(BUCKET_NAME, NEW_MODEL_DIR, NEW_MODEL_NAME)
LOCAL_BERT_GCS_DIR = "{}/{}_latest".format(NEW_MODEL_DIR, NEW_MODEL_NAME)
VOCAB_FILE = "gs://{}/{}/{}/vocab.txt".format(BUCKET_NAME, BASE_MODEL_DIR, MODEL_NAME)
LOCAL_TF_RECORD_DIR = "{}_tfrecord".format( INPUT_DATA_DIR)
TF_RECORD_DIR = "gs://{}/{}_tfrecord".format(BUCKET_NAME, INPUT_DATA_DIR)
CONFIG_FILE = "gs://{}/{}/{}/bert_config.json".format(BUCKET_NAME, BASE_MODEL_DIR, MODEL_NAME)

! gcloud config set project $GOOGLE_CLOUD_PROJECT_NAME
print(BERT_GCS_DIR)
! gsutil ls $BERT_GCS_DIR
! mkdir -p tmp
! touch tmp/.tmp
! gsutil -m cp -r tmp/ $BERT_GCS_DIR
! gsutil -m cp -r tmp/ $TF_RECORD_DIR
! gsutil rm $BERT_GCS_DIR/.tmp
! gsutil rm $TF_RECORD_DIR/.tmp




Updated property [core/project].
gs://bert-sogou-pretrain/fine_tuning/model/chinese_L-12_H-768_A-12_sample_latest
gs://bert-sogou-pretrain/fine_tuning/model/chinese_L-12_H-768_A-12_sample_latest/.tmp
Copying file://tmp/.tmp [Content-Type=application/octet-stream]...
Copying file://tmp/zz [Content-Type=application/octet-stream]...
/ [2/2 files][    0.0 B/    0.0 B]                                              
Operation completed over 2 objects.                                              
Copying file://tmp/zz [Content-Type=application/octet-stream]...
Copying file://tmp/.tmp [Content-Type=application/octet-stream]...
/ [2/2 files][    0.0 B/    0.0 B]                                              
Operation completed over 2 objects.                                              


**准备tf-record数据**

In [32]:
from google.cloud import storage
from google.colab import auth, drive

storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)

file_partitions = [[]]
index = 0

def list_files(bucketFolder):
    """List all files in GCP bucket."""
    files = bucket.list_blobs(prefix=bucketFolder, max_results=1000)
    fileList = [file.name for file in files ]
    return fileList


procesed_set = set([])

for filename in list_files(INPUT_DATA_DIR) :
    
    if filename.find("tf") != -1 : 
        org_filename = filename.split("/")[-1].split(".")[0]    
        procesed_set.add(org_filename)
        continue


for filename in list_files(INPUT_DATA_DIR) :
    if filename.find("tf") != -1 or filename.endswith("/"):
        continue
    
    if filename.split("/")[-1] in procesed_set:
        continue

    if len(file_partitions[index]) == PROCESSES:
        file_partitions.append([])
        index += 1
    file_partitions[index].append("gs://{}/{}".format(BUCKET_NAME, filename))

! gsutil ls $TF_RECORD_DIR


index = 0
for partition in file_partitions:
    
    for filename in partition:
        print(filename, "----", index)
    index += 1
    XARGS_CMD = ("gsutil ls {} | "
             "awk 'BEGIN{{FS=\"/\"}}{{print $NF}}' | "
             "xargs -n 1 -P {} -I{} "
             "python3 bert/create_pretraining_data.py "
             "--input_file=gs://{}/{}/{} "
             "--output_file={}/{}.tfrecord "
             "--vocab_file={} "
             "--do_lower_case={} "
             "--max_predictions_per_seq={} "
             "--max_seq_length={} "
             "--masked_lm_prob={} "
             "--random_seed=34 "
             "--dupe_factor=5")


    XARGS_CMD = XARGS_CMD.format(" ".join(partition),
                             PROCESSES, '{}',  BUCKET_NAME, INPUT_DATA_DIR, '{}', 
                             TF_RECORD_DIR, '{}',
                             VOCAB_FILE, DO_LOWER_CASE, 
                             MAX_PREDICTIONS, MAX_SEQ_LENGTH, MASKED_LM_PROB)

    print (XARGS_CMD)

    ! $XARGS_CMD



gs://bert-sogou-pretrain/fine_tuning/data/sample_tfrecord/.tmp
gs://bert-sogou-pretrain/fine_tuning/data/sample_tfrecord/zz
gs://bert-sogou-pretrain/fine_tuning/data/sample/sample.txt ---- 0
gsutil ls gs://bert-sogou-pretrain/fine_tuning/data/sample/sample.txt | awk 'BEGIN{FS="/"}{print $NF}' | xargs -n 1 -P 4 -I{} python3 bert/create_pretraining_data.py --input_file=gs://bert-sogou-pretrain/fine_tuning/data/sample/{} --output_file=gs://bert-sogou-pretrain/fine_tuning/data/sample_tfrecord/{}.tfrecord --vocab_file=gs://bert-sogou-pretrain/fine_tuning/base_model/chinese_L-12_H-768_A-12/vocab.txt --do_lower_case=True --max_predictions_per_seq=20 --max_seq_length=128 --masked_lm_prob=0.15 --random_seed=34 --dupe_factor=5


W1112 03:33:04.569111 140152605525888 module_wrapper.py:139] From bert/create_pretraining_data.py:437: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1112 03:33:04.569376 140152605525888 module_wrapper.py:139] F

**训练模型**

* 链接TPU


In [33]:
import os
import logging
import tensorflow as tf

log = logging.getLogger("pre-train-bert")
auth.authenticate_user()

if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  USE_TPU = True
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  with tf.Session(TPU_ADDRESS) as session:
    print(TPU_ADDRESS)
    log.info('TPU address is ' + TPU_ADDRESS)
    tf.contrib.cloud.configure_gcs(session)
else:
  log.warning('Not connected to TPU runtime')
  USE_TPU = False
print(USE_TPU)


grpc://10.119.220.250:8470
True


* 设置训练参数

In [34]:
from bert import modeling, optimization, tokenization

# Input data pipeline config
TRAIN_BATCH_SIZE = 128 #@param {type:"integer"}
MAX_PREDICTIONS = 20 #@param {type:"integer"}
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param

# Training procedure config
EVAL_BATCH_SIZE = 64
LEARNING_RATE = 2e-5
TRAIN_STEPS = 1000000 #@param {type:"integer"}
SAVE_CHECKPOINTS_STEPS = 250 #@param {type:"integer"}
NUM_TPU_CORES = 8

TMP_INIT_CHECKPOINT = tf.train.latest_checkpoint(BERT_GCS_DIR)
if TMP_INIT_CHECKPOINT is not None:
    INIT_CHECKPOINT = TMP_INIT_CHECKPOINT


bert_config = modeling.BertConfig.from_json_file(CONFIG_FILE)
input_files = tf.gfile.Glob(os.path.join(TF_RECORD_DIR,'*tfrecord'))

log.info("Using checkpoint: {}".format(INIT_CHECKPOINT))

log.info("Using {} data shards".format(len(input_files)))

! gsutil ls $INIT_CHECKPOINT*



gs://bert-sogou-pretrain/fine_tuning/base_model/chinese_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001
gs://bert-sogou-pretrain/fine_tuning/base_model/chinese_L-12_H-768_A-12/bert_model.ckpt.index
gs://bert-sogou-pretrain/fine_tuning/base_model/chinese_L-12_H-768_A-12/bert_model.ckpt.meta


* 训练模型

In [0]:
import sys
sys.path.append("bert")
from bert.run_pretraining import input_fn_builder, model_fn_builder
from bert import modeling, optimization, tokenization


model_fn = model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=INIT_CHECKPOINT,
      learning_rate=LEARNING_RATE,
      num_train_steps=TRAIN_STEPS,
      num_warmup_steps=10,
      use_tpu=USE_TPU,
      use_one_hot_embeddings=True)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=BERT_GCS_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=USE_TPU,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)
  
train_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=MAX_SEQ_LENGTH,
        max_predictions_per_seq=MAX_PREDICTIONS,
        is_training=True)

estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)
                

INFO:tensorflow:Using config: {'_model_dir': 'gs://bert-sogou-pretrain/fine_tuning/model/chinese_L-12_H-768_A-12_sample_latest', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 250, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.119.220.250:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4c592c1be0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.119.220.250:8470', '_evaluation_master': 'grpc://10.1