<a href="https://colab.research.google.com/github/hellozhaojian/transformers/blob/master/bert_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**简介**

本文介绍如何使用TPU完成对bert的基准模型进行fine-tuning。

前置条件：
1. 在google cloud里有一个项目。
   本次教程中项目名称为 pre-train-bert-sogou； 
2. 事先我们要准备如下数据：
   1. 基准模型， [bert 中文基准模型](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)
   2. 文本数据。 数据格式：每行一个句子；一篇文档的句子之间没有空行。不同文档之间有一个空行。


主要步骤如下：

1. 将基准模型和数据放置到google cloud项目中。

2. 将数据准备为tf-record的格式。

3. 训练模型。



**数据、配置、模型准备**


* 登录google cloud

In [0]:
! gcloud auth application-default login
! gcloud auth login
from google.colab import auth


* 代码准备

In [2]:
! git clone https://github.com/google-research/bert
! ls ./

fatal: destination path 'bert' already exists and is not an empty directory.
bert  fine_tuning  sample_data	tmp


* 数据模型准备

In [0]:
import tensorflow as tf
GOOGLE_CLOUD_PROJECT_NAME = "pre-train-bert-sogou" #@param {type: "string" }
BUCKET_NAME = "bert-sogou-pretrain"  #@param {type: "string"}
BASE_MODEL_DIR = "fine_tuning/base_model" #@param {type: "string"}
NEW_MODEL_DIR = "fine_tuning/model" #@param {type: "string"}
MODEL_NAME = "chinese_L-12_H-768_A-12" #@param {type: "string"}
#
# Fine-Tuning 模型保存的位置 
NEW_MODEL_NAME = "chinese_L-12_H-768_A-12_tutorial" #@param {type: "string"}
# 数据文件存放的地址
# 每行一个句子。每篇文档的句子之间没有空行。 不同篇的文章有一个空行。
INPUT_DATA_DIR = "fine_tuning/data/tutorial" #@param {type: "string"}

PROCESSES = 4 #@param {type: "integer"}
DO_LOWER_CASE = True
MAX_SEQ_LENGTH = 128 #@param {type : "integer"}
MASKED_LM_PROB = 0.15 #@param {type: "number" }
# xxxx
MAX_PREDICTIONS = 20 #@param {type: "integer"

# google 的基准模型放置在这个目录， 其中文件包括：
#  bert_config.json；  bert_model.ckpt.data-00000-of-00001 ； bert_model.ckpt.index；
#  bert_model.ckpt.meta ；  vocab.txt

base_model_name = "gs://{}/{}/{}".format(BUCKET_NAME, BASE_MODEL_DIR, MODEL_NAME)

INIT_CHECKPOINT = "{}/bert_model.ckpt".format(base_model_name)

BERT_GCS_DIR = "gs://{}/{}/{}_latest".format(BUCKET_NAME, NEW_MODEL_DIR, NEW_MODEL_NAME)
VOCAB_FILE = "gs://{}/{}/{}/vocab.txt".format(BUCKET_NAME, BASE_MODEL_DIR, MODEL_NAME)
TF_RECORD_DIR = "gs://{}/{}_tfrecord".format(BUCKET_NAME, INPUT_DATA_DIR)
CONFIG_FILE = "gs://{}/{}/{}/bert_config.json".format(BUCKET_NAME, BASE_MODEL_DIR, MODEL_NAME)

! gcloud config set project $GOOGLE_CLOUD_PROJECT_NAME
print(BERT_GCS_DIR)
# trick， 在bucket里创建目录的方法
! gsutil ls $BERT_GCS_DIR
! mkdir -p need_remove
! touch need_remove/.tmp
! gsutil -m cp -r need_remove/ $BERT_GCS_DIR
! gsutil -m cp -r need_remove/ $TF_RECORD_DIR
! gsutil rm $BERT_GCS_DIR/.tmp
! gsutil rm $TF_RECORD_DIR/.tmp




**准备tf-record数据**

In [0]:
from google.cloud import storage
from google.colab import auth, drive

storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)

file_partitions = [[]]
index = 0

def list_files(bucketFolder):
    """List all files in GCP bucket."""
    files = bucket.list_blobs(prefix=bucketFolder, max_results=1000)
    fileList = [file.name for file in files ]
    return fileList


procesed_set = set([])

for filename in list_files(INPUT_DATA_DIR) :
    
    if filename.find("tf") != -1 : 
        org_filename = filename.split("/")[-1].split(".")[0]    
        procesed_set.add(org_filename)
        continue


for filename in list_files(INPUT_DATA_DIR) :
    if filename.find("tf") != -1 or filename.endswith("/"):
        continue
    
    if filename.split("/")[-1] in procesed_set:
        continue

    if len(file_partitions[index]) == PROCESSES:
        file_partitions.append([])
        index += 1
    file_partitions[index].append("gs://{}/{}".format(BUCKET_NAME, filename))

! gsutil ls $TF_RECORD_DIR


index = 0
for partition in file_partitions:
    
    for filename in partition:
        print(filename, "----", index)
    index += 1
    XARGS_CMD = ("gsutil ls {} | "
             "awk 'BEGIN{{FS=\"/\"}}{{print $NF}}' | "
             "xargs -n 1 -P {} -I{} "
             "python3 bert/create_pretraining_data.py "
             "--input_file=gs://{}/{}/{} "
             "--output_file={}/{}.tfrecord "
             "--vocab_file={} "
             "--do_lower_case={} "
             "--max_predictions_per_seq={} "
             "--max_seq_length={} "
             "--masked_lm_prob={} "
             "--random_seed=34 "
             "--dupe_factor=5")


    XARGS_CMD = XARGS_CMD.format(" ".join(partition),
                             PROCESSES, '{}',  BUCKET_NAME, INPUT_DATA_DIR, '{}', 
                             TF_RECORD_DIR, '{}',
                             VOCAB_FILE, DO_LOWER_CASE, 
                             MAX_PREDICTIONS, MAX_SEQ_LENGTH, MASKED_LM_PROB)

    print (XARGS_CMD)

    ! $XARGS_CMD



**训练模型**

* 链接TPU


In [0]:
import os
import logging
import tensorflow as tf

log = logging.getLogger("pre-train-bert")
auth.authenticate_user()

if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  USE_TPU = True
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  with tf.Session(TPU_ADDRESS) as session:
    print(TPU_ADDRESS)
    log.info('TPU address is ' + TPU_ADDRESS)
    tf.contrib.cloud.configure_gcs(session)
else:
  log.warning('Not connected to TPU runtime')
  USE_TPU = False
print(USE_TPU)


* 设置训练参数

In [0]:
from bert import modeling, optimization, tokenization

# Input data pipeline config
TRAIN_BATCH_SIZE = 128 #@param {type:"integer"}
MAX_PREDICTIONS = 20 #@param {type:"integer"}
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param

# Training procedure config
EVAL_BATCH_SIZE = 64
LEARNING_RATE = 2e-5
TRAIN_STEPS = 1000000 #@param {type:"integer"}
SAVE_CHECKPOINTS_STEPS = 250 #@param {type:"integer"}
NUM_TPU_CORES = 8

TMP_INIT_CHECKPOINT = tf.train.latest_checkpoint(BERT_GCS_DIR)
if TMP_INIT_CHECKPOINT is not None:
    INIT_CHECKPOINT = TMP_INIT_CHECKPOINT


bert_config = modeling.BertConfig.from_json_file(CONFIG_FILE)
input_files = tf.gfile.Glob(os.path.join(TF_RECORD_DIR,'*tfrecord'))

log.info("Using checkpoint: {}".format(INIT_CHECKPOINT))

log.info("Using {} data shards".format(len(input_files)))

! gsutil ls $INIT_CHECKPOINT*


* 训练模型

In [0]:
import sys
sys.path.append("bert")
from bert.run_pretraining import input_fn_builder, model_fn_builder
from bert import modeling, optimization, tokenization


model_fn = model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=INIT_CHECKPOINT,
      learning_rate=LEARNING_RATE,
      num_train_steps=TRAIN_STEPS,
      num_warmup_steps=10,
      use_tpu=USE_TPU,
      use_one_hot_embeddings=True)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=BERT_GCS_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=USE_TPU,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)
  
train_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=MAX_SEQ_LENGTH,
        max_predictions_per_seq=MAX_PREDICTIONS,
        is_training=True)

estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)
                

**在TPU机器上训练模型**

将如下代码拷贝到TPU机器上，下载最新的bert代码，然后用python3 执行即可。


```python
  """
@file: bert_with_tpu.py
@time: 2019/11/10 7:07 上午
"""
#######
## please exe "git clone https://github.com/google-research/bert" before run this script
######

import sys
import os
import tensorflow as tf
import logging
sys.path.append("bert")
from bert import modeling, optimization, tokenization
from bert.run_pretraining import input_fn_builder, model_fn_builder

USE_TPU=True
TPU_ADDRESS = "taey2113"
GOOGLE_CLOUD_PROJECT_NAME = "pre-train-bert-sogou" #@param {type: "string" }
BUCKET_NAME = "bert-sogou-pretrain"  #@param {type: "string"}
BASE_MODEL_DIR = "fine_tuning/base_model" #@param {type: "string"}
NEW_MODEL_DIR = "fine_tuning/model" #@param {type: "string"}
MODEL_NAME = "chinese_L-12_H-768_A-12" #@param {type: "string"}
## 以下两个变量是区分不同中文预训练的关键参数
## 1. 模型存储目录
NEW_MODEL_NAME = "chinese_L-12_H-768_A-12_tutorial" #@param {type: "string"}
## 2. 数据目录
INPUT_DATA_DIR = "fine_tuning/data/tutorial" #@param {type: "string"}

PROCESSES = 4 #@param {type: "integer"}
DO_LOWER_CASE = True
MAX_SEQ_LENGTH = 128 #@param {type : "integer"}
MASKED_LM_PROB = 0.15 #@param {type: "number" }
MAX_PREDICTIONS = 20 #@param {type: "integer"


base_model_name = "gs://{}/{}/{}".format(BUCKET_NAME, BASE_MODEL_DIR, MODEL_NAME)

INIT_CHECKPOINT = "{}/bert_model.ckpt".format(base_model_name)
BERT_GCS_DIR = "gs://{}/{}/{}_latest".format(BUCKET_NAME, NEW_MODEL_DIR, NEW_MODEL_NAME)
VOCAB_FILE = "gs://{}/{}/{}/vocab.txt".format(BUCKET_NAME, BASE_MODEL_DIR, MODEL_NAME)
TF_RECORD_DIR = "gs://{}/{}_tfrecord".format(BUCKET_NAME, INPUT_DATA_DIR)
CONFIG_FILE = "gs://{}/{}/{}/bert_config.json".format(BUCKET_NAME, BASE_MODEL_DIR, MODEL_NAME)


# Input data pipeline config
TRAIN_BATCH_SIZE = 128 #@param {type:"integer"}
MAX_PREDICTIONS = 20 #@param {type:"integer"}
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param

# Training procedure config
EVAL_BATCH_SIZE = 64
LEARNING_RATE = 2e-7
TRAIN_STEPS = 1000000 #@param {type:"integer"}
SAVE_CHECKPOINTS_STEPS = 250 #@param {type:"integer"}
NUM_TPU_CORES = 8



#"gs://bert-sogou-pretrain/fine_tuning/base_model/chinese_L-12_H-768_A-12/bert_model.ckpt"
TMP_INIT_CHECKPOINT = tf.train.latest_checkpoint(BERT_GCS_DIR)
if TMP_INIT_CHECKPOINT is not None:
    INIT_CHECKPOINT = TMP_INIT_CHECKPOINT


log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

bert_config = modeling.BertConfig.from_json_file(CONFIG_FILE)
input_files = tf.gfile.Glob(os.path.join(TF_RECORD_DIR,'*tfrecord'))

log.info("Using checkpoint: {}".format(INIT_CHECKPOINT))

log.info("Using {} data shards".format(len(input_files)))


model_fn = model_fn_builder(
    bert_config=bert_config,
    init_checkpoint=INIT_CHECKPOINT,
    learning_rate=LEARNING_RATE,
    num_train_steps=TRAIN_STEPS,
    num_warmup_steps=10,
    use_tpu=USE_TPU,
    use_one_hot_embeddings=True)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=BERT_GCS_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=USE_TPU,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)

train_input_fn = input_fn_builder(
    input_files=input_files,
    max_seq_length=MAX_SEQ_LENGTH,
    max_predictions_per_seq=MAX_PREDICTIONS,
    is_training=True)

estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)

```

**Notice**

colab的脚本需要有特定的权限访问google cloud的bucket。如果发现代码在执行TPU的训练任务的时候，发现服务无法访问google cloud里的文件。那么解决方案是到google cloud的bucket里给当前的任务增加权限。