<a href="https://colab.research.google.com/github/kiking0501/Cantonese-Chinese-Translation/blob/master/code/v2/COLAB_Train_Cantonese_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **An Example to Train Cantonese-BERT**

- Below is designed to be run in Colab Jupyter Environment with a GCS bucket. If you are unsure about this, check <a href="https://medium.com/fenwicks/tutorial-0-setting-up-google-colab-tpu-runtime-and-cloud-storage-b88d34aa9dcb" target="_blank">here</a>.

- To resume training after a disconnection, run only cells with <!>


## **<!>Specify Tensorflow 1.X version**

In [0]:
%tensorflow_version 1.x

## **<!>Setup GCS bucket name**



In [0]:
BUCKET_NAME = "bert_cantonese" #@param {type:"string"}
BUCKET_PATH = "gs://{}".format(BUCKET_NAME)

## **<!>Authorize to GCS**

In [0]:
import os
import sys
import json
import nltk
import random
import logging
import tensorflow as tf

from glob import glob
from google.colab import auth, drive
from tensorflow.keras.utils import Progbar

auth.authenticate_user()
  
# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s :  %(message)s')
sh = logging.StreamHandler()
sh.setLevel(logging.INFO)
sh.setFormatter(formatter)
log.handlers = [sh]

if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  USE_TPU = True
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']

  with tf.Session(TPU_ADDRESS) as session:
    log.info('TPU address is ' + TPU_ADDRESS)
    # Upload credentials to TPU.
    with open('/content/adc.json', 'r') as f:
      auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    
else:
  log.warning('Not connected to TPU runtime')
  USE_TPU = False

## **Download Bert**


In [0]:
!git clone https://github.com/google-research/bert

sys.path.append("bert")
from bert import modeling, optimization, tokenization
from bert.run_pretraining import input_fn_builder, model_fn_builder

### **Or, if you have a customized BERT folder in GCS bucket**

In [0]:
!gsutil -m cp -r $BUCKET_PATH/code/v2/bert .

## **Download Wikipedia Data, WikiExtractor**

In [0]:
DUMP_FILE = zh_yuewiki-20200301-pages-articles-multistream.xml.bz2
!wget https://dumps.wikimedia.org/zh_yuewiki/20200301/$DUMP_FILE

In [0]:
!wget https://github.com/attardi/wikiextractor/archive/master.zip
!unzip master.zip

--2020-03-20 22:15:07--  https://github.com/attardi/wikiextractor/archive/master.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/attardi/wikiextractor/zip/master [following]
--2020-03-20 22:15:07--  https://codeload.github.com/attardi/wikiextractor/zip/master
Resolving codeload.github.com (codeload.github.com)... 140.82.112.10
Connecting to codeload.github.com (codeload.github.com)|140.82.112.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip              [ <=>                ] 249.29K  --.-KB/s    in 0.1s    

2020-03-20 22:15:08 (2.37 MB/s) - ‘master.zip’ saved [255270]



## **Extract Wiki Files**

In [0]:
!python wikiextractor-master/WikiExtractor.py -o . --json -b 500k zh_yuewiki-20200301-pages-articles-multistream.xml.bz2

In [0]:
!ls ./AA

wiki_00  wiki_11  wiki_22  wiki_33  wiki_44  wiki_55  wiki_66  wiki_77	wiki_88
wiki_01  wiki_12  wiki_23  wiki_34  wiki_45  wiki_56  wiki_67  wiki_78	wiki_89
wiki_02  wiki_13  wiki_24  wiki_35  wiki_46  wiki_57  wiki_68  wiki_79	wiki_90
wiki_03  wiki_14  wiki_25  wiki_36  wiki_47  wiki_58  wiki_69  wiki_80	wiki_91
wiki_04  wiki_15  wiki_26  wiki_37  wiki_48  wiki_59  wiki_70  wiki_81	wiki_92
wiki_05  wiki_16  wiki_27  wiki_38  wiki_49  wiki_60  wiki_71  wiki_82	wiki_93
wiki_06  wiki_17  wiki_28  wiki_39  wiki_50  wiki_61  wiki_72  wiki_83	wiki_94
wiki_07  wiki_18  wiki_29  wiki_40  wiki_51  wiki_62  wiki_73  wiki_84	wiki_95
wiki_08  wiki_19  wiki_30  wiki_41  wiki_52  wiki_63  wiki_74  wiki_85	wiki_96
wiki_09  wiki_20  wiki_31  wiki_42  wiki_53  wiki_64  wiki_75  wiki_86
wiki_10  wiki_21  wiki_32  wiki_43  wiki_54  wiki_65  wiki_76  wiki_87


In [0]:
!mv ./AA ./json
!ls

adc.json    sample_data
bert	    wikiextractor-master
json	    zh_yuewiki-20200301-pages-articles-multistream.xml.bz2
master.zip


## **Save Clean Wiki Files**

In [0]:
import os
from collections import defaultdict
import json
import re
import jieba

DATA_PATH = "."
WIKI_PATH = DATA_PATH
WIKI_ORI_PATH = os.path.join(DATA_PATH, "json")

def read_wiki(file_name):
    data = []
    with open(os.path.join(WIKI_ORI_PATH, file_name), 'r') as f:
        for json_obj in f:
            data.append(json.loads(json_obj))
    return data


def save_clean_wiki(file_name, verbose=True):
    json_list = read_wiki(file_name)
    for json_obj in json_list:
        output_path = os.path.join(".", "clean", "%s_%s" % (json_obj['id'], json_obj['title'].replace('/', '-').strip()))
        with open(output_path, "w") as f:
            for line in json_obj['text'].split('\n'):
                content = re.findall('\（.*?\）', line)

                for l in line.split('。'):
                    if l:
                        f.write(' '.join(jieba.cut(l, cut_all=False)) + ' 。\n')
        print("%s saved." % output_path)


def read_clean_wiki(file_code):
    with open(os.path.join(WIKI_PATH, "clean", file_code)) as f:
        return [l.strip().split(' ') for l in f]


def save_clean_wikipedia(output_file="wiki_yue_overview.csv", verbose=True):
    load_jieba()
    for (_, _, filenames) in sorted(os.walk(WIKI_ORI_PATH)):
        for file_name in sorted(filenames):
            save_clean_wiki(file_name)
    total = 0
    with open(os.path.join(WIKI_PATH, output_file), "w") as f:
        file_codes = []
        for (_, _, filenames) in os.walk(os.path.join(".", "clean")):
            file_codes.extend(filenames)
        for ind, code in enumerate(sorted(file_codes)):
            f.write('%s,%s\n' % (ind, code))
        total += len(file_codes)
    if verbose:
        print("Total: %d. %s saved." % (total, output_file))


def read_clean_wikipedia(overview_csv="wiki_yue_overview.csv"):
    with open(os.path.join(WIKI_PATH, overview_csv)) as f:
        for line in f.readlines():
            # print(line)
            _, file_code = line.partition(',')[0], line.partition(',')[2].strip()
            if file_code.startswith('wiki'):
                continue
            # print("Reading %s.." % file_code)
            for sen in read_clean_wiki(file_code):
                yield sen

In [0]:
!mkdir clean
save_clean_wikipedia()

output_path = "wiki_dataset.txt"
with open(output_path, "w") as f:
    for ind, sen in enumerate(read_clean_wikipedia()):
        f.write("%s\n" % " ".join(sen))
    print("%s saved." % f)

## **Create Vocab Files**

In [0]:
!gsutil cp $GCS/data/embedding/cantonese/custom_wiki.bin .

In [0]:
# bert_vocab = list(map(parse_sentencepiece_token, snt_vocab))

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format("./custom_wiki.bin", binary=True)
print(model.index2word[:1000])

bert_vocab = model.index2word

In [0]:
ctrl_symbols = ["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]"]
bert_vocab = ctrl_symbols + bert_vocab

In [0]:
#bert_vocab += ["[UNUSED_{}]".format(i) for i in range(VOC_SIZE - len(bert_vocab))]
print(len(bert_vocab))
VOC_SIZE = len(bert_vocab)

70035


In [0]:
VOC_FNAME = "vocab.txt" #@param {type:"string"}

with open(VOC_FNAME, "w") as fo:
  for token in bert_vocab:
    fo.write(token+"\n")

In [0]:
!head -n 50 $VOC_FNAME

In [0]:
testcase = "香港士巴拿係一種架生，作用係方便上緊或者扭鬆正方形同六角形嘅螺絲頭同螺絲帽，手柄畀人揸住用力。"

In [0]:
bert_tokenizer = tokenization.FullTokenizer(VOC_FNAME)
bert_tokenizer.tokenize(testcase)

## **Create Local Shard, Generate PreTraining Data**

In [0]:
!mkdir ./shards
!split -a 4 -l 256000 -d $PRC_DATA_FPATH ./shards/shard_
!ls ./shards/

shard_0000  shard_0001


In [0]:
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param
MAX_PREDICTIONS = 20 #@param {type:"integer"}
DO_LOWER_CASE = True #@param {type:"boolean"}
PROCESSES = 8 #@param {type:"integer"}
PRETRAINING_DIR = "pretraining_data" #@param {type:"string"}

In [0]:
XARGS_CMD = ("ls ./shards/ | "
             "xargs -n 1 -P {} -I{} "
             "python3 bert/create_pretraining_data.py "
             "--input_file=./shards/{} "
             "--output_file={}/{}.tfrecord "
             "--vocab_file={} "
             "--do_lower_case={} "
             "--max_predictions_per_seq={} "
             "--max_seq_length={} "
             "--masked_lm_prob={} "
             "--random_seed=34 "
             "--dupe_factor=5")

XARGS_CMD = XARGS_CMD.format(PROCESSES, '{}', '{}', PRETRAINING_DIR, '{}', 
                             VOC_FNAME, DO_LOWER_CASE, 
                             MAX_PREDICTIONS, MAX_SEQ_LENGTH, MASKED_LM_PROB)

In [0]:
tf.gfile.MkDir(PRETRAINING_DIR)
!$XARGS_CMD

### **Or, if you already have PreTraining Data from GCS**

In [6]:
!gsutil -m cp -r $GCS/code/v2/$PRETRAINING_DIR .

CommandException: No URLs matched: /code/v2/
CommandException: 1 file/object could not be transferred.


## **Create Trained Model Directory**

In [0]:
MODEL_DIR = "bert_model" #@param {type:"string"}
tf.gfile.MkDir(MODEL_DIR)

In [0]:
# use this for BERT-base

bert_base_config = {
  "attention_probs_dropout_prob": 0.1, 
  "directionality": "bidi", 
  "hidden_act": "gelu", 
  "hidden_dropout_prob": 0.1, 
  "hidden_size": 768, 
  "initializer_range": 0.02, 
  "intermediate_size": 3072, 
  "max_position_embeddings": 512, 
  "num_attention_heads": 12, 
  "num_hidden_layers": 12, 
  "pooler_fc_size": 768, 
  "pooler_num_attention_heads": 12, 
  "pooler_num_fc_layers": 3, 
  "pooler_size_per_head": 128, 
  "pooler_type": "first_token_transform", 
  "type_vocab_size": 2, 
  "vocab_size": VOC_SIZE
}

with open("{}/bert_config.json".format(MODEL_DIR), "w") as fo:
  json.dump(bert_base_config, fo, indent=2)
  
with open("{}/{}".format(MODEL_DIR, VOC_FNAME), "w") as fo:
  for token in bert_vocab:
    fo.write(token+"\n")

### **Backup Directories to GCS Bucket: PreTraining Data and Trained Model**

In [0]:
!gsutil -m cp -r $MODEL_DIR $PRETRAINING_DIR gs://$BUCKET_NAME

Copying file://bert_model/vocab.txt [Content-Type=text/plain]...
Copying file://bert_model/bert_config.json [Content-Type=application/json]...
Copying file://pretraining_data/shard_0001.tfrecord [Content-Type=application/octet-stream]...
Copying file://pretraining_data/shard_0000.tfrecord [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite o

## **<!> Set Training Configuration**

In [0]:
import os

BUCKET_NAME = "bert_cantonese" #@param {type:"string"}
MODEL_DIR = "bert_model" #@param {type:"string"}
PRETRAINING_DIR = "pretraining_data" #@param {type:"string"}
VOC_FNAME = "vocab.txt" #@param {type:"string"}

# Input data pipeline config
TRAIN_BATCH_SIZE = 128 #@param {type:"integer"}
MAX_PREDICTIONS = 20 #@param {type:"integer"}
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param

# Training procedure config
EVAL_BATCH_SIZE = 64
LEARNING_RATE = 2e-5
TRAIN_STEPS = 1000000 #@param {type:"integer"}
SAVE_CHECKPOINTS_STEPS = 2500 #@param {type:"integer"}
NUM_TPU_CORES = 8

if BUCKET_NAME:
  BUCKET_PATH = "gs://{}".format(BUCKET_NAME)
else:
  BUCKET_PATH = "."

BERT_GCS_DIR = "{}/{}".format(BUCKET_PATH, MODEL_DIR)
DATA_GCS_DIR = "{}/{}".format(BUCKET_PATH, PRETRAINING_DIR)

VOCAB_FILE = os.path.join(BERT_GCS_DIR, VOC_FNAME)
CONFIG_FILE = os.path.join(BERT_GCS_DIR, "bert_config.json")

INIT_CHECKPOINT = tf.train.latest_checkpoint(BERT_GCS_DIR)

bert_config = modeling.BertConfig.from_json_file(CONFIG_FILE)
input_files = tf.gfile.Glob(os.path.join(DATA_GCS_DIR,'*tfrecord'))

log.info("Using checkpoint: {}".format(INIT_CHECKPOINT))
log.info("Using {} data shards".format(len(input_files)))

2020-03-21 02:12:18,351 :  From /content/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

2020-03-21 02:12:18,497 :  Using checkpoint: gs://bert_cantonese/bert_model/model.ckpt-28500
2020-03-21 02:12:18,498 :  Using 2 data shards


In [0]:
model_fn = model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=INIT_CHECKPOINT,
      learning_rate=LEARNING_RATE,
      num_train_steps=TRAIN_STEPS,
      num_warmup_steps=10,
      use_tpu=USE_TPU,
      use_one_hot_embeddings=True)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=BERT_GCS_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=USE_TPU,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)
  
train_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=MAX_SEQ_LENGTH,
        max_predictions_per_seq=MAX_PREDICTIONS,
        is_training=True)

2020-03-21 02:12:26,362 :  Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f3d7eb56f28>) includes params argument, but params are not passed to Estimator.
2020-03-21 02:12:26,364 :  Using config: {'_model_dir': 'gs://bert_cantonese/bert_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 2500, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.7.192.10:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3d7eb23f60>, '_t

In [0]:
estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)

2020-03-21 02:12:29,257 :  Querying Tensorflow master (grpc://10.7.192.10:8470) for TPU system metadata.
2020-03-21 02:12:29,278 :  Found TPU system:
2020-03-21 02:12:29,279 :  *** Num TPU Cores: 8
2020-03-21 02:12:29,280 :  *** Num TPU Workers: 1
2020-03-21 02:12:29,280 :  *** Num TPU Cores Per Worker: 8
2020-03-21 02:12:29,281 :  *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 18073774665154168877)
2020-03-21 02:12:29,283 :  *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 17709391651440839388)
2020-03-21 02:12:29,284 :  *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 5318324718595205416)
2020-03-21 02:12:29,284 :  *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 4836610270761407413)
2020-03-21 02:12:29,286 :  *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:T