# BERT pretraining on Japanese wiki

This notebook is assumed to be executed on Colaboratory notebook with TPU.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14Ky8w5NodVyfk7tm13u6vdaGPl5qvPxL)


In [1]:
import tensorflow as tf

In [2]:
tf.__version__

'1.12.0'

## Set input and output

Need to put `all-maxseq(128|512).tfrecord` data for pre-traning on your GCS bucket.  
Trained objects will be saved into a specified GCS bucket.

In [3]:
INPUT_DATA_GCS = '/home/ubuntu/work/data2/asahi/kiji_bert'

In [4]:
! ls /home/ubuntu/work/data/wiki/AA

all-maxseq128.tfrecord	wiki_12  wiki_27  wiki_42  wiki_57  wiki_72  wiki_87
all-maxseq512.tfrecord	wiki_13  wiki_28  wiki_43  wiki_58  wiki_73  wiki_88
all.txt			wiki_14  wiki_29  wiki_44  wiki_59  wiki_74  wiki_89
wiki_00			wiki_15  wiki_30  wiki_45  wiki_60  wiki_75  wiki_90
wiki_01			wiki_16  wiki_31  wiki_46  wiki_61  wiki_76  wiki_91
wiki_02			wiki_17  wiki_32  wiki_47  wiki_62  wiki_77  wiki_92
wiki_03			wiki_18  wiki_33  wiki_48  wiki_63  wiki_78  wiki_93
wiki_04			wiki_19  wiki_34  wiki_49  wiki_64  wiki_79  wiki_94
wiki_05			wiki_20  wiki_35  wiki_50  wiki_65  wiki_80  wiki_95
wiki_06			wiki_21  wiki_36  wiki_51  wiki_66  wiki_81  wiki_96
wiki_07			wiki_22  wiki_37  wiki_52  wiki_67  wiki_82  wiki_97
wiki_08			wiki_23  wiki_38  wiki_53  wiki_68  wiki_83  wiki_98
wiki_09			wiki_24  wiki_39  wiki_54  wiki_69  wiki_84  wiki_99
wiki_10			wiki_25  wiki_40  wiki_55  wiki_70  wiki_85
wiki_11			wiki_26  wiki_41  wiki_56  wiki_71  wiki_86


In [5]:
TARGET_DIRS = [
  'AA',
  'AB',
  'AC',
  'AD',
  'AE',
  'AF',
  'AG',
  'AH',
  'AI',
  'AJ',
  'AK',
  'AL',
  'AM',
  'AN',
  'AO',
  'AP',
  'AQ',
  'AR',
  'AS',
  'AT',
  'AU',
  'AV',
  'AW',
  'AX',
  'AY',
  'AZ',
  'BA',
  'BB'
]

In [6]:
# MAX_SEQ_LEN = 128
MAX_SEQ_LEN = 512

In [7]:
INPUT_FILE = ','.join( [ '{}/{}/all-maxseq{}.tfrecord'.format(INPUT_DATA_GCS, elem, MAX_SEQ_LEN) for elem in TARGET_DIRS] )
INPUT_FILE

'/home/ubuntu/work/data/wiki/AA/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AB/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AC/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AD/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AE/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AF/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AG/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AH/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AI/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AJ/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AK/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AL/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AM/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AN/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AO/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AP/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AQ/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki/AR/all-maxseq512.tfrecord,/home/ubuntu/work/data/wiki

In [8]:
OUTPUT_GCS = '/home/ubuntu/work/bert-japanese/pretrain/asahi'

## Execute pre-training

NOTE that you have to give `service-xxx@cloud-tpu.iam.gserviceaccount.com` the following permissions on the specified GCS bucket:
- Storage Legacy Bucket Reader
- Storage Legacy Bucket Writer
- Storage Legacy Object Reader
- Storage Object Viewer


In [9]:
# !python bert-japanese/src/run_pretraining.py \
#   --input_file={INPUT_FILE} \
#   --output_dir={OUTPUT_GCS} \
#   --use_tpu=True \
#   --tpu_name={TPU_ADDRESS} \
#   --num_tpu_cores=8 \
#   --do_train=True \
#   --do_eval=True \
#   --train_batch_size=256 \
#   --max_seq_length={MAX_SEQ_LEN} \
#   --max_predictions_per_seq=20 \
#   --num_train_steps=1000000 \
#   --num_warmup_steps=10000 \
#   --save_checkpoints_steps=10000 \
#   --learning_rate=1e-4

In [13]:
%%time

!python ../src/run_pretraining.py \
  --input_file={INPUT_FILE} \
  --output_dir={OUTPUT_GCS} \
  --use_tpu=False \
  --do_train=True \
  --do_eval=True \
  --train_batch_size=8 \
  --max_seq_length={MAX_SEQ_LEN} \
  --max_predictions_per_seq=20 \
  --num_train_steps=1400000 \
  --num_warmup_steps=10000 \
  --save_checkpoints_steps=10000 \
  --learning_rate=1e-4

INFO:tensorflow:*** Input Files ***
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AA/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AB/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AC/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AD/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AE/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AF/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AG/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AH/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AI/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AJ/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AK/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AL/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubuntu/work/data/wiki/AM/all-maxseq512.tfrecord
INFO:tensorflow:  /home/ubun

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-03-15-08:01:20
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./pretrain/model/model.ckpt-1400000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Evaluation [40/100]
INFO:tensorflow:Evaluation [50/100]
INFO:tensorflow:Evaluation [60/100]
INFO:tensorflow:Evaluation [70/100]
INFO:tensorflow:Evaluation [80/100]
INFO:tensorflow:Evaluation [90/100]
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2019-03-15-08:01:32
INFO:tensorflow:Saving dict for global step 1400000: global_step = 1400000, loss = 2.0181375, masked_lm_accuracy = 0.60637784, masked_lm_loss = 1.9834335, next_sentence_accuracy = 0.9875, next_sentence_loss = 0.033180095
INFO:tensorflow:Saving 'checkpoint_path' summary for global ste