<a href="https://colab.research.google.com/github/vietai/dab/blob/master/colab/T2T_translate_vi%3C_%3Een_tiny_tpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this colab, we will train a translation model from English to/from Vietnamese using the Transformer architecture, making use of the Tensor2Tensor library. You will be shown how to use GPU/TPUv2 and how to connect to your Google Drive/Cloud Storage, all for free! To perform back-translation, you will need to run the entire colab the second time with the reverse direction of translation.

**MIT License**

Copyright (c) [2019] [Trieu H. Trinh](https://thtrieu.github.io/)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

# Install dependencies.

* `tensor2tensor`: a library with all necessary tools to perform training/inference.

In [0]:
# Install Tensor2tensor
!pip install -q -U tensor2tensor

print('All done.')


# Setup some options.

In [0]:
# Imports we need.
import tensorflow as tf
import numpy as np
import os
import collections
import json
import pprint

#@markdown 1. The problem is either `translate_vien_iwslt32k` or `translate_envi_iwslt32k`

problem = 'translate_envi_iwslt32k'  # @param

#@markdown 2. We use the tiny setting of the transformer by default.

hparams_set = 'transformer_tiny'  # @param

#@markdown 3. Next we specify the directory where all data involving this colab will be stored (training data, checkpoints, decoded text etc.)

#@markdown * For GPU we use Google Drive Storage (free for everyone with a Google account, no need to install any payment method).

google_drive_dir = 'back_translate'  # @param


#@markdown * With TPU, unfortunately only Google Cloud Storage is usable (free trial with a payment method required). Here we specify a Storage bucket.

google_cloud_bucket = 'vien-translation'  # @param

#@markdown Please note that only one of the two options above will be used depending on which runtime setting you are using.

#@markdown 4. Now we specify all sub-directories:

#@markdown * Data tfrecords (train/valid) to train/eval on will be generated to:
data_dir = 'data/translate_envi_iwslt32k'  # @param

#@markdown * Save/load checkpoints to here:
logdir = 'checkpoints/translate_envi_iwslt32k_tiny'  # @param

#@markdown * The temporary dir to store all the temp files during data generation (e.g. downloads from the internet).

tmp_dir = 'raw'  # @param

is_demo = True  # @param {type: "boolean"}

# Create and mount to Cloud/Drive Storage, connect to GPU or TPU runtime.

Now we create all the directories and mount them to the colab so that python packages here (e.g. `os.path.exists`) can see and work on them. 

* For TPUs, we will have access to a cluster of 2x2 chips (i.e. 8 cores because each chip has 2 cores). One complete replication of the TF graph will be placed on each core, data parallelism is done through this.

* The address of the TPUs on the cloud will also be needed to pass to tensor2tensor while training.

In [0]:
# Check if the runtime is set to TPU or GPU:
use_tpu = 'COLAB_TPU_ADDR' in os.environ


def setup_gpu():
  # Mount "My Drive" into /content/drive
  from google.colab import drive
  drive.mount('/content/drive')
  tpu_address = ''
  mount_point = '/content/drive/My Drive/{}'.format(google_drive_dir)
  return mount_point
  
  
def setup_tpu():
  from google.colab import auth
  auth.authenticate_user()

  # First we Connect to the TPU pod.
  tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  print ('TPU address is', tpu_address)
  with tf.Session(tpu_address) as session:
    devices = session.list_devices()
    # Upload credentials to TPU.
    with open('/content/adc.json', 'r') as f:
      auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)

  print('TPU devices:')
  pprint.pprint(devices)

  # Mount the bucket to colab, so that python package os can access to it.
  # First we install gcsfuse to be able to mount Google Cloud Storage with Colab.
  print('\nInstalling gcsfuse')
  !echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
  !curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
  !apt -qq update
  !apt -qq install gcsfuse

  bucket = google_cloud_bucket
  print('Mounting bucket {} to local.'.format(bucket))
  mount_point = '/content/{}'.format(bucket)
  if not os.path.exists(mount_point):
    tf.gfile.MakeDirs(mount_point)
  
  !fusermount -u $mount_point
  !gcsfuse --implicit-dirs $bucket $mount_point
  print('\nMount point content:')
  !ls $mount_point

  return mount_point, tpu_address


if not use_tpu:
  mount_point = setup_gpu()
  tpu_address = ''
else:
  mount_point, tpu_address = setup_tpu()
  
print('\nMount point: {}'.format(mount_point))
print('TPU address: {}'.format(tpu_address))

Now we create all the directories.

In [0]:
# Now we make all the paths absolute.
logdir = os.path.join(mount_point, logdir)
data_dir = os.path.join(mount_point, data_dir)
tmp_dir = os.path.join(mount_point, tmp_dir)
tf.gfile.MakeDirs(logdir)
tf.gfile.MakeDirs(data_dir)
tf.gfile.MakeDirs(tmp_dir)

if is_demo:
  run_logdir = os.path.join(logdir, 'demo')
  if tf.gfile.Exists(run_logdir):
    tf.gfile.DeleteRecursively(run_logdir)
else:
  run_logdir = logdir

print('log dir: {}'.format(run_logdir))
print('data dir: {}'.format(data_dir))
print('temp dir: {}'.format(tmp_dir))

# Clone or Pull source code from our Github repo `vietai/dab`



In [0]:
src = '/content/dab'
if not os.path.exists(src):
  ! git clone https://github.com/vietai/dab.git
else:
  % cd $src
  ! git pull
  % cd /

print('\n Source code:')
!ls $src

# Generate Training and Validation datasets

First let's look at what the training data looks like in its original text format

In [0]:
!head -n 10 $tmp_dir/train.en
print('=' * 10)
!head -n 10 $tmp_dir/train.vi

# Now we count the number of lines:
print('\nNumber of training text lines:')
!wc -l $tmp_dir/train.en
!wc -l $tmp_dir/train.vi
print('\nNumber of validation text lines:')
!wc -l $tmp_dir/tst2012.en
!wc -l $tmp_dir/tst2012.vi

`tensor2tensor` requests the data to be in a certain format to be efficiently handled in its training/inference pipeline (e.g. parallel access, shuffling, distributed storage, etc). Use the following command to:

* Download raw text training/dev data from the internet into `tmp_dir`.
* Proprocess and tokenize the raw text to build a vocabulary.
* Turn the original text format of the data into the expected format `tfrecords` and store them in `data_dir`.

In [0]:
!python $src/t2t_datagen.py --data_dir=$data_dir \
--tmp_dir=$tmp_dir --problem=$problem

print('\nGenerated TF records:')
!ls $data_dir

Let's also look at the vocabulary file. Each token will be on a line and they are of the decreasing frequency order.

In [0]:
vocab = os.path.join(data_dir, 'vocab.{}.32768.subwords'.format(problem))
!head -n 10 $vocab
!tail -n 10 $vocab
!wc -l $vocab

# Run Training

Instead of using the default option (250K training steps), the model only need ~50K steps to converge (and overfit without regularization).

* On GPU, this will take ~ half a day. Evaluation on the validation set will be done intermittenly in-between training for every 1000 steps.

* On TPU, this will take ~ half an hour. Evaluation on the validation set will be done once training is finished.

In [0]:
train_steps = 1000 if is_demo else 50000

if use_tpu:
  # TPU wants the address to begin with gs://
  train_output_dir = run_logdir.replace(mount_point, 'gs://{}'.format(google_cloud_bucket))
  train_data_dir = data_dir.replace(mount_point, 'gs://{}'.format(google_cloud_bucket))

!python $src/t2t_trainer.py --model='transformer' --hparams_set=$hparams_set \
--hparams='learning_rate_cosine_cycle_steps=50000' \
--train_steps=$train_steps --eval_steps=10 \
--problem=$problem --data_dir=$train_data_dir \
--output_dir=$train_output_dir --use_tpu=$use_tpu --cloud_tpu_name=$tpu_address

# Launch Tensorboard

We launch tensorboard before training, the tensorboard will update its content in real time.

In [0]:
%load_ext tensorboard

print('Reading events from {}'.format(run_logdir))
%tensorboard --logdir=$run_logdir

# Download test data

So far we have trained and evaluated the model on Train/Dev sets. Now we download the Test set and perform decoding on it.

In [0]:
%cd $tmp_dir
!wget "https://github.com/stefan-it/nmt-en-vi/raw/master/data/test-2013-en-vi.tgz"
!tar -xzf test-2013-en-vi.tgz
%cd /

print('\nSample test data:')
!head -n 10 $tmp_dir/tst2013.en
print('=' * 10)
!head -n 10 $tmp_dir/tst2013.vi

print('\nTest data size:')
!wc -l $tmp_dir/tst2013.en
!wc -l $tmp_dir/tst2013.vi

# Compute Test set BLEU score using the final checkpoint

The BLEU (BiLingual Evaluation Understudy) score measures n-grams overlapping between the translated text and reference text (i.e. ground truth). It is shown to be correlated well with human judgement. Although there has been some criticism, BLEU score has been one of the most widely used auto metric to evaluate any translation model in Machine Learning.

There are two steps to compute BLEU score:

1. Translate the source text file `tst2013.vi` to English.
2. Compare the output of Step 1 with the reference `tst2013.en`.

Use the following command for Step 1:

In [0]:
decode_from_file = os.path.join(tmp_dir, 'tst2013.en')
decode_to_file = os.path.join(tmp_dir, 'tiny.tst2013.en2vi.txt')
ref_file = os.path.join(tmp_dir, 'tst2013.vi')
  
if use_tpu:
  # TPU wants the paths to begin with gs://
  ckpt_dir = logdir.replace(mount_point, 'gs://{}'.format(google_cloud_bucket))

print('Decode to file {}'.format(decode_to_file))
!python $src/t2t_decoder.py \
--data_dir=$train_data_dir --problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--decode_hparams="beam_size=4,alpha=0.6"  \
--decode_from_file=$decode_from_file \
--decode_to_file=$decode_to_file  \
--output_dir=$ckpt_dir \
--use_tpu=$use_tpu \
--cloud_tpu_name=$tpu_address

Now let's look at the translated text and compare it to the reference text

In [0]:
!wc -l $decode_to_file
!wc -l $ref_file

!head -n 5 $decode_to_file
!tail -n 5 $decode_to_file
print('=' * 10)
!head -n 5 $ref_file
!tail -n 5 $ref_file

Now use the following command to compute the BLEU score (Step 2):

In [0]:
print('\nCompare {} with reference {}'.format(decode_to_file, ref_file))
!t2t-bleu --translation=$decode_to_file --reference=$ref_file

# Compute BLEU score by averaging the latest 20 checkpoints.

A very effective and powerful technique to improve test performance of neural networks is to average a few last checkpoints. Let's try doing so and see if there is any improvement. First we need to use `t2t-avg-all` from the `tensor2tensor` library to average all the checkpoints available - which is 20 - the number of checkpoints that got kept during training.

In [0]:
decode_to_file = os.path.join(tmp_dir, 'tiny.avg.tst2013.en2vi.txt')
 
if use_tpu:
  # TPU wants the paths to begin with gs://
  ckpt_dir = logdir.replace(mount_point, 'gs://{}'.format(google_cloud_bucket))

avg_dir = os.path.join(ckpt_dir, 'avg')
avg_ckpt = os.path.join(avg_dir, 'model.ckpt-50000.index')

print('Averaging..')
if not tf.gfile.Exists(avg_ckpt):
  !t2t-avg-all --model_dir=$logdir --output_dir=$avg_dir

Now we repeat the same step as above to compute the BLEU score, this time reviving from the `avg_dir` to have the averaged checkpoint.

In [0]:
print('Decoding..')
!python $src/t2t_decoder.py --data_dir=$data_dir \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer --decode_hparams="beam_size=4,alpha=0.6"  \
--decode_from_file=$decode_from_file \
--decode_to_file=$decode_to_file  \
--output_dir=$avg_dir \
--use_tpu=$use_tpu \
--cloud_tpu_name=$tpu_address

print('Compute BLEU score..')
!t2t-bleu --translation=$decode_to_file \
--reference=$ref_file

If you did everything right, there should be at least a +1.0 improvement in BLEU score for the settings in this Colab! And that's the end of this tutorial, happy training!

# Acknowledgements

This work is made possible by [VietAI](http://vietai.org/). Special thanks to [Thang Luong](http://thangluong.com), Le Cao Thang, and Hoang Quy Phat for collaborating and giving comments.

# References

1. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

2. Izmailov, Pavel, et al. "Averaging weights leads to wider optima and better generalization." arXiv preprint arXiv:1803.05407 (2018).

3. Vaswani, Ashish, et al. "Tensor2tensor for neural machine translation." arXiv preprint arXiv.