# Offline Translation
Authors:
- Alexander Wolf

created_at: 11/04/2017   
last_commit: 03/05/2018

### TL;DR 

This is a notebook showcasing how you can make a translator via the [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor) library. The Tensor2Tensor library was made to be used with shell scripts in the terminal; however, this notebook reversed engineered the library a bit to allow you to train T2T models in code. The API supplies datasets and code to make translators in the following languages.

1. French
2. German
3. Chinese
4. Chzech
5. Macedonian

Tensor2Tensor consists of three main components to train advanced neural networks with minimal code.

1. Problems
	- This specifies what type of input modality the network will have and what your task you're trying to solve. 
2. Model
	- T2T consist of many state of the art models for translation and other tasks, which are independent of the type of input/output modality. For translation tasks it is recommended to use the *Transformer*. 
3. Hyperparameter Sets
    - These are predefined hparam sets to train a model for a specified T2T problem. They can be modified and changed also. 
    
If you want to use one of Tensor2Tensor's precoded model/ framework for training a Neural Net on one of your own datasets and/or a new problem type, it is possible to extend the library yourself. See [here](https://github.com/tensorflow/tensor2tensor/blob/0d464ff699029cd604fdb11fb532f64bebc9134c/docs/new_problem.md) for more info.   

##### Import needed Libraries

In [1]:
import json
import os

import tensorflow as tf

##### Get Tensorflow Flags

In [2]:
# Flags for Tensorflow 1.5 +
FLAGS = tf.flags

# Flags for Tensorflow < 1.5
# FLAGS = tf.flags.FLAGS

##### Import needed T2T Classes and Fucntions

In [3]:
from tensor2tensor.utils import trainer_lib
from tensor2tensor.utils.trainer_lib import create_run_config, create_experiment, create_hparams
from tensor2tensor.utils import registry
from tensor2tensor import models, problems

Initialize Notebook Params

In [4]:
PARAMS = {}

## Define T2T Problem, Model, and HPARAMS
These are Tensor2Tensor parameters which will be used to make the correct pipeline for training, evaluating, and decoding your T2T model properly. Read more below to see how to set them. 

#### Problem

A Tensor2Tensor translation problem specifies what language you're translating from and to, along with the vocabulary source and size for your translation model. This will be needed to be set for data collection, preparation, and training your model. Add '_rev' to the end of the problem name to translate from a language to english instead of english to the other selected language. 

In Tensor2Tensor the format of a translation problem name goes as follows
- 'translate_en' + OTHER_LANGAUGE_CODE + 'wmt' + VOCAB_SIZE/TYPE *+ (optional) + '_rev'*

You can view all the available problems in T2T by executing the code below. It is possible to add your own problem to T2T also if it is not already built in. See [here](https://github.com/tensorflow/tensor2tensor/blob/master/docs/new_problem.md) for more details

```python
from tensor2tensor import problems
problems.available()
```

In [5]:
PARAMS['T2T_Problem'] = 'translate_enfr_wmt32k_rev' # T2T Problem name for French -> English model with 32k word vocab size

#### Model

Tensor2Tensor has a lot of prebuilt code for neural sequential data processing. For translation, it is advised to use the *'transformer'* model; it is able to train very quickly compared to other models like RNNS, and converges well.

To see all available T2T models execute the code below; Note not all of these can be used for translation tasks
```python
from tensor2tensor.utils import registry
registry.list_models()
```

In [6]:
PARAMS['T2T_Model'] = 'transformer' # Setting the Transformer Model for Translation

#### HPARAM Set

Tensor2Tensor contains different prebuilt hyperparameter sets for defining the number of hidden layers in the model, learning rates, regularization terms, and much more. They are already preconfigured to help you produce state of the art models, but can be modified to how ever you want within the notebook later.

For translation if using 1 GPU, it is advised to start off with hparam set *transformer_base*. To view all prebuilt hparams sets for your model you can look at the Tensor2Tensor source code: For example at the bottom of this [py file](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py) you can find all the Transformer hparam sets.

In [7]:
PARAMS['T2T_HPARAMS'] = 'transformer_base' # Setting the Transformer Model for Translation

## Generate Data for Problem
Tensor2Tensor supplies data generators for each problem in its library. You will need to generate the data into a *Data_Dir* only once. This process can take many hours so by default the function which generates the data is commented out.

Make sure to import the correct data generator for the problem your using in the T2T library.

In [8]:
from tensor2tensor.data_generators.translate_enfr import TranslateEnfrWmt32k

### Set Temp and Data  Directory
Set these two directories below.
- *TEMP_DIR*
	- Stores downloaded zip files of text data from internet
	- If internally using at Dataiku on dku11 for any translation problem in T2T; keep the *TMP_DIR* below
        - Data is already downloaded for you 
- *DATA_DIR*
	- This directory precompiles the data for your specific T2T problem
	- You will need to set a new one for each new translation task
    - Can be reused in training again if the same problem

In [9]:
PARAMS['TMP_DIR'] = '/data.nfs/shared/translation_t2t_data'

In [10]:
PARAMS['DATA_DIR'] = '/data.nfs/awolf/translation/enfr_rev_train_data/translate_fr_en_wmt32k' 

Now Generate the Data

In [11]:
#TranslateEnfrWmt32k.generate_data(PARAMS['DATA_DIR'], PARAMS['TMP_DIR']) 

### Set Train Directory
You need to set a Train Dir which Tensor2Tensor will use to store the model checkpoints in. This can be used to load models and continue training from those checkpoints if stopped. The checkpoint files can get very large, because there are so many parameters in the model, so keep that in mind when selecting a directory.

Tensor2Tensor will *overwrite* model checkpoints stored in this directory, so if you don't want your old checkpoints erased, make sure to change this directory when training a model. 

Set the *PARAMS['keep\_checkpoint_max']* parameter below to let T2T know how many checkpoints to keep in the *TRAIN_DIR* before overwriting the oldest one.

In [12]:
PARAMS['TRAIN_DIR'] = '/data.nfs/awolf/translation/models/translate_enfr_wmt32k_rev/transformer_train' 

## Get TF Hparams Object
Tensor2Tensor creates a [Tensorflow Hparam](https://www.tensorflow.org/api_docs/python/tf/contrib/training/HParams) object for training. IF you set the *T2T_HPARAMS* value already, the object will be made for you automatically below, and changes can be made manually.

In [13]:
hparams = create_hparams(PARAMS['T2T_HPARAMS'])

### Make Any Hyperparameter Changes
View your hparams below and make any changes as needed

In [14]:
hparams.batch_size = 1024
hparams.learning_rate_warmup_steps = 45000
hparams.learning_rate = .4

View all Hparams for Training

In [15]:
json.loads(hparams.to_json())

{u'attention_dropout': 0.1,
 u'attention_dropout_broadcast_dims': u'',
 u'attention_key_channels': 0,
 u'attention_value_channels': 0,
 u'batch_size': 1024,
 u'clip_grad_norm': 0.0,
 u'compress_steps': 0,
 u'daisy_chain_variables': True,
 u'dropout': 0.2,
 u'eval_drop_long_sequences': False,
 u'eval_run_autoregressive': False,
 u'factored_logits': False,
 u'ffn_layer': u'dense_relu_dense',
 u'filter_size': 2048,
 u'force_full_predict': False,
 u'grad_noise_scale': 0.0,
 u'hidden_size': 512,
 u'initializer': u'uniform_unit_scaling',
 u'initializer_gain': 1.0,
 u'input_modalities': u'default',
 u'kernel_height': 3,
 u'kernel_width': 1,
 u'label_smoothing': 0.1,
 u'layer_postprocess_sequence': u'da',
 u'layer_prepostprocess_dropout': 0.1,
 u'layer_prepostprocess_dropout_broadcast_dims': u'',
 u'layer_preprocess_sequence': u'n',
 u'learning_rate': 0.4,
 u'learning_rate_cosine_cycle_steps': 250000,
 u'learning_rate_decay_rate': 1.0,
 u'learning_rate_decay_scheme': u'none',
 u'learning_rate_

## Set Options for Training

Keep these flags as they are

In [16]:
FLAGS.problems = PARAMS['T2T_Problem']
FLAGS.model = PARAMS['T2T_Model']
FLAGS.schedule = "train_and_evaluate"

### Train and Eval Steps
Set how many steps to train the model for; It is easier to set a high amount of train_steps and then stop training when you reach a desired performance.

Each time a checkpoint is saved the model will run evaluation for *X* amount of steps. 100 is a good default. 

In [17]:
PARAMS['train_steps'] = 1000000
PARAMS['eval_steps'] = 100

### How often to save Checkpoint/Evaluate Model

Use the flags below to control how often to save a model checkpoint. Note after a checkpoint is saved evaluation happens. You can control this by

1. Steps
	- set *FLAGS.local\_eval_frequency* to control how many training steps to take before training/ evaluating.
	- FLAGS.save\_checkpoints_secs must be set to 0
2. Time
	- Set FLAGS.save\_checkpoints_secs to save a checkpoint and evaluate every x seconds while training

In [18]:
FLAGS.save_checkpoints_secs = 0
FLAGS.local_eval_frequency = 2000

PARAMS['local_eval_frequency'] = FLAGS.local_eval_frequency
PARAMS['save_checkpoints_secs'] = FLAGS.save_checkpoints_secs

### Control how many Checkpoints to Save in the *train_dir*

Set the parameter *PARAMS['keep_checkpoint_max']* below to control how many model checkpoints are kept in the *train_dir*; The oldest ckpt will be over written if this number is reached during training.


These model ckpt files can get very large, so take that in mind when choosing the amount to keep at once. The Tensorboard logs for all eval/train steps will be kept in the *train_dir* regardless of this parameter. 

In [19]:
PARAMS['keep_checkpoint_max'] = 3

### GPU/ Distributed Training Flags
- Use these to control single and multi GPU support
- For more info see Tensor2Tensor's [documentation](https://github.com/tensorflow/tensor2tensor/blob/master/docs/distributed_training.md)

In [20]:
os.environ["CUDA_VISIBLE_DEVICES"] = "1" # Set this to control which GPU is visable during training

In [21]:
FLAGS.gpu_memory_fraction = .99
FLAGS.worker_gpu = 1
FLAGS.ps_gpu = 2
FLAGS.log_device_placement = True
FLAGS.worker_replicas = 2

## Create Tensorflow Experiment Object
Tensor2Tensor does the training with Tensorflow by creating an [experiment](https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Experiment) for you. 

First thing we need to do is create the run_config

In [22]:
RUN_CONFIG = create_run_config(
      model_dir=PARAMS['TRAIN_DIR'],
      keep_checkpoint_max=PARAMS['keep_checkpoint_max'],
      save_checkpoints_secs=PARAMS['save_checkpoints_secs'],
      gpu_mem_fraction=FLAGS.gpu_memory_fraction
)

INFO:tensorflow:schedule=continuous_train_and_eval
INFO:tensorflow:worker_gpu=1
INFO:tensorflow:sync=False
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:ps_devices: ['gpu:0']


## Training the Model with Tensor2Tensor
Use this dictionary below to refer to all the important variables for this models training

In [23]:
PARAMS

{'DATA_DIR': '/data.nfs/awolf/translation/enfr_rev_train_data/translate_fr_en_wmt32k',
 'T2T_HPARAMS': 'transformer_base',
 'T2T_Model': 'transformer',
 'T2T_Problem': 'translate_enfr_wmt32k_rev',
 'TMP_DIR': '/data.nfs/shared/translation_t2t_data',
 'TRAIN_DIR': '/data.nfs/awolf/translation/models/translate_enfr_wmt32k_rev/transformer_train',
 'eval_steps': 100,
 'keep_checkpoint_max': 3,
 'local_eval_frequency': 2000,
 'save_checkpoints_secs': 0,
 'train_steps': 1000000}

To run the training function twice without resetting the notebook kernel, you'll need to run the commented out code below before training for the second or more time

In [24]:
# del hparams.data_dir
# del hparams.train_steps
# del hparams.eval_steps

Check logs below to see progress of model loss and BLEU score

In [None]:
exp_fn = create_experiment(
        run_config=RUN_CONFIG,
        hparams=hparams,
        model_name=PARAMS['T2T_Model'],
        problem_name=PARAMS['T2T_Problem'],
        data_dir=PARAMS['DATA_DIR'],
        train_steps=PARAMS['train_steps'],
        eval_steps=PARAMS['eval_steps']
    )
exp_fn.train_and_evaluate()

INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 3, '_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0xa29d0d0>, '_keep_checkpoint_every_n_hours': 10000, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.99
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, 'use_tpu': False, '_tf_random_seed': None, '_num_worker_replicas': 0, '_task_id': 0, 't2t_device_info': {'num_async_replicas': 1}, '_evaluation_master': '', '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_is_chief': True, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_save_checkpoints_steps': 1000, '_environment': 'local', '_master': '', '_model_dir': '/data.nfs/awolf/translation/models/translate_enfr_wmt32k_rev/transformer_train', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0xa28cd10>, '_save_summary_steps': 100}
INFO:tensorflow:Using Vali