<a href="https://colab.research.google.com/github/jenh/epub-ocr-and-translate/blob/master/onmt-helpers/EOAT_OpenNMT_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

Install OpenNMT-py and connect to Google Drive:

In [0]:
!pip install opennmt-py

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Get some sample data

Let's check out the OpenNMT-py project to grab some sample data to process. Google Colab will check out to /content and our sample data resides in /content/OpenNMT-py/data.

In [0]:
!git clone https://github.com/OpenNMT/OpenNMT-py

# Preprocess

Preprocess files. This command is vanilla, no bpe or sentencepiece or any other special sauce, we're training straight from corpus --- you can customize this as needed.

In [0]:
!onmt_preprocess -train_src /content/OpenNMT-py/data/src-train.txt \\
-train_tgt /content/OpenNMT-py/data/tgt-train.txt \\
-valid_src /content/OpenNMT-py/data/src-val.txt \\
-valid_tgt /content/OpenNMT-py/data/tgt-val.txt \\
-save_data /content/drive/My\ Drive/my_train_data --lower \\
--share_vocab


# Check GPU

Verify GPU status.

In [0]:
!nvidia-smi


# Train (First run)

Create and run a training command like the following (yours may differ based on what your aims are --- I've cribbed my settings based on a lot of Googling, and they are as always subject to change). On Google Colab, you want -world_size 1 (we have one GPU available to us) and -gpu_ranks 0 (its id is 0).

The steps are also really low here, just so that you can get quicker results with sample data. The second run of training has more realistic numbers.

In [0]:
!onmt_train -data /content/drive/My\ Drive/my_train_data \\
-save_model /content/drive/My\ Drive/my-model \\
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \\
-encoder_type transformer -decoder_type transformer -position_encoding \\
-train_steps 3000  -max_generator_batches 2 -dropout 0.1 -batch_size 4096 \\
-batch_type tokens -normalization tokens  -accum_count 2 -optim adam -adam_beta2 0.998 \\
-decay_method noam -learning_rate 2 -max_grad_norm 0 -param_init 0 \\
-param_init_glorot -label_smoothing 0.1 -valid_steps 1000 -save_checkpoint_steps 1000 \\
-report_every 1000 -world_size 1 -gpu_ranks 0


# Train (from model)

Once you've been interrupted, restart training with -train_from to start where you left off.

In [0]:
!onmt_train -data /content/drive/My\ Drive/my_train_data \\
-save_model /content/drive/My\ Drive/my-2nd-model \\
-train_from /content/drive/My\ Drive/my-model_step_1000.pt \\
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \\
-encoder_type transformer -decoder_type transformer -position_encoding \\
-train_steps 200000  -max_generator_batches 2 -dropout 0.1 -batch_size 4096 \\
-batch_type tokens -normalization tokens  -accum_count 2 -optim adam -adam_beta2 0.998 \\
-decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 \\
-param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 3000 \\
-report_every 1000 -world_size 1 -gpu_ranks 0

# Translate

Now that you've got a model, you're ready to translate! Let's use the sample data from the OpenNMT-py project.

In [0]:
!onmt_translate --model /content/drive/My\ Drive/my-model_step_1000.pt \\
--src /content/OpenNMT-py/data/src-test.txt --output /content/drive/My\ Drive/translation-output.txt \\
--replace_unk -verbose

In [0]:
!cat /content/drive/My\ Drive/translation-output.txt