# Machine Translation English-German Example Using SageMaker Seq2Seq

1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Download dataset and preprocess](#Download-dataset-and-preprocess)
3. [Training the Machine Translation model](#Training-the-Machine-Translation-model)
4. [Inference](#Inference)

## Introduction

Welcome to our Machine Translation end-to-end example! In this demo, we will train a English-German translation model and will test the predictions on a few examples.

SageMaker Seq2Seq algorithm is built on top of [Sockeye](https://github.com/awslabs/sockeye), a sequence-to-sequence framework for Neural Machine Translation based on MXNet. SageMaker Seq2Seq implements state-of-the-art encoder-decoder architectures which can also be used for tasks like Abstractive Summarization in addition to Machine Translation.

To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on.

## Setup

Let's start by specifying:
- The S3 bucket and prefix that you want to use for training and model data. **This should be within the same region as the Notebook Instance, training, and hosting.**
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp in the cell below with a the appropriate full IAM role arn string(s).

In [1]:
import sagemaker

sagemaker_session = sagemaker.Session()

# S3 bucket and prefix
bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-seq2seq"

In [2]:
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [3]:
from time import gmtime, strftime
import time
import numpy as np
import os
import json

# For plotting attention matrix later on
import matplotlib

%matplotlib inline
import matplotlib.pyplot as plt

## Download dataset and preprocess

In this notebook, we will train a English to German translation model on a dataset from the
[Conference on Machine Translation (WMT) 2017](http://www.statmt.org/wmt17/).

In [4]:
%%bash
wget http://data.statmt.org/wmt17/translation-task/preprocessed/de-en/corpus.tc.de.gz & \
wget http://data.statmt.org/wmt17/translation-task/preprocessed/de-en/corpus.tc.en.gz & wait
gunzip corpus.tc.de.gz & \
gunzip corpus.tc.en.gz & wait
mkdir validation
wget http://data.statmt.org/wmt17/translation-task/preprocessed/de-en/dev.tgz
tar xzf dev.tgz -C validation

--2023-06-08 07:21:42--  http://data.statmt.org/wmt17/translation-task/preprocessed/de-en/corpus.tc.de.gz
--2023-06-08 07:21:42--  http://data.statmt.org/wmt17/translation-task/preprocessed/de-en/corpus.tc.en.gz
Resolving data.statmt.org (data.statmt.org)... Resolving data.statmt.org (data.statmt.org)... 129.215.197.184129.215.197.184
Connecting to data.statmt.org (data.statmt.org)|129.215.197.184|:80... 
Connecting to data.statmt.org (data.statmt.org)|129.215.197.184|:80... connected.
HTTP request sent, awaiting response... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://data.statmt.org/wmt17/translation-task/preprocessed/de-en/corpus.tc.de.gz [following]
--2023-06-08 07:21:42--  https://data.statmt.org/wmt17/translation-task/preprocessed/de-en/corpus.tc.de.gz
301 Moved Permanently
Location: https://data.statmt.org/wmt17/translation-task/preprocessed/de-en/corpus.tc.en.gz [following]
--2023-06-08 07:21:42--  https://data.statmt.org/wmt17/tran

  3650K .......... .......... .......... .......... ..........  1%  250M 96s
  3700K .......... .......... .......... .......... ..........  1%  141M 94s
  3750K .......... .......... .......... .......... ..........  1%  247M 93s
  3800K .......... .......... .......... .......... ..........  1%  228M 92s
  3850K .......... .......... .......... .......... ..........  1%  247M 91s
  3900K .......... .......... .......... .......... ..........  1%  260M 89s
  3950K .......... .......... .......... .......... ..........  1%  256M 88s
  4000K .......... .......... .......... .......... ..........  1%  212M 87s
  4050K .......... .......... .......... .......... ..........  1%  260M 86s
  4100K .......... .......... .......... .......... ..........  1%  200M 85s
  4150K .......... .......... .......... .......... ..........  1% 13.5M 84s
  4200K .......... .......... .......... .......... ..........  1% 24.7M 83s
  4250K .......... .......... .......... .......... ..........  1% 56.7M 82s

  7600K .......... .......... .......... .......... ..........  2%  110M 52s
  7650K .......... .......... .......... .......... ..........  2%  261M 52s
  7700K .......... .......... .......... .......... ..........  3%  241M 51s
  7750K .......... .......... .......... .......... ..........  3%  515K 54s
  7800K .......... .......... .......... .......... ..........  3% 47.9M 54s
  7850K .......... .......... .......... .......... ..........  3% 31.9M 53s
  7900K .......... .......... .......... .......... ..........  3%  182M 53s
  7950K .......... .......... .......... .......... ..........  3%  233M 53s
  8000K .......... .......... .......... .......... ..........  3%  180M 52s
  8050K .......... .......... .......... .......... ..........  3%  226M 52s
  8100K .......... .......... .......... .......... ..........  3%  225M 52s
  8150K .......... .......... .......... .......... ..........  3% 37.8M 51s
  8200K .......... .......... .......... .......... ..........  3%  201M 51s

 11100K .......... .......... .......... .......... ..........  4% 88.0M 41s
 11150K .......... .......... .......... .......... ..........  4% 81.6M 41s
 11200K .......... .......... .......... .......... ..........  4% 77.7M 41s
 11250K .......... .......... .......... .......... ..........  4% 79.2M 41s
 11300K .......... .......... .......... .......... ..........  4% 96.5M 40s
 11350K .......... .......... .......... .......... ..........  4% 80.7M 40s
 11400K .......... .......... .......... .......... ..........  4% 87.8M 40s
 11450K .......... .......... .......... .......... ..........  4%  102M 40s
 11500K .......... .......... .......... .......... ..........  4% 57.3M 40s
 11550K .......... .......... .......... .......... ..........  4%  121M 40s
 11600K .......... .......... .......... .......... ..........  4% 73.5M 39s
 11650K .......... .......... .......... .......... ..........  4% 36.3M 39s
 11700K .......... .......... .......... .......... ..........  4%  156M 39s

  6700K .......... .......... .......... .......... ..........  2%  144M 85s
  6750K .......... .......... .......... .......... ..........  2%  181M 84s
  6800K .......... .......... .......... .......... ..........  2%  151M 84s
  6850K .......... .......... .......... .......... .......... ..........  5%  561K 36s
 13550K .......... .......... .......... .......... ..........  5% 61.1M 36s
 13600K .......... .......... .......... .......... ..........  5% 97.3M 36s
 13650K .......... .......... .......... .......... ..........  5%  116M 36s
 13700K .......... .......... .......... .......... ..........  5%  100M 36s
 13750K .......... .......... .......... .......... ..........  5% 59.2M 36s
 13800K .......... .......... .......... .......... ..........  5%  129M 36s
 13850K .......... .......... .......... .......... ..........  5% 94.7M 36s
 13900K .......... .......... .......... .......... ..........  5%  165M 35s
 13950K .......... .......... .......... .......... ..........  5

 16150K. ..  3%  176M 66s......
  9400K. .......... .. ............... .......... ........ .......... ........ ..................  6%  133M 32s ........
 16200K ..........  3%  113M 65s
  9450K ........ .......... ........ .......... ........ .................. ........ ......... .........  6%.  104M. 32s..
 16250K .  3%...  122M.. 65s
  9500K ........ .......... .......... ........... ............. .... ................ ........ .........  3%.  228M 65s
  9550K ......  6%  201M 31s
 16300K .............. ........ ................ ...... ......... ......... ............ .......... ........  3%  230M 64s......  6%  230M 31s
 16350K ..
  9600K ............. ........... ............ .......... .... .......... ............ .............. ..........  3%..  198M  6% 64s
  9650K  218M  31s.......... ....
 16400K .............. .... .............. ............ ...... ............ ......  3%.  230M. 64s..
  9700K. ....... .............. ....  6%  171M 31s.
 16450K. ............ ........ .......

 11900K ......  7%  202M 28s
 19050K ............ ...... .................. .. .......... .......... ..........  4%  245M 55s
 11950K .......... .......... .......... .......... ..........  4%  274M 55s
 12000K .......... .......... .......... .......... ..........  4%  208M 54s
 12050K .......... .......... .......... .......... ..........  4% 7.31M 54s
 12100K .......... .......... .......... .......... ..........  4%  154M 54s
 12150K .......... .......... .......... .......... ..........  4%  263M 54s
 12200K .......... .......... .......... .......... ..........  4%  232M 54s
 12250K .......... .......... .......... .......... ..........  4%  258M 53s
 12300K .......... .......... .......... .............. .......... ..........  7% 1.87M 29s
 19100K .......... .......... .......... .......... ..........  7%  175M 29s
 19150K .......... .......... .......... .......... ..........  7% 90.1M 29s
 19200K .......... .......... .......... .......... ..........  7% 72.9M 28s
 19250K ....

 14500K .......... .......... .......... .......... ..........  4%  194M 48s
 14550K .............. .......... ..........  8% 3.88M 26s
 21800K .......... .......... .......... .......... ..........  8%  222M 26s
 21850K .......... .......... .. .......... .......... .......... ..........  4% 4.47M 48s
 14600K .......... .......... .......... .......... ..........  5%  209M 48s
 14650K .......... .......... .......... .......... ..........  5%  242M 47s
 14700K .......... .......... .......... .......... ..........  5%  263M 47s
 14750K .......... .......... .......... .......... ..........  5%  257M 47s
 14800K .......... .......... .......... .......... ..........  5%  218M 47s
 14850K .......... .......... .......... .......... ..........  5%  266M 47s
 14900K .......... .......... .......... .......... ..........  5%  262M 47s
 14950K .......... .......... .......... .......... ..........  5%  161M 46s
 15000K .......... .......... .......... .......... ..........  5%  224M 46s
 15

 17200K .......... .......... .......... .......... ..........  5% 34.6M 42s
 17250K .......... ............ ..........  9% 5.45M 25s
 24450K .......... .......... ........... ...............  .................. ........ ..........  5% 8.55M 42s
 17300K ........  9%  118M 25s
 24500K .......... ........ .......... .......... .......... ..........  9%  203M 25s
 24550K .................. ........ .......... ..... ............. ........ .......... ........  5%  100M 42s
 17350K ..................  9%  185M 25s ........
 24600K .......... ........ .......... ........ .......... ..........  5% 43.2M 42s
 17400K .......... .......... .......... .......... ..........  9% 13.2M 25s
 24650K .......... .......... .......... .............. .......... ..........  5% 17.0M 42s
 17450K .......... .......... .......... .......... ..........  5% 48.2M 42s
 17500K .......... .......... .......... .......... ..........  5%  201M 42s
 17550K .......... .......... .......... .......... ..........  6%  19

 27100K ............ .............. .......... .......... ........ .......... ........  6%...... 14.2M 39s 10%
 19900K .. 14.6M.. 23s
 27150K .............. .... ................ .......... .......... ...... ............ .......... 10%..  200M. 23s. .....
 27200K .......... ...........  6%  168M 39s
 19950K .......... .......... .......... .......... ..........  6% 31.3M 39s
 20000K .......... .......... .......... .......... ..........  6%  180M 38s
 20050K ............ .......... .......... .......... 10% 19.8M 23s
 27250K .......... .......... .......... .......... .......... 10%  248M 23s
 27300K .......... .......... .......... .......... .......... 10%  247M 23s
 27350K .......... .......... .......... .......... .......... 10%  261M 23s
 27400K .......... .......... .......... .......... .......... 10%  214M 23s
 27450K .......... .......... .......... .......... .......... 10%  238M 23s
 27500K .......... .......... .......... .......... .......... 10% 3.07M 23s
 27550K .......

 29900K .......... .......... .......... .......... .......... 11% 65.8M 22s
 29950K .......... .......... .......... .......... .......... 11% 86.9M 22s
 30000K .......... .......... .......... .......... .......... 11% 98.1M 22s
 30050K .......... .......... .......... .......... .......... 11%  102M 22s
 30100K .......... .......... ............ .......... .......... ..........  7% 5.47M 36s
 22450K .......... .......... .......... .......... ..........  7%  249M 36s
 22500K .......... .......... .......... .......... ..........  7% 93.8M 36s
 22550K .......... .......... .......... .......... ..........  7%  114M 36s
 22600K .......... .......... .......... .......... ..........  7%  108M 35s
 22650K .......... .......... .......... .......... ..........  7% 76.3M 35s
 22700K .......... .......... .......... .......... ..........  7%  160M 35s
 22750K .......... .......... .......... .......... ..........  7%  113M 35s
 22800K .......... .......... .......... .......... .......... 

 32350K ......  8% 47.3M 33s
 25300K .......... .......... .......... .......... ............ .......... .......... .......... .......... 12% 58.1M 21s
 32400K .......... .......... .......... .......... .......... 12% 99.4M 21s
 32450K .......... .......... ................  8% 35.6M 33s
 25350K .......... .......... .......... 12% 68.9M 21s
 32500K .......... .......... .......... .......... .... .......... .......... .......... ..........  8% 41.2M 33s
 25400K .......... .......... .......... .......... ..........  8% 45.6M 33s
 25450K .......... .......... ........ 12% 24.8M 21s
 32550K .......... .......... .......... .......... .......... 12%  116M 21s
 32600K ................ .......... ..........  8% 43.8M 33s
 25500K .......... .......... .......... .......... ..........  8% 69.2M 33s
 25550K .......... .......... .......... .......... ..........  8%  151M 33s
 25600K .......... .......... .......... .... .......... .......... .......... .......... 12% 19.3M 21s
 32650K ......

 34900K .......... .......... ............ ........ ................ .... .......... ..........  9% 38.5M 31s
 28100K .......... .......... .............. 13% 41.3M 21s
 34950K .......... .......... .......... .......... .... .......... ..........  9% 54.4M 31s
 28150K .......... .......... .............. 13% 65.5M 21s
 35000K .......... .......... .......... .............. .......... ..........  9% 51.2M 31s
 28200K .......... .......... .... .......... 13% 45.0M 21s
 35050K ............ .......... ............ ......  9% 65.6M 31s
 28250K ............ .......... .......... .......... 13% 64.4M 20s
 35100K .......... .............. .......... .......... .......... ..........  9% 41.9M 31s
 28300K .... ............ ...... .................. .......... 13% 44.1M 20s
 35150K .......... .......... .......... .... .......... .......... ..........  9% 44.1M 31s
 28350K ................ .. ................ 13% 54.4M 20s
 35200K .......... .......... .......... .......... ............ .. 13%.

 31150K .......... .......... .......... .......... .......... 14% 3.87M 20s
 37200K .......... .......... .......... .......... .......... 14%  104M 20s
 37250K .......... .......... .......... 10% 30.0M 29s
 31200K .......... .......... .......... .......... .............. .......... .......... .......... 14% 42.6M 20s
 37300K .......... .......... .......... .......... .......... 14% 92.2M 20s
 37350K .......... .......... 10% 30.8M 29s
 31250K ........ .......... ........ .......... ........ .......... .......... .......... 10%  100M 29s
 31300K .......... .......... 14% 70.8M 20s
 37400K .......... .......... .......... .......... .......... 14%  108M 20s
 37450K .......... .......... .......... .......... .......... 14%  124M 20s
 37500K ............ .......... .......... .......... 10% 36.7M 29s
 31350K .......... .......... ............ .......... .......... .......... .......... 14% 42.8M 20s
 37550K .......... .......... .......... .......... ...... .................. 14% .. 

 33650K .......... .......... .......... .......... .......... 11% 90.0M 28s
 33700K .......... .......... .......... .......... .......... 11%  143M 28s
 33750K .......... .......... .......... .......... .......... 11%  113M 28s
 33800K .......... .......... .......... .......... .......... 11% 63.0M 28s
 33850K .......... .......... .......... .......... .......... 11% 61.7M 27s
 33900K .......... ..........
 40000K .......... .......... .......... .......... ........ .......... ........ 15%  656K 19s
 40050K .......... .......... .......... .......... .......... 15%  132M 19s
 40100K .......... ...... .......... 11% 3.73M 28s
 33950K .......... .......... .......... .......... .......... 11% 49.8M 27s
 34000K .......... .......... .......... .......... ................ .......... 11% 50.5M 27s
 34050K .......... .......... ............ .......... .......... 15% 14.4M 19s
 40150K .......... .......... ...... .......... .......... 11% 27.1M 27s
 34100K .......... ...... .......... ..

 36000K .......... .......... .......... .......... .......... 12% 73.5M 26s
 36050K .......... .......... .......... .......... .......... 12% 30.1M 26s
 36100K .......... .......... .......... .......... .......... 12% 25.2M 26s
 36150K .......... .......... .......... .......... .......... 12% 29.0M 26s
 36200K .......... .......... .......... .......... .......... 12%  115M 26s
 36250K .......... .......... .......... .......... .......... 12%  112M 26s
 36300K .......... .............. .......... .......... 16% 5.06M 18s
 43000K .......... .......... .......... .......... .......... .......... .......... 12% 1.13M 26s
 36350K .......... .......... .......... .......... .......... 12% 29.8M 26s
 36400K .......... .......... .......... .......... .......... 12% 7.24M 26s
 36450K .......... .......... .......... .......... .......... 12%  147M 26s
 36500K .......... .......... .......... .......... .......... 12% 9.43M 26s
 36550K .......... .......... .......... .......... .........

 38850K .......... .......... .......... .......... .............. ............ 13% 56.5M 25s
 38900K .......... .......... .......... .......... .......... 13% 94.7M 25s
 38950K .......... .... .......... 17% 21.3M 18s
 45500K ............ .......... .......... ........ .......... .......... .............. 13% 61.6M 25s
 39000K.... .......... 17% 62.8M 18s
 45550K .. ........ .................. .......... .......... .......... 17% 93.8M 18s.. .......... .......... .......... .......... 13% 48.3M 25s
 39050K .......... .......... ..
 45600K .......... .......... .......... .......... .......... 17% 46.0M 18s
 45650K .......... .......... .......... .......... .......... 17%  115M 18s
 45700K .......... .......... ................ .......... .......... 13% 30.8M 25s
 39100K .......... .......... .......... ........ .......... .......... 17% 48.0M 18s
 45750K .......... .......... .......... .......... ...... .......... 13% 47.5M 25s
 39150K .......... .......... .......... .......... ..

 48300K .......... .......... .......... .......... .......... 18%  138M 17s
 48350K .......... .......... .......... .......... .......... 18%  104M 17s
 48400K .......... .......... .......... .......... .......... 18% 36.8M 17s
 48450K .......... .... .......... .......... .......... .......... 14% 2.50M 24s
 41400K .......... .......... .......... .......... .......... 14% 98.2M 24s
 41450K .......... .......... .......... .......... .......... 14%  103M 24s
 41500K .......... .......... .......... .......... .......... 14% 94.2M 24s
 41550K .......... .......... .......... .......... .......... 14%  130M 24s
 41600K .......... .......... .......... .......... .......... 14% 67.5M 24s
 41650K .......... .......... .......... .......... .......... 14%  106M 24s
 41700K .......... .......... .......... .......... .......... 14% 91.8M 24s
 41750K .......... .......... .......... .......... .......... 14%  121M 24s
 41800K .......... .......... .......... .......... .......... 14% 82.0

 50600K .......... .......... .......... .......... .......... 19%  102M 17s
 50650K .......... .......... .......... .......... .......... 19%  116M 17s
 50700K .......... .......... .......... .......... .......... 19% 79.8M 17s
 50750K .......... ............ .......... .... .......... .......... .......... 19% 95.8M 17s
 50800K .......... .............. .......... 15% 2.48M 23s
 44450K .......... .............. .......... .......... .. .......... .......... 15% 69.6M 23s
 44500K .......... .......... ................ .......... 19% 35.9M 17s
 50850K ........ .......... ........ .......... .......... .......... .......... 19% 79.5M 17s
 50900K .......... .......... .............. 15% 37.4M 23s
 44550K .......... .......... .......... .......... .......... 15%  165M 23s
 44600K .......... .......... .......... 19% 42.5M 17s
 50950K .......... .......... .......... .......... .... .......... .......... .......... .......... 15% 42.3M 23s
 44650K .......... .......... ........... .....

 52950K .......... .......... .......... .......... .......... 20%  160M 16s
 53000K .......... .......... .......... ................ .......... .......... 16% 2.34M 22s
 47450K .......... .......... .......... .......... .......... 16% 48.7M 22s
 47500K .......... .......... .......... ...... .......... 20% 11.2M 16s
 53050K .......... .......... .......... .......... .......... 20%  100M 16s
 53100K .......... .......... .......... .......... .......... 16% 38.6M 22s
 47550K .......... .......... .......... .......... .......... 16%  139M 22s.... .......... 20% 45.4M 16s
 53150K .......... .......... .......... .......... .......... 20%  117M 16s
 53200K .......... .......... .......... ..
 47600K .......... .......... .......... .......... .......... 16% 33.7M 22s
 47650K .......... ............ .......... 20% 40.8M 16s
 53250K .......... .......... .......... .......... .......... 20%  114M 16s
 53300K .......... .......... .......... .......... .......... 20% 75.7M 16s
 53350K ..

 55300K .......... .......... .......... .......... ........ .......... .......... .............. 21% 27.8M 16s
 55350K .......... .......... .......... .......... .......... 21% 12.5M 16s
 55400K .......... .......... .......... .......... .......... 21% 37.3M 16s
 55450K .......... .......... .......... .......... .......... 21% 4.39M 16s
 55500K .......... .......... .......... .......... .......... 21% 77.7M 16s
 55550K .......... .......... .......... .......... .......... 21% 43.9M 16s
 55600K .......... .......... .......... .......... .......... 21%  109M 16s
 55650K .......... .......... ............ 17% 2.30M 22s
 50450K .......... .......... .......... .......... .......... 21% 52.1M 16s
 55700K .......... .......... .......... .......... .... .......... .......... 17% 46.1M 22s
 50500K .......... ........ 21% 74.5M 16s
 55750K .......... .......... .......... .......... .......... 21% 98.5M 16s
 55800K ................ .......... .......... .......... 17% 35.1M 22s
 50550K 

 52800K .......... .......... .......... .......... .......... 18% 67.9M 21s
 52850K .......... .......... .......... .......... .......... 18%  122M 21s
 52900K .......... .......... .......... .......... .......... 18% 74.0M 21s
 52950K .......... .......... .......... .......... .......... 18%  123M 21s
 53000K .......... .......... .......... .......... .......... 18% 80.7M 21s
 53050K .......... .......... .......... .......... .......... 18% 93.8M 21s
 53100K .......... ............ .......... 22%  689K 16s
 58300K .... .......... .......... .......... 18% 33.3M 21s
 53150K .......... .............. .......... .......... ........ .......... .......... .......... 18% 3.89M 21s
 53200K .......... .......... .......... .......... .......... 18% 14.1M 21s
 53250K .......... .......... .......... .......... .......... 18% 30.4M 21s
 53300K .......... .......... .......... .......... .......... 18% 30.7M 21s
 53350K .......... .......... .......... .......... .......... 18% 51.5M 21s
 

 55100K .......... .......... .......... .......... .......... 18% 45.0M 20s
 55150K .......... .......... .......... .......... .......... 18% 44.5M 20s
 55200K .......... .......... .......... .......... .......... 18% 27.9M 20s
 55250K .......... .......... .......... .......... .......... 18% 29.5M 20s
 55300K .......... .......... .......... .......... .......... 18% 34.8M 20s
 55350K .......... .......... .......... .......... .......... 18% 14.0M 20s
 55400K .......... .......... .......... .......... .......... 18% 6.66M 20s
 55450K .......... .......... .......... .......... .......... 18%  103M 20s
 55500K .......... .......... .......... .......... .......... 18% 3.81M 20s
 55550K .......... .......... .......... .......... .......... 18% 10.6M 20s
 55600K .......... .......... .......... .......... .......... 19% 9.84M 20s
 55650K .......... .......... .......... .......... .......... 19% 78.8M 20s
 55700K .......... .......... .......... .......... .......... 19%  195M 20s

 64000K .......... .......... .......... .......... .......... 24% 49.6M 15s
 64050K .......... .......... .......... .......... .......... 24%  137M 15s
 64100K .......... .......... ............ .......... .... .......... .......... 24% 19.2M 15s
 64150K .......... .......... .......... .......... .......... 24%  186M 15s
 64200K .......... .......... .......... .......... .......... 24%  231M 15s
 64250K .............. 19% 4.75M 20s
 57800K ............ .......... .......... .......... .......... 24% 48.2M 15s
 64300K .......... ............ .......... .... .......... .......... .......... 25% 81.1M 15s
 64350K ........ .......... .......... 19% 19.0M 20s
 57850K .......... .......... .......... ................ .......... .......... .......... .......... 25% 28.4M 14s.. .......... 19% 42.8M 20s
 57900K .......... .......... .......... .......... .......... 19%  130M 20s
 57950K .......... .......... .......... .......... .......... 19% 22.8M 20s
 58000K .......... .......... ......

 66300K ....
 60800K .......... ............ .......... .......... .......... .......... ...... .......... 25% 64.0M 14s
 66350K .......... .......... .......... .......... .......... 25%  110M 14s
 66400K .......... .......... .......... .......... .......... 25% 95.4M 14s
 66450K .......... .......... .......... .......... .......... 25%  139M 14s
 66500K .......... .......... ................ .......... .......... .......... 25% 57.9M 14s
 66550K .......... .......... .......... .......... .......... 25% 30.2M 14s
 66600K .......... .......... .......... .......... .......... 25% 85.8M 14s
 66650K .......... .......... .......... .......... .......... 25% 69.4M 14s
 66700K .......... .......... .......... .......... .......... 25% 48.5M 14s
 66750K .......... .......... .......... ...... 20% 5.08M 19s
 60850K .......... .......... .......... 25% 31.4M 14s
 66800K .......... .......... .......... .......... .............. ............ 25% 57.1M 14s
 66850K .......... .......... .....

 68550K .......... .......... .......... .......... .......... .......... .......... 21% 13.2M 19s
 63900K ............ 26% 70.1M 14s
 68600K .......... .......... .......... .......... .......... 26%  124M 14s
 68650K .......... .......... .......... .......... .......... 26%  110M 14s
 68700K ......... ................. .......... .......... .......... .......... 26% 97.9M 14s
 68750K .......... .......... .......... .......... .......... 26%  125M 14s
 68800K .......... .......... .......... .......... .......... 26%  106M 14s
 68850K .......... .......... .......... .......... .......... 26%  130M 14s
 68900K .......... .......... .......... .......... .......... 26% 97.6M 14s
 68950K .......... .......... .......... .......... .......... 26% 95.7M 14s
 69000K .......... .......... .......... .......... .......... 26% 91.7M 14s
 69050K .......... .......... .......... .......... .......... 26% 62.5M 14s
 69100K .......... .......... .......... .......... .......... 26%  115M 14s
 6

 71300K .......... ...... 22% 38.1M 18s
 66500K .......... .......... .......... .......... .......... 22%  141M 18s
 66550K .................. .......... .......... .......... 27% 56.6M 14s
 71350K .......... .......... ...... .......... .......... .......... .......... 22% 50.7M 18s
 66600K .......... .......... .......... .......... .......... 22% 38.8M 18s
 66650K .......... .......... .......... .......... .......... 22%  108M 18s
 66700K .......... .......... .......... .......... .......... 22% 43.7M 18s
 66750K .......... .......... .......... .......... .......... 22% 30.5M 18s
 66800K .......... .......... .......... .......... .......... 22% 32.2M 18s
 66850K .......... .......... .......... .......... .......... 22% 57.7M 18s
 66900K .......... .......... .......... .......... ........ .......... .......... 27% 6.13M 14s
 71400K .......... .......... .......... 22% 13.9M 18s
 66950K .......... .......... ............ .............. .......... .......... 22% 8.67M 18s
 67000

 73750K .......... .......... .......... .......... .......... 28% 76.9M 14s
 73800K .......... .......... .......... .......... .......... 28% 35.4M 14s
 73850K .......... .......... .......... .......... .......... 28% 94.0M 14s
 73900K .......... .......... .......... ...... .......... .......... .......... ...... .......... 28% 6.23M 14s
 73950K .......... .......... .......... .......... .......... 28%  135M 14s........ 23% 2.30M 18s
 69400K ........
 74000K .......... .......... .......... .......... .......... 28% 27.7M 14s
 74050K .......... ...... .......... .......... ................ .......... .......... .......... 28% 30.0M 14s
 74100K .......... .......... ........ .......... 23% 20.3M 18s
 69450K .......... .......... .......... .......... .......... 23%  180M 18s
 69500K .......... ............ .......... .... .......... .......... .......... 23% 44.5M 18s
 69550K ........ 28% 28.9M 14s
 74150K .......... .......... .......... .......... .......... .......... ..........

 71800K .......... .......... .......... .......... .......... 24% 92.0M 17s
 71850K .......... .......... .......... .......... .......... 24%  139M 17s
 71900K .......... .......... .......... .......... .......... 24% 10.9M 17s
 71950K .......... .......... .......... .......... .......... 24% 37.8M 17s
 72000K .......... .......... .......... .......... .......... 24% 42.1M 17s
 72050K .......... .......... .......... .......... .......... 24% 99.4M 17s
 72100K .......... .......... .......... .......... .......... 24% 99.1M 17s
 72150K .......... .......... .......... .......... .......... 24% 38.9M 17s
 72200K .......... .......... ........ .......... ........ .......... .......... 24% 18.5M 17s
 72250K .......... .......... .......... ................ .......... .......... 29% 1.42M 13s
 76700K .......... .......... .......... .......... .......... 29% 27.5M 13s
 76750K .......... .......... .......... .......... .......... 29% 5.61M 13s
 76800K .......... .......... .......... 

 74150K .......... .......... .......... .......... .......... 25%  102M 17s
 74200K .......... .......... .......... .......... .......... 25% 75.2M 17s
 74250K .......... .......... .......... .......... .......... 25%  114M 17s
 74300K .......... .......... .......... .......... .......... 25%  116M 17s
 74350K .......... .......... 30% 1.38M 13s
 79650K .......... ...... .......... .......... .......... 25% 79.4M 17s...... .......... .......... .......... 30% 62.6M 13s
 79700K .......... ..
 74400K .......... .......... .......... .......... .......... 25% 58.3M 17s
 74450K .......... .......... .................. ........ .......... .......... 25% 76.7M 17s
 74500K .......... .......... .......... .......... .......... 25%  126M 17s
 74550K .......... .......... .......... .......... .......... 25% 89.2M 17s
 74600K .......... .......... .......... .......... .......... 25% 86.6M 17s
 74650K .......... .......... .......... .......... .......... 25%  119M 17s
 74700K .......... ..

 81950K .......... .......... .......... .......... .......... 26% 25.9M 17s...... .......... 31% 26.8M 13s
 82000K .......... .......... .......... .......... .......... 31%  155M 13s
 82050K .......... .......... .......... .......... .......... 31%  259M 13s
 82100K .......... .......... .......... .......... .......... 31%  254M 13s
 82150K .......... .......... .......... .......... .......... 31% 34.7M 13s
 82200K .......... .......... .......... .......... .......... 31%  155M 13s
 82250K .......... .......... .......... .......... .......... 31%  215M 13s
 82300K .......... .......... .......... .......... .......... 32% 26.9M 12s
 82350K .......... .......... .......... .......... .......... 32%  240M 12s
 82400K .......... .......... .......... .......... .......... 32% 30.2M 12s
 82450K .......... .......... .......... .......... .......... 32% 67.1M 12s
 82500K .......... .......... .......... .......... .......... 32% 67.1M 12s
 82550K .......... .......... .......... ....

 84250K .......... .......... .. .......... 27% 43.2M 16s
 80250K .............. .......... .......... 32% 63.7M 12s
 84300K .......... .......... .......... .......... .......... 32%  143M 12s
 84350K .......... .......... .......... .......... .......... 32%  118M 12s
 84400K .......... .......... .......... .......... .......... 32% 47.7M 12s
 84450K .......... .......... .......... .......... .......... 32%  153M 12s
 84500K .......... .......... .......... .......... .......... 32%  168M 12s
 84550K .......... .......... .......... .......... .......... 32%  145M 12s
 84600K .......... .......... .......... .......... .......... 32%  167M 12s
 84650K .......... .......... .......... .......... .......... 32% 44.1M 12s
 84700K .......... .......... .......... .......... .......... 32%  126M 12s
 84750K .......... .......... .......... .......... .......... 32%  163M 12s
 84800K .......... .......... .......... .......... .......... 32% 31.9M 12s
 84850K .......... .......... ......

 86650K .......... .......... .......... .......... .......... 33%  111M 12s
 86700K .......... .......... 28% 21.2M 16s
 83200K .......... .......... .......... .......... .......... 28% 36.2M 16s
 83250K .......... .......... .......... .......... .......... 28% 41.1M 16s
 83300K .......... ........ .......... .......... .......... .......... 33% 8.12M 12s
 86750K .......... .......... .......... .......... .......... 33%  113M 12s
 86800K .......... .......... .......... .......... .......... 33% 86.9M 12s
 86850K .......... .......... .......... .......... .......... 33%  102M 12s
 86900K .......... .......... .......... .......... .......... 33% 90.3M 12s
 86950K .......... .......... .......... .......... .......... 33% 79.2M 12s
 87000K .......... .......... .......... .......... .......... 33% 64.3M 12s
 87050K .......... .......... .......... .......... .......... 33%  117M 12s
 87100K .......... .......... .......... .......... .......... 33%  106M 12s
 87150K .......... ....

 85550K .......... .......... .......... .......... .......... 29% 94.1M 16s
 85600K .......... .......... .......... .......... .......... 29% 79.6M 16s
 85650K .......... .......... .......... .......... .......... 29% 76.9M 15s
 85700K .......... .......... .......... .......... .......... 29%  178M 15s
 85750K .......... .......... .............. .......... 34% 2.37M 12s
 89650K .......... ........ .......... .......... 29% 63.7M 15s
 85800K .......... .......... .......... .......... .......... 29% 82.3M 15s
 85850K .......... .......... .......... .......... .......... 29%  161M 15s
 85900K .......... .......... .......... .......... .......... 29% 85.5M 15s
 85950K .......... .......... .......... .......... .......... 29% 42.5M 15s
 86000K .......... .......... .......... ........ .......... .......... .......... 34% 1.64M 12s
 89700K .......... .......... .......... .......... .......... 34% 62.8M 12s
 89750K .......... .......... .......... .......... .......... 34%  193M 12s

 92400K .......... .......... .......... .......... .......... 35% 25.4M 12s
 92450K .......... .......... .......... .......... .......... 35% 42.5M 12s
 92500K .......... .......... .......... .......... .......... 35% 34.3M 12s
 92550K .......... .......... .......... .......... .......... 35% 19.9M 12s
 92600K .......... .......... .......... .......... .......... 36% 5.40M 12s
 92650K .......... .......... .... .......... .......... 30% 1018K 15s
 88150K .......... .......... .......... .......... .......... 30% 77.3M 15s
 88200K .......... .......... .......... .......... .......... 30%  166M 15s
 88250K .......... .......... .......... .......... .......... 30%  144M 15s
 88300K .......... .......... .......... .......... .......... 30%  107M 15s
 88350K .......... .......... .......... .......... .......... 30% 59.6M 15s
 88400K .......... .......... .......... .......... .......... 30% 84.2M 15s
 88450K .......... .......... .......... .......... .......... 30%  179M 15s
 8850

 90900K .......... .......... .......... .......... .......... 31%  116M 15s
 90950K .......... .......... .......... .......... ...... .......... .......... 31% 85.5M 15s
 91000K .......... .......... .......... .......... .......... 31%  111M 15s
 91050K .......... .......... .......... .......... .......... 31% 75.4M 15s
 91100K .......... .......... .......... .......... .......... 31%  205M 15s
 91150K .......... .......... .......... ............ 36% 1.47M 11s
 95000K .......... .......... .......... .......... .......... 36% 31.5M 11s
 95050K .......... .......... .......... .......... .......... 36% 4.51M 11s
 95100K .......... .......... .......... .......... .......... 36% 6.72M 11s
 95150K .......... .......... .......... .......... .......... 37% 31.9M 11s
 95200K .......... .......... .......... .......... .......... 37% 38.2M 11s
 95250K .......... .......... .......... .......... .......... 37% 23.3M 11s
 95300K .......... .......... .......... .......... .......... 37% 

 98000K .......... .............. .......... 31% 16.0M 15s
 93250K .......... .......... .......... .......... .......... 31% 45.8M 15s
 93300K .......... .......... .......... .......... .......... 31% 67.0M 15s
 93350K .......... .......... .......... .......... .......... 31% 26.7M 15s
 93400K .......... .......... .......... .......... .......... 31% 46.0M 15s
 93450K .......... .......... .......... .......... .......... 31% 29.2M 15s
 93500K .......... .......... .......... .......... .......... 31%  126M 15s
 93550K .......... .......... .......... .......... .......... 31% 30.1M 15s
 93600K .......... .......... .......... .......... .......... 31% 29.5M 15s
 93650K .......... .......... .......... .......... .......... 32% 30.2M 15s
 93700K .......... .......... .......... .......... ........ .......... ........ 32% 44.4M 15s
 93750K .......... .......... .......... .......... .......... 32% 60.2M 15s
 93800K .......... .......... .......... .......... .......... 32% 66.9M 15s

100500K .......... .......... .......... .......... .......... 39%  178M 11s
100550K .......... .......... .......... .......... .......... 39%  108M 11s
100600K .......... .......... .......... .......... .......... 39% 67.8M 11s
100650K .......... .......... .......... .......... .......... 39% 53.3M 11s
100700K .......... .......... .......... .......... .......... 39% 34.7M 11s
100750K .......... .......... .......... .......... .......... 39% 79.9M 11s
100800K .......... .......... .......... .......... .......... 39% 30.4M 11s
100850K .......... .......... .......... .......... .......... 39% 69.4M 11s
100900K .......... .......... .......... .......... .......... 39% 30.7M 11s
100950K .......... .......... .......... .......... .......... 39%  118M 11s
101000K .......... .......... .... .......... .......... 32% 1.03M 14s
 96100K .......... .......... .......... .......... .......... 32%  124M 14s
 96150K .......... .......... .......... .......... .......... 32% 57.6M 14s
 9620

102800K .......... .......... .......... .......... .......... 39% 86.8M 11s
102850K .......... .......... .......... .......... .......... 40% 92.1M 11s
102900K .......... .......... .......... .......... .......... 40% 72.6M 11s
102950K .......... .......... .......... .......... .......... 40%  106M 11s
103000K .......... .......... .......... .......... .......... 40% 73.8M 11s
103050K .......... .......... .......... .......... .......... 40%  129M 11s
103100K .......... .......... .......... .......... .......... 40% 99.0M 11s
103150K .......... .......... .......... .......... .......... 40% 78.5M 11s
103200K .......... .......... .......... .......... .......... 40% 64.5M 11s
103250K .......... .......... .......... .......... .......... 40% 69.2M 11s
103300K .......... .......... .......... .......... .......... 40% 76.0M 11s
103350K .......... .......... .......... .......... .......... 40% 62.7M 11s
103400K .......... .......... .......... .......... .......... 40%  109M 11s

105400K .......... .......... .......... .......... .......... 40%  147M 10s
105450K .......... .......... .......... .......... .......... 41% 57.4M 10s
105500K .......... .......... .......... .......... .......... 41%  123M 10s
105550K .......... .......... .......... .......... .......... 41%  107M 10s
105600K .......... .......... .......... .......... .......... 41% 83.2M 10s
105650K .......... .......... .......... .......... .......... 41%  109M 10s
105700K .......... .......... .......... .......... .......... 41% 77.1M 10s
105750K .......... .......... .......... .......... .... .......... 34% 1011K 14s
101900K .......... .......... .......... .......... .......... 34%  104M 14s
101950K .......... ................ 41% 41.5M 10s
105800K .......... .......... .......... ............ .......... .......... .......... 34% 67.3M 14s
102000K .......... ...... .......... 41% 43.7M 10s
105850K .......... .......... ...... .......... .......... ................ .......... .......... 41

104700K .......... .......... .......... .......... .......... 35%  169M 13s
104750K .......... .......... .......... .......... .......... 35% 89.2M 13s
104800K .......... .......... .......... .......... .......... 35% 97.1M 13s
104850K .......... .......... .......... .......... .......... 35% 51.4M 13s
104900K .......... .......... .............. .......... .......... .......... .......... 41% 1.48M 10s
107950K .......... .......... .......... .......... .......... 41% 5.98M 10s
108000K .......... .......... .......... .......... .......... 42% 3.14M 10s
108050K .......... ...... .......... .......... 35% 1.06M 13s
104950K .......... .......... .......... .......... .......... 35% 51.3M 13s
105000K .......... .......... .......... .......... .......... 35%  125M 13s
105050K .......... .......... .......... .......... .......... 35% 44.8M 13s
105100K .......... .......... .......... .......... .......... 35% 41.4M 13s
105150K .......... .......... .......... .......... .......... 35

Process is interrupted.


110900K .......... .......... ........ .......... 36% 1.32M 13s
107050K .......... .......... .......... .......... .......... 36% 31.2M 13s
107100K .......... .......... .......... .......... .......... 36% 52.6M 13s
107150K .......... .......... .......... ...... .......... .......... .......... 36% 77.3M 13s


Please note that it is a common practise to split words into subwords using Byte Pair Encoding (BPE). Please refer to [this](https://github.com/awslabs/sockeye/tree/master/tutorials/wmt) tutorial if you are interested in performing BPE.

Since training on the whole dataset might take several hours/days, for this demo, let us train on the **first 10,000 lines only**. Don't run the next cell if you want to train on the complete dataset.

In [5]:
!head -n 10000 corpus.tc.en > corpus.tc.en.small
!head -n 10000 corpus.tc.de > corpus.tc.de.small

107200K .......... .......... .......... .......... .......... 36%  107M 13s
107250K .......... .......... .......... .......... .......... 36%  118M 13s
107300K .......... .......... .......... .......... .......... 36%  106M 13s
107350K .......... .......... .......... .......... .......... 36%  104M 13s
107400K .......... .......... .......... .......... .......... 36%  100M 13s
107450K .......... .......... .......... .......... .......... 36%  114M 13s
107500K .......... .......... .......... .......... .......... 36% 96.3M 13s
107550K .......... .......... .......... .......... .......... 36%  123M 13s
107600K .......... .......... .......... .......... .......... 36% 83.8M 13s
107650K .......... .......... .......... .......... .......... 36% 76.0M 13s
107700K .......... .......... .......... .......... .......... 36%  113M 13s
107750K .......... .......... .......... .......... .......... 36%  112M 13s
107800K .......... .......... .......... .......... .......... 36%  144M 13s

head: cannot open ‘corpus.tc.en’ for reading: No such file or directory


108300K .......... .......... .......... .......... .......... 37%  155M 13s
108350K .......... .......... .......... .......... .......... 37% 93.6M 13s
108400K .......... .......... .......... .......... .......... 37%  118M 13s
108450K .......... .......... .......... .......... .......... 37% 95.9M 13s
108500K .......... .......... .......... .......... .......... 37%  122M 13s
108550K .......... .......... .......... .......... .......... 37%  109M 13s
108600K .......... .......... .......... .......... .......... 37% 59.3M 13s
108650K .......... .......... .......... .......... .......... 37%  101M 13s
108700K .......... .......... .......... .......... .......... 37% 96.1M 13s
108750K .......... .......... .......... .......... .......... 37% 41.8M 13s...... ..........
108800K .......... ...... .......... .......... .......... 43% 4.14M 10s
111150K .......... .......... .......... ........ .......... .......... .......... 37% 28.6M 13s
108850K .......... .......... .......... ..

110750K .......... .......... .......... .......... .......... 37%  113M 13s
110800K .......... .......... .......... .......... .......... 37% 67.0M 13s
110850K .......... .......... .......... .......... .......... 37%  132M 13s
110900K .......... .......... .......... .......... .......... 37% 62.0M 13s
110950K ..........
114000K .......... .......... .......... .......... .......... 44% 1.33M 10s
114050K .......... .......... .......... .......... .......... 44% 36.5M 10s
114100K .......... .. .......... .......... .......... .......... 37% 1.19M 13s
111000K .......... .......... .......... .......... .......... 37% 6.23M 13s
111050K .......... .......... .......... .......... .......... 37% 49.1M 13s
111100K .......... .......... .......... .......... .......... 37%  236M 13s
111150K .......... .......... .......... ............ .......... .......... .......... .......... 37% 24.3M 13s


head: cannot open ‘corpus.tc.de’ for reading: No such file or directory


111200K .......... .......... .......... .......... .......... 38%  121M 13s
111250K .......... .......... .......... .......... .......... 38%  154M 13s
111300K .......... .......... .......... .......... .......... 38%  158M 13s
111350K .......... .......... .......... .......... .......... 38% 23.9M 13s
111400K .......... .......... .......... .......... .......... 38%  176M 13s
111450K .......... .......... .......... .......... .......... 38%  136M 13s
111500K .......... .......... .......... .......... .......... 38%  191M 13s
111550K .......... .......... .......... .......... .......... 38%  169M 13s
111600K .......... .......... .......... .......... .......... 38%  208M 13s
111650K .......... .......... .......... .......... .......... 38%  237M 13s
111700K .......... .......... .......... .......... .......... 38%  170M 13s
111750K .......... .......... .......... .......... .......... 38%  147M 13s
111800K .......... .......... .......... .......... .......... 38% 19.9M 13s

Now, let's use the preprocessing script `create_vocab_proto.py` (provided with this notebook) to create vocabulary mappings (strings to integers) and convert these files to x-recordio-protobuf as required for training by SageMaker Seq2Seq.  
Uncomment the cell below and run to see check the arguments this script expects.

In [6]:
%%bash
# python3 create_vocab_proto.py -h

113400K .......... .......... .......... .......... .......... 38%  111M 13s
113450K .......... .......... .......... .......... .......... 38%  118M 13s
113500K .......... ............ .......... .......... .......... .......... .......... .......... 38% 53.0M 13s
113550K .......... .......... .......... .......... .......... 38%  134M 13s
113600K .......... .......... .......... .......... .......... 38% 79.5M 13s
113650K .......... .......... .......... .......... .......... 38%  107M 13s
113700K .......... .......... .......... .......... .......... 38%  104M 13s
113750K .......... .......... .......... .......... .......... 38%  126M 12s
113800K .......... .......... .......... .......... .......... 45% 1.46M 9s
117050K .......... .......... .......... .......... .......... 45% 2.14M 9s


The cell below does the preprocessing. If you are using the complete dataset, the script might take around 10-15 min on an m4.xlarge notebook instance. Remove ".small" from the file names for training on full datasets.

In [7]:
%%time
%%bash
python3 create_vocab_proto.py \
        --train-source corpus.tc.en.small \
        --train-target corpus.tc.de.small \
        --val-source validation/newstest2014.tc.en \
        --val-target validation/newstest2014.tc.de

117100K .......... .......... .......... ...... .......... 38% 1009K 13s
113850K .......... .......... .......... .......... .......... 38%  104M 13s
113900K .......... .......... .......... .......... .......... 38%  147M 13s
113950K .......... .......... .......... .......... .......... 38%  114M 13s.... .......... 45% 2.79M 9s
117150K .......... .......... .......... ....
114000K .......... .......... .......... .......... .......... 38% 47.2M 13s
114050K .......... .......... .......... .......... .......... 38%  114M 13s
114100K .......... .......... .......... .......... .......... 39%  110M 13s
114150K .......... .......... .......... .......... .......... 39% 95.2M 13s
114200K .......... .......... .......... .......... .......... 39% 88.0M 13s
114250K .......... .......... .......... .......... .......... 39%  122M 12s
114300K .......... .......... .......... .......... .......... 39% 88.5M 12s
114350K .......... .......... .......... .......... .......... 39%  127M 12s
114400

116300K .......... .......... .......... .......... .......... 39%  102M 12s
116350K .......... .......... .......... .......... .......... 39% 97.6M 12s
116400K .......... .......... .......... .......... .......... 39% 86.6M 12s
116450K .......... .......... .......... .......... .......... 39% 91.7M 12s
116500K .......... .......... .......... 46% 2.24M 9s
120050K .......... .......... ................ .......... .......... ........ .......... .......... 39% 52.5M 12s
116550K .......... .......... .......... .......... .......... 39% 88.6M 12s
116600K .......... .......... .......... .......... .......... 39% 84.5M 12s
116650K .......... .......... .......... .......... .......... 39% 98.2M 12s
116700K .......... .......... .......... .......... .......... 39%  113M 12s
116750K .......... .......... .......... .......... .......... 39% 89.0M 12s
116800K .......... .......... .......... .......... .......... 39%  102M 12s
116850K .......... .......... .......... .......... ..........

122750K .......... .......... .......... .......... .......... 47%  164M 9s
122800K .......... .......... .......... .......... .......... 47% 10.8M 9s
122850K .......... .......... .......... .......... .......... 47% 32.0M 9s
122900K .......... .......... .......... .......... .......... 47% 95.7M 9s
122950K .......... .......... .......... .......... .......... 47% 20.3M 9s
123000K .......... .......... .......... .......... .......... 47% 65.4M 9s
123050K .......... .......... .......... .............. 40% 1.61M 12s
118950K .......... .......... .......... .......... .......... 40% 45.5M 12s
119000K .......... .......... .......... .......... .......... 40%  137M 12s
119050K .......... .......... .......... .......... .......... 40% 71.8M 12s
119100K .......... .......... .......... .......... .......... 40% 59.8M 12s
119150K .......... .......... .......... .......... .......... 40%  156M 12s
119200K .......... .......... .......... .......... .......... 40% 72.8M 12s
119250K ....

125350K .......... .......... .......... .......... .......... 48% 95.4M 9s
125400K .......... .......... .......... .......... .......... 48% 85.8M 9s
125450K .......... .......... .......... .......... .......... 48% 94.1M 9s
125500K .......... .......... .......... .......... .......... 48%  101M 9s
125550K .......... .......... .......... .......... .......... 48% 96.0M 9s
125600K .......... .......... .......... .......... .......... 48% 47.4M 9s
125650K .......... .......... .......... .......... .......... 48% 53.4M 9s
125700K .......... .......... .......... .......... .......... 48% 17.3M 9s
125750K .......... .......... .......... .......... .......... 48% 42.6M 9s
125800K .......... .......... .......... .......... .......... 48% 41.2M 9s
125850K .......... .......... .......... .......... .......... 48% 12.6M 9s
125900K .......... .......... .......... .......... .......... 48% 61.2M 9s
125950K .......... .......... .......... .......... .......... 48% 93.7M 9s
126000K ....

127650K .......... .......... .......... .......... .......... 49% 97.1M 9s
127700K .......... .......... .......... .......... .......... 49% 63.6M 9s
127750K .......... .......... .......... .......... .......... 49% 64.8M 9s
127800K .......... .......... .......... .......... .......... 49% 55.5M 9s
127850K .......... .......... .......... .......... .......... 49% 29.6M 9s
127900K .......... .......... .......... .......... .......... 49% 41.8M 9s
127950K .......... .......... .......... .......... .......... 49% 45.9M 9s
128000K .......... .......... .......... .......... .......... 49% 48.3M 9s
128050K .......... .......... .......... .......... .......... 49% 46.7M 9s
128100K .......... .......... .......... .......... .......... 49% 26.9M 9s
128150K .......... .......... .......... .......... .......... 49% 7.12M 9s
128200K .......... .......... .......... .......... .......... 49%  207M 9s
128250K .......... .......... .......... .......... .......... 49%  172M 9s
128300K ....

130100K .......... .......... .......... .......... .......... 50% 4.50M 8s
130150K .......... .......... .......... .......... .......... 50%  198M 8s
130200K .......... .......... .......... .......... .......... 50% 92.2M 8s
130250K .......... .......... .......... .......... .......... 50%  126M 8s
130300K .......... .......... .......... .......... .......... 50%  141M 8s
130350K .......... .......... .......... .......... .......... 50%  122M 8s
130400K .......... .......... .......... .......... .......... 50% 51.8M 8s
130450K .......... .......... .......... .......... .......... 50%  163M 8s
130500K .......... .......... .......... .......... .......... 50%  115M 8s
130550K .......... .......... .......... .......... .......... 50% 94.8M 8s
130600K .......... .......... .......... .......... .......... 50%  137M 8s
130650K .......... .......... .......... .......... .......... 50%  120M 8s
130700K .......... .......... .......... .......... .......... 50% 90.5M 8s
130750K ....

132350K .......... .......... .......... .......... .......... 51% 3.96M 8s
132400K .......... .......... .......... .......... .......... 51% 7.43M 8s
132450K .......... .......... .......... .......... ...... .......... .......... 44%  985K 11s
129700K .......... .......... .......... .......... .......... 44%  104M 11s
129750K .......... .......... .......... .......... .......... 44%  115M 11s
129800K .......... .......... .......... .......... .......... 44%  118M 11s
129850K .......... .......... ...... 51% 3.54M 8s
132500K .......... .......... ................ .......... .......... 44% 58.5M 11s
129900K .......... .......... .......... .......... .......... 44%  131M 11s
129950K .......... .......... .......... .......... .......... 44%  105M 11s
130000K .......... .......... .......... .......... .......... 44% 83.6M 11s
130050K .......... .......... .......... .......... .......... 44% 39.9M 11s
130100K .......... .......... .......... .......... .......... 44% 47.9M 11s
1301

132100K .......... .......... .......... .......... .......... 45% 97.1M 11s
132150K .......... .......... .......... .......... .......... 45%  132M 11s
132200K .......... .......... .......... .......... .......... 45% 96.9M 11s
132250K .......... .......... .......... ................ .......... .......... ...... .......... 45% 69.4M 11s
132300K .......... .......... .......... .......... .......... 45% 91.8M 11s
132350K .......... .......... .......... .......... .......... 45% 95.7M 11s
132400K .......... .......... .......... .......... .......... 45% 89.6M 11s
132450K .......... .......... .......... .......... .......... 45%  108M 11s
132500K .......... .......... .......... .......... .......... 45% 88.3M 11s
132550K .......... .......... .......... .......... .......... 45%  176M 11s
132600K .......... .......... .......... .......... .......... 45%  106M 11s
132650K .......... .......... .......... .......... .......... 45% 98.7M 11s
132700K .......... .......... .......... 

138050K .......... .......... .......... .......... .......... 53% 78.0M 8s
138100K .......... .......... .......... .......... .......... 53% 54.9M 8s
138150K .......... .......... .......... .......... .......... 53% 91.7M 8s
138200K .......... .......... .......... .......... .......... 53% 30.6M 8s
138250K .......... .......... .......... .......... .......... 53%  100M 8s
138300K .......... .......... .......... .......... .......... 53% 26.7M 8s
138350K .......... .......... .......... ...... .......... .......... .......... 46% 1.62M 11s
134800K .......... .......... .......... .......... .......... 46% 77.7M 11s
134850K .......... .......... .......... .......... .......... 46%  147M 11s
134900K .......... .......... .......... .......... .......... 46%  123M 11s
134950K .......... .......... .......... .......... .......... 46% 80.5M 11s
135000K .......... .......... .......... .......... .......... 46% 84.0M 11s
135050K .......... .......... .......... .......... .......... 4

140500K .......... .......... .......... .......... .......... 54% 81.7M 8s
140550K .......... .......... .......... .......... .......... 54%  167M 8s
140600K .......... .......... .......... .......... .......... 54% 55.9M 8s
140650K .......... .......... .......... .......... .......... 54% 96.4M 8s
140700K .......... .......... .......... .......... .......... 54% 84.1M 8s
140750K .......... .......... .......... .......... .......... 54% 51.8M 8s
140800K .......... .......... .......... .......... .......... 54% 50.4M 8s
140850K .......... .......... .......... .......... .......... 54% 44.9M 8s
140900K .......... .......... .......... .......... .......... 54%  179M 8s
140950K .......... .......... .......... .......... .......... 54% 86.3M 8s
141000K .......... .......... .......... .......... .......... 54% 36.1M 8s
141050K .......... .......... .......... .......... .......... 54% 60.4M 8s
141100K .......... .......... .......... .......... .......... 54%  102M 8s
141150K ....

142800K .......... .......... .......... .......... .......... 55% 39.4M 8s
142850K .......... .......... .......... .......... .......... 55% 92.5M 8s
142900K .......... .......... .......... .......... .......... 55%  107M 8s
142950K .......... .......... .......... .......... .......... 55% 30.1M 8s
143000K .......... .......... .......... .......... .......... 55% 92.4M 7s
143050K .......... .......... .......... .......... .......... 55%  100M 7s
143100K .......... .......... .......... .......... .......... 55% 39.1M 7s
143150K .......... .......... .......... .......... .......... 55% 55.4M 7s
143200K .......... .......... .......... .......... .......... 55% 37.8M 7s
143250K .......... .......... .......... .......... .......... 55% 60.2M 7s
143300K .......... .......... .......... .......... .......... 55% 86.9M 7s
143350K .......... .......... .......... .......... .......... 55% 73.0M 7s
143400K .......... .......... .......... .......... .......... 55%  116M 7s
143450K ....

143550K .......... .......... .......... .......... .......... 49% 53.1M 10s
143600K .......... ...... .......... .......... .......... ...... .......... .......... .......... 49% 79.1M 10s
143650K .......... .......... .................. 56% 2.50M 7s
145400K .......... .......... .......... .......... .......... 56%  171M 7s
145450K .......... .......... .......... .......... .......... 56%  211M 7s
145500K .......... .......... .......... .......... .......... 56%  228M 7s
145550K .......... .......... .......... .......... .......... 56%  129M 7s
145600K .......... .......... .......... .......... .......... 56%  173M 7s
145650K .......... .......... .......... .......... .......... 56% 96.1M 7s
145700K .......... .......... .......... .......... .......... 56%  139M 7s
145750K .......... .......... .......... .......... .......... 56%  205M 7s
145800K .......... .......... .......... .......... .......... 56%  209M 7s
145850K .......... .......... .......... .......... .......... 5

145950K .......... .......... .......... .......... .......... 49% 96.4M 10s
146000K .......... .......... .......... 57% 3.99M 7s
148350K .......... .......... .......... ........ .......... .......... .......... 49% 38.9M 10s
146050K .......... .......... .......... .......... .......... 49%  145M 10s
146100K .......... .......... .......... .......... .......... 49% 89.3M 10s
146150K .......... .......... .......... .......... .......... 49%  100M 10s
146200K .......... .......... .......... .......... .......... 49%  140M 10s
146250K .......... .......... .......... .......... .......... 49%  103M 10s
146300K .......... .......... .......... .......... .......... 50%  112M 10s
146350K .......... .......... .......... .......... .......... 50% 65.2M 10s
146400K .......... .......... .......... .......... .......... 50% 88.3M 10s
146450K .......... .......... .......... .......... .......... 50%  109M 10s
146500K .......... .......... .......... .......... .......... 50% 93.5M 10s
14

150900K .......... .......... .......... .......... .......... 58% 30.3M 7s
150950K .......... .......... .......... .......... .......... 58% 24.9M 7s
151000K .......... .......... .......... .......... .......... 58%  120M 7s
151050K .......... .......... .......... .......... .......... 58% 48.1M 7s
151100K .......... .......... .......... .......... .......... 58% 36.0M 7s
151150K .......... .......... .......... .......... .......... 58% 41.1M 7s
151200K .......... .......... .......... .......... .......... 58% 5.60M 7s
151250K .......... .......... .......... .......... .......... 58% 5.05M 7s
151300K .......... .......... .......... .......... .......... 58% 9.15M 7s
151350K .......... .......... ........ .......... .......... .......... 50% 1.18M 10s
148800K .......... .......... .......... .......... .......... 50% 38.4M 10s
148850K .......... .......... .......... .......... .......... 50%  161M 10s
148900K .......... .......... .......... .......... .......... 50% 23.0M 10s

151250K .......... .......... .......... .......... .......... 51%  155M 9s
151300K .......... .......... .......... .......... .......... 51%  145M 9s
151350K .......... .......... .......... .......... .......... 51% 93.1M 9s
151400K .......... .......... .......... .......... .......... 51% 97.3M 9s
151450K .......... .......... .......... .......... .......... 51%  123M 9s
151500K .......... .......... .......... .......... .......... 51%  116M 9s
151550K .. .......... 59% 14.7M 7s
153850K .............. .......... .......... .......... .......... 51% 91.7M 9s
151600K .......... .......... .......... ...... .......... .......... .......... .......... 59% 39.1M 7s
153900K .......... .......... .......... .......... .......... 59% 7.70M 7s
153950K .......... .......... .......... .......... .......... 59% 33.9M 7s
154000K .......... .......... .......... .......... .......... 59% 15.8M 7s
154050K .......... .......... .......... .......... .......... 59%  195M 7s
154100K .......... .

156800K .......... .......... .......... .......... ................ .......... .......... .......... 52% 1.41M 9s
153750K .......... .......... .......... .......... .......... 52% 89.5M 9s
153800K .......... .......... .......... .......... .......... 52% 68.7M 9s
153850K .......... .......... .......... .......... .......... 52% 49.1M 9s
153900K .......... .......... .......... .......... .......... 52% 53.5M 9s
153950K .......... .......... 60% 2.98M 6s
156850K .......... ...... .......... .......... .......... 52% 35.2M 9s
154000K .......... .......... .......... .......... .......... 52% 67.8M 9s
154050K .......... .......... .......... .......... .......... 52% 33.1M 9s
154100K .......... .......... .............. .......... .......... .......... 60% 13.8M 6s
156900K .......... .... .......... .......... 52% 59.7M 9s
154150K .......... .......... .......... .......... .......... 52% 85.8M 9s
154200K .......... .......... ............ .......... .......... .......... 61% 32.4M 6s

156250K .......... .......... ...... 62% 77.0M 6s
159700K .......... .......... .......... .......... 53% 64.9M 9s
156300K .......... .......... .......... .......... .......... 53% 59.0M 9s
156350K .......... ................ .......... .......... ...... .......... .......... .......... 53% 78.6M 9s
156400K .......... .......... .......... .......... .......... 53% 30.9M 9s
156450K .......... .......... .......... .......... .......... 53% 83.6M 9s
156500K .......... .......... .......... .......... .......... 62% 11.0M 6s
159750K .......... .......... .......... .......... .......... 62% 74.5M 6s
159800K .............. 53% 40.6M 9s
156550K .......... .......... .......... .......... .......... 53% 42.2M 9s
156600K .......... .......... .......... .......... .......... 53% 91.5M 9s
156650K .......... .......... .......... .......... .......... 53% 57.0M 9s
156700K .......... .......... .......... .......... .......... 53% 34.4M 9s
156750K .......... .......... .......... .......... ..

162450K .......... .......... ................ 54% 49.0M 9s
158950K .......... .......... .......... .......... .......... 54% 83.7M 9s
159000K ........ .......... ........ .......... .......... .............. .......... 54% 87.2M 9s
159050K ...... 63% 35.2M 6s
162500K .......... ...... .......... .......... ................ .......... .......... 54% 46.5M 9s
159100K .......... .......... .......... ........ .......... ........ .......... 54% 61.1M 9s
159150K .......... .......... .......... .......... .......... 54%  110M 9s...... 63% 19.4M 6s
162550K .......... .......... .......... .......... .......... 63%  126M 6s
162600K ........
159200K .......... .......... .......... .......... .......... 54% 41.3M 9s
159250K .......... ...... .......... .......... .......... .......... 63% 46.2M 6s
162650K ............ .......... .......... .......... 54% 51.4M 9s
159300K .......... ...... .......... .......... .......... .......... .......... 54% 50.6M 9s
159350K .......... .......... ......

161850K .......... .......... .......... .......... .......... 55% 68.8M 9s
161900K .......... .......... .......... .......... .......... 55% 80.2M 9s
161950K ...... .......... .......... .......... .......... .......... .......... 55% 63.4M 9s
162000K .......... .......... .......... .......... .......... 55% 56.1M 9s
162050K .......... .......... .......... .......... .......... 55% 59.3M 9s
162100K .......... .......... .......... .......... ............ 64% 5.99M 6s
165000K .............. 55% 95.3M 9s
162150K .......... .......... ........ .......... .......... ...... ............ 55% 51.5M 9s
162200K .......... .......... .......... ................ .......... .......... 55% 18.6M 9s
162250K .......... .......... .......... .......... .......... 55%  191M 9s
162300K .......... .......... .......... .......... .......... 55%  230M 9s
162350K .......... .......... .......... .......... .......... 55%  148M 9s
162400K .......... .......... .......... .......... .......... 55%  166M 

167450K .......... .......... .......... .......... .......... 65%  258M 6s
167500K .......... .......... 56% 1.19M 8s
164800K .......... .......... .......... .......... .......... 56% 79.2M 8s
164850K .......... .......... ................ .......... .......... .......... 56% 80.9M 8s
164900K .......... .. .......... .......... .......... 65% 23.6M 6s
167550K .......... .......... .......... .......... .......... 65%  156M 6s
167600K .......... .............. .......... .......... .......... 56% 19.5M 8s
164950K .......... .......... .......... .......... .......... 56%  111M 8s
165000K .......... .......... .......... .............. .......... .......... .......... 65% 16.4M 6s
167650K .......... .......... .......... .......... .......... 65%  169M 6s
167700K .......... .......... .......... .......... .... .......... 56% 17.2M 8s
165050K .......... .......... .......... .......... .......... 56% 95.3M 8s
165100K .......... .......... .......... .......... .......... 56%  201M 8s
1

170000K .......... .......... .......... .......... .......... 66%  146M 6s
170050K .......... .......... .......... .......... .......... 66%  182M 6s
170100K .......... .......... .......... .......... .......... 66% 88.3M 6s
170150K .......... .......... .......... .......... .......... 66%  133M 6s
170200K .......... .......... .......... .......... .......... 66% 90.5M 6s
170250K .......... .......... .......... .......... .......... 66%  109M 6s
170300K .......... .......... .......... .......... .......... 66% 85.0M 6s
170350K .......... .......... .......... .......... .......... 66% 75.5M 6s
170400K .......... .......... .......... .......... .......... 66% 86.2M 6s
170450K .......... .......... .......... .......... .......... 66% 81.3M 6s
170500K .......... .......... .......... .......... .......... 66%  123M 6s
170550K .......... .......... .......... .......... .......... 66% 97.4M 6s
170600K .......... .......... .......... .......... .......... 66% 48.0M 6s
170650K ....

170300K .......... .......... .......... ........ .......... ........ .......... 58% 72.4M 8s
170350K .......... .......... .......... .......... .......... 58% 80.2M 8s
170400K .......... .......... .......... ........ .......... 67% 7.24M 5s........ .......... 58% 85.1M 8s
170450K .......... ....
172800K .......... .......... .......... ........ .......... .......... .......... 58% 60.8M 8s
170500K .......... .......... .......... .......... .......... 58%  124M 8s
170550K .......... .......... .......... .......... .......... 58%  101M 8s
170600K ................ .......... .......... .......... .......... .......... 58% 73.0M 8s
170650K .......... .......... .......... .......... 67% 2.01M 5s
172850K .......... .......... .......... .......... .......... 67% 8.87M 5s
172900K .......... .......... .......... .......... .......... 67% 35.9M 5s
172950K .......... .......... .......... .......... .......... 67% 4.92M 5s
173000K .......... .......... .......... .......... .......... 67%

175750K .......... .......... .......... .......... 59% 1.23M 8s
172750K .......... .......... .......... .......... .......... 59% 69.0M 8s.... .......... ..
172800K .......... .......... .......... .......... .......... 59% 22.9M 8s
172850K .......... .......... .......... .......... .......... 59% 85.9M 8s
172900K .......... .......... .......... .......... .......... 59% 30.1M 8s
172950K .......... .......... .......... .......... .......... 59%  123M 8s
173000K .......... .......... .......... .......... .......... 59%  140M 8s
173050K .......... .......... .......... .......... .......... 59% 29.8M 8s
173100K .......... .......... .......... .......... .......... 59%  120M 8s
173150K .......... .......... .......... ............ 68% 5.04M 5s
175800K .......... .......... .......... .......... .......... 68% 28.5M 5s
175850K ............ .......... 59% 21.0M 8s
173200K .......... .......... .......... .......... .......... 59% 79.8M 8s
173250K .......... .......... .............. 

175600K .......... .......... .......... .......... ................ .......... .......... 69% 6.70M 5s
178300K .......... .......... .......... .......... .......... 69% 75.6M 5s
178350K .......... .......... .......... .......... .......... 69% 34.3M 5s
178400K .......... .......... .......... .......... .......... 69% 23.6M 5s
178450K .......... .......... .......... .......... .......... 69% 35.5M 5s
178500K .......... .......... .......... .......... .......... 69% 6.91M 5s
178550K .......... .......... .......... .......... .......... 69%  114M 5s
178600K .......... .......... .......... .......... .......... 69% 5.66M 5s
178650K .......... .......... .......... .......... .......... 69% 2.44M 5s
178700K .......... .......... .......... ........ 60% 1.04M 8s
175650K .......... .......... .......... .......... .......... 60%  111M 8s
175700K .......... .......... ............ .......... 69% 34.6M 5s
178750K .......... .......... .......... ...... .......... .......... 60% 45.5M 8s

178000K .......... .......... .......... .......... .......... 60% 78.6M 7s
178050K .......... .......... .......... .......... .......... 60% 79.0M 7s
178100K .......... .......... .......... .......... .......... 60% 89.8M 7s
178150K .......... .......... .......... .......... .......... 60%  117M 7s
178200K .......... .......... .......... .......... .......... 60% 89.4M 7s
178250K .......... .......... .......... .......... .......... 60% 93.2M 7s
178300K .......... .......... .......... .......... .......... 60%  173M 7s
178350K .......... .......... .......... .......... .......... 60% 99.0M 7s
178400K .......... .......... .......... .......... .......... 60% 71.2M 7s
178450K .......... .......... .......... .......... .......... 60%  101M 7s
178500K .......... .......... .......... .......... .......... 61% 93.5M 7s
178550K .......... .......... .......... .......... .......... 61% 95.8M 7s
178600K .......... .......... .......... .......... .......... 61%  133M 7s
178650K ....

183950K ...... .......... 61% 1.35M 7s
180750K .......... .......... .......... .......... .......... 71% 65.5M 5s
184000K .......... .......... .......... .................. ........ .......... .......... .......... .......... 61% 39.4M 7s
180800K .......... .......... .......... .......... .......... 61% 41.3M 7s
180850K .......... .......... ............ 71% 18.4M 5s
184050K .......... .......... .......... .......... .......... 71%  114M 5s
184100K .......... .......... .......... .......... .......... 71%  123M 5s
184150K ......... .......... .......... 61% 26.3M 7s
180900K .......... ........... .......... .......... .... .......... .......... .......... .......... 71% 46.3M 5s
184200K .............. 61% 26.9M 7s
180950K .......... .......... .......... .......... .......... 61%  165M 7s
181000K .......... .......... .......... .......... .......... 71% 21.6M 5s
184250K .......... .......... .... .......... .......... .......... .......... 61% 31.0M 7s
181050K .......... ........

186500K .......... .......... .......... .......... .......... 72% 88.6M 4s
186550K .......... .......... .......... .......... .......... 72% 65.5M 4s
186600K .......... .......... .......... .......... .......... 72% 51.4M 4s
186650K .......... .......... .......... .......... .......... 72% 46.3M 4s
186700K .......... .......... .......... .......... .......... 72% 98.2M 4s
186750K .......... .......... .......... .......... .......... 72% 52.4M 4s
186800K .......... .......... .......... .......... .......... 72% 29.7M 4s
186850K .......... .......... ..........
183600K .......... .......... .......... .......... ........ .......... .... 62% 1.06M 7s
183650K .......... .............. .............. 72% 48.8M 4s
186900K .......... .......... .......... .......... .......... 72%  131M 4s
186950K .......... .......... .......... 62% 42.8M 7s
183700K .......... .......... .......... .......... .......... 62%  131M 7s
183750K .......... .......... .......... .......... .. .......... ...

188900K .......... .......... .......... .......... .......... 73%  113M 4s
188950K .......... .......... .......... .......... .......... 73% 80.1M 4s
189000K .......... .......... .......... .......... .......... 73%  117M 4s
189050K .......... .......... .......... .......... .......... 73% 68.2M 4s
189100K .......... .......... .......... .......... .......... 73%  161M 4s
189150K .......... .......... .......... .......... .......... 73%  138M 4s
189200K .......... .......... .......... .......... .......... 73% 74.5M 4s
189250K .......... .......... .......... .......... .......... 73% 25.3M 4s
189300K .......... .......... .......... .......... .......... 73%  138M 4s
189350K .......... .......... .......... .......... .......... 73%  118M 4s
189400K .......... .......... .......... .......... .......... 73%  118M 4s
189450K .......... .......... .......... .............. .......... .......... 63% 1.17M 7s
186650K .......... .......... .......... .......... ...... .......... 63%

191350K .......... .......... .......... .......... .......... 74% 3.78M 4s
191400K .......... .......... .......... .......... .......... 74%  111M 4s
191450K .......... .......... .......... .......... .......... 74%  130M 4s
191500K .......... .......... .......... .......... .......... 74%  108M 4s
191550K .......... .......... .......... .......... .......... 74% 94.3M 4s
191600K .......... .......... .......... .......... .......... 74% 85.6M 4s
191650K .......... .......... .......... .......... .......... 74% 83.3M 4s
191700K .......... .......... .......... .......... .......... 74%  102M 4s
191750K .......... .......... .......... .......... .......... 74% 90.5M 4s
191800K .......... .......... .......... .......... .......... 74% 93.7M 4s
191850K .......... .......... .......... .......... .......... 74%  105M 4s
191900K .......... .......... .......... .......... .......... 74% 97.0M 4s
191950K .......... .......... .......... .......... .......... 74% 78.1M 4s
192000K ....

192100K .......... .......... .......... .......... .......... 65%  124M 6s
192150K .......... .......... .......... .......... .......... 65%  103M 6s
192200K .......... .......... .......... .......... .......... 65% 92.0M 6s
192250K .......... .......... .......... .......... .......... 65% 92.2M 6s
192300K .......... .......... .......... .......... .......... 65%  114M 6s
192350K .......... ................ .......... .......... .......... .......... 65% 65.6M 6s
192400K .......... .......... .......... .......... .......... 65% 91.9M 6s
192450K .......... .......... .......... .......... .......... 65%..   115M.. 6s..
192500K. ........... 75%.... 4.17M  4s..
194300K ............ .......... .......... .......... 65% 57.2M 6s
192550K .......... .......... .......... .......... .......... 65% 37.6M 6s
192600K .............. .......... .......... .......... .......... 75% 2.39M 4s
194350K .......... .......... .......... .......... .......... 75% 5.96M 4s
194400K .......... .........

197150K .......... .......... .......... .......... .......... 76% 27.1M 4s
197200K .......... .......... .......... .......... .......... .......... 66% 1.02M 6s
194700K .......... .............. .......... .......... ........ .......... .......... .......... 66% 64.8M 6s
194750K .......... .......... .......... .......... .......... 66% 94.9M 6s.. 76% 2.82M 4s
197250K .......... ....
194800K .......... .......... .......... .......... .......... 66% 75.2M 6s
194850K .......... .......... .......... .......... .......... 66% 88.2M 6s
194900K .......... .......... .......... .......... .......... 66%  155M 6s
194950K .......... .......... .......... .......... .......... 66%  119M 6s
195000K .......... .......... .......... .......... .......... 66% 84.3M 6s
195050K .......... .......... .......... .......... .......... 66%  133M 6s
195100K .......... .......... .......... .......... .......... 66%  123M 6s
195150K .......... .............. ............ .......... .......... ..........

197350K .......... .......... .......... .......... .......... 67%  108M 6s
197400K .......... .......... .......... .......... .......... 67%  105M 6s
197450K .......... .......... .......... .......... .......... 67% 30.4M 6s
197500K .......... .......... .......... .......... .......... 67% 67.0M 6s
197550K .......... ................ .......... .......... 77% 3.31M 4s
199900K .......... ............ .......... .......... .......... 67% 65.0M 6s
197600K .......... .......... .......... .......... .......... 67% 33.2M 6s
197650K .......... .... .......... .......... .......... 77% 10.3M 4s
199950K .......... .......... .......... .......... .......... 77%  110M 4s
200000K .......... .......... .......... .......... .......... 77% 4.63M 4s
200050K .......... .......... .......... .......... .......... 77% 20.8M 4s
200100K .......... .......... .......... .......... .......... 77% 4.86M 4s
200150K .......... .......... .......... .......... .......... 77% 21.4M 4s
200200K .......... ..

199850K .......... .......... .......... .............. .......... 78% 39.8M 3s
202800K .......... .......... .......... .......... .......... .......... 68% 51.5M 6s
199900K .......... .......... .......... .......... .......... 68% 96.6M 6s
199950K .... 78% 43.2M 3s
202850K .......... ............ .......... .......... .......... .......... 68%  121M 6s
200000K .......... .......... .......... ........ .......... .......... .......... 78% 35.0M 3s
202900K .......... .......... .......... 68% 43.7M 6s
200050K .......... .......... .......... .......... .......... 68% 83.7M 6s
200100K .......... .......... .......... .......... .......... 68% 63.2M 6s
200150K .......... .......... .......... .......... .......... ................ 68% 32.7M 6s
200200K .......... .......... .......... .......... .......... 68% 83.7M 6s
200250K .......... .......... .......... .......... .......... 68%  108M 6s
200300K .......... ............ .......... .... .......... .......... .......... 68% 31.7M 6s
2

203050K ........ .......... .......... .......... .......... 79% 44.9M 3s
205050K .......... .......... .......... .......... .......... 69% 54.0M 6s
203100K ........ .......... .......... .......... .......... 79% 43.2M 3s
205100K .......... .......... .......... .......... .......... 69% 50.8M 6s
203150K .......... .......... .......... .......... .......... .......... .......... .......... 79% 42.5M 3s
205150K .......... .............. .......... 69% 36.7M 6s
203200K .......... .......... .......... .......... .......... .......... .......... .......... 79% 42.2M 3s
205200K .......... .......... .......... .... 69% 42.2M 6s
203250K .......... .......... .................. .......... 79% 61.8M 3s
205250K .......... .... .......... .......... 69% 75.6M 6s
203300K .......... .......... .......... .......... .......... 69% 70.2M 6s
203350K .......... .......... ............ .......... .......... .......... 79% 31.3M 3s
205300K .......... .......... ............ .......... .......... 69%

207550K .......... ............ ......... ............... .... .......... .......... .......... ........ 70% 18.8M 6s
205950K. ......... 80% 68.8M 3s........ .......... .......... .......... .......... 70% 83.9M 5s
206000K .......... .......... .......... .......... .......... 70% 76.7M 5s
206050K ......
207600K ............ ...... .......... .......... .......... .......... 80% 17.0M 3s
207650K .......... .......... ................ .......... .......... ...... .......... .......... 80% 67.1M 3s
207700K .......... ...... 70% 19.1M 5s
206100K .......... .......... ................ .......... ...... ............ ........ .......... 70% 94.4M 5s
206150K .......... .......... ............ 80% 40.7M 3s
207750K .......... .......... .......... .......... ...... .......... .......... 70% 44.4M 5s
206200K .......... .......... .......... .................. 80% 32.2M 3s
207800K .......... .......... .......... .......... .......... 80% 60.7M 3s
207850K ...... .......... 70% 24.5M 5s
206250K ..

208850K .......... .......... .......... .......... .......... 71% 93.3M 5s
208900K .......... ..
210000K .......... .......... .......... .......... .......... 81% 8.39M 3s
210050K .......... .......... .......... .......... .......... 81% 96.0M 3s
210100K .......... .......... .......... .......... .......... 81% 29.6M 3s
210150K .......... .......... .......... .......... .......... 81% 89.8M 3s
210200K .......... .......... .......... .......... .......... 81%  123M 3s
210250K .............. .......... .......... .......... 71% 11.0M 5s
208950K .......... .......... .......... .......... .......... 71% 48.4M 5s
209000K .......... .......... .......... .............. .......... .......... .......... .......... 81% 17.4M 3s
210300K .......... .......... .......... 71% 29.3M 5s
209050K .......... .......... .......... .......... .......... 71%  113M 5s
209100K .......... .......... .......... .......... .......... 71% 65.9M 5s
209150K .......... .......... .......... .......... ......

212700K .......... .......... .......... .......... .......... 82% 10.0M 3s
212750K .......... .......... .......... .......... .......... 82% 4.14M 3s
212800K .......... .......... .......... .......... .......... 82% 34.3M 3s
212850K .......... .......... .......... .......... .......... 82% 3.85M 3s
212900K .......... .......... .......... .......... .......... 82% 28.3M 3s
212950K .......... .......... .......... .......... .... .......... ............ 82% 46.2M 3s
213000K .............. .......... 72% 1003K 5s
211600K .......... .......... .......... .......... .......... 72%  120M 5s
211650K .......... .......... .......... .......... .......... 72%  100M 5s
211700K .......... .......... .......... .......... .......... 72%  110M 5s
211750K .......... .......... .......... .......... .......... 72% 85.0M 5s
211800K .......... .......... .......... .......... .......... 72%  147M 5s
211850K .......... .......... .......... .......... .......... 72%  113M 5s
211900K .......... ....

214050K .......... .......... .......... .......... .......... 73%  103M 5s
214100K .......... .......... .......... .......... .......... 73%  111M 5s
214150K .......... .......... .......... .......... .......... 73% 73.0M 5s
214200K .......... .......... .......... .......... .......... 73%  125M 5s
214250K .......... .......... .......... .......... .......... 73% 93.6M 5s
214300K .......... .......... .......... .......... .......... 73%  125M 5s
214350K .......... .......... .......... .......... .......... 73%  117M 5s
214400K .......... .......... .......... .......... .......... 73% 59.3M 5s
214450K .......... .......... .......... .......... .......... 73% 87.9M 5s
214500K .......... .......... .......... .......... ............ .......... 83% 2.24M 3s
215650K .......... .......... .......... .......... .......... 83% 4.58M 3s
215700K .......... .......... .......... .......... .......... 83% 25.2M 3s
215750K .......... .......... .......... .......... .......... 83% 3.68M 3s

216550K .......... .......... .......... .......... .......... 74% 37.6M 5s
216600K .......... .......... .......... .......... .......... 74% 46.5M 5s
216650K .......... .......... .......... .......... .......... .......... 84% 2.30M 2s
218550K .......... .......... .......... .......... .......... 84% 11.8M 2s
218600K .......... .......... .......... .......... .......... 85% 5.64M 2s
218650K .......... .......... .... .......... 74% 1.37M 5s
216700K .......... .......... .......... .......... .......... 74% 85.5M 5s
216750K .......... .......... .......... .......... .......... 74% 45.7M 5s
216800K .......... .......... .......... .......... .......... 74% 73.0M 5s
216850K .......... .......... .......... .......... .......... 74% 34.5M 5s
216900K .......... .......... .......... .......... .......... 74% 75.6M 5s
216950K .......... .......... .......... .......... .......... 74%  176M 5s
217000K .......... .......... .......... .......... .......... 74% 29.0M 5s
217050K ..........

219350K .......... .......... ............ .......... 85% 32.9M 2s
221200K .......... .......... .......... .......... .......... 86%  102M 2s
221250K .......... .......... .............. .......... .......... 74% 34.6M 5s
219400K .......... .......... .......... .......... .......... 74% 87.7M 5s
219450K .......... .......... .......... .......... .......... 75%  101M 5s
219500K .... .......... .......... 86% 28.4M 2s
221300K .......... .......... .......... .......... .......... 86%  113M 2s
221350K .......... .......... ............ .......... .......... .......... .......... 75% 39.2M 5s
219550K .......... ............ .......... .......... 86% 37.6M 2s
221400K .......... .......... .......... .......... .......... 86% 81.7M 2s
221450K .......... .......... .......... .......... .......... 86%  186M 2s
221500K .......... .......... .......... .......... .......... 86%  106M 2s
221550K .......... .......... .......... .......... .......... 86% 27.7M 2s
221600K .......... .......... 

The script will output 4 files, namely:
- train.rec : Contains source and target sentences for training in protobuf format
- val.rec : Contains source and target sentences for validation in protobuf format
- vocab.src.json : Vocabulary mapping (string to int) for source language (English in this example)
- vocab.trg.json : Vocabulary mapping (string to int) for target language (German in this example)

Let's upload the pre-processed dataset and vocabularies to S3

In [8]:
def upload_to_s3(bucket, prefix, channel, file):
    s3 = boto3.resource("s3")
    data = open(file, "rb")
    key = prefix + "/" + channel + "/" + file
    s3.Bucket(bucket).put_object(Key=key, Body=data)


upload_to_s3(bucket, prefix, "train", "train.rec")
upload_to_s3(bucket, prefix, "validation", "val.rec")
upload_to_s3(bucket, prefix, "vocab", "vocab.src.json")
upload_to_s3(bucket, prefix, "vocab", "vocab.trg.json")

In [9]:
region_name = boto3.Session().region_name

In [10]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(region_name, "seq2seq")

print("Using SageMaker Seq2Seq container: {} ({})".format(container, region_name))

## Training the Machine Translation model

In [None]:
job_name = "DEMO-seq2seq-en-de-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)

create_training_params = {
    "AlgorithmSpecification": {"TrainingImage": container, "TrainingInputMode": "File"},
    "RoleArn": role,
    "OutputDataConfig": {"S3OutputPath": "s3://{}/{}/".format(bucket, prefix)},
    "ResourceConfig": {
        # Seq2Seq does not support multiple machines. Currently, it only supports single machine, multiple GPUs
        "InstanceCount": 1,
        "InstanceType": "ml.p2.xlarge",  # We suggest one of ["ml.p2.16xlarge", "ml.p2.8xlarge", "ml.p2.xlarge"]
        "VolumeSizeInGB": 50,
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        # Please refer to the documentation for complete list of parameters
        "max_seq_len_source": "60",
        "max_seq_len_target": "60",
        "optimized_metric": "bleu",
        "batch_size": "64",  # Please use a larger batch size (256 or 512) if using ml.p2.8xlarge or ml.p2.16xlarge
        "checkpoint_frequency_num_batches": "1000",
        "rnn_num_hidden": "512",
        "num_layers_encoder": "1",
        "num_layers_decoder": "1",
        "num_embed_source": "512",
        "num_embed_target": "512",
        "checkpoint_threshold": "3",
        "max_num_batches": "2100"
        # Training will stop after 2100 iterations/batches.
        # This is just for demo purposes. Remove the above parameter if you want a better model.
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 48 * 3600},
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
        },
        {
            "ChannelName": "vocab",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/vocab/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/validation/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
        },
    ],
}

sagemaker_client = boto3.Session().client(service_name="sagemaker")
sagemaker_client.create_training_job(**create_training_params)

status = sagemaker_client.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]
print(status)

In [18]:
import time

status = sagemaker_client.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]

while status == "InProgress":
    time.sleep(60)
    status = sagemaker_client.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]


print(status)
# if the job failed, determine why
if status == "Failed":
    message = sagemaker_client.describe_training_job(TrainingJobName=job_name)["FailureReason"]
    print("Training failed with the following error: {}".format(message))
    raise Exception("Training job failed")

NameError: name 'sagemaker_client' is not defined

> Now wait for the training job to complete and proceed to the next step after you see model artifacts in your S3 bucket.

You can jump to [Use a pretrained model](#Use-a-pretrained-model) as training might take some time.

## Inference

A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means translating sentence(s) from English to German.
This section involves several steps,
- Create model - Create a model using the artifact (model.tar.gz) produced by training
- Create Endpoint Configuration - Create a configuration defining an endpoint, using the above model
- Create Endpoint - Use the configuration to create an inference endpoint.
- Perform Inference - Perform inference on some input data using the endpoint.

### Create model
We now create a SageMaker Model from the training output. Using the model, we can then create an Endpoint Configuration.

In [15]:
use_pretrained_model = False

### Use a pretrained model
#### Please uncomment and run the cell below if you want to use a pretrained model, as training might take several hours/days to complete.

In [16]:
# use_pretrained_model = True
# model_name = "DEMO-pretrained-en-de-model"
# s3 = boto3.client("s3")
# s3.download_file(f"sagemaker-sample-files", "models/seq2seq-data/model.tar.gz", "model.tar.gz")
# s3.download_file(f"sagemaker-sample-files", "models/seq2seq-data/vocab.src.json", "vocab.src.json")
# s3.download_file(f"sagemaker-sample-files", "models/seq2seq-data/vocab.trg.json", "vocab.trg.json")
# upload_to_s3(bucket, prefix, 'pretrained_model', 'model.tar.gz')
# model_data = "s3://{}/{}/pretrained_model/model.tar.gz".format(bucket, prefix)

In [17]:
%%time

sage = boto3.client("sagemaker")

if not use_pretrained_model:
    info = sage.describe_training_job(TrainingJobName=job_name)
    model_name = job_name
    model_data = info["ModelArtifacts"]["S3ModelArtifacts"]

print(model_name)
print(model_data)

primary_container = {"Image": container, "ModelDataUrl": model_data}

create_model_response = sage.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
)

print(create_model_response["ModelArn"])

NameError: name 'job_name' is not defined

### Create endpoint configuration
Use the model to create an endpoint configuration. The endpoint configuration also contains information about the type and number of EC2 instances to use when hosting the model.

Since SageMaker Seq2Seq is based on Neural Nets, we could use an ml.p2.xlarge (GPU) instance, but for this example we will use a free tier eligible ml.m4.xlarge.

In [13]:
from time import gmtime, strftime

endpoint_config_name = "DEMO-Seq2SeqEndpointConfig-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = sage.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.m4.xlarge",
            "InitialInstanceCount": 1,
            "ModelName": model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

DEMO-Seq2SeqEndpointConfig-2023-06-08-07-22-16


NameError: name 'model_name' is not defined

### Create endpoint
Lastly, we create the endpoint that serves up model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 10-15 minutes to complete.

In [None]:
%%time
import time

endpoint_name = "DEMO-Seq2SeqEndpoint-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sage.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response["EndpointArn"])

resp = sage.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

# wait until the status has changed
sage.get_waiter("endpoint_in_service").wait(EndpointName=endpoint_name)

# print the status of the endpoint
endpoint_response = sage.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response["EndpointStatus"]
print("Endpoint creation ended with EndpointStatus = {}".format(status))

if status != "InService":
    raise Exception("Endpoint creation failed.")

If you see the message,
> Endpoint creation ended with EndpointStatus = InService

then congratulations! You now have a functioning inference endpoint. You can confirm the endpoint configuration and status by navigating to the "Endpoints" tab in the AWS SageMaker console.  

We will finally create a runtime object from which we can invoke the endpoint.

In [None]:
runtime = boto3.client(service_name="runtime.sagemaker")

## Perform Inference

### Using JSON format for inference (Suggested for a single or small number of data instances)

#### Note that you don't have to convert string to text using the vocabulary mapping for inference using JSON mode

In [None]:
sentences = ["you are so good !", "can you drive a car ?", "i want to watch a movie ."]

payload = {"instances": []}
for sent in sentences:
    payload["instances"].append({"data": sent})

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

response = response["Body"].read().decode("utf-8")
response = json.loads(response)
print(response)

### Retrieving the Attention Matrix

Passing `"attention_matrix":"true"` in `configuration` of the data instance will return the attention matrix.

In [None]:
sentence = "can you drive a car ?"

payload = {"instances": [{"data": sentence, "configuration": {"attention_matrix": "true"}}]}

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

response = response["Body"].read().decode("utf-8")
response = json.loads(response)["predictions"][0]

source = sentence
target = response["target"]
attention_matrix = np.array(response["matrix"])

print("Source: %s \nTarget: %s" % (source, target))

In [None]:
# Define a function for plotting the attentioan matrix
def plot_matrix(attention_matrix, target, source):
    source_tokens = source.split()
    target_tokens = target.split()
    assert attention_matrix.shape[0] == len(target_tokens)
    plt.imshow(attention_matrix.transpose(), interpolation="nearest", cmap="Greys")
    plt.xlabel("target")
    plt.ylabel("source")
    plt.gca().set_xticks([i for i in range(0, len(target_tokens))])
    plt.gca().set_yticks([i for i in range(0, len(source_tokens))])
    plt.gca().set_xticklabels(target_tokens)
    plt.gca().set_yticklabels(source_tokens)
    plt.tight_layout()

In [None]:
plot_matrix(attention_matrix, target, source)

### Using Protobuf format for inference (Suggested for efficient bulk inference)

Reading the vocabulary mappings as this mode of inference accepts list of integers and returns list of integers.

In [None]:
import io
import tempfile
from record_pb2 import Record
from create_vocab_proto import (
    vocab_from_json,
    reverse_vocab,
    write_recordio,
    list_to_record_bytes,
    read_next,
)

source = vocab_from_json("vocab.src.json")
target = vocab_from_json("vocab.trg.json")

source_rev = reverse_vocab(source)
target_rev = reverse_vocab(target)

In [None]:
sentences = [
    "this is so cool",
    "i am having dinner .",
    "i am sitting in an aeroplane .",
    "come let us go for a long drive .",
]

Converting the string to integers, followed by protobuf encoding:

In [None]:
# Convert strings to integers using source vocab mapping. Out-of-vocabulary strings are mapped to 1 - the mapping for <unk>
sentences = [[source.get(token, 1) for token in sentence.split()] for sentence in sentences]
f = io.BytesIO()
for sentence in sentences:
    record = list_to_record_bytes(sentence, [])
    write_recordio(f, record)

In [None]:
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/x-recordio-protobuf", Body=f.getvalue()
)

response = response["Body"].read()

Now, parse the protobuf response and convert list of integers back to strings

In [None]:
def _parse_proto_response(received_bytes):
    output_file = tempfile.NamedTemporaryFile()
    output_file.write(received_bytes)
    output_file.flush()
    target_sentences = []
    with open(output_file.name, "rb") as datum:
        next_record = True
        while next_record:
            next_record = read_next(datum)
            if next_record:
                rec = Record()
                rec.ParseFromString(next_record)
                target = list(rec.features["target"].int32_tensor.values)
                target_sentences.append(target)
            else:
                break
    return target_sentences

In [None]:
targets = _parse_proto_response(response)
resp = [" ".join([target_rev.get(token, "<unk>") for token in sentence]) for sentence in targets]
print(resp)

## Stop / Close the Endpoint (Optional)

Finally, we should delete the endpoint before we close the notebook.

In [None]:
sage.delete_endpoint(EndpointName=endpoint_name)