diff --git a/README.md b/README.md
index 349220f..c9c7fde 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,321 @@
 ## Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
 
-This is a repo for the ACL 2019 paper ["Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"](https://arxiv.org/abs/1905.09418).
 
-Code of the model will appear by the time of publication.
+<img src="./resources/acl19_heads-min_pad.png" 
+	title="paper logo" width="600"/>
+
+
+This is the official repo for the ACL 2019 paper ["Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"](https://arxiv.org/abs/1905.09418).
+
+Maybe it's worth to first looking into the [blog post](https://github.com/lena-voita/lena-voita.github.io/posts/acl19_heads.html).
+
+#### Bibtex
+```
+@inproceedings{voita-etal-2019-analyzing,
+    title = "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned",
+    author = "Voita, Elena  and
+      Talbot, David  and
+      Moiseev, Fedor and
+      Sennrich, Rico  and
+      Titov, Ivan",
+    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+    month = jul,
+    year = "2019",
+    address = "Florence, Italy",
+    publisher = "Association for Computational Linguistics",
+}
+```
+
+## Introduction
+
+Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU.
+
+In this repo, we provide code and describe steps needed to reproduce our experiments with the L0 head pruning.
+
+## Pruning Attention Heads
+
+In the standard Transformer, results of different attention heads in a layer are concatenated:
+
+```MultiHead(Q, K, V ) = Concat(head_i)W^O.```
+
+We would like to disable less important heads completely, i.e. ideally apply `L0` regularization to the number of heads. We modify the original Transformer architecture by multiplying the representation computed by each `head_i` by a scalar gate `g_i`:
+
+```MultiHead(Q, K, V ) = Concat(g_i * head_i)W^O.```
+
+Unlike usual gates, `g_i` are parameters specific to heads and are independent of the input (i.e. the sentence). Each gate `g_i` is a random variable drawn independently from a head-specific [Hard Concrete distribution](https://openreview.net/pdf?id=H1Y8hhg0b). The distributions have non-zero probability mass at 0 and 1; look at the illustration. 
+
+![concrete_gif](./resources/concrete_crop.gif)
+
+We use the sum of the probabilities of heads being non-zero (`L_C`) as a stochastic relaxation of the non-differentiable `L0` norm. The resulting training objective is:
+
+```L = L_xent + λ * L_C.```
+
+By varying the coefficient `λ` in the optimized objective, we obtain models with different numbers of retained heads. Below is shown how the probabilities of encoder heads being completely closed (P(g_i)=0) change in training for different values of `λ` (pruning starts from a converged model). White color denotes P(g_i=0) = 1, which means that a head is completely removed from the model.
+
+![enc_head_gif](./resources/enc_head_gif_delay7-min.gif)
+
+(Gif is for model trained on EN-RU WMT. For other datasets, values of `λ` can be different.)
+
+We observe that the model converges to solutions where gates are either almost completely closed or completely open. This means that at test time we can treat the model as a standard Transformer and use only a subset of heads.
+
+---
+# Experiments
+
+## Requirements
+
+__Operating System:__ This implementation works on the most popular Linux distributions (tested on Ubuntu 14, 16). It will also likely to work on Mac OS. For other operating systems we recommend using Docker.
+
+__Hardware:__ The model can be trained on one or several GPUs. Training on CPU is also supported.
+
+__OpenMPI(optional):__ To train on several GPUs, you have to install OpenMPI. The code was tested on [OpenMPI 3.1.2(download)](https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz). See build instructions [here]( https://www.open-mpi.org/faq/?category=building#easy-build).
+
+__Python:__ The code works with Python 3.5 and 3.6; we recommend using [anaconda](https://www.anaconda.com/). Install the rest of python packages with `pip install -r requirements.txt`. If you haven't build OpenMPI, remove `horovod` from the list of requirements.
+
+## Data preprocessing
+The model training config requires the data to be preprocessed, i.e. tokenized and bpeized. 
+### Tokenization
+Here is an example of how to tokenize (and lowercase) you data:
+```
+cat text_lines.en | moses-tokenizer en | python3 -c "import sys; print(sys.stdin.read().lower())" > text_lines.en.tok
+```
+
+For the OpenSubtitles18 dataset, you do not need this step since the data is already tokenized.
+
+### BPE-ization
+Learn BPE rules:
+```
+subword-nmt learn-bpe -s 32000 < text_lines.en.tok > bpe_rules.en
+```
+Apply BPE rules to your data:
+```
+/path_to_this_repo/lib/tools/apply_bpe.py  --bpe_rules ./bpe_rules.en  < text_lines.en.tok > text_lines.en.bpeized
+```
+---
+## Model training
+
+In the [scripts](./scripts) folder you can find files `train_baseline.sh`, `train_concrete_heads.sh` and `train_fixed_alive_heads.sh` with configs for training baseline, model with heads pruning using relaxation of the L0 penalty, and model with a fixed configuration of open and closed heads. 
+
+To launch an experiment, do the following (example is for the heads pruning experiment):
+```
+mkdir exp_dir_name && cd exp_dir_name
+cp the-story-of-heads_dir/scripts/train_concrete_heads.sh .
+bash train_concrete_heads.sh
+```
+
+After that, checkpoints will be in the `exp_dir_name/build/checkpoint` directory, summary for tensorboard - in `exp_dir_name/build/summary`, translations of dev set for checkpoints (if specified; see below) in `exp_dir_name/build/translations`.
+
+---
+## Notebooks: how to use a model
+In the [noteboooks](./notebooks) folder you can find notebooks showing how to deal with your trained model. From a notebook name it's content has to be clear, but I'll write this just in case.
+
+[1_Load_model_and_translate](./notebooks/1_Load_model_and_translate.ipynb) - how to load model and translate sentences;
+
+[2_Look_at_attention_maps](./notebooks/2_Look_at_attention_maps.ipynb) - how to draw attention maps for encoder heads;
+
+[3_Look_which_heads_are_dead](./notebooks/3_Look_which_heads_are_dead.ipynb) - if you are pruning heads, you might want to know which ended up dead; this notebook shows you how to do so.
+
+
+---
+## Training config tour
+
+Each training script has a thorough description of the parameters and explanation of the things you need to change for your experiment. Here we'll provide a tour of the config files and explain the parameters once again.
+
+
+### Data
+First, you need to specify your directory with the [the-story-of-heads](./) repo, data directory and train/dev file names.
+```
+REPO_DIR="../" # insert the dir to the the-story-of-heads repo
+DATA_DIR="../" # insert your datadir
+
+NMT="${REPO_DIR}/scripts/nmt.py"
+
+# path to preprocessed data (tokenized, bpe-ized)
+train_src="${DATA_DIR}/train.src"
+train_dst="${DATA_DIR}/train.dst"
+dev_src="${DATA_DIR}/dev.src"
+dev_dst="${DATA_DIR}/dev.dst"
+```
+After that, in the config you'll see the code for creating vocabularies from your data and shuffling the data.
+
+---
+### Model
+```
+params=(
+...
+--model lib.task.seq2seq.models.transformer_head_gates.Model
+...)
+```
+This is the Transformer model with extra options for attention head gates: stochastic, fixed or no extra parameters for the baseline. Model hyperparameters are split into groups:
+* main model hp,
+* minor model hp (probably you do not want to change them)
+* regularization and label smoothing
+* inference params (beam search with a beam of 4)
+* head gates parameters (for the baseline, nothing is here)
+
+For the baseline, the parameters are as follows:
+```
+hp = {
+     "num_layers": 6,
+     "num_heads": 8,
+     "ff_size": 2048,
+     "ffn_type": "conv_relu",
+     "hid_size": 512,
+     "emb_size": 512,
+     "res_steps": "nlda", 
+    
+     "rescale_emb": True,
+     "inp_emb_bias": True,
+     "normalize_out": True,
+     "share_emb": False,
+     "replace": 0,
+    
+     "relu_dropout": 0.1,
+     "res_dropout": 0.1,
+     "attn_dropout": 0.1,
+     "label_smoothing": 0.1,
+    
+     "translator": "ingraph",
+     "beam_size": 4,
+     "beam_spread": 3,
+     "len_alpha": 0.6,
+     "attn_beta": 0,
+    }
+```
+This set of parameters corresponds to Transformer-base [(Vaswani et al., 2017)](https://papers.nips.cc/paper/7181-attention-is-all-you-need). 
+
+To train the model with heads pruning, you need to specify the types of attention heads you want to prune. For encoder self-attention heads only, 
+```
+    "concrete_heads": {"enc-self"},
+```
+and for all attention types, it's
+```
+    "concrete_heads": {"enc-self", "dec-self", "dec-enc"},
+```
+
+For fixed head configuration, specify gate values for each head:
+```
+     "alive_heads": {"enc-self": [[1,0,1,0,1,0,1,0],
+                                  [1,1,1,1,1,1,1,1],
+                                  [0,0,0,0,0,0,0,0],
+                                  [1,1,1,0,0,1,0,0],
+                                  [0,0,0,0,1,1,1,1],
+                                  [0,0,1,1,0,0,1,1]],
+                    },
+```
+In this case, only encoder self-attention heads will be masked. For all attention types, specify all gates:
+```
+     "alive_heads": {"enc-self": [[1,0,1,0,1,0,1,0],
+                                  [1,1,1,1,1,1,1,1],
+                                   ...
+                                  [0,0,1,1,0,0,1,1]],
+                     "dec-self": [[...],
+                                   ...,
+                                  [...]],
+                     "dec-enc": [[...],
+                                  ...,
+                                 [...]],
+                    },
+```
+---
+### Problem (loss function)
+You need to set the training objective for you model. For the baseline and fixed head configuration, it's the standard cross-entropy loss with no extra options:
+```
+params=(
+    ...
+    --problem lib.task.seq2seq.problems.default.DefaultProblem
+    --problem-opts '{}'
+    ...)
+```
+For pruning heads, loss function is `L = L_xent + λ * L_C.`. You need to set another problem and specify the value of `λ`:
+```
+params=(
+    ...
+     --problem lib.task.seq2seq.problems.concrete.ConcreteProblem
+     --problem-opts '{'"'"'concrete_coef'"'"': 0.1,}'
+    ...)
+```
+---
+### Starting checkpoint
+If you start model training from already trained model (for example, we start pruning heads from the trained baseline model), specify the initial checkpoint:
+```
+params=(
+    ...
+     --pre-init-model-checkpoint 'dir_to_your_trained_baseline_checkpoint.npz'
+    ...)
+```
+You do not need this if you start from scratch.
+
+---
+### Variables to optimize
+If you want to freeze some sets of parameters in the model (for example, when pruning encoder heads we freeze the decoder parameters to ensure that heads functions do not move to the decoder), you have to specify which parameters you **want** to optimize. To optimize only encoder, add `variables` to `--optimizer-opts`:
+```
+params=(
+    ...
+    --optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998,
+                       '"'"'variables'"'"': ['"'"'mod/emb_inp*'"'"',
+                                             '"'"'mod/enc*'"'"',],}'
+    ...)
+```
+(Here `beta1` and `beta2` are parameters of the adam optimizer).
+
+---
+### Batch size
+It has been shown that Transformer’s performance depends heavily on a batch size (see for example [Popel and Bojar, 2018](https://content.sciendo.com/view/journals/pralin/110/1/article-p43.xml)), and we chose a large value of batch size to ensure that models show their best performance. In our experiments, each training batch contained a set of translation pairs containing approximately 16000 source tokens. This can be reached by using several of GPUs or by accumulating the gradients for several batches and then making an update. Our implementation enables both these options.
+
+Batch size per one gpu is set like this:
+```
+params=(
+    ...
+     --batch-len 4000
+    ...)
+```
+The effective batch size will be then `batch-len * num_gpus`. For example, with `--batch-len 4000` and `4 gpus` you would get the desirable batch size of 16000.
+
+If you do not have several gpus (often, we don't have either :) ), you still have to have models of a proper quality. For this, accumulate the gradients for several batches and then make an update. Add `average_grads: True` and `sync_every_steps: N` to the optimizer options like this:
+```
+params=(
+    ...
+    --optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998,
+                       '"'"'sync_every_steps'"'"': 4,
+                       '"'"'average_grads'"'"': True, }'
+    ...)
+```
+The effective batch size will be then `batch-len * sync_every_steps`. For example, with `--batch-len 4000` and `sync_every_steps: 4` you would get the desirable batch size of 16000.
+
+
+---
+### Other options
+If you want to see dev BLEU score on your tensorboard:
+```
+params=(
+    ...
+      --translate-dev
+      --translate-dev-every 2048
+    ...)
+```
+Specify how often you want to save a checkpoint:
+```
+params=(
+    ...
+      --checkpoint-every-steps 2048
+    ...)
+```
+Specify how often you want to score the dev set (eval loss values):
+```
+params=(
+    ...
+      --score-dev-every 256
+    ...)
+```
+How many last checkpoints to keep:
+```
+params=(
+    ...
+       --keep-checkpoints-max 10
+    ...)
+```
+
+---
+
+# Comments
+* `lib.task.seq2seq.models.transformer_head_gates` model enables you to train baseline as well as other versions, but if you want Transformer model without any modifications, you can find it here: `lib.task.seq2seq.models.transformer`.
diff --git a/lib/__init__.py b/lib/__init__.py
new file mode 100644
index 0000000..a700fb1
--- /dev/null
+++ b/lib/__init__.py
@@ -0,0 +1,6 @@
+from . import train
+from .meta import *
+from .util import *
+from .ops import *
+from .task import *
+from .layers import *
diff --git a/lib/data.py b/lib/data.py
new file mode 100644
index 0000000..8599ceb
--- /dev/null
+++ b/lib/data.py
@@ -0,0 +1,270 @@
+import numpy as np
+import tensorflow as tf
+
+import bintrees
+import os
+import sys
+import random
+import threading
+import itertools
+from .util import nested_pack, nested_flatten
+from .ops import mpi
+
+
+class TfUploader:
+
+    def __init__(self, iterator, capacity, dtypes=None, shapes=None, session=None):
+        self.session = session if session is not None else tf.get_default_session()
+        self.empty = False
+
+        # Detect dtypes from first iterator element
+        if dtypes is None or shapes is None:
+            # We need to wrap iterator access in session because it may call TF operations
+            with self.session.as_default():
+                try:
+                    first = next(iterator)
+                except StopIteration:
+                    self.empty = True
+                    return
+
+            self.structure = first
+            self.dtypes = tuple(e.dtype for e in nested_flatten(first))
+            self.shapes = tuple(tuple(map(lambda x: None, e.shape)) for e in nested_flatten(first))
+            self.iterator = itertools.chain([first], iterator)
+        else:
+            self.structure = dtypes
+            self.dtypes = tuple(nested_flatten(dtypes))
+            self.shapes = tuple(nested_flatten(shapes))
+            self.iterator = iterator
+
+        self.session_close_lock = threading.Lock()
+        self.session_closed = False
+
+        with tf.name_scope("uploader"):
+            self.queue = tf.FIFOQueue(dtypes=self.dtypes, capacity=capacity)
+
+            self.enqueue_inputs = [tf.placeholder(dtype=dt) for dt in self.dtypes]
+            self.enqueue_op = self.queue.enqueue(self.enqueue_inputs)
+            self.close_op = self.queue.close()
+
+    def __enter__(self):
+        if not self.empty:
+            self.thread = threading.Thread(target=self._thread_main)
+            self.thread.daemon = True
+            self.thread.start()
+        return self
+
+    def __exit__(self, *args):
+        if not self.empty:
+            with self.session_close_lock:
+                if not self.session_closed:
+                    self.session.run(self.queue.close(True))
+                    self.session_closed = True
+            self.thread.join(1)
+        return False
+
+    def get_next(self):
+        if self.empty:
+            raise tf.errors.OutOfRangeError(None, None, "Queue is empty")
+        res = self.queue.dequeue()
+        if isinstance(res, list):
+            for t, sh in zip(res, self.shapes):
+                t.set_shape(sh)
+            res = nested_pack(res, self.structure)
+        return res
+
+    def _thread_main(self):
+        try:
+            # We need to wrap iterator access in session because it may call TF operations
+            with self.session.graph.as_default(), self.session.as_default():
+                for t in self.iterator:
+                    self.session.run(self.enqueue_op, feed_dict=dict(zip(self.enqueue_inputs, tuple(nested_flatten(t)))))
+
+            with self.session_close_lock:
+                self.session.run(self.close_op)
+                self.session_closed = True
+        except tf.errors.CancelledError:
+            pass
+
+
+class LastElement(object):
+    """
+    Class wrapping last element in RoundRobinIterator
+    """
+    def __init__(self, element=None):
+        self.element = element
+
+class RoundRobinIterator(object):
+    """
+    Class implementing Round-Robin iterator between coordinator and workers
+    """
+    def __init__(self, iterator=None, is_train=True, with_cost=False):
+        self.iterator = iterator
+        self.is_train = is_train
+        self.with_cost = with_cost
+        self.mpi_rank = os.getenv('OMPI_COMM_WORLD_RANK') or '0'
+        self.mpi_size = os.getenv('OMPI_COMM_WORLD_SIZE') or '1'
+        self.finish = False
+
+    def __iter__(self):
+        self.finish = False
+        return self
+
+    def __next__(self):
+        if self.finish:
+            # we should quit iterator
+            raise StopIteration
+
+        buf = None
+        if self.mpi_rank == '0' or self.mpi_rank is None:
+            # fill buffer with elements to scatter
+            try:
+                buf = []
+                for _ in range(int(self.mpi_size)):
+                    batch = next(self.iterator)
+                    if self.with_cost:
+                        if len(buf) == 0:  # On first element save coordinator cost
+                            coord_cost = batch[-1]
+                        batch = batch + (coord_cost,)
+                    buf.append(batch)
+            except StopIteration:
+                # if iterator is out, scatter None values (during training) and
+                # add None to missing workers
+                if self.is_train:
+                    buf = [LastElement()] * int(self.mpi_size)
+                else:
+                    for i in range(len(buf)):
+                        buf[i] = LastElement(buf[i])
+                    buf += [LastElement()] * (int(self.mpi_size) - len(buf))
+
+        # scatter objects between workers
+        value = mpi.scatter_obj(buf)
+        if isinstance(value, LastElement):
+            if value.element is None:
+                raise StopIteration
+            # remember to quit iterator at the next step
+            self.finish = True
+            value = value.element
+        return value
+
+
+class CostBufferIterator(object):
+    """
+    Class implementing CostBuffer iterator for fast finding of the batch with
+    desired cost (useful for balancing batches)
+
+    We assume inputs from the iterator passed in the constructor in the form:
+    <batch> <cost> <coordinator cost>
+    """
+    def __init__(self, iterator=None, buf_size=1000):
+        self.iterator = iterator
+        self.buf_size = buf_size
+        self.tree = bintrees.FastRBTree()
+        self.coord_costs = []
+        self.rng = random.Random(42)
+
+    def __iter__(self):
+        self.tree = bintrees.FastRBTree()
+        self.coord_costs = []
+        self.rng = random.Random(42)
+        return self
+
+    def __next__(self):
+        # Warming up
+        while len(self.coord_costs) < self.buf_size:
+            try:
+                batch, cost, coord_cost = next(self.iterator)
+            except StopIteration:
+                break
+            if cost in self.tree:
+                self.tree[cost].append(batch)
+            else:
+                self.tree[cost] = [batch]
+            self.coord_costs.append(coord_cost)
+
+        # No elements left - finish iteration
+        if len(self.coord_costs) == 0:
+            raise StopIteration
+
+        # generate cost to choose and choose relevant batch
+        index = self.rng.randrange(len(self.coord_costs))
+        best_cost = self._find_best_match(self.coord_costs[index])
+        batch = self.tree[best_cost][0]
+
+        # remove selected items from structures
+        del self.coord_costs[index]
+        del self.tree[best_cost][0]
+        if len(self.tree[best_cost]) == 0:
+            del self.tree[best_cost]
+
+        return batch
+
+    def _find_best_match(self, cost):
+        min_cost = self.tree.min_key()
+        max_cost = self.tree.max_key()
+        if cost <= min_cost:
+            return min_cost
+        if cost >= max_cost:
+            return max_cost
+        floor_cost = self.tree.floor_key(cost)
+        ceil_cost = self.tree.ceiling_key(cost)
+        return floor_cost if abs(ceil_cost - cost) < abs(floor_cost - cost) else ceil_cost
+
+
+class ShuffleIterator(object):
+    """
+    Class implementing shuffling iterator via auxiliary buffer
+    """
+    def __init__(self, iterator, buf_size=1000):
+        self.iterator = iterator
+        self.buf_size = buf_size
+        self.buf = []
+        self.rng = random.Random(42)
+
+    def __iter__(self):
+        self.buf = []
+        self.rng = random.Random(42)
+        return self
+
+    def __next__(self):
+        # Return element from the previously shuffled buffer
+        if len(self.buf) > 0:
+            value = self.buf.pop()
+            return value
+
+        # Keep elements in the buffer
+        while len(self.buf) < self.buf_size:
+            try:
+                value = next(self.iterator)
+            except StopIteration:
+                break
+            self.buf.append(value)
+
+        # No elements left - finish iteration
+        if len(self.buf) == 0:
+            raise StopIteration
+
+        # Shuffle and return element from the buffer
+        self.rng.shuffle(self.buf)
+        value = self.buf.pop()
+        return value
+
+
+def pad_seq_list(array, sentinel):
+    """
+    Add padding, compose lengths
+    """
+    # Compute max length.
+    maxlen = 0
+    for seq in array:
+        maxlen = max(maxlen, len(seq))
+
+    # Pad.
+    padded = []
+    lens = []
+    for seq in array:
+        padding = maxlen - len(seq)
+        padded.append(seq + [sentinel] * padding)
+        lens.append(len(seq))
+
+    return padded, lens
diff --git a/lib/layers/__init__.py b/lib/layers/__init__.py
new file mode 100644
index 0000000..877360d
--- /dev/null
+++ b/lib/layers/__init__.py
@@ -0,0 +1,3 @@
+from .basic import *
+from .attn import *
+from .lrp import *
diff --git a/lib/layers/attn.py b/lib/layers/attn.py
new file mode 100644
index 0000000..aae72ab
--- /dev/null
+++ b/lib/layers/attn.py
@@ -0,0 +1,374 @@
+
+import tensorflow as tf
+import math
+
+import lib
+from lib.ops.basic import is_dropout_enabled, dropout
+from lib.ops import record_activations as rec
+from .basic import Dense, LRP
+
+from lib.layers.concrete_gate import ConcreteGate
+
+
+class MultiHeadAttn:
+    """
+    Multihead scaled-dot-product attention with input/output transformations
+    """
+    ATTN_BIAS_VALUE = -1e9
+
+    def __init__(
+            self, name, inp_size,
+            key_depth, value_depth, output_depth,
+            num_heads, attn_dropout, attn_value_dropout,
+            kv_inp_size=None, _format='combined'
+    ):
+        self.name = name
+        self.key_depth = key_depth
+        self.value_depth = value_depth
+        self.num_heads = num_heads
+        self.attn_dropout = attn_dropout
+        self.attn_value_dropout = attn_value_dropout
+        self.format = _format
+        kv_inp_size = kv_inp_size or inp_size
+
+        with tf.variable_scope(name) as scope:
+            self.scope = scope
+
+            if self.format == 'use_kv':
+                self.query_conv = Dense(
+                    'query_conv',
+                    inp_size, key_depth,
+                    activ=lambda x: x,
+                    bias_initializer=tf.zeros_initializer(),
+                )
+
+                self.kv_conv = Dense(
+                    'mem_conv',
+                    kv_inp_size, key_depth + value_depth,
+                    activ=lambda x: x,
+                    bias_initializer=tf.zeros_initializer(),
+                )
+
+                if kv_inp_size == inp_size:
+                    self.combined_conv = Dense(
+                        'combined_conv',
+                        inp_size, key_depth * 2 + value_depth,
+                        activ=lambda x: x,
+                        matrix=tf.concat([self.query_conv.W, self.kv_conv.W], axis=1),
+                        bias=tf.concat([self.query_conv.b, self.kv_conv.b], axis=0),
+                    )
+
+            elif self.format == 'combined':
+                assert inp_size == kv_inp_size, 'combined format is only supported when inp_size == kv_inp_size'
+                self.combined_conv = Dense(
+                    'mem_conv',  # old name for compatibility
+                    inp_size, key_depth * 2 + value_depth,
+                    activ=lambda x: x,
+                    bias_initializer=tf.zeros_initializer())
+
+                self.query_conv = Dense(
+                    'query_conv',
+                    inp_size, key_depth,
+                    activ=lambda x: x,
+                    matrix=self.combined_conv.W[:, :key_depth],
+                    bias=self.combined_conv.b[:key_depth],
+                )
+
+                self.kv_conv = Dense(
+                    'kv_conv',
+                    kv_inp_size, key_depth + value_depth,
+                    activ=lambda x: x,
+                    matrix=self.combined_conv.W[:, key_depth:],
+                    bias=self.combined_conv.b[key_depth:],
+                )
+            else:
+                raise Exception("Unexpected format: " + self.format)
+
+            self.out_conv = Dense(
+                'out_conv',
+                value_depth, output_depth,
+                activ=lambda x: x,
+                bias_initializer=tf.zeros_initializer())
+
+    def attention_core(self, q, k, v, attn_mask):
+        """ Core math operations of multihead attention layer """
+        q = self._split_heads(q)  # [batch_size * n_heads * n_q * (k_dim/n_heads)]
+        k = self._split_heads(k)  # [batch_size * n_heads * n_kv * (k_dim/n_heads)]
+        v = self._split_heads(v)  # [batch_size * n_heads * n_kv * (v_dim/n_heads)]
+
+        key_depth_per_head = self.key_depth / self.num_heads
+        q = q / math.sqrt(key_depth_per_head)
+
+        # Dot-product attention
+        # logits: (batch_size * n_heads * n_q * n_kv)
+        attn_bias = MultiHeadAttn.ATTN_BIAS_VALUE * (1 - attn_mask)
+        logits = tf.matmul(
+            q,
+            tf.transpose(k, perm=[0, 1, 3, 2])) + attn_bias
+        weights = tf.nn.softmax(logits)
+
+        tf.add_to_collection("AttnWeights", weights)
+        tf.add_to_collection(lib.meta.ATTENTIONS, lib.meta.Attention(self.scope, weights, logits, attn_mask))
+
+        if is_dropout_enabled():
+            weights = dropout(weights, 1.0 - self.attn_dropout)
+        x = tf.matmul(
+            weights,  # [batch_size * n_heads * n_q * n_kv]
+            v  # [batch_size * n_heads * n_kv * (v_deph/n_heads)]
+        )
+        combined_x = self._combine_heads(x)
+
+        if is_dropout_enabled():
+            combined_x = dropout(combined_x, 1.0 - self.attn_value_dropout)
+        return combined_x
+
+    def __call__(self, query_inp, attn_mask, kv_inp=None, kv=None):
+        """
+        query_inp: [batch_size * n_q * inp_dim]
+        attn_mask: [batch_size * 1 * n_q * n_kv]
+        kv_inp: [batch_size * n_kv * inp_dim]
+        -----------------------------------------------
+        results: [batch_size * n_q * output_depth]
+        """
+        assert kv is None or kv_inp is None, "please only feed one of kv or kv_inp"
+
+        with tf.variable_scope(self.scope), tf.name_scope(self.name) as scope:
+            rec.save_activation('kv', kv)
+            if kv_inp is not None or kv is not None:
+                q = self.query_conv(query_inp)
+                if kv is None:
+                    kv = self.kv_conv(kv_inp)
+                k, v = tf.split(kv, [self.key_depth, self.value_depth], axis=2)
+                rec.save_activation('is_combined', False)
+            else:
+                combined = self.combined_conv(query_inp)
+                q, k, v = tf.split(combined, [self.key_depth, self.key_depth, self.value_depth], axis=2)
+                rec.save_activation('is_combined', True)
+
+            rec.save_activations(q=q, k=k, v=v, attn_mask=attn_mask)
+            combined_x = self.attention_core(q, k, v, attn_mask)
+            outputs = self.out_conv(combined_x)
+
+            return outputs
+
+    def relprop(self, R):
+        with tf.variable_scope(self.scope):
+            assert rec.get_activation('kv') is None, "relprop through translatemodelfast is not implemented"
+            R = self.out_conv.relprop(R)
+            q, k, v, attn_mask = rec.get_activations('q', 'k', 'v', 'attn_mask')
+            # TODO relprop with taylor expansion?
+            Rq, Rk, Rv = LRP.relprop(lambda q, k, v: self.attention_core(q, k, v, attn_mask), None, R, q, k, v)
+
+            if rec.get_activation('is_combined'):
+                Rqkv = tf.concat([Rq, Rk, Rv], axis=2)  # [batch, time, 3 * hid_size]
+                Rinp = self.combined_conv.relprop(Rqkv)
+                return Rinp
+            else:
+                Rkv = tf.concat([Rk, Rv], axis=2)  # [batch, time, 2 * hid_size]
+                Rkvinp = self.kv_conv.relprop(Rkv)
+                Rqinp = self.query_conv.relprop(Rq)
+                return {'query_inp': Rqinp, 'kv_inp': Rkvinp}
+
+    def _split_heads(self, x):
+        """
+        Split channels (dimension 3) into multiple heads (dimension 1)
+        input: (batch_size * ninp * inp_dim)
+        output: (batch_size * n_heads * ninp * (inp_dim/n_heads))
+        """
+        old_shape = x.get_shape().dims
+        dim_size = old_shape[-1]
+        new_shape = old_shape[:-1] + [self.num_heads] + [dim_size // self.num_heads if dim_size else None]
+        ret = tf.reshape(x, tf.concat([tf.shape(x)[:-1], [self.num_heads, tf.shape(x)[-1] // self.num_heads]], 0))
+        ret.set_shape(new_shape)
+        return tf.transpose(ret, [0, 2, 1, 3])  # [batch_size * n_heads * ninp * (hid_dim//n_heads)]
+
+    def _combine_heads(self, x):
+        """
+        Inverse of split heads
+        input: (batch_size * n_heads * ninp * (inp_dim/n_heads))
+        out: (batch_size * ninp * inp_dim)
+        """
+        x = tf.transpose(x, [0, 2, 1, 3])
+        old_shape = x.get_shape().dims
+        a, b = old_shape[-2:]
+        new_shape = old_shape[:-2] + [a * b if a and b else None]
+        ret = tf.reshape(x, tf.concat([tf.shape(x)[:-2], [tf.shape(x)[-2] * tf.shape(x)[-1]]], 0))
+        ret.set_shape(new_shape)
+        return ret
+
+
+
+class MultiHeadAttnConcrete(MultiHeadAttn):
+    """
+    Multihead scaled-dot-product attention with input/output transformations.
+    This is the modification with scalar gates to each head, which enables head pruning introduced in https://arxiv.org/abs/1905.09418
+    """
+
+    def __init__(
+            self, name, inp_size,
+            key_depth, value_depth, output_depth,
+            num_heads, attn_dropout, attn_value_dropout,
+            kv_inp_size=None, _format='combined',
+            gate_hp={'l0_penalty': 1.0},
+    ):
+        super().__init__(name, inp_size,
+            key_depth, value_depth, output_depth,
+            num_heads, attn_dropout, attn_value_dropout,
+            kv_inp_size=kv_inp_size, _format=_format)
+
+        self.gate_hp = gate_hp
+
+        with tf.variable_scope(name):
+            self.scope = tf.get_variable_scope()
+            self.gate = ConcreteGate('gate', shape=[1, self.num_heads, 1, 1], **self.gate_hp)
+
+
+    def __call__(self, query_inp, attn_mask, kv_inp=None, kv=None):
+        """
+        query_inp: [batch_size * n_q * inp_dim]
+        attn_mask: [batch_size * 1 * n_q * n_kv]
+        kv_inp: [batch_size * n_kv * inp_dim]
+        -----------------------------------------------
+        results: [batch_size * n_q * output_depth]
+        """
+        assert kv is None or kv_inp is None, "please only feed one of kv or kv_inp"
+        with tf.name_scope(self.name) as scope:
+            if kv_inp is not None or kv is not None:
+                q = self.query_conv(query_inp)
+                if kv is None:
+                    kv = self.kv_conv(kv_inp)
+                k, v = tf.split(kv, [self.key_depth, self.value_depth], axis=2)
+            else:
+                combined = self.combined_conv(query_inp)
+                q, k, v = tf.split(combined, [self.key_depth, self.key_depth, self.value_depth], axis=2)
+            q = self._split_heads(q)  # [batch_size * n_heads * n_q * (k_dim/n_heads)]
+            k = self._split_heads(k)  # [batch_size * n_heads * n_kv * (k_dim/n_heads)]
+            v = self._split_heads(v)  # [batch_size * n_heads * n_kv * (v_dim/n_heads)]
+
+            key_depth_per_head = self.key_depth / self.num_heads
+            q = q / math.sqrt(key_depth_per_head)
+
+            # Dot-product attention
+            # logits: (batch_size * n_heads * n_q * n_kv)
+            attn_bias = MultiHeadAttn.ATTN_BIAS_VALUE * (1 - attn_mask)
+            logits = tf.matmul(
+                q,
+                tf.transpose(k, perm=[0, 1, 3, 2])) + attn_bias
+            weights = tf.nn.softmax(logits)
+
+            tf.add_to_collection("AttnWeights", weights)
+
+            tf.add_to_collection(lib.meta.ATTENTIONS, lib.meta.Attention(scope, weights, logits, attn_mask))
+
+            if is_dropout_enabled():
+                weights = dropout(weights, 1.0 - self.attn_dropout)
+            x = tf.matmul(
+                weights,  # [batch_size * n_heads * n_q * n_kv]
+                v  # [batch_size * n_heads * n_kv * (v_deph/n_heads)]
+            )
+            # x: [batch, n_heads, n_q, (v_deph/n_heads)]
+
+            # ========================  apply the gate  ========================
+            gated_x = self.gate(x)
+
+            tf.add_to_collection("CONCRETE", self.gate.get_sparsity_rate())
+            tf.add_to_collection("GATEVALUES", self.gate.get_gates(False))
+            # ==================================================================
+
+            combined_x = self._combine_heads(gated_x)
+
+            if is_dropout_enabled():
+                combined_x = dropout(combined_x, 1.0 - self.attn_value_dropout)
+
+            outputs = self.out_conv(combined_x)
+
+            return outputs
+
+
+
+class MultiHeadAttnFixedAliveHeads(MultiHeadAttn):
+    """
+    Multihead scaled-dot-product attention with input/output transformations.
+    This is the modification with constant binary gates for each head,
+    which specify which heads are present.
+    Need to pass 'head_gate' parameter, which the list of num_heads values, one for each head.
+    """
+
+    def __init__(
+            self, name, inp_size,
+            key_depth, value_depth, output_depth,
+            num_heads, attn_dropout, attn_value_dropout,
+            kv_inp_size=None, _format='combined',
+            head_gate=None,
+    ):
+        super().__init__(name, inp_size,
+            key_depth, value_depth, output_depth,
+            num_heads, attn_dropout, attn_value_dropout,
+            kv_inp_size=kv_inp_size, _format=_format)
+
+        assert head_gate is not None, "You must feed values for head gates"
+        self.head_gate = head_gate
+
+        with tf.variable_scope(name):
+            self.scope = tf.get_variable_scope()
+            self.gate = tf.constant(self.head_gate, dtype=tf.float32)[None, :, None, None]
+
+
+    def __call__(self, query_inp, attn_mask, kv_inp=None, kv=None):
+        """
+        query_inp: [batch_size * n_q * inp_dim]
+        attn_mask: [batch_size * 1 * n_q * n_kv]
+        kv_inp: [batch_size * n_kv * inp_dim]
+        -----------------------------------------------
+        results: [batch_size * n_q * output_depth]
+        """
+        assert kv is None or kv_inp is None, "please only feed one of kv or kv_inp"
+        with tf.name_scope(self.name) as scope:
+            if kv_inp is not None or kv is not None:
+                q = self.query_conv(query_inp)
+                if kv is None:
+                    kv = self.kv_conv(kv_inp)
+                k, v = tf.split(kv, [self.key_depth, self.value_depth], axis=2)
+            else:
+                combined = self.combined_conv(query_inp)
+                q, k, v = tf.split(combined, [self.key_depth, self.key_depth, self.value_depth], axis=2)
+            q = self._split_heads(q)  # [batch_size * n_heads * n_q * (k_dim/n_heads)]
+            k = self._split_heads(k)  # [batch_size * n_heads * n_kv * (k_dim/n_heads)]
+            v = self._split_heads(v)  # [batch_size * n_heads * n_kv * (v_dim/n_heads)]
+
+            key_depth_per_head = self.key_depth / self.num_heads
+            q = q / math.sqrt(key_depth_per_head)
+
+            # Dot-product attention
+            # logits: (batch_size * n_heads * n_q * n_kv)
+            attn_bias = MultiHeadAttn.ATTN_BIAS_VALUE * (1 - attn_mask)
+            logits = tf.matmul(
+                q,
+                tf.transpose(k, perm=[0, 1, 3, 2])) + attn_bias
+            weights = tf.nn.softmax(logits)
+
+            tf.add_to_collection("AttnWeights", weights)
+
+            tf.add_to_collection(lib.meta.ATTENTIONS, lib.meta.Attention(scope, weights, logits, attn_mask))
+
+            if is_dropout_enabled():
+                weights = dropout(weights, 1.0 - self.attn_dropout)
+            x = tf.matmul(
+                weights,  # [batch_size * n_heads * n_q * n_kv]
+                v  # [batch_size * n_heads * n_kv * (v_deph/n_heads)]
+            )
+            # x: [batch, n_heads, n_q, (v_deph/n_heads)]
+
+            # ========================  apply the gate  ========================
+            gated_x = self.gate * x
+            # ==================================================================
+
+            combined_x = self._combine_heads(gated_x)
+
+            if is_dropout_enabled():
+                combined_x = dropout(combined_x, 1.0 - self.attn_value_dropout)
+
+            outputs = self.out_conv(combined_x)
+
+            return outputs
+
diff --git a/lib/layers/basic.py b/lib/layers/basic.py
new file mode 100644
index 0000000..51fe6a7
--- /dev/null
+++ b/lib/layers/basic.py
@@ -0,0 +1,392 @@
+# Basic NN layers
+
+import lib
+
+import tensorflow as tf
+from ..util import nop_ctx
+from ..ops import record_activations as rec
+from .lrp import LRP
+from ..ops.basic import *
+
+###############################################################################
+#                                                                             #
+#                                   LAYERS                                    #
+#                                                                             #
+###############################################################################
+
+
+
+## ----------------------------------------------------------------------------
+#                                   Dense
+class Dense:
+    def __init__(
+            self, name,
+            inp_size, out_size, activ=tf.tanh,
+            matrix=None, bias=None,
+            matrix_initializer=None, bias_initializer=None):
+
+        """
+        <name>/W
+        <name>/b
+
+        User can explicitly specify matrix to use instead of W (<name>/W is
+        not created then), but this is not recommended to external users.
+        """
+        self.name = name
+        self.activ = activ
+        self.inp_size = inp_size
+        self.out_size = out_size
+
+        with tf.variable_scope(name) as self.scope:
+            if matrix is None:
+                self.W = get_model_variable('W', shape=[inp_size, out_size], initializer=matrix_initializer)
+            else:
+                self.W = matrix
+
+            if bias is None:
+                self.b = get_model_variable('b', shape=[out_size], initializer=bias_initializer)
+            else:
+                self.b = bias
+
+    def __call__(self, inp):
+        """
+        inp: [..., inp_size]
+        --------------------
+        Ret: [..., out_size]
+        """
+        with tf.variable_scope(self.scope):
+            out = self.activ(dot(inp, self.W) + self.b)
+            out.set_shape([None] * (out.shape.ndims - 1) + [self.out_size])
+            if rec.is_recorded():
+                rec.save_activations(inp=inp, out=out)
+            return out
+
+    def relprop(self, output_relevance):
+        """
+        computes input relevance given output_relevance
+        :param output_relevance: relevance w.r.t. layer output, [*dims, out_size]
+        notation from DOI:10.1371/journal.pone.0130140, Eq 60
+        """
+        # make two copies of the layer: one with positive params and one with negative
+        clone_self = lambda W, b, activ: Dense(self.name, self.inp_size, self.out_size,
+                                               activ, matrix=W, bias=b)
+        f_positive = clone_self(tf.maximum(self.W, LRP.eps), b=0, activ=nop)
+        f_negative = clone_self(tf.minimum(self.W, -LRP.eps), b=0, activ=nop)  # the dark side of me
+
+        with tf.variable_scope(self.scope):
+            inp, out = rec.get_activations('inp', 'out')
+            # inp: [*dims, inp_size], out: [*dims, out_size]
+            input_relevance = LRP.relprop(f_positive, f_negative, output_relevance, inp)
+        return input_relevance
+
+    @property
+    def input_size(self):
+        return self.inp_size
+
+    @property
+    def output_size(self):
+        return self.out_size
+
+## ----------------------------------------------------------------------------
+#                                 Embedding
+
+class Embedding:
+    def __init__(self, name, voc_size, emb_size, matrix=None, initializer=None, device=''):
+        """
+        Parameters:
+
+          <name>/mat
+        """
+        self.name = name
+        self.voc_size = voc_size
+        self.emb_size = emb_size
+        self.device = device
+
+        if matrix is not None:
+            self.mat = matrix
+        else:
+            with tf.variable_scope(name), (tf.device(device) if device is not None else nop_ctx()):
+                self.mat = get_model_variable('mat', shape=[voc_size, emb_size], initializer=initializer)
+
+    def __call__(self, inp, gumbel=False):
+        """
+        inp: [...]
+        --------------------
+        Ret: [..., emb_size]
+        """
+        with tf.name_scope(self.name), (tf.device(self.device) if self.device is not None else nop_ctx()):
+            return tf.gather(self.mat, inp) if not gumbel else dot(inp, self.mat)
+
+## ----------------------------------------------------------------------------
+#                               LayerNorm
+
+class LayerNorm:
+    """
+    Performs Layer Normalization
+    """
+    def __init__(self, name, inp_size, epsilon=1e-6):
+        self.name = name
+        self.epsilon = epsilon
+
+        with tf.variable_scope(name):
+            self.scale = get_model_variable('scale', shape=[inp_size], initializer=tf.ones_initializer())
+            self.bias = get_model_variable('bias', shape=[inp_size], initializer=tf.zeros_initializer())
+
+    def __call__(self, inp):
+        with tf.variable_scope(self.name):
+            mean = tf.reduce_mean(inp, axis=[-1], keep_dims=True)
+            variance = tf.reduce_mean(tf.square(inp - mean), axis=[-1], keep_dims=True)
+            norm_x = (inp - mean) * tf.rsqrt(variance + self.epsilon)
+            return norm_x * self.scale + self.bias
+
+    def relprop(self, R):
+        #TODO find out the "canonic" way to relrop through layernorm
+        return R
+
+## ----------------------------------------------------------------------------
+#                               ResidualWrapper
+
+
+class Wrapper:
+    """ Reflection-style wrapper, code from http://code.activestate.com/recipes/577555-object-wrapper-class/ """
+    def __init__(self, wrapped_layer):
+        self.wrapped_layer = wrapped_layer
+
+    def __getattr__(self, attr):
+        if attr in self.__dict__:
+            return getattr(self, attr)
+        return getattr(self.wrapped_layer, attr)
+
+
+class ResidualLayerWrapper(Wrapper):
+    def __init__(self, name, wrapped_layer, inp_size, out_size, steps='ldan', dropout=0, dropout_seed=None):
+        """
+        Applies any number of residual connection, dropout and/or layer normalization before or after wrapped layer
+        :param steps: a sequence of operations to perform, containing any combination of:
+            - 'l' - call wrapped [l]ayer, this operation should be used exactly once
+            - 'd' - apply [d]ropout with p = dropout and seed = dropout_seed
+            - 'a' - [a]dd inputs to output (residual connection)
+            - 'n' - apply layer [n]ormalization here, can only be done once
+        """
+        assert steps.count('l') == 1, "residual wrapper must call wrapped layer exactly once"
+        assert steps.count('n') <= 1, "in the current implementaion, there can be at most one layer normalization step"
+        assert inp_size == out_size or 'a' not in steps, "residual step only works if inp_size == out_size"
+        self.name = name
+        self.wrapped_layer = wrapped_layer
+
+        if 'n' in steps:
+            ln_size = inp_size if steps.index('n') < steps.index('l') else out_size
+            with tf.variable_scope(name) as self.scope:
+                self.norm_layer = LayerNorm("layer_norm", ln_size)
+
+        self.steps = steps
+        self.preprocess_steps = steps[:steps.index('l')]
+        self.postprocess_steps = steps[steps.index('l') + 1:]
+        self.dropout = dropout
+        self.dropout_seed = dropout_seed
+
+    def __call__(self, inp, *args, **kwargs):
+        out = self.preprocess(inp)
+        out = self.wrapped_layer(out, *args, **kwargs)
+        out = self.postprocess(out, inp)
+        return out
+
+    def preprocess(self, inp):
+        return self._perform(self.preprocess_steps, inp)
+
+    def postprocess(self, out, inp=None):
+        return self._perform(self.postprocess_steps, out, inp=inp)
+
+    def _perform(self, steps, out, inp=None):
+        with tf.variable_scope(self.scope):
+            if inp is None:
+                inp = out
+
+            for s in steps:
+                if s == 'd':
+                    if is_dropout_enabled():
+                        out = lib.ops.dropout(out, 1.0 - self.dropout, seed=self.dropout_seed)
+                elif s == 'a':
+                    rec.save_activations(inp=inp, out_pre_residual=out)
+                    out += inp
+                elif s == 'n':
+                    out = self.norm_layer(out)
+                else:
+                    raise RuntimeError("Unknown process step: %s" % s)
+            return out
+
+    def relprop(self, R, main_key=None):
+        original_scale = tf.reduce_sum(abs(R))
+        with tf.variable_scope(self.scope):
+            Rinp_residual = 0.0
+            for s in self.steps[::-1]:
+                if s == 'l':
+                    R = self.wrapped_layer.relprop(R)
+                    if isinstance(R, dict):
+                        assert main_key is not None
+                        R_dict = R
+                        R = R_dict[main_key]
+                elif s == 'a':
+                    inp, out = rec.get_activations('inp', 'out_pre_residual')
+                    Rinp_residual, R = LRP.relprop(lambda a, b: a + b, None, R, inp, out)
+                elif s == 'n':
+                    R = self.norm_layer.relprop(R)
+
+            pre_residual_scale = tf.reduce_sum(abs(R) + abs(Rinp_residual))
+
+            R = R + Rinp_residual
+            R = R * pre_residual_scale / tf.reduce_sum(tf.abs(R))
+            if main_key is not None:
+                R_dict = dict(R_dict)
+                R_dict[main_key] = R
+                total_scale = sum(tf.reduce_sum(abs(relevance)) for relevance in R_dict.values())
+                R_dict = {key: value * original_scale / total_scale
+                          for key, value in R_dict.items()}
+                return R_dict
+            else:
+                return R
+
+
+###############################################################################
+#                                                                             #
+#                              SEQUENCE LOSSES                                #
+#                                                                             #
+###############################################################################
+
+
+class SequenceLossBase:
+    def rdo_to_logits(self, *args, **kwargs):
+        raise NotImplementedError()
+
+    def rdo_to_logits__predict(self, *args, **kwargs):
+        return self.rdo_to_logits(*args, **kwargs)
+
+
+class LossXent(SequenceLossBase):
+    def __init__(
+        self, name, rdo_size, voc, hp,
+        matrix=None, bias=None,
+        matrix_initializer=None, bias_initializer=tf.zeros_initializer(),
+    ):
+        """
+        Parameters:
+
+          Dense: <name>/logits
+        """
+        if 'lm_path' in hp:
+            raise NotImplementedError("LM fusion not implemented")
+
+        self.name = name
+        self.rdo_size = rdo_size
+        self.voc_size = voc.size()
+
+        self.bos = voc.bos
+        self.label_smoothing = hp.get('label_smoothing', 0)
+
+        with tf.variable_scope(name):
+            self._rdo_to_logits = Dense(
+                'logits', rdo_size, self.voc_size, activ=nop,
+                matrix=matrix, bias=bias,
+                matrix_initializer=matrix_initializer, bias_initializer=bias_initializer)
+
+    def __call__(self, rdo, out, out_len):
+        """
+        rdo: [batch_size, ninp, rdo_size]
+        out: [batch_size, ninp], dtype=int
+        out_len: [batch_size]
+        inp_words: [batch_size, ninp], dtype=string
+        attn_P_argmax: [batch_size, ninp], dtype=int
+        --------------------------
+        Ret: [batch_size]
+        """
+        logits = self.rdo_to_logits(rdo, out, out_len) # [batch_size, ninp, voc_size]
+        return self.logits2loss(logits, out, out_len)
+
+    def rdo_to_logits(self, rdo, out, out_len):
+        """
+        compute logits in training mode
+        :param rdo: pre-final activations float32[batch, num_outputs, hid_size]
+        :param out: output sequence, padded with EOS int64[batch, num_outputs]
+        :param out_len: lengths of outputs in :out: excluding padding, int64[batch]
+        """
+        return self._rdo_to_logits(rdo)
+
+    def logits2loss(self, logits, out, out_len, reduce_rows=True):
+        if self.label_smoothing:
+            voc_size = tf.shape(logits)[-1]
+            smooth_positives = 1.0 - self.label_smoothing
+            smooth_negatives = self.label_smoothing / tf.to_float(voc_size - 1)
+            onehot_labels = tf.one_hot(out, depth=voc_size, on_value=smooth_positives, off_value=smooth_negatives)
+
+            losses = tf.nn.softmax_cross_entropy_with_logits(
+                labels=onehot_labels,
+                logits=logits,
+                name="xentropy")
+
+            # Normalizing constant is the best cross-entropy value with soft targets.
+            # We subtract it just for readability, makes no difference on learning.
+            normalizing = -(smooth_positives * tf.log(smooth_positives) +
+                tf.to_float(voc_size - 1) * smooth_negatives * tf.log(smooth_negatives + 1e-20))
+            losses -= normalizing
+        else:
+            losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=out)
+
+        losses *= tf.sequence_mask(out_len, maxlen=tf.shape(out)[1], dtype=logits.dtype)
+
+        if reduce_rows:
+            return tf.reduce_sum(losses, axis=1)
+        else:
+            return losses
+
+    def rdo_to_logits__predict(self, rdo, prefix):
+        """ like rdo_to_logits, but used in beam search """
+        return self._rdo_to_logits(rdo)
+
+
+LossXentLm = LossXent  # alias
+
+
+class FFN:
+    """
+    Feed-forward layer
+    """
+
+    def __init__(self, name,
+                 inp_size, hid_size, out_size,
+                 relu_dropout):
+        assert isinstance(hid_size, int), "List of hidden sizes not is not supported"
+        self.name = name
+        self.relu_dropout = relu_dropout
+
+        with tf.variable_scope(name):
+            self.first_conv = Dense(
+                'conv1',
+                inp_size, hid_size,
+                activ=tf.nn.relu,
+                bias_initializer=tf.zeros_initializer())
+
+            self.second_conv = Dense(
+                'conv2',
+                hid_size, out_size,
+                activ=lambda x: x,
+                bias_initializer=tf.zeros_initializer())
+
+    def __call__(self, inputs, summarize_preactivations=False):
+        """
+        inp: [batch_size * ninp * inp_dim]
+        ---------------------------------
+        out: [batch_size * ninp * out_dim]
+        """
+        with tf.variable_scope(self.name):
+            hidden = self.first_conv(inputs)
+            if is_dropout_enabled():
+                hidden = dropout(hidden, 1.0 - self.relu_dropout)
+
+            outputs = self.second_conv(hidden)
+
+        return outputs
+
+    def relprop(self, R):
+        R = self.second_conv.relprop(R)
+        R = self.first_conv.relprop(R)
+        return R
\ No newline at end of file
diff --git a/lib/layers/concrete_gate.py b/lib/layers/concrete_gate.py
new file mode 100644
index 0000000..b5d5a01
--- /dev/null
+++ b/lib/layers/concrete_gate.py
@@ -0,0 +1,97 @@
+import tensorflow as tf
+from warnings import warn
+import lib
+
+
+class ConcreteGate:
+    """
+    A gate made of stretched concrete distribution (using experimental Stretchable Concrete™)
+    Can be applied to sparsify neural network activations or weights.
+    Example usage: https://gist.github.com/justheuristic/1118a14a798b2b6d47789f7e6f511abd
+    :param shape: shape of gate variable. can be broadcasted.
+        e.g. if you want to apply gate to tensor [batch, length, units] over units axis,
+        your shape should be [1, 1, units]
+    :param temperature: concrete sigmoid temperature, should be in (0, 1] range
+        lower values yield better approximation to actual discrete gate but train longer
+    :param stretch_limits: min and max value of gate before it is clipped to [0, 1]
+        min value should be negative in order to compute l0 penalty as in https://arxiv.org/pdf/1712.01312.pdf
+        however, you can also use tf.nn.sigmoid(log_a) as regularizer if min, max = 0, 1
+    :param l0_penalty: coefficient on the regularizer that minimizes l0 norm of gated value
+    :param l2_penalty: coefficient on the regularizer that minimizes l2 norm of gated value
+    :param eps: a small additive value used to avoid NaNs
+    :param hard: if True, gates are binarized to {0, 1} but backprop is still performed as if they were concrete
+    :param local_rep: if True, samples a different gumbel noise tensor for each sample in batch,
+        by default, noise is sampled using shape param as size.
+
+    """
+
+    def __init__(self, name, shape, temperature=0.33, stretch_limits=(-0.1, 1.1),
+                 l0_penalty=0.0, l2_penalty=0.0, eps=1e-6, hard=False, local_rep=False):
+        self.name = name
+        self.temperature, self.stretch_limits, self.eps = temperature, stretch_limits, eps
+        self.l0_penalty, self.l2_penalty = l0_penalty, l2_penalty
+        self.hard, self.local_rep = hard, local_rep
+        with tf.variable_scope(name):
+            self.log_a = lib.ops.get_model_variable("log_a", shape=shape)
+
+    def __call__(self, values, is_train=None, axis=None, reg_collection=tf.GraphKeys.REGULARIZATION_LOSSES):
+        """ applies gate to values, if is_train, adds regularizer to reg_collection """
+        is_train = lib.layers.basic.is_dropout_enabled() if is_train is None else is_train
+        gates = self.get_gates(is_train, shape=tf.shape(values) if self.local_rep else None)
+
+        if self.l0_penalty != 0 or self.l2_penalty != 0:
+            reg = self.get_penalty(values=values, axis=axis)
+            tf.add_to_collection(reg_collection, tf.identity(reg, name='concrete_gate_reg'))
+        return values * gates
+
+    def get_gates(self, is_train, shape=None):
+        """ samples gate activations in [0, 1] interval """
+        low, high = self.stretch_limits
+        with tf.name_scope(self.name):
+            if is_train:
+                shape = tf.shape(self.log_a) if shape is None else shape
+                noise = tf.random_uniform(shape, self.eps, 1.0 - self.eps)
+                concrete = tf.nn.sigmoid((tf.log(noise) - tf.log(1 - noise) + self.log_a) / self.temperature)
+            else:
+                concrete = tf.nn.sigmoid(self.log_a)
+
+            stretched_concrete = concrete * (high - low) + low
+            clipped_concrete = tf.clip_by_value(stretched_concrete, 0, 1)
+            if self.hard:
+                hard_concrete = tf.to_float(tf.greater(clipped_concrete, 0.5))
+                clipped_concrete = clipped_concrete + tf.stop_gradient(hard_concrete - clipped_concrete)
+        return clipped_concrete
+
+    def get_penalty(self, values=None, axis=None):
+        """
+        Computes l0 and l2 penalties. For l2 penalty one must also provide the sparsified values
+        (usually activations or weights) before they are multiplied by the gate
+        Returns the regularizer value that should to be MINIMIZED (negative logprior)
+        """
+        if self.l0_penalty == self.l2_penalty == 0:
+            warn("get_penalty() is called with both penalties set to 0")
+        low, high = self.stretch_limits
+        assert low < 0.0, "p_gate_closed can be computed only if lower stretch limit is negative"
+        with tf.name_scope(self.name):
+            # compute p(gate_is_closed) = cdf(stretched_sigmoid < 0)
+            p_open = tf.nn.sigmoid(self.log_a - self.temperature * tf.log(-low / high))
+            p_open = tf.clip_by_value(p_open, self.eps, 1.0 - self.eps)
+
+            total_reg = 0.0
+            if self.l0_penalty != 0:
+                if values != None and self.local_rep:
+                    p_open += tf.zeros_like(values)  # broadcast shape to account for values
+                l0_reg = self.l0_penalty * tf.reduce_sum(p_open, axis=axis)
+                total_reg += tf.reduce_mean(l0_reg)
+
+            if self.l2_penalty != 0:
+                assert values is not None
+                l2_reg = 0.5 * self.l2_penalty * p_open * tf.reduce_sum(values ** 2, axis=axis)
+                total_reg += tf.reduce_mean(l2_reg)
+
+            return total_reg
+
+    def get_sparsity_rate(self, is_train=False):
+        """ Computes the fraction of gates which are now active (non-zero) """
+        is_nonzero = tf.not_equal(self.get_gates(is_train), 0.0)
+        return tf.reduce_mean(tf.to_float(is_nonzero))
\ No newline at end of file
diff --git a/lib/layers/lrp.py b/lib/layers/lrp.py
new file mode 100644
index 0000000..566da1c
--- /dev/null
+++ b/lib/layers/lrp.py
@@ -0,0 +1,58 @@
+import tensorflow as tf
+from ..ops import record_activations as rec
+
+
+class LRP:
+    """ Helper class for layerwise relevance propagation """
+    alpha = 1.0
+    beta = 0.0
+    eps = 1e-7
+    crop_function = abs
+
+    @classmethod
+    def relprop(cls, f_positive, f_negative, output_relevance, *inps):
+        """
+        computes input relevance given output_relevance using z+ rule
+        works for linear layers, convolutions, poolings, etc.
+        notation from DOI:10.1371/journal.pone.0130140, Eq 60
+        :param f_positive: forward function with positive weights (if any) and no nonlinearities
+        :param f_negative: forward function with negative weights and no nonlinearities
+            if there's no weights, set f_negative to None. Only used for alpha-beta LRP
+        :param output_relevance: relevance w.r.t. layer output
+        :param inps: a list of layer inputs
+        """
+        assert len(inps) > 0, "please provide at least one input"
+        with rec.do_not_record():
+            alpha, beta, eps = cls.alpha, cls.beta, cls.eps
+            inps = [inp + eps for inp in inps]
+
+            # ouput relevance: [*dims, out_size]
+            z_positive = f_positive(*inps)
+            s_positive = cls.alpha * output_relevance / z_positive  # [*dims, out_size]
+            positive_relevances = tf.gradients(z_positive, inps, grad_ys=s_positive)
+            # ^-- list of [*dims, inp_size]
+
+            if cls.beta != 0 and f_negative is not None:
+                z_negative = f_negative(*inps)
+                s_negative = -cls.beta * output_relevance / z_negative  # [*dims, out_size]
+                negative_relevances = tf.gradients(z_negative, inps, grad_ys=s_negative)
+                # ^-- list of [*dims, inp_size]
+            else:
+                negative_relevances = [0.0] * len(inps)
+
+            inp_relevances = [
+                inp * (rel_pos + rel_neg)
+                for inp, rel_pos, rel_neg in zip(inps, positive_relevances, negative_relevances)
+            ]
+
+            return cls.rescale(output_relevance, *inp_relevances)
+
+
+    @classmethod
+    def rescale(cls, reference, *inputs,  axis=None):
+        inputs = [cls.crop_function(inp) for inp in inputs]
+        ref_scale = tf.reduce_sum(reference, axis=axis, keep_dims=axis is not None)
+        inp_scales = [tf.reduce_sum(inp, axis=axis, keep_dims=axis is not None) for inp in inputs]
+        total_inp_scale = sum(inp_scales) + cls.eps
+        inputs = [inp * (ref_scale / total_inp_scale) for inp in inputs]
+        return inputs[0] if len(inputs) == 1 else inputs
diff --git a/lib/meta.py b/lib/meta.py
new file mode 100644
index 0000000..5d57e60
--- /dev/null
+++ b/lib/meta.py
@@ -0,0 +1,46 @@
+import tensorflow as tf
+import sys
+
+from collections import namedtuple
+from contextlib import contextmanager
+
+## Collection keys
+
+# Collection of tensors representing layer activations in network
+ACTIVATIONS = tf.GraphKeys.ACTIVATIONS
+
+# Collection of Attention objects
+ATTENTIONS = "attentions"
+SUMMARIES_ZOO = "summaries_zoo"
+PARAMS_SUMMARIES = "params_summaries"
+
+
+Attention = namedtuple('Attention', ['name', 'weights', 'logits', 'mask'])
+
+
+def get_indexed_collection(coll, scope, root_scope=None):
+    if root_scope is None:
+        root_scope = tf.contrib.framework.get_name_scope()
+
+    full_scope = root_scope + '/' + scope
+
+    def normalize_name(n):
+        n = n[len(full_scope)+1:]
+        if n.endswith(':0'):
+            n = n[:-2]
+        if n.endswith('/'):
+            n = n[:-1]
+        return n
+
+    return dict((normalize_name(t.name), t) for t in tf.get_collection(coll, full_scope + '/.*'))
+
+
+@contextmanager
+def lock_collections(collections):
+    collection_states = [tf.get_collection(coll) for coll in collections]
+    yield
+    for coll, old_coll_state in zip(collections, collection_states):
+        new_coll_state = tf.get_collection_ref(coll)
+        if old_coll_state != new_coll_state:
+            print("! Changes in collection %s will be ignored!" % coll, flush=True, file=sys.stderr)
+            new_coll_state[:] = old_coll_state  # Replace collection state with old one
diff --git a/lib/ops/__init__.py b/lib/ops/__init__.py
new file mode 100644
index 0000000..ba74a64
--- /dev/null
+++ b/lib/ops/__init__.py
@@ -0,0 +1,2 @@
+from . import basic, mpi, sliced_argmax, devices, record_activations
+from .basic import *
\ No newline at end of file
diff --git a/lib/ops/basic.py b/lib/ops/basic.py
new file mode 100644
index 0000000..63b5e84
--- /dev/null
+++ b/lib/ops/basic.py
@@ -0,0 +1,164 @@
+# Basic TF operations
+import threading
+from contextlib import contextmanager
+
+import tensorflow as tf
+import hashlib
+from copy import copy
+
+
+def get_seed_from_name(name):
+    full_name = '/'.join([tf.get_variable_scope().name, name])
+    return int(hashlib.md5(full_name.encode()).hexdigest()[:8], 16)
+
+
+def default_initializer(seed, dtype):
+    scope_initializer = tf.get_variable_scope().initializer
+    if scope_initializer is not None:
+        return scope_initializer
+    try:
+        return tf.initializers.glorot_uniform(seed, dtype)
+    except:
+        return tf.glorot_uniform_initializer(seed, dtype)
+
+
+def get_model_variable(name, **kwargs):
+    """ Get variable from MODEL_VARIABLES collection with initializer seeded from its name, not id """
+
+    if kwargs.get('initializer') is None:
+        kwargs['initializer'] = default_initializer(seed=get_seed_from_name(name), dtype=kwargs.get('dtype', tf.float32))
+    elif hasattr(kwargs['initializer'], 'seed') and kwargs['initializer'].seed is None:
+        kwargs['initializer'] = copy(kwargs['initializer'])
+        kwargs['initializer'].seed = get_seed_from_name(name)
+
+    return tf.contrib.framework.model_variable(name, **kwargs)
+
+
+def dot(x, y):
+    """
+    x: [..., a]
+    y: [a, ...]
+    -------------
+    Ret: [..., ...]
+    """
+    x_ndim = x.get_shape().ndims
+    y_ndim = y.get_shape().ndims
+    etc_x = tf.slice(tf.shape(x), [0], [x_ndim-1])
+    etc_y = tf.slice(tf.shape(y), [1], [-1])
+    a = tf.shape(y)[0]
+
+    # Reshape forth.
+    if x_ndim != 2:
+        x = tf.reshape(x, [-1, a])
+    if y_ndim != 2:
+        y = tf.reshape(y, [a, -1])
+
+    # Compute
+    ret = tf.matmul(x, y)
+
+    # Reshape back.
+    if x_ndim != 2 or y_ndim != 2:
+        ret = tf.reshape(ret, tf.concat([etc_x, etc_y], 0))
+
+    return ret
+
+
+def sequence_mask(lengths, dtype, maxlen=None):
+    """
+    WARNING: THis func produces Time-major tensor
+    lengths: [batch_size]
+    -------
+    out: [maxlen, batch_size]
+    """
+    lengths = tf.cast(lengths, tf.int32)
+    if maxlen is not None:
+        maxlen = tf.cast(maxlen, tf.int32)
+    return tf.transpose(tf.sequence_mask(lengths, dtype=dtype, maxlen=maxlen))
+
+
+def infer_length(seq, eos=1, time_major=False):
+    """
+    compute length given output indices and eos code
+    :param seq: tf matrix [time,batch] if time_major else [batch,time]
+    :param eos: integer index of end-of-sentence token
+    :returns: lengths, int32 vector of [batch_size]
+    """
+    axis = 0 if time_major else 1
+    is_eos = tf.cast(tf.equal(seq, eos), 'int32')
+    count_eos = tf.cumsum(is_eos, axis=axis, exclusive=True)
+    lengths = tf.reduce_sum(tf.cast(tf.equal(count_eos, 0), 'int32'), axis=axis)
+    return lengths
+
+
+def infer_mask(seq, eos=1, time_major=False, dtype=tf.bool):
+    """
+    compute mask
+    :param seq: tf matrix [time,batch] if time_major else [batch,time]
+    :param eos: integer index of end-of-sentence token
+    :returns: mask, matrix of same shape as seq and of given dtype (bool by default)
+    """
+    lengths = infer_length(seq, eos=eos, time_major=time_major)
+    mask_fn = sequence_mask if time_major else tf.sequence_mask
+    maxlen = tf.shape(seq)[0 if time_major else 1]
+    return mask_fn(lengths, dtype=dtype, maxlen=maxlen)
+
+
+def dropout(x, keep_prob, *args, **kwargs):
+    """This is a hack to save memory if there is no dropout"""
+    if keep_prob >= 1:
+        return x
+    return tf.nn.dropout(x, keep_prob, *args, **kwargs)
+
+
+def group(*ops):
+    """
+    Like tf.group(), but returns tf.constant(0) instead of tf.no_op(),
+    which makes it suitable for use in tf.cond().
+    """
+    with tf.control_dependencies(ops):
+        return tf.constant(0)
+
+
+def select_values_over_last_axis(values, indices):
+    """
+    Auxiliary function to select logits corresponding to chosen tokens.
+    :param values: logits for all actions: float32[batch,tick,action]
+    :param indices: action ids int32[batch,tick]
+    :returns: values selected for the given actions: float[batch,tick]
+    """
+    assert values.shape.ndims == 3 and indices.shape.ndims == 2
+    batch_size, seq_len = tf.shape(indices)[0], tf.shape(indices)[1]
+
+    time_i, batch_i = tf.meshgrid(tf.range(0, seq_len, dtype=indices.dtype),
+                                  tf.range(0, batch_size, dtype=indices.dtype))
+
+    indices_nd = tf.stack([batch_i, time_i, indices], axis=-1)
+
+    return tf.gather_nd(values, indices_nd)
+
+
+def nop(x):
+    return x
+
+
+def kl_divergence_with_logits(p_logits, q_logits):
+    return tf.reduce_sum(tf.nn.softmax(p_logits) * (tf.nn.log_softmax(p_logits) - tf.nn.log_softmax(q_logits)), axis=-1)
+
+
+_tls = threading.local()
+
+
+def is_dropout_enabled():
+    if not hasattr(_tls, 'dropout_enabled'):
+        _tls.dropout_enabled = True
+    return _tls.dropout_enabled
+
+
+@contextmanager
+def dropout_scope(enabled):
+    was_enabled = is_dropout_enabled()
+    _tls.dropout_enabled = enabled
+    try:
+        yield
+    finally:
+        _tls.dropout_enabled = was_enabled
\ No newline at end of file
diff --git a/lib/ops/devices.py b/lib/ops/devices.py
new file mode 100644
index 0000000..08936bf
--- /dev/null
+++ b/lib/ops/devices.py
@@ -0,0 +1,15 @@
+import tensorflow as tf
+
+
+def list_devices(session=None):
+    if session is None:
+        session = session or tf.get_default_session()
+    return session.list_devices()
+
+
+def list_gpu_devices(session=None):
+    return [x for x in list_devices(session) if x.device_type == 'GPU']
+
+
+def have_gpu():
+    return len(list_gpu_devices()) != 0
diff --git a/lib/ops/mpi/__init__.py b/lib/ops/mpi/__init__.py
new file mode 100644
index 0000000..2ed8327
--- /dev/null
+++ b/lib/ops/mpi/__init__.py
@@ -0,0 +1,372 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=g-short-docstring-punctuation
+"""## Communicating Between Processes with MPI
+
+TensorFlow natively provides inter-device communication through send and
+receive ops and inter-node communication through Distributed TensorFlow, based
+on the same send and receive abstractions. On HPC clusters where Infiniband or
+other high-speed node interconnects are available, these can end up being
+insufficient for synchronous data-parallel training (without asynchronous
+gradient descent). This module implements a variety of MPI ops which can take
+advantage of hardware-specific MPI libraries for efficient communication.
+
+In order to use this module, TensorFlow must be built with an MPI library,
+which can be provided to the `./configure` script at build time. As a user of
+TensorFlow, you will need to build TensorFlow yourself to select the MPI
+library to use; to do so, follow the [instructions for building TensorFlow from
+source](https://www.tensorflow.org/get_started/os_setup#installing_from_sources).
+
+### Utility Ops
+
+In addition to reductions and gathers, this module provides utility operations
+for detecting the running MPI configuration.
+
+Example:
+
+```python
+from tensorflow.contrib import mpi
+
+# Use `mpi.Session` instead of `tf.Session`
+with mpi.Session() as session:
+    rank = session.run(mpi.rank())
+    print("My MPI Rank:", rank)
+
+    if rank == 0:
+        print("MPI Size:", session.run(mpi.size()))
+```
+
+@@rank
+@@size
+
+### Ring Allreduce and Allgather
+
+When summing or averaging tensors across many processes, communication can
+easily become a bottleneck. A naive implementation will send all the tensor
+values to the same process, perform the reduction, and then broadcast the
+values back to all other processes, effectively creating a synchronous
+parameter server in one process. However, the process responsible for
+performing the reduction will have to receive and send a massive amount of data
+which scales with the number of processes *and* the number of parameters in the
+model.
+
+Instead of centralizing the reduction and having one primary reducer, we can
+implement a distributed allreduce or allgather. A bandwidth-optimal allreduce
+will end up sending 2(N - 1) values for every value in the input tensor,
+and can be implemented with a ring allreduce [1]. (Intuitively, a linear reduce
+requires at least (N - 1) sends between the different nodes, and a broadcast of
+the result also requires (N - 1) sends, for a total of 2 (N - 1); these two
+steps cannot be combined in a clever way to reduce the number of required
+sends.) This module implements bandwidth-optimal ring allreduce and ring
+allgather operations using MPI; by choosing a hardware-appropriate MPI
+implementation (such as OpenMPI with CUDA-IPC support), you can train large
+models with synchronous gradient descent with minimal communication overhead.
+
+In addition to the `allreduce` and `allgather` functions, a convenience
+`DistributedOptimizer` wrapper is provided to simplify using these functions
+for reducing model gradients.
+
+Example:
+
+```python
+import tensorflow as tf
+from tensorflow.contrib import mpi
+
+# Construct a simple linear regression model to optimize
+W = tf.get_variable("W", shape=[20, 1], dtype=tf.float32)
+B = tf.get_variable("B", shape=[1, 1], dtype=tf.float32)
+inputs = tf.placeholder("Inputs", shape=[None, 20])
+outputs = tf.placeholder("Outputs", shape=[None, 1])
+loss = tf.nn.l2_loss(tf.matmul(inputs, W) + B - outputs)
+
+# Training using MPI allreduce with DistributedOptimizer
+optimizer = mpi.DistributedOptimizer(tf.train.AdamOptimizer())
+train = optimizer.minimize(loss)
+
+# Average loss over all ranks, for printing.
+# Do not pass this to an optimizer!
+avg_loss = mpi.allreduce(loss)
+
+# On different ranks, feed different input data.
+with mpi.Session() as session:
+    rank = session.run(mpi.rank())
+    batch_inputs, batch_outputs = construct_batch_for_rank(rank)
+    feed_dict = {inputs: batch_inputs, outputs: batch_outputs}
+    _, l = session.run([train, avg_loss], feed_dict=feed_dict)
+    print("Average Loss:", l)
+```
+
+[1] Patarasuk, Pitch and Yuan, Xin. "Bandwidth Optimal All-reduce Algorithms
+for Clusters of Workstations".
+
+@@Session
+@@DistributedOptimizer
+@@allreduce
+@@allgather
+"""
+
+import tensorflow as tf
+
+import threading
+import importlib
+import os
+
+
+_provider = None
+
+
+def get_provider():
+    global _provider
+    if _provider is None:
+        set_provider('horovod' if os.getenv('OMPI_COMM_WORLD_SIZE') is not None else 'dummy')
+    return _provider
+
+
+def set_provider(provider, force=False):
+    global _provider
+    if _provider is not None and not force:
+        raise RuntimeError("%r already set as provider" % _provider)
+    _provider = importlib.import_module('lib.ops.mpi.%s_provider' % provider)
+
+
+def is_master():
+    """
+    Helper function to identify master
+    """
+    mpi_rank = os.getenv('OMPI_COMM_WORLD_RANK')
+    return mpi_rank is None or mpi_rank == '0'
+
+
+def is_distributed():
+    """
+    Helper function to identify if we are in distributed mode
+    """
+    mpi_size = os.getenv('OMPI_COMM_WORLD_SIZE')
+    return mpi_size is not None and int(mpi_size) > 1
+
+
+class Session(tf.Session):
+    """A class for running TensorFlow operations, with copies of the same graph
+    running distributed across different MPI nodes.
+
+    The primary difference between `tf.Session` and `tf.contrib.mpi.Session` is
+    that the MPI `Session` ensures that the `Session` options are correct for
+    use with `tf.contrib.mpi`, and initializes MPI immediately upon the start
+    of the session.
+    """
+
+    def __init__(self, gpu_group=None, gpu_group_size=1, target='', graph=None, config=None):
+        """Creates a new TensorFlow MPI session.
+
+        Unlike a normal `tf.Session`, an MPI Session may only use a single GPU,
+        which must be specified in advance before the session is initialized.
+        In addition, it only uses a single graph evaluation thread, and
+        initializes MPI immediately upon starting.
+
+        If no `graph` argument is specified when constructing the session,
+        the default graph will be launched in the session. If you are
+        using more than one graph (created with `tf.Graph()` in the same
+        process, you will have to use different sessions for each graph,
+        but each graph can be used in multiple sessions. In this case, it
+        is often clearer to pass the graph to be launched explicitly to
+        the session constructor.
+
+        Args:
+        gpu: (Optional.) The GPU index to use, or None for CPU only MPI.
+        graph: (Optional.) The `Graph` to be launched (described above).
+        config: (Optional.) A `ConfigProto` protocol buffer with configuration
+        options for the session.
+        """
+        if config is None:
+            config = tf.ConfigProto()
+
+        if gpu_group is not None:
+            config.gpu_options.visible_device_list = ','.join(str(gpu_group*gpu_group_size + d) for d in range(gpu_group_size))
+
+        super(Session, self).__init__(target, graph, config=config)
+
+        # Initialize MPI on the relevant device.
+        with self.as_default():
+            self.run(init())
+
+        # Setup finalize status and lock to prevent double finalize call
+        self._mpi_finalized = False
+        self._mpi_finalize_lock = threading.Lock()
+
+    def close(self):
+        with self._mpi_finalize_lock:
+            if not self._mpi_finalized:
+                # Finalize MPI on the relevant device
+                self.run(finalize())
+                self._mpi_finalized = True
+
+        super(Session, self).close()
+
+
+###############################################################################
+#
+#  TensorFlow MPI operations
+#
+###############################################################################
+
+
+def size(name=None):
+    """An op which returns the number of MPI processes.
+
+    This is equivalent to running `MPI_Comm_size(MPI_COMM_WORLD, ...)` to get the
+    size of the global communicator.
+
+    Returns:
+    An integer scalar containing the number of MPI processes.
+    """
+    return get_provider().size(name)
+
+
+def rank(name=None):
+    """An op which returns the MPI rank of the calling process.
+
+    This is equivalent to running `MPI_Comm_rank(MPI_COMM_WORLD, ...)` to get the
+    rank of the current process in the global communicator.
+
+    Returns:
+    An integer scalar with the MPI rank of the calling process.
+    """
+    return get_provider().rank(name)
+
+
+def local_rank(name=None):
+    """An op which returns the local MPI rank of the calling process, within the
+    node that it is running on. For example, if there are seven processes running
+    on a node, their local ranks will be zero through six, inclusive.
+
+    This is equivalent to running `MPI_Comm_rank(...)` on a new communicator
+    which only includes processes on the same node.
+
+    Returns:
+    An integer scalar with the local MPI rank of the calling process.
+    """
+    return get_provider().local_rank(name=name)
+
+
+def init(name=None):
+    """An op which initializes MPI on the device on which it is run.
+
+    All future MPI ops must be run on the same device that the `init` op was run
+    on.
+    """
+    return get_provider().init(name)
+
+
+def finalize(name=None):
+    """An op which finalizes MPI on the device on which it is run.
+
+    No future MPI ops must be run on the same device that the `finalize` op was run
+    on.
+    """
+    return get_provider().finalize(name=name)
+
+
+def allreduce(tensor, average=True, name=None):
+    """Perform an MPI allreduce on a tf.Tensor or tf.IndexedSlices.
+
+    Arguments:
+        tensor: tf.Tensor, tf.Variable, or tf.IndexedSlices to reduce.
+        The shape of the input must be identical across all ranks.
+        average: If True, computes the average over all ranks.
+                 Otherwise, computes the sum over all ranks.
+
+    This function performs a bandwidth-optimal ring allreduce on the input
+    tensor. If the input is an tf.IndexedSlices, the function instead does an
+    allgather on the values and the indices, effectively doing an allreduce on
+    the represented tensor.
+    """
+    return get_provider().allreduce(tensor, average, name)
+
+
+def allgather(tensor, name=None):
+    """An op which concatenates the input tensor with the same input tensor on
+    all other MPI processes.
+
+    The concatenation is done on the first dimension, so the input tensors on the
+    different processes must have the same rank and shape, except for the first
+    dimension, which is allowed to be different.
+
+    Returns:
+    A tensor of the same type as `tensor`, concatenated on dimension zero
+    across all processes. The shape is identical to the input shape, except for
+    the first dimension, which may be greater and is the sum of all first
+    dimensions of the tensors in different MPI processes.
+    """
+    return get_provider().allgather(tensor, name)
+
+
+def broadcast(tensor, name=None):
+    """Broadcasts value of given tensor from coordinator node to all the others.
+
+    Returns:
+    Result of broadcast, same shape as `tensor`
+    """
+    return get_provider().broadcast(tensor, name=name)
+
+
+def broadcast_var(ref, allow_uninitialized=False, name=None):
+    """Broadcasts value of given variable from coordinator node to all the others.
+
+    Returns:
+    A mutable `tensor`, same as `ref`
+    """
+    return get_provider().broadcast_var(ref, allow_uninitialized=allow_uninitialized, name=name)
+
+
+###############################################################################
+#
+#  Specific MPI operations on Python objects
+#
+###############################################################################
+
+
+def broadcast_obj(obj, name=None):
+    """
+    Returns:
+        Broadcasted object, same as input
+    """
+    return get_provider().broadcast_obj(obj, name)
+
+
+def gather_obj(obj, name=None):
+    """Gathers given Python object from all workers on the coordinator
+
+    Returns:
+      Gathered object on the coordinator (on all other workers None)
+    """
+    return get_provider().gather_obj(obj, name)
+
+
+def scatter_obj(obj_array, name=None):
+    """Scatters given array of Python objects to all workers from the coordinator
+
+     Returns:
+      Object on each worker
+    """
+    return get_provider().scatter_obj(obj_array, name)
+
+
+def allgather_obj(obj, name=None):
+    """Performs ALLGATHER on the given Python object
+
+    Returns:
+      Gathered object on all workers
+    """
+    return get_provider().allgather_obj(obj, name)
diff --git a/lib/ops/mpi/dummy_provider.py b/lib/ops/mpi/dummy_provider.py
new file mode 100644
index 0000000..5faaf77
--- /dev/null
+++ b/lib/ops/mpi/dummy_provider.py
@@ -0,0 +1,66 @@
+import tensorflow as tf
+
+###############################################################################
+#
+#  TensorFlow MPI operations
+#
+###############################################################################
+
+
+def size(name=None):
+    return tf.constant(1, name=name)
+
+
+def rank(name=None):
+    return tf.constant(0, name=name)
+
+
+def local_rank(name=None):
+    return tf.constant(0, name=name)
+
+
+def init(name=None):
+    return tf.no_op(name=name)
+
+
+def finalize(name=None):
+    return tf.no_op(name=name)
+
+
+def allreduce(tensor, average=True, name=None):
+    return tf.stop_gradient(tensor, name=name)  # Stop gradient propagation, as in distributed mode
+
+
+def allgather(tensor, name=None):
+    return tf.stop_gradient(tensor, name=name)
+
+
+def broadcast(tensor, name=None):
+    return tf.stop_gradient(tensor, name=name)
+
+
+def broadcast_var(ref, allow_uninitialized=False, name=None):
+    return ref
+
+
+###############################################################################
+#
+#  Specific MPI operations on Python objects
+#
+###############################################################################
+
+
+def broadcast_obj(obj, name=None):
+    return obj
+
+
+def gather_obj(obj, name=None):
+    return [obj]
+
+
+def scatter_obj(obj_array, name=None):
+    return obj_array[0]
+
+
+def allgather_obj(obj, name=None):
+    return [obj]
diff --git a/lib/ops/mpi/horovod_provider.py b/lib/ops/mpi/horovod_provider.py
new file mode 100644
index 0000000..2623bd8
--- /dev/null
+++ b/lib/ops/mpi/horovod_provider.py
@@ -0,0 +1,152 @@
+import tensorflow as tf
+import horovod.tensorflow as hvd
+import pickle
+import os
+import threading
+
+
+###############################################################################
+#
+#  Horovod MPI operations
+#
+###############################################################################
+
+
+def size(name=None):
+    return tf.constant(int(os.getenv('OMPI_COMM_WORLD_SIZE', 1)), name=name, dtype=tf.int32)
+
+
+def rank(name=None):
+    return tf.constant(int(os.getenv('OMPI_COMM_WORLD_RANK', 0)), name=name, dtype=tf.int32)
+
+
+def local_rank(name=None):
+    return tf.constant(int(os.getenv('OMPI_COMM_WORLD_LOCAL_RANK', 0)), name=name, dtype=tf.int32)
+
+
+def init(name=None):
+    hvd.init()
+    return tf.no_op(name=name)
+
+
+def finalize(name=None):
+    return tf.no_op(name=name)
+
+
+def allreduce(tensor, average=True, name=None):
+    return hvd.allreduce(tensor, average=average)
+
+
+def allgather(tensor, name=None):
+    return hvd.allgather(tensor)
+
+
+def broadcast(tensor, name=None):
+    return hvd.broadcast(tensor, root_rank=0)
+
+
+def broadcast_var(ref, allow_uninitialized=False, name=None):
+    if allow_uninitialized:
+        raise RuntimeError("allow_uninitialized is not supported in Horovod implementation")
+    return tf.assign(ref, broadcast(ref))
+
+
+###############################################################################
+#
+#  Specific MPI operations on Python objects
+#
+###############################################################################
+
+
+def broadcast_obj(obj, name=None):
+    if name is None:
+        name = 'broadcast_obj'
+    return allgather_obj(obj, name)[0]
+
+
+def gather_obj(obj, name=None):
+    if name is None:
+        name = 'gather_obj'
+
+    res = allgather_obj(obj, name)
+    if int(os.getenv('OMPI_COMM_WORLD_RANK', 0)) == 0:
+        return res
+    else:
+        return None
+
+
+def scatter_obj(inps, name=None):
+    if name is None:
+        name = 'scatter_obj'
+
+    if int(os.getenv('OMPI_COMM_WORLD_RANK', 0)) == 0:
+        assert len(inps) == int(os.getenv('OMPI_COMM_WORLD_SIZE', 1))
+    else:
+        inps = None
+
+    outs = allgather_obj(inps, name)
+    return outs[0][int(os.getenv('OMPI_COMM_WORLD_RANK', 0))]
+
+
+def allgather_obj(obj, name=None):
+    if name is None:
+        name = 'allgather_obj'
+
+    encoded = _encode_obj(obj)
+    encoded_size = len(encoded)
+
+    graph_ops = _get_graph_ops(name)
+
+    sizes, encoded_res = tf.get_default_session().run([graph_ops.allgather_obj_size_result, graph_ops.allgather_obj_result], feed_dict={
+        graph_ops.allgather_obj_size_inp: [encoded_size],
+        graph_ops.allgather_obj_inp: encoded
+    })
+
+    res = []
+    pos = 0
+    for sz in sizes:
+        res.append(_decode_obj(encoded_res[pos:pos+sz]))
+        pos += sz
+    return res
+
+
+## Implementation details
+
+class _GraphOps:
+    def __init__(self, name):
+        self.name = name
+
+        with tf.name_scope("horovod_python_ops/" + name):
+            self.allgather_obj_size_inp = tf.placeholder(name="allgather_obj_size", dtype=tf.int32, shape=[None])
+            self.allgather_obj_inp = tf.placeholder(name="allgather_obj", dtype=tf.uint8, shape=[None])
+
+            self.allgather_obj_size_result = hvd.allgather(self.allgather_obj_size_inp)
+            self.allgather_obj_result = hvd.allgather(self.allgather_obj_inp)
+
+
+_graph_ops_collection = "HOROVOD_GRAPH_OPS"
+_graph_ops_lock = threading.Lock()
+
+
+def _encode_obj(obj):
+    return list(pickle.dumps(obj))
+
+
+def _decode_obj(data):
+    return pickle.loads(bytes(data))
+
+
+def _get_graph_ops(name):
+    """
+    Returns lazy-initialized hash of graph operations required to implement allgather_obj/scatter_obj.
+    These operations stored in graph collection to avoid binding parallelism to specific graph
+    """
+
+    found = tf.get_collection(_graph_ops_collection, name)
+    if len(found) > 0:
+        return found[0]
+
+    with _graph_ops_lock:
+        ops = _GraphOps(name)
+        tf.add_to_collection(_graph_ops_collection, ops)
+        return ops
diff --git a/lib/ops/record_activations.py b/lib/ops/record_activations.py
new file mode 100644
index 0000000..05ca2c5
--- /dev/null
+++ b/lib/ops/record_activations.py
@@ -0,0 +1,93 @@
+from warnings import warn
+from collections import defaultdict
+from contextlib import contextmanager
+import tensorflow as tf
+
+# Idea: we need to store layer activations to do things like relevance propagation,
+# let's build a single-use collection that one can store layer-wise activations in
+# Here's how it should work:
+# with record_activations() as saved_activations:
+#    y = model(x)              # saves activations in... saved_activations
+#    x_rel = model.relprop(y)  # uses activations stored on forward pass
+#
+# print('btw, activation tensors are', activations)
+# note: why not just use tf collections? because they are global and you can never be sure
+# what's left in there since previous run
+
+# this will be a dictionary: { layer name -> a dict of saved activations }
+RECORDED_ACTIVATIONS = None
+WARN_IF_NO_COLLECTION = False
+
+
+@contextmanager
+def recording_activations(existing_state_dict=None, subscope_key=None):
+    """ A special context that allows you to store any forward pass activations """
+    assert isinstance(existing_state_dict, (dict, type(None)))
+    global RECORDED_ACTIVATIONS
+    prev_collection = RECORDED_ACTIVATIONS
+    RECORDED_ACTIVATIONS = existing_state_dict or defaultdict(dict)
+    if subscope_key:
+        assert is_recorded() and existing_state_dict is None
+        prev_collection[subscope_key] = RECORDED_ACTIVATIONS
+
+    try:
+        yield RECORDED_ACTIVATIONS
+    finally:
+        RECORDED_ACTIVATIONS = prev_collection
+
+
+@contextmanager
+def do_not_record():
+    """ Temporarily disables recording activations within context """
+    global RECORDED_ACTIVATIONS
+    prev_collection = RECORDED_ACTIVATIONS
+    RECORDED_ACTIVATIONS = None
+    try:
+        yield
+    finally:
+        RECORDED_ACTIVATIONS = prev_collection
+
+
+def is_recorded():
+    return RECORDED_ACTIVATIONS is not None
+
+
+def save_activation(key, value, scope=None, overwrite=False):
+    """ Saves value in current recorded activations (if it exists) under current name scope """
+    scope = scope or tf.get_variable_scope().name or tf.contrib.framework.get_name_scope()
+    if is_recorded():
+        if scope in RECORDED_ACTIVATIONS and key in RECORDED_ACTIVATIONS[scope] and not overwrite:
+            raise ValueError('Recorded activations already contain key "{}" for scope "{}". '
+                             'Make sure you run your network only once inside recording_activations context. '
+                             'If a layer is called multiple times, make sure each call happens in a separate '
+                             ' tf.name_scope .'.format(key, scope))
+
+        RECORDED_ACTIVATIONS[scope][key] = value
+    elif WARN_IF_NO_COLLECTION:
+        warn('Tried to save under key "{}" in scope "{}" without recording_activations context. '
+             'As the fox says, the context is important'.format(key, scope))
+
+
+def save_activations(**kwargs):
+    """ convenience function to save multiple activations. see save_activation """
+    scope, overwrite = kwargs.pop('scope', None), kwargs.pop('overwrite', False)
+    assert isinstance(scope, (str, type(None)))
+    assert isinstance(overwrite, bool)
+    for key, value in kwargs.items():
+        save_activation(key, value, scope=scope, overwrite=overwrite)
+
+
+def get_activation(key, scope=None):
+    """ gets one activation from current scope or freaks out if there isn't any """
+    scope = scope or tf.get_variable_scope().name or tf.contrib.framework.get_name_scope()
+    assert is_recorded(), "can't get activations if used outside recording_activations context."
+    assert scope in RECORDED_ACTIVATIONS, 'no saved activations in scope "{}". Is scope name correct?'.format(scope)
+    assert key in RECORDED_ACTIVATIONS[scope], 'no saved activation for "{}" in scope "{}". Existing keys: {}'.format(
+        key, scope, list(RECORDED_ACTIVATIONS[scope].keys())
+    )
+    return RECORDED_ACTIVATIONS[scope][key]
+
+
+def get_activations(*keys, scope=None):
+    """ convenience function to get multiple activations from current scope, see get_activation """
+    return [get_activation(key, scope=scope) for key in keys]
diff --git a/lib/ops/sliced_argmax.py b/lib/ops/sliced_argmax.py
new file mode 100644
index 0000000..c0409b6
--- /dev/null
+++ b/lib/ops/sliced_argmax.py
@@ -0,0 +1,168 @@
+import numpy as np
+import tensorflow as tf
+
+
+def hypo_to_batch_index(n_hypos, slices):
+    """
+    Computes index in batch (input sequence index) for each hypothesis given slices.
+    :param n_hypos: number of hypotheses (tf int scalar)
+    :param slices: indices of first hypo for each input in batch
+    It should guaranteed that
+     - slices[0]==0 (first hypothesis starts at index 0), otherwise output[:slices[0]] will be -1
+     - if batch[i] is terminated, then batch[i]==batch[i+1]
+    """
+    is_next_sent_at_t = tf.bincount(slices, minlength=n_hypos, maxlength=n_hypos)
+    hypo_to_index = tf.cumsum(is_next_sent_at_t) - 1
+    return hypo_to_index
+
+
+def sliced_argmax_naive(logits, slices, k):
+    """
+    Computes top-k of values in each slice.
+    :param values: matrix of shape [m,n]
+    :param slices: vector of shape [m] containing start indices for each slice.
+    :param k: take this many elements with largest values from each slice
+    :returns: batch_scores,batch_indices:
+        - batch_scores[m,k] - top-beam_size values from logP corresponding to
+        - batch_indices[m,k] - indices of batch_scores in each respective slice (first value in each slice has index 0!)
+
+    For any slice contains less than k elements, batch_scores would be padded with -inf, batch_indices - with -1
+    If values.shape[1] != 1, batch_indices will still be 1-dimensional, satisfying the following property:
+        - batch_scores,batch_indices = sliced_argmax(values,slices,k)
+        - start, end = slices[i], slices[i+1]
+        - tf.equals(batch_scores == tf.reshape(values[start:end,:],[-1])[batch_indices])  #this is True for all indices
+
+    Examples
+    --------
+    >>> logp = tf.constant(np.array([[1, 2,   3, 4, 5,   6],
+                                     [6, 5,   4, 3, 2,   1]],'float32').T)
+    >>> slices = tf.constant([0,2,5])
+    >>> best_scores, best_indices = sliced_argmax(logp,slices,tf.constant(4))
+    >>> print('scores:\n%s\nindices:\n%s'%(best_scores.eval(), best_indices.eval()))
+    scores:
+    [[  6.   5.   2.   1.]
+     [  5.   4.   4.   3.]
+     [  6.   1. -inf -inf]]
+    indices:
+    [[ 1  3  2  0]
+     [ 4  1  2  3]
+     [ 0  1 -1 -1]]
+    """
+
+    assert logits.shape.ndims == 2, "logits must be [batch*beam, num_tokens]"
+    assert slices.shape.ndims == 1, "slices must be 1d indices"
+    n_slices, n_hypos, voc_size = tf.shape(slices)[0], tf.shape(logits)[0], tf.shape(logits)[1]
+    slices_incl = tf.concat([slices, [n_hypos]], axis=0)
+    offsets = slices_incl[1:] - slices_incl[:-1]
+    slice_indices = hypo_to_batch_index(n_hypos, slices)  # [n_hypos], index of slice the value belongs to
+
+    # step 1: flatten logits[n_hypos, voc_size] into [n_slices, max_slice_length * voc_size]
+    # by putting all logits within slice on the same row and padding with -inf
+    flat_shape = [n_slices, (tf.reduce_max(offsets)) * voc_size]
+    flat_row_index = tf.reshape(tf.tile(slice_indices[:, None], [1, voc_size]), [-1])
+    flat_col_index = tf.range(n_hypos * voc_size) - tf.gather(slices_incl * voc_size, flat_row_index)
+    flat_index_2d = tf.stack([flat_row_index, flat_col_index], axis=1)
+    mask = tf.less(tf.range(flat_shape[1]), (offsets * voc_size)[:, None])
+    flat_logits = tf.where(mask,
+                           tf.scatter_nd(flat_index_2d, tf.reshape(logits, [-1]), flat_shape),
+                           tf.fill(flat_shape, -float('inf'))
+                           )  # shape: [n_slices, max_slice_length * voc_size]
+
+    flat_indices = tf.where(mask,
+                            tf.scatter_nd(flat_index_2d, flat_col_index, flat_shape),
+                            tf.fill(flat_shape, -1)
+                            )  # shape: [n_slices, max_slice_length * voc_size]
+
+    # step 2: top-k for each slice and gather respectrive indices
+    sliced_top_k = tf.nn.top_k(flat_logits, k=k)
+    original_values = sliced_top_k.values
+
+    original_indices_flat = tf.gather_nd(flat_indices,
+                                         tf.stack([tf.range(n_slices * k) // k,
+                                                   tf.reshape(sliced_top_k.indices, [-1])], axis=1))
+    original_indices = tf.reshape(original_indices_flat, tf.shape(original_values))
+
+    # set shapes
+    out_shape = (logits.shape[0], k if isinstance(k, int) else None)
+    original_values.set_shape(out_shape)
+    original_indices.set_shape(out_shape)
+    return original_values, original_indices
+
+
+def sliced_argmax(logits, slices, k, staged=None):
+    """
+    Computes top-k of values in each slice.
+    :param values: matrix of shape [m,n]
+    :param slices: vector of shape [m] containing start indices for each slice.
+    :param k: take this many elements with largest values from each slice
+    :param staged: if True, computes sliced argmax in two stages:
+                (1) select top-k for each row and
+                (2) global top-k among all rows in slice
+            if False, runs second stage only
+            if None (default), defaults to True unless logits.shape[1] / k < 10
+    :returns: batch_scores,batch_indices:
+        - batch_scores[m,k] - top-beam_size values from logP corresponding to
+        - batch_indices[m,k] - indices of batch_scores in each respective slice (first value in each slice has index 0!)
+
+    For any slice contains less than k elements, batch_scores would be padded with -inf, batch_indices - with -1
+    If values.shape[1] != 1, batch_indices will still be 1-dimensional, satisfying the following property:
+        - batch_scores,batch_indices = sliced_argmax(values,slices,k)
+        - start, end = slices[i], slices[i+1]
+        - tf.equals(batch_scores == tf.reshape(values[start:end,:],[-1])[batch_indices])  #this is True for all indices
+
+    Examples
+    --------
+    >>> logp = tf.constant(np.array([[1, 2,   3, 4, 5,   6],
+                                     [6, 5,   4, 3, 2,   1]],'float32').T)
+    >>> slices = tf.constant([0,2,5])
+    >>> best_scores, best_indices = sliced_argmax(logp,slices,tf.constant(4))
+    >>> print('scores:\n%s\nindices:\n%s'%(best_scores.eval(), best_indices.eval()))
+    scores:
+    [[  6.   5.   2.   1.]
+     [  5.   4.   4.   3.]
+     [  6.   1. -inf -inf]]
+    indices:
+    [[ 1  3  2  0]
+     [ 4  1  2  3]
+     [ 0  1 -1 -1]]
+    """
+
+    assert logits.shape.ndims == 2, "logits must be [batch*beam, num_tokens]"
+    assert slices.shape.ndims == 1, "slices must be 1d indices"
+    if staged is None:
+        staged = (logits.shape[1].value is None) or (float(logits.shape[1].value) / k >= 10.0)
+
+    if staged:
+        # two-step process: (1) select top-k for each row and (2) global top-k among all rows in slice
+        # this version is slightly slower but a lot more memory-efficient
+        logits_topk = tf.nn.top_k(logits, k=k)  # [n_hypos, k]
+        best_values, best_indices_in_top = sliced_argmax_naive(logits_topk.values, slices, k=k)
+
+        best_hypo_ix = tf.where(tf.not_equal(best_indices_in_top, -1),
+                                best_indices_in_top // k + slices[:, None],
+                                best_indices_in_top)
+
+        best_token_ix_in_top = tf.where(tf.not_equal(best_indices_in_top, -1),
+                                        best_indices_in_top % k,
+                                        best_indices_in_top)
+
+        best_token_indices_original = tf.gather_nd(
+            logits_topk.indices,
+            tf.maximum(0, tf.reshape(tf.stack([best_hypo_ix, best_token_ix_in_top], axis=-1), [-1, 2]))
+        )
+        best_token_indices_original = tf.where(tf.not_equal(tf.reshape(best_hypo_ix, [-1]), -1),
+                                               best_token_indices_original,
+                                               tf.fill(tf.shape(best_token_indices_original), -1))
+
+        best_token_indices_original = tf.reshape(best_token_indices_original,
+                                                 tf.shape(best_token_ix_in_top))
+        best_hypo_ix_within_slice = tf.where(
+            tf.not_equal(best_indices_in_top, -1),
+            best_indices_in_top // k,
+            tf.zeros_like(best_indices_in_top, dtype=best_indices_in_top.dtype))
+            #  ^-- use 0 cuz best_token_indices_original is already -1 and they are added
+
+        best_indices_original = best_token_indices_original + best_hypo_ix_within_slice * tf.shape(logits)[1]
+        return best_values, best_indices_original
+    else:
+        return sliced_argmax_naive(logits, slices, k)
diff --git a/lib/session.py b/lib/session.py
new file mode 100644
index 0000000..1dbc1c5
--- /dev/null
+++ b/lib/session.py
@@ -0,0 +1,181 @@
+import tensorflow as tf
+from tensorflow.python import ops
+import lib
+import sys
+import os
+import threading
+from contextlib import contextmanager
+from tensorflow.python.framework import *
+from tensorflow.contrib.tfprof import *
+from tensorflow.python.client import timeline, session
+from collections import namedtuple
+
+# tfprof-oriented Session object.
+# More about tfprof:
+# https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/tfprof
+#
+# !! Attention !! For using need to append
+# /usr/local/cuda-8.0/extras/CUPTI/lib64 to $LD_LIBRARY_PATH
+
+
+PROFILE_SUPER_VERBOSE = 666
+
+_tls = threading.local()
+def get_profile_level():
+    if not hasattr(_tls, 'profile_level'):
+        _tls.profile_level = PROFILE_SUPER_VERBOSE  # Never profile most of sess.run
+    return _tls.profile_level
+
+def set_profile_level(level):
+    _tls.profile_level = level
+
+@contextmanager
+def profile_scope(level=1):
+    prev_level = get_profile_level()
+    _tls.profile_level = level
+    try:
+        yield
+    finally:
+        _tls.profile_level = prev_level
+
+
+MemTimelineRecord = namedtuple('MemTimelineRecord', ['ts', 'node_name', 'bytes_in_use', 'live_bytes'])
+
+
+class SessionWrapper(session.SessionInterface):
+
+    def __init__(self, session):
+        self._sess = session
+
+    @property
+    def graph(self):
+        return self._sess.graph
+
+    @property
+    def sess_str(self):
+        return self._sess.sess_str
+
+    def run(self, *a, **kwa):
+        return self._sess.run(*a, **kwa)
+
+    def partial_run_setup(self, *a, **kwa):
+        raise RuntimeError("Not supported in session wrapper")
+
+    def partial_run(self, *a, **kwa):
+        raise RuntimeError("Not supported in session wrapper")
+
+    def make_callable(self, *a, **kwa):
+        raise RuntimeError("Not supported in session wrapper")
+
+    def as_default(self):
+        return ops.default_session(self)
+
+    def __getattr__(self, attr):
+        return getattr(self._sess, attr)
+
+    def __enter__(self):
+        if self._default_session_context_manager is None:
+            self._default_session_context_manager = self.as_default()
+        return self._default_session_context_manager.__enter__()
+
+    def __exit__(self, *exc):
+        self._default_session_context_manager.__exit__(*exc)
+
+    def __del__(self):
+        self._sess.__del__()
+
+
+class ProfilableSessionWrapper(SessionWrapper):
+    def __init__(self, session, log_dir, skip_first_nruns=0, profile_level=0):
+        super(ProfilableSessionWrapper, self).__init__(session)
+
+        self.run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
+        self.run_metadata = tf.RunMetadata()
+        self.run_counter = 0
+        self.nruns_threshold = skip_first_nruns
+        self.profile_level = profile_level
+
+        self.log_dir = log_dir
+        os.makedirs(log_dir, exist_ok=True)
+
+        self.op_log = None
+        tf.profiler.write_op_log(
+            tf.get_default_graph(),
+            log_dir=log_dir,
+            op_log=self.op_log,
+            run_meta=self.run_metadata
+            )
+
+    def _write_log(self):
+        print("* --------------------------------------", file=sys.stderr)
+        print("* RUN: %d" % self.run_counter, file=sys.stderr)
+
+        # 1. Fetch memory usage and timing stat
+        time_stat_options = model_analyzer.PRINT_ALL_TIMING_MEMORY
+        time_stat_options['output'] = 'file:outfile=%s/time_stat.run_%d.txt' % (self.log_dir, self.run_counter)
+        time_stat_options['select'] = ['device', 'micros', 'bytes']
+        time_stat_options['order_by'] = 'micros'
+        tf.profiler.profile(
+            tf.get_default_graph(),
+            run_meta=self.run_metadata,
+            op_log=self.op_log,
+            options=time_stat_options
+            )
+
+        # 2. Create timeline.json file. It can be load in chrome://tracing
+        time_data = timeline.Timeline(self.run_metadata.step_stats)
+        trace = time_data.generate_chrome_trace_format(show_memory=True)
+        timeline_fname = '%s/timeline.run_%d.json' % (self.log_dir, self.run_counter)
+        with open(timeline_fname, 'w') as f:
+            f.write(trace)
+
+        # 3. Get peak memory
+        mem_timelines = self._build_memory_timelines()
+        peak_memory = self._compute_peak_memory(mem_timelines)
+        print("Peak memory: %s" % str(peak_memory), file=sys.stderr)
+
+        # 4. Print memory timelines
+        for allocator, tl in mem_timelines.items():
+            memory_fname = '%s/memory.%s.run_%d.txt' % (self.log_dir, allocator, self.run_counter)
+            with open(memory_fname, 'w') as f:
+                print("ts,node_name,bytes_in_use,live_bytes", file=f)
+                for r in tl:
+                    print("%d,%s,%d,%d" % (r.ts, r.node_name, r.bytes_in_use, r.live_bytes), file=f)
+
+    def run(self, fetches, feed_dict=None, options=None, run_metadata=None):
+        do_profile = self.run_counter >= self.nruns_threshold and self.profile_level >= get_profile_level()
+        result = super(ProfilableSessionWrapper, self).run(
+            fetches, feed_dict,
+            options=self.run_options if do_profile else None,
+            run_metadata=self.run_metadata if do_profile else None
+            )
+        # For earch invocation of `run()` or `eval()` methods dump log to new file
+        if do_profile and lib.ops.mpi.is_master():
+            self._write_log()
+        self.run_counter += 1
+        return result
+
+    def _compute_peak_memory(self, mem_timelines):
+        res = {}
+        for k, tl in mem_timelines.items():
+            res[k] = max([r.bytes_in_use for r in tl])
+        return res
+
+    def _build_memory_timelines(self):
+        timelines = {}
+
+        for dev in self.run_metadata.step_stats.dev_stats:
+            for node in dev.node_stats:
+                ts = node.all_start_micros
+                for mem in node.memory:
+                    if mem.allocator_name not in timelines:
+                        timelines[mem.allocator_name] = []
+                    timelines[mem.allocator_name].append(MemTimelineRecord(ts, node.node_name, mem.allocator_bytes_in_use, mem.live_bytes))
+
+        for tl in timelines.values():
+            tl.sort()
+
+        return timelines
+
+    def _simplify_device_name(self, device_name):
+        return '/' + device_name.split('device:')[1]
diff --git a/lib/task/__init__.py b/lib/task/__init__.py
new file mode 100644
index 0000000..a6c3bca
--- /dev/null
+++ b/lib/task/__init__.py
@@ -0,0 +1 @@
+from . import seq2seq
diff --git a/lib/task/seq2seq/__init__.py b/lib/task/seq2seq/__init__.py
new file mode 100644
index 0000000..6590432
--- /dev/null
+++ b/lib/task/seq2seq/__init__.py
@@ -0,0 +1 @@
+from . import inference, problems, models, data, bleu, summary, tickers, voc
\ No newline at end of file
diff --git a/lib/task/seq2seq/bleu.py b/lib/task/seq2seq/bleu.py
new file mode 100644
index 0000000..c91af9b
--- /dev/null
+++ b/lib/task/seq2seq/bleu.py
@@ -0,0 +1,266 @@
+#!/usr/bin/env python3
+# coding: utf-8
+
+import argparse
+from collections import Counter, namedtuple
+import math
+import os.path
+import sys
+import numpy as np
+
+sys.path += [os.path.dirname(sys.argv[0])]
+
+from .strutils import tokenize, al_num, all_chars__punct_tokens, all_chars__punct_tokens__foldcase
+from .strutils import al_num__foldcase, chinese_tok, split_by_char_tok, equal_to_framework
+
+SEED = 51
+BleuResult = namedtuple('BleuResult', 'BLEU brevity_penalty ratio hyp_len ref_len BLEU_for_ngrams')
+
+
+def best_match_length(references, cand, verbose=False):
+    spl_cand_length = len(cand)
+    diff = sys.maxsize
+    for ref in references:
+        spl_ref_length = len(ref)
+        if not spl_ref_length:
+            continue
+        if spl_ref_length == spl_cand_length:
+            return spl_ref_length
+        elif abs(diff) == abs(spl_cand_length - spl_ref_length):
+            diff = max(diff, spl_cand_length - spl_ref_length)
+        elif abs(diff) > abs(spl_cand_length - spl_ref_length):
+            diff = spl_cand_length - spl_ref_length
+    best_len = max(spl_cand_length - diff, 0)
+    if verbose and not best_len:
+        print('WARNING: empty reference: ', repr((references, cand)), file=sys.stderr)
+    return best_len
+
+
+def brev_penalty(cand_length, best_match_length):
+    if cand_length > best_match_length:
+        return 1
+    else:
+        return math.exp(1 - float(best_match_length) / float(cand_length))
+
+
+def split_into_ngrams(text, n):
+    if n <= 0:
+        raise ValueError('n should be a positive number!')
+    return [tuple(text[i:i+n]) for i in range(len(text) - n + 1)]
+
+
+def compute_length_for_n(text, n_for_ngram):
+    '''
+    # split into words and count:
+    # count - n
+    '''
+    unigram_count = len(text)
+    if n_for_ngram > unigram_count:
+        return 0
+    else:
+        return unigram_count - n_for_ngram + 1
+
+
+def mod_precision_for_n(refs, cand, n, smoothed=False):
+    cand_counter = Counter(split_into_ngrams(cand, n))
+    ref_counters = [Counter(split_into_ngrams(ref, n)) for ref in refs]
+    total_sum = 0
+    for ngram, count_in_cand in cand_counter.items():
+        max_count_in_refs = max(counter[ngram] for counter in ref_counters)
+        total_sum += min(max_count_in_refs, count_in_cand)
+    if smoothed and n > 1:
+        return total_sum + 1, compute_length_for_n(cand, n) + 1
+    return total_sum, compute_length_for_n(cand, n)
+
+
+def logarithm(x):
+    if x == 0:
+        return -sys.maxsize - 1
+    else:
+        return math.log(x)
+
+
+def print_summary(bleu_vals):
+    bleu_mean, bleu_std = np.mean(bleu_vals), np.std(bleu_vals)
+    summary_string = ("Mean BLEU: %.4f; 95%% CI: [%.4f, %.4f]; std=%.4f" %
+        (bleu_mean, bleu_mean - 1.96 * bleu_std, bleu_mean + 1.96 * bleu_std, bleu_std))
+    print(summary_string)
+
+
+class Bleu(object):
+    def __init__(self, normalize_func=None, smoothed=False, cached=False, language=None, verbose=False):
+        self.cand_len = 0
+        self.best_ref_len = 0
+        self.brevity_penalty = 0
+        self.mod_precision = [[0, 0], [0, 0], [0, 0], [0, 0]]
+        self.normalize_func = normalize_func
+        self.smoothed = smoothed
+        self.cached = cached
+        self.language = language
+        if cached:
+            self.cand_len_vals = []
+            self.best_ref_len_vals = []
+            self.mod_precision_vals = []
+        self.verbose = verbose
+
+    def process_next(self, cand, refs, **kwargs):
+        if self.normalize_func is not None:
+            cand = tokenize(self.normalize_func(cand, self.language))
+            refs = [tokenize(self.normalize_func(ref, self.language)) for ref in refs]
+        else:
+            cand = tokenize(cand)
+            refs = [tokenize(ref) for ref in refs]
+        self.last__cand_len = compute_length_for_n(cand, 1)
+        self.cand_len += self.last__cand_len
+        self.last__best_ref_len = best_match_length(refs, cand, verbose=self.verbose)
+        self.best_ref_len += self.last__best_ref_len
+        self.last_mp = []
+        for i in range(4):
+            self.last_mp.append(mod_precision_for_n(refs, cand, i + 1, smoothed=self.smoothed))
+            self.mod_precision[i][0] += self.last_mp[i][0]
+            self.mod_precision[i][1] += self.last_mp[i][1]
+
+        if self.cached:
+            self.cand_len_vals.append(self.last__cand_len)
+            self.best_ref_len_vals.append(self.last__best_ref_len)
+            self.mod_precision_vals.append(self.last_mp)
+
+    def _compute_bleu(self, cand_len, best_ref_len, mod_precision, sentence_level=False):
+        brevity_penalty = brev_penalty(cand_len, best_ref_len)
+        bleu_for_ngram = [0, 0, 0, 0]
+        for i in range(4):
+            if mod_precision[i][0] > 0.0 and mod_precision[i][1] > 0.0 :
+                bleu_for_ngram[i] = round(float(mod_precision[i][0]) / float(mod_precision[i][1]), 4)
+            else:
+                bleu_for_ngram[i] = 0.0
+        average = 0
+        for i in range(4):
+            if sentence_level:
+                nonzero = mod_precision[i][1] > 0.0
+            else:
+                nonzero = mod_precision[i][0] > 0.0 and mod_precision[i][1] > 0.0
+                if not nonzero:
+                    average += 0.25 * (-sys.maxsize)
+            if nonzero:
+                average += 0.25 * logarithm(float(mod_precision[i][0]) / float(mod_precision[i][1]))
+        total_bleu = round(brevity_penalty * math.exp(average), 4)
+        return BleuResult(total_bleu, brevity_penalty, round(float(cand_len) / float(best_ref_len), 4), cand_len, best_ref_len, bleu_for_ngram)
+
+    def result_for_last(self):
+        return self._compute_bleu(self.last__cand_len, self.last__best_ref_len, self.last_mp, True)
+
+    def total(self):
+        return self._compute_bleu(self.cand_len, self.best_ref_len, self.mod_precision)
+
+    def bootstrap_sample(self, n_times=1000, seed=None):
+        rng = np.random.RandomState(seed)
+        if not self.cached:
+            return None
+        bleu_vals = []
+        for i in range(n_times):
+            inds = rng.randint(0, len(self.cand_len_vals), len(self.cand_len_vals))
+            cand_len = sum([self.cand_len_vals[i] for i in inds])
+            best_ref_len = sum([self.best_ref_len_vals[i] for i in inds])
+            mod_precision = sum([np.array(self.mod_precision_vals[i]) for i in inds])
+            bleu_vals.append(self._compute_bleu(cand_len, best_ref_len, mod_precision)[0])
+        return np.array(bleu_vals)
+
+
+if __name__ == '__main__':
+    t_options = {'simple': al_num__foldcase,
+                 'case-sensitive': al_num,
+                 'punctuation': all_chars__punct_tokens__foldcase,
+                 'c-s-punctuation': all_chars__punct_tokens,
+                 'ch': chinese_tok,
+                 'split-by-char': split_by_char_tok,
+                 'framework': equal_to_framework}
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-t', '--tokenization', help='''Tokenization options:
+    default - split text by spaces
+    simple = alphanumerics only,
+    case-sensitive = with small letters,
+    punctuation = with punctuation marks as separate tokens,
+    c-s-punctuation = case-sensitive + punctuation,
+    split-by-char = set space between all characters,
+    framework = lang-specific replacements + unicode category tokenization''',
+    choices=t_options.keys())
+    parser.add_argument('-c', '--candidate', type=int, nargs='+', help='Hypothesis column number.', required=True)
+    parser.add_argument('-r', '--reference', help='Reference column number (range or int)')
+    parser.add_argument('--all', help='Bleu scores for all queries.', action='store_true')
+    parser.add_argument('-s', '--smoothed', action='store_true', default=False, help='Use to compute smoothed BLEU')
+    parser.add_argument('-l', '--language', help='Dst-side language')
+    parser.add_argument('--bootstrap-sampling-n', type=int,
+                        help='Run bootstrap sampling n times for BLEU CI estimate.', default=0)
+    parser.add_argument('--compare', help='Compare Bleu scores for two MT systems', action='store_true')
+    args = parser.parse_args()
+
+    if args.compare and len(args.candidate) != 2:
+        raise AssertionError('It should specify 2 hypothesis columns if `--compare` flag used')
+    if args.compare and args.all:
+        raise AssertionError('Could not evaluate BLEU score for each query if `--compare` flag used')
+
+    if ':' in args.reference:
+        r_start, r_end = args.reference.split(':')
+        reference = slice(int(r_start), int(r_end) if len(r_end) > 0 else None)
+    else:
+        reference = int(args.reference)
+
+    bleu_opts = {
+        'normalize_func': t_options[args.tokenization] if args.tokenization else None,
+        'smoothed': args.smoothed,
+        'cached': bool(args.bootstrap_sampling_n) or args.compare,
+        'language': args.language,
+        }
+
+    b_first = Bleu(verbose=True, **bleu_opts)
+    if args.compare:
+        b_second = Bleu(verbose=True, **bleu_opts)
+
+    for i, line in enumerate(sys.stdin):  # for candidate and set of references in corpus compute process_next
+        line = line.rstrip('\n')
+        if not line:
+            continue
+        text_data = line.rstrip().split('\t')
+        refs = [text_data[reference]] if isinstance(reference, int) else text_data[reference]
+        refs = [ref for ref in refs if ref]
+        if not refs:
+            print('Error: no data found in {} column, line {}'.format(args.reference, i + 1), file=sys.stderr)
+            continue
+        if len(text_data) < 2:
+            text_data += ['']
+        cand_first = text_data[args.candidate[0]]
+        b_first.process_next(cand_first, refs)
+        if args.compare:
+            cand_second = text_data[args.candidate[1]]
+            b_second.process_next(cand_second, refs)
+
+        #if not cand:
+        #    print >> sys.stderr, 'Error: no data found in %d column, line %i' % (args.r, i + 1)
+        #    sys.exit(1)
+        if args.all:
+            print(i, b_first.result_for_last()[0])
+
+    if not args.compare:
+        print(b_first.total())
+        if args.bootstrap_sampling_n:
+            bleu_vals = b_first.bootstrap_sample(args.bootstrap_sampling_n, seed=SEED)
+            print_summary(bleu_vals)
+    else:
+        sampling_n = args.bootstrap_sampling_n if args.bootstrap_sampling_n > 0 else 1000
+
+        print("---\nFirst system stats:" )
+        print(b_first.total())
+        bleu_vals_first = b_first.bootstrap_sample(sampling_n, seed=SEED)
+        print_summary(bleu_vals_first)
+
+        print("---\nSecond system stats:" )
+        print(b_second.total())
+        bleu_vals_second = b_second.bootstrap_sample(sampling_n, seed=SEED)
+        print_summary(bleu_vals_second)
+
+        delta = bleu_vals_first - bleu_vals_second
+        bootstrap_p_value = np.mean(delta > 0)
+        print("---\nSystem %d is better. Significance test results:" % (1 if bootstrap_p_value > 0.5 else 2))
+        print("Paired boostrap p-value = %.3f" % min(bootstrap_p_value, 1 - bootstrap_p_value))
+
+
diff --git a/lib/task/seq2seq/data.py b/lib/task/seq2seq/data.py
new file mode 100644
index 0000000..c22e73a
--- /dev/null
+++ b/lib/task/seq2seq/data.py
@@ -0,0 +1,324 @@
+import sys
+import random
+from sortedcontainers import SortedList
+import numpy as np
+import math
+import itertools
+import tensorflow as tf
+
+from lib.data import pad_seq_list
+
+
+def srclen(item):
+    return item[0].count(' ') + 1
+
+
+def dstlen(item):
+    return item[1].count(' ') + 1
+
+
+def maxlen(item):
+    return max(srclen(item), dstlen(item))
+
+
+def sumlen(item):
+    return srclen(item) + dstlen(item)
+
+
+def form_batches(data, batch_size):
+    seq = iter(data)
+    done = False
+    while not done:
+        batch = []
+        for _ in range(batch_size):
+            try:
+                batch.append(next(seq))
+            except StopIteration:
+                done = True
+        if batch:
+            yield batch
+
+
+def locally_sorted_by_len(seq, window, weight_func=maxlen, alterate=False):
+    reverse = False
+    for batch in form_batches(seq, window):
+        batch = sorted(batch, key=weight_func, reverse=reverse)
+        for x in batch:
+            yield x
+        if alterate:
+            reverse = not reverse
+
+
+def form_adaptive_batches(data, batch_len, batch_size_max=0):
+    seq = iter(data)
+    prev = []
+    max_len = 0
+    done = False
+    while not done:
+        batch = prev
+        try:
+            while True:
+                item = next(seq)
+                max_len = max(max_len, maxlen(item))
+                if (len(batch) + 1) * max_len > batch_len or (batch_size_max and len(batch) >= batch_size_max):
+                    prev, max_len = [item], maxlen(item)
+                    break
+                batch.append(item)
+        except StopIteration:
+            done = True
+        if batch:
+            yield batch
+
+
+def form_adaptive_batches_windowed(data, weight_func=maxlen, max_size=5000, split_len=10000, batch_size_max=0):
+    rng = random.Random(42)
+    buf = []
+    last_chunk = []
+    reverse = False
+    for p in data:
+        if len(buf) >= split_len:
+            # Last chunk may contain fewer sentences than others - let's return in to the miller
+            buf += last_chunk
+
+            buf = sorted(buf, key=weight_func, reverse=reverse)
+            chunks = list(form_adaptive_batches(buf, max_size, batch_size_max=batch_size_max))
+
+            last_chunk = chunks.pop()
+            buf = []
+
+            reverse = not reverse
+
+            rng.shuffle(chunks)
+            for chunk in chunks:
+                yield chunk
+        buf.append(p)
+
+    buf += last_chunk
+    buf = sorted(buf, key=weight_func, reverse=reverse)
+    chunks = list(form_adaptive_batches(buf, max_size, batch_size_max=batch_size_max))
+    rng.shuffle(chunks)
+    for chunk in chunks:
+        yield chunk
+
+
+def batch_cost(x):
+    return len(x) + 2 * (max(len(i[0]) for i in x) + max(len(i[1]) for i in x))
+
+
+def load_parallel(src, dst, cycle=False):
+    # Load data.
+    for i in itertools.count():
+        if i > 0 and not cycle:
+            break
+        data = zip(open(src), open(dst))
+        for l, r in data:
+            yield l.rstrip('\n'), r.rstrip('\n')
+
+
+def filter_by_len(data, max_srclen=sys.maxsize, max_dstlen=sys.maxsize, batch_len=None):
+    def item_ok(item):
+        ok = (srclen(item) <= max_srclen and dstlen(item) <= max_dstlen)
+        if batch_len is not None:
+            return ok and maxlen(item) <= batch_len
+        return ok
+
+    return filter(item_ok, data)
+
+
+# ======= Random block reader =====================
+def _read_file_part(fd, file_size, part_id, nparts):
+    begin_pos = (part_id * file_size) // nparts
+    end_pos = ((part_id + 1) * file_size) // nparts
+    fd.seek(begin_pos)
+
+    if part_id > 0:
+        _prefix = fd.readline()  # skip first line
+
+    current_pos = fd.tell()  # get offset
+    if current_pos < end_pos:
+        body = fd.readlines(end_pos - current_pos)
+    elif current_pos == end_pos and end_pos < file_size:
+        body = [fd.readline()]
+    else:
+        return []
+
+    return body
+
+
+def _grouper(data, n=5):
+    pool = []
+    for item in data:
+        if len(pool) == n:
+            yield pool
+            pool = []
+        pool.append(item)
+    yield pool
+
+
+def random_block_reader(fname, part_size=64 * 1024, parallel=5, infinite=False, seed=42, encoding='utf-8'):
+    fd = open(fname, 'rb')
+    fd.seek(0, 2)
+    file_size = fd.tell()
+
+    nparts = math.ceil(file_size / part_size)
+    rng = np.random.RandomState(seed)
+    if infinite:
+        part_ids = (rng.randint(0, nparts) for _ in itertools.count())
+    else:
+        part_ids = np.arange(nparts)
+        rng.shuffle(part_ids)
+
+    for group in _grouper(part_ids, parallel):
+        lines = []
+        for part_id in group:
+            part_lines = _read_file_part(fd, file_size, part_id, nparts)
+            lines += part_lines
+
+        rng.shuffle(lines)
+
+        for line in lines:
+            yield line.decode(encoding).rstrip('\n')
+
+    fd.close()
+
+
+# ======= Adaptive batches with sorted data =======
+class FastSplitter2d:
+    def __init__(self, max_size=5000, chunk_count=5):
+        self.max_size = max_size
+        self.max_x = 0
+        self.points = SortedList(key=lambda p: -p[1])
+        self.chunk_count = chunk_count
+
+    def add_to_pack(self, p):
+        self.max_x = max(self.max_x, p[0])
+        new_pos = self.points.bisect_right(p)
+        self.points.insert(new_pos, p)
+
+        offset = 0
+        bs_vec = []
+        while offset < len(self.points):
+            bs = self.max_size // (self.max_x + self.points[offset][1])
+            bs = min(len(self.points) - offset, bs)
+            bs_vec.append(bs)
+            offset += bs
+
+        return new_pos, bs_vec
+
+    def make_chunk_gen(self, points):
+        prev_bs_vec = [0]
+        for p in sorted(list(points), key=lambda p: p[0], reverse=True):
+            new_pos, bs_vec = self.add_to_pack(p)
+
+            if len(bs_vec) > len(prev_bs_vec):
+                if len(prev_bs_vec) >= self.chunk_count:
+                    self.points.pop(new_pos)
+                    offset = 0
+                    for sz in prev_bs_vec:
+                        yield self.points[offset:offset + sz]
+                        offset += sz
+                    self.points.clear()
+                    self.points.add(p)
+                    prev_bs_vec = [1]
+                    self.max_x = p[0]
+            prev_bs_vec = bs_vec
+        offset = 0
+        for sz in prev_bs_vec:
+            yield self.points[offset:offset + sz]
+            offset += sz
+
+
+def _in_conv(item, lead_inp_len=False):
+    x_len = item[0].count(' ') + 1
+    y_len = item[1].count(' ') + 1
+    if lead_inp_len:
+        return item.__class__((x_len, y_len)) + item
+    else:
+        return item.__class__((y_len, x_len)) + item
+
+
+def _out_chunk_conv(chunk):
+    return [item[2:] for item in chunk]
+
+
+def form_adaptive_batches_split2d(data, max_size=5000, split_len=10000, chunk_count=5, lead_inp_len=False):
+    rng = random.Random(42)
+    buf = []
+    for p in data:
+        if len(buf) >= split_len:
+            splitter = FastSplitter2d(max_size=max_size, chunk_count=chunk_count)
+            chunks = list(splitter.make_chunk_gen(buf))
+            rng.shuffle(chunks)
+            for chunk in chunks:
+                if len(chunk) == 0:
+                    print("SPLIT2D: empty chunk", file=sys.stderr)
+                    continue
+                yield _out_chunk_conv(chunk)
+            buf = []
+
+        buf.append(_in_conv(p, lead_inp_len=lead_inp_len))
+
+    splitter = FastSplitter2d(max_size=max_size, chunk_count=chunk_count)
+    chunks = list(splitter.make_chunk_gen(buf))
+    rng.shuffle(chunks)
+    for chunk in chunks:
+        if len(chunk) == 0:
+            print("SPLIT2D: empty chunk", file=sys.stderr)
+            continue
+        yield _out_chunk_conv(chunk)
+
+
+## ============================================================================
+#                               Integration
+
+def words_from_line(line, voc, bos=0, eos=1):
+    line = line.rstrip('\n')
+    words = [token for token in line.split(' ') if token]
+    return voc.words([voc.bos]) * bos + words + voc.words([voc.eos]) * eos
+
+
+def words_from_ids(ids, voc):
+    return [
+        word if (id not in [voc.bos, voc.eos]) else None
+        for id, word in zip(ids, voc.words(ids))
+    ]
+
+
+def lines2ids(lines, voc, **kwargs):
+    # Read as-is, without padding.
+    ids_all = []
+    for line in lines:
+        words = words_from_line(line, voc, **kwargs)
+        ids = voc.ids(words)
+        ids_all.append(ids)
+
+    # Pad and transpose.
+    ids_all, ids_len = pad_seq_list(ids_all, voc.eos)
+    return ids_all, ids_len
+
+
+def make_batch_data(batch, inp_voc, out_voc, force_bos=False, **kwargs):
+    inp_lines, out_lines = zip(*batch)
+    inp, inp_len = lines2ids(inp_lines, inp_voc, bos=int(force_bos))
+    out, out_len = lines2ids(out_lines, out_voc, bos=int(force_bos))
+
+    batch_data = dict(
+        inp=np.array(inp, dtype=np.int32),
+        inp_len=np.array(inp_len, dtype=np.int32),
+        out=np.array(out, dtype=np.int32),
+        out_len=np.array(out_len, dtype=np.int32))
+
+    return batch_data
+
+
+def make_batch_placeholder(batch_data):
+    batch_placeholder = {
+        k: tf.placeholder(v.dtype, [None] * len(v.shape))
+        for k, v in batch_data.items()}
+    return batch_placeholder
+
+
+class BatchIndexer:
+    pass
+
+
diff --git a/lib/task/seq2seq/inference.py b/lib/task/seq2seq/inference.py
new file mode 100644
index 0000000..407b796
--- /dev/null
+++ b/lib/task/seq2seq/inference.py
@@ -0,0 +1,1046 @@
+import sys
+from collections import namedtuple
+from warnings import warn
+
+import tensorflow as tf
+
+import lib.util
+from lib.ops import infer_length, infer_mask
+from lib.ops.sliced_argmax import sliced_argmax
+from lib.util import nested_map, is_scalar
+import numpy as np
+
+
+def translate_lines(lines, translator, model, out_voc, replace_unk=False, unbpe=False, dumper=None):
+    """
+    tokenize, translate and detokenize strings using specified model and translator
+    :param lines: an iterable of strings
+    :type translator: something that can .translate_batch(batch_dict) -> out, attnP, ...
+    :param model: a model from lib.task.seq2seq.models.ModelBase
+    :param out_voc: destination language dictionary
+    :param replace_unk: if True, forbids sampling UNK from the model
+    :param unbpe: if True, concatenates bpe subword units together
+    :return: yields one translation line at a time
+    """
+    batch = [(l, "") for l in lines]
+    batch_data = model.make_feed_dict(batch, add_inp_words=True)
+    kwargs = {}
+
+    if dumper is not None:
+        kwargs['batch_dumpers'] = dumper.create_batch_dumpers(batch)
+
+    out_ids, attnP = translator.translate_batch(batch_data, **kwargs)[:2]
+
+    for i in range(len(out_ids)):
+        ids = list(out_ids[i])
+        words = out_voc.words(ids)
+        words = [w for w, out_id in zip(words, ids) if out_id not in [out_voc.bos, out_voc.eos]]
+
+        if replace_unk:
+            where = [(w and '_UNK_' in w) for w in words]
+            if any(w for w in where):
+                inp_words = batch_data['inp_words'][i][:batch_data['inp_len'][i]]
+
+                # select attention weights for non-BOS/EOS tokens, shape=[num_outputs, num_inputs]
+                attns = np.array([a for a, out_id in zip(attnP[i], ids)
+                                  if out_id not in [out_voc.bos, out_voc.eos]])[:len(words), :len(inp_words)]
+
+                # forbid attns to special tokens if there are normal tokens in inp
+                inp_mask = np.array([w not in ['_BOS_', '_EOS_'] for w in inp_words])
+                attns = np.where(inp_mask[None, :], attns, -np.inf)
+
+                words = copy_argmax(inp_words, words, attns, where)
+
+        out_line = " ".join(words)
+        if unbpe:
+            out_line = out_line.replace('` ', '')
+        yield out_line
+
+
+def copy_argmax(inp, out, attnP, where):
+    """
+    inp: [ninp]
+    out: [nout]
+    attnP: [nout, ninp]
+    where: [nout]
+    """
+    # Check shapes.
+    if len(inp) != attnP.shape[1]:
+        msg = 'len(inp) is %i, but attnP.shape[1] is %i'
+        raise ValueError(msg % (len(inp), attnP.shape[1]))
+    if len(out) != attnP.shape[0]:
+        msg = 'len(out) is %i, but attnP.shape[0] is %i'
+        raise ValueError(msg % (len(out), attnP.shape[0]))
+
+    # Copy in every requested position.
+    new_out = []
+    for o in range(len(out)):
+        # Output as-is.
+        if not where[o]:
+            new_out.append(out[o])
+            continue
+
+        # Copy from input.
+        i = np.argmax(attnP[o])
+        new_out.append(inp[i])
+
+    return new_out
+
+
+class TranslateModel:
+
+    def __init__(self, name, inp_voc, out_voc, loss, **hp):
+        """ Each model must have name, vocabularies and a hyperparameter dict """
+        self.name = name
+        self.inp_voc = inp_voc
+        self.out_voc = out_voc
+        self.loss = loss
+        self.hp = hp
+
+    def encode(self, batch, **flags):
+        """
+        Encodes symbolic input and returns initial state of decode
+        :param batch: {
+            inp: int32 matrix [batch,time] or whatever your model can encode
+            inp_len: int vector [batch_size]
+        }
+        --------------------------------------------------
+        :returns: dec_state, nested structure of tensors, batch-major
+        """
+        raise NotImplementedError()
+
+    def decode(self, dec_state, words, **flags):
+        """
+        Performs decode step on given words.
+
+        dec_state: nested structure of tensors, batch-major
+        words: int vector [batch_size]
+        ------------------------------------------------------
+        :returns: new_dec_state, nested structure of tensors, batch-major
+        """
+        raise NotImplementedError()
+
+    def shuffle(self, dec_state, hypo_indices):
+        """
+        Selects hypotheses from model decoder state by given indices.
+        :param dec_state: a nested structure of tensors representing model state
+        :param hypo_indices: int32 vector of indices to select
+        :returns: dec state elements for given flat_indices only
+        """
+        return nested_map(lambda x: tf.gather(x, hypo_indices), dec_state)
+
+    def switch(self, condition, state_on_true, state_on_false):
+        """
+        Composes a new stack.best_dec_state out of new dec state when new_is_better and old dec state otherwise
+        :param condition: a boolean condition vector of shape [batch_size]
+        """
+        return nested_map(lambda x, y: tf.where(condition, x, y), state_on_true, state_on_false)
+
+    def sample(self, dec_state, base_scores, slices, k, sampling_strategy='greedy', sampling_temperature=None, **flags):
+        """
+        Samples top-K new words for each hypothesis from a beam.
+        Decoder states and base_scores of hypotheses for different inputs are concatenated like this:
+            [x0_hypo0, x0_hypo1, ..., x0_hypoN, x1_hypo0, ..., x1_hypoN, ..., xM_hypoN
+
+        :param dec_state: nested structure of tensors, batch-major
+        :param base_scores: [batch_size], log-probabilities of hypotheses in dec_state with additive penalties applied
+        :param slices: start indices of each input
+        :param k: [], int, how many hypotheses to sample per input
+        :returns: best_hypos, words, scores,
+            best_hypos: in-beam hypothesis index for each sampled token, [batch_size / slice_size, k], int
+            words: new tokens for each hypo, [batch_size / slice_size, k], int
+            scores: log P(words | best_hypos), [batch_size / slice_size, k], float32
+        """
+        rdo = self.get_rdo(dec_state)
+        if isinstance(rdo, (tuple, list)) or lib.util.is_namedtuple(rdo):
+            logits = self.loss.rdo_to_logits__predict(*rdo)
+        else:
+            logits = self.loss.rdo_to_logits__predict(rdo)
+
+        n_hypos, voc_size = tf.shape(logits)[0], tf.shape(logits)[1]
+        batch_size = tf.shape(slices)[0]
+
+        if sampling_strategy == 'random':
+            if sampling_temperature is not None:
+                logits /= sampling_temperature
+
+            logp = tf.nn.log_softmax(logits, 1)
+
+            best_hypos = tf.range(0, n_hypos)[:, None]
+
+            best_words = tf.cast(tf.multinomial(logp, k), tf.int32)
+            best_words_flat = (tf.range(0, batch_size) * voc_size)[:, None] + best_words
+
+            best_delta_scores = tf.gather(tf.reshape(logp, [-1]), best_words_flat)
+
+        elif sampling_strategy == 'greedy':
+            logp = tf.nn.log_softmax(logits, 1) + base_scores[:, None]
+            best_scores, best_indices = sliced_argmax(logp, slices, k)
+
+            # If best_hypos == -1, best_scores == -inf, set best_hypos to 0 to avoid runtime IndexError
+            best_hypos = tf.where(tf.not_equal(best_indices, -1),
+                                  tf.floordiv(best_indices, voc_size) + slices[:, None],
+                                  tf.fill(tf.shape(best_indices), -1))
+            best_words = tf.where(tf.not_equal(best_indices, -1),
+                                  tf.mod(best_indices, voc_size),
+                                  tf.fill(tf.shape(best_indices), -1))
+
+            best_delta_scores = best_scores - tf.gather(base_scores, tf.maximum(0, best_hypos))
+        else:
+            raise ValueError("sampling_strategy must be in ['random','greedy']")
+
+        return (best_hypos, best_words, best_delta_scores)
+
+    def get_rdo(self, dec_state):
+        if hasattr(dec_state, 'rdo'):
+            return dec_state.rdo
+        raise NotImplementedError()
+
+    def get_attnP(self, dec_state):
+        """
+        Returns attnP
+
+        dec_state: [..., batch_size, ...]
+        ---------------------------------
+        Ret: attnP
+            attnP: [batch_size, ninp]
+        """
+        if hasattr(dec_state, 'attnP'):
+            return dec_state.attnP
+        raise NotImplementedError()
+
+
+class GreedyDecoder:
+    """
+    Inference that encodes input sequence, then iteratively samples and decodes output sequence.
+    :type model: lib.task.seq2seq.inference.translate_model.TranslateModel
+    :param batch: a dictionary that contains symbolic tensor {'inp': input token ids, shape [batch_size,time]}
+    :param max_len: maximum length of output sequence, symbolic or numeric integer
+        if scalar, sets global length; if vector[batch_size], sets length for each input;
+        if None, defaults to 2*inp_len + 3
+    :param force_bos: if True, forces zero-th output to be model.out_voc.bos. Otherwise lets model decide.
+    :param force_eos: if True, any token past initial EOS is guaranteed to be EOS
+    :param get_tracked_outputs: callback that returns whatever tensor(s) you want to track on each time-step
+    :param crop_last_step: if True, does not perform  additional decode __after__ last eos
+            ensures all tensors have equal time axis
+    :param back_prop: see tf.while_loop back_prop param
+    :param swap_memory: see tf.while_loop swap_memory param
+    :param **flags: you can add any amount of tags that encode and decode understands.
+        e.g. greedy=True or is_train=True
+
+    """
+
+    Stack = namedtuple('Stack',
+                       ['out', 'out_len', 'scores', 'finished', 'dec_state', 'attnP', 'tracked'])
+
+    def __init__(self, model, batch_placeholder, max_len=None, force_bos=False, force_eos=True,
+                 get_tracked_outputs=lambda dec_state: [], crop_last_step=True,
+                 back_prop=True, swap_memory=False, **flags):
+        self.batch_placeholder = batch_placeholder
+        self.get_tracked_outputs = get_tracked_outputs
+
+        inp_len = batch_placeholder.get('inp_len', infer_length(batch_placeholder['inp'], model.out_voc.eos))
+        max_len = max_len if max_len is not None else (2 * inp_len + 3)
+
+        first_stack = self.create_initial_stack(model, batch_placeholder, force_bos=force_bos, **flags)
+        shape_invariants = nested_map(lambda v: tf.TensorShape([None for _ in v.shape]), first_stack)
+
+        # Actual decoding
+        def should_continue_translating(*stack):
+            stack = self.Stack(*stack)
+            return tf.reduce_any(tf.less(stack.out_len, max_len)) & tf.reduce_any(~stack.finished)
+
+        def inference_step(*stack):
+            stack = self.Stack(*stack)
+            return self.greedy_step(model, stack, **flags)
+
+        final_stack = tf.while_loop(
+            cond=should_continue_translating,
+            body=inference_step,
+            loop_vars=first_stack,
+            shape_invariants=shape_invariants,
+            swap_memory=swap_memory,
+            back_prop=back_prop,
+        )
+
+        outputs, _, scores, _, dec_states, attnP, tracked_outputs = final_stack
+        if crop_last_step:
+            attnP = attnP[:, :-1]
+            tracked_outputs = nested_map(lambda out: out[:, :-1], tracked_outputs)
+
+        if force_eos:
+            out_mask = infer_mask(outputs, model.out_voc.eos)
+            outputs = tf.where(out_mask, outputs, tf.fill(tf.shape(outputs), model.out_voc.eos))
+
+        self.best_out = outputs
+        self.best_attnP = attnP
+        self.best_scores = scores
+        self.dec_states = dec_states
+        self.tracked_outputs = tracked_outputs
+
+    def translate_batch(self, batch_data, **optional_feed):
+        """
+        Translates NUMERIC batch of data
+        :param batch_data: dict {'inp':np.array int32[batch,time]}
+        :optional_feed: any additional values to be fed into graph. e.g. if you used placeholder for max_len at __init__
+        :return: best hypotheses' outputs[batch, out_len] and attnP[batch, out_len, inp_len]
+        """
+        feed_dict = {placeholder: batch_data[k] for k, placeholder in self.batch_placeholder.items()}
+        for k, v in optional_feed.items():
+            feed_dict[k] = v
+
+        out_ids, attnP = tf.get_default_session().run(
+            [self.best_out, self.best_attnP],
+            feed_dict=feed_dict)
+
+        return out_ids, attnP
+
+    def create_initial_stack(self, model, batch_placeholder, force_bos=False, **flags):
+        inp = batch_placeholder['inp']
+        batch_size = tf.shape(inp)[0]
+
+        initial_state = model.encode(batch_placeholder, **flags)
+        initial_attnP = model.get_attnP(initial_state)[:, None]
+        initial_tracked = nested_map(lambda x: x[:, None], self.get_tracked_outputs(initial_state))
+
+        if force_bos:
+            initial_outputs = tf.cast(tf.fill((batch_size, 1), model.out_voc.bos), inp.dtype)
+            initial_state = model.decode(initial_state, initial_outputs[:, 0], **flags)
+            second_attnP = model.get_attnP(initial_state)[:, None]
+            initial_attnP = tf.concat([initial_attnP, second_attnP], axis=1)
+            initial_tracked = nested_map(lambda x, y: tf.concat([x, y[:, None]], axis=1),
+                                         initial_tracked,
+                                         self.get_tracked_outputs(initial_state),)
+        else:
+            initial_outputs = tf.zeros((batch_size, 0), dtype=inp.dtype)
+
+        initial_scores = tf.zeros([batch_size], dtype='float32')
+        initial_finished = tf.zeros_like([batch_size], dtype='bool')
+        initial_len = tf.shape(initial_outputs)[1]
+
+        return self.Stack(initial_outputs, initial_len, initial_scores, initial_finished,
+                          initial_state, initial_attnP, initial_tracked)
+
+    def greedy_step(self, model, stack, **flags):
+        """
+        :type model: lib.task.seq2seq.inference.translate_model.TranslateModel
+        :param stack: beam search stack
+        :return: new beam search stack
+        """
+        out_seq, out_len, scores, finished, dec_states, attnP, tracked = stack
+
+        # 1. sample
+        batch_size = tf.shape(out_seq)[0]
+        phony_slices = tf.range(batch_size)
+        _, new_outputs, logp_next = model.sample(dec_states, scores, phony_slices, k=1, **flags)
+
+        out_seq = tf.concat([out_seq, new_outputs], axis=1)
+        scores = scores + logp_next[:, 0] * tf.cast(~finished, 'float32')
+        is_eos = tf.equal(new_outputs[:, 0], model.out_voc.eos)
+        finished = tf.logical_or(finished, is_eos)
+
+        # 2. decode
+        new_states = model.decode(dec_states, new_outputs[:, 0], **flags)
+        attnP = tf.concat([attnP, model.get_attnP(new_states)[:, None]], axis=1)
+        tracked = nested_map(lambda seq, new: tf.concat([seq, new[:, None]], axis=1),
+                             tracked, self.get_tracked_outputs(new_states)
+                             )
+        return self.Stack(out_seq, out_len + 1, scores, finished, new_states, attnP, tracked)
+
+
+class BeamSearchDecoder:
+    """
+    Performs ingraph beam search for given input sequences (inp)
+    Supports custom penalizing, pruning against best score and best score in beam (via beam_spread)
+    :param model: something that implements TranslateModel
+    :param batch_placeholder: whatever model can .encode,
+        by default should be {'inp': int32 matrix [batch_size x time]}
+    :param max_len: maximum hypothesis length to consider, symbolic or numeric integer
+        if scalar, sets global length; if vector[batch_size], sets length for each input;
+        if None, defaults to 2*inp_len + 3; float('inf') means unlimited
+    :param min_len: minimum valid output length. None means min_len=inp_len // 4 - 1; Same type as min_len
+    :param beam_size: maximum number of hypotheses that can pass from one beam search step to another.
+        The rest is pruned.
+    :param beam_spread: maximum difference in score between a hypothesis and current best hypothesis.
+        Anything below that is pruned.
+    :param force_bos: if True, forces zero-th output to be model.out_voc.bos. Otherwise lets model decide.
+    :param if_no_eos: if 'last', will return unfinished hypos if there are no finished hypos by max_len
+                      elif 'initial', returns empty hypothesis
+    :param back_prop: see tf.while_loop back_prop param
+    :param swap_memory: see tf.while_loop swap_memory param
+
+    :param **flags: whatever else you want to feed into model. This will be passed to encode, decode, etc.
+        is_train - if True (default), enables dropouts and similar training-only stuff
+        sampling_strategy - if "random", samples hypotheses proportionally to softmax(logits)
+                              otherwise(default) - takes top K hypotheses
+        sampling_temperature -  if sampling_strategy == "random",
+            performs sampling ~ softmax(logits/sampling_temperature)
+
+    """
+    Stack = namedtuple('Stack', [
+        # per hypo values
+        'out',  # [batch_size x beam_size, nout], int
+        'scores',  # [batch_size x beam_size ]
+        'raw_scores',  # [batch_size x beam_size ]
+        'attnP',  # [batch_size x beam_size, nout+1, ninp]
+        'dec_state', # TranslateModel DecState nested structure of [batch_size x beam_size, ...]
+
+        # per beam values
+        'slices',  # indices of first hypo for each sentence [batch_size ]
+        'out_len',  # total (maximum) length of a stack [], int
+        'best_out',  # [batch_size, nout], int, padded with EOS
+        'best_scores',  # [batch_size]
+        'best_raw_scores',  # [batch_size]
+        'best_attnP',  # [batch_size, nout+1, ninp], padded with EOS
+        'best_dec_state', # TranslateModel DecState; nested structure of [batch_size, ...]
+
+        # Auxilary data for extension classes.
+        'ext' # Dict[StackExtType, StackExtType()]
+        ])
+
+    def __init__(self, model, batch_placeholder, min_len=None, max_len=None,
+                 beam_size=12, beam_spread=3, beam_spread_raw=None, force_bos=False,
+                 if_no_eos='last', back_prop=True, swap_memory=False, **flags
+                 ):
+        assert if_no_eos in ['last', 'initial']
+        assert np.isfinite(beam_spread) or max_len != float('inf'), "Must set maximum length if beam_spread is infinite"
+        # initialize fields
+        self.batch_placeholder = batch_placeholder
+        inp_len = batch_placeholder.get('inp_len', infer_length(batch_placeholder['inp'], model.out_voc.eos))
+        self.min_len = min_len if min_len is not None else inp_len // 4 - 1
+        self.max_len = max_len if max_len is not None else 2 * inp_len + 3
+        self.beam_size, self.beam_spread = beam_size, beam_spread
+        if beam_spread_raw is None:
+            self.beam_spread_raw = beam_spread
+        else:
+            self.beam_spread_raw = beam_spread_raw
+        self.force_bos, self.if_no_eos = force_bos, if_no_eos
+
+        # actual beam search
+        first_stack = self.create_initial_stack(model, batch_placeholder, force_bos=force_bos, **flags)
+        shape_invariants = nested_map(lambda v: tf.TensorShape([None for _ in v.shape]), first_stack)
+
+        def should_continue_translating(*stack):
+            stack = self.Stack(*stack)
+            should_continue = self.should_continue_translating(model, stack)
+            return tf.reduce_any(should_continue)
+
+        def expand_hypos(*stack):
+            return self.beam_search_step(model, self.Stack(*stack), **flags)
+
+        last_stack = tf.while_loop(
+            cond=should_continue_translating,
+            body=expand_hypos,
+            loop_vars=first_stack,
+            shape_invariants=shape_invariants,
+            back_prop=back_prop,
+            swap_memory=swap_memory,
+        )
+
+        # crop unnecessary EOSes that occur if no hypothesis is updated on several last steps
+        actual_length = infer_length(last_stack.best_out, model.out_voc.eos)
+        max_length = tf.reduce_max(actual_length)
+        last_stack = last_stack._replace(best_out=last_stack.best_out[:, :max_length])
+
+        self.best_out = last_stack.best_out
+        self.best_attnP = last_stack.best_attnP
+        self.best_scores = last_stack.best_scores
+        self.best_raw_scores = last_stack.best_raw_scores
+        self.best_state = last_stack.best_dec_state
+
+    def translate_batch(self, batch_data, **optional_feed):
+        """
+        Translates NUMERIC batch of data
+        :param batch_data: dict {'inp':np.array int32[batch,time]}
+        :optional_feed: any additional values to be fed into graph. e.g. if you used placeholder for max_len at __init__
+        :return: best hypotheses' outputs[batch, out_len] and attnP[batch, out_len, inp_len]
+        """
+        feed_dict = {placeholder: batch_data[k] for k, placeholder in self.batch_placeholder.items()}
+        for k, v in optional_feed.items():
+            feed_dict[k] = v
+
+        out_ids, attnP = tf.get_default_session().run(
+            [self.best_out, self.best_attnP],
+            feed_dict=feed_dict)
+
+        return out_ids, attnP
+
+    def create_initial_stack(self, model, batch, **flags):
+        """
+        Creates initial stack for beam search by encoding inp and optionally forcing BOS as first output
+        :type model: lib.task.seq2seq.inference.TranslateModel
+        :param batch: model inputs - whatever model can eat for self.encode(batch,**tags)
+        :param force_bos: if True, forces zero-th output to be model.out_voc.bos. Otherwise lets model decide.
+        """
+
+        dec_state = dec_state_0 = model.encode(batch, **flags)
+        attnP_0 = model.get_attnP(dec_state_0)
+        batch_size = tf.shape(attnP_0)[0]
+
+        out_len = tf.constant(0, 'int32')
+        out = tf.zeros(shape=(batch_size, 0), dtype=tf.int32)  # [batch_size, nout = 0]
+
+        if self.force_bos:
+            bos = tf.fill(value=model.out_voc.bos, dims=(batch_size,))
+            dec_state = dec_state_1 = model.decode(dec_state_0, bos, **flags)
+            attnP_1 = model.get_attnP(dec_state_1)
+            attnP = tf.stack([attnP_0, attnP_1], axis=1)  # [batch_size, 2, ninp]
+            out_len += 1
+            out = tf.concat([out, bos[:, None]], axis=1)
+
+        else:
+            attnP = attnP_0[:, None, :]  # [batch_size, 1, ninp]
+
+        slices = tf.range(0, batch_size)
+        empty_out = tf.fill(value=model.out_voc.eos, dims=(batch_size, tf.shape(out)[1]))
+
+        # Create stack.
+        stack = self.Stack(
+            out=out,
+            scores=tf.zeros(shape=(batch_size,)),
+            raw_scores=tf.zeros(shape=(batch_size,)),
+            attnP=attnP,
+            dec_state=dec_state,
+            slices=slices,
+            out_len=out_len,
+            best_out=empty_out,
+            best_scores=tf.fill(value=-float('inf'), dims=(batch_size,)),
+            best_raw_scores=tf.fill(value=-float('inf'), dims=(batch_size,)),
+            best_attnP=attnP,
+            best_dec_state=dec_state,
+            ext={}
+        )
+
+        return stack
+
+    def should_continue_translating(self, model, stack):
+        """
+        Returns a bool vector for all hypotheses where True means hypo should be kept, 0 means it should be dropped.
+        A hypothesis is dropped if it is either finished or pruned by beam_spread or by beam_size
+        Note: this function assumes hypotheses for each input sample are sorted by scores(best first)!!!
+        """
+
+        # drop finished hypotheses
+        should_keep = tf.logical_not(
+            tf.reduce_any(tf.equal(stack.out, model.out_voc.eos), axis=-1))  # [batch_size x beam_size]
+
+        n_hypos = tf.shape(stack.out)[0]
+        batch_size = tf.shape(stack.best_out)[0]
+        batch_indices = hypo_to_batch_index(n_hypos, stack.slices)
+
+        # prune by length
+        if self.max_len is not None:
+            within_max_length = tf.less_equal(stack.out_len, self.max_len)
+
+            # if we're given one max_len per each sentence, repeat it for each batch
+            if not is_scalar(self.max_len):
+                within_max_length = tf.gather(within_max_length, batch_indices)
+
+            should_keep = tf.logical_and(
+                should_keep,
+                within_max_length,
+            )
+
+        # prune by beam spread
+        if self.beam_spread is not None:
+            best_scores_for_hypos = tf.gather(stack.best_scores, batch_indices)
+            pruned_by_spread = tf.less(stack.scores + self.beam_spread, best_scores_for_hypos)
+            should_keep = tf.logical_and(should_keep, tf.logical_not(pruned_by_spread))
+
+        if self.beam_spread_raw:
+            best_raw_scores_for_hypos = tf.gather(stack.best_raw_scores, batch_indices)
+            pruned_by_raw_spread = tf.less(stack.raw_scores + self.beam_spread_raw, best_raw_scores_for_hypos)
+            should_keep = tf.logical_and(should_keep,
+                                         tf.logical_not(pruned_by_raw_spread))
+
+
+        # pruning anything exceeding beam_size
+        if self.beam_size is not None:
+            # This code will use a toy example to explain itself: slices=[0,2,5,5,8], n_hypos=10, beam_size=2
+            # should_keep = [1,1,1,0,1,1,1,1,0,1] (two hypotheses have been pruned/finished)
+
+            # 1. compute index of each surviving hypothesis globally over full batch,  [0,1,2,3,3,4,5,6,7,7]
+            survived_hypo_id = tf.cumsum(tf.cast(should_keep, 'int32'), exclusive=True)
+            # 2. compute number of surviving hypotheses for each batch sample, [2,2,3,1]
+            survived_hypos_per_input = tf.bincount(batch_indices, weights=tf.cast(should_keep, 'int32'),
+                                                   minlength=batch_size, maxlength=batch_size)
+            # 3. compute the equivalent of slices for hypotheses excluding pruned: [0,2,4,4,7]
+            slices_exc_pruned = tf.cumsum(survived_hypos_per_input, exclusive=True)
+            # 4. compute index of surviving hypothesis within one sample (for each sample)
+            # index of input sentence in batch:       inp0  /inp_1\  /inp_2\, /inp_3\
+            # index of hypothesis within input:      [0, 1, 0, 1, 1, 0, 1, 2, 0, 0, 1]
+            # 'e' = pruned earlier, 'x' - pruned now:         'e'         'x'   'e'
+            beam_index = survived_hypo_id - tf.gather(slices_exc_pruned, batch_indices)
+
+            # 5. prune hypotheses with index exceeding beam_size
+            pruned_by_beam_size = tf.greater_equal(beam_index, self.beam_size)
+            should_keep = tf.logical_and(should_keep, tf.logical_not(pruned_by_beam_size))
+
+        return should_keep
+
+    def beam_search_step_expand_hypos(self, model, stack, **flags):
+        """
+        Performs one step of beam search decoding. Samples new hypothesis to stack.
+        :type model: lib.task.seq2seq.inference.TranslateModel
+        :type stack: BeamSearchDecoder.BeamSearchStack
+        """
+
+        # Prune
+        #     - Against best completed hypo
+        #     - Against best hypo in beam
+        #     - EOS translations
+        #     - Against beam size
+
+        should_keep = self.should_continue_translating(model, stack)
+
+        hypo_indices = tf.where(should_keep)[:, 0]
+        stack = self.shuffle_beam(model, stack, hypo_indices)
+
+        # Compute penalties, if any
+        base_scores = self.compute_base_scores(model, stack, **flags)
+
+        # Get top-beam_size new hypotheses for each input.
+        # Note: we assume sample returns hypo_indices from highest score to lowest, therefore hypotheses
+        # are automatically sorted by score within each slice.
+        hypo_indices, words, delta_raw_scores = model.sample(stack.dec_state, base_scores, stack.slices,
+                                                             self.beam_size, **flags
+                                                             )
+
+        # hypo_indices, words and delta_raw_scores may contain -1/-1/-inf triples for non-available hypotheses.
+        # This can only happen if for some input there were 0 surviving hypotheses OR beam_size > n_hypos*vocab_size
+        # In either case, we want to prune such hypotheses
+        valid_indices = tf.where(tf.not_equal(tf.reshape(hypo_indices, [-1]), -1))[:, 0]
+        hypo_indices = tf.gather(tf.reshape(hypo_indices, [-1]), valid_indices)
+        words = tf.gather(tf.reshape(words, [-1]), valid_indices)
+        delta_raw_scores = tf.gather(tf.reshape(delta_raw_scores, [-1]), valid_indices)
+
+        stack = self.shuffle_beam(model, stack, hypo_indices)
+        dec_state = model.decode(stack.dec_state, words, **flags)
+        step_attnP = model.get_attnP(dec_state)
+        # step_attnP shape: [batch_size * beam_size, ninp]
+
+        # collect stats for the next step
+        attnP = tf.concat([stack.attnP, step_attnP[:, None, :]], axis=1) # [batch * beam_size, nout, ninp]
+        out = tf.concat([stack.out, words[..., None]], axis=-1)
+        out_len = stack.out_len + 1
+
+        raw_scores = stack.raw_scores + delta_raw_scores
+
+        return stack._replace(
+            out=out,
+            raw_scores=raw_scores,
+            attnP=attnP,
+            out_len=out_len,
+            dec_state=dec_state,
+        )
+
+    def beam_search_step_update_best(self, model, stack, maintain_best_state=False, **flags):
+        """
+        Performs one step of beam search decoding. Removes hypothesis from (beam_size ** 2) stack.
+        :type model: lib.task.seq2seq.inference.TranslateModel
+        :type stack: BeamSearchDecoder.BeamSearchStack
+        """
+
+        # Compute sample id for each hypo in stack
+        n_hypos = tf.shape(stack.out)[0]
+        batch_indices = hypo_to_batch_index(n_hypos, stack.slices)
+
+        # Mark finished hypos
+        finished = tf.equal(stack.out[:, -1], model.out_voc.eos)
+
+        if self.min_len is not None:
+            below_min_length = tf.less(stack.out_len, self.min_len)
+            if not is_scalar(self.min_len):
+                below_min_length = tf.gather(below_min_length, batch_indices)
+
+            finished = tf.logical_and(finished, tf.logical_not(below_min_length))
+
+        if self.if_no_eos == 'last':
+            # No hypos finished with EOS, but len == max_len, consider unfinished hypos
+            reached_max_length = tf.equal(stack.out_len, self.max_len)
+            if not is_scalar(self.max_len):
+                reached_max_length = tf.gather(reached_max_length, batch_indices)
+
+            have_best_out = tf.reduce_any(tf.not_equal(stack.best_out, model.out_voc.eos), 1)
+            no_finished_alternatives = tf.gather(tf.logical_not(have_best_out), batch_indices)
+            allow_unfinished_hypo = tf.logical_and(reached_max_length, no_finished_alternatives)
+
+            finished = tf.logical_or(finished, allow_unfinished_hypo)
+
+        # select best finished hypo for each input in batch (if any)
+        finished_scores = tf.where(finished, stack.scores, tf.fill(tf.shape(stack.scores), -float('inf')))
+        best_scores, best_indices = sliced_argmax(finished_scores[:, None], stack.slices, 1)
+        best_scores, best_indices = best_scores[:, 0], stack.slices + best_indices[:, 0]
+        best_indices = tf.clip_by_value(best_indices, 0, tf.shape(stack.out)[0] - 1)
+
+        stack_is_nonempty = tf.not_equal(tf.shape(stack.out)[0], 0)
+
+        # take the better one of new best hypotheses or previously existing ones
+        new_is_better = tf.greater(best_scores, stack.best_scores)
+        best_scores = tf.where(new_is_better, best_scores, stack.best_scores)
+
+        new_best_raw_scores = tf.cond(stack_is_nonempty,
+                                      lambda:tf.gather(stack.raw_scores, best_indices),
+                                      lambda:stack.best_raw_scores)
+
+        best_raw_scores = tf.where(new_is_better, new_best_raw_scores, stack.best_raw_scores)
+
+
+        batch_size = tf.shape(stack.best_out)[0]
+        eos_pad = tf.fill(value=model.out_voc.eos, dims=(batch_size, 1))
+        padded_best_out = tf.concat([stack.best_out, eos_pad], axis=1)
+        new_out = tf.cond(stack_is_nonempty,
+                          lambda: tf.gather(stack.out, best_indices),
+                          lambda: tf.gather(padded_best_out, best_indices) # dummy out, best indices are zeros
+                          )
+        best_out = tf.where(new_is_better, new_out, padded_best_out)
+
+        zero_attnP = tf.zeros_like(stack.best_attnP[:, :1, :])
+        padded_best_attnP = tf.concat([stack.best_attnP, zero_attnP], axis=1)
+        new_attnP = tf.cond(stack_is_nonempty,
+                            lambda: tf.gather(stack.attnP, best_indices),
+                            lambda: tf.gather(padded_best_attnP, best_indices), # dummy attnP, best indices are zeros
+                            )
+        best_attnP = tf.where(new_is_better, new_attnP, padded_best_attnP)
+
+        # if better translation is reached, update it's state too
+        best_dec_state = stack.best_dec_state
+        if maintain_best_state:
+            new_best_dec_state = model.shuffle(stack.dec_state, best_indices)
+            best_dec_state = model.switch(new_is_better, new_best_dec_state, stack.best_dec_state)
+
+        return stack._replace(
+            best_out=best_out,
+            best_scores=best_scores,
+            best_attnP=best_attnP,
+            best_raw_scores=best_raw_scores,
+            best_dec_state=best_dec_state,
+        )
+
+    def beam_search_step(self, model, stack, **flags):
+        stack = self.beam_search_step_expand_hypos(model, stack, **flags)
+        stack = stack._replace(
+            scores=self.compute_scores(model, stack, **flags)
+        )
+        is_beam_not_empty = tf.not_equal(tf.shape(stack.raw_scores)[0], 0)
+        return self.beam_search_step_update_best(model, stack, **flags)
+
+    def compute_scores(self, model, stack, **flags):
+        """
+        Compute hypothesis scores given beam search stack. Applies any penalties necessary.
+        For quick prototyping, you can store whatever penalties you need in stack.dec_state
+        :type model: lib.task.seq2seq.inference.TranslateModel
+        :type stack: BeamSearchDecoder.BeamSearchStack
+        :return: float32 vector (one score per hypo)
+        """
+        return stack.raw_scores
+
+    def compute_base_scores(self, model, stack, **flags):
+        """
+        Compute hypothesis scores to be used as base_scores for model.sample.
+        This is usually same as compute_scores but scaled to the magnitude of log-probabilities
+        :type model: lib.task.seq2seq.inference.TranslateModel
+        :type stack: BeamSearchDecoder.BeamSearchStack
+        :return: float32 vector (one score per hypo)
+        """
+        return self.compute_scores(model, stack, **flags)
+
+    def shuffle_beam(self, model, stack, flat_indices):
+        """
+        Selects hypotheses by index from entire BeamSearchStack
+        Note: this method assumes that both stack and flat_indices are sorted by sample index
+        (i.e. first are indices for input0 are, then indices for input1, then 2, ... then input[batch_size-1]
+        """
+        n_hypos = tf.shape(stack.out)[0]
+        batch_size = tf.shape(stack.best_out)[0]
+
+        # compute new slices:
+        # step 1: get index of inptut sequence (in batch) for each hypothesis in flat_indices
+        sample_ids_for_slices = tf.gather(hypo_to_batch_index(n_hypos, stack.slices), flat_indices)
+        # step 2: compute how many hypos per flat_indices
+        n_hypos_per_sample = tf.bincount(sample_ids_for_slices, minlength=batch_size, maxlength=batch_size)
+        # step 3: infer slice start indices
+        new_slices = tf.cumsum(n_hypos_per_sample, exclusive=True)
+
+        # shuffle everything else
+        return stack._replace(
+            out=tf.gather(stack.out, flat_indices),
+            scores=tf.gather(stack.scores, flat_indices),
+            raw_scores=tf.gather(stack.raw_scores, flat_indices),
+            attnP=tf.gather(stack.attnP, flat_indices),
+            dec_state=model.shuffle(stack.dec_state, flat_indices),
+            ext=nested_map(lambda x: tf.gather(x, flat_indices), stack.ext),
+            slices=new_slices,
+        )
+
+
+class PenalizedBeamSearchDecoder(BeamSearchDecoder):
+    """
+    Performs ingraph beam search for given input sequences (inp)
+    Implements length and coverage penalties
+    """
+    PenalizedExt = namedtuple('PenalizedExt', [
+        'attnP_sum',  # [batch_size x beam_size, ninp]
+    ])
+
+    def beam_search_step_expand_hypos(self, model, stack, **flags):
+        new_stack = super().beam_search_step_expand_hypos(model, stack, **flags)
+        new_stack_ext = new_stack.ext[self.PenalizedExt]
+
+        step_attnP = model.get_attnP(new_stack.dec_state)
+        # step_attnP shape: [batch_size * beam_size, ninp]
+
+        new_stack.ext[self.PenalizedExt] = new_stack_ext._replace(
+            attnP_sum=new_stack_ext.attnP_sum + step_attnP)
+        return new_stack
+
+    def create_initial_stack(self, model, batch, **flags):
+        stack = super().create_initial_stack(model, batch, **flags)
+        stack.ext[self.PenalizedExt] = self.PenalizedExt(
+            attnP_sum=tf.reduce_sum(stack.attnP, axis=1))
+        return stack
+
+    def compute_scores(self, model, stack, len_alpha=1, attn_beta=0, **flags):
+        """
+        Computes scores after length and coverage penalty
+        :param len_alpha: coefficient for length penalty, score / ( [5 + len(output_sequence)] / 6) ^ len_alpha
+        :param attn_beta: coefficient for coverage penalty (additive)
+            attn_beta * sum_i {log min(1.0, sum_j {attention_p[x_i,y_j] }  )}
+        :return: float32 vector (one score per hypo)
+        """
+        stack_ext = stack.ext[self.PenalizedExt]
+
+        if attn_beta:
+            warn("whenever attn_beta !=0, this code works as in http://bit.ly/2ziK5a8,"
+                 "which may or may not be correct depending on your definition.")
+
+        scores = stack.raw_scores
+        if len_alpha:
+            length_penalty = tf.pow((1. + tf.to_float(stack.out_len) / 6.), len_alpha)
+            scores /= length_penalty
+        if attn_beta:
+            times_translated = tf.minimum(stack_ext.attnP_sum, 1)
+            coverage_penalty = tf.reduce_sum(
+                tf.log(times_translated + sys.float_info.epsilon),
+                axis=-1) * attn_beta
+            scores += coverage_penalty
+        return scores
+
+    def compute_base_scores(self, model, stack, len_alpha=1, **flags):
+        """
+        Compute hypothesis scores to be used as base_scores for model.sample
+        :return: float32 vector (one score per hypo)
+        """
+        scores = self.compute_scores(model, stack, len_alpha=len_alpha, **flags)
+        if len_alpha:
+            length_penalty = tf.pow((1. + tf.to_float(stack.out_len) / 6.), len_alpha)
+            scores *= length_penalty
+        return scores
+
+
+def get_words_attnP(step_attnP, inp_words_mask, slices, src_word_attn_aggregation='max'):
+    # Helper function to extract word-level alignment aggregation on src.
+    # For parameter description see AlignmentPenaltyBeamSearchDecoder.AlignmentPenaltyExt
+
+    def _get_words_attnP(step_attnP, inp_words_mask, slices):
+        max_words_len = np.max(np.sum(inp_words_mask, axis=1))
+        words_attnP = np.zeros((step_attnP.shape[0], max_words_len))
+        slices = slices.tolist() + [step_attnP.shape[0]]
+        for words_mask, (b, e) in zip(inp_words_mask,
+                                      zip(slices[:-1], slices[1:])):
+            words_ind = np.where(words_mask)[0].tolist() + [len(words_mask)]
+            for i, (wb, we) in enumerate(zip(words_ind[:-1], words_ind[1:])):
+                if src_word_attn_aggregation == 'max':
+                    words_attnP[b:e, i] = np.max(step_attnP[b:e, wb:we], axis=1)
+                elif src_word_attn_aggregation == 'sum':
+                    words_attnP[b:e, i] = np.sum(step_attnP[b:e, wb:we], axis=1)
+                else:
+                    raise ValueError('Unknown src_word_attn_aggregation mode: %s' % src_word_attn_aggregation)
+        return words_attnP.astype(np.float32)
+
+    words_attnP = tf.py_func(_get_words_attnP, [step_attnP, inp_words_mask, slices], tf.float32, stateful=False)
+    words_attnP.set_shape([None, None])
+    return tf.stop_gradient(words_attnP)
+
+
+class AlignmentPenaltyBeamSearchDecoder(BeamSearchDecoder):
+    AlignmentPenaltyExt = namedtuple('AlignmentPenaltyExt', [
+        'attnP_aggregated_src',  # [batch_size x beam_size, ninp|ninp_words]
+        'attnP_aggregated_dst',  # [batch_size x beam_size, nout]
+
+        'inp_words_mask', # Does bpe token start a new word? [batch_size, ninp], bool
+    ])
+
+    def __init__(self, *args,
+                 len_alpha=1,
+                 attn_beta=0, src_attn_aggregation='max',
+                 src_word_attn_aggregation=None,
+                 dst_attn_beta=0, dst_attn_aggregation='max',
+                 **kwargs ):
+        # We need to initialize them all to create initial stack
+        self.len_alpha = len_alpha
+        self.attn_beta = attn_beta
+        self.src_attn_aggregation = src_attn_aggregation
+        self.src_word_attn_aggregation = src_word_attn_aggregation
+        self.dst_attn_beta = dst_attn_beta
+        self.dst_attn_aggregation = dst_attn_aggregation
+        super().__init__(*args, **kwargs)
+
+    def beam_search_step_expand_hypos(self, model, stack, **flags):
+        stack = super().beam_search_step_expand_hypos(model, stack, **flags)
+        stack_ext = stack.ext[self.AlignmentPenaltyExt]
+
+        step_attnP = model.get_attnP(stack.dec_state)
+        # step_attnP shape: [batch_size * beam_size, ninp]
+
+        # updating attnP_aggregated_src
+        step_attnP_word = step_attnP
+        if self.src_word_attn_aggregation:
+            step_attnP_word = get_words_attnP(
+                step_attnP_word, stack_ext.inp_words_mask,
+                stack.slices, self.src_word_attn_aggregation)
+
+        max_words_num = tf.shape(stack_ext.attnP_aggregated_src)[1]
+        paddings = max_words_num - tf.shape(step_attnP_word)[1]
+        step_attnP_word = tf.pad(step_attnP_word, [[0, 0], [0, paddings]])
+
+        if self.attn_beta:
+            if self.src_attn_aggregation == 'max':
+                attnP_aggregated_src = tf.maximum(stack_ext.attnP_aggregated_src,
+                                                  step_attnP_word)
+            elif self.src_attn_aggregation == 'sum':
+                attnP_aggregated_src = stack_ext.attnP_aggregated_src + step_attnP_word
+            else:
+                raise ValueError
+        else:
+            attnP_aggregated_src = stack_ext.attnP_aggregated_src
+
+        # updating attnP_aggregated_dst
+        if self.dst_attn_beta:
+            if self.dst_attn_aggregation == 'max':
+                dst_attnP_aggregated = tf.reduce_max(step_attnP_word, axis=-1)[:, None]
+            elif self.dst_attn_aggregation == 'sum':
+                dst_attnP_aggregated = tf.reduce_sum(step_attnP_word, axis=-1)[:, None]
+            else:
+                raise ValueError
+
+            attnP_aggregated_dst = tf.concat(
+                [stack_ext.attnP_aggregated_dst, dst_attnP_aggregated],
+                axis=1)
+        else:
+            attnP_aggregated_dst = stack_ext.attnP_aggregated_dst
+
+        stack.ext[self.AlignmentPenaltyExt] = stack_ext._replace(
+            attnP_aggregated_src=attnP_aggregated_src,
+            attnP_aggregated_dst=attnP_aggregated_dst)
+        return stack
+
+    def create_initial_stack(self, model, batch, **flags):
+        stack = super().create_initial_stack(model, batch, **flags)
+
+        words_attnP = tf.squeeze(stack.attnP, axis=1)
+
+        # Calc inp_words_mask and aggregate data.
+        if self.src_word_attn_aggregation:
+            def is_new_word(inp_words):
+                return np.array([[not v.startswith(b'`') for v in l] for l in inp_words])
+
+            inp_words_mask = tf.py_func(is_new_word, [batch['inp_words']], bool, stateful=False)
+            inp_words_mask.set_shape(batch['inp_words'].shape)
+            inp_words_mask = tf.stop_gradient(inp_words_mask)
+
+            words_attnP = get_words_attnP(
+                words_attnP, inp_words_mask, stack.slices,
+                self.src_word_attn_aggregation)
+        else:
+            inp_words_mask = tf.fill(tf.shape(batch['inp']), 1.0)
+
+
+        if self.attn_beta:
+            if self.src_attn_aggregation in ('max', 'sum'):
+                attnP_aggregated_src = words_attnP
+            else:
+                raise ValueError
+        else:
+            attnP_aggregated_src = tf.fill(tf.shape(batch['inp']), 0.0)
+
+        # Calc attnP_aggregated_dst
+        if self.dst_attn_beta:
+            if self.dst_attn_aggregation == 'max':
+                attnP_aggregated_dst = tf.reduce_max(words_attnP, axis=-1)
+            elif self.dst_attn_aggregation == 'sum':
+                attnP_aggregated_dst = tf.reduce_sum(words_attnP, axis=-1)
+            else:
+                raise ValueError
+        else:
+            attnP_aggregated_dst = tf.fill((tf.shape(batch['inp'])[0],), 0.0)
+        attnP_aggregated_dst = attnP_aggregated_dst[:, None]
+
+        stack.ext[self.AlignmentPenaltyExt] = self.AlignmentPenaltyExt(
+            attnP_aggregated_src=attnP_aggregated_src,
+            attnP_aggregated_dst=attnP_aggregated_dst,
+            inp_words_mask=inp_words_mask
+        )
+        return stack
+
+    def compute_scores(self, model, stack, **flags):
+        """
+        Computes scores after length and coverage penalty
+        :param len_alpha: coefficient for length penalty, score / ( [5 + len(output_sequence)] / 6) ^ len_alpha
+        :param attn_beta: coefficient for coverage penalty (additive)
+            attn_beta * sum_i {log min(1.0, {src_attn_aggregation}_j {attention_p[x_i,y_j] }  )}
+        :param src_attn_aggregation: aggregation for src coverage penalty.
+            Possible values are 'max', 'sum'.
+        :param src_word_attn_aggregation: should we aggregate src coverage penalty by words?
+            Possible values are None/max/sum.
+        :param dst_attn_beta: coefficient for coverage penalty on dst side:
+            attn_beta * sum_j {log min(1.0, {dst_attn_aggregation}_i {attention_p[x_i,y_j] }  )}
+        :param dst_attn_aggregation: aggregation for dst coverage penalty.
+            Possible values are 'max', 'sum'.
+        :return: float32 vector (one score per hypo)
+        """
+
+        stack_ext = stack.ext[self.AlignmentPenaltyExt]
+
+        scores = stack.raw_scores
+        if self.len_alpha:
+            length_penalty = tf.pow((1. + tf.to_float(stack.out_len) / 6.), self.len_alpha)
+            scores /= length_penalty
+        if self.attn_beta:
+            coverage_penalty = tf.reduce_sum(
+                tf.log(tf.minimum(stack_ext.attnP_aggregated_src, 1) + sys.float_info.epsilon),
+                axis=-1)
+            scores += coverage_penalty * self.attn_beta
+        if self.dst_attn_beta:
+            coverage_penalty = tf.reduce_sum(
+                tf.log(tf.minimum(stack_ext.attnP_aggregated_dst, 1) + sys.float_info.epsilon),
+                axis=-1)
+            scores += coverage_penalty * self.dst_attn_beta
+
+        return scores
+
+    def compute_base_scores(self, model, stack, **flags):
+        """
+        Compute hypothesis scores to be used as base_scores for model.sample
+        :return: float32 vector (one score per hypo)
+        """
+        scores = self.compute_scores(model, stack, **flags)
+        if self.len_alpha:
+            length_penalty = tf.pow((1. + tf.to_float(stack.out_len) / 6.), self.len_alpha)
+            scores *= length_penalty
+        return scores
+
+
+def hypo_to_batch_index(n_hypos, slices):
+    """
+    Computes index in batch (input sequence index) for each hypothesis given slices.
+    :param n_hypos: number of hypotheses (tf int scalar)
+    :param slices: indices of first hypo for each input in batch
+
+    It should guaranteed that
+     - slices[0]==0 (first hypothesis starts at index 0), otherwise output[:slices[0]] will be -1
+     - if batch[i] is terminated, then batch[i]==batch[i+1]
+    """
+    is_next_sent_at_t = tf.bincount(slices, minlength=n_hypos, maxlength=n_hypos)
+    hypo_to_index = tf.cumsum(is_next_sent_at_t) - 1
+    return hypo_to_index
diff --git a/lib/task/seq2seq/models/__init__.py b/lib/task/seq2seq/models/__init__.py
new file mode 100644
index 0000000..7a93f7e
--- /dev/null
+++ b/lib/task/seq2seq/models/__init__.py
@@ -0,0 +1,95 @@
+from ..inference import translate_lines
+from lib.task.seq2seq.inference import TranslateModel, GreedyDecoder, PenalizedBeamSearchDecoder
+from ..data import make_batch_data, make_batch_placeholder
+from functools import lru_cache
+from itertools import chain, islice
+
+
+class ModelBase:
+    def encode_decode(self, batch, is_train):
+        """ Encode input sequence and decode rdo for output sequence """
+        raise NotImplementedError()
+
+    def _get_batch_sample(self):
+        return [("i saw a cat", "i write the code")]
+
+    def make_feed_dict(self, batch, **kwargs):
+        batch_data = make_batch_data(batch, self.inp_voc, self.out_voc, force_bos=self.hp.get('force_bos', True), **kwargs)
+        return batch_data
+
+
+class TranslateModelBase(TranslateModel, ModelBase):
+    """
+       A base class that most seq2seq models depend on.
+       Must have following fields: name, inp_voc, out_voc, loss
+    """
+    def translate_lines(self, lines, ingraph=True, ingraph_mode='beam_search',
+                        unbpe=True, batch_size=None, dumper=None, **flags):
+        """ Translate multiple lines with the model """
+        if ingraph:
+            translator = self.get_ingraph_translator(mode=ingraph_mode, back_prop=False, **flags)
+        else:
+            translator = self.get_translator(**flags)
+
+        replace_unk = flags.get('replace', self.hp.get('replace', False))
+
+        if batch_size is None:
+            lines_batched = [lines]
+        else:
+            lines = iter(lines)
+            lines_batched = list(iter(lambda: tuple(islice(lines, batch_size)), ()))
+
+        outputs = (translate_lines(batch_lines, translator, self, self.out_voc, replace_unk, unbpe, dumper=dumper)
+                   for batch_lines in lines_batched)
+
+        return list(chain(*outputs))
+
+    def predict(self):
+        self.get_predictor().main()
+
+    @lru_cache()
+    def get_ingraph_translator(self, mode='beam_search', **flags):
+        """
+        Creates a symbolic translation graph on a batch of placeholders.
+        Used to translate numeric data.
+        :param mode: 'greedy', 'sample', or 'beam_search'
+        :param flags: anything else you want to pass to decoder, encode, decode, sample, etc.
+        :return: a class with .best_out, .best_scores containing symbolic tensors for translations
+        """
+        batch_data_sample = self.make_feed_dict(self._get_batch_sample())
+        batch_placeholder = make_batch_placeholder(batch_data_sample)
+        return self.symbolic_translate(batch_placeholder, mode, **flags)
+
+    def symbolic_translate(self, batch_placeholder, mode='beam_search', **flags):
+        """
+        A function that takes a dict of symbolic inputs and outputs symolic translations
+        :param batch_placeholder: a dict of symbolic inputs {'inp':int32[batch, time]}
+        :param mode: str: 'greedy', 'sample', 'beam_search' or a decoder class
+        :param flags: anything else you want to pass to decoder, encode, decode, sample, etc.
+        :return: a class with .best_out, .best_scores containing symbolic tensors for translations
+        """
+        flags = dict(self.hp, **flags)
+
+        if mode in ('greedy', 'sample'):
+            flags['sampling_strategy'] = 'random' if mode == 'sample' else 'greedy'
+            return GreedyDecoder(
+                model=self.get_translate_model(),
+                batch_placeholder=batch_placeholder,
+                **flags
+            )
+        elif mode == 'beam_search':
+            return PenalizedBeamSearchDecoder(
+                model=self.get_translate_model(),
+                batch_placeholder=batch_placeholder,
+                **flags
+            )
+        elif callable(mode):
+            return mode(self.get_translate_model(), batch_placeholder, **flags)
+        else:
+            raise ValueError("Invalid mode : %s" % mode)
+
+    def get_translate_model(self):
+        if hasattr(self, 'translate_model'):
+            return self.translate_model
+
+        return self
diff --git a/lib/task/seq2seq/models/transformer.py b/lib/task/seq2seq/models/transformer.py
new file mode 100644
index 0000000..4c7015b
--- /dev/null
+++ b/lib/task/seq2seq/models/transformer.py
@@ -0,0 +1,623 @@
+#!/usr/bin/env python3
+from lib.layers import *
+from lib.ops import *
+from ..models import TranslateModelBase, TranslateModel
+from ..data import *
+from collections import namedtuple
+
+
+class Transformer:
+    def __init__(
+            self, name,
+            inp_voc, out_voc,
+            *_args,
+            emb_size=None, hid_size=512,
+            key_size=None, value_size=None,
+            inner_hid_size=None,  # DEPRECATED. Left for compatibility with older experiments
+            ff_size=None,
+            num_heads=8, num_layers=6,
+            attn_dropout=0.0, attn_value_dropout=0.0, relu_dropout=0.0, res_dropout=0.1,
+            share_emb=False, inp_emb_bias=False, rescale_emb=False,
+            dst_reverse=False, dst_rand_offset=False, summarize_preactivations=False,
+            res_steps='ldan', normalize_out=False, multihead_attn_format='v1',
+            emb_inp_device='', emb_out_device='',
+            **_kwargs
+    ):
+
+        if isinstance(ff_size, str):
+            ff_size = [int(i) for i in ff_size.split(':')]
+
+        if _args:
+            raise Exception("Unexpected positional arguments")
+
+        emb_size = emb_size if emb_size else hid_size
+        key_size = key_size if key_size else hid_size
+        value_size = value_size if value_size else hid_size
+        if key_size % num_heads != 0:
+            raise Exception("Bad number of heads")
+        if value_size % num_heads != 0:
+            raise Exception("Bad number of heads")
+
+        self.name = name
+        self.num_layers_enc = num_layers
+        self.num_layers_dec = num_layers
+        self.res_dropout = res_dropout
+        self.emb_size = emb_size
+        self.hid_size = hid_size
+        self.rescale_emb = rescale_emb
+        self.summarize_preactivations = summarize_preactivations
+        self.dst_reverse = dst_reverse
+        self.dst_rand_offset = dst_rand_offset
+        self.normalize_out = normalize_out
+
+        with tf.variable_scope(name):
+            max_voc_size = max(inp_voc.size(), out_voc.size())
+
+            self.emb_inp = Embedding(
+                'emb_inp', max_voc_size if share_emb else inp_voc.size(), emb_size,
+                initializer=tf.random_normal_initializer(0, emb_size ** -.5),
+                device=emb_inp_device)
+
+            self.emb_out = Embedding(
+                'emb_out', max_voc_size if share_emb else out_voc.size(), emb_size,
+                matrix=self.emb_inp.mat if share_emb else None,
+                initializer=tf.random_normal_initializer(0, emb_size ** -.5),
+                device=emb_out_device)
+
+            self.emb_inp_bias = 0
+            if inp_emb_bias:
+                self.emb_inp_bias = get_model_variable('emb_inp_bias', shape=[1, 1, emb_size])
+
+            def get_layer_params(layer_prefix, layer_idx):
+                layer_name = '%s-%i' % (layer_prefix, layer_idx)
+                inp_out_size = emb_size if layer_idx == 0 else hid_size
+                return layer_name, inp_out_size
+
+            def attn_layer(layer_prefix, layer_idx, **kwargs):
+                layer_name, inp_out_size = get_layer_params(layer_prefix, layer_idx)
+                return ResidualLayerWrapper(
+                    layer_name,
+                    MultiHeadAttn(
+                        layer_name,
+                        inp_size=inp_out_size,
+                        key_depth=key_size,
+                        value_depth=value_size,
+                        output_depth=hid_size,
+                        num_heads=num_heads,
+                        attn_dropout=attn_dropout,
+                        attn_value_dropout=attn_value_dropout,
+                        **kwargs),
+                    inp_size=inp_out_size,
+                    out_size=inp_out_size,
+                    steps=res_steps,
+                    dropout=res_dropout)
+
+            def ffn_layer(layer_prefix, layer_idx, ffn_hid_size):
+                layer_name, inp_out_size = get_layer_params(layer_prefix, layer_idx)
+                return ResidualLayerWrapper(
+                    layer_name,
+                    FFN(
+                        layer_name,
+                        inp_size=inp_out_size,
+                        hid_size=ffn_hid_size,
+                        out_size=hid_size,
+                        relu_dropout=relu_dropout),
+                    inp_size=inp_out_size,
+                    out_size=hid_size,
+                    steps=res_steps,
+                    dropout=res_dropout)
+
+            # Encoder/decoder layer params
+            enc_ffn_hid_size = ff_size if ff_size else (inner_hid_size if inner_hid_size else hid_size)
+            dec_ffn_hid_size = ff_size if ff_size else hid_size
+            dec_enc_attn_format = 'use_kv' if multihead_attn_format == 'v1' else 'combined'
+
+            # Encoder Layers
+            self.enc_attn = [attn_layer('enc_attn', i) for i in range(self.num_layers_enc)]
+
+            self.enc_ffn = [ffn_layer('enc_ffn', i, enc_ffn_hid_size) for i in range(self.num_layers_enc)]
+
+            if self.normalize_out:
+                self.enc_out_norm = LayerNorm('enc_out_norm',
+                                              inp_size=emb_size if self.num_layers_enc == 0 else hid_size)
+
+            # Decoder layers
+            self.dec_attn = [attn_layer('dec_attn', i) for i in range(self.num_layers_dec)]
+            self.dec_enc_attn = [attn_layer('dec_enc_attn', i, _format=dec_enc_attn_format) for i in
+                                 range(self.num_layers_dec)]
+
+            self.dec_ffn = [ffn_layer('dec_ffn', i, dec_ffn_hid_size) for i in range(self.num_layers_dec)]
+
+            if self.normalize_out:
+                self.dec_out_norm = LayerNorm('dec_out_norm',
+                                              inp_size=emb_size if self.num_layers_dec == 0 else hid_size)
+
+    def encode(self, inp, inp_len, is_train):
+        with dropout_scope(is_train), tf.name_scope('mod_enc') as scope:
+
+            # Embeddings
+            emb_inp = self.emb_inp(inp)  # [batch_size * ninp * emb_dim]
+            if self.rescale_emb:
+                emb_inp *= self.emb_size ** .5
+            emb_inp += self.emb_inp_bias
+
+            # Prepare decoder
+            enc_attn_mask = self._make_enc_attn_mask(inp, inp_len)  # [batch_size * 1 * 1 * ninp]
+
+            enc_inp = self._add_timing_signal(emb_inp)
+
+            # Apply dropouts
+            if is_dropout_enabled():
+                enc_inp = tf.nn.dropout(enc_inp, 1.0 - self.res_dropout)
+
+            tf.add_to_collection("LayerEmbeddings", enc_inp)
+
+            # Encoder
+            for layer in range(self.num_layers_enc):
+                enc_inp = self.enc_attn[layer](enc_inp, enc_attn_mask)
+                enc_inp = self.enc_ffn[layer](enc_inp, summarize_preactivations=self.summarize_preactivations)
+                tf.add_to_collection("LayerEmbeddings", enc_inp)
+
+            if self.normalize_out:
+                enc_inp = self.enc_out_norm(enc_inp)
+
+            tf.add_to_collection(lib.meta.ACTIVATIONS, tf.identity(enc_inp, name=scope))
+
+            return enc_inp, enc_attn_mask
+
+    def decode(self, out, out_len, out_reverse, enc_out, enc_attn_mask, is_train):
+        with dropout_scope(is_train), tf.name_scope('mod_dec') as scope:
+            # Embeddings
+            emb_out = self.emb_out(out)  # [batch_size * nout * emb_dim]
+            if self.rescale_emb:
+                emb_out *= self.emb_size ** .5
+
+            # Shift right; drop embedding for last word
+            emb_out = tf.pad(emb_out, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
+
+            # Prepare decoder
+            dec_attn_mask = self._make_dec_attn_mask(out)  # [1 * 1 * nout * nout]
+
+            offset = 'random' if self.dst_rand_offset else 0
+            dec_inp = self._add_timing_signal(emb_out, offset=offset, inp_reverse=out_reverse)
+            # Apply dropouts
+            if is_dropout_enabled():
+                dec_inp = dropout(dec_inp, 1.0 - self.res_dropout)
+
+            # bypass info from Encoder to avoid None gradients for num_layers_dec == 0
+            if self.num_layers_dec == 0:
+                inp_mask = tf.squeeze(tf.transpose(enc_attn_mask, perm=[3, 1, 2, 0]), 3)
+                dec_inp += tf.reduce_mean(enc_out * inp_mask, axis=[0, 1], keep_dims=True)
+
+            # Decoder
+            for layer in range(self.num_layers_dec):
+                dec_inp = self.dec_attn[layer](dec_inp, dec_attn_mask)
+                dec_inp = self.dec_enc_attn[layer](dec_inp, enc_attn_mask, enc_out)
+                dec_inp = self.dec_ffn[layer](dec_inp, summarize_preactivations=self.summarize_preactivations)
+
+            if self.normalize_out:
+                dec_inp = self.dec_out_norm(dec_inp)
+
+            tf.add_to_collection(lib.meta.ACTIVATIONS, tf.identity(dec_inp, name=scope))
+
+            return dec_inp
+
+    def relprop_decode(self, R):
+        """ propagates relevances from rdo to output embeddings and encoder state """
+        R_enc = 0.0
+        R_enc_scale = 0.0
+        for layer in range(self.num_layers_dec)[::-1]:
+            R = self.dec_ffn[layer].relprop(R)
+
+            relevance_dict = self.dec_enc_attn[layer].relprop(R, main_key='query_inp')
+            R = relevance_dict['query_inp']
+            R_enc += relevance_dict['kv_inp']
+            R_enc_scale += tf.reduce_sum(abs(relevance_dict['kv_inp']))
+
+            R = self.dec_attn[layer].relprop(R)
+
+        # shift left: compensate for right shift
+        R = LRP.rescale(R, tf.pad(R, [[0, 0], [0, 1], [0, 0]])[:, 1:, :])
+        return {'emb_out': R, 'enc_out': R_enc * R_enc_scale / tf.reduce_sum(abs(R_enc))}
+
+    def relprop_encode(self, R):
+        """ propagates relevances from enc_out to emb_inp """
+        for layer in range(self.num_layers_enc)[::-1]:
+            R = self.enc_ffn[layer].relprop(R)
+            R = self.enc_attn[layer].relprop(R)
+        return R
+
+    def relprop_encode_decode(self, R):
+        """ propagates relevances from rdo to input and optput embeddings """
+        relevances = self.relprop_decode(R)
+        relevances['emb_inp'] = self.relprop_encode(relevances['enc_out'])
+        return relevances
+
+    def _make_enc_attn_mask(self, inp, inp_len, dtype=tf.float32):
+        """
+        inp = [batch_size * ninp]
+        inp_len = [batch_size]
+
+        attn_mask = [batch_size * 1 * 1 * ninp]
+        """
+        with tf.variable_scope("make_enc_attn_mask"):
+            inp_mask = tf.sequence_mask(inp_len, dtype=dtype, maxlen=tf.shape(inp)[1])
+
+            attn_mask = inp_mask[:, None, None, :]
+            return attn_mask
+
+    def _make_dec_attn_mask(self, out, dtype=tf.float32):
+        """
+        out = [baatch_size * nout]
+
+        attn_mask = [1 * 1 * nout * nout]
+        """
+        with tf.variable_scope("make_dec_attn_mask"):
+            length = tf.shape(out)[1]
+            lower_triangle = tf.matrix_band_part(tf.ones([length, length], dtype=dtype), -1, 0)
+            attn_mask = tf.reshape(lower_triangle, [1, 1, length, length])
+            return attn_mask
+
+    def _add_timing_signal(self, inp, min_timescale=1.0, max_timescale=1.0e4, offset=0, inp_reverse=None):
+        """
+        inp: (batch_size * ninp * hid_dim)
+        :param offset: add this number to all character positions.
+            if offset == 'random', picks this number uniformly from [-32000,32000] integers
+        :type offset: number, tf.Tensor or 'random'
+        """
+        with tf.variable_scope("add_timing_signal"):
+            ninp = tf.shape(inp)[1]
+            hid_size = tf.shape(inp)[2]
+
+            position = tf.to_float(tf.range(ninp))[None, :, None]
+
+            if offset == 'random':
+                BIG_LEN = 32000
+                offset = tf.random_uniform(tf.shape(position), minval=-BIG_LEN, maxval=BIG_LEN, dtype=tf.int32)
+
+            # force broadcasting over batch axis
+            if isinstance(offset * 1, tf.Tensor):  # multiply by 1 to also select variables, special generators, etc.
+                assert offset.shape.ndims in (0, 1, 2)
+                new_shape = [tf.shape(offset)[i] for i in range(offset.shape.ndims)]
+                new_shape += [1] * (3 - len(new_shape))
+                offset = tf.reshape(offset, new_shape)
+
+            position += tf.to_float(offset)
+
+            if inp_reverse is not None:
+                position = tf.multiply(
+                    position,
+                    tf.where(
+                        tf.equal(inp_reverse, 0),
+                        tf.ones_like(inp_reverse, dtype=tf.float32),
+                        -1.0 * tf.ones_like(inp_reverse, dtype=tf.float32)
+                    )[:, None, None]  # (batch_size * ninp * dim)
+                )
+            num_timescales = hid_size // 2
+            log_timescale_increment = (
+                    math.log(float(max_timescale) / float(min_timescale)) /
+                    (tf.to_float(num_timescales) - 1))
+            inv_timescales = min_timescale * tf.exp(
+                tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
+
+            # scaled_time: [ninp * hid_dim]
+            scaled_time = position * inv_timescales[None, None, :]
+            signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=-1)
+            signal = tf.pad(signal, [[0, 0], [0, 0], [0, tf.mod(hid_size, 2)]])
+            return inp + signal
+
+
+# ============================================================================
+#                                  Transformer model
+
+class Model(TranslateModelBase):
+
+    def __init__(self, name, inp_voc, out_voc, **hp):
+        self.name = name
+        self.inp_voc = inp_voc
+        self.out_voc = out_voc
+        self.hp = hp
+
+        # Parameters
+        self.transformer = Transformer(name, inp_voc, out_voc, **hp)
+
+        projection_matrix = None
+        if hp.get('dwwt', False):
+            projection_matrix = tf.transpose(self.transformer.emb_out.mat)
+
+        self.loss = LossXent(
+            hp.get('loss_name', 'loss_xent_lm'),
+            hp['hid_size'],
+            out_voc,
+            hp,
+            matrix=projection_matrix,
+            bias=None if hp.get("loss_bias", False) else 0)
+
+        inference_mode = hp.get("inference_mode", "fast")
+        if inference_mode == 'fast':
+            self.translate_model = TranslateModelFast(self.name, self.transformer, self.loss, self.inp_voc,
+                                                      self.out_voc)
+        elif inference_mode == 'lazy':
+            self.translate_model = TranslateModelLazy(self.name, self.transformer, self.loss, self.inp_voc,
+                                                      self.out_voc)
+        else:
+            raise NotImplementedError("inference_mode %s is not supported" % inference_mode)
+
+    # Train interface
+    def encode_decode(self, batch, is_train, score_info=False):
+        inp = batch['inp']  # [batch_size * ninp]
+        out = batch['out']  # [batch_size * nout]
+        inp_len = batch.get('inp_len', infer_length(inp, self.inp_voc.eos, time_major=False))  # [batch]
+        out_len = batch.get('out_len', infer_length(out, self.out_voc.eos, time_major=False))  # [batch]
+
+        out_reverse = tf.zeros_like(inp_len)  # batch['out_reverse']
+
+        # rdo: [batch_size * nout * hid_dim]
+        enc_out, enc_attn_mask = self.transformer.encode(inp, inp_len, is_train)
+        rdo = self.transformer.decode(out, out_len, out_reverse, enc_out, enc_attn_mask, is_train)
+
+        return rdo
+
+    def make_feed_dict(self, batch, **kwargs):
+        feed_dict = make_batch_data(batch, self.inp_voc, self.out_voc,
+                                    force_bos=self.hp.get('force_bos', False),
+                                    **kwargs)
+        return feed_dict
+
+    # ======== TranslateModel for Inference ============
+    def encode(self, batch, **flags):
+        """
+        :param batch: a dict of {string:symbolic tensor} that model understands.
+            By default it should accept {'inp': int32 matrix[batch,time]}
+        :return: initial decoder state
+        """
+        return self.translate_model.encode(batch, **flags)
+
+    def decode(self, dec_state, words=None, **flags):
+        """
+        Performs decoding step given words and previous state.
+        :param words: previous output tokens, int32[batch_size]. if None, uses zero embeddings (first step)
+        :returns: next state
+        """
+        return self.translate_model.decode(dec_state, words, **flags)
+
+    def sample(self, dec_state, base_scores, slices, k, **kwargs):
+        return self.translate_model.sample(dec_state, base_scores, slices, k, **kwargs)
+
+    def get_rdo(self, dec_state, **kwargs):
+        return self.translate_model.get_rdo(dec_state, **kwargs)
+
+    def get_attnP(self, dec_state, **kwargs):
+        return self.translate_model.get_attnP(dec_state, **kwargs)
+
+
+class ScopedModel(Model):
+
+    def __init__(self, name, inp_voc, out_voc, **hp):
+        with tf.variable_scope(name):
+            super(ScopedModel, self).__init__(name, inp_voc, out_voc, **hp)
+
+    def encode_decode(self, *args, **kwargs):
+        with tf.name_scope(self.name):
+            return super(ScopedModel, self).encode_decode(*args, **kwargs)
+
+    def encode(self, *args, **kwargs):
+        with tf.name_scope(self.name):
+            return super(ScopedModel, self).encode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        with tf.name_scope(self.name):
+            return super(ScopedModel, self).decode(*args, **kwargs)
+
+    def sample(self, *args, **kwargs):
+        with tf.name_scope(self.name):
+            return super(ScopedModel, self).sample(*args, **kwargs)
+
+
+# ============================================================================
+#                              Transformer inference
+
+class TranslateModelFast(TranslateModel):
+    DecState = namedtuple("transformer_state", ['enc_out', 'enc_attn_mask', 'attnP', 'rdo', 'out_seq', 'offset',
+                                                'emb', 'dec_layers', 'dec_enc_kv', 'dec_dec_kv'])
+
+    def __init__(self, name, transformer, loss, inp_voc, out_voc):
+        """
+        A translation model that performs quick (n^2) inference for transformer
+        with manual implementation of 1-step decoding
+        """
+        self.name = name
+        self.transformer = transformer
+        self.loss = loss
+        self.inp_voc = inp_voc
+        self.out_voc = out_voc
+
+    def encode(self, batch, is_train=False, **kwargs):
+        """
+        :param batch: a dict containing 'inp':int32[batch_size * ninp] and optionally inp_len:int32[batch_size]
+        :param is_train: if True, enables dropouts
+        """
+        inp = batch['inp']
+        inp_len = batch.get('inp_len', infer_length(inp, self.inp_voc.eos, time_major=False))
+        with dropout_scope(is_train), tf.name_scope(self.transformer.name):
+            # Encode.
+            enc_out, enc_attn_mask = self.transformer.encode(inp, inp_len, is_train=False)
+
+            # Decoder dummy input/output
+            ninp = tf.shape(inp)[1]
+            batch_size = tf.shape(inp)[0]
+            hid_size = tf.shape(enc_out)[-1]
+            out_seq = tf.zeros([batch_size, 0], dtype=inp.dtype)
+            rdo = tf.zeros([batch_size, hid_size], dtype=enc_out.dtype)
+
+            attnP = tf.ones([batch_size, ninp]) / tf.to_float(inp_len)[:, None]
+
+            offset = tf.zeros((batch_size,))
+            if self.transformer.dst_rand_offset:
+                BIG_LEN = 32000
+                random_offset = tf.random_uniform(tf.shape(offset), minval=-BIG_LEN, maxval=BIG_LEN, dtype=tf.int32)
+                offset += tf.to_float(random_offset)
+
+            trans = self.transformer
+            empty_emb = tf.zeros([batch_size, 0, trans.emb_size])
+            empty_dec_layers = [tf.zeros([batch_size, 0, trans.hid_size])] * trans.num_layers_dec
+            input_layers = [empty_emb] + empty_dec_layers[:-1]
+
+            # prepare kv parts for all decoder attention layers. Note: we do not preprocess enc_out
+            # for each layer because ResidualLayerWrapper only preprocesses first input (query)
+            dec_enc_kv = [layer.kv_conv(enc_out)
+                          for i, layer in enumerate(trans.dec_enc_attn)]
+            dec_dec_kv = [layer.kv_conv(layer.preprocess(input_layers[i]))
+                          for i, layer in enumerate(trans.dec_attn)]
+
+            new_state = self.DecState(enc_out, enc_attn_mask, attnP, rdo, out_seq, offset,
+                                      empty_emb, empty_dec_layers, dec_enc_kv, dec_dec_kv)
+
+            # perform initial decode (instead of force_bos) with zero embeddings
+            new_state = self.decode(new_state, is_train=is_train)
+            return new_state
+
+    def decode(self, dec_state, words=None, is_train=False, **kwargs):
+        """
+        Performs decoding step given words and previous state.
+        Returns next state.
+
+        :param words: previous output tokens, int32[batch_size]. if None, uses zero embeddings (first step)
+        :param is_train: if True, enables dropouts
+        """
+        trans = self.transformer
+        enc_out, enc_attn_mask, attnP, rdo, out_seq, offset, prev_emb = dec_state[:7]
+        prev_dec_layers = dec_state.dec_layers
+        dec_enc_kv = dec_state.dec_enc_kv
+        dec_dec_kv = dec_state.dec_dec_kv
+
+        batch_size = tf.shape(rdo)[0]
+        if words is not None:
+            out_seq = tf.concat([out_seq, tf.expand_dims(words, 1)], 1)
+
+        with dropout_scope(is_train), tf.name_scope(trans.name):
+            # Embeddings
+            if words is None:
+                # initial step: words are None
+                emb_out = tf.zeros((batch_size, 1, trans.emb_size))
+            else:
+                emb_out = trans.emb_out(words[:, None])  # [batch_size * 1 * emb_dim]
+                if trans.rescale_emb:
+                    emb_out *= trans.emb_size ** .5
+
+            # Prepare decoder
+            dec_inp_t = trans._add_timing_signal(emb_out, offset=offset)
+            # Apply dropouts
+            if is_dropout_enabled():
+                dec_inp_t = tf.nn.dropout(dec_inp_t, 1.0 - trans.res_dropout)
+
+            # bypass info from Encoder to avoid None gradients for num_layers_dec == 0
+            if trans.num_layers_dec == 0:
+                inp_mask = tf.squeeze(tf.transpose(enc_attn_mask, perm=[3, 1, 2, 0]), 3)
+                dec_inp_t += tf.reduce_mean(enc_out * inp_mask, axis=[0, 1], keep_dims=True)
+
+            # Decoder
+            new_emb = tf.concat([prev_emb, dec_inp_t], axis=1)
+            _out = tf.pad(out_seq, [(0, 0), (0, 1)])
+            dec_attn_mask = trans._make_dec_attn_mask(_out)[:, :, -1:, :]  # [1, 1, n_q=1, n_kv]
+
+            new_dec_layers = []
+            new_dec_dec_kv = []
+
+            for layer in range(trans.num_layers_dec):
+                # multi-head self-attention: use only the newest time-step as query,
+                # but all time-steps up to newest one as keys/values
+                next_dec_kv = trans.dec_attn[layer].kv_conv(trans.dec_attn[layer].preprocess(dec_inp_t))
+                new_dec_dec_kv.append(tf.concat([dec_dec_kv[layer], next_dec_kv], axis=1))
+                dec_inp_t = trans.dec_attn[layer](dec_inp_t, dec_attn_mask, kv=new_dec_dec_kv[layer])
+
+                dec_inp_t = trans.dec_enc_attn[layer](dec_inp_t, enc_attn_mask, kv=dec_enc_kv[layer])
+                dec_inp_t = trans.dec_ffn[layer](dec_inp_t, summarize_preactivations=trans.summarize_preactivations)
+
+                new_dec_inp = tf.concat([prev_dec_layers[layer], dec_inp_t], axis=1)
+                new_dec_layers.append(new_dec_inp)
+
+            if trans.normalize_out:
+                dec_inp_t = trans.dec_out_norm(dec_inp_t)
+
+            rdo = dec_inp_t[:, -1]
+
+            new_state = self.DecState(enc_out, enc_attn_mask, attnP, rdo, out_seq, offset + 1,
+                                      new_emb, new_dec_layers, dec_enc_kv, new_dec_dec_kv)
+            return new_state
+
+    def get_rdo(self, dec_state, **kwargs):
+        return dec_state.rdo, dec_state.out_seq
+
+    def get_attnP(self, dec_state, **kwargs):
+        return dec_state.attnP
+
+
+class TranslateModelLazy(TranslateModel):
+    def __init__(self, name, transformer, loss, inp_voc, out_voc):
+        """
+        Automatically implements O(n^3) decoding by using trans.decode
+        """
+        self.name = name
+        self.transformer = transformer
+        self.loss = loss
+        self.inp_voc = inp_voc
+        self.out_voc = out_voc
+
+    def encode(self, batch, is_train=False, **kwargs):
+        """
+            :param batch: a dict of placeholders
+                inp: [batch_size * ninp]
+                inp_len; [batch_size]
+        """
+        inp = batch['inp']
+        inp_len = batch['inp_len']
+        with dropout_scope(is_train), tf.name_scope(self.transformer.name):
+            # Encode.
+            enc_out, enc_attn_mask = self.transformer.encode(inp, inp_len, is_train=False)
+
+            # Decoder dummy input/output
+            ninp = tf.shape(inp)[1]
+            batch_size = tf.shape(inp)[0]
+            hid_size = tf.shape(enc_out)[-1]
+            out_seq = tf.zeros([batch_size, 0], dtype=inp.dtype)
+            rdo = tf.zeros([batch_size, hid_size], dtype=enc_out.dtype)
+
+            attnP = tf.ones([batch_size, ninp]) / tf.to_float(inp_len)[:, None]
+
+            return self._decode_impl((enc_out, enc_attn_mask, attnP, out_seq, rdo), **kwargs)
+
+    def decode(self, dec_state, words, **kwargs):
+        """
+        Performs decoding step given words
+
+        words: [batch_size]
+        """
+        with tf.name_scope(self.transformer.name):
+            (enc_out, enc_attn_mask, attnP, prev_out_seq, rdo) = dec_state
+            out_seq = tf.concat([prev_out_seq, tf.expand_dims(words, 1)], 1)
+            return self._decode_impl((enc_out, enc_attn_mask, attnP, out_seq, rdo), **kwargs)
+
+    def _decode_impl(self, dec_state, is_train=False, **kwargs):
+        (enc_out, enc_attn_mask, attnP, out_seq, rdo) = dec_state
+
+        with dropout_scope(is_train):
+            out = tf.pad(out_seq, [(0, 0), (0, 1)])
+            out_len = tf.fill(dims=(tf.shape(out)[0],), value=tf.shape(out_seq)[1])
+            out_reverse = tf.zeros_like(out_len)  # batch['out_reverse']
+            dec_out = self.transformer.decode(out, out_len, out_reverse, enc_out, enc_attn_mask, is_train=False)
+            rdo = dec_out[:, -1, :]  # [batch_size * hid_dim]
+
+            attnP = enc_attn_mask[:, 0, 0, :]  # [batch_size * ninp ]
+            attnP /= tf.reduce_sum(attnP, axis=1, keep_dims=True)
+
+            return (enc_out, enc_attn_mask, attnP, out_seq, rdo)
+
+    def get_rdo(self, dec_state, **kwargs):
+        rdo = dec_state[4]
+        out = dec_state[3]
+        return rdo, out
+
+    def get_attnP(self, dec_state, **kwargs):
+        return dec_state[2]
+
diff --git a/lib/task/seq2seq/models/transformer_head_gates.py b/lib/task/seq2seq/models/transformer_head_gates.py
new file mode 100644
index 0000000..1f0f63c
--- /dev/null
+++ b/lib/task/seq2seq/models/transformer_head_gates.py
@@ -0,0 +1,651 @@
+#!/usr/bin/env python3
+from lib.layers import *
+from lib.ops import *
+from ..models import TranslateModelBase, TranslateModel
+from ..data import *
+from collections import namedtuple
+
+
+class Transformer:
+    def __init__(
+            self, name,
+            inp_voc, out_voc,
+            *_args,
+            emb_size=None, hid_size=512,
+            key_size=None, value_size=None,
+            inner_hid_size=None,  # DEPRECATED. Left for compatibility with older experiments
+            ff_size=None,
+            num_heads=8, num_layers=6,
+            attn_dropout=0.0, attn_value_dropout=0.0, relu_dropout=0.0, res_dropout=0.1,
+            share_emb=False, inp_emb_bias=False, rescale_emb=False,
+            dst_reverse=False, dst_rand_offset=False, summarize_preactivations=False,
+            res_steps='nlda', normalize_out=False, multihead_attn_format='v1',
+            emb_inp_device='', emb_out_device='',
+            concrete_heads={},  # any subset of {enc-self, dec-self, dec-enc}
+            alive_heads={},  # {enc-self: [[1,1,1,0,1,0,0,0], [0,0,0,1,0,1,0,1], ..., [0,1,0,1,0,0,0,0]],
+                             # dec-self: [...],
+                             # dec-enc: [...]}
+            num_layers_enc=0,
+            num_layers_dec=0,
+            **_kwargs
+    ):
+
+        for attn_type in ['enc-self', 'dec-self', 'dec-enc']:
+            assert not (attn_type in concrete_heads and attn_type in alive_heads),\
+                "'{}' is passed as both with trainable concrete gates heads and fixed gates".format(attn_type)
+
+        if isinstance(ff_size, str):
+            ff_size = [int(i) for i in ff_size.split(':')]
+
+        if _args:
+            raise Exception("Unexpected positional arguments")
+
+        emb_size = emb_size if emb_size else hid_size
+        key_size = key_size if key_size else hid_size
+        value_size = value_size if value_size else hid_size
+        if key_size % num_heads != 0:
+            raise Exception("Bad number of heads")
+        if value_size % num_heads != 0:
+            raise Exception("Bad number of heads")
+
+        self.name = name
+        self.num_layers_enc = num_layers if num_layers_enc == 0 else num_layers_enc
+        self.num_layers_dec = num_layers if num_layers_dec == 0 else num_layers_dec
+        self.res_dropout = res_dropout
+        self.emb_size = emb_size
+        self.hid_size = hid_size
+        self.rescale_emb = rescale_emb
+        self.summarize_preactivations = summarize_preactivations
+        self.dst_reverse = dst_reverse
+        self.dst_rand_offset = dst_rand_offset
+        self.normalize_out = normalize_out
+
+        with tf.variable_scope(name):
+            max_voc_size = max(inp_voc.size(), out_voc.size())
+
+            self.emb_inp = Embedding(
+                'emb_inp', max_voc_size if share_emb else inp_voc.size(), emb_size,
+                initializer=tf.random_normal_initializer(0, emb_size ** -.5),
+                device=emb_inp_device)
+
+            self.emb_out = Embedding(
+                'emb_out', max_voc_size if share_emb else out_voc.size(), emb_size,
+                matrix=self.emb_inp.mat if share_emb else None,
+                initializer=tf.random_normal_initializer(0, emb_size ** -.5),
+                device=emb_out_device)
+
+            self.emb_inp_bias = 0
+            if inp_emb_bias:
+                self.emb_inp_bias = get_model_variable('emb_inp_bias', shape=[1, 1, emb_size])
+
+            def get_layer_params(layer_prefix, layer_idx):
+                layer_name = '%s-%i' % (layer_prefix, layer_idx)
+                inp_out_size = emb_size if layer_idx == 0 else hid_size
+                return layer_name, inp_out_size
+
+            def attn_layer(layer_prefix, layer_idx, **kwargs):
+                layer_name, inp_out_size = get_layer_params(layer_prefix, layer_idx)
+                return ResidualLayerWrapper(
+                    layer_name,
+                    MultiHeadAttn(
+                        layer_name,
+                        inp_size=inp_out_size,
+                        key_depth=key_size,
+                        value_depth=value_size,
+                        output_depth=hid_size,
+                        num_heads=num_heads,
+                        attn_dropout=attn_dropout,
+                        attn_value_dropout=attn_value_dropout,
+                        **kwargs),
+                    inp_size=inp_out_size,
+                    out_size=inp_out_size,
+                    steps=res_steps,
+                    dropout=res_dropout)
+
+            def attn_layer_concrete_heads(layer_prefix, layer_idx, **kwargs):
+                layer_name, inp_out_size = get_layer_params(layer_prefix, layer_idx)
+                return ResidualLayerWrapper(
+                    layer_name,
+                    MultiHeadAttnConcrete(
+                        layer_name,
+                        inp_size=inp_out_size,
+                        key_depth=key_size,
+                        value_depth=value_size,
+                        output_depth=hid_size,
+                        num_heads=num_heads,
+                        attn_dropout=attn_dropout,
+                        attn_value_dropout=attn_value_dropout,
+                        **kwargs),
+                    inp_size=inp_out_size,
+                    out_size=inp_out_size,
+                    steps=res_steps,
+                    dropout=res_dropout)
+
+            def attn_layer_fixed_alive_heads(layer_prefix, layer_idx, head_gate, **kwargs):
+                layer_name, inp_out_size = get_layer_params(layer_prefix, layer_idx)
+                return ResidualLayerWrapper(
+                    layer_name,
+                    MultiHeadAttnFixedAliveHeads(
+                        layer_name,
+                        inp_size=inp_out_size,
+                        key_depth=key_size,
+                        value_depth=value_size,
+                        output_depth=hid_size,
+                        num_heads=num_heads,
+                        attn_dropout=attn_dropout,
+                        attn_value_dropout=attn_value_dropout,
+                        head_gate=head_gate,
+                        **kwargs),
+                    inp_size=inp_out_size,
+                    out_size=inp_out_size,
+                    steps=res_steps,
+                    dropout=res_dropout)
+
+            def ffn_layer(layer_prefix, layer_idx, ffn_hid_size):
+                layer_name, inp_out_size = get_layer_params(layer_prefix, layer_idx)
+                return ResidualLayerWrapper(
+                    layer_name,
+                    FFN(
+                        layer_name,
+                        inp_size=inp_out_size,
+                        hid_size=ffn_hid_size,
+                        out_size=hid_size,
+                        relu_dropout=relu_dropout),
+                    inp_size=inp_out_size,
+                    out_size=hid_size,
+                    steps=res_steps,
+                    dropout=res_dropout)
+
+            # Encoder/decoder layer params
+            enc_ffn_hid_size = ff_size if ff_size else (inner_hid_size if inner_hid_size else hid_size)
+            dec_ffn_hid_size = ff_size if ff_size else hid_size
+            dec_enc_attn_format = 'use_kv' if multihead_attn_format == 'v1' else 'combined'
+
+            # Encoder Layers
+            self.enc_attn = [attn_layer_concrete_heads('enc_attn', i) if 'enc-self' in concrete_heads else
+                             attn_layer('enc_attn', i) if not 'enc-self' in alive_heads else
+                             attn_layer_fixed_alive_heads('enc_attn', i, alive_heads['enc-self'][i])
+                             for i in range(self.num_layers_enc)]
+
+            self.enc_ffn = [ffn_layer('enc_ffn', i, enc_ffn_hid_size) for i in range(self.num_layers_enc)]
+
+            if self.normalize_out:
+                self.enc_out_norm = LayerNorm('enc_out_norm',
+                                              inp_size=emb_size if self.num_layers_enc == 0 else hid_size)
+
+            # Decoder layers
+            self.dec_attn = [attn_layer_concrete_heads('dec_attn', i) if 'dec-self' in concrete_heads else
+                             attn_layer('dec_attn', i) if not 'dec-self' in alive_heads else
+                             attn_layer_fixed_alive_heads('dec_attn', i, alive_heads['dec-self'][i])
+                             for i in range(self.num_layers_dec)]
+
+            self.dec_enc_attn = [attn_layer_concrete_heads('dec_enc_attn', i, _format=dec_enc_attn_format) \
+                                 if 'dec-enc' in concrete_heads else \
+                             attn_layer('dec_enc_attn', i, _format=dec_enc_attn_format) if \
+                    not 'dec-enc' in alive_heads else \
+                             attn_layer_fixed_alive_heads('dec_enc_attn', i, alive_heads['dec-enc'][i], _format=dec_enc_attn_format)
+                             for i in range(self.num_layers_enc)]
+
+            self.dec_ffn = [ffn_layer('dec_ffn', i, dec_ffn_hid_size) for i in range(self.num_layers_dec)]
+
+            if self.normalize_out:
+                self.dec_out_norm = LayerNorm('dec_out_norm',
+                                              inp_size=emb_size if self.num_layers_dec == 0 else hid_size)
+
+    def encode(self, inp, inp_len, is_train):
+        with dropout_scope(is_train), tf.name_scope('mod_enc') as scope:
+
+            # Embeddings
+            emb_inp = self.emb_inp(inp)  # [batch_size * ninp * emb_dim]
+            if self.rescale_emb:
+                emb_inp *= self.emb_size ** .5
+            emb_inp += self.emb_inp_bias
+
+            # Prepare decoder
+            enc_attn_mask = self._make_enc_attn_mask(inp, inp_len)  # [batch_size * 1 * 1 * ninp]
+
+            enc_inp = self._add_timing_signal(emb_inp)
+
+            # Apply dropouts
+            if is_dropout_enabled():
+                enc_inp = tf.nn.dropout(enc_inp, 1.0 - self.res_dropout)
+
+            # Encoder
+            for layer in range(self.num_layers_enc):
+                enc_inp = self.enc_attn[layer](enc_inp, enc_attn_mask)
+                enc_inp = self.enc_ffn[layer](enc_inp, summarize_preactivations=self.summarize_preactivations)
+
+            if self.normalize_out:
+                enc_inp = self.enc_out_norm(enc_inp)
+
+            tf.add_to_collection(lib.meta.ACTIVATIONS, tf.identity(enc_inp, name=scope))
+
+            return enc_inp, enc_attn_mask
+
+    def decode(self, out, out_len, out_reverse, enc_out, enc_attn_mask, is_train):
+        with dropout_scope(is_train), tf.name_scope('mod_dec') as scope:
+            # Embeddings
+            emb_out = self.emb_out(out)  # [batch_size * nout * emb_dim]
+            if self.rescale_emb:
+                emb_out *= self.emb_size ** .5
+
+            # Shift right; drop embedding for last word
+            emb_out = tf.pad(emb_out, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
+
+            # Prepare decoder
+            dec_attn_mask = self._make_dec_attn_mask(out)  # [1 * 1 * nout * nout]
+
+            offset = 'random' if self.dst_rand_offset else 0
+            dec_inp = self._add_timing_signal(emb_out, offset=offset, inp_reverse=out_reverse)
+            # Apply dropouts
+            if is_dropout_enabled():
+                dec_inp = tf.nn.dropout(dec_inp, 1.0 - self.res_dropout)
+
+            # bypass info from Encoder to avoid None gradients for num_layers_dec == 0
+            if self.num_layers_dec == 0:
+                inp_mask = tf.squeeze(tf.transpose(enc_attn_mask, perm=[3, 1, 2, 0]), 3)
+                dec_inp += tf.reduce_mean(enc_out * inp_mask, axis=[0, 1], keep_dims=True)
+
+            # Decoder
+            for layer in range(self.num_layers_dec):
+                dec_inp = self.dec_attn[layer](dec_inp, dec_attn_mask)
+                dec_inp = self.dec_enc_attn[layer](dec_inp, enc_attn_mask, enc_out)
+                dec_inp = self.dec_ffn[layer](dec_inp, summarize_preactivations=self.summarize_preactivations)
+
+            if self.normalize_out:
+                dec_inp = self.dec_out_norm(dec_inp)
+
+            tf.add_to_collection(lib.meta.ACTIVATIONS, tf.identity(dec_inp, name=scope))
+
+            return dec_inp
+
+    def _make_enc_attn_mask(self, inp, inp_len, dtype=tf.float32):
+        """
+        inp = [batch_size * ninp]
+        inp_len = [batch_size]
+
+        attn_mask = [batch_size * 1 * 1 * ninp]
+        """
+        with tf.variable_scope("make_enc_attn_mask"):
+            inp_mask = tf.sequence_mask(inp_len, dtype=dtype, maxlen=tf.shape(inp)[1])
+
+            attn_mask = inp_mask[:, None, None, :]
+            return attn_mask
+
+    def _make_dec_attn_mask(self, out, dtype=tf.float32):
+        """
+        out = [baatch_size * nout]
+
+        attn_mask = [1 * 1 * nout * nout]
+        """
+        with tf.variable_scope("make_dec_attn_mask"):
+            length = tf.shape(out)[1]
+            lower_triangle = tf.matrix_band_part(tf.ones([length, length], dtype=dtype), -1, 0)
+            attn_mask = tf.reshape(lower_triangle, [1, 1, length, length])
+            return attn_mask
+
+    def _add_timing_signal(self, inp, min_timescale=1.0, max_timescale=1.0e4, offset=0, inp_reverse=None):
+        """
+        inp: (batch_size * ninp * hid_dim)
+        :param offset: add this number to all character positions.
+            if offset == 'random', picks this number uniformly from [-32000,32000] integers
+        :type offset: number, tf.Tensor or 'random'
+        """
+        with tf.variable_scope("add_timing_signal"):
+            ninp = tf.shape(inp)[1]
+            hid_size = tf.shape(inp)[2]
+
+            position = tf.to_float(tf.range(ninp))[None, :, None]
+
+            if offset == 'random':
+                BIG_LEN = 32000
+                offset = tf.random_uniform(tf.shape(position), minval=-BIG_LEN, maxval=BIG_LEN, dtype=tf.int32)
+
+            # force broadcasting over batch axis
+            if isinstance(offset * 1, tf.Tensor):  # multiply by 1 to also select variables, special generators, etc.
+                assert offset.shape.ndims in (0, 1, 2)
+                new_shape = [tf.shape(offset)[i] for i in range(offset.shape.ndims)]
+                new_shape += [1] * (3 - len(new_shape))
+                offset = tf.reshape(offset, new_shape)
+
+            position += tf.to_float(offset)
+
+            if inp_reverse is not None:
+                position = tf.multiply(
+                    position,
+                    tf.where(
+                        tf.equal(inp_reverse, 0),
+                        tf.ones_like(inp_reverse, dtype=tf.float32),
+                        -1.0 * tf.ones_like(inp_reverse, dtype=tf.float32)
+                    )[:, None, None]  # (batch_size * ninp * dim)
+                )
+            num_timescales = hid_size // 2
+            log_timescale_increment = (
+                    math.log(float(max_timescale) / float(min_timescale)) /
+                    (tf.to_float(num_timescales) - 1))
+            inv_timescales = min_timescale * tf.exp(
+                tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
+
+            # scaled_time: [ninp * hid_dim]
+            scaled_time = position * inv_timescales[None, None, :]
+            signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=-1)
+            signal = tf.pad(signal, [[0, 0], [0, 0], [0, tf.mod(hid_size, 2)]])
+            return inp + signal
+
+
+# ============================================================================
+#                                  Transformer model
+
+class Model(TranslateModelBase):
+
+    def __init__(self, name, inp_voc, out_voc, **hp):
+        self.name = name
+        self.inp_voc = inp_voc
+        self.out_voc = out_voc
+        self.hp = hp
+
+        # Parameters
+        self.transformer = Transformer(name, inp_voc, out_voc, **hp)
+
+        projection_matrix = None
+        if hp.get('dwwt', False):
+            projection_matrix = tf.transpose(self.transformer.emb_out.mat)
+
+        self.loss = LossXent(
+            hp.get('loss_name', 'loss_xent_lm'),
+            hp['hid_size'],
+            out_voc,
+            hp,
+            matrix=projection_matrix,
+            bias=None if hp.get("loss_bias", False) else 0)
+
+        inference_mode = hp.get("inference_mode", "fast")
+        if inference_mode == 'fast':
+            self.translate_model = TranslateModelFast(self.name, self.transformer, self.loss, self.inp_voc,
+                                                      self.out_voc)
+        elif inference_mode == 'lazy':
+            self.translate_model = TranslateModelLazy(self.name, self.transformer, self.loss, self.inp_voc,
+                                                      self.out_voc)
+        else:
+            raise NotImplementedError("inference_mode %s is not supported" % inference_mode)
+
+    # Train interface
+    def encode_decode(self, batch, is_train, score_info=False):
+        inp = batch['inp']  # [batch_size * ninp]
+        out = batch['out']  # [batch_size * nout]
+        inp_len = batch.get('inp_len', infer_length(inp, self.inp_voc.eos, time_major=False))  # [batch]
+        out_len = batch.get('out_len', infer_length(out, self.out_voc.eos, time_major=False))  # [batch]
+
+        out_reverse = tf.zeros_like(inp_len)  # batch['out_reverse']
+
+        # rdo: [batch_size * nout * hid_dim]
+        enc_out, enc_attn_mask = self.transformer.encode(inp, inp_len, is_train)
+        rdo = self.transformer.decode(out, out_len, out_reverse, enc_out, enc_attn_mask, is_train)
+
+        return rdo
+
+    def make_feed_dict(self, batch, **kwargs):
+        feed_dict = make_batch_data(batch, self.inp_voc, self.out_voc,
+                                    force_bos=self.hp.get('force_bos', False),
+                                    **kwargs)
+        return feed_dict
+
+
+
+    # ======== TranslateModel for Inference ============
+    def encode(self, batch, **flags):
+        """
+        :param batch: a dict of {string:symbolic tensor} that model understands.
+            By default it should accept {'inp': int32 matrix[batch,time]}
+        :return: initial decoder state
+        """
+        return self.translate_model.encode(batch, **flags)
+
+    def decode(self, dec_state, words=None, **flags):
+        """
+        Performs decoding step given words and previous state.
+        :param words: previous output tokens, int32[batch_size]. if None, uses zero embeddings (first step)
+        :returns: next state
+        """
+        return self.translate_model.decode(dec_state, words, **flags)
+
+    def sample(self, dec_state, base_scores, slices, k, **kwargs):
+        return self.translate_model.sample(dec_state, base_scores, slices, k, **kwargs)
+
+    def get_rdo(self, dec_state, **kwargs):
+        return self.translate_model.get_rdo(dec_state, **kwargs)
+
+    def get_attnP(self, dec_state, **kwargs):
+        return self.translate_model.get_attnP(dec_state, **kwargs)
+
+
+class ScopedModel(Model):
+
+    def __init__(self, name, inp_voc, out_voc, **hp):
+        with tf.variable_scope(name):
+            super(ScopedModel, self).__init__(name, inp_voc, out_voc, **hp)
+
+    def encode_decode(self, *args, **kwargs):
+        with tf.name_scope(self.name):
+            return super(ScopedModel, self).encode_decode(*args, **kwargs)
+
+    def encode(self, *args, **kwargs):
+        with tf.name_scope(self.name):
+            return super(ScopedModel, self).encode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        with tf.name_scope(self.name):
+            return super(ScopedModel, self).decode(*args, **kwargs)
+
+    def sample(self, *args, **kwargs):
+        with tf.name_scope(self.name):
+            return super(ScopedModel, self).sample(*args, **kwargs)
+
+
+# ============================================================================
+#                              Transformer inference
+
+class TranslateModelFast(TranslateModel):
+    DecState = namedtuple("transformer_state", ['enc_out', 'enc_attn_mask', 'attnP', 'rdo', 'out_seq', 'offset',
+                                                'emb', 'dec_layers', 'dec_enc_kv', 'dec_dec_kv'])
+
+    def __init__(self, name, transformer, loss, inp_voc, out_voc):
+        """
+        A translation model that performs quick (n^2) inference for transformer
+        with manual implementation of 1-step decoding
+        """
+        self.name = name
+        self.transformer = transformer
+        self.loss = loss
+        self.inp_voc = inp_voc
+        self.out_voc = out_voc
+
+    def encode(self, batch, is_train=False, **kwargs):
+        """
+        :param batch: a dict containing 'inp':int32[batch_size * ninp] and optionally inp_len:int32[batch_size]
+        :param is_train: if True, enables dropouts
+        """
+        inp = batch['inp']
+        inp_len = batch.get('inp_len', infer_length(inp, self.inp_voc.eos, time_major=False))
+        with dropout_scope(is_train), tf.name_scope(self.transformer.name):
+            # Encode.
+            enc_out, enc_attn_mask = self.transformer.encode(inp, inp_len, is_train=False)
+
+            # Decoder dummy input/output
+            ninp = tf.shape(inp)[1]
+            batch_size = tf.shape(inp)[0]
+            hid_size = tf.shape(enc_out)[-1]
+            out_seq = tf.zeros([batch_size, 0], dtype=inp.dtype)
+            rdo = tf.zeros([batch_size, hid_size], dtype=enc_out.dtype)
+
+            attnP = tf.ones([batch_size, ninp]) / tf.to_float(inp_len)[:, None]
+
+            offset = tf.zeros((batch_size,))
+            if self.transformer.dst_rand_offset:
+                BIG_LEN = 32000
+                random_offset = tf.random_uniform(tf.shape(offset), minval=-BIG_LEN, maxval=BIG_LEN, dtype=tf.int32)
+                offset += tf.to_float(random_offset)
+
+            trans = self.transformer
+            empty_emb = tf.zeros([batch_size, 0, trans.emb_size])
+            empty_dec_layers = [tf.zeros([batch_size, 0, trans.hid_size])] * trans.num_layers_dec
+            input_layers = [empty_emb] + empty_dec_layers[:-1]
+
+            # prepare kv parts for all decoder attention layers. Note: we do not preprocess enc_out
+            # for each layer because ResidualLayerWrapper only preprocesses first input (query)
+            dec_enc_kv = [layer.kv_conv(enc_out)
+                          for i, layer in enumerate(trans.dec_enc_attn)]
+            dec_dec_kv = [layer.kv_conv(layer.preprocess(input_layers[i]))
+                          for i, layer in enumerate(trans.dec_attn)]
+
+            new_state = self.DecState(enc_out, enc_attn_mask, attnP, rdo, out_seq, offset,
+                                      empty_emb, empty_dec_layers, dec_enc_kv, dec_dec_kv)
+
+            # perform initial decode (instead of force_bos) with zero embeddings
+            new_state = self.decode(new_state, is_train=is_train)
+            return new_state
+
+    def decode(self, dec_state, words=None, is_train=False, **kwargs):
+        """
+        Performs decoding step given words and previous state.
+        Returns next state.
+
+        :param words: previous output tokens, int32[batch_size]. if None, uses zero embeddings (first step)
+        :param is_train: if True, enables dropouts
+        """
+        trans = self.transformer
+        enc_out, enc_attn_mask, attnP, rdo, out_seq, offset, prev_emb = dec_state[:7]
+        prev_dec_layers = dec_state.dec_layers
+        dec_enc_kv = dec_state.dec_enc_kv
+        dec_dec_kv = dec_state.dec_dec_kv
+
+        batch_size = tf.shape(rdo)[0]
+        if words is not None:
+            out_seq = tf.concat([out_seq, tf.expand_dims(words, 1)], 1)
+
+        with dropout_scope(is_train), tf.name_scope(trans.name):
+            # Embeddings
+            if words is None:
+                # initial step: words are None
+                emb_out = tf.zeros((batch_size, 1, trans.emb_size))
+            else:
+                emb_out = trans.emb_out(words[:, None])  # [batch_size * 1 * emb_dim]
+                if trans.rescale_emb:
+                    emb_out *= trans.emb_size ** .5
+
+            # Prepare decoder
+            dec_inp_t = trans._add_timing_signal(emb_out, offset=offset)
+            # Apply dropouts
+            if is_dropout_enabled():
+                dec_inp_t = tf.nn.dropout(dec_inp_t, 1.0 - trans.res_dropout)
+
+            # bypass info from Encoder to avoid None gradients for num_layers_dec == 0
+            if trans.num_layers_dec == 0:
+                inp_mask = tf.squeeze(tf.transpose(enc_attn_mask, perm=[3, 1, 2, 0]), 3)
+                dec_inp_t += tf.reduce_mean(enc_out * inp_mask, axis=[0, 1], keep_dims=True)
+
+            # Decoder
+            new_emb = tf.concat([prev_emb, dec_inp_t], axis=1)
+            _out = tf.pad(out_seq, [(0, 0), (0, 1)])
+            dec_attn_mask = trans._make_dec_attn_mask(_out)[:, :, -1:, :]  # [1, 1, n_q=1, n_kv]
+
+            new_dec_layers = []
+            new_dec_dec_kv = []
+
+            for layer in range(trans.num_layers_dec):
+                # multi-head self-attention: use only the newest time-step as query,
+                # but all time-steps up to newest one as keys/values
+                next_dec_kv = trans.dec_attn[layer].kv_conv(trans.dec_attn[layer].preprocess(dec_inp_t))
+                new_dec_dec_kv.append(tf.concat([dec_dec_kv[layer], next_dec_kv], axis=1))
+                dec_inp_t = trans.dec_attn[layer](dec_inp_t, dec_attn_mask, kv=new_dec_dec_kv[layer])
+
+                dec_inp_t = trans.dec_enc_attn[layer](dec_inp_t, enc_attn_mask, kv=dec_enc_kv[layer])
+                dec_inp_t = trans.dec_ffn[layer](dec_inp_t, summarize_preactivations=trans.summarize_preactivations)
+
+                new_dec_inp = tf.concat([prev_dec_layers[layer], dec_inp_t], axis=1)
+                new_dec_layers.append(new_dec_inp)
+
+            if trans.normalize_out:
+                dec_inp_t = trans.dec_out_norm(dec_inp_t)
+
+            rdo = dec_inp_t[:, -1]
+
+            new_state = self.DecState(enc_out, enc_attn_mask, attnP, rdo, out_seq, offset + 1,
+                                      new_emb, new_dec_layers, dec_enc_kv, new_dec_dec_kv)
+            return new_state
+
+    def get_rdo(self, dec_state, **kwargs):
+        return dec_state.rdo, dec_state.out_seq
+
+    def get_attnP(self, dec_state, **kwargs):
+        return dec_state.attnP
+
+
+class TranslateModelLazy(TranslateModel):
+    def __init__(self, name, transformer, loss, inp_voc, out_voc):
+        """
+        Automatically implements O(n^3) decoding by using trans.decode
+        """
+        self.name = name
+        self.transformer = transformer
+        self.loss = loss
+        self.inp_voc = inp_voc
+        self.out_voc = out_voc
+
+    def encode(self, batch, is_train=False, **kwargs):
+        """
+            :param batch: a dict of placeholders
+                inp: [batch_size * ninp]
+                inp_len; [batch_size]
+        """
+        inp = batch['inp']
+        inp_len = batch['inp_len']
+        with dropout_scope(is_train), tf.name_scope(self.transformer.name):
+            # Encode.
+            enc_out, enc_attn_mask = self.transformer.encode(inp, inp_len, is_train=False)
+
+            # Decoder dummy input/output
+            ninp = tf.shape(inp)[1]
+            batch_size = tf.shape(inp)[0]
+            hid_size = tf.shape(enc_out)[-1]
+            out_seq = tf.zeros([batch_size, 0], dtype=inp.dtype)
+            rdo = tf.zeros([batch_size, hid_size], dtype=enc_out.dtype)
+
+            attnP = tf.ones([batch_size, ninp]) / tf.to_float(inp_len)[:, None]
+
+            return self._decode_impl((enc_out, enc_attn_mask, attnP, out_seq, rdo), **kwargs)
+
+    def decode(self, dec_state, words, **kwargs):
+        """
+        Performs decoding step given words
+
+        words: [batch_size]
+        """
+        with tf.name_scope(self.transformer.name):
+            (enc_out, enc_attn_mask, attnP, prev_out_seq, rdo) = dec_state
+            out_seq = tf.concat([prev_out_seq, tf.expand_dims(words, 1)], 1)
+            return self._decode_impl((enc_out, enc_attn_mask, attnP, out_seq, rdo), **kwargs)
+
+    def _decode_impl(self, dec_state, is_train=False, **kwargs):
+        (enc_out, enc_attn_mask, attnP, out_seq, rdo) = dec_state
+
+        with dropout_scope(is_train):
+            out = tf.pad(out_seq, [(0, 0), (0, 1)])
+            out_len = tf.fill(dims=(tf.shape(out)[0],), value=tf.shape(out_seq)[1])
+            out_reverse = tf.zeros_like(out_len)  # batch['out_reverse']
+            dec_out = self.transformer.decode(out, out_len, out_reverse, enc_out, enc_attn_mask, is_train=False)
+            rdo = dec_out[:, -1, :]  # [batch_size * hid_dim]
+
+            attnP = enc_attn_mask[:, 0, 0, :]  # [batch_size * ninp ]
+            attnP /= tf.reduce_sum(attnP, axis=1, keep_dims=True)
+
+            return (enc_out, enc_attn_mask, attnP, out_seq, rdo)
+
+    def get_rdo(self, dec_state, **kwargs):
+        rdo = dec_state[4]
+        out = dec_state[3]
+        return rdo, out
+
+    def get_attnP(self, dec_state, **kwargs):
+        return dec_state[2]
+
diff --git a/lib/task/seq2seq/problems/__init__.py b/lib/task/seq2seq/problems/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/lib/task/seq2seq/problems/concrete.py b/lib/task/seq2seq/problems/concrete.py
new file mode 100644
index 0000000..4298799
--- /dev/null
+++ b/lib/task/seq2seq/problems/concrete.py
@@ -0,0 +1,131 @@
+
+from ..summary import *
+from lib.layers.basic import *
+from lib.train.problem import Problem
+from lib.task.seq2seq.problems.default import word_dropout
+
+
+class ConcreteProblem(Problem):
+    def __init__(self, models, dump_dir=None, dump_first_n=None, sum_loss=False, use_small_batch_multiplier=False,
+                 inp_word_dropout=0, out_word_dropout=0, word_dropout_method='unk', concrete_coef=1.,
+                 ):
+        assert len(models) == 1
+
+        self.models = models
+        self.model = list(self.models.values())[0]
+
+        self.inp_voc = self.model.inp_voc
+        self.out_voc = self.model.out_voc
+
+        self.dump_dir = dump_dir
+        self.dump_first_n = dump_first_n
+        self.sum_loss = sum_loss
+        self.use_small_batch_multiplier = use_small_batch_multiplier
+
+        self.inp_word_dropout = inp_word_dropout
+        self.out_word_dropout = out_word_dropout
+        self.word_dropout_method = word_dropout_method
+
+        # ========================  for concrete gates  =========================================
+        self.concrete_coef = concrete_coef
+        # ========================================================================================
+
+        if self.use_small_batch_multiplier:
+            self.max_batch_size_var = tf.get_variable("max_batch_size", shape=[], initializer=tf.ones_initializer(),
+                                                      trainable=False)
+
+    def _make_encdec_batch(self, batch, is_train):
+        encdec_batch = copy(batch)
+
+        if is_train and self.inp_word_dropout > 0:
+            encdec_batch['inp'] = word_dropout(encdec_batch['inp'], encdec_batch['inp_len'], self.inp_word_dropout,
+                                               self.word_dropout_method, self.model.inp_voc)
+
+        if is_train and self.out_word_dropout > 0:
+            encdec_batch['out'] = word_dropout(encdec_batch['out'], encdec_batch['out_len'], self.out_word_dropout,
+                                               self.word_dropout_method, self.model.out_voc)
+
+        return encdec_batch
+
+    def batch_counters(self, batch, is_train):
+        if hasattr(self.model, 'batch_counters'):
+            return self.model.batch_counters(batch, is_train)
+
+        # ========================  for concrete gates  =========================================
+        tf.get_default_graph().clear_collection("CONCRETE")
+        tf.get_default_graph().clear_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
+
+        rdo = self.model.encode_decode(self._make_encdec_batch(batch, is_train), is_train)
+
+        sparsity_rate = tf.reduce_mean(tf.get_collection("CONCRETE"))
+        concrete_reg = tf.reduce_mean(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
+        # ========================================================================================
+
+        with lib.layers.basic.dropout_scope(is_train):
+            logits = self.model.loss.rdo_to_logits(rdo, batch['out'],
+                                                   batch['out_len'])  # [batch_size * nout * ovoc_size]
+            loss_values = self.model.loss.logits2loss(logits, batch['out'], batch['out_len'])
+            # loss_values /= math.log(2.0)  # TODO: move to loss or to model
+
+        if self.dump_dir:
+            dump_map = batch
+
+            loss_values = tf_dump(
+                loss_values,
+                dump_map,
+                self.dump_dir + '/batch_dump_{}.npz',
+                first_n=self.dump_first_n)
+
+        counters = dict(
+            loss=tf.reduce_sum(loss_values),
+            out_len=tf.to_float(tf.reduce_sum(batch['out_len'])),
+            # ========================  for concrete gates  =========================================
+            sparsity_rate=sparsity_rate,
+            concrete_reg=concrete_reg,
+            # ========================================================================================
+        )
+        append_counters_common_metrics(counters, logits, batch['out'], batch['out_len'], is_train)
+        append_counters_xent(counters, loss_values, batch['out_len'])
+        append_counters_io(counters, batch['inp'], batch['out'], batch['inp_len'], batch['out_len'])
+        return counters
+
+    def loss_multibatch(self, counters, is_train):
+        if self.sum_loss:
+            value = tf.reduce_sum(counters['loss'])
+        else:
+            value = tf.reduce_sum(counters['loss']) / tf.reduce_sum(counters['out_len'])
+
+        if self.use_small_batch_multiplier and is_train:
+            batch_size = tf.reduce_sum(counters['out_len'])
+            max_batch_size = tf.maximum(self.max_batch_size_var, batch_size)
+            with tf.control_dependencies([tf.assign(self.max_batch_size_var, max_batch_size)]):
+                small_batch_multiplier = batch_size / max_batch_size
+                value = value * small_batch_multiplier
+
+        # ========================  for concrete gates  =========================================
+        value += self.concrete_coef * tf.reduce_mean(counters['concrete_reg'])
+        # ========================================================================================
+
+        return value
+
+    def summary_multibatch(self, counters, prefix, is_train):
+        res = []
+        # ========================  for concrete gates  =========================================
+        res.append(tf.summary.scalar(prefix + "/concrete_reg", tf.reduce_mean(counters['concrete_reg'])))
+        res.append(tf.summary.scalar(prefix + "/sparsity_rate", tf.reduce_mean(counters['sparsity_rate'])))
+        # ========================================================================================
+
+        res += summarize_common_metrics(counters, prefix)
+        res += summarize_xent(counters, prefix)
+        res += summarize_io(counters, prefix)
+        return res
+
+    def params_summary(self):
+        if hasattr(self.model, 'params_summary'):
+            return self.model.params_summary()
+        return []
+
+    def make_feed_dict(self, batch, **kwargs):
+        return self.model.make_feed_dict(batch, **kwargs)
+
+
diff --git a/lib/task/seq2seq/problems/default.py b/lib/task/seq2seq/problems/default.py
new file mode 100644
index 0000000..04ad923
--- /dev/null
+++ b/lib/task/seq2seq/problems/default.py
@@ -0,0 +1,122 @@
+from ..summary import *
+from lib.layers.basic import *
+from lib.train.problem import Problem
+
+
+def word_dropout(inp, inp_len, dropout, method, voc):
+    inp_shape = tf.shape(inp)
+
+    border = tf.fill([inp_shape[0], 1], False)
+
+    mask = tf.sequence_mask(inp_len - 2, inp_shape[1] - 2)
+    mask = tf.concat((border, mask, border), axis=1)
+    mask = tf.logical_and(mask, tf.random_uniform(inp_shape) < dropout)
+
+    if method == 'unk':
+        replacement = tf.fill(inp_shape, tf.cast(voc._unk, inp.dtype))
+    elif method == 'random_word':
+        replacement = tf.random_uniform(inp_shape, minval=max(voc.bos, voc.eos, voc._unk)+1, maxval=voc.size(), dtype=inp.dtype)
+    else:
+        raise ValueError("Unknown word dropout method: %r" % method)
+
+    return tf.where(mask, replacement, inp)
+
+
+class DefaultProblem(Problem):
+
+    def __init__(self, models, dump_dir=None, dump_first_n=None, sum_loss=False, use_small_batch_multiplier=False,
+        inp_word_dropout=0, out_word_dropout=0, word_dropout_method='unk',
+    ):
+        assert len(models) == 1
+
+        self.models = models
+        self.model = list(self.models.values())[0]
+
+        self.inp_voc = self.model.inp_voc
+        self.out_voc = self.model.out_voc
+
+        self.dump_dir = dump_dir
+        self.dump_first_n = dump_first_n
+        self.sum_loss = sum_loss
+        self.use_small_batch_multiplier = use_small_batch_multiplier
+
+        self.inp_word_dropout = inp_word_dropout
+        self.out_word_dropout = out_word_dropout
+        self.word_dropout_method = word_dropout_method
+
+        if self.use_small_batch_multiplier:
+            self.max_batch_size_var = tf.get_variable("max_batch_size", shape=[], initializer=tf.ones_initializer(), trainable=False)
+
+    def _make_encdec_batch(self, batch, is_train):
+        encdec_batch = copy(batch)
+
+        if is_train and self.inp_word_dropout > 0:
+            encdec_batch['inp'] = word_dropout(encdec_batch['inp'], encdec_batch['inp_len'], self.inp_word_dropout, self.word_dropout_method, self.model.inp_voc)
+
+        if is_train and self.out_word_dropout > 0:
+            encdec_batch['out'] = word_dropout(encdec_batch['out'], encdec_batch['out_len'], self.out_word_dropout, self.word_dropout_method, self.model.out_voc)
+
+        return encdec_batch
+
+    def batch_counters(self, batch, is_train):
+        if hasattr(self.model, 'batch_counters'):
+            return self.model.batch_counters(batch, is_train)
+
+        rdo = self.model.encode_decode(self._make_encdec_batch(batch, is_train), is_train)
+
+        with dropout_scope(is_train):
+            logits = self.model.loss.rdo_to_logits(rdo, batch['out'], batch['out_len'])  # [batch_size * nout * ovoc_size]
+            loss_values = self.model.loss.logits2loss(logits, batch['out'], batch['out_len'])
+
+        counters = dict(
+            loss=tf.reduce_sum(loss_values),
+            out_len=tf.to_float(tf.reduce_sum(batch['out_len'])),
+        )
+        append_counters_common_metrics(counters, logits, batch['out'], batch['out_len'], is_train)
+        append_counters_xent(counters, loss_values, batch['out_len'])
+        append_counters_io(counters, batch['inp'], batch['out'], batch['inp_len'], batch['out_len'])
+        return counters
+
+    def get_xent(self, batch, is_train):
+        if hasattr(self.model, 'batch_counters'):
+            return self.model.batch_counters(batch, is_train)
+
+        rdo = self.model.encode_decode(self._make_encdec_batch(batch, is_train), is_train)
+
+        with dropout_scope(is_train):
+            logits = self.model.loss.rdo_to_logits(rdo, batch['out'],
+                                                   batch['out_len'])  # [batch_size * nout * ovoc_size]
+            loss_values = self.model.loss.logits2loss(logits, batch['out'], batch['out_len'])
+
+        return loss_values
+
+    def loss_multibatch(self, counters, is_train):
+        if self.sum_loss:
+            value = tf.reduce_sum(counters['loss'])
+        else:
+            value = tf.reduce_sum(counters['loss']) / tf.reduce_sum(counters['out_len'])
+
+        if self.use_small_batch_multiplier and is_train:
+            batch_size = tf.reduce_sum(counters['out_len'])
+            max_batch_size = tf.maximum(self.max_batch_size_var, batch_size)
+            with tf.control_dependencies([tf.assign(self.max_batch_size_var, max_batch_size)]):
+                small_batch_multiplier = batch_size / max_batch_size
+                value = value * small_batch_multiplier
+
+        return value
+
+    def summary_multibatch(self, counters, prefix, is_train):
+        res = []
+        res += summarize_common_metrics(counters, prefix)
+        res += summarize_xent(counters, prefix)
+        res += summarize_io(counters, prefix)
+        return res
+
+    def params_summary(self):
+        if hasattr(self.model, 'params_summary'):
+            return self.model.params_summary()
+
+        return []
+
+    def make_feed_dict(self, batch, **kwargs):
+        return self.model.make_feed_dict(batch, **kwargs)
diff --git a/lib/task/seq2seq/strutils.py b/lib/task/seq2seq/strutils.py
new file mode 100644
index 0000000..f9c5e81
--- /dev/null
+++ b/lib/task/seq2seq/strutils.py
@@ -0,0 +1,134 @@
+# coding: utf-8
+
+from codecs import iterdecode
+import re
+import sys
+import unicodedata
+
+
+def normalize_table_lang(text, lang=None):
+    """According to normalization done in framework"""
+    if lang == 'ru':
+        # replace capital and small letters IO -> IE
+        return text.replace(u'\u0401', u'\u0415').replace(u'\u0451', u'\u0435')
+    elif lang == 'ro':
+        # replace capital and small letters S and T with cedilla -> comma below
+        return text.replace(u'\u015F', u'\u0219').replace(u'\u015E',
+            u'\u0218').replace(u'\u0163', u'\u021b').replace(u'\u0162', u'\u021a')
+    elif lang == 'tr':
+        # replace capital and small letters with circumflex
+        return text.replace(u'\u00C2', u'\u0041').replace(u'\u00E2',
+            u'\u0061').replace(u'\u00CE', u'\u0049').replace(u'\u00EE',
+            u'\u0069').replace(u'\u00DB', u'\u0055').replace(u'\u00FB', u'\u0075')
+    else:
+        return text
+
+
+def unicode_category_tokenize(text, lang=None):
+    import regex
+    re_for_split = regex.compile(
+            u'(?u)[\p{Punctuation}\p{Separator}\p{Other}\p{Sm}\p{So}\p{Sc}]+')
+    return u' '.join(tok for tok in re_for_split.split(text) if tok)
+
+
+def chinese_tok(text, lang=None):
+    """ставит между всеми символами пробелы"""
+    # '''from meteor_ext import make_tmp_file, get_random_filename
+    # import os
+    #
+    # tmpfile = make_tmp_file(pre=get_random_filename())
+    # tmpfile.write(text.encode('utf-8'))
+    # args = ['/place/framework/metrics/stanford-segmenter/segment.sh', 'ctb', tmpfile.name, encoding, '0']
+    # p = Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    # out_data, err_data = p.communicate()
+    # tmpfile.close()
+    # os.unlink(tmpfile.name)
+    # return out_data'''
+    # from itertools import cycle
+    return ' '.join(text)
+
+def split_by_char_tok(text, lang=None):
+    return ' '.join(text)
+
+def tokenize(text):
+    return text.split()
+
+def lower(text, lang=None):
+    return text.lower()
+
+def upper(text, lang=None):
+    return text.upper()
+
+def foldcase(text, lang=None):
+    """приводит текст к одному регистру (верхнему)"""
+    # folds case according to language
+    # TODO: set locale by lang so that some letters are folded correctly
+    # (e.g. turkish i without dot)
+    return upper(text)
+
+def join_tokens(text):
+    return u' '.join(tokenize(text))
+
+
+def separate_punctuation(text, lang=None):
+    """отделяет пунктуацию и символы (Po/So/Ps/Pe/Sc/-) двумя пробелами, после ' ставит один пробел"""
+    new_chars = []
+    for character in text:
+        if character == u"'":
+            new_chars.append(character + u' ')
+        elif unicodedata.category(character) in ('Po', 'So', 'Ps', 'Pe', 'Sc')\
+         or character == u"-":
+            new_chars.append(u' ' + character + u' ')
+        else:
+            new_chars.append(character)
+    return "".join(new_chars)
+
+
+def alphanum(text, lang=None):
+    """заменяет все не-alphanumeric (\W, Unicode) на пробел"""
+    #TODO: do not remove currency signs
+    non_alphanum = re.compile(u'\W', re.UNICODE)
+    text = non_alphanum.sub(' ', text)
+    return text
+
+
+def func_chain(*funcs):
+    """Returns a function that chains parameter functions"""
+    def result_func(text, lang=None):
+        result = text
+        for func in funcs:
+            result = func(result, lang)
+        return result
+    return result_func
+
+
+def normalize_space(u_text, lang=None):
+    """стирает пробельные символы, заменяя их на один пробел"""
+    return ' '.join(u_text.split())
+
+def xlines(fileobj, encoding='utf_8_sig', keepends=False):
+    for line in iterdecode(fileobj, encoding):
+        if not keepends:
+            line = line.rstrip('\r\n')
+        yield line
+
+# only alphanumeric characters are kept
+al_num = func_chain(alphanum, normalize_space)
+# only alphanumeric characters are kept, the rest is case-folded
+al_num__foldcase = func_chain(foldcase, alphanum, normalize_space)
+all_chars__as_is = func_chain()
+# all characters are folded in case
+all_chars__foldcase = func_chain(foldcase, normalize_space)
+# punctuation becomes separate tokens
+all_chars__punct_tokens = func_chain(separate_punctuation, normalize_space)
+# punctuation becomes separate tokens, all characters are folded in case
+all_chars__punct_tokens__foldcase = func_chain(foldcase, separate_punctuation, normalize_space)
+# as is in eval framework
+equal_to_framework = func_chain(normalize_table_lang, foldcase, unicode_category_tokenize)
+
+if __name__ == '__main__':
+    funcs = {'-s': al_num__foldcase, '-p': all_chars__punct_tokens__foldcase,
+             '-cs': al_num, '-csp': all_chars__punct_tokens}
+    a = sys.argv[1]
+    for line in xlines(sys.stdin):
+        print(u''.join(map(funcs[a], line.split('\t'))))
\ No newline at end of file
diff --git a/lib/task/seq2seq/summary.py b/lib/task/seq2seq/summary.py
new file mode 100644
index 0000000..fa31885
--- /dev/null
+++ b/lib/task/seq2seq/summary.py
@@ -0,0 +1,159 @@
+import tensorflow as tf
+from ...ops.basic import select_values_over_last_axis
+
+
+def append_counters_accuracy(counters, logits, out, out_len):
+    with tf.variable_scope("summary_accuracy"):
+        predictions = tf.argmax(logits, axis=2)
+        acc_values = predictions2accuracy(predictions, out, out_len)
+        acc_top5_values = logits2accuracy_top_k(logits, out, out_len, k=5)
+        acc_per_seq_values = predictions2accuracy_per_sequence(predictions, out, out_len)
+
+        node = dict(
+            accuracy=tf.reduce_sum(acc_values),
+            accuracy_top5=tf.reduce_sum(acc_top5_values),
+            accuracy_per_sequence=tf.reduce_sum(acc_per_seq_values),
+            out_len=tf.to_float(tf.reduce_sum(out_len)),
+            seqs=tf.to_float(tf.shape(out_len)[0]),
+        )
+
+        _append_counters(counters, "summarize_accuracy", node)
+
+
+def append_counters_common_metrics(counters, logits, out, out_len, is_train):
+    append_counters_accuracy(counters, logits, out, out_len)
+
+
+def append_counters_xent(counters, xent_values, out_len):
+    with tf.variable_scope("summary_xent"):
+        node = dict(
+            xent=tf.reduce_sum(xent_values),
+            out_len=tf.to_float(tf.reduce_sum(out_len)),
+        )
+        _append_counters(counters, "summarize_xent", node)
+
+
+def append_counters_io(counters, inp, out, inp_len, out_len):
+    with tf.variable_scope("summary_io"):
+        node = dict(
+            batch_size=tf.to_float(tf.shape(inp))[0],
+            inp_len=tf.to_float(tf.reduce_sum(inp_len)),
+            out_len=tf.to_float(tf.reduce_sum(out_len)),
+            ninp=tf.to_float(tf.shape(inp)[1]),
+            nout=tf.to_float(tf.shape(out)[1]),
+        )
+        _append_counters(counters, "summarize_io", node)
+
+
+def summarize_accuracy(counters, prefix):
+    node = counters['summarize_accuracy']
+    summaries = [
+        tf.summary.scalar("%s_metrics/Acc" % prefix, tf.reduce_sum(node['accuracy']) / tf.reduce_sum(node['out_len'])),
+        tf.summary.scalar("%s_metrics/AccTop5" % prefix, tf.reduce_sum(node['accuracy_top5']) / tf.reduce_sum(node['out_len'])),
+        tf.summary.scalar("%s_metrics/AccPerSeq" % prefix, tf.reduce_sum(node['accuracy_per_sequence']) / tf.reduce_sum(node['seqs'])),
+    ]
+    return summaries
+
+
+def summarize_common_metrics(counters, prefix):
+    return summarize_accuracy(counters, prefix)
+
+
+def summarize_xent(counters, prefix):
+    node = counters['summarize_xent']
+    return [
+        tf.summary.scalar("%s_metrics/Xent" % prefix, tf.reduce_sum(node['xent']) / tf.reduce_sum(node['out_len'])),
+    ]
+
+
+def summarize_io(counters, prefix):
+    node = counters['summarize_io']
+    return [
+        tf.summary.scalar("%s_IO/BatchSize" % prefix, tf.reduce_sum(node['batch_size'])),
+        tf.summary.scalar("%s_IO/InpLenAvg" % prefix, tf.reduce_sum(node['inp_len']) / tf.reduce_sum(node['batch_size'])),
+        tf.summary.scalar("%s_IO/OutLenAvg" % prefix, tf.reduce_sum(node['out_len']) / tf.reduce_sum(node['batch_size'])),
+        tf.summary.scalar("%s_IO/InpLenSum" % prefix, tf.reduce_sum(node['inp_len'])),
+        tf.summary.scalar("%s_IO/OutLenSum" % prefix, tf.reduce_sum(node['out_len'])),
+
+        tf.summary.scalar(
+            "%s_IO/InpNoPadRate" % prefix,
+            tf.reduce_sum(node['inp_len']) / tf.reduce_sum(node['ninp'] * node['batch_size'])),
+        tf.summary.scalar(
+            "%s_IO/OutNoPadRate" % prefix,
+            tf.reduce_sum(node['out_len']) / tf.reduce_sum(node['nout'] * node['batch_size'])),
+    ]
+
+
+def _append_counters(counters, key, value):
+    if isinstance(counters, dict):
+        if key in counters:
+            raise Exception('Duplicate key "{}" in counters'.format(key))
+        counters[key] = value
+    else:
+        raise Exception('Unexpected type: {}. Counters should be dict'.format(counters.__class__.__name__))
+
+
+def logits2accuracy(logits, out, out_len, dtype=tf.float32):
+    """
+    logits : [batch_size * nout * voc_size]
+    out : [batch_size * nout]
+    out_len: [batch_size]
+
+    results: [batch_size * nout]
+    """
+    predictions = tf.argmax(logits, axis=2)
+    return predictions2accuracy(predictions, out, out_len, dtype=dtype)
+
+
+def predictions2accuracy(predictions, out, out_len, dtype=tf.float32):
+    """
+    predictions: [batch_size * nout]
+    out : [batch_size * nout]
+    out_len: [batch_size]
+
+    results: [batch_size * nout]
+    """
+    out_equals = tf.equal(tf.cast(predictions, dtype=out.dtype), out)
+    out_mask = tf.sequence_mask(out_len, dtype=dtype, maxlen=tf.shape(out)[1])
+    acc_values = tf.cast(out_equals, dtype=dtype) * out_mask
+
+    return acc_values
+
+
+def logits2accuracy_top_k(logits, out, out_len, k, dtype=tf.float32):
+    """
+    logits: [batch_size * nout * ntokens]
+    out : [batch_size * nout]
+    out_len: [batch_size]
+
+    results: [batch_size * nout]
+    """
+    out_logits = select_values_over_last_axis(logits, tf.to_int32(out))
+    out_logits = tf.expand_dims(out_logits, axis=-1)
+
+    greater_mask = tf.greater(logits, out_logits)
+    greater_ranks = tf.reduce_sum(tf.to_int32(greater_mask), axis=-1)
+    hit_mask = greater_ranks < k
+    out_mask = tf.sequence_mask(out_len, dtype=dtype, maxlen=tf.shape(out)[1])
+    acc_values = tf.to_float(hit_mask) * out_mask
+
+    return acc_values
+
+
+def predictions2accuracy_per_sequence(predictions, out, out_len, dtype=tf.float32):
+    """
+    predictions: [batch_size * nout]
+    out: [batch_size * nout]
+    out_len: [batch_size]
+
+    results: [batch_size]
+    """
+    not_correct = tf.not_equal(tf.cast(predictions, dtype=out.dtype), out)
+    out_mask = tf.sequence_mask(out_len, dtype=dtype, maxlen=tf.shape(out)[1])
+    correct_seq = 1.0 - tf.minimum(1.0, tf.reduce_sum(tf.cast(not_correct, dtype=dtype) * out_mask, axis=1))
+    return tf.cast(correct_seq, dtype=dtype)
+
+
+
+
+
diff --git a/lib/task/seq2seq/tickers.py b/lib/task/seq2seq/tickers.py
new file mode 100644
index 0000000..0b158bf
--- /dev/null
+++ b/lib/task/seq2seq/tickers.py
@@ -0,0 +1,107 @@
+import os
+import sys
+import tensorflow as tf
+
+from ...train.tickers import DistributedTicker, _IsItTimeYet
+import lib
+from .bleu import Bleu
+
+# - TranslateTicker - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+def unbpe(sent):
+    return sent.replace(' `', '')
+
+
+class TranslateTicker(DistributedTicker):
+    """
+    - Translate devset once in a while.
+    - Print BLEU to stderr after each translation.
+    """
+    def __init__(self, model_name, devset, name='Dev', every_steps=None, every_minutes=None, initial=False, folder=None,
+                 suffix=None, device=None, parallel=True):
+        self.model_name = model_name
+        self.devset = devset
+        self.every_steps = every_steps
+        self.every_minutes = every_minutes
+        self.folder = folder
+        self.name = name
+        self.initial = initial
+        self.device = device
+        self.parallel = parallel
+        self.suffix = suffix if suffix is not None else model_name
+        if self.suffix:  # add underscore if we add suffix
+            self.suffix = '_' + self.suffix
+
+    def on_started(self, context):
+        self.devset_batches = list(self.devset)
+        self.context = context
+        self.model = context.get_model(self.model_name)
+
+        self.bleu = tf.placeholder(tf.float32)
+        self.translations = tf.placeholder(tf.string, shape=[None])
+
+        self.summary = tf.summary.merge([
+            tf.summary.scalar(("%s/BLEU" % self.name) + self.suffix, self.bleu),
+            tf.summary.text(("%s/Translations" % self.name) + self.suffix, self.translations)])
+
+        self.is_it_time_yet = _IsItTimeYet(
+            context, self.every_steps, self.every_minutes)
+
+        # Score devset after initialization if option passed (and we are not loading some non-init checkpoint)
+        if self.initial and context.get_global_step() == 0:
+            self._score()
+
+    def after_train_batch(self, ingraph_result):
+        if self.is_it_time_yet():
+            self._score()
+
+    def _score(self):
+        if lib.ops.mpi.is_master():
+            print('Translating', end='', file=sys.stderr, flush=True)
+
+        translations = None
+
+        if self.parallel or lib.ops.mpi.is_master():
+            translations = []
+            with tf.device(self.device) if self.device is not None else lib.util.nop_ctx():
+                for batch in self.devset_batches:
+                    trans = self.model.translate_lines([line[0] for line in batch])
+                    for index in range(len(batch)):
+                        src = unbpe(batch[index][0])
+                        ethalon = unbpe(batch[index][1])
+                        translations.append(src + '\t' + ethalon + '\t' + unbpe(trans[index]))
+
+            if self.parallel:
+                translations = lib.ops.mpi.gather_obj(translations)
+                if translations is not None:
+                    translations = [x for t in translations for x in t]
+
+        if translations is not None:
+            # compute BLEU only on the master
+
+            if self.folder is not None:
+                global_step = self.context.get_global_step()
+                self._dump_translations(
+                    translations,
+                    fname='translations{}_{}.txt'.format(self.suffix, global_step)
+                )
+
+            bleu = Bleu()
+            for translation in translations:
+                src, ethalon, trans = translation.split('\t')
+                bleu.process_next(trans, [ethalon])
+            bleu_value = 100 * (bleu.total()[0])
+
+            print('BLEU %f' % bleu_value, file=sys.stderr, flush=True)
+
+            summary = tf.get_default_session().run(self.summary, feed_dict={self.bleu: bleu_value,
+                                                                            self.translations: translations})
+
+            self.context.get_summary_writer().add_summary(summary, self.context.get_global_step())
+
+    def _dump_translations(self, translations, fname):
+        if not os.path.isdir(self.folder):
+            os.mkdir(self.folder)
+        fout = open(os.path.join(self.folder, fname), 'w')
+        for translation in translations:
+            print(translation, file=fout)
diff --git a/lib/task/seq2seq/voc.py b/lib/task/seq2seq/voc.py
new file mode 100644
index 0000000..1ec3aed
--- /dev/null
+++ b/lib/task/seq2seq/voc.py
@@ -0,0 +1,105 @@
+import collections
+import sys
+
+
+class BaseVoc:
+    @property
+    def bos(self):
+        raise NotImplementedError()
+
+    @property
+    def eos(self):
+        raise NotImplementedError()
+
+    def ids(self, words):
+        raise NotImplementedError()
+
+    def words(self, ids):
+        raise NotImplementedError()
+
+    def size(self):
+        raise NotImplementedError()
+
+
+class Voc:
+    @property
+    def bos(self):
+        return 0
+
+    @property
+    def eos(self):
+        return 1
+
+    @property
+    def _unk(self):
+        return 2
+
+    def ids(self, words):
+        if isinstance(words, (list, tuple)):
+            return [self.ids(word) for word in words]
+        return self._voc.get(words, self._unk)
+
+    def words(self, ids):
+        if isinstance(ids, (list, tuple)):
+            return [self.words(id) for id in ids]
+        return self._ivoc[ids]
+
+    def size(self):
+        return self._size
+
+    @staticmethod
+    def compile(corpus_filename, max_words, index=0):
+        # Accumulate frequencies.
+        freqs = collections.defaultdict(int)
+        with open(corpus_filename) as corpus:
+            for line in corpus:
+                line = line.strip('\n')
+                if not line:
+                    continue
+                for word in line.split(' '):
+                    freqs[word.split('|||')[index]] += 1
+
+        # Sort by frequency.
+        freq_and_word = lambda item: item[::-1]
+        most_frequent = sorted(freqs.items(), key=freq_and_word, reverse=True)
+
+        # Create voc.
+        obj = Voc()
+        voc = { '_BOS_': obj.bos, '_EOS_': obj.eos }
+        id = 3
+        total_covered_freq = 0
+        for word, freq in most_frequent[:max_words]:
+            voc[word] = id
+            id += 1
+            total_covered_freq += freq
+
+        # Report coverage.
+        total_freq = sum(freqs.values())
+        msg = 'Voc %r: %i words, %.3f%% coverage' % (
+            corpus_filename,
+            id,
+            total_covered_freq * 100 / total_freq,
+            )
+        print(msg, file=sys.stderr, flush=True)
+
+        # Return.
+        obj.__setstate__((voc,))
+        return obj
+
+    def __getstate__(self):
+        return self._voc,
+
+    def __setstate__(self, state):
+        # Load direct vocabulary.
+        self._voc, = state
+
+        # Fill inverse vocabulary.
+        self._ivoc = {}
+        for k, v in self._voc.items():
+            self._ivoc[v] = k
+        self._ivoc[self.bos] = '_BOS_'
+        self._ivoc[self.eos] = '_EOS_'
+        self._ivoc[self._unk] = '_UNK_'
+
+        # Compute size
+        self._size = max(self._voc.values()) + 1
diff --git a/lib/tools/__init__.py b/lib/tools/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/lib/tools/apply_bpe.py b/lib/tools/apply_bpe.py
new file mode 100755
index 0000000..497daa8
--- /dev/null
+++ b/lib/tools/apply_bpe.py
@@ -0,0 +1,110 @@
+#!/usr/bin/env python3
+import argparse
+import sys
+import numpy as np
+from collections import defaultdict
+
+
+class BPEizer:
+    def __init__(self, path, separator=' `'):
+        """
+        A tool that converts tokenized strings into BPE units given bpe rules
+        Works by iteratively merging subword pairs with lowest priority, starting from individual characters
+        :param path: path to a file with bpe merging rules. Either from subword_nmt or yandex internal bpe tool
+            subword_nmt: file should start with #version: {some version} header and contain "{left_part right_part}" rules
+            yandex internal: file shoud contain lines with "{left_part}\t{right_part}\t{priority}"
+        :param separator: a string that will separates segments of a word;
+            Note: subword_nmt's default separator is "@@ " (mind the space)
+
+        Usage:
+        >>> bpeizer = BPEizer(path='./data/ru.bpe.voc')
+        >>> bpeizer.bpeize_token('транспонировали')
+        'тран `сп `он `ир `овали'
+        >>> bpeizer(['тридцать три треугольных матрицы транспонировали - транспонировали', ', да не вытранспонировали !'])
+        ['тридцать три треуголь `ных мат `рицы тран `сп `он `ир `овали - тран `сп `он `ир `овали',
+         ', да не выт `ран `сп `он `ир `овали !']
+        """
+        self.bpe_rules = defaultdict(lambda: float('inf'))
+        self.separator = separator
+
+        if self.is_yandex_bpe(path):
+            self.mode = 'yandex'
+            self.begin, self.end = '^$'
+            for left, right, index in map(str.split, open(path)):
+                self.bpe_rules[left, right] = int(index)
+
+        elif self.is_rsenrich_bpe(path):
+            self.mode = 'rsenrich'
+            self.begin, self.end = '<w>', '</w>'
+            f_rules = open(path)
+            f_rules.readline()
+            for i, (left, right) in enumerate(map(str.split, f_rules)):
+                self.bpe_rules[left, right] = i
+        else:
+            raise NotImplementedError("BPE rules header is compatible with neither subword_nmt nor yandex bpe")
+
+        self.escape_chars = {self.begin: chr(0x110000 - 2), self.end: chr(0x110000 - 1)}
+        self.unescape_chars = {v: k for k, v in self.escape_chars.items()}
+
+    def bpeize_token(self, chars: str):
+        """ split a single token (str) into bpe units """
+        tokens = [self.begin] + [self.escape_chars.get(c, c) for c in chars] + [self.end]
+        if self.mode == 'rsenrich':
+            last = tokens.pop()
+            tokens[-1] += last  # automatically merge </w> with previous token
+
+        while len(tokens) > 1:
+            # find first bpe rule to match
+            bpe_rule_priorities = [self.bpe_rules[prev, cur] for prev, cur in zip(tokens[:-1], tokens[1:])]
+
+            chosen_ix = np.argmin(bpe_rule_priorities)
+            if bpe_rule_priorities[chosen_ix] == float('inf'):
+                break  # this is the end of the road, afro samurai!
+
+            # apply it
+            tokens = tokens[:chosen_ix] + [tokens[chosen_ix] + tokens[chosen_ix + 1]] + tokens[chosen_ix + 2:]
+
+        assert tokens[0].startswith(self.begin) and tokens[-1].endswith(self.end)
+        tokens[0] = tokens[0][len(self.begin):]
+        tokens[-1] = tokens[-1][:-len(self.end)]
+        tokens = [''.join([self.unescape_chars.get(c, c) for c in bpe])
+                  for bpe in tokens if len(bpe) != 0]
+        return self.separator.join(filter(len, tokens))
+
+    def bpeize_line(self, line: str):
+        """ converts a tokenized line into a bpe-ized line """
+        return ' '.join(map(self.bpeize_token, line.split()))
+
+    def __call__(self, text):
+        if isinstance(text, (list, tuple)):
+            return list(map(self, text))
+        elif isinstance(text, str):
+            return self.bpeize_line(text)
+        else:
+            raise ValueError("Expected string or list/tuple of strings but found {}".format(type(text)))
+
+    @staticmethod
+    def is_rsenrich_bpe(bpe_rules_path):
+        """ Check if bpe rules were learned by https://github.com/rsennrich/subword-nmt """
+        header = open(bpe_rules_path).readline()
+        return header.startswith('#version:')
+
+    @staticmethod
+    def is_yandex_bpe(bpe_rules_path):
+        """ Check if bpe rules were learned by internal Yandex tool """
+        try:
+            header = open(bpe_rules_path).readline()
+            l, r, i = header.split('\t')  # check if this line contains 3 tabs
+            return True
+        except:
+            return False
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--bpe_rules', required=True)
+    args = parser.parse_args()
+
+    bpeizer = BPEizer(args.bpe_rules)
+    for l in sys.stdin:
+        print(bpeizer.bpeize_line(l))
diff --git a/lib/tools/average_npz.py b/lib/tools/average_npz.py
new file mode 100755
index 0000000..e325a4c
--- /dev/null
+++ b/lib/tools/average_npz.py
@@ -0,0 +1,71 @@
+#!/usr/bin/env python3
+
+"""
+Averaging NPZ files
+"""
+
+import argparse
+import os
+import numpy as np
+
+
+def get_last_checkpoints(folder, num_checkpoints):
+    labels = []
+    for fname in os.listdir(folder):
+        if not fname.startswith('model-') or not fname.endswith('.npz'):
+            continue
+        label = fname[len('model-'):-len('.npz')]
+        if not label.isdigit():
+            continue
+        labels.append(int(label))
+    labels = sorted(labels, reverse=True)[0:num_checkpoints]
+
+    files = []
+    for label in labels:
+        filename = os.path.join(folder, 'model-%d.npz' % label)
+        files += [filename]
+    return files
+
+
+def average_npzs(files):
+    out = {}
+    for filename in files:
+        model = np.load(filename)
+        for var in model:
+            if var in out:
+                out[var] += model[var]
+            else:
+                out[var] = model[var]
+    for var in out:
+        out[var] /= len(files)
+
+    return out
+
+
+def _parse_args():
+    p = argparse.ArgumentParser()
+    p.add_argument('--oname','-O', required=True, help='output file name')
+    p.add_argument('--ncheckpoints', '-n', type=int, help='number of checkpoints to use')
+    p.add_argument('--folder', type=str, help='path to checkpoints')
+    p.add_argument('files', nargs='*')
+
+    args = p.parse_args()
+    if (args.folder is None) != (args.ncheckpoints is None):
+        raise Exception("--folder and --ncheckpoints should be specified togather")
+
+    if (args.folder is not None) and len(args.files):
+        raise Exception("Use one of two modes:\n<SCRIPT> -O <out_file> files+\n or <SCRIPT> -O <out_file> -n <ncheckpoints> --folder <folder>")
+
+    return args
+
+
+if __name__ == '__main__':
+    args = _parse_args()
+
+    if args.folder:
+        files = get_last_checkpoints(args.folder, args.ncheckpoints)
+        npz = average_npzs(files)
+        np.savez(args.oname, **npz)
+    else:
+        npz = average_npzs(args.files)
+        np.savez(args.oname, **npz)
diff --git a/lib/train/__init__.py b/lib/train/__init__.py
new file mode 100644
index 0000000..d817445
--- /dev/null
+++ b/lib/train/__init__.py
@@ -0,0 +1,347 @@
+# Training routine
+
+from collections import defaultdict
+import tensorflow as tf
+import sys
+import inspect
+
+import lib
+from lib.train.saveload import initialize_uninitialized_variables
+from . import algorithms, saveload, tickers
+from .problem import Problem, SimpleProblem
+from .tickers import Ticker, DistributedTicker, TrainService, LearningRateService, ModelService, \
+    LearningRateFn, SummaryWriterService, GlobalStepService
+
+from ..session import profile_scope
+from ..data import TfUploader
+from ..util import nested_map
+
+class DuplicateServiceError(Exception):
+    pass
+
+class DuplicateMethodError(Exception):
+    pass
+
+
+# Main train loop
+
+
+def train(problem, algorithm, iterator, tickers, tick_every_steps=0):
+    uploader = TfUploader(iterator, capacity=5)
+    if uploader.empty:
+        raise RuntimeError("Trainset is empty")
+
+    global_step_ticker = _GlobalStepService(tick_every_steps)
+    train_ticker = _TrainTicker(problem, algorithm, uploader)
+
+    tickers = [global_step_ticker, train_ticker] + sorted(tickers, key=lambda t: t.priority)
+
+    if not lib.ops.mpi.is_master():
+        tickers = [t for t in tickers if isinstance(t, DistributedTicker)]
+
+    real_tickers = [t for t in tickers if isinstance(t, Ticker)]
+
+    session = tf.get_default_session()
+    context = _TrainContext(uploader.iterator, tickers)
+
+    initialize_uninitialized_variables()
+
+    for ticker in sorted(real_tickers, key=lambda t: t.init_priority):
+        ticker.on_started(context)
+    # second loop because tickers can depend on each other
+    for ticker in real_tickers:
+        ticker.prepare_ingraph_ops()
+
+    with uploader:
+        try:
+            while not context.should_stop:
+                batch_evals = []
+                for ticker in real_tickers:
+                    batch_evals.append(ticker.before_train_batch())
+
+                with profile_scope(level=1):
+                    if tick_every_steps == 0:
+                        batch_results = session.run(batch_evals, feed_dict={
+                            global_step_ticker.global_step_ingraph: global_step_ticker.global_step,
+                            global_step_ticker.batch_no_ingraph: global_step_ticker.batch_no
+                        })
+                    else:
+                        batch_results = session.run(batch_evals, feed_dict={
+                            global_step_ticker.global_step_ingraph: global_step_ticker.global_step,
+                            global_step_ticker.batch_no_ingraph: global_step_ticker.batch_no,
+                            global_step_ticker.tick_no_ingraph: global_step_ticker.tick_no
+                        })
+
+                for ticker, result in zip(real_tickers, batch_results):
+                    ticker.after_train_batch(result)
+
+        except tf.errors.OutOfRangeError:
+            pass
+
+    for ticker in real_tickers:
+        ticker.on_finished()
+
+
+## ============================================================================
+#                              Internal functions
+
+
+def _get_classes(cls, desired_cls):
+    bases = cls.__bases__
+
+    if desired_cls in bases:
+        yield cls
+    else:
+        for base in bases:
+            yield from _get_classes(base, desired_cls)
+
+
+class _TrainContext(object):
+
+    def __init__(self, iterator, objects):
+        self._register_providers(objects)
+        self.subscribers = self._register_subscribers(objects)
+        self.iterator = iterator
+        self.should_stop = False
+
+    def stop_training(self, reason):
+        print("Stopping because of %s" % reason, file=sys.stderr)
+        self.should_stop = True
+
+    def skip_train_data(self, batches):
+        print("! Skipping %d batches..." % batches, file=sys.stderr, flush=True, end='')
+        for i in range(batches):
+            if i > 1000 and (i & (i - 1)) == 0:
+                print(" %d" % i, file=sys.stderr, flush=True, end='')
+            next(self.iterator)
+        print(" done", file=sys.stderr, flush=True)
+
+    def _register_providers(self, objects):
+        service_providers = {}
+        method_services = {}
+
+        for obj in objects:
+            for srv_class in _get_classes(type(obj), tickers.Service):
+                if srv_class in service_providers:
+                    raise DuplicateServiceError("Multiple providers for service %s detected: %s and %s" % (srv_class, service_providers[srv_class], obj))
+
+                service_providers[srv_class] = obj
+
+                for srv_method, _ in inspect.getmembers(srv_class, predicate=inspect.isfunction):
+                    if srv_method in method_services:
+                        raise DuplicateMethodError("Multiple services implementing %s detected: %s and %s" % (srv_method, method_services[srv_method], srv_class))
+
+                    method_services[srv_method] = srv_class
+                    self.__dict__[srv_method] = getattr(obj, srv_method)
+
+    def _register_subscribers(self, objects):
+        subscribers = defaultdict(list)
+
+        for obj in objects:
+            for subscriber_class in _get_classes(type(obj), tickers.Subscriber):
+                subscribers[subscriber_class].append(obj)
+
+        return subscribers
+
+    def get_subscribers(self, subscriber_cls):
+        return self.subscribers[subscriber_cls]
+
+
+## ============================================================================
+#                              Internal tickers
+
+
+class _AnyProblem(Problem):
+    def __init__(self, problem):
+        self.problem = problem
+        self.simple = isinstance(problem, SimpleProblem)
+
+    def parse_batch(self, batch, is_train):
+        return self.problem.parse_batch(batch, is_train)
+
+    def batch_counters(self, parsed_batch, is_train, **kwargs):
+        if self.simple:
+            return self.problem.loss(parsed_batch, is_train, **kwargs)
+        else:
+            return self.problem.batch_counters(parsed_batch, is_train, **kwargs)
+
+    def loss_multibatch(self, counters, is_train):
+        if self.simple:
+            return tf.reduce_mean(counters)
+        else:
+            return self.problem.loss_multibatch(counters, is_train)
+
+    def summary_multibatch(self, counters, prefix, is_train):
+        if self.simple:
+            return [tf.summary.scalar('%s/loss' % prefix, tf.reduce_mean(counters))]
+        else:
+            op = self.problem.summary_multibatch(counters, prefix, is_train)
+            if not isinstance(op, (list, tuple)):
+                op = [op]
+            return op
+
+    def params_summary(self):
+        if self.simple:
+            return []
+        else:
+            op = self.problem.params_summary()
+            if not isinstance(op, (list, tuple)):
+                op = [op]
+            return op
+
+    def make_feed_dict(self, batch):
+        if self.simple:
+            return super(_AnyProblem, self).make_feed_dict(batch)
+        else:
+            return self.problem.make_feed_dict(batch)
+
+    def get_batch_cost_fn(self):
+        if self.simple:
+            return super(_AnyProblem, self).get_batch_cost_fn()
+        else:
+            return self.problem.get_batch_cost_fn()
+
+
+# - _GlobalStep - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class _GlobalStepService(
+        DistributedTicker,
+        GlobalStepService
+        ):
+
+    def __init__(self, tick_every_steps):
+        assert len(tf.get_collection(tf.GraphKeys.GLOBAL_STEP)) == 0, "Global step already registered!"
+
+        self.global_step = 0
+        self.global_step_var = tf.get_variable(
+            'global_step', [], tf.int64,
+            initializer=tf.constant_initializer(0), trainable=False
+            )
+
+        self.batch_no = 0
+        self.batch_no_var = tf.get_variable(
+            'batch_no', [], tf.int64,
+            initializer=tf.constant_initializer(0), trainable=False
+            )
+
+        self.tick_every_steps = tick_every_steps
+
+        if self.tick_every_steps > 0:
+            self.tick_no = 0
+            self.tick_no_var = tf.get_variable(
+                'tick_no', [], tf.int64,
+                initializer=tf.constant_initializer(0), trainable=False
+                )
+
+        with tf.name_scope("step"):
+            # In on_started global_step_ingraph and batch_no_ingraph should return current value
+            self.global_step_ingraph = tf.placeholder_with_default(self.global_step_var, shape=[])
+            self.batch_no_ingraph = tf.placeholder_with_default(self.batch_no_var, shape=[])
+
+            tf.add_to_collection(tf.GraphKeys.GLOBAL_STEP, self.global_step_ingraph)
+
+            if self.tick_every_steps > 0:
+                self.tick_no_ingraph = tf.placeholder_with_default(self.tick_no_var, shape=[])
+                tf.add_to_collection("TICK_NO", self.tick_no_ingraph)
+
+        self.tick_every_steps = tick_every_steps
+
+    def on_finished(self):
+        tf.get_collection_ref(tf.GraphKeys.GLOBAL_STEP).clear()
+
+    def on_train_batch_ingraph(self):
+        with tf.name_scope("step"):
+            if self.tick_every_steps == 0:
+                return [
+                    tf.assign(self.global_step_var, self.global_step_var + 1),
+                    tf.assign(self.batch_no_var, self.batch_no_var + 1)
+                ]
+            else:
+                is_it_time_yet = tf.equal(tf.mod(self.tick_no_var, self.tick_every_steps), self.tick_every_steps - 1)
+
+                incr_global_step = tf.cond(is_it_time_yet,
+                    lambda: tf.assign(self.global_step_var, self.global_step_var + 1),
+                    lambda: tf.identity(self.global_step_var))
+
+                incr_batch_no = tf.cond(is_it_time_yet,
+                    lambda: tf.assign(self.batch_no_var, self.batch_no_var + 1),
+                    lambda: tf.identity(self.batch_no_var))
+
+                with tf.control_dependencies([incr_global_step, incr_batch_no]):
+                    incr_tick_no = tf.assign(self.tick_no_var, self.tick_no_var + 1)
+
+                return [
+                    incr_global_step,
+                    incr_batch_no,
+                    incr_tick_no
+                ]
+
+    def after_train_batch(self, ingraph_result):
+        if self.tick_every_steps == 0:
+            self.global_step, self.batch_no = ingraph_result
+        else:
+            self.global_step, self.batch_no, self.tick_no = ingraph_result
+
+    def get_batch_no(self):
+        return self.batch_no
+
+    def get_batch_no_ingraph(self):
+        return self.batch_no_ingraph
+
+    def get_global_step(self):
+        return self.global_step
+
+    def get_global_step_ingraph(self):
+        return self.global_step_ingraph
+
+    def set_global_step(self, global_step):
+        with tf.name_scope("step"):
+            self.global_step = tf.get_default_session().run(
+                tf.assign(self.global_step_var, global_step))
+
+# - _TrainTicker - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class _TrainTicker(DistributedTicker, ModelService, TrainService, LearningRateService):
+
+    def __init__(self, problem, algorithm, uploader):
+        # Bind to learning rate multiplier here
+        LearningRateService.__init__(self, algorithm.learning_rate)
+
+        self.problem = _AnyProblem(problem)
+        self.algorithm = algorithm
+
+        with tf.name_scope("counters"):
+            self.local_counters = nested_map(lambda t: tf.expand_dims(t, 0), self.problem.batch_counters(uploader.get_next(), True))
+
+        with tf.name_scope("loss"):
+            self.local_loss = self.problem.loss_multibatch(self.local_counters, True)
+
+        with tf.name_scope("aggregate"):
+            self.counters = nested_map(lambda t: lib.ops.mpi.allgather(t), self.local_counters)
+            self.loss = lib.ops.mpi.allreduce(self.local_loss, name='TrainLoss')
+
+        with tf.name_scope("update"):
+            self.update_op = self.algorithm.create_update_ops(self.local_loss, self.loss)
+
+    def on_train_batch_ingraph(self):
+        return [self.local_loss, self.update_op, self.get_learning_rate_ingraph()]
+
+    def after_train_batch(self, ingraph_result):
+        # Print dot.
+        lr = ingraph_result[-1]
+        self.set_learning_rate(lr)
+        print('.', end='', file=sys.stderr, flush=True)
+
+    def get_problem(self):
+        return self.problem
+
+    def get_model(self, name):
+        return self.problem.problem.models[name]
+
+    def get_train_counters_ingraph(self):
+        return self.counters
+
+    def get_train_loss_ingraph(self):
+        return self.loss
diff --git a/lib/train/algorithms.py b/lib/train/algorithms.py
new file mode 100644
index 0000000..3fc027a
--- /dev/null
+++ b/lib/train/algorithms.py
@@ -0,0 +1,236 @@
+import tensorflow as tf
+
+import re
+import sys
+from fnmatch import fnmatch
+from warnings import warn
+
+from lib.train.tickers import LearningRateFn
+from lib.ops import mpi
+from . import optimizers
+
+
+class GradientAccumulator:
+    def __init__(self, var):
+        self.target_var = var
+        self.device = var.device
+
+        with tf.device(self.device):
+            self.accumulator_var = tf.get_variable(
+                var.name[:-2] + '_grad',
+                var.shape,
+                var.dtype,
+                initializer=tf.constant_initializer(0),
+                trainable=False)
+
+    def get_value(self, grad, n_steps, average):
+        with tf.device(self.device):
+            if average:
+                return tf.div(self.accumulator_var + grad, n_steps)
+            else:
+                return self.accumulator_var + grad
+
+    def update(self, grad):
+        with tf.device(self.device):
+            return tf.assign_add(self.accumulator_var, grad)
+
+    def reset(self):
+        with tf.device(self.device):
+            return tf.assign(self.accumulator_var, tf.zeros(tf.shape(self.accumulator_var)))
+
+
+class Algorithm(object):
+    def __init__(self, learning_rate):
+        if callable(learning_rate):
+            self.learning_rate = learning_rate
+        else:
+            self.learning_rate = LearningRateFn('constant', learning_rate)
+        self.learning_rate_ingraph = None
+
+    @property
+    def name(self):
+        return "adam"
+
+    def create_update_ops(self, local_loss, loss):
+        raise NotImplementedError()
+
+    def _make_learning_rate_ingraph(self):
+        self.learning_rate_ingraph = self.learning_rate()
+
+
+class GenericSgdAlgorithm(Algorithm):
+    def __init__(self, learning_rate, dump_dir=None, dump_first_n=None, sync_every_steps=0,
+                 variables=None, force_check_grad=True, average_grads=False, clip_norm=0, **kwargs):
+        Algorithm.__init__(self, learning_rate)
+        self.dump_dir = dump_dir
+        self.dump_first_n = dump_first_n
+        self.sync_every_steps = sync_every_steps
+        self.average_grads = average_grads
+        self.force_check_grad = force_check_grad
+        self.clip_norm = clip_norm
+
+        self.var_list = (
+            tf.trainable_variables() +
+            tf.get_collection(tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES))
+
+        if isinstance(variables, str):
+            variables = [variables]
+            print("WARNING: Variables should be list, not a string!", file=sys.stderr, flush=True)
+
+        if variables:
+
+            for m in variables:
+                assert any(fnmatch(v.name.split(':')[0], m) for v in self.var_list), \
+                    "Pattern '{}' does not match any variables!".format(m)
+
+            self.var_list = [v for v in self.var_list if any(fnmatch(v.name.split(':')[0], m) for m in variables)]
+
+        print("Variables to optimize: %s" % ', '.join(v.name for v in self.var_list), file=sys.stderr, flush=True)
+
+    def create_update_ops(self, local_loss, loss):
+        self._make_learning_rate_ingraph()
+        with tf.variable_scope(self.name):
+            optimizer = self._create_optimizer()
+
+            grads = tf.gradients(
+                local_loss, self.var_list,
+                colocate_gradients_with_ops=True,
+                gate_gradients=1)
+
+            # handle None grads
+            for grad, var in zip(grads, self.var_list):
+                if grad is None:
+                    if self.force_check_grad:
+                        raise ValueError("Gradient for %s is None. Make sure loss is differentiable "
+                                         "w.r.t. %s or set force_check_grad=False in optimizer_opts" % (var.name, var.name))
+                    else:
+                        warn("Gradient for %s is None. It will be ignored!" % var.name)
+
+            grads, self.var_list = zip(*[(grad, var) for grad, var in zip(grads, self.var_list)
+                                            if grad is not None])
+
+            if self.sync_every_steps > 0:
+                with tf.variable_scope(self.name):
+                    self.grads_acc = [GradientAccumulator(v) for v in self.var_list]
+
+            if self.sync_every_steps == 0:
+                return self._compute_update_op(optimizer, grads, loss)
+            else:
+                # sync every N steps
+                global_step = tf.get_collection("TICK_NO")[0]
+                is_it_time_yet = tf.equal(tf.mod(global_step, self.sync_every_steps), self.sync_every_steps - 1)
+
+                # compute update op or fake update op
+                update_op = tf.cond(is_it_time_yet,
+                    lambda: self._compute_update_op(
+                        optimizer,
+                        [a.get_value(g, self.sync_every_steps, self.average_grads) for a, g in zip(self.grads_acc, grads)],
+                        loss),
+                    lambda: tf.no_op()
+                )
+
+                with tf.control_dependencies([update_op]), tf.name_scope('accumulate'):
+                    assign_op = tf.cond(is_it_time_yet,
+                                        lambda: [tf.group(a.reset()) for a in self.grads_acc],
+                                        lambda: [tf.group(a.update(g)) for a, g in zip(self.grads_acc, grads)])
+
+                return update_op, assign_op
+
+    def _compute_update_op(self, optimizer, grads, loss):
+        with tf.name_scope("aggregate"):
+            # Gradient clipping.
+            if self.clip_norm > 0:
+                with tf.name_scope("clip"):
+                    grads = [tf.clip_by_norm(grad, self.clip_norm) for grad in grads]
+
+            # Reduce gradients
+            grads = [mpi.allreduce(grad, name=re.sub('\\W', '_', var.name) + '_allreduce')
+                     for grad, var in zip(grads, self.var_list)]
+
+            with tf.name_scope("apply"):
+                update_op, var_update_ops = optimizer.apply_gradients(zip(grads, self.var_list), loss)
+
+        return update_op
+
+    def _create_optimizer(self):
+        raise NotImplementedError()
+
+
+class Sgd(GenericSgdAlgorithm):
+
+    @property
+    def name(self):
+        return "sgd"
+
+    def _create_optimizer(self):
+        return optimizers.GradientDescentOptimizer(self.learning_rate_ingraph)
+
+
+class RMSProp(GenericSgdAlgorithm):
+
+    def __init__(self, learning_rate, clip_norm=0, variables=None,
+                 sync_every_steps=0, average_grads=False, force_check_grad=True, **kwargs):
+        GenericSgdAlgorithm.__init__(self, learning_rate=learning_rate, variables=variables, clip_norm=clip_norm,
+                                     sync_every_steps=sync_every_steps, average_grads=average_grads,
+                                     force_check_grad=force_check_grad)
+        self.kwargs = kwargs
+
+    @property
+    def name(self):
+        return "rms_prop"
+
+    def _create_optimizer(self):
+        return optimizers.RMSPropOptimizer(self.learning_rate_ingraph, **self.kwargs)
+
+
+class Adam(GenericSgdAlgorithm):
+
+    def __init__(self, learning_rate, clip_norm=0, variables=None, dump_dir=None, dump_first_n=None,
+                 sync_every_steps=0, average_grads=False, force_check_grad=True, *args, **kwargs):
+        GenericSgdAlgorithm.__init__(self, learning_rate=learning_rate, dump_dir=dump_dir, dump_first_n=dump_first_n,
+                                     clip_norm=clip_norm, sync_every_steps=sync_every_steps, variables=variables,
+                                     average_grads=average_grads, force_check_grad=force_check_grad)
+        self.args = args
+        self.kwargs = kwargs
+
+    @property
+    def name(self):
+        return "adam"
+
+    def _create_optimizer(self):
+        return optimizers.AdamOptimizer(self.learning_rate_ingraph, *self.args, **self.kwargs)
+
+
+class Eve(GenericSgdAlgorithm):
+
+    def __init__(self, learning_rate, clip_norm=0, variables=None, sync_every_steps=0, average_grads=False,
+                 force_check_grad=True, *args, **kwargs):
+        GenericSgdAlgorithm.__init__(self, learning_rate=learning_rate, clip_norm=clip_norm, variables=variables,
+                                     sync_every_steps=sync_every_steps, average_grads=average_grads,
+                                     force_check_grad=force_check_grad)
+        self.args = args
+        self.kwargs = kwargs
+
+    @property
+    def name(self):
+        return "eve"
+
+    def _create_optimizer(self):
+        return optimizers.EveOptimizer(self.learning_rate_ingraph, *self.args, **self.kwargs)
+
+
+class LazyAdam(GenericSgdAlgorithm):
+    def __init__(self, learning_rate, clip_norm=0, variables=None, dump_dir=None, dump_first_n=None,
+                 sync_every_steps=0, average_grads=False, force_check_grad=True, *args, **kwargs):
+        GenericSgdAlgorithm.__init__(self, learning_rate=learning_rate, clip_norm=clip_norm, variables=variables,
+                                     sync_every_steps=sync_every_steps, average_grads=average_grads,
+                                     force_check_grad=force_check_grad, dump_dir=dump_dir, dump_first_n=dump_first_n)
+        self.args = args
+        self.kwargs = kwargs
+
+    @property
+    def name(self):
+        return "lazy_adam"
+
+    def _create_optimizer(self):
+        return optimizers.LazyAdamOptimizer(self.learning_rate_ingraph, *self.args, **self.kwargs)
diff --git a/lib/train/optimizers.py b/lib/train/optimizers.py
new file mode 100644
index 0000000..2658d11
--- /dev/null
+++ b/lib/train/optimizers.py
@@ -0,0 +1,271 @@
+# Gradient optimizers
+
+
+import tensorflow as tf
+from .. import ops
+
+## ----------------------------------------------------------------------------
+#                                  Base
+
+
+class OptimizerBase:
+    def apply_gradients(self, grads_and_vars, loss):
+        """
+        Create internal optimizer state and return an operation that performs
+        one optimization step.
+
+        Ret: complete_update_op, per_var_update_ops.
+
+        'complete_update_op': single operation that performs one optimization
+        step and updates everything -- model parameters and internal optimizer
+        state.
+
+        'per_var_update_ops': list of operations, one per model variable; each
+        operation assigns its respective variable the new value. Order of
+        per_var_update_ops matches the order of grads_and_vars.
+        """
+        raise NotImplementedError()
+
+
+## ----------------------------------------------------------------------------
+#                                  SGD
+
+
+class GradientDescentOptimizer(OptimizerBase):
+    def __init__(self, learning_rate):
+        self._learning_rate = learning_rate
+
+    def apply_gradients(self, grads_and_vars, loss):
+        spy = _GradientDescentOptimizerSpy(self._learning_rate)
+        return spy.get_update_ops(grads_and_vars)
+
+
+class _GradientDescentOptimizerSpy(tf.train.GradientDescentOptimizer):
+    """
+    Like tf.train.GradientDescentOptimizer, but is able to return per-variable
+    update operations along with operation that does the complete update
+    """
+
+    def get_update_ops(self, grads_and_vars):
+        _complete_update_op = super().apply_gradients(grads_and_vars)
+        return _complete_update_op, self._per_var_update_ops
+
+    def _finish(self, update_ops, name_scope):
+        self._per_var_update_ops = update_ops
+        return super()._finish(update_ops, name_scope)
+
+
+## ----------------------------------------------------------------------------
+#                                  RMSProp
+
+
+class RMSPropOptimizer(OptimizerBase):
+    def __init__(self, learning_rate, **kwargs):
+        self._learning_rate = learning_rate
+        self._kwargs = kwargs
+
+    def apply_gradients(self, grads_and_vars, loss):
+        spy = _RMSPropOptimizerSpy(self._learning_rate, **self._kwargs)
+        return spy.get_update_ops(grads_and_vars)
+
+
+class _RMSPropOptimizerSpy(tf.train.RMSPropOptimizer):
+    """
+    Like tf.train.RMSPropOptimizerOptimizer, but is able to return per-variable
+    update operations along with operation that does the complete update
+    """
+
+    def get_update_ops(self, grads_and_vars):
+        _complete_update_op = super().apply_gradients(grads_and_vars)
+        return _complete_update_op, self._per_var_update_ops
+
+    def _finish(self, update_ops, name_scope):
+        self._per_var_update_ops = update_ops
+        return super()._finish(update_ops, name_scope)
+
+
+## ----------------------------------------------------------------------------
+#                                   Adam
+
+
+class AdamOptimizer(OptimizerBase):
+    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
+        self._learning_rate = learning_rate
+        self._beta1 = beta1
+        self._beta2 = beta2
+        self._epsilon = epsilon
+
+    def apply_gradients(self, grads_and_vars, loss):
+        spy = _AdamOptimizerSpy(
+            self._learning_rate,
+            self._beta1,
+            self._beta2,
+            self._epsilon
+            )
+        return spy.get_update_ops(grads_and_vars)
+
+
+class _AdamOptimizerSpy(tf.train.AdamOptimizer):
+    """
+    Like tf.train.AdamOptimizer, but is able to return per-variable update
+    operations along with operation that does the complete update
+    """
+
+    def get_update_ops(self, grads_and_vars):
+        _complete_update_op = super().apply_gradients(grads_and_vars)
+        return _complete_update_op, self._per_var_update_ops
+
+    def _finish(self, update_ops, name_scope):
+        self._per_var_update_ops = update_ops
+        return super()._finish(update_ops, name_scope)
+
+
+## ----------------------------------------------------------------------------
+#                                   LazyAdam
+
+
+class LazyAdamOptimizer(OptimizerBase):
+    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.98, epsilon=1e-9):
+        self._learning_rate = learning_rate
+        self._beta1 = beta1
+        self._beta2 = beta2
+        self._epsilon = epsilon
+
+    def apply_gradients(self, grads_and_vars, loss):
+        spy = _LazyAdamOptimizerSpy(
+            self._learning_rate,
+            self._beta1,
+            self._beta2,
+            self._epsilon
+            )
+        return spy.get_update_ops(grads_and_vars)
+
+
+class _LazyAdamOptimizerSpy(tf.contrib.opt.LazyAdamOptimizer):
+    """
+    Like tf.contrib.opt.LazyAdamOptimizer, but is able to return per-variable
+    update operations along with operation that does the complete update
+    """
+
+    def get_update_ops(self, grads_and_vars):
+        _complete_update_op = super().apply_gradients(grads_and_vars)
+        return _complete_update_op, self._per_var_update_ops
+
+    def _finish(self, update_ops, name_scope):
+        self._per_var_update_ops = update_ops
+        return super()._finish(update_ops, name_scope)
+
+
+## ----------------------------------------------------------------------------
+#                                   Eve
+
+
+class EveOptimizer(OptimizerBase):
+    def __init__(
+            self, learning_rate=0.001, threshold_lower=0.1, threshold_upper=10,
+            beta1=0.9, beta2=0.999, beta3=0.999
+            ):
+
+        self._lr = learning_rate
+        self._threshold_lower = threshold_lower
+        self._threshold_upper = threshold_upper
+        self._beta1 = beta1
+        self._beta2 = beta2
+        self._beta3 = beta3
+
+    def apply_gradients(self, grads_and_vars, loss):
+        grads_and_vars = list(grads_and_vars)
+        with tf.variable_scope("eve"):
+            return self._apply_gradients(grads_and_vars, loss)
+
+    def _apply_gradients(self, grads_and_vars, loss):
+        # 1. Create scalar variables
+        d = tf.get_variable(
+            "d",
+            shape=[1],
+            initializer=tf.constant_initializer(1.),
+            trainable=False
+            )
+        f_hat = tf.get_variable(
+            "f_hat",
+            shape=[],
+            dtype = tf.float32,
+            initializer=tf.constant_initializer(0.),
+            trainable=False
+            )
+        t = tf.get_variable(
+            "t",
+            shape=[],
+            initializer=tf.constant_initializer(1.),
+            trainable=False
+            )
+
+    # 2. Create tensor variables.
+        v = []
+        m = []
+        for _, var in grads_and_vars:
+            with tf.variable_scope("v"):
+                v.append(tf.get_variable(
+                    var.name.replace(':', ''),
+                    initializer=tf.zeros(var.get_shape()),
+                    trainable=False
+                    ))
+            with tf.variable_scope("m"):
+                m.append(tf.get_variable(
+                    var.name.replace(':', ''),
+                    initializer=tf.zeros(var.get_shape()),
+                    trainable=False
+                    ))
+
+        # 3. Create conditional updates.
+        def update_not_at_start():
+            loss_increased = tf.greater_equal(loss, f_hat)
+            delta1_t = tf.cond(
+                loss_increased,
+                lambda: tf.constant(self._threshold_lower + 1.),
+                lambda: tf.constant(1. / (self._threshold_upper+1.))
+                )
+            delta2_t = tf.cond(
+                loss_increased,
+                lambda: tf.constant(self._threshold_upper + 1.),
+                lambda: tf.constant(1. / (self._threshold_lower+1.))
+                )
+            c_t = tf.minimum(tf.maximum(delta1_t, loss/f_hat), delta2_t)
+            r_t = tf.abs((c_t-1.)*f_hat) / tf.minimum(c_t*f_hat, f_hat)
+            d_t = self._beta3*d + (1.-self._beta3)*r_t
+            return ops.group(
+                d.assign(d_t).op,
+                f_hat.assign(c_t * f_hat).op
+                )
+
+        def update_at_start():
+            return ops.group(f_hat.assign(loss).op)
+
+        at_start = tf.greater_equal(t, 2.)
+        updates = [tf.cond(at_start, update_not_at_start, update_at_start).op]
+
+        # 4. Create unconditional updates.
+
+        # Updates for trainable variables.
+        per_var_update_ops = []
+        for i in range(len(grads_and_vars)):
+            g = grads_and_vars[i][0]
+            p = grads_and_vars[i][1]
+
+            if self._beta1 > 0:
+                m_t = self._beta1*m[i] + tf.scalar_mul(1.-self._beta1, g)
+                m_hat = m_t / (1. - tf.pow(self._beta1, t))
+                updates.append(m[i].assign(m_t).op)
+            else:
+                m_hat = g
+
+            v_t = self._beta2*v[i] + (1.-self._beta2)*tf.square(g)
+            v_hat = v_t / (1. - tf.pow(self._beta2, t))
+            p_t = p - self._lr * (m_hat / (d * tf.sqrt(v_hat) + 1e-8))
+
+            updates.append(v[i].assign(v_t).op)
+            per_var_update_ops.append(p.assign(p_t).op)
+
+        updates.append(t.assign(t + 1.).op)
+
+        return tf.group(*(updates + per_var_update_ops)), per_var_update_ops
diff --git a/lib/train/problem.py b/lib/train/problem.py
new file mode 100644
index 0000000..f54b1b5
--- /dev/null
+++ b/lib/train/problem.py
@@ -0,0 +1,100 @@
+# Model definition
+
+
+## ============================================================================
+#                               SimpleProblem
+
+
+class SimpleProblem:
+    """
+    SimpleProblem is an interace which describes gradient minimization problem.
+
+    You might wanna take a look at Model interface if you're doing more than a
+    simple demonstration. Some use cases for Model are:
+
+        - Account for number of items in batch (to implement proper averaging)
+        - Print extra summary in tensorboard
+    """
+
+    def loss(self, batch, is_train):
+        """
+        Return batch-average loss, a scalar Tensor.
+
+        'batch' is a nested structure that has the same schema as input
+        data, except values are now Tensors. For example, if training data is
+
+            { 'inp': np.array([1, 2, 3]),
+              'out': np.array([3, 2, 1]) }
+
+        then 'batch' will be
+
+            { 'inp': tf.Tensor(...),
+              'out': tf.Tensor(...) }
+
+        trainer performs asynchronous data upload to tensorflow, so the tensors point to internal tensorflow buffers.
+
+        loss() on devset is averaged among devset batches. If you want
+        averaging with proper account of number of items in every batch, take
+        a look at Model class.
+
+        'is_train' specifies whether the the function should compute trainset
+        or devset loss.
+        """
+        raise NotImplementedError()
+
+
+## ============================================================================
+#                                    Problem
+
+
+class Problem:
+    """
+    Like SimpleProblem, but with a more sophisticated interface: instead of loss(), you
+    must override batch_counters() and loss_multibatch().
+
+    Values of batch_counters() are accumulated to a Tensor among a new 0'th
+    axis, so that loss_multibatch() can do custom aggregation.
+    """
+    def batch_counters(self, batch, is_train):
+        """
+        'batch': same as in SimpleModel.loss()
+
+        Return nested structure of Tensors for use in loss_multibatch() and
+        summary_multibatch().
+        """
+        raise NotImplementedError()
+
+    def loss_multibatch(self, counters, is_train):
+        """
+        Aggregate 'counters' into single scalar loss Tensor.
+
+        'counters': nested structure of Tensors returned by batch_counters().
+        Each Tensor in the nested structure has an additional 0'th axis, each
+        element holding a result of batch_counters() invocation.
+
+        'is_train' specifies whether the the function should compute trainset
+        or devset loss.
+        """
+        raise NotImplementedError()
+
+    def summary_multibatch(self, counters, prefix, is_train):
+        """
+        Return operation or a list of operations that compute aggregated
+        summary for multiple batches.
+
+        Every returned summary operation should have given 'prefix'.
+
+        'is_train' specifies whether the the function should compute trainset
+        or devset summary.
+        """
+        return []
+
+    def params_summary(self):
+        """
+        Return an operation or a list of operations that compute custom
+        summaries of model parameters.
+
+        Note that training automatically computes parameter histogram summaries,
+        so you don't need to do that yourself.
+        """
+        return []
diff --git a/lib/train/saveload.py b/lib/train/saveload.py
new file mode 100644
index 0000000..ef3cbf7
--- /dev/null
+++ b/lib/train/saveload.py
@@ -0,0 +1,90 @@
+# Saving Tensorflow variables to NPZ
+
+
+import tensorflow as tf
+import numpy as np
+
+
+## ----------------------------------------------------------------------------
+#                         Saving to / loading from NPZ
+
+
+DO_NOT_SAVE = 'VARIABLES_DO_NOT_SAVE'
+
+
+def save(filename, vars):
+    """
+    Save specified variables to an NPZ archive.
+    """
+    p = {}
+    for var in vars:
+        p[var.name] = var
+
+    # 3. Evaluate all tensors at once
+    keys = list(p.keys())
+    values = tf.get_default_session().run([p[k] for k in keys])
+    p = dict(zip(keys, values))
+
+    # 3. Write.
+    np.savez(filename, **p)
+
+
+def load(filename, vars, batch_size=10):
+    """
+    Load NPZ archive into specified variables.
+
+    If variable we want to load is not in NPZ, we ignore it.
+    If NPZ has a value for a variable that is not in 'vars' list, we ignore it.
+    """
+    p = np.load(filename)
+    ops = []
+    feed_dict = {}
+
+    with tf.variable_scope('load'):
+        for var in vars:
+            if var.name not in p.keys():
+                continue
+
+            # Create placeholder.
+            placeholder = tf.placeholder(var.dtype)
+            feed_dict[placeholder] = p[var.name]
+
+            # Create assign op for normal vars.
+            ops.append(tf.assign(var, placeholder, validate_shape=False).op)
+
+    if ops:
+        for ofs in range(0, len(ops), batch_size):
+            tf.get_default_session().run(ops[ofs:ofs+batch_size], feed_dict)
+
+
+def get_model_variables():
+    """
+    Return set of variables in tf.GraphKeys.MODEL_VARIABLES collection.
+    """
+    g = tf.get_default_graph()
+    return set(g.get_collection(tf.GraphKeys.MODEL_VARIABLES))
+
+
+def get_state_variables():
+    """
+    Return set of all tensorflow variables except:
+
+        - Variables in tf.GraphKeys.MODEL_VARIABLES collection
+        - Variables in DO_NOT_SAVE collection
+    """
+    g = tf.get_default_graph()
+    vars = set(tf.global_variables())
+    vars -= get_model_variables()
+    vars -= set(g.get_collection(DO_NOT_SAVE))
+    return vars
+
+
+def initialize_uninitialized_variables():
+    with tf.name_scope("initialize"):
+        uninitialized_names = set(tf.get_default_session().run(tf.report_uninitialized_variables()))
+        uninitialized_vars = []
+        for var in tf.global_variables():
+            if var.name[:-2].encode() in uninitialized_names:
+                uninitialized_vars.append(var)
+
+        tf.variables_initializer(uninitialized_vars).run()
diff --git a/lib/train/tickers.py b/lib/train/tickers.py
new file mode 100644
index 0000000..98339e9
--- /dev/null
+++ b/lib/train/tickers.py
@@ -0,0 +1,1065 @@
+# Tickers
+import sys
+from functools import partial
+
+import tensorflow as tf
+import numpy as np
+import time
+import os
+import re
+
+from lib.train.saveload import get_model_variables, get_state_variables
+from . import saveload
+import lib
+from ..util import *
+from ..ops import mpi
+from ..tools.average_npz import average_npzs
+
+## ============================================================================
+#                                   Ticker
+
+
+class Ticker:
+    """
+    An interface for receiving notifications from optimizer. Several kinds of
+    notifications are supported:
+
+        - Training is started
+        - Training has just processed a trainset batch
+        - Devset loss is computed (happens every K training batches)
+        - Training is complete
+
+    User-defined tickers can be used e.g. for decaying learning rate in a
+    custom manner. Also, see the collection of standard tickers.
+    """
+
+    init_priority = 0
+    priority = 0
+
+    def on_started(self, context):
+        """
+        Called when training starts.
+
+        'context': TrainContext
+        """
+        pass
+
+    def prepare_ingraph_ops(self):
+        """
+        Initializes ingraph operations of the ticker
+
+        """
+        self._ingraph_ops = self.on_train_batch_ingraph()
+
+    def on_train_batch_ingraph(self):
+        """
+        Return a tensorflow operation that should be executed each time
+        optimizer processes a training batch.
+        """
+        return []
+
+    def before_train_batch(self):
+        """
+        Called each time before optimizer processing a training batch
+
+        Return ingraph ops to run on this tick
+        """
+        return self._ingraph_ops
+
+    def after_train_batch(self, ingraph_result):
+        """
+        Called each time optimizer processes a training batch.
+        """
+        pass
+
+    def on_finished(self):
+        """
+        Called when the training is finished.
+        """
+        pass
+
+
+class DistributedTicker(Ticker):
+    """
+    Has the same interface as Ticker, but on_* functions are called
+    everywhere, not just on master.
+    """
+
+    pass
+
+
+## ============================================================================
+#                              Services & events
+
+
+class Service:
+    pass
+
+
+class Subscriber:
+    pass
+
+
+class ModelService(Service):
+
+    def get_problem(self):
+        raise NotImplementedError()
+
+    def get_model(self, name):
+        raise NotImplementedError()
+
+
+class TrainService(Service):
+
+    def get_train_counters_ingraph(self):
+        raise NotImplementedError()
+
+    def get_train_loss_ingraph(self):
+        raise NotImplementedError()
+
+    def get_worker_batch_computation_time_ingraph(self):
+        raise NotImplementedError()
+
+    def get_worker_batch_time_ingraph(self):
+        raise NotImplementedError()
+
+    def get_worker_batch_lag_ingraph(self):
+        raise NotImplementedError()
+
+
+class RollbackService(Service):
+
+    def rollback_to_or_before(self, step):
+        raise NotImplementedError()
+
+
+class SummaryWriterService(Service):
+
+    def get_summary_writer(self):
+        raise NotImplementedError()
+
+
+class GlobalStepService(Service):
+
+    def get_batch_no(self):
+        raise NotImplementedError()
+
+    def get_batch_no_ingraph(self):
+        raise NotImplementedError()
+
+    def get_global_step(self):
+        raise NotImplementedError()
+
+    def get_global_step_ingraph(self):
+        raise NotImplementedError()
+
+    def set_global_step(self, global_step):
+        raise NotImplementedError()
+
+
+class LearningRateService(Service):
+    def __init__(self, learning_rate_fn):  # scale, learning_rate_policy_fn=None):
+        self.multiplier = None
+        self.learning_rate_fn = learning_rate_fn
+        self.learning_rate = None
+
+    # manage learning_rate
+    def set_learning_rate(self, value):
+        self.learning_rate = value
+
+    def get_learning_rate(self):
+        return self.learning_rate
+
+    def get_learning_rate_ingraph(self):
+        return self.learning_rate_fn()
+
+    # Manage learning_rate multiplier
+    def set_learning_rate_multiplier(self, new_multiplier):
+        self.learning_rate *= new_multiplier / self.multiplier
+        self.multiplier = tf.get_default_session().run(
+            tf.assign(self.learning_rate_fn.multiplier_var, new_multiplier)
+        )
+
+    def get_learning_rate_multiplier(self):
+        if self.multiplier is None:
+            self.multiplier = tf.get_default_session().run(self.learning_rate_fn.multiplier_var)
+        return self.multiplier
+
+
+class DevSubscriber(Subscriber):
+
+    def prepare_dev_graph(self, dev_name, dev_counters, dev_loss):
+        """
+        Called at the start of process to prepare part of tensorflow graph
+        for processing dev results
+        """
+        pass
+
+    def before_dev_run(self, dev_name):
+        """
+        Called each dev step to get lst of ops which should be evaluated this time
+        Call result may differ between steps (for example, if you want to do something only once in several steps)
+        This call MUST NOT create new graph operations, only return those already prepared
+        """
+        return []
+
+    def after_dev_run(self, dev_name, dev_run_values):
+        """
+        Called each dev step after evaluating ops and provides their results
+        """
+        pass
+
+
+## ============================================================================
+#                             Standard tickers
+
+# -
+#  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class Summary(DistributedTicker, SummaryWriterService, DevSubscriber):
+    """
+    Compute and store train batch summary every `every_steps` training batches.
+    Compute and store dev summary every time dev is computed.
+    Compute and store parameter summary once in a while.
+    """
+
+    def __init__(self, folder, every_steps=1, params_every_steps=None, params_every_minutes=None):
+        self.folder = folder
+        self.params_every_steps = params_every_steps
+        self.params_every_minutes = params_every_minutes
+        self.every_steps = every_steps
+        self.dev_summary_graphs = {}
+
+    def on_started(self, context):
+        if mpi.is_master() and not os.path.exists(self.folder):
+            os.makedirs(self.folder)
+
+        self.context = context
+        self.writer = tf.summary.FileWriter(self.folder)
+        self.problem = context.get_problem()
+
+        if mpi.is_master():
+            self.writer.add_graph(tf.get_default_graph())
+
+        self.is_it_time_yet_for_params = _IsItTimeYet(
+            context, self.params_every_steps, self.params_every_minutes)
+        self.is_it_time_yet_for_summary = _IsItTimeYet(
+            context, self.every_steps, None)  # None for minutes as we don't use them for training summary
+
+    def on_finished(self):
+        self.writer.close()
+
+    def prepare_ingraph_ops(self):
+        counters = self.context.get_train_counters_ingraph()
+        learning_rate = self.context.get_learning_rate_ingraph()
+
+        self._summary_ops = merge_summaries(
+            [tf.summary.scalar('Train/LearningRate', learning_rate)] +
+            self.problem.summary_multibatch(counters, 'Train', is_train=True) +
+            tf.get_collection(lib.meta.SUMMARIES_ZOO)
+        )
+
+        self._params_summary_ops = merge_summaries(
+            [tf.summary.histogram(v.name, v) for v in tf.model_variables()] +
+            self.problem.params_summary() +
+            tf.get_collection(lib.meta.PARAMS_SUMMARIES)
+        )
+
+    def before_train_batch(self):
+        self.start = time.time()
+
+        # return dummy if it is not time to compute summary yet
+        ops = {}
+        if self.is_it_time_yet_for_summary(True):
+            ops['summary'] = self._summary_ops
+        if self.is_it_time_yet_for_params():
+            ops['params_summary'] = self._params_summary_ops
+        return ops
+
+    def prepare_dev_graph(self, dev_name, dev_counters, dev_loss):
+        self.dev_summary_graphs[dev_name] = merge_summaries(self.problem.summary_multibatch(dev_counters, dev_name,
+                                                                                            is_train=False))
+
+    def before_dev_run(self, dev_name):
+        return self.dev_summary_graphs[dev_name]
+
+    def after_dev_run(self, dev_name, dev_run_value):
+        if not lib.ops.mpi.is_master():
+            return
+        self.writer.add_summary(dev_run_value, self.context.get_global_step())
+
+    def after_train_batch(self, ingraph_result):
+        if not lib.ops.mpi.is_master():
+            return
+
+        step = self.context.get_global_step()
+
+        if 'summary' in ingraph_result:
+            self.writer.add_summary(ingraph_result['summary'], step)
+
+        if 'params_summary' in ingraph_result:
+            self.writer.add_summary(ingraph_result['params_summary'], step)
+
+    def get_summary_writer(self):
+        return self.writer
+
+    def _flush(self):
+        self.writer.flush()
+
+
+# - Save  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class SaveLoad(DistributedTicker, RollbackService):
+
+    init_priority = -100
+    priority = 90
+
+    def __init__(self, folder, every_steps=None, every_minutes=None, skip_train_data=True,
+                 keep_checkpoints_max=None, pre_init_model_checkpoint=None, pre_init_state_checkpoint=None,
+                 avg_checkpoint_every=None, avg_last_checkpoints=None):
+        self.folder = folder
+        self.every_steps = every_steps
+        self.every_minutes = every_minutes
+        self.skip_train_data = skip_train_data
+        self.keep_checkpoints_max = keep_checkpoints_max
+        self.pre_init_model_checkpoint = pre_init_model_checkpoint
+        self.pre_init_state_checkpoint = pre_init_state_checkpoint
+        self.avg_checkpoint_every = avg_checkpoint_every
+        self.avg_last_checkpoints = avg_last_checkpoints
+        assert (avg_last_checkpoints is None) == (avg_checkpoint_every is None)
+        if avg_checkpoint_every is not None:
+            assert avg_checkpoint_every >= avg_last_checkpoints
+
+    def average_last_checkpoints(self):
+        if not lib.ops.mpi.is_master():
+            return
+        checkpoint_names = []
+        for label in self._get_sorted_labels()[::-1][:self.avg_last_checkpoints]:
+            checkpoint_names.append(os.path.join(self.folder, 'model-%s.npz' % label))
+        averaged_checkpoint = average_npzs(checkpoint_names)
+        print("\n\t lib.train.tickers.Saveload.average_last_checkpoints")
+        print('\n'.join(checkpoint_names))
+        return averaged_checkpoint
+
+    def on_started(self, context):
+        if lib.ops.mpi.is_master() and not os.path.exists(self.folder):
+            os.makedirs(self.folder)
+
+        # 1. Init.
+        self.context = context
+
+        # 2. We can't set these functions here yet because global step may change
+        self.is_it_time_yet = None
+
+        # 3. Load checkpoint and sync global state
+        self._load_or_save_initial_checkpoints()
+        self._broadcast_global_variables()
+        self.context.set_global_step(tf.get_default_session().run(tf.get_collection(tf.GraphKeys.GLOBAL_STEP)[0]))
+
+        # 4. Skip training data if necessary
+        batch_no = tf.get_default_session().run(self.context.get_batch_no_ingraph())
+        if self.skip_train_data and batch_no > 0:
+            self.context.skip_train_data(batch_no)
+
+        # 5. Set is_it_time_yet functions with correct global step
+        self.is_it_time_yet = _IsItTimeYet(
+            self.context, self.every_steps, self.every_minutes)
+
+    def after_train_batch(self, ingraph_result):
+        if self.is_it_time_yet():
+            self._save(self.context.get_global_step())
+            if (self.avg_last_checkpoints is not None) and\
+                (int(int(self._find_latest_label()) / self.every_steps) % self.avg_last_checkpoints == 0):
+                print("\n AVERAGING CHECKPOINTS: \n")
+                averaged_checkpoint = self.average_last_checkpoints()
+                latest_label = self._find_latest_label()
+                print("\n\t latest_label: {}".format(latest_label))
+                if latest_label is None:
+                    return
+                model_path = os.path.join(self.folder, 'model-%s.npz' % latest_label)
+                np.savez(model_path, **averaged_checkpoint)
+                self._load(latest_label)
+
+    def on_finished(self):
+        self._save('final')
+
+    def rollback_to_or_before(self, max_step):
+        global_step = self.context.get_global_step()
+
+        label = self._find_latest_label(max_step) if lib.ops.mpi.is_master() else ''
+
+        if label is None:
+            raise RuntimeError("No checkpoint to rollback found!")
+
+        self._load(label)
+        self._broadcast_global_variables()
+        self.context.set_global_step(global_step)
+
+    def _load_or_save_initial_checkpoints(self):
+        stop_decision = None
+
+        if lib.ops.mpi.is_master():
+            # Find latest model.
+            #     - 'model-init.npz' < 'model-N.npz' < 'model-final.npz'
+            #     - 'model-latest.npz' is ignored.
+            latest_label = self._find_latest_label()
+            if latest_label is not None:
+                # Load parameters if we found a checkpoint.
+                self._load(latest_label)
+
+                # End training if loaded final checkpoint.
+                if latest_label == 'final':
+                    model_path = os.path.join(self.folder, 'model-%s.npz' % latest_label)
+                    state_path = os.path.join(self.folder, 'state-%s.npz' % latest_label)
+                    stop_decision = 'Found final checkpoints %r and %r' % (model_path, state_path)
+            else:
+                # If didn't find a checkpoint, load pre-init and save initial parameters
+                self._load_pre_init_checkpoints()
+                self._save('init')
+                self._save_graph()
+
+        # Distribute stop decision to other workers
+        stop_decision = lib.ops.mpi.broadcast_obj(stop_decision, name="stop_decision")
+        if stop_decision is not None:
+            self.context.stop_training(stop_decision)
+
+    def _find_latest_label(self, max_label=None):
+        # Get all labels sorted in ascending order and retrieve last label no more than max_label
+        for label in self._get_sorted_labels()[::-1]:
+            if max_label is None or not self._label_gt(label, max_label):
+                return label
+        return None
+
+    def _get_sorted_labels(self):
+        """ Get list of sorted labels
+
+        'init' is prepended (if exists) and 'final' is appended (if exists)
+        """
+        init_exist = final_exist = False
+        labels = []
+        for fname in os.listdir(self.folder):
+            # Skip invalid filenames.
+            if not fname.startswith('model-') or not fname.endswith('.npz'):
+                continue
+            label = fname[len('model-'):-len('.npz')]
+            if label not in ['init', 'final'] and not label.isdigit():
+                continue
+
+            # Skip models without state.
+            model_path = os.path.join(self.folder, fname)
+            state_path = os.path.join(self.folder, 'state-%s.npz' % label)
+            if not os.path.isfile(model_path) or not os.path.isfile(state_path):
+                continue
+
+            if label == 'init':
+                init_exist = True
+            elif label == 'final':
+                final_exist = True
+            else:
+                labels.append(label)
+
+        return (['init'] if init_exist else []) + sorted(labels, key=lambda x: int(x)) + (['final'] if final_exist else [])
+
+    def _label_gt(self, l1, l2):
+        """ Check if first label greater then the second """
+
+        # If labels are equal first is not greater
+        # If first label is 'init' it can't be greater then any other
+        # If second is 'final' it can't be less
+        if l1 == l2 or l1 == 'init' or l2 == 'final':
+            return False
+
+        if l1 == 'final' or l2 == 'init':
+            return True
+
+        # Now we are sure what both labels are numbers
+        return int(l1) > int(l2)
+
+    def _save_graph(self):
+        if lib.ops.mpi.is_master():
+            print('! Saving graph to %r' % os.path.join(self.folder, 'graph.pbtxt'), file=sys.stderr, flush=True)
+            tf.train.write_graph(tf.get_default_session().graph_def, self.folder, 'graph.pbtxt')
+
+    def _save(self, label):
+        # Save (only on master)
+        if lib.ops.mpi.is_master():
+            model_vars = saveload.get_model_variables()
+            state_vars = saveload.get_state_variables()
+            model_path = os.path.join(self.folder, 'model-%s.npz' % label)
+            state_path = os.path.join(self.folder, 'state-%s.npz' % label)
+            print('! Saving model to %r' % model_path, file=sys.stderr, flush=True)
+            print('! Saving state to %r' % state_path, file=sys.stderr, flush=True)
+            saveload.save(model_path, model_vars)
+            saveload.save(state_path, state_vars)
+
+            if self.keep_checkpoints_max:
+                self._remove_old_labels()
+
+            # Create symlink 'model-latest.npz'.
+            sym_from = 'model-%s.npz' % label
+            sym_to = os.path.join(self.folder, 'model-latest.npz')
+            try:
+                os.unlink(sym_to)
+            except OSError as e:
+                if e.errno != 2: # File not found.
+                    raise
+            os.symlink(sym_from, sym_to)
+
+    def _remove_old_labels(self):
+        """Removes oldest labels"""
+        labels = self._get_sorted_labels()
+        for index in range(0, len(labels) - self.keep_checkpoints_max):
+            os.remove(os.path.join(self.folder, 'model-%s.npz' % labels[index]))
+            os.remove(os.path.join(self.folder, 'state-%s.npz' % labels[index]))
+
+    def _load(self, label):
+        # Load (only from master)
+        if lib.ops.mpi.is_master():
+            model_vars = saveload.get_model_variables()
+            state_vars = saveload.get_state_variables()
+            model_path = os.path.join(self.folder, 'model-%s.npz' % label)
+            state_path = os.path.join(self.folder, 'state-%s.npz' % label)
+            print('! Loading model from %r' % model_path,
+                file=sys.stderr, flush=True)
+            saveload.load(model_path, model_vars)
+            print('! Loading state from %r' % state_path,
+                file=sys.stderr, flush=True)
+            saveload.load(state_path, state_vars)
+            print('! Loading - DONE', file=sys.stderr, flush=True)
+
+            uninitialized_names = sorted(tf.get_default_session().run(tf.report_uninitialized_variables()))
+            if len(uninitialized_names) > 0:
+                print('! Uninitialized variables after loading checkpoint: %s' % str(uninitialized_names), file=sys.stderr, flush=True)
+
+    def _load_pre_init_checkpoints(self):
+        if self.pre_init_model_checkpoint:
+            print('! Loading pre-init model from %r' % self.pre_init_model_checkpoint,
+                file=sys.stderr, flush=True)
+            saveload.load(self.pre_init_model_checkpoint, saveload.get_model_variables())
+
+        if self.pre_init_state_checkpoint:
+            print('! Loading pre-init state from %r' % self.pre_init_state_checkpoint,
+                file=sys.stderr, flush=True)
+            saveload.load(self.pre_init_state_checkpoint, saveload.get_state_variables())
+
+    def _broadcast_global_variables(self):
+        ops = []
+        for var in tf.global_variables():
+            with tf.device(var.device):
+                ops.append(lib.ops.mpi.broadcast_var(var, name=re.sub('\\W', '_', var.name)))
+        tf.group(*ops).run()
+
+
+# - DevLoss - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class DevLossTicker(DistributedTicker):
+    """
+    - Compute devset loss once in a while.
+    - Print devset loss to stderr every time train() computes it.
+    - Print a dot  .  every trainset batch.
+    """
+
+    def __init__(self, devset, name='Dev', every_steps=None, every_minutes=None, initial=False):
+        self.devset = devset
+        self.name = name
+        self.every_steps = every_steps
+        self.every_minutes = every_minutes
+        self.initial = initial
+
+    def on_started(self, context):
+        self.problem = context.get_problem()
+        self.devset_batches = list(self.devset)
+        self.subscribers = context.get_subscribers(DevSubscriber)
+
+        if len(self.devset_batches) > 0:
+            with tf.name_scope("%s/loss" % self.name), lib.meta.lock_collections([lib.meta.SUMMARIES_ZOO,
+                                                                                  lib.meta.PARAMS_SUMMARIES]):
+
+                self.dev_batch_inp = nested_map(lambda e: tf.placeholder(dtype=tf.as_dtype(e.dtype), shape=[None] * len(e.shape)), self.devset_batches[0])
+
+                dev_batch_copy = nested_map(lambda x: x, self.dev_batch_inp)
+                self.dev_batch_counters_op = self.problem.batch_counters(dev_batch_copy, is_train=False)
+                # note: we send copy of batch_counters dict to avoid bugs if user changes it inside batch_counters
+
+        self.subscriber_ops = None  # Will be initialized later
+        self.dev_loss_op = None
+
+        self.is_it_time_yet = _IsItTimeYet(
+            context, self.every_steps, self.every_minutes)
+
+        # Score devset after initialization if option passed (and we are not loading some non-init checkpoint)
+        if self.initial and context.get_global_step() == 0:
+            self._score()
+
+    def after_train_batch(self, ingraph_result):
+        if self.is_it_time_yet():
+            self._score()
+
+    def _score(self):
+        # Iterate over dev set and compute counters for each batch
+        counters = []
+        for batch in self.devset_batches:
+            feed_dict = dict(zip(nested_flatten(self.dev_batch_inp), nested_flatten(batch)))
+            counters.append(tf.get_default_session().run(self.dev_batch_counters_op, feed_dict=feed_dict))
+
+        # Allgather counters
+        counters = lib.ops.mpi.allgather_obj(counters)
+        counters = [x for c in counters for x in c]
+
+        # Stack counters from different batches
+        counters = nested_map(lambda *x: np.stack(x), *counters)
+
+        # Delayed initialization of subscriber ops to ensure what subscribers are initialized
+        if self.dev_loss_op is None:
+            with tf.name_scope("%s/loss" % self.name):
+                self.dev_counters_inp = nested_map(lambda e: tf.placeholder(dtype=tf.as_dtype(e.dtype),
+                                                                            shape=[None] * len(e.shape)), counters)
+                self.dev_loss_op = self.problem.loss_multibatch(self.dev_counters_inp, is_train=False)
+
+            # Prepare graphs in all subscribers
+            for subscriber in self.subscribers:
+                subscriber.prepare_dev_graph(self.name, self.dev_counters_inp, self.dev_loss_op)
+
+        # Prepare list of ops
+        ops = []
+        for subscriber in self.subscribers:
+            ops.append(subscriber.before_dev_run(self.name))
+        ops.append(self.dev_loss_op)
+
+        # Compute devloss and subscriber ops
+        feed_dict = dict(zip(nested_flatten(self.dev_counters_inp), nested_flatten(counters)))
+        results = tf.get_default_session().run(ops, feed_dict=feed_dict)
+        dev_loss = results.pop()
+
+        if lib.ops.mpi.is_master():
+            # Print dev loss only on the master
+            print('%f' % dev_loss, file=sys.stderr, flush=True)
+        for subscriber, result in zip(self.subscribers, results):
+            subscriber.after_dev_run(self.name, result)
+
+
+# - DecayLearningRate - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class DecayLearningRate(DistributedTicker, DevSubscriber):
+    """
+    Decay learning rate and rollback.
+    """
+
+    priority = 80
+
+    def __init__(self, after_steps, rollback=True, decay_by=2**0.5, dev_name='Dev'):
+        self.after_steps = after_steps
+        self.decay_by = decay_by
+        self.rollback = rollback
+
+        # Track min loss only on the master
+        self.min_dev_loss = None
+        self.min_dev_loss_var = tf.get_variable(
+            'min_dev_loss', [], tf.float32,
+            initializer=tf.constant_initializer(np.inf), trainable=False
+            )
+
+        self.min_dev_loss_global_step = None
+        self.min_dev_loss_global_step_var = tf.get_variable(
+            'min_dev_loss_global_step', [], tf.int64,
+            initializer=tf.constant_initializer(0), trainable=False
+            )
+
+        self.min_dev_loss_inp = tf.placeholder(tf.float32, shape=[])
+        self.min_dev_loss_global_step_inp = tf.placeholder(tf.int64, shape=[])
+        self.update_vars = [
+            tf.assign(self.min_dev_loss_var, self.min_dev_loss_inp),
+            tf.assign(self.min_dev_loss_global_step_var, self.min_dev_loss_global_step_inp)
+        ]
+
+        self.dev_name = dev_name
+
+    def on_started(self, context):
+        self.context = context
+
+    def prepare_dev_graph(self, dev_name, dev_counters, dev_loss):
+        if dev_name != self.dev_name:
+            return
+        self.dev_loss_ingraph = dev_loss
+
+    def before_dev_run(self, dev_name):
+        if dev_name != self.dev_name:
+            return []
+        return self.dev_loss_ingraph
+
+    def after_dev_run(self, dev_name, dev_loss):
+        if dev_name != self.dev_name:
+            return
+
+        global_step = self.context.get_global_step()
+
+        # Track min loss only on the master
+        if self.min_dev_loss is None or dev_loss < self.min_dev_loss:
+            self.min_dev_loss, self.min_dev_loss_global_step = \
+                tf.get_default_session().run(self.update_vars, feed_dict={
+                    self.min_dev_loss_inp: dev_loss,
+                    self.min_dev_loss_global_step_inp: global_step})
+
+        # Decay learning rate if necessary
+        if global_step - self.min_dev_loss_global_step >= self.after_steps:
+            new_learning_rate_multiplier = self.context.get_learning_rate_multiplier() / self.decay_by
+
+            if lib.ops.mpi.is_master():
+                print("! Decaying learning rate multiplier to %f" % new_learning_rate_multiplier, file=sys.stderr, flush=True)
+
+            if self.rollback:
+                self.context.rollback_to_or_before(self.min_dev_loss_global_step)
+                self.min_dev_loss_global_step = global_step
+
+            self.context.set_learning_rate_multiplier(new_learning_rate_multiplier)
+
+
+# - LearningRateStopper - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class LearningRateStopper(DistributedTicker):
+    """
+    Stop training when learning rate becomes too low.
+    """
+
+    priority = 100
+
+    def __init__(self, threshold=None, times=None, min_steps=None):
+        self.threshold = threshold
+        self.times = times
+        self.min_steps = min_steps
+
+    def on_started(self, context):
+        self.context = context
+        try:
+            self.initial_lr = self.context.get_learning_rate()
+        except NotImplementedError:
+            msg  = '! LearningRateStopper is useless: '
+            msg += 'algorithm has no learning rate'
+            print(msg, file=sys.stderr, flush=True)
+            self.initial_lr = None
+
+    def after_train_batch(self, ingraph_result):
+        lr = self.context.get_learning_rate()
+
+        if lr is None:
+            raise Exception("LearningRateStopper:: lr expected to be not None")
+        if self.initial_lr is None:
+            self.initial_lr = lr
+            return
+
+        global_step = self.context.get_global_step()
+        if self.min_steps is not None and global_step < self.min_steps:
+            return
+
+        if self.threshold is not None and lr <= self.threshold:
+            msg = 'Learning rate %f below threshold %f'
+            self.context.stop_training(msg % (lr, self.threshold))
+
+        if self.times is not None and self.initial_lr * (1 + 1e-5) >= lr * self.times:
+            msg = 'Learning rate %f is %f times less than initial %f'
+            self.context.stop_training(msg % (lr, self.times, self.initial_lr))
+
+
+# - GlobalStepStopper - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class GlobalStepStopper(DistributedTicker):
+    """
+    Stop training after N batches.
+    """
+
+    priority = 100
+
+    def __init__(self, num_steps):
+        self.num_steps = num_steps
+
+    def on_started(self, context):
+        self.context = context
+
+    def after_train_batch(self, _ingraph_result):
+        step = self.context.get_global_step()
+        if step >= self.num_steps:
+            self.context.stop_training('Made %i steps' % step)
+
+
+# - NanInfSpotter - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class NanInfSpotter(DistributedTicker, DevSubscriber):
+    """
+    Spots Nan or Inf values in any tensor during training.
+    """
+    def __init__(self, debug=False, op_names=()):
+        """
+        debug: turns on slow debug mode
+        op_names: names of operations to monitor for nan/inf values
+        """
+        self.debug = debug
+        self.op_names = list(op_names)
+
+        self.dev_graphs = {}
+
+    def on_started(self, context):
+        self.context = context
+
+    def on_train_batch_ingraph(self):
+        if self.debug:
+            # Slow mode - every floating tensor is wrapped
+            return self._add_check_numerics_ops(self.op_names)
+        return self.context.get_train_loss_ingraph()
+
+    def after_train_batch(self, ingraph_result):
+        if np.isnan(ingraph_result) or np.isinf(ingraph_result):
+            raise tf.errors.InvalidArgumentError(None, "Train Loss", "Train Loss: Nan or inf values spotted")
+
+    def prepare_dev_graph(self, dev_name, dev_counters, dev_loss):
+        if self.debug:
+            # Slow mode - every floating tensor is wrapped
+            self.dev_graphs[dev_name] = self._add_check_numerics_ops(self.op_names)
+        else:
+            self.dev_graphs[dev_name] = dev_loss
+
+    def before_dev_run(self, dev_name):
+        return self.dev_graphs[dev_name]
+
+    def after_dev_run(self, dev_name, dev_run_values):
+        if np.isnan(dev_run_values) or np.isinf(dev_run_values):
+            raise tf.errors.InvalidArgumentError(None, "Dev Loss", "Dev Loss: Nan or inf values spotted")
+
+    def _add_check_numerics_ops(self, op_names):
+        check_op = []
+        for op_name in op_names:
+            for op in tf.get_default_graph().get_operation_by_name(op_name):
+                for output in op.outputs:
+                    if output.dtype in [tf.float16, tf.float32, tf.float64]:
+                        message = op_name + ':' + str(output.value_index)
+                        with tf.control_dependencies(check_op):
+                            check_op = [tf.check_numerics(output, message=message)]
+        return tf.group(*check_op)
+
+
+# - DivergenceSpotter - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class DivergenceSpotter(DistributedTicker):
+
+    priority = 120
+
+    def __init__(self, every_steps=None, every_minutes=None, sync=False, debug=False):
+        self.every_steps = every_steps
+        self.every_minutes = every_minutes
+        self.sync = sync
+        self.debug = debug
+
+    def on_started(self, context):
+        self.context = context
+        self.is_it_time_yet = _IsItTimeYet(context, self.every_steps, self.every_minutes)
+
+        self.var_list = get_model_variables() | get_state_variables()
+        self.op_list = []
+
+        for var in self.var_list:
+            dtype = var.dtype.base_dtype
+
+            if dtype == tf.float32:
+                op = tf.bitcast(var, tf.int32)
+            elif dtype == tf.float64:
+                op = tf.bitcast(var, tf.int64)
+            elif dtype in [tf.int8, tf.int16, tf.int32, tf.int64, tf.uint8, tf.uint16]:
+                op = var
+            else:
+                raise RuntimeError("Unknown var dtype: %s" % var.op.dtype)
+
+            self.op_list.append(lib.ops.mpi.allgather(tf.expand_dims(tf.reduce_sum(op), axis=0),
+                                                      name=re.sub('\\W', '_', var.name) + '_sum_allgather'))
+
+    def after_train_batch(self, ingraph_result):
+        if not self.is_it_time_yet():
+            return
+
+        values = tf.get_default_session().run(self.op_list)
+        diverged_vars = []
+
+        for var, vals in zip(self.var_list, values):
+            if any(v != vals[0] for v in vals):
+                if self.debug and lib.ops.mpi.is_master():
+                    print("Detected divergence in %s: %s" % (var.name, str(vals)), file=sys.stderr, flush=True)
+                diverged_vars.append(var)
+
+        if self.sync and len(diverged_vars) > 0:
+            if self.debug and lib.ops.mpi.is_master():
+                print("Synchronizing diverged variables: %s" % str([v.name for v in diverged_vars]),
+                      file=sys.stderr, flush=True)
+
+            ops = []
+            for var in diverged_vars:
+                ops.append(tf.assign(var, lib.ops.mpi.allreduce(var, average=True,
+                                                                name=re.sub('\\W', '_', var.name) + '_sync')))
+            tf.group(*ops).run()
+
+
+class LearningRateFn:
+    def __init__(self, policy, scale, **kwargs):
+        bound_policy_fn = partial(learning_rate_policy_fn, policy=policy, scale=scale, **kwargs)
+        self.policy_fn = lambda global_step_var: bound_policy_fn(global_step=global_step_var)
+
+        self.multiplier_var = tf.get_variable(
+            'learning_rate_multiplier', [], tf.float32,
+            initializer=tf.constant_initializer(1.0),
+            trainable=False)
+
+        self.learning_rate_ingraph = None
+
+    def __call__(self):
+        # create policy here
+        if self.learning_rate_ingraph is None:
+            self.learning_rate_ingraph = self.multiplier_var * self.policy_fn(tf.train.get_global_step())
+        return self.learning_rate_ingraph
+
+
+def learning_rate_policy_fn(policy, scale, global_step, **kwargs):
+    with tf.name_scope("learning_rate_fn"):
+        if policy == 'constant':
+            new_learning_rate = tf.constant(scale, dtype=tf.float32)
+        elif policy == 'exponential':
+            new_learning_rate = tf.train.exponential_decay(
+                scale, global_step,
+                kwargs['decay_steps'], kwargs['decay_rate'],
+                staircase=kwargs.get('staircase', False))
+        elif policy == 'inverse_time':
+            new_learning_rate = tf.train.inverse_time_decay(
+                scale, global_step,
+                kwargs['decay_steps'], kwargs['decay_rate'],
+                staircase=kwargs.get('staircase', False))
+        elif policy == 'natural_exp':
+            new_learning_rate = tf.train.natural_exp_decay(
+                scale, global_step,
+                kwargs['decay_steps'], kwargs['decay_rate'],
+                staircase=kwargs.get('staircase', False))
+        elif policy == 'polynomial':
+            new_learning_rate = tf.train.polynomial_decay(
+                scale, global_step,
+                kwargs['decay_steps'], kwargs['end_learning_rate'],
+                power=kwargs.get('power', 1.0), cycle=kwargs.get('cycle', False))
+        elif policy == 'warmup_expup':
+            new_learning_rate = scale * tf.where(
+                global_step > kwargs['decay_steps'],
+                tf.exp( tf.to_float(global_step) / kwargs['decay_steps'] - 1.0 ), # exp growth
+                tf.to_float(global_step) / kwargs['decay_steps']  #linear growth
+            )
+        elif policy == 'warmup_const':
+            new_learning_rate = scale * tf.where(
+                global_step > kwargs['decay_steps'],
+                tf.constant(1.0, dtype=tf.float32), # const
+                tf.to_float(global_step) / kwargs['decay_steps']  #linear growth
+            )
+        elif policy == 'wait_const':
+            new_learning_rate = scale * tf.where(
+                global_step > kwargs['decay_steps'],
+                tf.constant(1.0, dtype=tf.float32), # const
+                tf.constant(0.0, dtype=tf.float32), # zero LR, - Wait for Adam stats
+            )
+        elif policy == 'warmup_inverse_sqrt_time':
+            new_learning_rate = scale * \
+                tf.minimum(
+                    tf.to_float(global_step + 1) ** -0.5,
+                    tf.to_float(global_step + 1) * kwargs['decay_steps'] ** -1.5) *\
+                kwargs['decay_steps'] ** 0.5
+        elif policy == 't2t_noam':
+            new_learning_rate = scale * \
+                tf.minimum(
+                    tf.to_float(global_step + 1) ** -0.5,
+                    tf.to_float(global_step + 1) * kwargs['decay_steps'] ** -1.5) * \
+                kwargs['hid_size'] ** -0.5
+        else:
+            raise ValueError("Wrong policy for learning rate scheduling specified")
+
+        return new_learning_rate
+
+# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+
+class TimeTicker(Ticker):
+    """
+    Prints elapsed time every N steps
+    """
+
+    priority = 999
+
+    def __init__(self, every_steps=100):
+        self.every_steps = every_steps
+
+    def on_started(self, context):
+        self.context = context
+
+        self.is_it_time_yet = _IsItTimeYet(context, self.every_steps, None)
+
+        self.last_time = time.time()
+        self.last_step = context.get_global_step()
+
+    def after_train_batch(self, ingraph_result):
+        if not self.is_it_time_yet():
+            return
+
+        cur_time = time.time()
+        cur_step = self.context.get_global_step()
+
+        elapsed_time = cur_time - self.last_time
+        elapsed_steps = cur_step - self.last_step
+
+        print("\nStep %d: made %d steps in %.3f seconds (%.3f steps/sec)" % (cur_step, elapsed_steps, elapsed_time, elapsed_steps / elapsed_time), file=sys.stderr, flush=True)
+
+        self.last_time = cur_time
+        self.last_step = cur_step
+
+
+## ============================================================================
+#                                Utilities
+
+
+class _IsItTimeYet:
+    """
+    Class that tells whether it's time to do some action.
+
+    It tracks number of steps and number of minutes passed since the last
+    time we executed the action.
+    """
+
+    def __init__(self, context, every_steps=None, every_minutes=None):
+        self.context = context
+        self.every_steps = every_steps
+        self.every_minutes = every_minutes
+        self.last_step = self.context.get_global_step()
+        self.last_time = time.time()
+
+    def __call__(self, true_on_same=False):
+        """
+        Tells whether it's time to do the action.
+
+        It's assumed that the action is executed after is() call, so the
+        tracker updates its timestamp and step number.
+        """
+        N = self.every_steps
+        if N:
+            if self.context.get_global_step() - self.last_step >= N:
+                self._do()
+                return True
+            elif true_on_same and self.context.get_global_step() == self.last_step:
+                # if we call this function on the same tick, it will return True
+                return True
+
+        if self.every_minutes:
+            if (time.time() - self.last_time) / 60 > self.every_minutes:
+                self._do()
+                return True
+
+        return False
+
+    def _do(self):
+        self.last_step = self.context.get_global_step()
+        self.last_time = time.time()
diff --git a/lib/util.py b/lib/util.py
new file mode 100644
index 0000000..74232f5
--- /dev/null
+++ b/lib/util.py
@@ -0,0 +1,174 @@
+# Utilities
+import numpy as np
+import tensorflow as tf
+
+import importlib
+import contextlib
+
+
+## ----------------------------------------------------------------------------
+#                        Nested dicts / lists / tuples
+
+
+def nested_compare(t, u):
+    """
+    Return whether nested structure of t1 and t2 matches.
+    """
+    if isinstance(t, (list, tuple)):
+        if not isinstance(u, type(t)):
+            return False
+        if len(t) != len(u):
+            return False
+        for a, b in zip(t, u):
+            if not nested_compare(a, b):
+                return False
+        return True
+
+    if isinstance(t, dict):
+        if not isinstance(u, dict):
+            return False
+        if set(t.keys()) != set(u.keys()):
+            return False
+        for k in t:
+            if not nested_compare(t[k], u[k]):
+                return False
+        return True
+
+    else:
+        return True
+
+
+def nested_flatten(t):
+    """
+    Turn nested list/tuple/dict into a flat iterator.
+    """
+    if isinstance(t, (list, tuple)):
+        for x in t:
+            yield from nested_flatten(x)
+    elif isinstance(t, dict):
+        for k, v in sorted(t.items()):
+            yield from nested_flatten(v)
+    else:
+        yield t
+
+
+def nested_pack(flat, structure):
+    return _nested_pack(iter(flat), structure)
+
+
+def _nested_pack(flat_iter, structure):
+    if is_namedtuple(structure):
+        return type(structure)(*[
+            _nested_pack(flat_iter, x)
+            for x in structure]
+            )
+    if isinstance(structure, (list, tuple)):
+        return type(structure)(
+            _nested_pack(flat_iter, x)
+            for x in structure
+            )
+    elif isinstance(structure, dict):
+        return {
+            k: _nested_pack(flat_iter, v)
+            for k, v in sorted(structure.items())
+            }
+    else:
+        return next(flat_iter)
+
+
+def is_namedtuple(x):
+    """Checks if x is a namedtuple instance. Taken from https://stackoverflow.com/a/2166841 ."""
+    t = type(x)
+    b = t.__bases__
+    if len(b) != 1 or b[0] != tuple: return False
+    f = getattr(t, '_fields', None)
+    if not isinstance(f, tuple): return False
+    return all(type(n)==str for n in f)
+
+
+def nested_map(fn, *t):
+    # Check arguments.
+    if not t:
+        raise ValueError('Expected 2+ arguments, got 1')
+    for i in range(1, len(t)):
+        if not nested_compare(t[0], t[i]):
+            msg = 'Nested structure of %r and %r differs'
+            raise ValueError(msg % (t[0], t[i]))
+
+    # Map.
+    flat = map(nested_flatten, t)
+    return nested_pack(map(fn, *flat), t[0])
+
+
+## ----------------------------------------------------------------------------
+#                         Variables and initialization
+
+
+def with_shape(var, shape):
+    tensor = var.value()
+    tensor.set_shape(shape)
+    return tensor
+
+
+def orthogonal(shape):
+    """
+    Generate a random orthogonal matrix (with ortogonal columns) via SVD of
+    random matrix.
+
+    Raises ValueError when shape.ndim != 2 or when shape[0] < shape[1].
+
+    'shape' is np.array.
+    """
+    if shape.ndim != 2:
+        msg = 'Can generate only orthogonal matrices, not %id tensors'
+        raise ValueError(msg % shape.ndim)
+    if shape[0] < shape[1]:
+        msg = ('Cannot generate %i orthogonal columns, each of which consists'
+               'of %i numbers (guess why)')
+        raise ValueError(msg % (shape[1], shape[0]))
+    a = np.random.normal(0.0, 1.0, shape)
+    u, _, v = np.linalg.svd(a, full_matrices=False)
+    return u
+
+
+def orthogonal_initializer(scale=1.0):
+    def initializer(shape, dtype=tf.float32, partition_info=None):
+        return tf.constant(orthogonal(shape) * scale, dtype)
+    return initializer
+
+
+## ----------------------------------------------------------------------------
+#                               Miscellaneous
+
+
+def load_class(full_name):
+    name_parts = full_name.split('.')
+
+    module_name = '.'.join(name_parts[:-1])
+    class_name = name_parts[-1]
+
+    return getattr(importlib.import_module(module_name), class_name)
+
+
+def merge_dicts(a, b):
+    res = a.copy()
+    res.update(b)
+    return res
+
+
+def is_scalar(var):
+    """ checks if var is not scalar. Works for list, np.array, tf.tensor and many similar classes """
+    return len(np.shape(var)) == 0
+
+
+@contextlib.contextmanager
+def nop_ctx():
+    yield
+
+
+def merge_summaries(inputs, collections=None, name=None):
+    # Wrapper correctly working with inputs = []
+    if len(inputs) == 0:
+        # We should return simple tf operation that returns empty bytes
+        return tf.identity(b'')
+    return tf.summary.merge(inputs, collections, name)
diff --git a/notebooks/1_Load_model_and_translate.ipynb b/notebooks/1_Load_model_and_translate.ipynb
new file mode 100644
index 0000000..bbb14ff
--- /dev/null
+++ b/notebooks/1_Load_model_and_translate.ipynb
@@ -0,0 +1,413 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What you will learn:\n",
+    "1. [Vocabulary: how to use](#voc)\n",
+    "2. [Load trained model](#load_model)\n",
+    "3. [Translate sentences](#translate)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Before starting"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, let's add path to the the-story-of-heads repo to the path."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, 'path_to_the_story_of_heads') # insert your local path to the repo"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Vocabulary <a name=\"voc\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To load a model, you need to pass vocabularies used in training. Let's load the vocabularies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "import numpy as np\n",
+    "\n",
+    "DATA_PATH = # insert your path\n",
+    "VOC_PATH =  # insert your path\n",
+    "\n",
+    "inp_voc = pickle.load(open(VOC_PATH + 'src.voc', 'rb'))\n",
+    "out_voc = pickle.load(open(VOC_PATH + 'dst.voc', 'rb'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### What you can do with a vocabulary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can get ids of tokens in the vocabulary, as well as tokens corresponding to ids:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[39, 1592, 10, 2548]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "inp_voc.ids(\"i saw a cat\".split())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['is', 'state', 'prime', 'lug']"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "inp_voc.words([12, 123, 1234, 12345])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Reserved token ids are:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[0, 1, 2]"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "inp_voc.ids(['_BOS_', '_EOS_', '_UNK_'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`_BOS_` - begin of sentence token; not used in the standard setting\n",
+    "\n",
+    "`_EOS_` - end of sentence token; this is the last token of any sentence\n",
+    "\n",
+    "`_UNK_` - unknown token; if you are using BPE, you probably won't see it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load model <a name=\"load_model\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Import liblaries and create session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "env: CUDA_VISIBLE_DEVICES=0\n"
+     ]
+    }
+   ],
+   "source": [
+    "%env CUDA_VISIBLE_DEVICES=0\n",
+    "\n",
+    "import tensorflow as tf\n",
+    "import lib\n",
+    "import lib.task.seq2seq.models.transformer_head_gates as tr\n",
+    "\n",
+    "tf.reset_default_graph()\n",
+    "gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.99, allow_growth=True)\n",
+    "sess = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, copy model hyperparameters from your training config. In this notebook, we'll use model with pruned encoder self-attention heads."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hp = {\n",
+    "     \"num_layers\": 6,\n",
+    "     \"num_heads\": 8,\n",
+    "     \"ff_size\": 2048,\n",
+    "     \"ffn_type\": \"conv_relu\",\n",
+    "     \"hid_size\": 512,\n",
+    "     \"emb_size\": 512,\n",
+    "     \"res_steps\": \"nlda\", \n",
+    "    \n",
+    "     \"rescale_emb\": True,\n",
+    "     \"inp_emb_bias\": True,\n",
+    "     \"normalize_out\": True,\n",
+    "     \"share_emb\": False,\n",
+    "     \"replace\": 0,\n",
+    "    \n",
+    "     \"relu_dropout\": 0.1,\n",
+    "     \"res_dropout\": 0.1,\n",
+    "     \"attn_dropout\": 0.1,\n",
+    "     \"label_smoothing\": 0.1,\n",
+    "    \n",
+    "     \"translator\": \"ingraph\",\n",
+    "     \"beam_size\": 4,\n",
+    "     \"beam_spread\": 3,\n",
+    "     \"len_alpha\": 0.6,\n",
+    "     \"attn_beta\": 0,\n",
+    "    \n",
+    "     \"concrete_heads\": {\"enc-self\"},\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now you can load the model. Pass vocs and hyperparameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = tr.Model('mod', inp_voc, out_voc, inference_mode='fast', **hp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Load checkpoint"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path_to_ckpt = # insert path to the final checkpoint\n",
+    "var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)\n",
+    "lib.train.saveload.load(path_to_ckpt, var_list)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Translate <a name=\"translate\"></a>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load test set\n",
+    "path_to_testset = # path to your data\n",
+    "test_src = open(path_to_testset + 'test.src').readlines()\n",
+    "test_dst = open(path_to_testset + 'test.dst').readlines()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To translate, just pass a list of sentence to the `translate_lines` function of the model:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['russian president vladimir putin has signed a law on the establishment of administrative liability for violating the deadline and procedures for payment of goods ( works , services ) as part of procurement for state and municipal needs .\\n',\n",
+       " 'this law is meant to solve the very serious issue of customers failing to fulfill their commitments as part of state and municipal proc `urements .\\n',\n",
+       " 'just two years ago , situations in which business owners were unable to collect payment for already - executed state and municipal contracts were widespread .\\n']"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_src[:3]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['президент россии владимир путин подписал закон о создании административной ответственности за нарушение сроков и процедур оплаты товаров ( работ , услуг ) в рамках закупок для государственных и муниципальных нужд .',\n",
+       " 'этот закон призван решить очень серьезную проблему , когда потребители не выполняют свои обязательства как часть государственных и муниципальных закупок .',\n",
+       " 'всего два года назад широко распростран `ялись ситуации , когда предприниматели не смогли собрать оплату уже выполн `енных государственных и муниципальных контрактов .']"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model.translate_lines(test_src[:3])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To translate a test set, just do this for a sequence of batches (50-100 sentences is ok).\n",
+    "\n",
+    "**Do not forget to unbpe your translations before evaluating BLEU score!**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def unbpe(sent):\n",
+    "    return sent.replace(' `', '')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "я видел кот `а\n",
+      "я видел кота\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(model.translate_lines(['i saw a cat'])[0])\n",
+    "print(unbpe(model.translate_lines(['i saw a cat'])[0]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/2_Look_at_attention_maps.ipynb b/notebooks/2_Look_at_attention_maps.ipynb
new file mode 100644
index 0000000..904a073
--- /dev/null
+++ b/notebooks/2_Look_at_attention_maps.ipynb
@@ -0,0 +1,333 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## You will learn: how to look at attention maps."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you don't know how to load a model yet, look in the notebook __1_Load_model_and_translate__."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Load trained model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, 'path_to_the_story_of_heads') # insert your local path to the repo"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load vocabularies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "import numpy as np\n",
+    "\n",
+    "DATA_PATH = # insert your path\n",
+    "VOC_PATH =  # insert your path\n",
+    "\n",
+    "inp_voc = pickle.load(open(VOC_PATH + 'src.voc', 'rb'))\n",
+    "out_voc = pickle.load(open(VOC_PATH + 'dst.voc', 'rb'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "env: CUDA_VISIBLE_DEVICES=0\n"
+     ]
+    }
+   ],
+   "source": [
+    "%env CUDA_VISIBLE_DEVICES=0\n",
+    "\n",
+    "import tensorflow as tf\n",
+    "import lib\n",
+    "import lib.task.seq2seq.models.transformer_concrete_heads as tr\n",
+    "\n",
+    "tf.reset_default_graph()\n",
+    "gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.99, allow_growth=True)\n",
+    "sess = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))\n",
+    "\n",
+    "hp = {\n",
+    "     \"num_layers\": 6,\n",
+    "     \"num_heads\": 8,\n",
+    "     \"ff_size\": 2048,\n",
+    "     \"ffn_type\": \"conv_relu\",\n",
+    "     \"hid_size\": 512,\n",
+    "     \"emb_size\": 512,\n",
+    "     \"res_steps\": \"nlda\", \n",
+    "    \n",
+    "     \"rescale_emb\": True,\n",
+    "     \"inp_emb_bias\": True,\n",
+    "     \"normalize_out\": True,\n",
+    "     \"share_emb\": False,\n",
+    "     \"replace\": 0,\n",
+    "    \n",
+    "     \"relu_dropout\": 0.1,\n",
+    "     \"res_dropout\": 0.1,\n",
+    "     \"attn_dropout\": 0.1,\n",
+    "     \"label_smoothing\": 0.1,\n",
+    "    \n",
+    "     \"translator\": \"ingraph\",\n",
+    "     \"beam_size\": 4,\n",
+    "     \"beam_spread\": 3,\n",
+    "     \"len_alpha\": 0.6,\n",
+    "     \"attn_beta\": 0,\n",
+    "}\n",
+    "\n",
+    "model = tr.Model('mod', inp_voc, out_voc, inference_mode='fast', **hp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load trained model parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path_to_ckpt = # insert path to the final checkpoint\n",
+    "var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)\n",
+    "lib.train.saveload.load(path_to_ckpt, var_list)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Visualize attention maps"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Get attention weights for a head"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "All attention weights in the model are saved in `tf.collection`. You just need to take them from the `AttnWeights` collection after feeding a batch of sentences. Do not forget to clear collection before that just in case!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_all_attns(src_sent):\n",
+    "    feed_dict = model.make_feed_dict([(src_sent, \"\")])\n",
+    "    tf.get_default_graph().clear_collection(\"AttnWeights\")\n",
+    "    model.transformer.encode(feed_dict['inp'], feed_dict['inp_len'], False)\n",
+    "    info = tf.get_collection(\"AttnWeights\")\n",
+    "    attns = [info[i][0].eval() for i in range(model.transformer.num_layers_enc)]\n",
+    "    return attns\n",
+    "\n",
+    "def get_attns(src_sent, layer, head):\n",
+    "    all_attns = get_all_attns(src_sent)\n",
+    "    return all_attns[layer][head]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[2.71142460e-02, 7.16518641e-01, 4.18839455e-02, 1.46593405e-02,\n",
+       "        1.99823827e-01],\n",
+       "       [1.04576237e-02, 8.77484307e-02, 8.29643533e-02, 3.46590877e-02,\n",
+       "        7.84170449e-01],\n",
+       "       [2.00368073e-02, 1.30270869e-01, 1.80031043e-02, 1.57493595e-02,\n",
+       "        8.15939903e-01],\n",
+       "       [4.23796329e-04, 2.44938699e-03, 1.80661608e-03, 8.79494660e-03,\n",
+       "        9.86525297e-01],\n",
+       "       [1.17151038e-04, 2.15308362e-04, 2.98917235e-04, 5.37839660e-04,\n",
+       "        9.98830736e-01]], dtype=float32)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_attns(src_sent=\"i saw a cat\", layer=0, head=0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Visualize"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "%matplotlib inline\n",
+    "\n",
+    "plt.rc('xtick', labelsize=15) \n",
+    "plt.rc('ytick', labelsize=15) \n",
+    "font = {'family' : 'normal',\n",
+    "        'weight' : 'normal',\n",
+    "        'size'   : 20}\n",
+    "plt.rc('font', **font)\n",
+    "\n",
+    "def draw_map(sent, attns):\n",
+    "    ticks = sent.split() + ['<eos>']\n",
+    "    plt.imshow(attns, interpolation='none', cmap='Blues')\n",
+    "    plt.xticks(range(len(ticks)), ticks, rotation=90);\n",
+    "    plt.yticks(range(len(ticks)), ticks);\n",
+    "    plt.colorbar(label='attention weights')\n",
+    "    plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXMAAAElCAYAAAAWbQJ4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3XmYXEW5x/Hvb8KSwBUIJixCFhYR\nRVEuCYgYIKAIgqKowBUlYYsg4AZcURBC0MuigAJqjAoBROIGCigECKsgmLBLWAQM+05IgJAQyHv/\neKtJp9MzfTpzejk978fnPDN9tqqeyNvVdarekpkRQgih2LpaXYEQQgi9F8E8hBA6QATzEELoABHM\nQwihA0QwDyGEDhDBPIQQOkAE8xBC6AARzEMIoQNEMA8hhA6wXKsr0C4GDRpkw4YNb0pZd9z3WFPK\nAdjsvUObVlYIPXn00Vm88MIL6s09+q0yzOzN1zOda68/P9XMdupNeUUSwTwZNmw4N906oyllDRx5\naFPKAbjp1rOaVlYIPdl6yxG9voe9OZ8VN94r07nz7zhzUK8LLJAI5iGE4hCgXjXuO1YE8xBCsSge\n9VUTwTyEUCCCrn6trkRbimAeQiiW6GapKoJ5CKE4RHSzdCOCeQihQBQt8250/EecpMmSmjPmMITQ\neOrKtvUxfaFlfgIwoNWVCCHkJFrmVXV8MDezh1tdhxBCThSjWbrT8d9FopslhA4T3SxVdXzLPITQ\nSdQnA3UWfTqYSxoHjAMYMjQSUoVQCF3RZ15Nn/6IM7NJZjbCzEYMHjS41dUJIdRSGmce3SxL6dMt\n8xBCAcVolqoimIcQCiRGs3QngnkIoVj6YBdKFhHMQwjFoZjO350I5iGEYomWeVUdH8zNbGyr6xBC\nyFG0zKvq+GAeQugk8QC0OxHMQwjFEfnMuxV/lRBCgajwk4YkbSbpq5JWLdu3sqRzJb0s6SlJX6/3\nvu37jkMIoZrSiJZaW/v6NnC0mc0p23ci8GU8Jr8TOE3SjvXcNIJ5CKFYCt4yB0YA15ZeSFoeGAP8\nE1gDWA94AfhaPTeNPvMWmD39rKaVNXDkoU0rC5r73kIf1d6t7izWAJ4oez0CeAfwCzObDzwl6S/A\nTvXcNIJ5CKE4OmNxCmPJ2PvRtO/6sn3PA3Vl/4tgHkIoFBW/Zf4Y8OGy17sBT5jZI2X73gXMruem\nEcxDCIUhOiKY/x44XtIfgfnAVsCPK855L1DXkpcRzEMIxaG0FdvpeH/47un1ncCE0kFJ6wEj8REu\nmUUwDyEUiArfMjezV4GtJb0/7ZppZovKT8EDfV1rF0cwDyEUStGDuaShwMtm9q9qx81slqQXgYH1\n3LetB2OGEEKlrq6uTFsb+w/wjRrnfC2dl1m0zEMIxdEZfeYNeQcRzEMIhaEO6DPPaC3gtXouiGAe\nQiiUIgZzSftU7PpQlX0A/YChwJeAe+opI4J5CKFQihjMgcn4KBXSz93SVqn05uYBx9dTQEufEkja\nRNIVkl6S9Jqk+yQdko7tIukqSc9JmivplvIsYpLWk2SSPlK278K0b9OyfZdKuqC57yyE0BACdSnT\nlul20vskTZM0L6WenSCpZr4ASSMkXZli10uSrpa0ZQ+X7AvsB+zv74K/pNeV2z7ALsC6ZnZlpjeR\ntLplfilwH/6VYgHwHmCVdGy9dPxHwCJgZ+BySduY2U1m9h9JTwKjgJvTNaPwGVWjgLsldQFbA99t\n0vsJITRYXi1zSQOBq4GZeCt5A+BUvJF7TA/XDUnX3Y6nrQU4ErhK0gfM7NHKa8zs3LLrxwB/NrPz\ncnkjScuCuaRBeMDezcxKfUPTSsfN7Kyyc7vwlJGb4J9sN6VDN+KB+2RJ6wNrA79I+34KvB8fq3lj\nQ99MCKEpcn4AehAwANjdzObiwXgVYLykU9K+anbBsxx+tpSTXNLNeNraTwI/76lQMxud1xso18pu\nlpeAx4GJkvaUtEb5QUnrppU3ngTeBBYCOwIblZ12Az6TqgvYBrgbb82PSse3SeXMrFYBSeMkzZA0\n4/kXns/xrYUQGkVSpi2DnYGpFUF7Ch7gt+3huuXxmFQ+2uTVtK9lHfotC+Zp+uqOwDPA2cAzkm5M\nSyp1AZcAHwGOBUbjuQouB/qX3eZGYDW8BT4qvb4ZWCu11EcBfzczowozm2RmI8xsxOBBdWWbDCG0\nijJutW0M3F++w8wewx8+btzDdX9K55wqaY3UED0dz3L4h0xvQdpW0mXpmeBCSW9V2d7M9C6SlvaZ\nm9n9wOfSShujgJOBvwLbAZsBO5vZFaXzJQ2ouMW9eMt7FN4K/46ZzZF0d9o3Cjit0e8jhNAkqqvP\nfJCk8vwmk8xsUtnrgcDLVa6bTQ9T6c3sKUmjgctYvBrQ08AnzKzmV3xJuwB/xochPgY8gLfqe6XV\nD0ABMLOFwDWSTgN+i/d9gz8UBUDSMPxh5t1l15mkvwN7ABvi3S6kn/ul+0R/eQgdpI6p+i+Y2Yi8\ny5e0Nt4Cvw04IO0+BPirpI+k1n1PxuPdxrvUO2KlJy3rZpG0aRras7+k0ZJ2xxc6vQu4BV9W6dQ0\nRHEv4ErgySq3uhFvlT9gZs9V7JuHP3EOIXSA0gPQnPrMZwOrVtk/kJ4XhjgS7zf/vJldkXoPPge8\nBRyRodz3A7/LM5BDa1vmzwDPAkfjq2q8jI9Y+baZLUjB/afAH/HA/gO8++X9FfcptbxvqLLv1tTq\nDyF0ivweMd5PRd94Gna4EhV96RU2Bu4tjy1m9oake/HhjbW8incP56plwTy1or/cw/HpwBYVuydX\nOe9WKv55zezZyn0hhA5QX595LZcDR0p6h5m9kvbtCbzOkutxVnoU+KSkFczsDQBJK+INzUszlDsN\nX10oV22dJzKEECrl2M0yEX8ud5Gkj0kah/dnn1Y+XFHSQ5J+XXbdr/DehItTN/Cu+APNtYHyB6zd\n+TawgaRjlOMnU1s8AA0hhKyyTtWvxcxmS9oBOAtvUb+MDzEcX3HqcvjIk9J1t0naCTgOOD/tvgf4\nuJndtVR9pbOrFH8vnntlP0l3Un1UjZnZ/lnfTwTzEEKh5NiYxcxmAtvXOGd4lX3TKJuxXsPYHo4N\nT1vVovEZ75lEMA8hFEYdXSjtZL1mFBLBPIRQKEUL5tUSbzVCBPMON3v6WbVPytHAkYc2raxmv7fQ\nHooWzJslgnkIoVgKHsslDc1w2iJgbg+ZG5cSwTyEUByqazp/u5rF4lWHeiTpGeAi4Hgze6Gncwv/\nVwkh9B0CpGxbGzsPn7EuYA4+Qen36eectP964G94DpdDgOmSekztGsE8hFAgueZmaZUTgQ8CJwFD\nzGx7M/sfM9seGAKcko4fDqyPj0cfBnynp5tGMA8hFEoHtMxPAu4ys++aWfkCF5jZa2Z2FJ4d9iQz\nW2RmxwN3Ap/q6aYRzEMIhdIBLfNtWLxucXduZsnVjm4B1u3pgngAGkIojvZvdWexIrBWjXPWTueV\nlJal61a0zEMIhSGgXz9l2trYXcCekirTeQO+1gO+4M6dZbuHAz2uYhQt8xBCobR5F0oWE/CRKtMl\n/Qa4CV/bYU3go8De+OIXJ8Dby2XuSI30uhHMQwjF0QHdLGY2VdLewM/xRFr7lR0uDVfc38ympn0r\n4HnWH+jpvhHMQwiF4ePMCx7NATObIukyYDd88fpVgbnAHcBfyhbLwMzmAFOr3qhMBPMQQoG0/UiV\nzMzsVeCCtPVaRz0AlbSVpEskPS3pNUl3pq8zIYQO0dWlTFtf02kt82H4w4SJwHxga+AcSYvM7MKW\n1iyE0HsF7DOXtE/69WIze6XsdU1mdl7WczsqmJvZlNLvaW29G/CB9gcCSwXztObfOIAhQ7MkMgsh\ntFJB+8wn44m1bgFeKXvdE6Vz+mYwlzQQz2OwG7AOi9fte7La+WY2ibQA6+abj8iUxSyE0FrFi+Xs\nhwfmp9PrfRtRSEcFc/wT78P4+MyZ+NPhg/HgHkLoAEVrmZvZ5IrX5zainI4J5pL6A7sCh5jZxLL9\nHfWQN4S+rmCxvGk6JpjjeQy6gAWlHZLeAXyajIngQwjtTaJjRqqk/OSfA94LrGxmB5TtXw+4x8xe\nz3q/jgnmZjZH0nTgWElz8WWXjsJnU63S0sqFEHLSGePMJe0PnAH0Z/HDzgPS4TWBf+CDM36d9Z6d\n1gXxReAR/AnwT4A/UcfT4BBC+yt6PnNJH8cHXjwIfBaf1v82M/sXcC/wmXru2zEtcwAzewjYocqh\n8U2uSgihQTqgZf5tfGTLtmY2V9JmVc65G9iqnpt2VDAPIXS4Nm91ZzQCmGJmc3s45wlq5zxfQgTz\nEEJhFHTSUKUVgNdqnLMa8FY9N41gHkIolA4YzTIL2LzGOVtSI+VtpU57ABpC6HAdsAboX4BRkr5Q\n7aCkfYFN8QEcmUXLPIRQHJ3RZ34KsBdwoaTP47nMkXQoMArYHfg3cGY9N41gHnI1e/pZTStr4MhD\nm1ZWM99X6J46YJy5mc2WtC0+bLq8dX5G+nkj8EUzq9WvvoToZgkhFEqe48wlvU/SNEnzJD0laYKk\nfrWvBEm7S5ou6XVJL0q6QtLKWa41s8fMbDvgQ3j+qGOAw4CRZratmVVNDtiTaJmHEAqlX04PQFOW\n1avxpHy7ARsAp+KN3GNqXHsAcBbeZXIkMBDYnjpjqpndjY8p77UI5iGEwvBWd27dLAcBA4Dd05jv\nqyStAoyXdEp348AlDQJOBw4zs1+WHbo4S6GSxgLTzOzxXtW+QnSzhBAKpUvZtgx2BqZWBO0peIDf\ntofr9kg/lzWV7dnALEn/ljRR0h7pA6JXIpiHEAolx6GJGwP3l+8ws8eAeelYd0pjwPeX9ISkhZJu\nlfSRjG/hELwVPxBPpnUh8IykuySdJmnXlPG1LhHMQwiFUscD0EGSZpRt4ypuNRB4uUoRs9Ox7qwF\nvAfvV/828Cl8RucVktasVX8z+7mZfR4YjE8e+jZwFbA+8A18HPqLkm6uda9y0WceQigM4cMTM3rB\nzEY0qBr/BXzBzK4ASIH3UeBQ4HtZbmJmBtyRth9JWgEf2fIdYA38G0BmEcxDCMUh5TaaBW+Br1pl\n/8B0rKfrDLiutCNlP7wNeF89FZC0IZ7pdQdgNLA6/mHxMDCtnnt1ZDCXtAbwVWCymc1qcXVCCDnK\ncc7Q/VT0jUsaAqxERV96hfsofUmoqBq+KE6PJO3N4gC+brruaeAKPIBfk/ru69KpfeZrAMcBw1tc\njxBCjgR0SZm2DC4HPlHxsHFP4HXg+h6uuyz9HP12vaRV8f7vuzKUez4wBl+c4lDgfWa2jpl92cwm\nL0sgh84N5iGEDpXjDNCJ+JrBF0n6WHpAOh44rXy4oqSHJL29fJuZzcAfUv5a0hhJuwCXAAuBn2Yo\n9w38c2lbYB9gb0nbpT7zZda2wVzSNpKulfSqpDmSrpO0maS1JZ0t6ZE0jfZBSd8v/SEkDQfuSbe5\nVpJJigWdQ+gQeQ1NNLPZeFdHP+BS4Hh8MtBxFacul84p9yXgz8BpwB/xQL59umctqwEfB36Ex+Cj\n8O6V2ZKuknSUpJGqc3ZUW/aZS9oOH6pzLf515DVga2Ad4E3gJeBb+IOIjfBP08HAV/C+p72BC/Dx\nnLc3tfIhhIbJe31PM5uJT8Pv6ZzhVfa9io88OXgZypyPB+9pAGnW6ehUj+2BH6RtDv5ANJO2DObA\niXjf0yfS8B3whwMlR5R+kXQTHuzPlnSYmS2QVMp1MNPMbumukPS1ahzAkKFD86x/CKFB+hU8a2Kl\nNBLm78CK+OzTNfDGabWRNt1qu2Ceso5tCXy9LJCXHxfwdTwIrwf0Lzs8FHgoa1lmNglfJZvNNx8R\nXTEhFEDRU+DC23FuGxaPavkAi0fIzMH75As/NHEgi4fqVPMN4IfAyfgT59nASPzBQ/9urgkhdAAf\nzdLqWvSOpBuBLfD4K3z0zDTgmvTzdjOrOcSxUjsG89n4WM21uzn+BeCPZnZ0aYekugbqhxAKqv2X\nhMtiS+BWFgfvf5jZwt7etO2CuZm9JulWYB9JZ1XpahmADycqt3fF6zfSz2iph9Bhih/LGVjvKkJZ\ntF0wT47Ck8ZfLmkS/oBzK2AGPsrlayngP4wH8g0rrn8M/+oyRtIcYGEaGxpCKDCR3+IUrdKIQA5t\nOs7czG7Ax2GuBPwG+B0+wP4JYAKeMvL76ecbwNcqrp8PHIjPyLoemN6suocQGivHFLgdpV1b5pjZ\n9fjT3mr2rbJviX89M7sAH2seQuggfS9MZ9O2wTyEECpJZM270udEMA8hFErE8uoimIcQCqUv9odn\nEcE8hFAYItfFKTpKBPMQQnHknGirk0QwD4U1e/pZTStr4MhDm1ZWM99XEXVCN4ukd+M5prbAU5hU\nptgFXyZ0g6z3jGAeQiiUtpwcUwdJW+GTIgfgKb2fTT+XOrWe+0YwDyEUhuiIlvmJeLrbg4Czzaxa\nIK9bBPMQQqF0wPPPkXiywEl53jSCeQihMKTi52bBU5As06LNPYlgHkIolOLHcm4GNsv7pkV/lhBC\n6GNK64DW2trYd4GPSPpynjeNlnkIoTB8paH2jtQZ7IYvTDFZ0gHAbcDLVc4zMzsh600jmIcQCqUD\nuhPGl/0+Km3VGBDBPITQeaSOmM4/uhE3jWAeQiiUoveypLUachfBPIRQKMVvmDdGobqfJO0h6R5J\nCyQ9LukHkpZLx8ZKMkkfkHSVpNck3S9p91bXO4SQj9ID0Cxbu5M0VNIxkv4kaZqki9LrYctyv8IE\nc0k74muB3o4/DT4TOAKozEr0W+AS4LPAv4EpktZtYlVDCA3UAUMTkXQg8ABwPB6rRgOfwdc4fkDS\nV+q9Z5G6WSYA15nZmPT6ipSj4URJ3y8773QzOxtA0m14EptdgYmVN5Q0DhgHMGTo0AZWPYSQCxW/\nm0XSDng8egX4IT5M8WlgbWB7fIH6n0p6yMymZb1vIVrmkvoB/w38oeLQ7/D3sFXZvitLv5jZi8Bz\nQNWWuZlNMrMRZjZi8KDB+VY6hJA7Af2kTFsbOxIP5Jub2bFmdp2ZPZB+HgtsDryazsusEMEcGAQs\nj7eyy5Ver162r3Lw/RtA/wbVK4TQZF3KtrWxLYDfm9nD1Q6m/X9I52VWlGD+ArAQWKNi/5rp50vN\nrU4IoVUkZdoy3ut96eHjPElPSZqQegKyXt8laUYafLFrxssG4DGtJ8+n8zIrRDA3s7fwKa9fqDi0\nB7AI+EfTKxVCaDofzZJPy1zSQHyRCMMHVUwADscfSmZ1AN104/bgUbxvvCejqTOzYiGCeXIcMFrS\nOZI+IekIfKrrL83siRbXLYTQDBlHsmRsmB+Et353N7OrzGwiHsi/JWmVmlXxD4MfAEfX+S4uBkZK\n+pmk1SruuYqkn+BdLBfVc9PCBHMzuxLYCxgBXAp8AzgVaN7ijCGElhKwXJcybRnsDEw1s7ll+6bg\nAX7bDNefANwEZB5xkpwI3I9/mDwq6QZJv5N0Pd4aPwwftnhiPTct0tBEzOx3+AiWascmA5Or7B/e\n0EqFEJoqx4EqG+PDAt9mZo9JmpeOXdp9HbQpsB+wab2FmtlcSR8BTgH2Bj5adnge8EvgqIoPmZoK\nFcxDCH2d6Mq+zvEgSTPKXk+qWKptINVTz85Ox3pyJnCWmT0kaXjWCpWY2RzgK5IOBd4DrArMAR4w\ns4X13g8imIcQCsQXdM58+gtmNiL3Okh74QH4U729Vwrc/+p1pYhgHkIoknzHkM/GW8SVBqZjSxcv\nLY/P2jwZ6EoPMEsPS1eW9A4zeyW3GtYhgnkIoVByTKJ1P943/jZJQ4CV0rFqVsaHIp6WtnJTgIeB\nDSvueQ0+/HGMmT2RXmdhZrZDxnMjmIcQikOQ5+IUlwNHVrSm9wReB7rLOf4qSy8usRZwIb62Z7VA\nvR0ezFcqe52FZTwPiGAeQiazp1cm52ycgSObO9q2me8tDzmOZpmIJ7W6SNLJwPr4km6nlY8kkfQQ\ncL2Z7W9mbwLXLVkfDU+/3mNmt1YWYmZdPb3OSwTzEEJhiPwmx5jZ7JTB8Cx8GOLLwOksuUYneJzM\nPMW/VSKYhxCKQ2TOu5KFmc2kxtT6WnNVzGyW1ywbSWcDfzazS3o4Z1d8Zup+We9bmBmgIYQAaXhi\nhq2NjQU+VOOcDwJjapyzhGiZhxAKo7RsXB+wIvBWPRdEMA8hFEqb5yrPqtuRKpJWBLYBnqnnhhHM\nQwgFkj1XeTuR9EjFrm9K2rfKqf2AwXjLfKmlLnsSwTyEUBh5jmZpsi4Wt8aN7rv2FwL34JkYv1/l\neLcimIcQCqWILfPyETGSFuELz0/Is4wI5iGEQileKF/KaGBW3jeNYB5CKAwJ+hWwZV7OzLpLFdAr\nEcxDCIVSxG6WaiSNwJeHG0j1GaZmZidkvV9HBXNJWwHfAUbiaSn/DfzQzC5oacVCCLkpeihP64te\nhHe39PR2DF+aLpOOCubAMHxNvonAfGBr4BxJi8zswpbWLISQiw5omP8QTyFwI3AO8DjwZm9v2lHB\n3MymlH6Xfxe7Ac89fCCeonIJksYB4wCGDB3apFqGEJaVD00sfDTfDbgdGG1mi/K6aUGHbFYnaaCk\nMyQ9io/XXIgH642qnW9mk8xshJmNGDxocDOrGkJYRlK2rY2tClybZyCHDmuZA5OBD+P9TDOBucDB\n+CdhCKHw1Am5Wf4NrJn3TTsmmEvqD+wKHGJmE8v2d9S3jxD6sg7pZvkpcJKkdczsybxu2jHBHM9l\n0AUsKO2Q9A7g09S5/FIIoU21fxdKFpfjD0BvknQ8cBu+MMZSzOyxrDftmGBuZnMkTQeOlTQXWAQc\nBcxh8erZIYSC64BgPovF+Vl+1cN5Rh0xumOCefJF4BfAecCL+HJQKwHNXVQxhNAwKn43y3k0oLeg\no4K5mT0E7FDl0PgmVyWE0AC+OEWra9E7Zja2EfftqGAeQuh8HTCapSEimIcQCqUDulneJmlj4L3A\nf5nZ+b25VwzbCyEURqmbJcvWziR9SNIM4F7gj/gcmdKxbSXNk/Speu4ZwTyEUCDK/L92JWkj4Drg\nPcBP8KGK5W4AXgI+X899I5iHEIoj41T+Nu9WPw5YAdjSzL4FTC8/aGYG/APP/ppZ9JmH0GZevPXM\nppY3cMuvN6WcBfc/3ut7iOIvToGPuLvIzGb2cM7jwMfruWkE8xBCoRQ+lPtiFE/UOEd46z2zCOYh\nhGIpfjR/Ftiwxjmb4K3zzKLPPIRQKEV/AApcA3xK0nuqHZQ0Eu+KmVrPTSOYhxAKpQMegJ6Iryx0\ng6SDgXcBSNokvb4UeAX4UT03jW6WEEKhtHecrs3MHpD0OXz1s7PSbgF3p58vA7vXkzERIpiHEApE\ngNq82Z2FmV0haT1gDL6gzjvxDK+3AOeY2Uv13jOCeQihOHLuQpH0PuBMYCu8Rfwr4Hgze6uHa0YC\nXwVG4V0kjwO/BU42s/lZyzazl/FJQz9Z5jdQJvrMQwiFooxbzftIA4Gr8XS0uwETgMOB42tcuiew\nAXAy8El85aBvARdkqr90tqRP1zhnV0lnZ7lfSUe2zCWNA54zsz+3ui4hhJzl1zI/CBiA90/PBa6S\ntAowXtIpaV81J5nZC2Wvr5M0H/iFpGFm9miNcsfiC1Rc0sM5H8S7YPbL8D6Azm2ZjwM+0+pKhBDy\nlmtulp2BqRVBewoe4Lft7qKKQF5yR/r5rqzvpIYVgW67eqrp1GAeQuhQOQ5N3Bi4v3xHGkEyLx2r\nx1b4UpUPZzy/25WGJK0IbAM8U08FWhbMJU2WNEPSLpJmppSPf5W0uqQNJV0r6bV0zqZl1x0uabqk\nOZKelXSppA3Ljl8HbA6MkWRpG9v8dxhCyJuPZskczAel+FHaxlXcbiDVF1KenY5lq5O0FnAMcL6Z\nPdfNOY+UtrTrm+X7yrZHU/mj8PHmmbW6z3wo/tDhGHytzjOBScBw4JfAKfgA+ymSNknZxNbFx2Y+\nii/UfBBws6R3m9kc/Cnzn4BHgBNSOVk/LUMIba6O2Z0vmNmIhtZFWgH4PfAq8M0eTu1icWu8tJhz\ntTeyELgHmAZ8v566tDqYrw5sZWYPA6QW+JHAGDM7L+0T8Ff8a899Zvb2H0xSP+Aq4Dn8afR5ZjZT\n0mvA82Z2S0+Fp0/qcQBDhg7N+72FEBogx6GJs4FVq+wfmI7VqIeEL868CbC1mXV7jZkNL7tuEXC6\nmU2ot8I9aXWf+axSIE8eSj+vqbJvHQBJH5Z0laQX8Smx84D/Ajaqt3Azm2RmI8xsxOBBg+uvfQih\n6fIamoj3ly/RNy5pCN5LcH/VK5b0Y7wRuZuZZTm/ZDRlKwvlpdXBvLK/6o0q+0v7+ksaClyJ/1t9\nBdgaT+D+HNC/gfUMIbSDrJE8WzS/HPiEpHeU7dsTeB24vsdqSN8BDgW+ZGZ/r+s9+OIU29W4/5ck\nXdPTOZVa3c1Sr53wT83dzOw1AEnL4d01IYQO52uA5tbPMhH4GnCRpJOB9YHxwGnlwxUlPQRcb2b7\np9dfBP4Pb10/KenDZfd82Myer1HudviycT0ZRg/DI6spWjAfgA//ebNs3x4s/T7eIFrqIXSkvEK5\nmc2WtAM+oOJSvEfgdDygl1sO6Ff2esf0c2zayu1LPl0oA1gyztVUtGB+Df5HPUfSr/EHD0ewdHfN\n/fjXp08ALwL/MbMXm1rTEEJj5JibJS3dtn2Nc4ZXvB7L0kG87qKr7UwPVYfiaQI6d3EKM7sH/yNu\nCVwGfBH4Ap5trNz3gfvwIUPTgU81r5YhhEYq4uIUkhZJektSaVbn+NLr8g1vjT8CfAifjZpZy1rm\n6dOtct9kKr6imNksyj6Lzex84PyKS4dXXPMI8LE86hlCaC8FzYB7A4tb49sAj+H5WSq9hfcmTMMz\nOGZWtG6WEEIfV8RYbmbblX5P48zPyXuceQTzEEJhdMjiFOtRPY1Ar0QwDyEUR/uv71lThhS5yySC\neQihUAoey98maW1gB3x2+4pVTjEzO6HK/qoimIcQiqUDormk44GjWDIGi8UPSUu/Zw7mhRqaGELo\n63JdnKIlJO0NfA+4Efg8HrjPxYda/xKfGDmFGuPfK0XLPIQ209XV5EC0qK4FbZaddbseQ2Y+nb/3\nVWmxg4EngJ3M7M30QHeWmU2cx/vxAAARUUlEQVTB031fjGeKvbCem0bLPIRQLDmmTWyRDwB/M7Py\n6fpvpwsws6nAVDwdeGYRzEMIhVL0bhZgeXxiUMnrLJ1X/V/4os6ZRTdLCKFQij40EXgaWLvs9WPA\nphXnvIs6E21FyzyEUCjF72XhDuD9Za+vAUZJ+rKklSXtgj8YvaOem0YwDyEUR8bFnNu89X4Z8H5J\n66XXJ+HJAicDc4FL8M+jY+q5aXSzhBAKoxOm81cmFDSzxyWNBA4HNsATcP0sZYnNLIJ5CKFQih3K\nqzOz/+DL0C2zCOYhhEIpeMO8YSKYhxAKpc2HHbZMBPMQQrFELK8qt9EskjbI6151lLmWpJWaXW4I\noXU6YGhiQ/QqmEvqL2lvSdcA/y7b3yXpKEkPSVog6UFJY6pcf6ikf6dzHpL0zYrj60r6vaTnJL0u\n6WFJ5VnEdgKelvSL9DQ4hNDBJOiSMm19zTJ1s0jaDNgf2BtYCR8XuUvZKWcCY4AJwO3Ax4GzJb1o\nZpelexyYzjsNz0MwGjhV0opmdlK6z3nAAGAcvjLH+sDGZeVcDKwC7AuMk3QPvm7eb8zspWV5byGE\nNtf34nQmmYO5pFXx4L0/8N/AncBxVAROSRviWcH2NbNz0+6rUyL244DLJHUB44HJZnZ4OufKVMZ3\nJP3YzOYDWwD/Y2aXpnOuK6+Tmc0BzgDOkPTfeFA/DjglZR77NTDNrHq6Nknj8A8KhgwdmvVPEUJo\noYjl1WXqZpG0E55P4ATgJmAzM9vMzM6o0gLeAc/He7Gk5Uobvtr0hyT1A9bFcw/8oeLa3+Et7Q+k\n13cCJ0oaK6nHaGtmt5vZYem+Y4CBeIv/kR6umWRmI8xsxOBBg2v9GUIIbaADZoA2RNY+8wXAPKA/\nnt1rNXU/DWsQns5xDrCwbJuMfxNYm8VJZp6tuLb0evX0c09gBnA68KikOyXtUKOub9cRf3+za5wf\nQiiM4i9O0SiZulnM7FpJ6wCfBQ7AE8PMkjQZOLdigdKX8GxfW+Mt9ErPsfhDZI2KY2uW3QMzexIY\nm7pltsC7Zi6RNNTM3k4hmT5Ytse7WXYH3gB+CxxsZnUlqwkhtC+fzt/qWrSnzKNZzGyBmU0xs4/h\n+QMuAA4E/iPpaklfSqdeg7fMVzWzGVW2N/BVNp4CvlBRzB54opklchKY2SIzuwU4Hn/gOgxA0pqS\nxgP/Aa4GhgAHAWub2VcjkIfQeaKbpbplGs2S8gh8LwXSnfDW+jn4w9AHJE3Elz86Be8m6Q9sAmxk\nZgeY2aJ07S8kvQhcBWyLPzj9rpnNTw9Dp+IjWh7EV68+HHgGuC9VZWc8eJ8L/MrM3h4eGULoTH2x\nCyWLXs0ANbO38LXq/ippzbJDh+AB+EB8eOJcYCY+uqR07S8l9Qe+nrYngMPN7PR0yny8hf51vMU9\nD7gF2NHMXk/nXIJ/gNSVxD2EUFB9tNWdRW7T+c3s2bLfDfhx2nq65kx8rHm1YwvwD4Oero+x5CH0\nIX11dmcWsThFCKFYcpzPL+l9kqZJmifpKUkT0vDpWtetKukcSbMlzZF0gaR3LuM7ykUk2gohFEpe\nU/UlDcQHTswEdsMHdpyKN3JrrfLze2Aj/HnhIuBk4M/AqFwqtwwimIcQCiXHbpaD8HQhu5vZXOAq\nSasA4yWdkvYtXb60FbAjsK2Z3ZD2PQncKuljZnZ1flXMLrpZQgjFkl83y87A1IqgPQUP8NvWuO7Z\nUiAHMLN/4kOkd876NvIWwTyEUCg5zgDdGLi/fIeZPYaPnNu46hXdXJfcV+O6hopuluT22297YcDy\nerT2mUsYBLzQiPq0QXlRVpSVd3nDelvoHbffNnWlFTQo4+n9Jc0oez3JzCaVvR6IZ2OtNDsd605P\n162fsW65i2CemFndmbYkzTCzEY2oT6vLi7KirHYqr8TMdmp2mUUR3SwhhL5qNp6Ur9JAek7Qt6zX\nNVQE8xBCX3U/FX3ckobg+Z+q9Yl3e13SXV96U0Qw751JtU8pbHlRVpTVTuU1wuXAJyS9o2zfnsDr\nwPU1rltL0kdLOySNwPvLL29ERbNQN4vwhBBCR0uThmYC/8In/ayPL2P5YzM7puy8h4DrzWz/sn1T\ngXcDR7B40tBzZtaySUPRMg8h9ElmNhtfGa0fcCmeYvt0fOnJcsulc8rtibfez8Yzu96Gr/fQMtEy\nDyGEDhAt8xBC6AARzEMIoQPEpKEAgKQ98fzxG+ErQy3BzCrXaw1VxN8xtEoE8xok/RMYa2YzJU0H\nenzIYGZb5FTu+cANwN/N7L5a5/eyrC/iD3Im4wtjn41/a/s0Pm35vBzLegvYKiUmqjy2OfBPM6uZ\nT7rOMt8F7Aqsy9IB1szs2zmV08y/4z7AX8sXNi87tjqwq5nlVl5ofxHMa7sXH3da+r1ZT4xXA04C\nVpP0EvB34Ma03Z6W7MvLkcAJqbxxwM/M7PY0/vYqPPFQXnrKgLQ8kOsSgJI+C1yIj0Z4Dnij4hQD\ncgnmNPfveA6wFbBUMAfWS8cjmPchEcxrMLN9y34f28RyPyVJwAeAbYCP4gta/xCYJ+kWM/t4TsW9\nG7jJzN5KLedVUh1ekXQyPlzrR8t6c0lDgeFluzZL67+W6w+MwdOI5un/gCvxb1eNXmawoX/HCj19\nKL4TX3e3ISQ9DLxkZiMbVUaoXwTzNpbWUr0buFvSZXhQ3y/93D7HouYCK6bfnwTeC1yXXgsPDr2x\nLz5219L2827Oex1fuSVPQ4DDmrRebEP/jpJ2w1fEKfmepOcrTuuPr3YzvTdl9VCHHfDsh8MlfdDM\n7mpEOaF+EczblKT34/9RjsKD95p4YL8RXwT7xhyLmw5sCkwFLgGOlfQm3iVxLHBLL+//M+CPeEC7\nG9g7/Sz3BvBYWsg7TzcD78GXB2u0Rv8d18C/qZVsAKxVcc4b+DeR7/eyrO7sj3cZrYY/6D20QeWE\nOsWkoTYlaRHeUj0Hn532j+6WscqhrA8Dw8zsd5JWA84FdsEf3k0H/sfMHsmprGHA02ZW2XfdEOlD\n8QJ8mvZVVMlDbWa59GU3+e94LXCwmTUtsVOa/v4k3h02EH82sHYDPoDDMohg3qYkXYj3k68N3IW3\nxG8AbjCzhi9CIGlFYMUGfoAsBwyl+vC9mTmWs6j81tXOyXv0TEX5Df07NpOkw/BvGO/Cl1Z7Ghhn\nZhe0tGIBiGDe9iSth3ezlLpbNgAewBP/HNzKui0LScsDZ+CtuxWrnZNncJU0ltrDSc/Nq7xmSqNk\ndqP7Me3/m3N5dwHXmdnX0+vzgXXNbHSe5YRlE8G8IFJ3wTbAHumnNbJF2SiSTgDGAv+Ld38cArwG\nfAn/oDrMzP7WgHLfhQ/lWx14Ce+2eirvcppF0gb484ABwMrA8/h7Ww5fIGGOmeW2hFlK8XorsLmZ\n3Zn27YD3z29kZg/nVVZYNhHM21Tqfy09AN0af+D0EnATabx5tYk37U7SA8Ap+MSahcBIM7stHTsX\nmG9mX8mxvC7gLPxhXfmH31t4Tu7DzGxRtWvbmaRL8L74L+AfhiPw7rg9gROBz5tZbiNaJE3E/602\nr9j/H+BCM/tuXmWFZRO5WdrXzcDXgFeBo4FNzWywmX3GzE4tYiBPhgAPpklP81ly4dwLgM/lXN4E\nfDjnd/Gx7gPSz++m/eNzLq9ZtgAmAqWHjyuY2Vtm9lvgVOAneRUkaQCwF/4wvtK5wNj0oRlaKIYm\ntq/1zWxWqyvRAE/j3zLAJwhtw+Jhgxs0oLx9gGPMrHyyzmPADyUZ/oF5bAPKbbT+wFwzW5RmCL+r\n7Ni/gA/mWNYqwNeBi6oc+xn+77gK1VesD00SwbxNdWggB59EMwofbvlLPKhuiLcw9wJ+m3N5a7D0\nmPaSu9PxInoQn7wDcAdwkKS/4d1H+wO5PQ8ws2fxFni1Y891dyw0VwTzNiZpK/w/zO5GK+SS1KvJ\njgYGAZjZj1PKgs/j3R9n4N0ieXoQ/5C4ssqxvfCRQUU0BfgQcD7wPXyi0lx8CbPl8NFCDZXGnQ8D\n7oux5q0XD0DblKSPA38DpgE74gvFDsAfhj6BD03cr3U1XDaSPgisU23EiqRdgMfNrLuW9LKUtwce\n+K7BZ6E+i7fGvwCMBvYysz/kVV6rpFXld8L/P3KNmf0r5/sfj4+XPyq93h74C76S/TPAjmZ2b55l\nhvrEQ4v2NQF/iLVLev09M9seb6UvZHHOj6I5Hdiym2Mj0vHcmNnv8SC3Mv73/BP+DWAlYKeiB3JJ\n70mB9QP47MyHgKGSPplzUXsD5bNNT8UzeW6d9p+Yc3mhTtEyb1OS5gC74y3KN4HtzOzGdGwv4Hgz\ne08Lq7hMJL0M7GFmS3V7SNoRmGJmqzeo7C68i+eFIg5HLCfpA3hq3/dSPYNirvMQJM3DP/xuSN8C\nHgU+bGb/TN+ozomFN1or+szb13ygy8xM0tP4SI9Scq25+EILRdQPbyVXszKwQqMKTgH8uUbdv8nO\nxr+h7Yq3xhud6+YVYNX0+/bA7LLhsfPxbzqhhSKYt6+78Gx/V+H95t+R9CT+H+0E4J4W1q03puML\nN1xc5dg4YEZzq1NY7wU+Z2ZTm1Te9cBRKdfNEXh/eclGwONNqkfoRgTz9vVjfMUY8Akul+IjFsAf\ngH62FZXKwXjgakm34kPansGTie2Dj43Oa8GNTvdPPFFZs3wTHzkzBbgTH5VUsg+eBC60UPSZF0Qa\nwrchnr/6FjNb2OIqLTNJ2+EPzLbA+3sX4Xk/jio9Fwg9S2PzL8Q/9K+lgal9M9RlFTwNQ1PSGofq\nIpi3KUkHA+8ws1PS6w8Cf8VbsXcCu5nZEy2sYq9JWgmfzj+7WYGnU6R86b/EH5JX1YhEbJJWwEfO\nlBKW3RNBvD1EMG9TkmYCZ5jZxPT6Bnzi0Gn4AsT3mtmXWljF0EJpGcGtgF/RzQPQvFP7Svpf4Dv4\n1H3hqYXnAv9nZj/Ms6xQv+gzb19DSbMTJQ3Gx/PuYGbXSXoDzwQY+q7RwIEpsVbDSfoG3jU2Efgd\nPvlqTVKWRkkLzOyMZtQlVBfBvH0tYPEwvdHAPBYPTXyJxcmqQt80C///RLMcApxkZuUPPh8Abkhz\nB76GT8YKLRIzQNvXP4FDJG2C/4dyRUobC7A+OSZSCoV0JHC0pOFNKm8I/qC1muso7ryHjhEt8/Z1\nOD4c8R58DG95HpY98UUqQt91PN4V96CkWVQfzZJnIrbH8BxBV1c59vF0PLRQBPM2lRY13kDSO4GX\nbMkn1Ufg47ND3/WvtDXLGcAZklZn6YRlY/F856GFYjRLCCETSQcCx+ELYRg+ouUpYLyZ/aqVdQsR\nzEMIdUiT14bgk9eeBp6wCCJtIYJ5CKEuKaCvDTxnZm+2uj7BxWiWEEImkj6ZcurMxx94bpr2T5IU\nE9haLIJ5CKEmSfsAl+ALUYxjydjxb3x5w9BCEcxDCFkcDfzQzMYAv6k4di/wvuZXKZSLYB5CyGIY\nnlu/mvl4vpbQQhHMQwhZPA5s1s2xEXiyr9BCEcxDCFn8GjguPegckPZJ0g7A/+LpeEMLxdDEEEJN\naTjiWcBBwFv47PGF+JquvzCzQ1pYvUAE8xBCHSRtAOwADMKzd15jZg+2tlYBIpiHEEJHiD7zEEKP\nJK0v6U1JO1Y5trekBWkBldBC0TIPIdQk6RrgeTPbs2L/VcArZtbtWqShOaJlHkLI4tfAp9NC0gBI\nGgpsT4xkaQsRzEMIWfwReB3Yu2zfGOAJ4IqW1CgsIYJ5CKEmM1sA/JYlV7waC0yOFLjtIfrMQwiZ\nSNoMmAF8CFgdmAZsYGaPtrRiAYhgHkKog6QZwN+BVYF1zGypES6hNWIN0BBCPX4NnACsSKS9bSvR\nMg8hZCZpFXwx8VfxlvnCFlcpJNEyDyFkZmZzJX0WWBCBvL1EyzyEEDpADE0MIYQOEME8hBA6QATz\nEELoABHMQwihA/w/RnReCWa3+A4AAAAASUVORK5CYII=\n",
+      "text/plain": [
+       "<Figure size 432x288 with 2 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sent = 'i saw a cat on a mat'\n",
+    "\n",
+    "draw_map(sent, get_attns(sent, layer=1, head=3))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXMAAAElCAYAAAAWbQJ4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3Xm8ndO9x/HPNzEkXEM0ZhkMVaWU\nK6GqQWiV0mq1xa1WYkopOuFWW62gvYYWLdqmUQRV6USLlogYohSJmZjVPBMJIoT87h+/tWVnZ5+z\nn53z7OHZ5/fu63mds59prX1Sv732etb6LZkZIYQQiq1PqysQQgih5yKYhxBCB4hgHkIIHSCCeQgh\ndIAI5iGE0AEimIcQQgeIYB5CCB0ggnkIIXSACOYhhNABlmh1BdrFwIEDbciQoa2uRghNd8f9Tzal\nHHvndezdt9STe/RdfojZu29lK++tlyaZ2U49Ka9IIpgnQ4YM5cZbpre6GiE03YDhhzalnLcf/FOP\n72HvzmXpDfbKdO7cO84Y2OMCCySCeQihOASoR437jhXBPIRQLIpHfdVEMA8hFIigT99WV6ItRTAP\nIRRLdLNUFcE8hFAcIrpZuhDBPIRQIIqWeRc6/iNO0gRJMeYwhE6hPtm2XqY3tMyPB/q3uhIhhJxE\ny7yqjg/mZvZoq+sQQsiJYjRLVzr+u0h0s4TQYaKbpaqOb5mHEDqJemWgzqJXB3NJY4AxAIMGD25x\nbUIImfSJPvNqevVHnJmNN7NhZjZs5YErt7o6IYRaSuPMo5tlEb26ZR5CKKAYzVJVBPMQQoHEaJau\nRDAPIRRLL+xCySKCeQihOBTT+bsSwTyEUCzRMq+q44O5mY1udR1CCDmKlnlVHR/MQwidJB6AdiWC\neQihOCKfeZfirxJCKBAVftKQpM0kfUPSCmX7lpV0nqTXJD0r6Vv13rd933EIIVRTGtFSa2tf3wN+\naGazyvadAHwNj8kfAE6VtGM9N41gHkIoloK3zIFhwLWlF5KWBEYBtwKrAGsDLwPfrOem0Wee3HH/\nkwwYfmhTypo57cymlBNCFs36/+PWW96cz43au9WdxSrA02WvhwHLAb81s7nAs5L+DuxUz00jmIcQ\niqMzFqcwFo69n0j7ri/b9xJQV/a/COYhhEJR8VvmTwIfK3u9G/C0mT1Wtm8NYGY9N41gHkIoDNER\nwfxPwLGS/gLMBbYCflFxzoeBupa8jGAeQigOpa3YTsP7w3dPr+8EjisdlLQ2MBwf4ZJZBPMQQoGo\n8C1zM3sD2FrSR9KuGWY2v/wUPNDXtXZxBPMQQqEUPZhLGgy8Zmb3VjtuZo9LegUYUM9923owZggh\nVOrTp0+mrY39B/h2jXO+mc7LLFrmIYTi6Iw+84a8gwjmIYTCUAf0mWe0GvBmPRdEMA8hFEoRg7mk\nfSp2bVplH0BfYDDwVeCeesqIYB5CKJQiBnNgAj5KhfRzt7RVKr25OcCx9RTQ0qcEkjaSdKWkVyW9\nKel+SYekY7tImizpRUmzJd1cnkVM0tqSTNLHy/ZdlPZtUrbvMkkXNvedhRAaQqA+yrRlup20oaQp\nkuak1LPHSaqZL0DSMElXpdj1qqSrJW3ZzSX7AvsB+/u74O/pdeW2D7ALsJaZXZXpTSStbplfBtyP\nf6V4G/gQsHw6tnY6/nNgPrAzcIWkbczsRjP7j6RngBHATemaEfiMqhHA3ZL6AFsDP2jS+wkhNFhe\nLXNJA4CrgRl4K3ld4BS8kXt0N9cNStfdjqetBTgSmCxpYzN7ovIaMzuv7PpRwN/M7Pxc3kjSsmAu\naSAesHczs1Lf0JTScTM7s+zcPnjKyI3wT7Yb06Eb8MB9kqR1gNWB36Z9vwI+go/VvKGhbyaE0BQ5\nPwA9COgP7G5ms/FgvDwwVtLJaV81u+BZDr9Qykku6SY8be1ngN90V6iZjczrDZRrZTfLq8BTwDhJ\ne0papfygpLXSyhvPAO8C84AdgfXLTpuKz6TqA2wD3I235kek49ukcmZUq4CkMZKmS5pu776V41sL\nITSKpExbBjsDkyqC9kQ8wG/bzXVL4jGpfLTJG2lfyzr0WxbM0/TVHYHngXOA5yXdkJZU6gNcCnwc\n+DEwEs9VcAXQr+w2NwAr4i3wEen1TcBqqaU+AviXmRlVmNl4MxtmZsO0RP9GvM0QQt6UcattA+CB\n8h1m9iT+8HGDbq77azrnFEmrpIboaXiWwz9negvStpIuT88E50l6r8r2bqZ3kbS0z9zMHgC+mFba\nGAGcBPwD2A7YDNjZzK4snS+pMuLeh7e8R+Ct8O+b2SxJd6d9I4BTG/0+QghNorr6zAdKKs9vMt7M\nxpe9HgC8VuW6mXQzld7MnpU0EricBasBPQd82sxeqlUpSbsAf8OHIT4JPIi36nuk1Q9AATCzecA1\nkk4F/oD3fYM/FAVA0hD8YebdZdeZpH8BewDr4d0upJ/7pftEf3kIHaSOqfovm9mwvMuXtDreAr8N\nOCDtPgT4h6SPp9Z9d8bi3ca71DtipTst62aRtEka2rO/pJGSdscXOr0LuBlfVumUNERxL+Aq4Jkq\nt7oBb5U/aGYvVuybgz9xDiF0gNID0Jz6zGcCK1TZP4DuF4Y4Eu83/5KZXZl6D74IvAcckaHcjwB/\nzDOQQ2tb5s8DLwA/xFfVeA0fsfI9M3s7BfdfAX/BA/tP8e6Xj1Tcp9Tynlpl3y2p1R9C6BT5PWJ8\ngIq+8TTscBkq+tIrbADcVx5bzOwdSffhwxtreQPvHs5Vy4J5akV/rZvj04AtKnZPqHLeLVT885rZ\nC5X7QggdoL4+81quAI6UtJyZvZ727Qm8xcLrcVZ6AviMpKXM7B0ASUvjDc3LMpQ7BV9dKFdtnScy\nhBAq5djNMg5/LnexpE9KGoP3Z59aPlxR0iOSzi677nd4b8IlqRt4V/yB5upA+QPWrnwPWFfS0crx\nk6ktHoCGEEJWWafq12JmMyXtAJyJt6hfw4cYjq04dQl85Enputsk7QQcA1yQdt8DfMrM7lqkvtI5\nVYq/D8+9sp+kO6k+qsbMbP+s7yeCeQihUHJszGJmM4Dta5wztMq+KZTNWK9hdDfHhqatatH4jPdM\nIpiHEAqjji6UdrJ2MwqJYB5CKJSiBfNqibcaIYJ5stmHB3PjLWfWPjEHA4Yf2pRyAGZOa857CqFZ\nihbMmyWCeQihWAoeyyUNznDafGB2N5kbFxHBPIRQHKprOn+7epwFqw51S9LzwMXAsWb2cnfnFv6v\nEkLoPQRI2bY2dj4+Y13ALHyC0p/Sz1lp//XAP/EcLocA0ySt3N1NI5iHEAok19wsrXIC8FHgRGCQ\nmW1vZv9jZtsDg4CT0/HDgXXw8ehDgO93d9MI5iGEQumAlvmJwF1m9gMzK1/gAjN708yOwrPDnmhm\n883sWOBO4LPd3TSCeQihUDqgZb4NC9Yt7spNLLza0c3AWt1dEA9AQwjF0f6t7iyWBlarcc7q6byS\n0rJ0XYqWeQihMAT07atMWxu7C9hTUmU6b8DXesAX3LmzbPdQoNtVjKJlHkIolDbvQsniOHykyjRJ\nvwduxNd2WBX4BLA3vvjF8fD+cpk7UiO9bgTzEEJxdEA3i5lNkrQ38Bs8kdZ+ZYdLwxX3N7NJad9S\neJ71B7u7bwTzEEJh+DjzgkdzwMwmSroc2A1fvH4FYDZwB/D3ssUyMLNZwKSqNyoTwTyEUCBtP1Il\nMzN7A7gwbT3WUQ9AJW0l6VJJz0l6U9Kd6etMCKFD9OmjTFtv02kt8yH4w4RxwFxga+BcSfPN7KKW\n1iyE0HMF7DOXtE/69RIze73sdU1mdn7WczsqmJvZxNLvaW29qfhA+wOBRYJ5WvNvDMCgwVkSmYUQ\nWqmgfeYT8MRaNwOvl73ujtI5vTOYSxqA5zHYDViTBev2PVPtfDMbT1qAdfPNh2XKYhZCaK3ixXL2\nwwPzc+n1vo0opKOCOf6J9zF8fOYM/OnwwXhwDyF0gKK1zM1sQsXr8xpRTscEc0n9gF2BQ8xsXNn+\njnrIG0JvV7BY3jQdE8zxPAZ9gLdLOyQtB3yOjIngQwjtTaJjRqqk/ORfBD4MLGtmB5TtXxu4x8ze\nynq/jgnmZjZL0jTgx5Jm48suHYXPplq+pZULIeSkM8aZS9ofOB3ox4KHnQekw6sC/8YHZ5yd9Z6d\n1gXxFeAx/AnwL4G/UsfT4BBC+yt6PnNJn8IHXjwEfAGf1v8+M7sXuA/4fD337ZiWOYCZPQLsUOXQ\n2CZXJYTQIB3QMv8ePrJlWzObLWmzKufcDWxVz007KpiHEDpcm7e6MxoGTDSz2d2c8zS1c54vJIJ5\nCKEwCjppqNJSwJs1zlkReK+em0YwDyEUSgeMZnkc2LzGOVtSI+VtpU57ABpC6HAdsAbo34ERkr5c\n7aCkfYFN8AEcmUXLPIRQHJ3RZ34ysBdwkaQv4bnMkXQoMALYHXgYOKOem0Ywb4GZ085sWlkDhh/a\ntLKgue8t9D7qgHHmZjZT0rb4sOny1vnp6ecNwFfMrFa/+kKimyWEUCh5jjOXtKGkKZLmSHpW0nGS\n+ta+EiTtLmmapLckvSLpSknLZrnWzJ40s+2ATfH8UUcDhwHDzWxbM6uaHLA70TIPIRRK35wegKYs\nq1fjSfl2A9YFTsEbuUfXuPYA4Ey8y+RIYACwPXXGVDO7Gx9T3mMRzEMIheGt7ty6WQ4C+gO7pzHf\nkyUtD4yVdHJX48AlDQROAw4zs7PKDl2SpVBJo4EpZvZUj2pfIbpZQgiF0kfZtgx2BiZVBO2JeIDf\ntpvr9kg/FzeV7TnA45IeljRO0h7pA6JHIpiHEAolx6GJGwAPlO8wsyeBOelYV0pjwPeX9LSkeZJu\nkfTxjG/hELwVPwBPpnUR8LykuySdKmnXlPG1LhHMQwiFUscD0IGSppdtYypuNQB4rUoRM9OxrqwG\nfAjvV/8e8Fl8RueVklatVX8z+42ZfQlYGZ889D1gMrAO8G18HPorkm6qda9y0WceQigM4cMTM3rZ\nzIY1qBr/BXzZzK4ESIH3CeBQ4EdZbmJmBtyRtp9LWgof2fJ9YBX8G0BmEcxDCMUh5TaaBW+Br1Bl\n/4B0rLvrDLiutCNlP7wN2LCeCkhaD8/0ugMwElgJ/7B4FJhSz706MphLWgX4BjDBzB5vcXVCCDnK\ncc7QA1T0jUsaBCxDRV96hfspfUmoqBq+KE63JO3NggC+VrruOeBKPIBfk/ru69KpfearAMcAQ1tc\njxBCjgT0kTJtGVwBfLriYeOewFvA9d1cd3n6OfL9ekkr4P3fd2Uo9wJgFL44xaHAhma2ppl9zcwm\nLE4gh84N5iGEDpXjDNBx+JrBF0v6ZHpAOhY4tXy4oqRHJL2/fJuZTccfUp4taZSkXYBLgXnArzKU\n+w7+ubQtsA+wt6TtUp/5YmvbYC5pG0nXSnpD0ixJ10naTNLqks6R9FiaRvuQpJ+U/hCShgL3pNtc\nK8kkxYLOIXSIvIYmmtlMvKujL3AZcCw+GeiYilOXSOeU+yrwN+BU4C94IN8+3bOWFYFPAT/HY/BR\nePfKTEmTJR0labjqnB3Vln3mkrbDh+pci38deRPYGlgTeBd4Ffgu/iBiffzTdGXg63jf097Ahfh4\nztubWvkQQsPkvb6nmc3Ap+F3d87QKvvewEeeHLwYZc7Fg/cUgDTrdGSqx/bAT9M2C38gmklbBnPg\nBLzv6dNp+A74w4GSI0q/SLoRD/bnSDrMzN6WVMp1MMPMbu6qkPS1agzAoMGD86x/CKFB+hY8a2Kl\nNBLmX8DS+OzTVfDGabWRNl1qu2Ceso5tCXyrLJCXHxfwLTwIrw30Kzs8GHgka1lmNh5fJZvNNx8W\nXTEhFEDRU+DC+3FuGxaMatmYBSNkZuF98oUfmjiABUN1qvk28DPgJPyJ80xgOP7goV8X14QQOoCP\nZml1LXpG0g3AFnj8FT56ZgpwTfp5u5nVHOJYqR2D+Ux8rObqXRz/MvAXM/thaYekugbqhxAKqv2X\nhMtiS+AWFgTvf5vZvJ7etO2CuZm9KekWYB9JZ1bpaumPDycqt3fF63fSz2iph9Bhih/LGVDvKkJZ\ntF0wT47Ck8ZfIWk8/oBzK2A6PsrlmyngP4oH8vUqrn8S/+oyStIsYF4aGxpCKDCR3+IUrdKIQA5t\nOs7czKbi4zCXAX4P/BEfYP80cByeMvIn6ec7wDcrrp8LHIjPyLoemNasuocQGivHFLgdpV1b5pjZ\n9fjT3mr2rbJvoX89M7sQH2seQuggvS9MZ9O2wTyEECpJZM270utEMA8hFErE8uoimIcQCqU39odn\nEcE8hFAYItfFKTpKBPMQQnHknGirk0Qw73Azp53Z1PIGDD+0aWU1+72F9tAJ3SySPojnmNoCT2FS\nmWIXfJnQdbPeM4J5CKFQ2nJyTB0kbYVPiuyPp/R+If1c5NR67hvBPIRQGKIjWuYn4OluDwLOMbNq\ngbxuEcxDCIXSAc8/h+PJAsfnedMI5iGEwpCKn5sFT0GyWIs2dyeCeQihUIofy7kJ2Czvmxb9WUII\noZcprQNaa2tjPwA+Lulred40WuYhhMLwlYbaO1JnsBu+MMUESQcAtwGvVTnPzOz4rDeNYB5CKJQO\n6E4YW/b7iLRVY0AE8xBC55E6Yjr/yEbcNIJ5CKFQit7LktZqyF0E8xBCoRS/Yd4Yhep+krSHpHsk\nvS3pKUk/lbREOjZakknaWNJkSW9KekDS7q2udwghH6UHoFm2didpsKSjJf1V0hRJF6fXQxbnfoUJ\n5pJ2xNcCvR1/GnwGcARQmW3pD8ClwBeAh4GJktZqYlVDCA3UAUMTkXQg8CBwLB6rRgKfx9c4flDS\n1+u9Z5G6WY4DrjOzUen1lSlHwwmSflJ23mlmdg6ApNvwJDa7AuMqbyhpDDAGYNDgwQ2segghFyp+\nN4ukHfB49DrwM3yY4nPA6sD2+AL1v5L0iJlNyXrfQrTMJfUF/hv4c8WhP+LvYauyfVeVfjGzV4AX\ngaotczMbb2bDzGzYygNXzrfSIYTcCegrZdra2JF4IN/czH5sZteZ2YPp54+BzYE30nmZFSKYAwOB\nJfFWdrnS65XK9lUOvn8H6NegeoUQmqyPsm1tbAvgT2b2aLWDaf+f03mZFSWYvwzMA1ap2L9q+vlq\nc6sTQmgVSZm2jPfaMD18nCPpWUnHpZ6ArNf3kTQ9Db7YNeNl/fGY1p2X0nmZFSKYm9l7+JTXL1cc\n2gOYD/y76ZUKITSdj2bJp2UuaQC+SIThgyqOAw7HH0pmdQBddON24wm8b7w7I6kzs2IhgnlyDDBS\n0rmSPi3pCHyq61lm9nSL6xZCaIaMI1kyNswPwlu/u5vZZDMbhwfy70pavmZV/MPgp8AP63wXlwDD\nJf1a0ooV91xe0i/xLpaL67lpYYK5mV0F7AUMAy4Dvg2cAjRv0ckQQksJWKKPMm0Z7AxMMrPZZfsm\n4gF+2wzXHw/cCGQecZKcADyAf5g8IWmqpD9Kuh5vjR+GD1s8oZ6bFmloImb2R3wES7VjE4AJVfYP\nbWilQghNleNAlQ3wYYHvM7MnJc1Jxy7rug7aBNgP2KTeQs1stqSPAycDewOfKDs8BzgLOKriQ6am\nQgXzEEJvJ/pkX+d4oKTpZa/HVyzVNoDqqWdnpmPdOQM408wekTQ0a4VKzGwW8HVJhwIfAlYAZgEP\nmtm8eu8HEcxDCAXiCzpnPv1lMxuWex2kvfAA/Nme3isF7nt7XCkimIcQiiTfMeQz8RZxpQHp2KLF\nS0viszZPAvqkB5ilh6XLSlrOzF7PrYZ1iGAeQiiUHJNoPYD3jb9P0iBgmXSsmmXxoYinpq3cROBR\nYL2Ke16DD38cZWZPp9dZmJntkPHcCOYhhOIQ5Lk4xRXAkRWt6T2Bt4Cuco6/waKLS6wGXISv7Vkt\nUG+HB/Nlyl5nYRnPAyKYh5zNnFaZxLJxBgxv3qjUZr6v0L0cR7OMw5NaXSzpJGAdfEm3U8tHkkh6\nBLjezPY3s3eB6xauj4amX+8xs1sqCzGzPt29zksE8xBCYYj8JseY2cyUwfBMfBjia8BpLLxGJ3ic\nzDzFv1UimIcQikNkzruShZnNoMbU+lpzVczsca9ZNpLOAf5mZpd2c86u+MzU/bLetzAzQEMIAdLw\nxAxbGxsNbFrjnI8Co2qcs5BomYcQCqO0bFwvsDTwXj0XRDAPIRRKm+cqz6rLkSqSlga2AZ6v54YR\nzEMIBZI9V3k7kfRYxa7vSNq3yql9gZXxlvkiS112J4J5CKEw8hzN0mR9WNAaN7ru2p8H3INnYvxJ\nleNdimAeQiiUIrbMy0fESJqPLzx/XJ5lRDAPIRRK8UL5IkYCj+d90wjmIYTCkKBvAVvm5cysq1QB\nPRLBPIRQKEXsZqlG0jB8ebgBVJ9hamZ2fNb7dVQwl7QV8H1gOJ6W8mHgZ2Z2YUsrFkLITdFDeVpf\n9GK8u6W7t2P40nSZdFQwB4bga/KNA+YCWwPnSppvZhe1tGYhhFx0QMP8Z3gKgRuAc4GngHd7etOO\nCuZmNrH0u/y72FQ89/CBeIrKhUgaA4wBGDR4cJNqGUJYXD40sfDRfDfgdmCkmc3P66YFHbJZnaQB\nkk6X9AQ+XnMeHqzXr3a+mY03s2FmNmzlgSs3s6ohhMUkZdva2ArAtXkGcuiwljkwAfgY3s80A5gN\nHIx/EoYQCk+dkJvlYWDVvG/aMcFcUj9gV+AQMxtXtr+jvn2E0Jt1SDfLr4ATJa1pZs/kddOOCeZ4\nLoM+wNulHZKWAz5HncsvhRDaVPt3oWRxBf4A9EZJxwK34QtjLMLMnsx6044J5mY2S9I04MeSZgPz\ngaOAWSxYPTuEUHAdEMwfZ0F+lt91c55RR4zumGCefAX4LXA+8Aq+HNQyQPMWiwwhNJSK381yPg3o\nLeioYG5mjwA7VDk0tslVCSE0gC9O0epa9IyZjW7EfTsqmIcQOl8HjGZpiAjmIYRC6YBulvdJ2gD4\nMPBfZnZBT+4Vw/ZCCIVR6mbJsrUzSZtKmg7cB/wFnyNTOratpDmSPlvPPSOYhxAKRJn/164krQ9c\nB3wI+CU+VLHcVOBV4Ev13DeCeQihODJO5W/zbvVjgKWALc3su8C08oNmZsC/8eyvmUWfeSismdPO\nbFpZA4Y3b3RrM99X0YjiL06Bj7i72MxmdHPOU8Cn6rlpBPMQQqEUPpT7YhRP1zhHeOs9swjmIYRi\nKX40fwFYr8Y5G+Gt88yizzyEUChFfwAKXAN8VtKHqh2UNBzviplUz00jmIcQCqUDHoCegK8sNFXS\nwcAaAJI2Sq8vA14Hfl7PTaObJYRQKO0dp2szswclfRFf/az0tFvA3enna8Du9WRMhAjmIYQCEaA2\nb3ZnYWZXSlobGIUvqPMBPMPrzcC5ZvZqvfeMYB5CKI6cu1AkbQicAWyFt4h/BxxrZu91c81w4BvA\nCLyL5CngD8BJZjY3a9lm9ho+aeiXi/0GykSfeQihUJRxq3kfaQBwNZ6OdjfgOOBw4Ngal+4JrAuc\nBHwGXznou8CFmeovnSPpczXO2VXSOVnuV9KRLXNJY4AXzexvra5LCCFn+bXMDwL64/3Ts4HJkpYH\nxko6Oe2r5kQze7ns9XWS5gK/lTTEzJ6oUe5ofIGKS7s556N4F8x+Gd4H0Lkt8zHA51tdiRBC3nLN\nzbIzMKkiaE/EA/y2XV1UEchL7kg/18j6TmpYGuiyq6eaTg3mIYQOlePQxA2AB8p3pBEkc9KxemyF\nL1X5aMbzu1xpSNLSwDbA8/VUoGXBXNIESdMl7SJpRkr5+A9JK0laT9K1kt5M52xSdt3hkqZJmiXp\nBUmXSVqv7Ph1wObAKEmWttHNf4chhLz5aJbMwXxgih+lbUzF7QZQfSHlmelYtjpJqwFHAxeY2Ytd\nnPNYaUu7vlO+r2x7IpU/Ah9vnlmr+8wH4w8djsbX6jwDGA8MBc4CTsYH2E+UtFHKJrYWPjbzCXyh\n5oOAmyR90Mxm4U+Z/wo8Bhyfysn6aRlCaHN1zO582cyGNbQu0lLAn4A3gO90c2ofFrTGS4s5V3sj\n84B7gCnAT+qpS6uD+UrAVmb2KEBqgR8JjDKz89M+Af/Av/bcb2bv/8Ek9QUmAy/iT6PPN7MZkt4E\nXjKzm7srPH1SjwEYNHhw3u8thNAAOQ5NnAmsUGX/gHSsRj0kfHHmjYCtzazLa8xsaNl184HTzOy4\neivcnVb3mT9eCuTJI+nnNVX2rQkg6WOSJkt6BZ8SOwf4L2D9egs3s/FmNszMhq08cOX6ax9CaLq8\nhibi/eUL9Y1LGoT3EjxQ9YqF/QJvRO5mZlnOLxlJ2cpCeWl1MK/sr3qnyv7Svn6SBgNX4f9WXwe2\nxhO4vwj0a2A9QwjtIGskzxbNrwA+LWm5sn17Am8B13dbDen7wKHAV83sX3W9B1+cYrsa9/+qpGu6\nO6dSq7tZ6rUT/qm5m5m9CSBpCby7JoTQ4XwN0Nz6WcYB3wQulnQSsA4wFji1fLiipEeA681s//T6\nK8D/4a3rZyR9rOyej5rZSzXK3Q5fNq47Q+hmeGQ1RQvm/fHhP++W7duDRd/HO0RLPYSOlFcoN7OZ\nknbAB1RchvcInIYH9HJLAH3LXu+Yfo5OW7l9yacLpT8Lx7maihbMr8H/qOdKOht/8HAEi3bXPIB/\nffo08ArwHzN7pak1DSE0Ro65WdLSbdvXOGdoxevRLBrE6y662s70UHUwniagcxenMLN78D/ilsDl\nwFeAL+PZxsr9BLgfHzI0Dfhs82oZQmikIi5OIWm+pPcklWZ1ji29Lt/w1vhjwKb4bNTMWtYyT59u\nlfsmUPEVxcwep+yz2MwuAC6ouHRoxTWPAZ/Mo54hhPZS0Ay4U1nQGt8GeBLPz1LpPbw3YQqewTGz\nonWzhBB6uSLGcjPbrvR7Gmd+bt7jzCOYhxAKo0MWp1ib6mkEeiSCeQihONp/fc+aMqTIXSwRzEMI\nhVLwWP4+SasDO+Cz25eucoqZ2fFV9lcVwTyEUCwdEM0lHQscxcIxWCx4SFr6PXMwL9TQxBBCb5fr\n4hQtIWlv4EfADcCX8MB9Hj7U+ix8YuREaox/rxQt8xAymDntzKaVNWCLw5pWFsDMW89oank94dP5\nW12LHjsYeBrYyczeTQ90HzfAtFVUAAARVklEQVSziXi670vwTLEX1XPTaJmHEIolx7SJLbIx8E8z\nK5+u/366ADObBEzC04FnFsE8hFAoRe9mAZbEJwaVvMWiedXvxRd1ziy6WUIIhVL0oYnAc8DqZa+f\nBDapOGcN6ky0FS3zEEKhFL+XhTuAj5S9vgYYIelrkpaVtAv+YPSOem4awTyEUBwZF3Nu89b75cBH\nJK2dXp+IJwucAMwGLsU/j46u56bRzRJCKIxOmM5fmVDQzJ6SNBw4HFgXT8D165QlNrMI5iGEQil2\nKK/OzP6DL0O32CKYhxAKpeAN84aJYB5CKJQ2H3bYMhHMQwjFErG8qtxGs0haN6971VHmapKWaXa5\nIYTW6YChiQ3Ro2AuqZ+kvSVdAzxctr+PpKMkPSLpbUkPSRpV5fpDJT2cznlE0ncqjq8l6U+SXpT0\nlqRHJZVnEdsJeE7Sb9PT4BBCB5Ogj5Rp620Wq5tF0mbA/sDewDL4uMhdyk45AxgFHAfcDnwKOEfS\nK2Z2ebrHgem8U/E8BCOBUyQtbWYnpvucD/QHxuArc6wDbFBWziXA8sC+wBhJ9+Dr5v3ezF5dnPcW\nQmhzvS9OZ5I5mEtaAQ/e+wP/DdwJHENF4JS0Hp4VbF8zOy/tvjolYj8GuFxSH2AsMMHMDk/nXJXK\n+L6kX5jZXGAL4H/M7LJ0znXldTKzWcDpwOmS/hsP6scAJ6fMY2cDU8zMqELSGPyDgkGDB2f9U4QQ\nWihieXWZulkk7YTnEzgeuBHYzMw2M7PTq7SAd8Dz8V4iaYnShq82vamkvsBaeO6BP1dc+0e8pb1x\nen0ncIKk0ZK6jbZmdruZHZbuOwoYgLf4H+vmmvFmNszMhq08cOVaf4YQQhvogBmgDZG1z/xtYA7Q\nD8/utaK6noY1EE/nOAuYV7ZNwL8JrM6CJDMvVFxber1S+rknMB04DXhC0p2SdqhR1/friL+/mTXO\nDyEURvEXp2iUTN0sZnatpDWBLwAH4IlhHpc0ATivYoHSV/FsX1vjLfRKL7LgQ2SVimOrlt0DM3sG\nGJ26ZbbAu2YulTTYzN5PIZk+WLbHu1l2B94B/gAcbGZ1JasJIbQvn87f6lq0p8yjWczsbTObaGaf\nxPMHXAgcCPxH0tWSvppOvQZvma9gZtOrbO/gq2w8C3y5opg98EQzC+UkMLP5ZnYzcCz+wHUIgKRV\nJY0F/gNcDQwCDgJWN7NvRCAPofNEN0t1izWaJeUR+FEKpDvhrfVz8YehD0oahy9/dDLeTdIP2AhY\n38wOMLP56drfSnoFmAxsiz84/YGZzU0PQyfhI1oewlevPhx4Hrg/VWVnPHifB/zOzN4fHhlC6Ey9\nsQslix7NADWz9/C16v4hadWyQ4fgAfhAfHjibGAGPrqkdO1ZkvoB30rb08DhZnZaOmUu3kL/Ft7i\nngPcDOxoZm+lcy7FP0DqSuIeQiioXtrqziK36fxm9kLZ7wb8Im3dXXMGPta82rG38Q+D7q6PseQh\n9CK9dXZnFrE4RQihWHKczy9pQ0lTJM2R9Kyk49Lw6VrXrSDpXEkzJc2SdKGkDyzmO8pFJNoKIRRK\nXlP1JQ3AB07MAHbDB3acgjdya63y8ydgffx54XzgJOBvwIhcKrcYIpiHEAolx26Wg/B0Ibub2Wxg\nsqTlgbGSTk77Fi1f2grYEdjWzKamfc8At0j6pJldnV8Vs4tulhBCseTXzbIzMKkiaE/EA/y2Na57\noRTIAczsVnyI9M5Z30beIpiHEAolxxmgGwAPlO8wsyfxkXMbVL2ii+uS+2tc11DRzZLcfvttL/df\nUk/UPnMhA4GXG1GfNigvyuolZfVf8sxmlTdkcQoqd8ftt01aZikNzHh6P0nTy16PN7PxZa8H4NlY\nK81Mx7rS3XXrZKxb7iKYJ2ZWd6YtSdPNbFgj6tPq8qKsKKudyisxs52aXWZRRDdLCKG3mokn5as0\ngO4T9C3udQ0VwTyE0Fs9QEUft6RBeP6nan3iXV6XdNWX3hQRzHtmfO1TCltelBVltVN5jXAF8GlJ\ny5Xt2xN4C7i+xnWrSfpEaYekYXh/+RWNqGgW6mIRnhBC6Ghp0tAM4F580s86+DKWvzCzo8vOewS4\n3sz2L9s3CfggcAQLJg29aGYtmzQULfMQQq9kZjPxldH6ApfhKbZPw5eeLLdEOqfcnnjr/Rw8s+tt\n+HoPLRMt8xBC6ADRMg8hhA4QwTyEEDpATBoKAEjaE88fvz6+MtRCzKxyvdZQRfwdQ6tEMK9B0q3A\naDObIWka0O1DBjPbIqdyLwCmAv8ys/trnd/Dsr6CP8iZgC+MfQ7+re1z+LTl83Ms6z1gq5SYqPLY\n5sCtZlYzn3SdZa4B7AqsxaIB1szsezmV08y/4z7AP8oXNi87thKwq5nlVl5ofxHMa7sPH3da+r1Z\nT4xXBE4EVpT0KvAv4Ia03Z6W7MvLkcDxqbwxwK/N7PY0/nYynngoL91lQFoSyHUJQElfAC7CRyO8\nCLxTcYoBuQRzmvt3PBfYClgkmANrp+MRzHuRCOY1mNm+Zb+PbmK5n5UkYGNgG+AT+ILWPwPmSLrZ\nzD6VU3EfBG40s/dSy3n5VIfXJZ2ED9f6+eLeXNJgYGjZrs3S+q/l+gGj8DSiefo/4Cr821Wjlxls\n6N+xQncfih/A191tCEmPAq+a2fBGlRHqF8G8jaW1VO8G7pZ0OR7U90s/t8+xqNnA0un3Z4APA9el\n18KDQ0/si4/dtbT9povz3sJXbsnTIOCwJq0X29C/o6Td8BVxSn4k6aWK0/rhq91M60lZ3dRhBzz7\n4VBJHzWzuxpRTqhfBPM2Jekj+H+UI/DgvSoe2G/AF8G+IcfipgGbAJOAS4EfS3oX75L4MXBzD+//\na+AveEC7G9g7/Sz3DvBkWsg7TzcBH8KXB2u0Rv8dV8G/qZWsC6xWcc47+DeRn/SwrK7sj3cZrYg/\n6D20QeWEOsWkoTYlaT7eUj0Xn532766WscqhrI8BQ8zsj5JWBM4DdsEf3k0D/sfMHsuprCHAc2ZW\n2XfdEOlD8UJ8mvZkquShNrNc+rKb/He8FjjYzJqW2ClNf38G7w4bgD8bWL0BH8BhMUQwb1OSLsL7\nyVcH7sJb4lOBqWbW8AUPJC0NLN3AD5AlgMFUH743I8dy5pffuto5eY+eqSi/oX/HZpJ0GP4NYw18\nabXngDFmdmFLKxaACOZtT9LaeDdLqbtlXeBBPPHPwa2s2+KQtCRwOt66W7raOXkGV0mjqT2c9Ly8\nymumNEpmN7oe0/6/OZd3F3CdmX0rvb4AWMvMRuZZTlg8EcwLInUXbAPskX5aI1uUjSLpeGA08L94\n98chwJvAV/EPqsPM7J8NKHcNfCjfSsCreLfVs3mX0yyS1sWfB/QHlgVewt/bEvgCCbPMLLclzFKK\n11uAzc3szrRvB7x/fn0zezSvssLiiWDeplL/a+kB6Nb4A6dXgRtJ482rTbxpd5IeBE7GJ9bMA4ab\n2W3p2HnAXDP7eo7l9QHOxB/WlX/4vYfn5D7MzOZXu7adSboU74v/Mv5hOAzvjtsTOAH4kpnlNqJF\n0jj832rziv3/AS4ysx/kVVZYPJGbpX3dBHwTeAP4IbCJma1sZp83s1OKGMiTQcBDadLTXBZeOPdC\n4Is5l3ccPpzzB/hY9/7p5w/S/rE5l9csWwDjgNLDx6XM7D0z+wNwCvDLvAqS1B/YC38YX+k8YHT6\n0AwtFEMT29c6ZvZ4qyvRAM/h3zLAJwhtw4Jhg+s2oLx9gKPNrHyyzpPAzyQZ/oH54waU22j9gNlm\nNj/NEF6j7Ni9wEdzLGt54FvAxVWO/Rr/d1ye6ivWhyaJYN6mOjSQg0+iGYEPtzwLD6rr4S3MvYA/\n5FzeKiw6pr3k7nS8iB7CJ+8A3AEcJOmfePfR/kBuzwPM7AW8BV7t2ItdHQvNFcG8jUnaCv8Ps6vR\nCrkk9WqyHwIDAczsFyllwZfw7o/T8W6RPD2Ef0hcVeXYXvjIoCKaCGwKXAD8CJ+oNBtfwmwJfLRQ\nQ6Vx50OA+2OseevFA9A2JelTwD+BKcCO+EKx/fGHoU/jQxP3a10NF4+kjwJrVhuxImkX4Ckz66ol\nvTjl7YEHvmvwWagv4K3xLwMjgb3M7M95ldcqaVX5nfD/j1xjZvfmfP9j8fHyR6XX2wN/x1eyfx7Y\n0czuy7PMUJ94aNG+jsMfYu2SXv/IzLbHW+nzWJDzo2hOA7bs4tiwdDw3ZvYnPMgti/89/4p/A1gG\n2KnogVzSh1Jg3RifnfkIMFjSZ3Iuam+gfLbpKXgmz63T/hNyLi/UKVrmbUrSLGB3vEX5LrCdmd2Q\nju0FHGtmH2phFReLpNeAPcxskW4PSTsCE81spQaV3Qfv4nm5iMMRy0naGE/t+2GqZ1DMdR6CpDn4\nh9/U9C3gCeBjZnZr+kZ1biy80VrRZ96+5gJ9zMwkPYeP9Cgl15qNL7RQRH3xVnI1ywJLNargFMBf\nbNT9m+wc/BvarnhrvNG5bl4HVki/bw/MLBseOxf/phNaKIJ5+7oLz/Y3Ge83/76kZ/D/aI8D7mlh\n3XpiGr5wwyVVjo0Bpje3OoX1YeCLZjapSeVdDxyVct0cgfeXl6wPPNWkeoQuRDBvX7/AV4wBn+By\nGT5iAfwB6BdaUakcjAWulnQLPqTteTyZ2D742Oi8FtzodLfiicqa5Tv4yJmJwJ34qKSSffAkcKGF\nos+8INIQvvXw/NU3m9m8FldpsUnaDn9gtgXe3zsfz/txVOm5QOheGpt/Ef6hfy0NTO2boS7L42kY\nmpLWOFQXwbxNSToYWM7MTk6vPwr8A2/F3gnsZmZPt7CKPSZpGXw6/8xmBZ5OkfKln4U/JK+qEYnY\nJC2Fj5wpJSy7J4J4e4hg3qYkzQBON7Nx6fVUfOLQqfgCxPeZ2VdbWMXQQmkZwa2A39HFA9C8U/tK\n+l/g+/jUfeGphWcD/2dmP8uzrFC/6DNvX4NJsxMlrYyP593BzK6T9A6eCTD0XiOBA1NirYaT9G28\na2wc8Ed88tWqpCyNkt42s9ObUZdQXQTz9vU2C4bpjQTmsGBo4qssSFYVeqfH8f9PNMshwIlmVv7g\n80Fgapo78E18MlZokZgB2r5uBQ6RtBH+H8qVKW0swDrkmEgpFNKRwA8lDW1SeYPwB63VXEdx5z10\njGiZt6/D8eGI9+BjeMvzsOyJL1IReq9j8a64hyQ9TvXRLHkmYnsSzxF0dZVjn0rHQwtFMG9TaVHj\ndSV9AHjVFn5SfQQ+Pjv0XvemrVlOB06XtBKLJiwbjec7Dy0Uo1lCCJlIOhA4Bl8Iw/ARLc8CY83s\nd62sW4hgHkKoQ5q8NgifvPYc8LRFEGkLEcxDCHVJAX114EUze7fV9QkuRrOEEDKR9JmUU2cu/sBz\nk7R/vKSYwNZiEcxDCDVJ2ge4FF+IYgwLx46H8eUNQwtFMA8hZPFD4GdmNgr4fcWx+4ANm1+lUC6C\neQghiyF4bv1q5uL5WkILRTAPIWTxFLBZF8eG4cm+QgtFMA8hZHE2cEx60Nk/7ZOkHYD/xdPxhhaK\noYkhhJrScMQzgYOA9/DZ4/PwNV1/a2aHtLB6gQjmIYQ6SFoX2AEYiGfvvMbMHmptrQJEMA8hhI4Q\nfeYhhG5JWkfSu5J2rHJsb0lvpwVUQgtFyzyEUJOka4CXzGzPiv2TgdfNrMu1SENzRMs8hJDF2cDn\n0kLSAEgaDGxPjGRpCxHMQwhZ/AV4C9i7bN8o4GngypbUKCwkgnkIoSYzexv4AwuveDUamBApcNtD\n9JmHEDKRtBkwHdgUWAmYAqxrZk+0tGIBiGAeQqiDpOnAv4AVgDXNbJERLqE1Yg3QEEI9zgaOB5Ym\n0t62lWiZhxAyk7Q8vpj4G3jLfF6LqxSSaJmHEDIzs9mSvgC8HYG8vUTLPIQQOkAMTQwhhA4QwTyE\nEDpABPMQQugAEcxDCKED/D9O+GNXTx+H0wAAAABJRU5ErkJggg==\n",
+      "text/plain": [
+       "<Figure size 432x288 with 2 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sent = 'i saw a cat on a mat'\n",
+    "\n",
+    "draw_map(sent, get_attns(sent, layer=1, head=6))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZQAAAFGCAYAAABNM2mEAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzsnXm4HFXRh99fElZlCYSdhAAuiMga\nED9lCSAIqCiyKPhBEIkoKi4fKIISQAUii4oiRIGACxERFGUJEExkkSVsshhkR1YJCWs2SOr7o84k\nnbkz93bP7bkzc1MvTz9z+/Tp6tMzoatPVZ0qmRlBEARB0FsGtHoAQRAEQf8gFEoQBEFQCqFQgiAI\nglIIhRIEQRCUQiiUIAiCoBRCoQRBEASlEAolCIIgKIVQKEEQBEEphEIJgiAISmFQqwcQLGLIkCG2\n3nrDWz2M3DQjx8I9/3qqCVJh8/cMa4pcNUVqc77bTmP+gvK/haefepKXXpreq59t4Irrmb01O1df\nm/3iRDP7SG+u10mEQmkj1ltvODffNrXVw8jNW/MXlC5ztW2/WrpMgCk3/7Qpcpca1JxJfqREgldn\nv1W6zF132LbXMuytOSyz0adz9Z1z91lDen3BDiIUShAEQREEqFlz084mFEoQBEFRFO7nWoRCCYIg\nKErMUGoSCiUIgqAQihlKHUKhVCFpP2B5MxufaZsMTDezfVo1riAI2gQBAwa2ehRtSajZruwHjGr1\nIIIgaFfkJq882xJGzFCCIAiKEiavmsS3kkHSeOBTwA6SLG1jMscPkPSIpFclXS1p3arzl5U0VtJ/\nJM2VdK+kPfr2LoIgaDoxQ6lJzFAW5yRgGLAy8KXU9jSwI/B+YG3gm8BywE+AcUBWYVwKbAMcDzyK\nm8+ukDTCzO7pg/EHQdB0wilfj1AoGczsUUkzgAFmdmulXf6msSKwp5nNTG1rAmdKWs7MZkvaGdgT\n2NHMpqRTr5X0LuBYYN9a15Q0GhgNMHRYc9KDBEFQIrGwsS6hZvNzR0WZJB5Mn+ukz12A54GbJQ2q\nbMAkYEQ9oWY2zsxGmNmI1Yas1pSBB0FQJoIBg/JtSxihUPLzctX+vPS5bPocAqwJvFm1jQGG9sH4\ngiDoKwYo35YDSRtLmiRplqRnJZ0oqce4ZEkjJF0raUbarpf0/l7fWy9Y8lRo85gBPAN8otUDCYKg\niYjSfCiSBgPX4xaPvYANgdPxl/3jujlvaDrvLuB/U/NRwHWS3mdmT5YywIKEQunKPBbNOoowCXfY\nv25m08odUhAEbUV5PpTD8SCfvc3sVVwhrAiMkTQ2tdViT2AF4JNm9ooPSbcA0/FAoV+UNcAihMmr\nK9OA90n6RJpSrp3zvOuAifg/iC9LGilpL0nHSzq5ecMNgqBvSVFeebae2R2YWKU4JuBKZoduzlsK\neAt4I9P2emprWcRAKJSunA1cC5wP3EGKwOoJ8wIWe6fzvoYrl3OBDwA3NWWkQRC0hgED8209sxH+\nErsQM3sKmJWO1eOPqc/pklaXtDpwJjAT+END91QCYfKqwsymA5/M0W8yVW8CZjYXX4NyfFMGFwRB\n6ym2aHGIpGzVvHFmNi6zP5iuAT/gimFwPaFm9qykkcBfgUpVuueA3czsxbyDK5tQKEEQBEXJ75Sf\nbmZ1lw00fHlpLXwmcifw+dR8BHClpP9Js5w+JxRKEARBUcpzys8EVqrRPjgdq8dRuB9lHzN704ek\nG4CHgf9j0aylTwmF0kYYzanTPqBJq3rVDLnDNytfJv7dNoMFC5ojuVkLsd+cX/54lx7UHFfsq7Pf\nLF1mOb9XqalXplHlK0khwctT5VupYiPggYoyATCzeZIewEOPW0I45YMgCIpSXnLIq4HdJK2Qadsf\nmA1MqX0KAE8Cm0haetGQtAywCfBE4fspiVAoQRAERVCpqVfOAeYCl0naJeX2GwOckQ0lTlnOz8uc\n9ys8We3lkvaU9FHgT8BaeNLaHm5BW0j6kqSVMm1vk3ShpJfTiv0j89xAllAoQRAERSlphpLyA+4M\nDAT+ApyAh/9WR4oOSn0q590JfARf3Phr4CLcTPZhM7s3xx18Czi2sigycTK+6n4AsCpwhqRdc8ha\nbJBBEARBEUpMX29mDwI79dBneI22SXiGjkYYAfytsiNpKeBg4Ha8XMcqwN24c//avEJjhhIEQVCU\nzi+wtTpe66nCCHy2c66ZzTGzZ4E/A5sWEdpvFYqkMZKm5+j3hKTT+mJMQRD0A1Rq6pVWYSxuofpQ\nassGArwIFKqpESavIAiCorT37CMPTwHbZvb3Ap42s8cybWvT/VqYLnSUQqlUR2z1OIIgWHIRMGBA\nW88+8nAJcIKkS4E5eM7BH1f1eQ9eyjw3LftWJI2XNDVl9Z0maY6kmyRtnOljkr4h6ceSXgTuyxz7\nsqSHJc1NIXVfr3OdD0q6K8m/R9KHcoxtO0lTUsGblyT9MhsnLmlUGtuWkianfvek/bdJukDSK5Ie\nk/SZXn5VQRC0EyqwtS9nAv/AE9oeANwLnFg5KGl9YGu6XwvThVar2fWAM4CT8JtaCZgoKVuP5Cg8\ntvp/SekEJB0GnAVcAXwMz2lzuqRvV8lfHvgNHuu9L56E7epUD74mkj6IF655HtgHzxy8B3BBje4X\nAhcDn8L/+VwKnAc8m869DbhI0ro9fxVBEHQGQsq3tStm9rqZfRB3um8KjKgKIa5kTy9UV6XVJq8h\nwF5mdguApDvxKdYoXAkAPGdm+1dOkDQAX/gz3sy+mZqvTQt0jpH0YzObk9qXw2Otf5fO/RtuO/wa\nUK18KpwC3FJ1zWeASZI2MbP7M31PM7MLUx8BVwKTzezY1HY7rlg+RosK3gRBUD7trCzyIGkY8HLV\n82whZvaEpJfoJuNxLVo9Q/lvRZkApLKVdwLbZPpcVXXOurizqDrn/++BFYH3VbVfnpH/Ol4Iaxtq\nIGl53JZ4iaRBlQ2vZ/ImsFXVKdkY8EfS5w2Z672CR0qsU+t66Zqjk+lv6vQXW5Z1OgiCAnT6DAV4\nHH+x7o6vpn65ablCqdO2Vmb/harja9Vpr+yvkml7vYYTv1p+lsH4atSzcQVS2ebimT2HVvXP1jGY\nV6Ot0l63pLCZjTOzEWY2YshqhSL0giBoBQINUK6tjWnK4Fpt8lq9TtsDmf3q9KDP1Tl3jfQ5I9P2\n9hqRYatnZFTzcrreGLrOjMB9I0EQLMGItp99lMWaLF5iuEdarlBSMZiKD2UYsCW1HeAVnsYf7Pvi\nmTor7Ae8SiYSLPFJoOJDeTvwYeokTzOzNyTdCrzbzE6s1ScIgqATFYqkg6qaNq/RBm6lGQZ8lq7P\n025ptUKZDvxG0nF4uuYTcJPU+HonmNkCSWOAc5PT6DpgB+CLwHcyDnmSzB8kRfIsXnhmaeAn3Yzp\naNwBvwCP2noN/3L3xB38/27gPoMg6Ed0okLBn6sVi4/hixn3qtGvcnOz8GdyblqtUJ4EfohHVq0H\nTAUOqFIKXTCzX6bQ4iPT9jTwTTM7s6rrLOAgPMT4PXjBmj3MrJ7JCzO7SdL2+Bf5a1xbPwlcQ1e/\nTRAESyBlKpS09u4sPCDoZTw1/QlmNr+bc8bQNSNxhe+Y2ck12g+pnA6cj6e7/3ONfvOBl4B/mFmt\nevd1abVCwcwuAy6rc6zur2ZmZ+E/Qr3jY3BfCMDm3fQbXqPtNjw1dL1zxlM1izKzJ6jh6KolPwiC\nDqbERYuSBuPr3h7EZwsbAqfjAVPHdXPqr/CX3CyfwNPSX921O1SWOKTrHgz8ycwuanjwNWi5QgmC\nIOgkhMpMvXI4vl5u71RQ6zpJKwJjJI3NFtnKYmZPs3i2YCR9F5hmZvf0dFEzG9n7oXel1WHDQRAE\nHUeJ61B2ByZWKY4JuJLZocB4VsUDji4uch9l0zKFYmajzGxEq64fBEHQMOXl8toI9+0uxMyewv2/\nGxUY0afwtXK5FYqkHST9VdJ/Jb0paX6N7a0CYwiTV7sxoAnRIy/PerN0mQCrvH3p0mXeeMZ+pcsE\n+NAPb+i5UwPc9t2dmyJ3QfXqq5IY1ITFdguaNNgVl1uqdJkDyrh/FXLKD5E0NbM/zsyyyxYG03Ux\nNHja+CJpTz4N3GVmD+fpLGlP3Ck/EE9H9RBQSHnUIhRKEARBQQoolOnNtsRIWgs3j32rwGlj8Cwg\ne5pZ7hK/PREKJQiCoAAlO+Vn4lnWqxlM/uJW++EGtt8XuO4mwIQylQmEQgmCIChOeZbDaVT5SiQN\nxUtvTKt5Rlc+DdxkZv8pcN3XWTxNVSlElFcQBEERVGqU19XAbtkCfsD+eJaPHotbSRqOl/ItGt01\nCV9IWSqhUIIgCApSokI5B89mfpmkXSSNxv0bZ2RDieVVac+rcf6ncWd6dTmPnvgWsKGk41Tisv8w\neQVBEBSkrGewmc2UtDPwM+AveMTXmSzK8lFhEB6RVc2ngUlmNr2760g6v0bzA3iKqc9Juofa0WZm\nZod2exNVgwyCIAiKUGL0tZk9COzUQ5/hddrrppWqYlQ3x4anreYlgFAoQRAEzUAqNcqrr1i/Ly6y\nRCsUSSPxkr3rmNmzqe0feIngVSuZNiXdB1yBT0t/AOyIV338D3AJcKKZzcvIPQbX6usCrwB3A6PM\n7Pm+ubMgCJpJp6WvT+XVm07HqdmSuQ1f3LMdLKwpvxVetveDqW0V4L3AjcAQPNTuG3g24h/hKaEX\nZj1OBWu+A5wB7IbXaXkEeFtf3FAQBM2nH9SUbwpL9AzFzGZJuhNXKL/Hw+9ewUPqtgOuBD6E2xFv\nSVEX/1c5X9LNeInM8yV9Jc1StgGuNbOzM5eqmZ4/yRgNjAYYOmxYiXcXBEHT6HBdkarj9sQC4NV6\nGY9rsUQrlMTfWVT7ZHvgJjz++7OZtnvN7NUUXnckrgDWB5bNyBmGz0TuAQ6VdAKukO7srlBOyusz\nDmDLrUY0KYNTEARl0g9mH0+wqHpjt0h6Hn8pPqGnaLIl3eQFbsraRNLK+KzkxrSNSFUhK20AXwNO\nAy7Hi+FsAxyRjlWUy/m4yWs/3KT2gqTvS6oV8hcEQadR7sLGVnER/jIt3CozBfcHT0n7Sn9fhbsF\njgDukLRad0JDocDN6XNH3OT1dzw++3VgZ2BLFimUfYFLzexYM7vWzO7ATV4LMbMFZnammb0Hn7Wc\nBhwDHNbsGwmCoPl4Lq98WxtzMrAZXn59qJntZGafMbOdgKHA2HT8m8AG+HqV9fBnWV2WeIViZjOB\n+4Gv47WU7zYzw01fR+NmwYpCWQ5f1ZrlwG5k/8fMTsFNYRuXPPQgCFqElG9rY07BTfnfMbPql+I3\nzOzbwD+BU9JL8gm4Of9j3QkNH4pzIz6lm5jxd9yIR3E9bGYvpLbrgK9Kug14FFcm78gKknQuHgl2\nKz51HAm8k2KppYMgaGPa3JyVh+3xtC/dcQteorjCrXhUa12W+BlKojID+XuNtpsybSfiSdi+nz7n\nAV+tkvUP/Me6ALc/fhI4zMz+VPKYgyBoBTlnJ22uc5YB1uyhz1qpX4XX6aEIV8xQADP7PVW1BMzs\nNqqCA83sdWpraGX6jAfGlz7IIAjaAlFS5cfWci+wv6Qzzez+6oOSNsUDi+7JNA8HXuxOaCiUIAiC\ngvQDhXIibkG5Q9Jv8OCkF4A18LV3B+I16k8CkLQcsCuewLIuYfIKgiAoQskmL0kbS5okaZakZyWd\nmHeZgaS9Jd0habaklyRdI6nHrBxmNhFXGnPwNFHnAX9Nn59L7QelfgBL43Vavtud3JihtBHzFxiv\nzH6zdLnz3lpQukyAN+Z2a05tiDtfyFv1tBi3fXfnpsidNbfumtVesfSg5rzrDWzCm/W8+c3593Xt\nv8tPfffq3N7//yXKc8pLGgxcDzyIr23bEDgdf9k/rodzP4/nFxwLHIWXDd6JnM91M5sg6a/pulvg\npYhfxXMP/tnMXsv0fQWYWFNQhlAoQRAEhSh10eLh+HKEvVOKk+skrQiMkTS2XtoTSUPwuilfMbNf\nZg5dXuTiyS/827T1mjB5BUEQFKREk9fu+HKFrOKYgCuZHbo5b7/0eWFDN9AkYoYSBEFQkBJnKBvh\nJTQWYmZPSZqVjtVzgr8feAjPG3gs7ky/C/i6md1SY7wHpT8vN7PXMvs9YmYX5e0bCiUIgqAAUqEo\nryGSpmb2x6WEsBUGU7v07sx0rB5rAu/G/SxHAy+lz2skvTOzGLvCeDwZ5K3Aa5n97lDqEwolCIKg\nWRSYoEw3sxHNGALwdmBfM7vGx6RbgCeBL9M1GutzuHJ4Lu13u+K9UUKhBEEQFKREk9dMPLqqmsHp\nWHfnGTC50pBKbNxJjbyBacF1dr8pvpd+45SXtLSkMZI27+Pr7ijJJG3Sl9cNgqB1lOiUn4b7SjKy\nNRRYPh2rx79IEczVQ8MLY7WEfqNQ8IU3xwN9qlBwR9gH8GSRQRD0d8qth3I1sJukFTJt+wOz8Xok\n9fhr+hy5cFjSSngJ83tz34q0mqTDJf1E0q+q2rdJK+Rz058USksws1fN7FYzm93qsQRB0Hx8YWNp\nM5Rz8JIYl0naJZUEHwOckQ0llvSIpPMq+2Y2FfgzcJ6kgyXtCVyBF8P6ea77kA7FKzf+HPgKi/tV\n1sAT3R6Q6y4SfapQJL03pQaYIekNSf+SdISkL0l6XdLbq/pXzEmbpf2PS7oznTtT0m2SKrHalVWd\nF6RzTNLwdN6yksZK+o+kuZLulbRH1bWekHSapG9Lek7SK5JOl7OHpAckvSbpT2l1a/UYN8m0DZR0\njKR/p+s9LWl86V9oEAQtoLwCW6ke087AQDxE+AR8weLxVV0HpT5ZPgv8CTgDuBRXJjslmd3fgfRh\nvPT4v/GM6L+oGtf9eKHBT/R4E1WD7Ev+gtv+Potr5XcDK+Ja+nRgHxbP1HsIcJeZ3StpQ/xL+wme\nZmBZfHq3Suq7Ex7P/X28ljssimi4FC/XezxumtoPuELSCDPLZtP8NHB7uu5WSdYAPB39d/HFRj/D\nq51l6wRUcy5wEJ4SYUoa46d6+G6CIOgQyqyHYmYP4s+v7voMr9H2OvDFtBXlW/jzcYfkzN+iRp9/\n4ub83PSZQkmpAtYH9jKz+1LzpMzxP+IP8vFp/+34Q/jbqcsWwGtmdlRG7FWZv+9In4+a2a0ZuTsD\newI7mlnFJnmtpHcBx+JlfSvMwcPw5uPx3HvhU8F3mtnjSd5mwMHUUSiSNsKTrR1pZj/NHPp9nf6j\ngdEA6wwdVqtLEATtRPvXOsnDCGBCvdQuiafpuWbKYvSlyWsG8B/gHEn7S1q96vh5wHaSNkj7++EK\n73dp/z5gJUkXSto1T0bNxC7A88DNkgZVNlyZVceHT85UbAQv3ftERZlk2laTtHSd61WcZOPzDM7M\nxpnZCDMbseqqQ/KcEgRBC6kkhyzJKd8qlgbe6KHPynhZ9Nz0mUIxswV4Pv3ngfOB5yXdmJlqTQYe\nA0al/UPwjJcz0vkP4VkxN8BnJtMl/U7Saj1cegiuZd+s2sYAQ6v6Vq9YnVenTfgPUotVgTd60PxB\nEHQw/UChPIGb9bujkt4lN33qlDezaWb2KVzz7YL7Qa6UNMDMDFc0ByVz1IfwMrrZ8680s+3wh/ah\nScZZPVx2BvAMsHWNbduy7i3DS8Db5BlDgyDoh5TllG8hf8YtQvvWOijpEGBT4I9FhLYkbNjM3jSz\nG/DohLVwBQNuJloXN389A1xX5/xXzOx3eKrmyqrQeelz2aruk/AZyutmNrV6K+ueMlQSveVOvhYE\nQQfRP2rKjwWeAi6W9HuS813Sl9P+OOBhen5hX4y+dMpvCpyGO6cfw1MLfAu4N2PWelbSNbgT/eSs\nP0PSF/CbvgZ4Fngn7lC/KJ07T9LjwH6S7scd7P/EldJEvM7AqXgo3Ir4AshlzeyYMu/TzB6SNA44\nPfmJ/o4rzH3M7NNlXisIgr5H5dZDaQlmNjMtubiIxQOTKoFENwIHmFlPfpbF6Muw4efxmsXHAmvj\nvom/4Uoly59whXJBVfs/gY/js5pV8JC3XwLfy/Q5HFda1wPLAOub2ROS9ga+A3wNGIabwe6hoPYt\nwJfwJG2fx6PU/gtc26RrBUHQx3S4PgE8TT6wY3rZ/wDuSngFuNXM7mxEZp8pFDP7L/C/ObruCtxk\nZg9Xnf8PXNF0d41rcbtfdftcfA1K9WKhbJ/hNdpG1WgbTyaCy8wmU5VPJ82sfpi2IAj6GQP6g0ZJ\nmNk/8Rf2XtM22YYlvQ8P490bX2AYBEHQlnS6PpE0CphkZv8pU27bKBR8Ff0Q4Gwzu7TVgwmCIKiF\nBAPbO4IrD+cDJukxPHDpBuAGM5veG6Fto1BqmZyWNAYOECsvv1Tpcl98dW7pMgGWHlh+kODFtz5T\nukyAg7Zarylyl1u6Or1SOcxf0FMxvcZYYOXLbca/A4Ahyy1TusxBKmesne6UB47Ac4jtiGfqOAxX\nMA+wSMFMMbPX6kqoQdsolCAIgk6h0/WJmf0C+IVcM26OK5ed8fV/7wOOBOZLmmpm/5NXbqSvD4Ig\nKIBXtcr3Xy550saSJkmaJelZSSdK6nbqK2l4Jqt6dptQ5F7MudvMTjOz3fFIr68DL+ITjvcXkRcz\nlCAIgoKU5UJJpTCuBx7EU0ttiGdeHwAcl0PE/wE3Z/YL+0AkvYNFM5SR+LIM4ZnZJ3VzahdCoQRB\nEBSh3Dxdh+NlMfZO+f+uS2mbxkgamyMn4EPZ7Op5kXQgi5TIurgCeQ5fOD4Jd9A/VVRuKJQgCIIC\niFKjvHYHJlYpjgnAqcAOePRrM/g1YLjz/RTgb2bWXQ37XIQPJQiCoCAl5vLaCFjsQZ5mBrPSsZ64\nQNL8VGX2DOWvAV/Jmr4DnnfwwFR9tl4W9VzkUiiS9ksLYbJtkyV1zHoRSWMk9SrGuuD1Vk/XHN5X\n1wyCoG8okL5+iKSpmW10lajBdC2RATAzHavHXLwW/KG42epcvHJjXqf8ysCH8VRVA/AUUZOAmZKu\nk5dC31oFbXt5TV774YsOxxcRvoSzOp7qZTJeeyAIgn5AwUzC082supBfrzGz54AvZ5omS3oBOFvS\nZmZ2bw/nz8EVyCSA5LcZiZci3gn4QdpeYVGZ9R7pc5NXgSlZEARBWzJAyrXlYCawUo32welYESoW\no54KZ3Uh+XBuwiPG/oFHi6nO2OrSo0KRNB6v7b5DJtZ5TOb4AZIekfSqpKslrZs5VomVPlDSRZJe\nJjmZJA1MJqGnJM2V9ICkA6qu3cWslux8JmmTTNuwdO3Zkh6XNErSpZIm17ifLSTdmmK+75a0XdXx\nXo8rmbnuS4f+VvneevqugyDoDEpUKNOo8pVIGgosT5VvJQdW9dktkt4maXdJp0m6G88GfzGeJX1p\nvAjXkUUGkMfkdRKe8n1lPC07ePH6HfFFL2sD38RD336CF2bZo0rGacBleN79So2TE4GjgROAO3Cl\n9VtJZmYX572BZOO7Io3vc3gdlO8Cq+Fx1FmWBy4EzsTT6R8PXCZpPTObVeK4ngMOBH6Lpzi4K+/9\nBEHQ3ojy1qEAVwNHSVohk+Zkf2A2MKWgrH3SZ4+p5yXdCGyD6wCl61VSrkwC7kpl2wvRo0Ixs0cl\nzQAGZOOdk69mRWBPM5uZ2tYEzpS0nJnNzoi51cyOyJy7Cl6b5Ptm9v3UPDHNbsbgWjIvewCbAduY\n2R1J/u2436JaoSwHfC1Vi0TSc8DdwPbANWWNy8zmSqqkg36wuzjx5KQbDTB02LA84oMgaCXlrkM5\nB/gq/mJ7KrAB/qw5IxtKLOkRPLfWoWl/DLACbqJ6FX+GHQVcltLR98T7gdtYpED+YWZv9vZmersO\n5Y6KMkk8mD7XAR7JtF9Zdd4m+GzhD1XtvwfGS1rNzF7MOYatgecrygTAzJ6RVEtLz8Od5NXjrZjp\nyhxXLsxsHD6rY8utRoRZLAg6gLL0SaqcuDPwM9wd8DJuQRlT1XUQkE3HMg1fJf95/EX5KeBHuCM9\nD4OLVmPMQ28VSnW4W7267i9U7a9Vp72yvwqeSyYPa9bp+yKuwbO8lp3GpbLB2fGWOa4gCPopJc5Q\nMLMH8ciq7voMr9qfQP4Q4VrySlcm0HdRXtVv3s+lz9Wr2tdInzPS5xzcOZSlOjb7edxfUk2ttp4o\nc1xBEPRDKj6UPNuSRl6FMo+us47ecD++EnTfqvb9gH9nzEpP03W16K5V+3cAa0raptIgaR0aCJ0r\neVz1ZmtBEHQ4JUZ59SvymrymAXtJ+gT+MH22Nxc1sxmSfgwcJ+ktYCpe+ncP4DOZrpcDh0o6E/fD\njAQ+UiXuKuBe4BJJx+DRCsfjZqpCUQolj+upNJaDJb0CvGlmU4uMJwiC9kPqXzXlyySvQjkb2AIv\nGzkYD6ntLd8D3sLTBayBO/E/m2yDAJjZlZK+g4crf55FcdF/zvQxSXvhqQcuwBXJD/AQukoocCvG\nNUfSYbhymwIsBTkLJARB0NaEPqlNLoWS6gx/Mke/yWQemmb2BHUeomY2H3/YHt+DzJOBk6uaVdXn\nSTIzBEkrpXN+lukzhq6RE5hZtawyx/VbfC1KEAT9iDKd8v2JfpG+XtLhuHnrYdwZ/w1gGXxGFQRB\nUCqhT2rTLxQKHnX1LWA9PKLsdmCXNHMJgiAoDbFkOtzz0C8UipmNpx9kQhbNmUqvvlJzAs3mvDm/\n504Fue3qwsXncjHvC9s2Re4yS3Vb+rvtWGCds3b2vWsVykuYi+XK+L0EA/pBTLCkd+K+321w33it\nL8fMbMO8MvuFQgmCIOhLOr0yoaQP4LXsl8ODkF5In126FpEbCiUIgqAAzbIk9DEn437mw4HzzayW\nMilMKJQgCIKC9AOL19bApSmXYGmEQgmCIChIP1Ao8/DF16XS6abAIAiCPkWCgQOUa8snTxtLmpSK\n/j0r6URJuaMHJA1I9epN0kdznnYLvli9VEKhVCFpP0mjqtq6VGgMgmDJpVJXvqetZzkajDvHDdgL\nL/D3TYplI/k8i0pw5OU7wP9I+t+C53VLmLy6sh8whH4QhhwEQfl4tuHSbF6H45FWe6eCWtdJWhEY\nI2lstshWzbG4QvoB8G3gVwWuuxdeXGu8pM/jVR6ry5GAhw2flFdoKJQgCIKClGja2R2YWKU4JgCn\nAjvgRbe64yS8auOkgtcdk/n/mit3AAAgAElEQVR7u7TVwtI1chEKJYOk8XgNeSRVVoCdkDl+AD4l\nXR3/EQ8zs6czx5dNxz+T+kwDjjGzq/pi/EEQ9A0lRg1vhM8UFmJmT0malY7VVSiSNgU+B2zawHVH\nNnBOj4RCWZyTgGHAyngmYfB0/TviNZjXxu2bywE/wUv37pE5/1J81enxeD37/YArJI0ws3v6YPxB\nEDQZFat1MkRStmzFuKpQ3cHUNjXNpOeifWcBPzOzRyQNzzsgADObUqR/XkKhZDCzRyXNAAaY2cIc\nIGkR04rAnmY2M7WtCZwpaTkzm53qQu8J7Jj5sa6V9C7gWLoW7QqCoEMZmN/mNd3MRpR9fUmfBt4N\nfKxs2b0hFEp+7qgok8SD6XMdvGbKLng54pslZb/XScCoekIljQZGAwwdNqzM8QZB0ARKdsrPBGol\nLRucjnW9vrQU8CPczzJA0sr4Cy/A2yStYGav5bm4pGHAQXgI8crAK8BdwK8bSa4bCiU/1dPS6hK/\nQ4A1gTdrnFs3i2Ka/o4D2GqrEZ2TuS8IlmBK9KFMo6qcuKShwPLpWC3ehocJn5G2LBNwc/s7erpw\nKgD4U2BpFs/Z9Qm8au2RZnZujntYSCiU8pgBPIP/GEEQ9FdU6kr5q4GjqmYV++Plw+v5OV6nq1N9\nTeBifH3JDV3OqCKZ6M8BXsNnOzcAzwFrATsBXwV+LukRM8sdQRYKpSvzWDTrKMIk3GH/upnVe7MI\ngqAfoPKqeZ+DP7wvk3QqsAEe0ntGNpRY0iPAFDM7NCVynLzYeBY55e8zs9tyXPcoXJlsZWaPZtof\nAiZLuhBfm3IUBUKSQ6F0ZRqwl6RP4BFez+Y87zpgIr4w6VTgAdyuuTmwrJkd04zBBkHQtwgYVNJC\nFDObmWYLP8NDhF8GzqRrufJB1K5X0ijbAJdUKZPsuB6V9AfSMoq8hELpytm4g+p83DGWKwWCmZmk\nvfEp59fw8OMZwD14eF8QBP2EMtPXm9mDuJmpuz7Dezj+BMVqlywHTO+hz4upX25CoVRhZtOBT+bo\nN5mqH9DM5uJrUI5vyuCCIGg5HuXV6lH0mifpQYnhfppCGYkjOWQQBEERciaGbPMaXJcDW0s6O4Ud\nL0TSipJ+gpvFLisiNGYoQRAEBSlxHUqrOBn4OJ6c8kBJ9+JRXmsCm+H+32mpX25ihhIEQVCAiskr\nz9aupAiy/wF+iTv7P4Rn89gOn2j8EvhgT9mOq4kZShthgFn5axvnvbWgdJkAc+bVXa/ZMEPe2eN6\nrIbIW+yoKM34vaB55pK33ip/vEsPas5gZ80tpcz5Yiwo5fcSAzt/hoKZvQJ8QdKX8TQuK+Er5R8y\ns1oLtHskFEoQBEEBRNv7RwqRlMf9ZcgKhRIEQVCENjdntZJQKEEQBAXpNKe8pBtwq/rBZvZ02s+D\nmdnOea8TCiUIgqAAHWry2hFXKMtn9vNQyOkUCiUIgqAgnTZDMbMB3e2XRYQNZ5A0OuXwqm5/QtJp\nrRhTEATthYCByrctaYRCWZzR1E4//0m8bkAQBEs68lxeebZ2RdL5kj7eQ5+PSjq/iNxQKDkws7vN\nrG5OG0kDJS3dl2MKgqB1KOeWS5a0saRJkmZJelbSiZK6zSws6b2Srkn950p6StKvJK2V87Kj8Ezo\n3bEZcHBOeUA/UiiStpM0Jf0oL0n6paQV0rFRkkzS+yRdJ+kNSdNSduDK+ZOBrYCDU1+TNCodW8zk\nJWm8pKmSPiHpAWAO8P50bJikCZJmpLFMlPTuvvsmgiBoJpUSwHm2HmVJg4Hrcef3XsCJeF2lnrKc\nrwQ8DvwfsBuekHYX4KqqEuS9YRm6qTZbi37hlJf0QfxH+ROwD7AqcAqefn6fTNff4eV2fwR8BZgg\naQMzexr4EvBH4DHgpNS/Zq2AxHBgLP4P4HngcUmrADcBL+E5cmYB3waul/QuM5vd65sNgqDllGjM\nOhxPEb93SnNynaQVgTGSxtZLfWJmtwC3ZJomS3oauBbYFK8L3xN1I7gkLQNsjz/bctMvFAquPG4x\ns/0rDZKeASZJ2iTT70wzOz8dvxN4AfgocI6ZPSjpDeBFM7s1xzVXBXYxs3sy1zwJr/e8uZnNSG03\nA08AnwN+Xi1E0mjcd8PQYcPy33EQBC1CDChvZePuwMQqxTEBOBXYAS+6lZeX0mdN87ukx6qavi7p\nkBpdBwKr4TOUcwpcv/NNXpKWBz4AXCJpUGXDZwpv4masCtdW/jCzl4D/Aus2eOlnssoksQteufHV\nzDhew0tpjqglxMzGmdkIMxsxZMhqDQ4lCIK+QviDM8+Wg43wrL4LSf7aWelY92ORBkhaOpnVTwHu\nAG6v030Ai9w7Rn3Xz5vAfbhSOyrfbTj9YYYyGNeoZ6etmqF4KV/w8ppZGq0fDz67qWYIsC2wf41j\nuesyB0HQ3hSI4BoiaWpmf5yZjcvsD6brcwlgZjrWE1fhPhTwF9c9zKxmNths1UdJC3CLzYk5rpGb\n/qBQXsa17Rj8y63mWWDXJly3lv1xBnAFi3wwWV5rwhiCIGgBBQxe082spnWiJL4CrAK8EzgOuFrS\nB81sTg/njcRN8aXS8QrFzN6QdCvw7nratsDbRG9mLOCzkP2AB8IBHwT9FJVaU34mHrFVzeB0rFvM\n7OH0522SbsQjvw4Aul0/YmZTCo4zFx2vUBJH4w74BcCl+GxgGLAncGwBOdOA3STthju4Hk++lryc\nAXwWuEHSWcAzwBq4c+0mM7u4gKwgCNqQig+lJKZR5SuRNBTPuTWt5hl1MLMnJc0ANsh7jqQReKnf\niuughlirZXGpSb9QKGZ2k6Tt8djtX+NfzJPANdT2ddTj+7giugQvgXkIML7AOKZL2hb4AXAmsDJe\nVvMm4J8FxhEEQRtTYi6vq4GjJK1gZhWz+P7AbKDQLCI55lfFZyk99V0Rrxc/ku4teEZtE35N+oVC\nATCz24CP1Dk8nhqKIeukSvuP4ZFaPfUb1c04nsUVURAE/ZQSs6qcA3wVuEzSqfjsYgxwRjaUWNIj\nwBQzOzTtnwa8BdyG+5Hfg1tqHsXDjnviR8BOwI3ABcB/krxe0W8UShAEQV/gJq9yNIqZzZS0M/Az\nfM3Jy7h1Y0xV10EsbpKaijvkR+N+36fwhdknm9kbOS69F774cWS9qLBGCIUSBEFQkDLzPprZg/hs\nobs+w6v2J5BvJlKPlYBfl6lMIBRKEARBQYTKTL7SGh7GA4ZKJRRKm/HW/EIF0nKxzFLdJi5tmGYU\nGbr31D1Llwkw961SX8QW0qxCS6/MerMpcoesUH5S7Hnzm/Pd7nZy3iq1+Xn++XKWg7VxZvq8/Bw4\nRdI6ZvZMWUJDoQRBEBRAgoGdr1Guxs1sN0s6AV9lX2vFfiUVTC5CoQRBEBSk8/UJT7Aon9evuuln\nFNAToVCCIAgK0g98KBfRTfr6RgmFEgRBUAAvsNXqUfSO7tbS9YZQKEEQBAXpBzOUprDEKhRJuwIb\nm9mPWz2WIAg6i2ZF97UCSRvhK+3fbma/7o2sji+w1Qt2Bb7W6kEEQdBZVExeebZ2RtLmqVbLA3hS\n3fGZYztImiXpY0VkLskKpVQkLdfqMQRB0Bco93/tiqR3AZOBdwM/wcOIs/wdr++0TxG5balQJG0v\n6W+SXpf0iqTJkrZIx4ZJmiBpRtKgE1OWzcq5wyWZpP0knZvOf1rSCZIGpD5jgG8C66W+Jml8OjZZ\n0qVV49kx9dmk6hoHSrpI0svAXySNlfSYqoolSBolaZ6kqPEbBJ2OPGw4z9bGHI/Xnn+/mX0DLx28\nEDMz4B/A1kWEtp1CkbQjXqjqTeBgPJXzjcA6klbBU8G/GzgcL2b1NuD6GjOEscDruIb9DfA9Fmnb\nXwG/A57H69F/gAIpmjOchtde2Rf4IV7UZn28/kmWQ4C/mNmLDVwjCII2o14x9uotlyxpY0mT0gvy\ns5JOlNRtegtJW0u6QNIj6byHJB0vKW+BwJ2By1IesXr8B1g7pzygPZ3yJwP3ArslLQle1wRJJ+EK\nZHMzm5HabsYX6XwOTydQ4e9m9s3093WSPgLsDVxiZk9Leg6Ya2a39mKst5rZEdmGNJ5D8OkkkjYA\ntgM+XkuApNF4xlCGDh3Wi6EEQdAXuA+lnOmHpMHA9cCDeAbgDYHT8Zf947o5df/U91Q8L9em+Evx\npsCnclx6MPB0T8PDZzG5aSuFIultwPuBIzPKJMsuwHXAq5IqY38NTxtQXbf52qr9B/HiWWVyZY22\n84CzJB1hZq8Do/AiX9fUEmBm44BxAFtuNaL8RF5BEJROieasw4HlgL1T/ZPrUvGrMZLGZmuiVHGK\nmU3P7E+WNAc4V9J6ZvZkD9d9AXhHD33ei89SctNuJq/BuFZ8rs7xIbhmfrNqGwkMrepbnZemt/Xi\na1GrGuQlwAJgv+RLORi4yMx6XbwmCIL2oESn/O7AxCrFMQFXMtWm84VUKZMKd6fPPGaqG4CPZf3P\nWSRtjZvFJuaQtZC2mqEAM/GH8Vp1js8ArqC2v6OcNKIwh67TvMF1+naZUZjZG5Im4DOTJ/FZ0QUl\njS0IgjagxBnKRvjDfSFm9pSkWenYXwrI+gD+/Hw0R9+Tcd/v31OQ0toAkt4LbI877V/D/cS5aSuF\nkh7GtwEHSfpZDbPXJNwR/4CZze7l5erNWJ7Gv9AsuxaUfR5wK1517VYzm1Z4dEEQtC0lBnANpnaW\n35nUf5HtgqQ1cZ/Lr83svz31N7OHJH0KuBivFgl+W/9Mny/jZrjcmYahzRRK4tu4k+pqSeOAN3DN\nOxU4A/gscIOks4Bn8CIxOwA3mdnFBa4zDVhD0ijgfmC6mT0BXA4cKulM3Ecykvq16mtiZrdJegD4\nEPCFIucGQdAB5NcoQ9LiwQrjkt+0vKFIS+Om9teBr+c9z8yukbQ+bpbfFlgVeAV/Gb6gEvhUhLZT\nKGb2d0kfxs1av8FnEncDfzKz6ZK2BX6A111eGfe33IRr1iJcgiuLscBqwIXAKDO7UtJ3gC8Bnwf+\nDByZPovwJ2ADelemMwiCNsNDgnNrlOlmVh0wlGUmXo63msHpWPdjcT/tRbgD/YNm1uM5WczsZXxh\n40+KnFePtlMoAGY2ha5mp8qxZ/Gw3HrnPkGN94fq7JpmNqeeHDM7GbcxZlHmeM1rVLEr8MduojSC\nIOhEyk2rMg33lSwSLw0Flk/HeuLHeLjxh4uY1iWdj7+kX9FNn4/iZq/P5ZXbblFeHY+kEZKOxleY\n/rTV4wmCoAmUt7LxamA3SStk2vYHZgNTuh2CdAzwZeCzZnZTsRtgFLB5D302w81huWnLGUqHcwfu\n0DrGzO7oqXMQBJ1GqXm6zgG+Clwm6VTcTD4GOCNr3ZD0CDDFzA5N+wfg2TnGA88kV0CFR0vKyrEM\nML/ICaFQSsbMevUvrRn5fxYsaM56yQFNSKc6dlKeiMfifHtkT2u4GmPQwOYkbFpp+aWaInfuWwtK\nlzmoSWl1v7DXe0qX+fOJ5SxFK+v/UzObKWlnPNLqL/jL6Jm4UskyCMimY6lEno5KW5ZDyGQO7u7y\n9Q5IWgZ3OzyfQ85CQqEEQRAUoEierjykfFo79dBneNX+KLoqkm6R9FhV09cl1fIjD8QDlZbBZ1C5\nCYUSBEFQELV5KuE6DGDRrMSorxvfBO7D1/19v8gFQqEEQRAUpBP1SXaWI2kBcKaZnVjmNUKhBEEQ\nFKQD9Uk1I4HHyxYaYcNBEARFyBsy3N5a53hgx+46SPqspBu661NNKJQgCIKCdHoJYFyZDO+hz3p0\nk/G4FmHyCoIgKIDoTB9KAywHFCq7EQolCIKgIP1EodRch5Lygw0D9qDDC2wFQRC0PZ1o8pK0QNJ8\nSZXV72Mq+9kNn5U8hqdmKZTcNmYoQRAEBenQGcrfWTQr2R54CniiRr/5wEv4OpRfFblAKJQgCIKC\ndKI+MbMdK3+ndSgXxDqUfoak0cBogKFDh7V4NEEQ5KITNcrirE/tSpG9InwoLcbMxpnZCDMbMWS1\n1Vo9nCAIekCCAVKurV0xsyfN7JWy5cYMJQiCoCBlqgpJGwNn4aXOX8b9FieYWd3U8ans7w/w0r0j\ngGUbyXQuaS1gZ2AdPBlkNWZmJ+WVFwolCIKgKCVpFEmDgeuBB/HKixsCp+PWo+O6OXV5vET57cAt\n9JCtuM61TwC+zeJ6QCxy3Ff+zq1QwuQVBEFQiLxBw7m0zuH4AsK9zew6MzsHOAH4hqQV652UasGv\nYma7AZcXvgPpQOC7wI3APrjyuBA4APglsAAPGS6kqEKhNBlJB0l6S9J6rR5LEATlIOXbcrA7MDFb\nnRF/kC9HD2lPzKw3lfO+CDwNfMTMKgrpCTObYGaHAx8F9gPqKrVahEJpPgPwgjXt66ELgiA3JeeG\n3AiYlm0ws6eAWelYs3gfcJWZZVOrLKwIaWYTgYnAUUWEhkJpMmY23sxkZk+0eixBEJSDpFxbDgZT\nO3x3ZjrWLJbCFy9WmA2sVNXnfmCzIkLDKR8EQVCQAhHBQyRNzeyPM7Nx5Y+oMM8Ba2X2nwI2reqz\nNpEcMgiCoLkUsF9PN7MR3RyfSdeZAfjsZGaxURXibmCTzP4NwGhJ/wtchqe33we4uYjQUChtRjNq\nVQ8Y0Bz3Te98grU5d8LUnjs1wPc+/K6myC3/G3CWHtQ5v9nAJv37+vD65S/0/fXSJTzy8jvc8zCN\nKl+JpKF4WPC0mmeUw1+BsyWtb2aPA6cA+wPj0wZeW7670OUuhA8lCIKgMKW55a8GdpO0QqZtf9yn\nMaXEAS9G8u0un5QJZvYfYGvgF8C1wDhgazO7tYjcmKEEQRAUoOQCW+cAXwUuk3QqsAEwBjgjG0os\n6RFgipkdmmnbHXgbnmYeSfukQ3eY2ZNFB5KUy5cbvA8gFEoQBEFhyrLymdlMSTsDPwP+gkd8nYkr\nlSyDyIT1Jn6Bl+mt8If0eQiLzFZ9SiiUIAiCgpRZPMvMHqSHFelmNjxPW6sJhRIEQVCUWKZck1Ao\nQRAEBQl9UptQKEEQBAUokKdriaNfhQ1L2rAF11xT0vJ9fd0gCFpHialX+hUdr1AkLSvpQEk3AA9n\n2gdI+rakRyTNlfRvSQfXOP/Lkh5OfR6R9PWq4+tKukTSfyXNlvSopGx9gI8Az0k6V9LWTbvRIAja\nhhKTQ/YrOtbkJWkL4FDgQHxV6RXAnpkuZwEHAycCdwEfBs6X9JKZ/TXJOCz1OwPPrDkSOF3SMmZ2\nSpJzEZ5KejQe0rcBi69svRxP8XwInrrgPrzi2m/MbEbZ9x0EQetZAicfuegohSJpJVyBHApsCdwD\nHE/Vw1vSO/B8/4eY2YWp+fpU7vJ44K+SBuCx3uPN7Jupz7XpGsdI+rGZzQG2AT5jZn9JfSZnx5Tq\nMv8U+KmkLXHFcjwwVtLlwHnApF7WLgiCoG3IXTxriaNjTF6SPoJnyDwJT1i2hZltYWY/rTET2Bmv\nOHa5pEGVDZgEbC5pILAunk3zD1Xn/h6fcbwv7d8DnCxplKRh3Y3RzO4ys68kuQfjCd4mAo91c1+j\nJU2VNHX69Bd7+hqCIGgxlZXyJRXY6ld0jEIB5uJFZ5bFs3OurPperyH4qtJX8ARnlW08Pitbi0Wp\nm1+oOreyv0r63B+Yiq9efVLSPWlla3csHCP+HdfNGmpm48xshJmNGDKk/GR4QRAEfUXHmLzM7G+S\n1gE+CXweT7f8hKTxwIVVuWtm4Hn8P4jPVKr5L4uU6epVx9bIyMDMngFGJRPZNriZ7ApJw8xsYYGa\npNx2wk1eewPzgN8BXzSzuxu55yAI2pMBS+L0IwedNEPBzOammse7ABsCvwUOAx6XdL2kz6auN+Az\nlJXMbGqNbR5eT/lZYN+qy+wHvArcV3XtBSnz5gl4EMB6AJLWkDQGeBy4HhgKHA6sZWZfCmUSBP2M\nnOauJVHndMwMpZqUGfO76WH+EXzWcgHuoH9I0jnABEljcZPVssB7gXeZ2efNbEE691xJLwHXATvg\nzvzvmNmc5KCfiEd6/RtYBvgm8DzwrzSU3XEFciHwKzNbGLocBEH/Y0kNCc5DxyqUCmY2H7gSuFLS\nGplDR+BK4DA8dPhV4EE86qpy7i8lLQscmbangW+a2Zmpyxx8pnIkPvOYBdwK7Gpms1OfK3AlVqhU\nZhAEHUxolJp0vELJYmYvZP424Mdp6+6cs/C1KLWOzcUVUnfnx1qTIFjCiLDh2vQrhRIEQdAXLIn+\nkTyEQgmCIChIKJTahEIJgiAoSJi8aqPICNI+SHoRyFMLeggwvQlDCLmdNdZOk9sOY13PzHq1gljS\nNemaeZhuZh/pzfU6iVAoHYikqWY2IuSWL7eTxtppcjtprEFjdNTCxiAIgqB9CYUSBEEQlEIolM5k\nXMhtmtxOGmunye2ksQYNED6UIAiCoBRihhIEQRCUQiiUIAiCoBRCoQRBEASlEAolCIIgKIVQKEGw\nhCNp1VaPIegfhELpACRtKun3kh6VNFfSlqn9B5J2b/X4skhaXdL6mX1JGi3px5I+1gu550uaUOfY\nxZJ+2YDM+ZK2qXNsK0nzi8rMnN+U30zSMpK+KOk8SddKemdq31/SexoU+6ykSyTtnkpd9wmShkoa\n1lfXC5pPKJQ2Jz187gTWxCtHLpU5PBf4Si/lry3pC5JOlDS2aju1AZHjga9n9k8Ezsaral4uaVSD\nQ/0w8Mc6x/4I7NaAzO4y/C0FNFQ0rVm/maR34UXjTgaGAzsDK6TD2wHHNCIX+AKwOvBX4D+Sfpiu\n1Wwew0tnB/2EyDbc/pwMjDezwyQNAo7PHLsHLz/cEJI+jZcuFvAiMK+qiwHfKih2S9JCs/S2ezhe\nUnmspBOAr+FKpyirAfWKmc3EH4g9kt6Ih2eatkhVO7MsCxxM4w+7Zv1mPwWeAj4GvM7iv9cUoJEX\nAMxsPDBe0gbAKOAg4FuSbgXOB35vZq83OObuOJRe1j6U9Cgww8y2LmdIQa8ws9jaeMPLEO+S/h4I\nLAC2TPs7AnN6IftR4HfAiiWOdzawXfp7a2A+sE7a3wF4vUG5DwEn1Dl2AvBITjnHp+9wftoW1Nne\nAD7TTr9ZGtMedeRuD8wu8XfcCfh7+o5ex18CtixLfklj3BmfRc4HNmv1eGKzMHl1AP8FNqhz7L34\nG2ujrAqcZ2av9kJGNU8DG6e/9wSmmdkzaX8l/GHbCOPxt+YjJL0dQNLbJX0JOBr4VU45ZwPvAzbD\n344PTPvZ7d3AKmZ2cYNjbdZvNgdYrs6xdYCXG5S7EEnLJ7Pk94APAQ8CZwLvAe6QdFRvr1EihwLX\nAbfTQ6nuoI9otUaLrfsNGAs8j//PXXkr3QJ4F/5gOr4Xss8HTip5vMcArwB/AGYBR2aO/RC4sUG5\nA3ClUZldvMqiGcY5pDRCBWWuByzdKb8ZMAH3zaxUJXcZ4Bb85aDRMW8PXJC+11eAc4FtqvocDbxU\n9vfV4HgHp39f+wKjcXPoMq0e15K+RS6vNkfSMrjTeXf8IbUWPgtYE7gW+KSZvdmg7OWB83BTyg3U\neMM1s6sakHsQbu66Bzjf0j8ySecA/zCzCxsZb5LxbmAkPrt6CbjBzP7doKyNe+pjZg82ILcpv5mk\nocDN+CzlOmB/4Ap81rM0sK2ZPd+A3Edxv9It+L+HS8xsVo1+WwF3mFnLLRuSvoLPotbGv4/ngNFm\n9tuWDmwJJxRKhyBpZ9xmPAR/G5tkZtf1Uubm+INv/TpdzMwG9uYa7YykBXjgQV16c/9N+s0GA9+o\nlgucYWYvNSjzVFzxP9SbsfUlku4FJpvZkWn/18C6ZjaytSNbsgmFsgQj6e7053eAR+ga5YWZ5SlJ\nXEv2IGAYHjFVLTPXW3+aQTxqZnObMZuQtEON5sF4CPJuwFfN7K9FZDYLSUsB2wCPm9mzJcpdFp/l\n/NDMJpclt5lIGgHcBmxlZvektp3x2d+7zOzRVo5vSSYUSoeQzCjr0IsHdA2ZbwB7m9nEXg4vK3Mp\nPLz1YNy234W8b/1pBrGtmd3ew2xClDybkvR9YJiZHdQLGaX9ZikEezawu5nd0OiY6sieCexjZpPK\nlNsskul0azPbqqr9ceBiM/tOa0YWxDqUNkfS2vi6jlqrq4U/ZBt9kN6OzyLK5HvAR/EInN8CR+A+\nms8CG1JsUd9IPMqo8ndf8jfgskZObMZvZmYLJD2M+2HK5grgE7jprK2RtBzwaeC4GocvBEZLOs7M\nFvTtyAKIGUrbI+kqfLHgyfjDtZZZakqDsrfAw3F/RH2nfBfnbA8yH8KjnMYDb+JvknemYxfiazC+\n0Mh4+xJJZwCfMrP1Gji3Kb+ZpL3wxYv7mtl9Rc/vRu4B+L+BfwBXAS9QNRtsJDijGUhaA8+6cJmZ\nvVZ1bHVcif/ZzHodQh0UJxRKmyPpFeAwM7ukCbIrb3F1/xEUNSNJmgXsZmY3pr8/bmbXp2O7Ar8z\nsyGNjjnJGYRHNVWPtajyq/WdLg1sBLwTX+FfePV5s34zSXfg0VirAM9Q+8FfMzdZD3J7epvv18EZ\nQXmEyav9+S9uO28Gn6OHKKcGeA5YOf39OL6+4fq0v2GjQiWthL/xfxJPw1IrZUfRh95qNdrmADcC\n3+jFW3mzfrP701Y29aL8OoYU/bYe8C8zm9vq8SypxAylzZH0GdwPsYeVu6K9KUg6D1/8drSkr+Gm\nlD/gSRH3x52mhzYg93I8dcsvqR+R1vD6ljLptN+s00g54ZYxs2+n/Z2APwPL4+t+djWzB1o4xCWW\nUChtSA1TzLZ4Vtk7qO3n2K+X19sY2AoYiq9HeF7SO4AXqu3UOWStCQwxs/vT/teBfVi0GO9EM3uj\ngTG+CnzBGk+H0pP8VXFT0oxG1nP09W9WJs1a4NksJD0CfN88qWUl/P15PKfbD4A3zOzjrRvhkkuY\nvNqTalNMJa5+qRrHGn4jSDmxzscf+G/i/x6uwf/n/CGeJuT/Coo9Avi7pCfM7HUzOxPPBdVbnsJT\nbZSKpP2BMXhalErbvyKJ7GgAABEnSURBVIHvmdkfCohqym8maSzwUzN7Ov3dLWZ2dF7ZGe7PMaZ2\n8qGsjae+r2QP2Ax/2bg9BVNc0MrBLcmEQmlDsqt9JX0P+FWtxWyS1qJ3SfHOAP4HX3V9M4snbrwK\nVyZFFcon8IWSC9Jq5hvxrLU3mdmLvRjr0cAJku42s94kxFxIMk39Frga98+8AKyBm+YmSBpoZjWL\nelXTxN9s3zTGp4H96P7Bb/j3VJRaIdmLLfBsQGYzeQ3PZwaeFXmmmd2e9ufgpq+gBYTJq82RVw38\nQOZ/mOyxrYDbG43AkTQdT974W0kD8VnKCDO7S9JI4AozW6F7KTXlDsYLPlW2LfE33Ifx5JANKcH0\n9vll4Alqm5EKRThJuh9XdF3qk6TFcx8ys00aGGfTfrO+powFnmWTzIvr4LPoU/D8Yp9Lx76IZzho\ntHpl0AtanuQt6JHKQrharIsXl2qU5fAEi7VYAc/mWxgzm2lmV5jZUXiU1yfwmcq78Miywkg6DS/O\ndTful3igxlaUd9B9Fch3NCATmvSbSTpIdeq/S1olJeUsm78BezVBbm/4Oh7kMQF/sTg2c+wgfEYc\ntIAwebUhkg7GU5eAP5h+kZzSWZbFa3dc24tL3YH/D3hNjWP74NlnCyFpReCDLJqdjMALNN0MHIUr\nlkb4PHCsmZ3c4Pm1eAEfX62EjSPS8Vz00W92AfABar8ErJ+OX9Sg7HrsSQl1VsrEvL7OTnUO70bj\nNXeCXhIKpT2ZxaKHhvD6FNXlb+fhtv+ze3Gd7wLXSboeD+01YI9MZNb2Dcicgb89/hn4DXB4SSGc\ns/BaIGVyATAmmfsuxRXI6rjf4jjcr1JkfM3+zborl7sqXsukuNAcCzwbkdtsJC2NK+hV8O/6vgjT\nbi3hQ2lzJF2Ah9o2Wt+8J/kfxO3Q2+J+DgNuBY42s5sbkPcP3GfyCj4buRGvd36P9eIfm6Rv4TVW\n9u2NnCqZA4CTgCNZvBLibODHwHcbuVaZv1lKt1IxOY0CrgSqgxuWxWeD/zKzXRu4xmS6mujm4IEA\nl7dL2pUsko7Gi7mtyCIT46t41uQftXJsSzKhUAJgYdK9wcDLRVOY1JG1LT7D2S79/RZuQpvSYDqT\nH+FJAWcDk+lqhjEz+1aD4x0MbIIXwnoOuN/MeuObKg1Jh+EVCcHXCk3Dk21mmZfav9+sF492Ii2Y\nPR2v1Pl7Fo/O+wKe5eCnrRvhkksolKCpJJ/KSNyRuj0N5oVKqcm7w8ysXh33foGkvwFfNLNpJcut\nlILu8h1LWg8vWdxQMEUzSFmXLzGzY2sc+wGwv5k1GlAR9IJQKEsYeRbHZSj81p9WymdDhithtw+Q\nTGDNSHTZKPICU9vj0VfVdUvMzH7R96PqW7J1Z2oca7swZ0lzgI9Wko5WHfsw8Bcz61KDJmg+4ZRf\n8ti3QF8DipqRnsVNMHcBE3Hn9s3tmE5c0ofw8OBaSSIhRWv13Yh6JtVa+Sj1FWBDZj/qhzlvQlef\nTat5Ctj1/9s7+2Ap6yqOf77TKJQKYtiI76OpqFhmajq+hJKVrwUpjFGKmqbjaGWllgpIljo6mUZq\nBqJjKpkmlIpimqkDKWO+EZrKEIlv+I6MAimnP85v5+69d+9ld++zu8/ynM/MDnefZ/d3z1529uzz\nO+d8v3SIjpZzUDoftIBIKAXDzBqtLHsg8IiZZa62K+kz+MzB7vgH6t5pCPPn+IDirBqXvAKX8Pgy\nsMDM/pdpwBkjaSRwM948sZTuAplVfwGQ9D28GaH0vBmSuqr09sdrE9fVGXKjuAK4QtJGdO/OG0fH\n6wqaTGx5BW2BpINxZ8E5uBnYBDqm+sfjWzaH1Lhm5hbIjUTSM7jawDgz69qSXOtaB+GJVMAZwE14\nQ0I5pWL/LXmThE/NChNwXS/DX8fLwEQzm9LK2IpMJJSCk1zuvg/sSUeX0yO4IGHVg32NRtITuMTG\niclgaxUdCeUI4Goz27TGNeem5+VC9n5NSFoOfL1S7aCP607AtcdeynLdRiNJuEL2Jvj7dklWLeVB\nfYT0SoFJMyjP462Wb+Ce4m8AJwPPp/N5YSjeIgrd9/uX4cNttXIK8ANJX+xLYE1kDrBD1oua2fnt\nlkzAC0bAi/i8zCuRTFpP1FCKzWR8+vzwco+SJGt/B/BrfEgxDywFemoL3pkqC7GSXqdzQloPuF/S\nKlzFthNm9qka42wkZwA3piuVe6kskFnXDJGkI4FRVC7212Ut3EgkHYJvee2K15T2BP4p6RrgQTP7\nfSvjKyqRUIrNUODIroZXZrY8iTHW4gfSaKYDkyQtAOamYyZpe7wQPbXKdX5D9rbHzeKp9O80en4N\n9cz4TATGA08CC6jghpknkgjmtbis/5V09j95HjgBl/0JmkwklGKzAN9/rsQQvCCbF84DdsJlXF5N\nx2bi8c/GpczXiJlNbERwTeJ4GpMMTwAuMrNcanZV4BzgEjP7SdJhK08o/6J2D58gIyKhFJvTgBvS\nFsoMM1spqR8wEjgbVyLOBanL6DBJI3BDsMG4IOB9ZlZJLXito8zytkfL5jqX3gCvn7ULW1FZIRpc\ng2xAE2MJyoiEUjB6qCHclM4tB9ZPx1cAt+P9/S1H0pZ44fU+unz4pa6vTWt1ckxSJj1941+NF/uf\nAKaZ2Yu1R50tktbDv41/A9dG62rZvBi3CKiV6cBXaZ+k8iLwObx9vCu7Ay80N5ygRCSU4tGuNYRF\nuBdIN3kQ3FP8UWqvH7wBfAHfNnsMnwjfGP/2/yrwDG5/+yNJI8xsXn2hZ8ZluGXzl+jZsrmehHIf\ncLGkwfRc7M+T4vBUYIKk14AZ6ZjS1euZwKSWRVZwIqEUjDauIfTmBdIf92CplTvwzrG9yv3fJW0G\n/AVvSjgKr9FciH+Qt5JRuGXz31LtoJzF+FZQPZTasbemwySsHKOOYn8DuRjf7rueDlfROXiMvw2l\n4dYRCSXoyaio5Z0+SWpl17JDh0ga2uVh/YHRwHN1/IrxuNT5y+UHzewlSZOAy8xsSvKyr7aLrJE0\nxLIZd3tsG9K8yanp/6W8nna/mdXzPggyIhJKwalgVATwrqQ8GBWNxGcNwL8lj+/hcYvw4cxaGQL0\n6+FcSccKfAamtyukZpG5ZTOAmS3uS1CtwswWAgtbHUfQQUzKF5hkVHQhXpQ/ANgRGJ7uXyjp9NZF\nB3iheQM6kt2B6X75rZ+ZbVunHMnfgYskdRrelLQ7/nd5IB3aDt9SajXnAaOSZfN36LBsvgHfmpvQ\n25N7Q1I/SadImipptqTt0vExknbMIvgskLSNpA8ldXOmlDRW0kpJPalHBw0mtLwKTNGNiiRtgddK\ndsGL8KWi/Cb4EOERZvaipO8CK/Kg+ZW1ZXNac3u8GD8Qb04YDuyRdNImAwPMLDct5JLuB143szFd\njt8LvGdmo1oTWRAJpcC0k1GRpP2AjcxsZro/GJcx3wnvUjq7Xvl5SYfi7aab4IllXs66mrqRsWXz\n3Xj7+OHAcjoLbx4FXJwnN0xJY4EpwJCSz05qK1+Ev59rtTEIMiJqKMWmnYyKLsGvJmam+5fjBdnb\ncQ+MlUBdk95mdidwZ99DbB7JbyYrz5n9gKPM7J0K3WOv4bWmPHErrjM3Fm+DB+9OW0Ll+lLQJCKh\nFJt2MiraHt+OQdIn8IL98WY2XdI8PJnUlFDSxHmvmNmCOmJtN1bgHWSV2IwKcymtJCk63IRL0ZQS\nyjjgulAcbi2RUAqMmU1OLn0T6NCJKhkVnZwzo6J16Rjk2wd/75auKp6jvm/R81nzkGee5i8axb3A\nT1Oxf3k6ZkmG5zR8aDJvTAVOkVRqd98KF4wMWkgklIJjZr+TNAWXLS8ZbOXRqOhZXB7kAXyrY66Z\nleTmN8XnEGrlgArHBgFfSbdWd7k1ix/jk/cv4Mml1KK9M95Wnbsit5k9LulxXNhyID6DkodOvEIT\nRfmCIqk/3sl0upnlft85uTL+EdfXGgh8rVR8lTQNGGxmh2f4+y4AtsxTd1MjkTQI91vpJLwJ/NLM\nehqmbCmSTgF+hie9E8zslhaHVHgioRQYSUuBb5nZ7FbHUg2StsUn558un4iWdBLwlJn9I8PfNQL4\nk5kNzGrNvCLps8BmlTrbkpHVEjN7qvszW4ukAXhX3nI8/rq6/ILsiMHGYnMjcFyrg6gGSbuY2UIz\nu62rvIaZXZNlMkkcSs6K0Q3kMlwksxJ7pPO5w8yW4c0ZoyOZ5IOooRSb/wKjU5fULLzLq/yS1czs\nqpZE1p0nJT2GF15vLs0f9AVJlbZI1sWdLLejzjbkNmQ3fFiyEnPJV7dfJ8zsnlbHEHQQW14FRtLq\nNTzEzCwXXU6ShuNXU6PwzquZeHL5a70NBMkPpSsr8HmG2/M+3JgVkt4DjjGz2yucGwncYGbrd39m\nEHQmEkrQViSTqTH43MG++If/9fgMQggF1kGSMllpZgdXODcL+LiZDW96YEHbEQml4CTp+nHAnnS0\nDT8CXJ8HCfveSAKGU/G5FICHgV9V+qZdxVqfJMn357WrqVFI2h9XS3gcT86v4u+FY3DzsoPM7KHW\nRRi0C1GULzBJRfZ5fNp4GO6nMSzdf6GaSfJWIGlrSROBe3AXx7uAk/Aa0B8kVV1ETmq6z+AS9c8C\nSyU9kzSsCoGZPYhL8KzGJU1uxaVtPiSSSVADcYVSYCQ9hM90HFbux56E9u7AhQf3b1V85SS5lSPx\nOsp+uBDgtfhW1ytljzsOuNzMBlSx5tF4p9ss3LXwNdwDZQw+RDnWzKZn/FJyTfo7DwLe7qvoZFA8\nIqEUGEkfAEeb2YwK50YCN5lZTxpPTSUVjj8G3AZMNbMHenjcMFwleY0uhJLmAw+b2ckVzl0N7Gtm\nw/oUeBAUiNjyKjb/wZ0JK9GffKkNn4nLlX+7PJlI2rD8QWY2v5pkkvg0nqAqcVs6HwRBlURCKTZn\nAxdI6jTUJmkvXNLirJZEVRlRZvMraVdJS4A3JT0mafM61nwN90GpxO7pfBAEVRIJpdici9vrzpH0\niqQnJb2CCwUOwBVoHy3dWhopnIrreJW4AldFHou/j3sazOuNacBESedKGippkKQdJJ2LKzCHem0Q\n1EBMyheb+enWDmwF/BsgeYbvA4wwswckrQIm17HmJGAd/Ert/LLjHwCXpvNBEFRJJJQCY2ZtoeOV\nWInLooDLzr8PlNpZ3wI2rPSk3jCz1cA5ki7F26VLczjzzeztPkccBAUjEkrQLjwKnJrqJqcDd5vZ\nR+ncNvj2V9Uk+f4/A79IRf6YtQiCPhI1lKBd+CFu+PQ0sAVwTtm5MXjdp2rMbAWupJsLrbIgWBuI\nOZSgrUgSKW+VC0ImG9hXzez1Gte6HlhmZqdlHGYQFJJIKEFhkfRN4BJcov0uusv3UxTF4SDIgkgo\nQWFpJ/n+IGgHoigfFJlqJ+qDIKiCuEIJgiAIMiGuUILCI6kfsBkVdM3MbEHzIwqC9iQSSlBYJG0K\nXAN0cyrEtcOMaCsOgqqJhBIUmSnAbsAZwAIg1w6VQZB3ooYSFBZJ7wInmtktrY4lCNYGYlI+KDJL\ncSHIIAgyIBJKUGTGA2dJWqNdcBAEayZqKEGhkNR1e2tLYLGkecA7XR9vZqObElgQrAVEQgmKxsZd\n7i9M/65T4VwUGIOgBiKhBIXCzA4o/SxpPDDFzLpJ30saApzYzNiCoN2JLq+gsEj6CNjbzLrZG0v6\nPPBoaHkFQfVEUT4oMqXhxUpsDoRrYxDUQGx5BYVC0rHAsemuAVdJWtblYf2BXYDZzYwtCNqdSChB\n0XgfeDP9LOBd3JO+nFXALODKJsYVBG1P1FCCwiJpGjDJzBa1OpYgWBuIhBIEQRBkQhTlgyAIgkyI\nhBIEQRBkQiSUIAiCIBMioQRBEASZ8H88oMZnum3YSgAAAABJRU5ErkJggg==\n",
+      "text/plain": [
+       "<Figure size 432x288 with 2 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sent = 'the problem was systemic throughout the entire century .'\n",
+    "\n",
+    "draw_map(sent, get_attns(sent, layer=0, head=3))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/3_Look_which_heads_are_dead.ipynb b/notebooks/3_Look_which_heads_are_dead.ipynb
new file mode 100644
index 0000000..dbb0788
--- /dev/null
+++ b/notebooks/3_Look_which_heads_are_dead.ipynb
@@ -0,0 +1,245 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## You will learn: how to look which heads ended up dead after pruning."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you don't know how to load a model yet, look in the notebook __1_Load_model_and_translate__."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Load trained model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, 'path_to_the_story_of_heads') # insert your local path to the repo"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load vocabularies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "import numpy as np\n",
+    "\n",
+    "DATA_PATH = # insert your path\n",
+    "VOC_PATH =  # insert your path\n",
+    "\n",
+    "inp_voc = pickle.load(open(VOC_PATH + 'src.voc', 'rb'))\n",
+    "out_voc = pickle.load(open(VOC_PATH + 'dst.voc', 'rb'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "env: CUDA_VISIBLE_DEVICES=0\n"
+     ]
+    }
+   ],
+   "source": [
+    "%env CUDA_VISIBLE_DEVICES=0\n",
+    "\n",
+    "import tensorflow as tf\n",
+    "import lib\n",
+    "import lib.task.seq2seq.models.transformer_concrete_heads as tr\n",
+    "\n",
+    "tf.reset_default_graph()\n",
+    "gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.99, allow_growth=True)\n",
+    "sess = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))\n",
+    "\n",
+    "hp = {\n",
+    "     \"num_layers\": 6,\n",
+    "     \"num_heads\": 8,\n",
+    "     \"ff_size\": 2048,\n",
+    "     \"ffn_type\": \"conv_relu\",\n",
+    "     \"hid_size\": 512,\n",
+    "     \"emb_size\": 512,\n",
+    "     \"res_steps\": \"nlda\", \n",
+    "    \n",
+    "     \"rescale_emb\": True,\n",
+    "     \"inp_emb_bias\": True,\n",
+    "     \"normalize_out\": True,\n",
+    "     \"share_emb\": False,\n",
+    "     \"replace\": 0,\n",
+    "    \n",
+    "     \"relu_dropout\": 0.1,\n",
+    "     \"res_dropout\": 0.1,\n",
+    "     \"attn_dropout\": 0.1,\n",
+    "     \"label_smoothing\": 0.1,\n",
+    "    \n",
+    "     \"translator\": \"ingraph\",\n",
+    "     \"beam_size\": 4,\n",
+    "     \"beam_spread\": 3,\n",
+    "     \"len_alpha\": 0.6,\n",
+    "     \"attn_beta\": 0,\n",
+    "    \n",
+    "     \"concrete_heads\": {\"enc-self\"},\n",
+    "}\n",
+    "\n",
+    "model = tr.Model('mod', inp_voc, out_voc, inference_mode='fast', **hp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load trained model parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path_to_ckpt = # insert path to the final checkpoint\n",
+    "var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)\n",
+    "lib.train.saveload.load(path_to_ckpt, var_list)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Look which heads are dead."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Gate values are saved in `tf.collection`. To get them, we need to first feed a batch to the model.\n",
+    "\n",
+    "Let's create batch for the model. We'll use a couple of sentences about cats, but you can just take a part of your test set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "feed_dict = model.make_feed_dict([('i saw a cat', 'я видел кота'),\n",
+    "                                  ('the cat sat on a mat', 'кот сидел на мате')])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Feed the batch to the model and get gate values from the collection:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tf.get_default_graph().clear_collection(\"GATEVALUES\")\n",
+    "model.transformer.encode(feed_dict['inp'], feed_dict['inp_len'], is_train=False)\n",
+    "gate_values = tf.get_collection(\"GATEVALUES\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "layer = 0, gate values = [0. 0. 1. 1. 0. 0. 0. 1.]\n",
+      "layer = 1, gate values = [1. 0. 0. 1. 0. 1. 1. 0.]\n",
+      "layer = 2, gate values = [0. 0. 0. 0. 1. 0. 1. 0.]\n",
+      "layer = 3, gate values = [1. 0. 1. 0. 1. 1. 0. 0.]\n",
+      "layer = 4, gate values = [0. 1. 1. 0. 1. 0. 1. 0.]\n",
+      "layer = 5, gate values = [1. 0. 0. 0. 0. 0. 0. 1.]\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i in range(model.transformer.num_layers_enc):\n",
+    "    gates = gate_values[i].eval().flatten()\n",
+    "    name = gate_values[i].name.split('/')[1]\n",
+    "    if name.startswith('enc_attn'):\n",
+    "        print(\"layer = {}, gate values = {}\".format(i, gates))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Layers in the model are indexed starting from 0, don't be confused!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..0419dc4
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,10 @@
+tensorflow-gpu>=1.8.0,<=1.14.0rc0
+bintrees
+horovod
+sortedcontainers
+cffi
+scipy
+regex
+nltk
+subword_nmt
+mosestokenizer
\ No newline at end of file
diff --git a/resources/acl19_heads-min_pad.png b/resources/acl19_heads-min_pad.png
new file mode 100644
index 0000000..1146ae4
Binary files /dev/null and b/resources/acl19_heads-min_pad.png differ
diff --git a/resources/concrete.gif b/resources/concrete.gif
new file mode 100644
index 0000000..194a5bd
Binary files /dev/null and b/resources/concrete.gif differ
diff --git a/resources/concrete_crop.gif b/resources/concrete_crop.gif
new file mode 100644
index 0000000..238fbc4
Binary files /dev/null and b/resources/concrete_crop.gif differ
diff --git a/resources/enc_head_gif_delay7-min.gif b/resources/enc_head_gif_delay7-min.gif
new file mode 100644
index 0000000..1b1089b
Binary files /dev/null and b/resources/enc_head_gif_delay7-min.gif differ
diff --git a/resources/enc_head_gif_delay7.gif b/resources/enc_head_gif_delay7.gif
new file mode 100644
index 0000000..c94d7b9
Binary files /dev/null and b/resources/enc_head_gif_delay7.gif differ
diff --git a/scripts/nmt.py b/scripts/nmt.py
new file mode 100755
index 0000000..a1b0d3e
--- /dev/null
+++ b/scripts/nmt.py
@@ -0,0 +1,508 @@
+#!/usr/bin/env python3
+
+# This is a hideous collection of scripts for machine translation tasks for training, translating, etc.
+# If you don't want a Ph.D in configuring this file, simply jump to main.py and continue from there
+
+import argparse
+import json
+import math
+import numpy as np
+import tensorflow as tf
+import time
+import operator
+import os
+import sys
+import pickle
+import glob
+import itertools
+
+
+from tensorflow.core.framework import graph_pb2
+from tensorflow.core.framework import node_def_pb2
+
+# add local libs to pythonpath
+sys.path.insert(0, os.path.dirname(__file__) + '/..')
+
+import lib
+from lib.data import CostBufferIterator, RoundRobinIterator, ShuffleIterator
+from lib.train.tickers import *
+from lib.task.seq2seq.tickers import *
+from lib.task.seq2seq.problems.default import DefaultProblem
+from lib.task.seq2seq.data import load_parallel, random_block_reader, filter_by_len, locally_sorted_by_len, maxlen, \
+    batch_cost, form_batches, form_adaptive_batches, form_adaptive_batches_windowed, form_adaptive_batches_split2d
+
+from lib.session import ProfilableSessionWrapper, profile_scope, get_profile_level, set_profile_level
+from lib.util import load_class, merge_dicts
+
+from tensorflow.python import debug as tf_debug
+
+np.set_printoptions(threshold=300000000, linewidth=1000, suppress=True)
+SEED = 'zemberek'  # the secret holy grail of Yandex Translate
+
+
+def rename_nodes(graph_def, node_name_dict):
+    output_graph_def = graph_pb2.GraphDef()
+    for node in graph_def.node:
+        output_node = node_def_pb2.NodeDef()
+        output_node.CopyFrom(node)
+
+        output_node.name = node_name_dict.get(node.name, node.name)
+
+        output_node.ClearField("input")
+        for node_name in node.input:
+            output_node.input.extend([node_name_dict.get(node_name, node_name)])
+
+        output_graph_def.node.extend([output_node])
+    return output_graph_def
+
+
+def get_node_name(op):
+    return re.sub("\:\d+$", "", op.name)
+
+
+def load_default_problem(model_class_name, inp_voc_path, out_voc_path, hp, problem_config={}, max_srclen=None,
+                         align=None, feat_vocs=None):
+    # Load vocs.
+    with open(inp_voc_path, 'rb') as f:
+        inp_voc = pickle.load(f)
+    with open(out_voc_path, 'rb') as f:
+        out_voc = pickle.load(f)
+    feat_vocs = []
+    for feat_voc in feat_vocs:
+        with open(feat_voc, 'rb') as f:
+            feat_vocs.append(pickle.load(f))
+
+    if os.path.exists(model_class_name):
+        # Read Python file to get model class
+        with open(model_class_name, 'rt') as f:
+            text = f.read()
+        code = compile(text, model_class_name, 'exec')
+        vars = {}
+        exec(code, vars, vars)
+        model_cls = vars['Model']
+    else:
+        model_cls = load_class(model_class_name)
+
+    # Create model instance
+    if align:
+        with open(align, 'rb') as f:
+            align = pickle.load(f)
+        hp['align'] = align
+    if len(feat_vocs) > 0:
+        hp['feat_vocs'] = feat_vocs
+    return DefaultProblem({'mod': model_cls('mod', inp_voc, out_voc, **hp)}, **problem_config)
+
+
+def load_problem(args, **kwargs):
+    model_specs = getattr(args, 'models', None)
+    if model_specs is None or len(model_specs) == 0:
+        model_specs = args.models = {'mod': {'class': args.model, 'translation_mode': 'src2dst', 'dev_suffix': ''}}
+
+    if not hasattr(args, 'problem') or args.problem is None:
+        return load_default_problem(args.model, args.ivoc, args.ovoc, args.hp, **kwargs)
+
+    problem_class = load_class(args.problem)
+
+    model_spec_options = {'class', 'translation_mode', 'device', 'hp', 'inp_voc', 'out_voc', 'dev_suffix'}
+
+    models = {}
+    for model_name, model_spec in model_specs.items():
+        assert len(set(model_spec.keys()).intersection(model_spec_options)) == len(model_spec.keys()), Exception(
+            'invalid model options')
+
+        model_class = load_class(model_spec['class'])
+
+        model_translation_mode = model_spec.get('translation_mode')
+        assert model_translation_mode in [None, 'src2dst', 'dst2src'], Exception("invalid translation_mode")
+
+        model_device = model_spec.get('device')
+        model_inp_voc_path = model_spec.get('inp_voc', args.ivoc if model_translation_mode != 'dst2src' else args.ovoc)
+        model_out_voc_path = model_spec.get('out_voc', args.ovoc if model_translation_mode != 'dst2src' else args.ivoc)
+
+        model_hp = merge_dicts(args.hp, model_spec.get('hp', {}))
+
+        with open(model_inp_voc_path, 'rb') as f:
+            model_inp_voc = pickle.load(f)
+
+        with open(model_out_voc_path, 'rb') as f:
+            model_out_voc = pickle.load(f)
+        with tf.device(model_device) if model_device else lib.util.nop_ctx():
+            models[model_name] = model_class(name=model_name, inp_voc=model_inp_voc, out_voc=model_out_voc, **model_hp)
+
+    return problem_class(models, **args.problem_opts)
+
+
+def set_no_constraints(hp):
+    hp['is_constrained'] = False
+
+
+def make_random_vec(size):
+    CLIP = math.sqrt(3.0 / size)
+    return np.array([np.random.uniform(-CLIP, CLIP)
+                     for _ in range(size)], dtype=np.float32)
+
+
+def average_checkpoints(folder, num_checkpoints):
+    model_vars = {}
+    checkpoints = []
+    for fname in os.listdir(folder):
+        if not fname.startswith('model-') or not fname.endswith('.npz'):
+            continue
+        label = fname[len('model-'):-len('.npz')]
+        if not label.isdigit():
+            continue
+        checkpoints.append(int(label))
+    checkpoints = sorted(checkpoints, reverse=True)[0:num_checkpoints]
+    for checkpoint in checkpoints:
+        filename = os.path.join(folder, 'model-%d.npz' % checkpoint)
+        checkpoint = np.load(filename)
+        for var in checkpoint:
+            if var in model_vars:
+                model_vars[var] += checkpoint[var]
+            else:
+                model_vars[var] = checkpoint[var]
+    for var in model_vars:
+        model_vars[var] /= len(checkpoints)
+    np.savez(os.path.join(folder, 'model-latest.npz'), **model_vars)
+
+
+## ============================================================================
+#                                  'mkvoc'
+
+def MKVOC_add_params(subp):
+    p = subp.add_parser('mkvoc')
+    p.add_argument('--text', required=True)
+    p.add_argument('--outvoc', required=True)
+    p.add_argument('--n-words', required=True, type=int)
+    p.add_argument('--unk-num', action='store_true', default=False)
+    p.add_argument('--index', type=int, default=0)
+    return 'mkvoc'
+
+
+def MKVOC(args):
+    voc = lib.task.seq2seq.voc.Voc.compile(args.text, args.n_words, args.index)
+    with open(args.outvoc, 'wb') as f:
+        pickle.dump(voc, f)
+
+
+## ============================================================================
+#                                 'train'
+
+def TRAIN_add_params(subp):
+    eval_arg = lambda x: eval(x, locals(), globals())
+
+    p = subp.add_parser('train')
+    p.add_argument('--model', required=False)
+    p.add_argument('--ivoc', required=False)
+    p.add_argument('--ovoc', required=False)
+    p.add_argument('--max-srclen', required=True, type=int)
+    p.add_argument('--max-dstlen', required=True, type=int)
+    p.add_argument('--batch-maker', default='adaptive_windowed')
+    p.add_argument('--maxlen-min', default=1, type=int)
+    p.add_argument('--maxlen-quant', default=1, type=int)
+    p.add_argument('--split-alterate', default=False, action='store_true')
+    p.add_argument('--t2t-batches', type=str)
+    p.add_argument('--t2t-batches-start', type=int, default=0)
+    p.add_argument('--train-src', required=True)
+    p.add_argument('--train-dst', required=True)
+    p.add_argument('--train-paste', action='store_true', default=False, help='SRC/DST separated by <tab>')
+    p.add_argument('--dev-src', required=True, action='append')
+    p.add_argument('--dev-dst', required=True, action='append')
+    p.add_argument('--num-batches', type=int, default=10000000)
+    p.add_argument('--part-size', type=int, default=1024 * 1024)
+    p.add_argument('--part-parallel', type=int, default=16)
+    p.add_argument('--batch-len', type=int, default=5000)
+    p.add_argument('--shuffle-len', type=int, default=12000)
+    p.add_argument('--batch-shuffle-len', type=int, default=0)
+    p.add_argument('--batch-size-max', type=int, default=0)  # 0 - no batch size limit
+    p.add_argument('--cost-buffer-len', type=int, default=12000)
+    p.add_argument('--split-len', type=int, default=10000)
+    p.add_argument('--split-chunk-count', type=int, default=0)
+    p.add_argument('--lead-inp-len', default=False, action='store_true')
+    p.add_argument('--folder', required=True)
+    p.add_argument('--device', type=str, default='')
+    p.add_argument('--gpu-memory-fraction', type=float, default=.95)
+    p.add_argument('--gpu-allow-growth', action='store_true', default=False)
+    p.add_argument('--align', type=str, default='')
+    p.add_argument('--decay-after-batches', default=0, type=int)
+    p.add_argument('--profile', action='store_true', default=False)
+    p.add_argument('--profile-level', type=int, default=1, help='Profiling verbosity level')
+    p.add_argument('--tfdbg', action='store_true', default=False)
+    p.add_argument('--translate-dev', action='store_true', default=False)
+    p.add_argument('--translate-dev-every', default=4096, type=int)
+    p.add_argument('--translate-dev-initial', action='store_true', default=False)
+    p.add_argument('--score-dev-every', default=256, type=int)
+    p.add_argument('--score-dev-initial', action='store_true', default=False)
+    p.add_argument('--learning-rate-stop-value', default=0.0, type=float)
+    p.add_argument('--rollback-off', action='store_true', default=False)
+    p.add_argument('--feat-vocs', action='append')
+    p.add_argument('--average-checkpoints', type=int, default=1)
+    p.add_argument('--checkpoint-every-steps', type=int, default=2048)
+    p.add_argument('--checkpoint-every-minutes', type=int, default=0)
+    p.add_argument('--time-every-steps', type=int)
+    p.add_argument('--learning-rate', type=float, default=.001)
+    p.add_argument('--decay-policy', default='constant')
+    p.add_argument('--decay-steps', type=int)
+    p.add_argument('--decay-min-steps', type=int, default=0)
+    p.add_argument('--optimizer', type=str, default='adam')
+    p.add_argument('--optimizer-opts', type=eval_arg, default={})
+    p.add_argument('--no-skip', dest='skip_train_data', default=True, action='store_false')
+    p.add_argument('--reader-seed', type=int, default=42)
+    p.add_argument('--dump-dir', default=None)
+    p.add_argument('--dump-first-n', default=None, type=int)
+    p.add_argument('--keep-checkpoints-max', type=int)
+    p.add_argument('--hp', type=eval_arg, default={})
+    p.add_argument('--seed', type=int, default=42)
+    p.add_argument('--pre-init-model-checkpoint', type=str)
+    p.add_argument('--pre-init-state-checkpoint', type=str)
+    p.add_argument('--problem', type=str)
+    p.add_argument('--problem-opts', type=eval_arg, default={})
+    p.add_argument('--models', type=eval_arg, default={})
+    p.add_argument('--mpi-provider', type=str)
+    p.add_argument('--mpi-gpu-count-per-process', type=int, default=1)
+    p.add_argument('--summary-every-steps', type=int, default=1)
+    p.add_argument('--params-summary-every-steps', type=int, default=2048)
+    p.add_argument('--avg_checkpoint_every', type=int, default=None)
+    p.add_argument('--avg_last_checkpoints', type=int, default=None)
+    p.add_argument('--translate-on-master', dest='translate_parallel', action='store_false')
+    p.add_argument('--end-of-params', action='store_true', default=False)
+    return 'train'
+
+
+def TRAIN(args):
+    if not args.end_of_params:
+        raise ValueError(
+            "You have forgotten to pass --end-of-params. This probably means that there is an extra space in your train script and not all parameters are present.")
+
+    # Set random seed.
+    tf.set_random_seed(args.seed)
+
+    # multi-gpu: we need Local Rank since it is the number of GPU on this host
+    MPI_RANK = os.getenv('OMPI_COMM_WORLD_RANK')
+    MPI_LOCAL_RANK = os.getenv('OMPI_COMM_WORLD_LOCAL_RANK')
+    MPI_SIZE = os.getenv('OMPI_COMM_WORLD_SIZE')
+
+    # Load problem and corpora.
+    problem = load_problem(
+        args,
+        problem_config=args.problem_opts,
+        max_srclen=args.max_srclen,
+        align=args.align,
+        feat_vocs=args.feat_vocs)
+
+    # Set mpi provider from args if given
+    if args.mpi_provider is not None:
+        lib.ops.mpi.set_provider(args.mpi_provider)
+
+    # Form batches and epoch infinite generator
+    def batch_form_func(x):
+        rng = np.random.RandomState(42)
+
+        def weight_func(x, min_maxlen=args.maxlen_min, quant=args.maxlen_quant):
+            out = maxlen(x)
+            out = math.ceil(out / quant) * quant
+            out = max(min_maxlen, out)
+            return out
+
+        if args.batch_maker == 'adaptive_windowed':
+            x = form_adaptive_batches_windowed(
+                x,
+                weight_func=weight_func,
+                max_size=args.batch_len,
+                split_len=args.split_len,
+                batch_size_max=args.batch_size_max)
+        elif args.batch_maker == 'adaptive':
+            x = locally_sorted_by_len(x, args.split_len, weight_func=weight_func, alterate=args.split_alterate)
+            x = form_adaptive_batches(x, args.batch_len, batch_size_max=args.batch_size_max)
+            x = ShuffleIterator(x, args.batch_shuffle_len)
+        elif args.batch_maker == 'single_example':
+            x = ([xi] for xi in x)
+        elif args.batch_maker == 'simple':
+            x = form_batches(x, args.batch_size)
+        else:
+            raise Exception("Unexpected batch_maker:", args.batch_maker)
+        return x
+
+    learning_rate_fn = LearningRateFn(
+        policy=args.decay_policy, scale=args.learning_rate,
+        decay_steps=args.decay_steps, hid_size=args.hp['hid_size']
+    )
+
+    if args.optimizer == 'lazy_adam':
+        algo = lib.train.algorithms.LazyAdam(
+            learning_rate=learning_rate_fn,
+            **args.optimizer_opts)
+    elif args.optimizer == 'adam':
+        algo = lib.train.algorithms.Adam(
+            learning_rate=learning_rate_fn,
+            **args.optimizer_opts)
+    elif args.optimizer == 'sgd':
+        algo = lib.train.algorithms.Sgd(
+            learning_rate=learning_rate_fn,
+            **args.optimizer_opts)
+    elif args.optimizer == 'rms_prop':
+        algo = lib.train.algorithms.RMSProp(
+            learning_rate=learning_rate_fn,
+            **args.optimizer_opts)
+    else:
+        raise Exception('Unsupported optimizer %s' % args.optimizer)
+
+    # Create session
+    config = tf.ConfigProto()
+    config.gpu_options.visible_device_list = args.device.replace('/gpu:', '') if MPI_RANK is None else MPI_LOCAL_RANK
+    config.gpu_options.per_process_gpu_memory_fraction = args.gpu_memory_fraction
+    config.allow_soft_placement = True
+
+    if args.gpu_allow_growth:
+        config.gpu_options.allow_growth = True
+
+    if MPI_LOCAL_RANK is not None:
+        session = lib.ops.mpi.Session(config=config, gpu_group=int(MPI_LOCAL_RANK),
+                                      gpu_group_size=args.mpi_gpu_count_per_process)
+    else:
+        session = tf.Session(config=config)
+
+    if args.profile:
+        session = ProfilableSessionWrapper(session, log_dir=os.path.join(args.folder, 'train_log'),
+                                           profile_level=args.profile_level)
+
+    if args.tfdbg:
+        session = tf_debug.LocalCLIDebugWrapperSession(session)
+
+    save_folder = os.path.join(args.folder, 'checkpoint')
+    summary_folder = os.path.join(args.folder, 'summary')
+    translations_folder = os.path.join(args.folder, 'translations')
+
+    if lib.ops.mpi.is_master():
+        if args.train_paste:
+            train = random_block_reader(args.train_src, part_size=args.part_size, parallel=args.part_parallel,
+                                        infinite=True, seed=args.reader_seed)
+            train = (line.split('\t', 2)[:2] for line in train)
+        else:
+            train = load_parallel(args.train_src, args.train_dst, cycle=True)
+
+        train = filter_by_len(train, args.max_srclen, args.max_dstlen, args.batch_len)
+        train = ShuffleIterator(train, args.shuffle_len)
+        train = batch_form_func(train)
+        train = ((x, batch_cost(x)) for x in train)
+
+        devs = []
+        for dev_src, dev_dst in zip(args.dev_src, args.dev_dst):
+            dev = load_parallel(dev_src, dev_dst)
+            dev = filter_by_len(dev, args.max_srclen, args.max_dstlen, args.batch_len)
+            dev = batch_form_func(dev)
+            devs.append(dev)
+
+        os.makedirs(save_folder, exist_ok=True)
+        os.makedirs(summary_folder, exist_ok=True)
+
+        full_devs = [list(dev) for dev in devs]  # full dev sets even if there are several nodes
+        devs = [iter(dev) for dev in full_devs]
+    else:
+        devs = [None] * len(args.dev_src)
+        full_devs = [[] for _ in range(len(args.dev_src))]
+        train = None
+
+    train = RoundRobinIterator(train, with_cost=args.cost_buffer_len > 0)
+    if args.cost_buffer_len > 0:
+        train = CostBufferIterator(train, args.cost_buffer_len)
+    devs = [RoundRobinIterator(dev if lib.ops.mpi.is_master() else None, is_train=False) for dev in devs]
+
+    # iterator over train feed dicts
+    train_np = map(lambda b: problem.make_feed_dict(b, is_train=True), train)
+
+    tickers = [
+        SaveLoad(save_folder,
+                 every_steps=args.checkpoint_every_steps, every_minutes=args.checkpoint_every_minutes,
+                 skip_train_data=args.skip_train_data, keep_checkpoints_max=args.keep_checkpoints_max,
+                 pre_init_model_checkpoint=args.pre_init_model_checkpoint,
+                 pre_init_state_checkpoint=args.pre_init_state_checkpoint,
+                 avg_checkpoint_every=args.avg_checkpoint_every,
+                 avg_last_checkpoints=args.avg_last_checkpoints
+                 ),
+        Summary(summary_folder, params_every_steps=args.params_summary_every_steps,
+                every_steps=args.summary_every_steps),
+        GlobalStepStopper(args.num_batches),
+    ]
+
+    if args.time_every_steps is not None:
+        tickers.append(TimeTicker(every_steps=args.time_every_steps))
+
+    if args.decay_after_batches is not None and args.decay_after_batches > 0:
+        tickers += [
+            DecayLearningRate(after_steps=args.decay_after_batches, rollback=not args.rollback_off),
+        ]
+    tickers += [
+        LearningRateStopper(
+            threshold=args.learning_rate_stop_value,
+            min_steps=args.decay_min_steps),
+    ]
+
+    with session:
+        for i, (dev, full_dev) in enumerate(zip(devs, full_devs)):
+            name = 'Dev' if i == 0 else 'Dev{}'.format(i)
+            dev = list(dev)
+            backward_dev = [[[row[1], row[0]] for row in batch] for batch in dev]
+            backward_full_dev = [[[row[1], row[0]] for row in batch] for batch in full_dev]
+            dev_np = map(problem.make_feed_dict, dev)
+            tickers.append(
+                DevLossTicker(dev_np, name=name, every_steps=args.score_dev_every, initial=args.score_dev_initial))
+
+            if args.translate_dev:
+                for model_name, model in sorted(args.models.items()):
+                    translation_mode = model.get('translation_mode')
+                    assert translation_mode in ['src2dst', 'dst2src', None]
+
+                    if translation_mode is not None:
+                        # if suffix is not passed, TranslateTicker will add model_name as a suffix
+
+                        if args.translate_parallel:
+                            translate_data = dev if translation_mode != 'dst2src' else backward_dev
+                        else:
+                            # if translation only happens on master, give it full data
+                            translate_data = full_dev if translation_mode != 'dst2src' else backward_full_dev
+
+                tickers.append(TranslateTicker(model_name, translate_data,
+                                               name=name, folder=translations_folder,
+                                               every_steps=args.translate_dev_every,
+                                               initial=args.translate_dev_initial,
+                                               suffix=model.get('dev_suffix'),
+                                               device=model.get('device'),
+                                               parallel=args.translate_parallel, ), )
+
+        lib.train.train(problem, algo, train_np, tickers,
+                        tick_every_steps=args.optimizer_opts.get('sync_every_steps', 0))
+
+    if args.average_checkpoints > 1:
+        if lib.ops.mpi.is_master():
+            average_checkpoints(save_folder, args.average_checkpoints)
+
+
+
+## ============================================================================
+#                                   Main
+
+def main():
+    # Create parser.
+    p = argparse.ArgumentParser('nmt.py')
+    subp = p.add_subparsers(dest='cmd')
+
+    # Add subcommands.
+    mkvoc = MKVOC_add_params(subp)
+    train = TRAIN_add_params(subp)
+
+    # Parse.
+    args = p.parse_args()
+
+    # Run.
+    if args.cmd == mkvoc:
+        MKVOC(args)
+    elif args.cmd == train:
+        TRAIN(args)
+    else:
+        p.print_help()
+
+
+if __name__ == '__main__':
+    main()
+
diff --git a/scripts/train_baseline.sh b/scripts/train_baseline.sh
new file mode 100644
index 0000000..3a7d045
--- /dev/null
+++ b/scripts/train_baseline.sh
@@ -0,0 +1,191 @@
+#!/bin/bash
+# this file trains a baseline transformer model on a preprocessed wmt en-ru dataset
+# you should fill in your train_src, train_dst, dev_src, dev_dst and tweak training params at the bottom
+
+REPO_DIR="../" # insert the dir to the the-story-of-heads repo
+DATA_DIR="../" # insert your datadir
+
+NMT="${REPO_DIR}/scripts/nmt.py"
+
+# path to preprocessed data (tokenized, bpe-ized)
+train_src="${DATA_DIR}/train.src"
+train_dst="${DATA_DIR}/train.dst"
+dev_src="${DATA_DIR}/dev.src"
+dev_dst="${DATA_DIR}/dev.dst"
+
+# path where results will be stored
+model_path="./build"
+mkdir -p $model_path
+
+# make vocabularies
+if [ ! -f $model_path/src.voc ]; then
+  echo "Creating source language vocabulary"
+  $NMT mkvoc --text $train_src --outvoc $model_path/src.voc --n-words=1000000
+  # n-words is the maximum number of tokens in vocabulary
+  # in practice it is unlikely to be reached if you are using BPE subwords
+fi
+
+if [ ! -f $model_path/dst.voc ]; then
+  echo "creating destination language vocabulary"
+  $NMT mkvoc --text $train_dst --outvoc $model_path/dst.voc --n-words=1000000
+fi
+
+
+# shuffle data
+shuffle_seed=42
+
+get_random_source()
+{
+  openssl enc -aes-256-ctr -pass pass:"$1" -nosalt </dev/zero 2>/dev/null
+}
+
+
+if [ ! -f $model_path/train.src.shuf ]; then
+  echo "Shuffling train src"
+  shuf -o $model_path/train.src.shuf --random-source=<(get_random_source $shuffle_seed) $train_src
+fi
+if [ ! -f $model_path/train.dst.shuf ]; then
+  echo "Shuffling train dst"
+  shuf -o $model_path/train.dst.shuf --random-source=<(get_random_source $shuffle_seed) $train_dst
+fi
+
+
+# maybe add openmpi wrapper
+RUN_NMT=$(/usr/bin/env python3 -c "
+import os, sys, tensorflow as tf
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
+num_gpus = sum(x.device_type == 'GPU' for x in tf.Session().list_devices())
+if num_gpus > 1:
+    sys.stdout.write('mpirun --allow-run-as-root --host {} {}'.format(','.join(['localhost'] * num_gpus), '$NMT'))
+else:
+    sys.stdout.write('$NMT')
+")
+
+
+# Model hp (the same as in the original Transformer-base)
+# hp are split into groups: 
+#      main model hp, 
+#      minor model hp (probably you do not want to change them)
+#      regularization and label smoothing
+#      inference params (beam search with a beam of 4)
+MODEL_HP=$(/usr/bin/env python3 -c '
+
+hp = {
+     "num_layers": 6,
+     "num_heads": 8,
+     "ff_size": 2048,
+     "ffn_type": "conv_relu",
+     "hid_size": 512,
+     "emb_size": 512,
+     "res_steps": "nlda", 
+    
+     "rescale_emb": True,
+     "inp_emb_bias": True,
+     "normalize_out": True,
+     "share_emb": False,
+     "replace": 0,
+    
+     "relu_dropout": 0.1,
+     "res_dropout": 0.1,
+     "attn_dropout": 0.1,
+     "label_smoothing": 0.1,
+    
+     "translator": "ingraph",
+     "beam_size": 4,
+     "beam_spread": 3,
+     "len_alpha": 0.6,
+     "attn_beta": 0,
+    }
+
+print(end=repr(hp))
+')
+
+
+params=(
+    --folder $model_path
+    --seed 42
+    
+    --train-src $model_path/train.src.shuf
+    --train-dst $model_path/train.dst.shuf
+    --dev-src $dev_src
+    --dev-dst $dev_dst
+    
+    --ivoc $model_path/src.voc
+    --ovoc $model_path/dst.voc
+    
+    # Model you want to train
+    --model lib.task.seq2seq.models.transformer_head_gates.Model
+    # Model hp (the same as in the original Transformer-base)
+    # Specified above
+    --hp "`echo $MODEL_HP`"
+    
+    # Problem, i.e. how to train your model (loss function).
+    # For the baseline, it is the standard cross-entropy loss, which is implemented in the Default problem.
+    --problem lib.task.seq2seq.problems.default.DefaultProblem
+    # Problem options.
+    # For the Default problem, we do not use any additional options
+    --problem-opts '{}'
+    
+    # Maximum number of tokens in a sentence.
+    # Sentences longer than this will not participate in training.
+    --max-srclen 200
+    --max-dstlen 200
+    
+    # How to form batches.
+    # The only thing you have to be careful about is batch-len:
+    # is has to be about 16000 in total. Here is 4000 for 4 gpus: 4 * 4000 in total.
+    --batch-len 4000    # YOU MAY WANT TO CHANGE THIS
+    --batch-maker adaptive_windowed
+    --shuffle-len 100000 
+    --batch-shuffle-len 10000
+    --split-len 200000
+    --maxlen-quant 1
+    --maxlen-min 8
+    
+    # Optimization. 
+    # This is the optimizer used in the original Transformer.
+    --optimizer lazy_adam
+    --optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998}'
+    
+    # Alternative optimizer opts in case you do not have several gpus.
+    # sync_every_steps=4 means that you accumulate gradients for 4 steps before making an update. 
+    # This is equivalent to having 'sync_every_steps' gpus. 
+    # The actual batch size will be then batch-len * sync_every_steps 
+    # (or batch-len * num_gpus if you are using the first version of optimizer-opts)
+    #--optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998,
+    #                   '"'"'sync_every_steps'"'"': 4, '"'"'average_grads'"'"': True, }'  \
+    
+    # Learning rate schedule.
+    # This is the usual Transformer learning rate schedule.
+    --learning-rate 4.0
+    --learning-rate-stop-value 1e-08
+    --decay-steps 16000
+    --decay-policy t2t_noam
+    
+    # How long to train. 
+    # Now it says 8m batches, which basically means that you have to look at your tensorboard and stop training manually
+    --num-batches 8000000
+    
+    # Checkpoints.
+    # How often to make a checkpoint
+    --checkpoint-every-steps 2048
+    # How many checkpoints you want to keep.
+    --keep-checkpoints-max 10
+    #                       ^---YOU MAY WANT TO CHANGE THIS
+    
+    # How often to score dev set (and put a dot on your tensorboard)
+    --score-dev-every 256
+    
+    # BLEU on your tensorboard.
+    # This says that you want to see BLEU score on your tensorboard.
+    --translate-dev
+    # How often to translate dev and add this info to your tensorboard.
+    --translate-dev-every 2048
+    
+    # This argument has to passed last.
+    # It controls that nmt.py has received all your arguments
+    --end-of-params
+)
+
+$RUN_NMT train "${params[@]}"
+
diff --git a/scripts/train_concrete_heads.sh b/scripts/train_concrete_heads.sh
new file mode 100644
index 0000000..8df58fd
--- /dev/null
+++ b/scripts/train_concrete_heads.sh
@@ -0,0 +1,215 @@
+#!/bin/bash
+# this file trains a baseline transformer model on a preprocessed wmt en-ru dataset
+# you should fill in your train_src, train_dst, dev_src, dev_dst and tweak training params at the bottom
+
+REPO_DIR="../" # insert the dir to the the-story-of-heads repo
+DATA_DIR="../" # insert your datadir
+
+NMT="${REPO_DIR}/scripts/nmt.py"
+
+# path to preprocessed data (tokenized, bpe-ized)
+train_src="${DATA_DIR}/train.src"
+train_dst="${DATA_DIR}/train.dst"
+dev_src="${DATA_DIR}/dev.src"
+dev_dst="${DATA_DIR}/dev.dst"
+
+# path where results will be stored
+model_path="./build"
+mkdir -p $model_path
+
+# make vocabularies
+if [ ! -f $model_path/src.voc ]; then
+  echo "Creating source language vocabulary"
+  $NMT mkvoc --text $train_src --outvoc $model_path/src.voc --n-words=1000000
+  # n-words is the maximum number of tokens in vocabulary
+  # in practice it is unlikely to be reached if you are using BPE subwords
+fi
+
+if [ ! -f $model_path/dst.voc ]; then
+  echo "creating destination language vocabulary"
+  $NMT mkvoc --text $train_dst --outvoc $model_path/dst.voc --n-words=1000000
+fi
+
+
+# shuffle data
+shuffle_seed=42
+
+get_random_source()
+{
+  openssl enc -aes-256-ctr -pass pass:"$1" -nosalt </dev/zero 2>/dev/null
+}
+
+
+if [ ! -f $model_path/train.src.shuf ]; then
+  echo "Shuffling train src"
+  shuf -o $model_path/train.src.shuf --random-source=<(get_random_source $shuffle_seed) $train_src
+fi
+if [ ! -f $model_path/train.dst.shuf ]; then
+  echo "Shuffling train dst"
+  shuf -o $model_path/train.dst.shuf --random-source=<(get_random_source $shuffle_seed) $train_dst
+fi
+
+
+# maybe add openmpi wrapper
+RUN_NMT=$(/usr/bin/env python3 -c "
+import os, sys, tensorflow as tf
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
+num_gpus = sum(x.device_type == 'GPU' for x in tf.Session().list_devices())
+if num_gpus > 1:
+    sys.stdout.write('mpirun --allow-run-as-root --host {} {}'.format(','.join(['localhost'] * num_gpus), '$NMT'))
+else:
+    sys.stdout.write('$NMT')
+")
+
+# Model hp (the same as in the original Transformer-base)
+# hp are split into groups:
+#      main model hp,
+#      minor model hp (probably you do not want to change them)
+#      regularization and label smoothing
+#      inference params (beam search with a beam of 4)
+#      parameters for heads pruning
+# The first four groups are parameters of Transformer (the same as in the baseline).
+# 'concrete_heads' specifies which types of attention heads are to be pruned:
+#         {'enc-self', 'dec-self', 'dec-enc'} - all types
+#         {'enc-self'} - encoder self-attention only
+    
+MODEL_HP=$(/usr/bin/env python3 -c '
+
+hp = {
+     "num_layers": 6,
+     "num_heads": 8,
+     "ff_size": 2048,
+     "ffn_type": "conv_relu",
+     "hid_size": 512,
+     "emb_size": 512,
+     "res_steps": "nlda", 
+    
+     "rescale_emb": True,
+     "inp_emb_bias": True,
+     "normalize_out": True,
+     "share_emb": False,
+     "replace": 0,
+    
+     "relu_dropout": 0.1,
+     "res_dropout": 0.1,
+     "attn_dropout": 0.1,
+     "label_smoothing": 0.1,
+    
+     "translator": "ingraph",
+     "beam_size": 4,
+     "beam_spread": 3,
+     "len_alpha": 0.6,
+     "attn_beta": 0,
+    
+     "concrete_heads": {"enc-self", "dec-self", "dec-enc"},
+    }
+
+print(end=repr(hp))
+')
+
+params=(
+    --folder $model_path
+    --seed 42
+
+    --train-src $model_path/train.src.shuf
+    --train-dst $model_path/train.dst.shuf
+    --dev-src $dev_src
+    --dev-dst $dev_dst
+
+    --ivoc $model_path/src.voc
+    --ovoc $model_path/dst.voc
+
+    # Model you want to train
+    --model lib.task.seq2seq.models.transformer_head_gates.Model
+    # Model hp (specified above)
+    --hp "`echo $MODEL_HP`"
+
+    # Problem, i.e. how to train your model (loss function).
+    # For this model, it is the mixture of the standard cross-entropy loss and regularized for heads pruning.
+    # Formally, L = L_xent + 'concrete_coef' * L_{l0_penalty}
+    --problem lib.task.seq2seq.problems.concrete.ConcreteProblem
+    # Problem options.
+    # For the ConcreteProblem, we need to specify 'concrete_coef' (the regularizer weight)
+    --problem-opts '{'"'"'concrete_coef'"'"': 0.1,}'
+    #                                          ^---YOU HAVE TO SET THIS AGAIN
+    #                                              FOR EACH EXPERIMENT
+
+    # Starting checkpoint.
+    # If you prune head starting from a trained model (as we did), you have to specify a starting checkpoint.
+    --pre-init-model-checkpoint 'dir_to_your_trained_baseline_checkpoint.npz'
+    #                             ^---YOU HAVE TO CHANGE THIS
+
+    # Maximum number of tokens in a sentence.
+    # Sentences longer than this will not participate in training.
+    --max-srclen 200
+    --max-dstlen 200
+
+    # How to form batches.
+    # The only thing you have to be careful about is batch-len:
+    # is has to be about 16000 in total. Here is 4000 for 4 gpus: 4 * 4000 in total.
+    --batch-len 4000
+    #             ^---YOU MAY WANT TO CHANGE THIS
+    --batch-maker adaptive_windowed
+    --shuffle-len 100000
+    --batch-shuffle-len 10000
+    --split-len 200000
+    --maxlen-quant 1
+    --maxlen-min 8
+
+    # Optimization.
+    # This is the optimizer used in the original Transformer.
+    --optimizer lazy_adam
+    --optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998}'
+
+    # Alternative optimizer opts in case you do not have several gpus.
+    # sync_every_steps=4 means that you accumulate gradients for 4 steps before making an update.
+    # This is equivalent to having 'sync_every_steps' gpus.
+    # The actual batch size will be then batch-len * sync_every_steps
+    # (or batch-len * num_gpus if you are using the first version of optimizer-opts)
+    #--optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998,
+    #                   '"'"'sync_every_steps'"'"': 4, '"'"'average_grads'"'"': True, }'
+    
+    # Train a subset of variables (i.e., freeze the rest).
+    # You can specify names of variables to optimize in optimizer opts 
+    # using regular expressions.
+    # For example, if you train only encoder (as we did in Section 6.2), 
+    # your optimizer opts have to be
+    # --optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998,
+    #                     '"'"'variables'"'"': ['"'"'mod/emb_inp*'"'"', '"'"'mod/enc*'"'"',],}'
+    #
+
+    # Learning rate schedule.
+    # This is the usual Transformer learning rate schedule.
+    --learning-rate 4.0
+    --learning-rate-stop-value 1e-08
+    --decay-steps 16000
+    --decay-policy t2t_noam
+
+    # How long to train.
+    # Now it says 8m batches, which basically means that you have to look at your tensorboard and stop training manually
+    --num-batches 8000000
+
+    # Checkpoints.
+    # How often to make a checkpoint
+    --checkpoint-every-steps 2048
+    # How many checkpoints you want to keep.
+    --keep-checkpoints-max 10
+    #                       ^---YOU MAY WANT TO CHANGE THIS
+
+    # How often to score dev set (and put a dot on your tensorboard)
+    --score-dev-every 256
+
+    # BLEU on your tensorboard.
+    # This says that you want to see BLEU score on your tensorboard.
+    --translate-dev
+    # How often to translate dev and add this info to your tensorboard.
+    --translate-dev-every 2048
+
+    # This argument has to passed last.
+    # It controls that nmt.py has received all your arguments
+    --end-of-params
+)
+
+$RUN_NMT train "${params[@]}"
+
+
diff --git a/scripts/train_fixed_alive_heads.sh b/scripts/train_fixed_alive_heads.sh
new file mode 100644
index 0000000..9059a6b
--- /dev/null
+++ b/scripts/train_fixed_alive_heads.sh
@@ -0,0 +1,224 @@
+#!/bin/bash
+# this file trains a baseline transformer model on a preprocessed wmt en-ru dataset
+# you should fill in your train_src, train_dst, dev_src, dev_dst and tweak training params at the bottom
+
+REPO_DIR="../" # insert the dir to the the-story-of-heads repo
+DATA_DIR="../" # insert your datadir
+
+NMT="${REPO_DIR}/scripts/nmt.py"
+
+# path to preprocessed data (tokenized, bpe-ized)
+train_src="${DATA_DIR}/train.src"
+train_dst="${DATA_DIR}/train.dst"
+dev_src="${DATA_DIR}/dev.src"
+dev_dst="${DATA_DIR}/dev.dst"
+
+# path where results will be stored
+model_path="./build"
+mkdir -p $model_path
+
+# make vocabularies
+if [ ! -f $model_path/src.voc ]; then
+  echo "Creating source language vocabulary"
+  $NMT mkvoc --text $train_src --outvoc $model_path/src.voc --n-words=1000000
+  # n-words is the maximum number of tokens in vocabulary
+  # in practice it is unlikely to be reached if you are using BPE subwords
+fi
+
+if [ ! -f $model_path/dst.voc ]; then
+  echo "creating destination language vocabulary"
+  $NMT mkvoc --text $train_dst --outvoc $model_path/dst.voc --n-words=1000000
+fi
+
+
+# shuffle data
+shuffle_seed=42
+
+get_random_source()
+{
+  openssl enc -aes-256-ctr -pass pass:"$1" -nosalt </dev/zero 2>/dev/null
+}
+
+
+if [ ! -f $model_path/train.src.shuf ]; then
+  echo "Shuffling train src"
+  shuf -o $model_path/train.src.shuf --random-source=<(get_random_source $shuffle_seed) $train_src
+fi
+if [ ! -f $model_path/train.dst.shuf ]; then
+  echo "Shuffling train dst"
+  shuf -o $model_path/train.dst.shuf --random-source=<(get_random_source $shuffle_seed) $train_dst
+fi
+
+
+# maybe add openmpi wrapper
+RUN_NMT=$(/usr/bin/env python3 -c "
+import os, sys, tensorflow as tf
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
+num_gpus = sum(x.device_type == 'GPU' for x in tf.Session().list_devices())
+if num_gpus > 1:
+    sys.stdout.write('mpirun --allow-run-as-root --host {} {}'.format(','.join(['localhost'] * num_gpus), '$NMT'))
+else:
+    sys.stdout.write('$NMT')
+")
+
+# Model hp (the same as in the original Transformer-base)
+# hp are split into groups:
+#      main model hp,
+#      minor model hp (probably you do not want to change them)
+#      regularization and label smoothing
+#      inference params (beam search with a beam of 4)
+#      configuration of attention heads (which are alive or dead)
+# The first four groups are parameters of Transformer (the same as in the baseline).
+# 'alive_heads' specifies configuration of attention heads for each attention type:
+#         'enc-self', 'dec-self', 'dec-enc' - attention types
+# How to set head configuration for encoder self-attention (example):
+#          {"enc-self": [[1,0,1,0,1,0,1,0],     # 1st layer, heads from 1 to 8
+#                        [1,1,1,1,1,1,1,1],     # 2nd layer, heads from 1 to 8
+#                        [0,0,0,0,0,0,0,0],
+#                        [1,1,1,0,0,1,0,0],
+#                        [0,0,0,0,1,1,1,1],
+#                        [0,0,1,1,0,0,1,1]],}   # 6th layer, heads from 1 to 8
+# 1 - the head is present, 0 - the head is removed
+
+MODEL_HP=$(/usr/bin/env python3 -c '
+
+hp = {
+     "num_layers": 6,
+     "num_heads": 8,
+     "ff_size": 2048,
+     "ffn_type": "conv_relu",
+     "hid_size": 512,
+     "emb_size": 512,
+     "res_steps": "nlda",
+
+     "rescale_emb": True,
+     "inp_emb_bias": True,
+     "normalize_out": True,
+     "share_emb": False,
+     "replace": 0,
+
+     "relu_dropout": 0.1,
+     "res_dropout": 0.1,
+     "attn_dropout": 0.1,
+     "label_smoothing": 0.1,
+
+     "translator": "ingraph",
+     "beam_size": 4,
+     "beam_spread": 3,
+     "len_alpha": 0.6,
+     "attn_beta": 0,
+
+     "alive_heads": {"enc-self": [[1,0,1,0,1,0,1,0],
+                                  [1,1,1,1,1,1,1,1],
+                                  [0,0,0,0,0,0,0,0],
+                                  [1,1,1,0,0,1,0,0],
+                                  [0,0,0,0,1,1,1,1],
+                                  [0,0,1,1,0,0,1,1]],
+                    },
+    }
+
+print(end=repr(hp))
+')
+
+params=(
+    --folder $model_path
+    --seed 42
+
+    --train-src $model_path/train.src.shuf
+    --train-dst $model_path/train.dst.shuf
+    --dev-src $dev_src
+    --dev-dst $dev_dst
+
+    --ivoc $model_path/src.voc
+    --ovoc $model_path/dst.voc
+
+    # Model you want to train
+    --model lib.task.seq2seq.models.transformer_head_gates.Model
+    # Model hp (specified above)
+    --hp "`echo $MODEL_HP`"
+
+    # Problem, i.e. how to train your model (loss function).
+    # For this model, it is the standard cross-entropy loss.
+    --problem lib.task.seq2seq.problems.default.DefaultProblem
+    # Problem options.
+    --problem-opts '{}'
+
+    # Starting checkpoint.
+    # If you remove heads starting from a trained model, you have to specify a starting checkpoint.
+    #--pre-init-model-checkpoint 'dir_to_your_trained_baseline_checkpoint.npz'
+    #                             ^---YOU HAVE TO CHANGE THIS
+
+    # Maximum number of tokens in a sentence.
+    # Sentences longer than this will not participate in training.
+    --max-srclen 200
+    --max-dstlen 200
+
+    # How to form batches.
+    # The only thing you have to be careful about is batch-len:
+    # is has to be about 16000 in total. Here is 4000 for 4 gpus: 4 * 4000 in total.
+    --batch-len 4000
+    #             ^---YOU MAY WANT TO CHANGE THIS
+    --batch-maker adaptive_windowed
+    --shuffle-len 100000
+    --batch-shuffle-len 10000
+    --split-len 200000
+    --maxlen-quant 1
+    --maxlen-min 8
+
+    # Optimization.
+    # This is the optimizer used in the original Transformer.
+    --optimizer lazy_adam
+    --optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998}'
+
+    # Alternative optimizer opts in case you do not have several gpus.
+    # sync_every_steps=4 means that you accumulate gradients for 4 steps before making an update.
+    # This is equivalent to having 'sync_every_steps' gpus.
+    # The actual batch size will be then batch-len * sync_every_steps
+    # (or batch-len * num_gpus if you are using the first version of optimizer-opts)
+    #--optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998,
+    #                   '"'"'sync_every_steps'"'"': 4, '"'"'average_grads'"'"': True, }'
+
+    # Train a subset of variables (i.e., freeze the rest).
+    # You can specify names of variables to optimize in optimizer opts
+    # using regular expressions.
+    # For example, if you train only encoder (as we did in Section 6.2),
+    # your optimizer opts have to be
+    # --optimizer-opts '{'"'"'beta1'"'"': 0.9, '"'"'beta2'"'"': 0.998,
+    #                     '"'"'variables'"'"': ['"'"'mod/emb_inp*'"'"', '"'"'mod/enc*'"'"',],}'
+    #
+
+    # Learning rate schedule.
+    # This is the usual Transformer learning rate schedule.
+    --learning-rate 4.0
+    --learning-rate-stop-value 1e-08
+    --decay-steps 16000
+    --decay-policy t2t_noam
+
+    # How long to train.
+    # Now it says 8m batches, which basically means that you have to look at your tensorboard and stop training manually
+    --num-batches 8000000
+
+    # Checkpoints.
+    # How often to make a checkpoint
+    --checkpoint-every-steps 2048
+    # How many checkpoints you want to keep.
+    --keep-checkpoints-max 10
+    #                       ^---YOU MAY WANT TO CHANGE THIS
+
+    # How often to score dev set (and put a dot on your tensorboard)
+    --score-dev-every 256
+
+    # BLEU on your tensorboard.
+    # This says that you want to see BLEU score on your tensorboard.
+    --translate-dev
+    # How often to translate dev and add this info to your tensorboard.
+    --translate-dev-every 2048
+
+    # This argument has to passed last.
+    # It controls that nmt.py has received all your arguments
+    --end-of-params
+)
+
+$RUN_NMT train "${params[@]}"
+
+