## In this notebook, you will learn how to translate text fragments with CADec

If you don't know how to load a model or how to operate with vocabularies yet, look in the notebook __1_Load_model_and_translate_baseline__.

### Load trained model

In [1]:
import sys

sys.path.insert(0, 'path_to_good_translation_wrong_in_context') # insert your local path to the repo

Load vocabularies.

In [2]:
import pickle
import numpy as np

DATA_PATH = # insert your path
VOC_PATH =  # insert your path

inp_voc = pickle.load(open(VOC_PATH + 'src.voc', 'rb'))
out_voc = pickle.load(open(VOC_PATH + 'dst.voc', 'rb'))

Load model.

In [3]:
%env CUDA_VISIBLE_DEVICES=0

import tensorflow as tf
import lib
import lib.task.seq2seq.cadec.model as tr

tf.reset_default_graph()
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.99, allow_growth=True)
sess = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))

hp = {
     "num_layers": 6,
     "num_heads": 8,
     "ff_size": 2048,
     "ffn_type": "conv_relu",
     "hid_size": 512,
     "emb_size": 512,
     "res_steps": "nlda", 
    
     "rescale_emb": True,
     "inp_emb_bias": True,
     "normalize_out": True,
     "share_emb": False,
     "replace": 0,
    
     "relu_dropout": 0.1,
     "res_dropout": 0.1,
     "attn_dropout": 0.1,
     "label_smoothing": 0.1,
    
     "translator": "ingraph",
     "beam_size": 4,
     "beam_spread": 3,
     "len_alpha": 0.6,
     "attn_beta": 0,
    
     "dec1_attn_mode": "rdo_and_emb",
     "share_loss": False,
     "decoder2_name": "decoder2",
     "max_ctx_sents": 3,
     "use_dst_ctx": True,
}

model = tr.CADecModel('mod', inp_voc, out_voc, inference_mode='lazy', **hp)

env: CUDA_VISIBLE_DEVICES=1
MODEL: use_dst_ctx  True
reusing...
DEC2: use_dst_ctx True


#### Load checkpoint

In [4]:
path_to_ckpt = # insert path to the final checkpoint
var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
lib.train.saveload.load(path_to_ckpt, var_list)

## Translate <a name="translate"></a>

In [5]:
# load test set
path_to_testset = # path to your data
test_src = open(path_to_testset + 'test.src').readlines()

There are groups of consecutive sentences in our test set (sentences are separated with the `_eos` token):

In [6]:
test_src[:3]

["someone broke into my house last night and stole that ugly shirt . _eos yeah ... _eos i don 't believe that . _eos oh , it 's bo `xy on me , dude .\n",
 'is she in danger ? _eos no . _eos things are different there . _eos different ?\n',
 "it can 't be you . _eos but he 's not dead . _eos well , he 'll have to be , right ? _eos otherwise , he 'll tell them the truth .\n"]

To translate group of sentences, CADec has to translate one sentence at a time, relying on previously generated translations (starting from baseline translation for the first sentence, which does not have context).

In [7]:
def num_sents(text):
    return len(text.split(' _eos '))

def get_first_n_sents(lines, N):
    """ Take groups of first N sentences from ' _eos '-separated fragments """
    return [' _eos '.join(line.split(' _eos ')[:N]) for line in lines]

def expand_groups(groups, sents):
    """Expand groups of sentences with new sentences"""
    assert len(groups) == len(sents), "Numbers of groups and new sentences do not match"
    return [group.strip() + ' _eos ' + sent for group, sent in zip(groups, sents)]

def make_cadec_batch_data(src_lines, dst_lines):
    """
    src_lines contain groups of N sentences, last of which is to be translated (' _eos '-separated)
    dst_lines contain translations of the first N-1 sentences from src_lines (' _eos '-separated)
    """
    assert len(src_lines) == len(dst_lines), "Different number of text fragments"
    batch = []
    for src, dst in zip(src_lines, dst_lines):
        assert num_sents(src) == num_sents(dst) + 1
        src_elems = src.strip().split(' _eos ')
        batch.append(' _eos '.join([src_elems[-1]] + src_elems[:-1]) + ' _eos_eos ' + dst)
    return batch

def translate_batch(src_lines):
    # produce translation of first sentences in groups using base model
    current_translations = model.model1.translate_lines(get_first_n_sents(src_lines, N=1))
    # consecutively translate sentences using previously translated context sentences
    for iter_ in range(1, num_sents(src_lines[0])):
        batch_data = make_cadec_batch_data(src_lines=get_first_n_sents(src_lines, iter_ + 1),
                                           dst_lines=current_translations)
        new_translations = model.translate_lines(batch_data)
        current_translations = expand_groups(current_translations, new_translations)
    return current_translations

In [8]:
translate_batch(test_src[:3])

new batch sample


['кто-то вломился в мой дом прошлой ночью и украл эту урод `ливую рубашку . _eos да ... _eos я в это не верю . _eos о , это ужасно , чувак .',
 'она в опасности ? _eos нет . _eos там все по-другому . _eos по-другому ?',
 'это не можешь быть ты . _eos но он не умер . _eos ну , он должен быть , верно ? _eos иначе он расскажет им правду .']

To translate a test set, just do this for a sequence of batches (50-100 sentences is ok).

**Similar to the baseline, do not forget to unbpe your translations before evaluating BLEU score!**

### Comment

Note how different are translations generated by CADec from the ones generated by base model.

Let's take one of the text examples:

In [9]:
test_src[1]

'is she in danger ? _eos no . _eos things are different there . _eos different ?\n'

Here what we get translating these sentences in isolation using baseline model:

In [10]:
' _eos '.join(model.model1.translate_lines(test_src[1].split(' _eos ')))

'она в опасности ? _eos нет . _eos там все по-другому . _eos другой ?'

Translations of the word `different` are not consistent, but have to be - in clarification questions we repeat the form of a word.

__она в опасности ? \_eos нет . \_eos там все <span style="color:blue">по-другому</span> . \_eos <span style="color:red">другой</span> ?__

Here is the translation using CADec:

In [11]:
translate_batch(test_src[1:2])

['она в опасности ? _eos нет . _eos там все по-другому . _eos по-другому ?']

This translation is consistent!

__она в опасности ? \_eos нет . \_eos там все <span style="color:blue">по-другому</span> . \_eos <span style="color:blue">по-другому</span> ?__