## In this notebook, you will learn how to score translations with CADec

You can use this code to score consistency test sets with your trained model.

(If you don't know how to load a model or how to operate with vocabularies yet, look in the notebook __1_Load_model_and_translate_baseline__.)

### Load trained model

In [1]:
import sys

sys.path.insert(0, 'path_to_good_translation_wrong_in_context') # insert your local path to the repo

Load vocabularies.

In [2]:
import pickle
import numpy as np

DATA_PATH = # insert your path
VOC_PATH =  # insert your path

inp_voc = pickle.load(open(VOC_PATH + 'src.voc', 'rb'))
out_voc = pickle.load(open(VOC_PATH + 'dst.voc', 'rb'))

Load model.

In [3]:
%env CUDA_VISIBLE_DEVICES=0

import tensorflow as tf
import lib
import lib.task.seq2seq.models.transformer as tr

tf.reset_default_graph()
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.99, allow_growth=True)
sess = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))

hp = {
     "num_layers": 6,
     "num_heads": 8,
     "ff_size": 2048,
     "ffn_type": "conv_relu",
     "hid_size": 512,
     "emb_size": 512,
     "res_steps": "nlda", 
    
     "rescale_emb": True,
     "inp_emb_bias": True,
     "normalize_out": True,
     "share_emb": False,
     "replace": 0,
    
     "relu_dropout": 0.1,
     "res_dropout": 0.1,
     "attn_dropout": 0.1,
     "label_smoothing": 0.1,
    
     "translator": "ingraph",
     "beam_size": 4,
     "beam_spread": 3,
     "len_alpha": 0.6,
     "attn_beta": 0,
}

model = tr.Model('mod', inp_voc, out_voc, inference_mode='fast', **hp)

env: CUDA_VISIBLE_DEVICES=1


#### Load checkpoint

In [4]:
path_to_ckpt = # insert path to the final checkpoint
var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
lib.train.saveload.load(path_to_ckpt, var_list)

## Score <a name="translate"></a>

In [5]:
# load consistency test set (for example, we'll take deixis)
path_to_testset = # path to your data
test_src = open(path_to_testset + 'deixis_dev.src').readlines()
test_dst = open(path_to_testset + 'deixis_dev.dst').readlines()

There are groups of consecutive sentences and treir translations in our test set (sentences are separated with the `_eos` token):

In [6]:
test_src[:4]

["just leave them outside the door . _eos the rooms need to be cleaned , once a week in minimum . _eos - that 's a policy ... _eos - didn 't i clear your policy ?\n",
 "just leave them outside the door . _eos the rooms need to be cleaned , once a week in minimum . _eos - that 's a policy ... _eos - didn 't i clear your policy ?\n",
 "just leave them outside the door . _eos the rooms need to be cleaned , once a week in minimum . _eos - that 's a policy ... _eos - didn 't i clear your policy ?\n",
 "just leave them outside the door . _eos the rooms need to be cleaned , once a week in minimum . _eos - that 's a policy ... _eos - didn 't i clear your policy ?\n"]

In [7]:
test_dst[:4]

['просто оставьте их за дверью . _eos номера должны быть очищ `ены , раз в неделю минимум . _eos - это политика ... . _eos - разве я не ваша политика ?\n',
 'просто оставьте их за дверью . _eos номера должны быть очищ `ены , раз в неделю минимум . _eos - это политика ... . _eos - разве я не твоя политика ?\n',
 'просто оставь их за дверью . _eos номера должны быть очищ `ены , раз в неделю минимум . _eos - это политика ... . _eos - разве я не твоя политика ?\n',
 'просто оставь их за дверью . _eos номера должны быть очищ `ены , раз в неделю минимум . _eos - это политика ... . _eos - разве я не ваша политика ?\n']

To score sentences, we have to get loss values from the problem.

In [8]:
from lib.task.seq2seq.problems.default import DefaultProblem
from lib.task.seq2seq.data import make_batch_placeholder
problem = DefaultProblem({'mod': model})
batch_ph = make_batch_placeholder(model.make_feed_dict(model._get_batch_sample()))
loss_values = problem.loss_values(batch=batch_ph, is_train=False)

Since baseline is able to process only one sentence, we have to take only last sentence to score.

In [9]:
def num_sents(text):
    return len(text.split(' _eos '))

def make_baseline_batch_data(src_lines, dst_lines):
    """
    src_lines contain groups of N sentences, last of which is to be translated (' _eos '-separated)
    dst_lines contain translations of sentences in src_lines (' _eos '-separated)
    
    returns: list of pairs (src, dst) which one can give to a model
    """
    assert len(src_lines) == len(dst_lines), "Different number of text fragments"
    batch_src = []
    batch_dst = []
    for src, dst in zip(src_lines, dst_lines):
        assert num_sents(src) == num_sents(dst)
        batch_src.append(src.split(' _eos ')[-1])
        batch_dst.append(dst.split(' _eos ')[-1])
    return list(zip(batch_src, batch_dst))

def score_batch(src_lines, dst_lines):
    feed_dict = model.make_feed_dict(make_baseline_batch_data(src_lines, dst_lines))
    feed = {batch_ph[k]: feed_dict[k] for k in batch_ph}
    scores = sess.run(loss_values, feed)
    return scores

In [10]:
score_batch(test_src[:4], test_dst[:4])

array([10.607619, 12.479008, 12.479008, 10.60762 ], dtype=float32)

To score a test set, just do this for a sequence of batches.