## Setup


Set up directory structure: 

*   *data/en-de/* 
    
    Directory containing the datasets: english sentences, translations and scores for training, validation and testing

*   *tensors/\<model\>/*

    Directory where the tensors for the sentences and translations generated with each language model (or custom features) are saved. 
    
    Generating and saving all the tensors once  allows faster loading of the data for training and testing.

*   *features*/

    Directory where the tensors for the features of the sentences and translations are saved

*   *saved_models*/

    Directory where the best models are saved






In [0]:
!mkdir -p data/en-de/
!mkdir tensors/
!mkdir tensors/spacy
!mkdir tensors/bert
!mkdir tensors/word2vec
!mkdir tensors/bpemb
!mkdir tensors/features
!mkdir saved_models/

## Word Embeddings
We have tried out different word embeddings for our models. In particular, we've used **BERT**, **spaCy** and **word2vec**.

### Word2Vec

The following cells setup and load the word2vec models for english and german. The models are quite large, so we have to download them from google drive because a manual file upload fails. In order to generate Word2vec embeddings, you need to have downloaded [this Google News file](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/view?usp=sharing) and [this German model file](http://cloud.devmount.de/d2bc5672c523b086/) to your Google Drive and replace the `<YOUR_X_ID_HERE>` with the ID of your files.

In [0]:
!pip install PyDrive
import re

In [0]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

english = drive.CreateFile({'id': '<YOUR_GERMAN_MODEL_ID_HERE>'})
english.GetContentFile('german.model')

german = drive.CreateFile({'id': '<YOUR_GOOGLE_NEWS_ID_HERE>'})
german.GetContentFile('GoogleNews-vectors-negative300.bin.gz')

### spaCy

The following cells set up spaCy for GloVe embeddings. We use the `en_core_web_md` and `de_web_news_md` models to generate word embeddings.

The runtime needs to be restarted after installing both models.

In [0]:
!pip install spacy
!python -m spacy download en_core_web_md
!python -m spacy download de_core_news_md

### BPemb

The following cells install the **bpemb** library for byte pair encoding sub-word embeddings. 

In [0]:
!pip install bpemb

### BERT

In order to generate the sentence embeddings for BERT, we will use [bert-as-service](https://github.com/hanxiao/bert-as-service) for which you must have downloaded [this BERT model](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip). Follow the instructions on the `README` of [bert-as-service](https://github.com/hanxiao/bert-as-service) using `max_seq_len=None` to generate the embeddings and place these into the `tensors/bert` folder.

# Imports

In [0]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

import keras
from keras.models import Model
from keras.layers.wrappers import Bidirectional, TimeDistributed
from keras.layers.core import Lambda, Masking
from keras.layers.recurrent import GRU, LSTM
from keras.layers.merge import Concatenate
from keras.callbacks import TerminateOnNaN, ModelCheckpoint, EarlyStopping, \
    TensorBoard
import keras.backend as K
import keras.layers as L
from keras.utils import Sequence, to_categorical, multi_gpu_model
from keras import metrics

from scipy.stats import pearsonr
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.model_selection import KFold
from gensim.models import KeyedVectors

from sklearn.model_selection import ParameterSampler

import numpy as np

import spacy

# Preprocessing

Define global constants used for retrieving file names for saving/loading data to files. Also define constants for needed for preprocessing and building models, such as the dimension of the embeddings.

In [0]:
features_dir = "features"
tensor_dir = "tensors"
sentences_file = "sentences_tensor.npy"
translations_file = "translations_tensor.npy"
sents_files = {"train": 'data/en-de/train.ende.src', "val": "data/en-de/dev.ende.src", "test": "data/en-de/test.ende.src"}
trans_files = {"train": "data/en-de/train.ende.mt", "val": "data/en-de/dev.ende.mt", "test": "data/en-de/test.ende.src"}
MAX_WORDS = 60
BATCH_SIZE = 100
VECTOR_DIM = 300

## File I/O Utils

Functions to read the datasets and scores and load saved tensors

In [0]:
def load_data(file):
    with open(file, 'r') as f:
        return [x.strip() for x in f.readlines()]


def load_scores(scores_file):
    scores = load_data(scores_file)
    return np.array(scores, dtype=np.float32)


def load_tensors(t_dir, tensor_file, model="spacy", dataset="train"):
    tensor_file = f'{t_dir}/{model}/{dataset}_{tensor_file}'
    return np.load(tensor_file)


def load_inputs(dataset="train", model="spacy"):

    sentences = load_tensors(tensor_dir, sentences_file, model=model, dataset=dataset)
    translations = load_tensors(tensor_dir, translations_file, model=model, dataset=dataset)
    
    return sentences, translations 

## word2vec preprocessing functions

In [0]:
def load_word2vec_models():
    # Load vectors directly from the file
    english_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
    german_model =  KeyedVectors.load_word2vec_format("german.model", binary=True)
    return english_model, german_model


def preprocess_inputs(word2vec_english_model, word2vec_german_model,
                            english_data, german_data):
    train_english = load_data(english_data)
    train_german = load_data(german_data)

    english_embeddings = embed_sentences_w2v(word2vec_english_model, train_english)
    german_embeddings = embed_sentences_w2v(word2vec_german_model, train_german)

    return english_embeddings, german_embeddings


def embed_sentences_w2v(word2vec_model, train, max_words=MAX_WORDS):
    result = np.zeros((len(train), max_words, 300))

    for sentence_index, sentence in enumerate(train):
        for word_index, word in enumerate(sentence.split(' ')):
            word = word.strip('\n')
            word = re.sub(r'\W+', '', word)
            try:
              result[sentence_index, word_index] = word2vec_model[word]
            except:
              continue

    return result


def load_word2vec_vecs(word2vec_english_model, word2vec_german_model, dataset="train", save=True):
    # key = "dev" if test else "train"
    sents_f = sents_files[dataset]
    translations_f = trans_files[dataset]
    sentences = load_data(sents_f)
    translations = load_data(translations_f)
    sentences_tensor, translations_tensor = preprocess_inputs(word2vec_english_model,
                                                                    word2vec_german_model,
                                                                    sents_f,
                                                                    translations_f)
    if save:
        sents_file = f"{dataset}_{sentences_file}"
        trans_file = f"{dataset}_{translations_file}"
        sents_file = f'{tensor_dir}/{sents_file}'
        trans_file = f'{tensor_dir}/{trans_file}'
        np.save(sents_file, sentences_tensor)
        np.save(trans_file, translations_tensor)

    return sentences_tensor, translations_tensor

In [0]:
word2vec_english_model, word2vec_german_model = load_word2vec_models()

## spaCy preprocessing functions

In [0]:
def load_spacy_models():
    nlp_english = spacy.load("en_core_web_md")
    nlp_german = spacy.load("de_core_news_md")
    return nlp_english, nlp_german

def spacy_preprocess_inputs(nlp_english, nlp_german,
                            english_data, german_data, max_words=MAX_WORDS):
    train_english = load_data(english_data)
    train_german = load_data(german_data)

    doc_english = get_doc(nlp_english, train_english)
    doc_german = get_doc(nlp_german, train_german)

    sentences = embed_sentences(doc_english, nlp_english, max_words)
    translations = embed_sentences(doc_german, nlp_german, max_words)

    return sentences, translations


def get_doc(nlp, train):
    doc = list(nlp.pipe(train, batch_size=32, n_threads=7))
    return doc


def embed_sentences(doc, nlp, max_words):
    unknown_vector = nlp.vocab['unk'].vector
    
    result = np.zeros((len(doc), max_words, len(unknown_vector)), dtype=np.float32)

    for sentence_index, sentence in enumerate(doc):
        token_index = 0
        for sent in sentence.sents:
            for i in range(len(sent)):
                token = sent[i]

                if token.has_vector:
                    result[sentence_index, token_index] = token.vector
                else:
                    result[sentence_index, token_index] = unknown_vector

                token_index += 1

    return result


def load_spacy_vecs(nlp_english, nlp_german, dataset="train", save=True):
    sents_f = sents_files[dataset]
    translations_f = trans_files[dataset]
    sentences = load_data(sents_f)
    translations = load_data(translations_f)
    sentences_tensor, translations_tensor = spacy_preprocess_inputs(nlp_english,
                                                                    nlp_german,
                                                                    sents_f,
                                                                    translations_f)
    if save:
        sents_file = f"{dataset}_{sentences_file}"
        trans_file = f"{dataset}_{translations_file}"
        sents_file = f'{tensor_dir}/spacy/{sents_file}'
        trans_file = f'{tensor_dir}/spacy/{trans_file}'
        np.save(sents_file, sentences_tensor)
        np.save(trans_file, translations_tensor)

    return sentences_tensor, translations_tensor

In [0]:
# Load spacy language models only once
nlp_english, nlp_german = load_spacy_models()

## BPE preprocessing functions

In [0]:
from bpemb import BPEmb
import numpy as np
import torch


def load_bp_models():
    bpemb_en = BPEmb(lang='en', vs=100000)
    bpemb_de = BPEmb(lang='de', vs=100000)
    return bpemb_en, bpemb_de


def bpe_embed_sentences(bpemb_model, train):
    max_byte_encoding_length = -1
    max_embedding_dimension = 100
    embeddings = []
    for sentence in train:
      embedding = bpemb_model.embed(sentence)
      embeddings.append(embedding)
      max_byte_encoding_length = max(max_byte_encoding_length, embedding.shape[0])
    
    result = np.zeros((len(train), max_byte_encoding_length, max_embedding_dimension))

    for sentence_index, embedding in enumerate(embeddings):
      for i in range(embedding.shape[0]):
        result[sentence_index, i] = embedding[i]

    return result


def bpe_preprocess_inputs(bpemb_en, bpemb_de,
                            english_data, german_data):
    train_english = load_data(english_data)
    train_german = load_data(german_data)

    english_embeddings = bpe_embed_sentences(bpemb_en, train_english)
    german_embeddings = bpe_embed_sentences(bpemb_de, train_german)

    sentences = torch.tensor(english_embeddings, dtype=torch.float32)
    translations = torch.tensor(german_embeddings, dtype=torch.float32)
    return sentences, translations


def load_bpe_vecs(bpemb_en, bpemb_de, dataset="train", save=True):
    bpemb_en, bpemb_de = load_bp_models()

    sents_f = sents_files[dataset]
    translations_f = trans_files[dataset]
    sentences = load_data(sents_f)
    translations = load_data(translations_f)
    sentences_tensor, translations_tensor = bpe_preprocess_inputs(bpemb_en,
                                                                    bpemb_de,
                                                                    sents_f,
                                                                    translations_f)
    
    if save:
        sents_file = f"{dataset}_{sentences_file}"
        trans_file = f"{dataset}_{translations_file}"
        sents_file = f'{tensor_dir}/bpemb/{sents_file}'
        trans_file = f'{tensor_dir}/bpemb/{trans_file}'
        np.save(sents_file, sentences_tensor)
        np.save(trans_file, translations_tensor)
    
    return sentences_tensor, translations_tensor

In [0]:
bpemb_en, bpemb_de = load_bp_models()

## Feature extraction

Extract sentence-level features from sentences:

*   Number of tokens in each source sentence
*   Number of tokens in each translation
*   Ratio of the number tokens in source to translation
*   Average source token length
*   Average translation token length
*   Number of named entities in source
*   Number of named entities in translation
*   Number of puctuation marks in source
*   Number of puctuation marks in translation
*   Frequencies of different POS tags in source sentence
*   Frequencies of different POS tags in translation
*   Frequencies of different fine-grained POS tags in source sentence
*   Frequencies of different fine-grained POS tags in translation
*   Frequencies of different dependency tags in source sentence
*   Frequencies of different dependency tags in translation





Set of all possible tag values for spaCy POS tags and dependency tags:

In [0]:
pos = {'ADJ': 0, 'ADP': 1, 'ADV': 2, 'AUX': 3, 'CONJ': 4, 'CCONJ': 5, 'DET': 6, 'INTJ': 7, 'NOUN': 8, 'NUM': 9, 'PART': 10, 'PRON': 11, 'PROPN': 12, 'PUNCT': 13, 'SCONJ': 14, 'SYM': 15, 'VERB': 16, 'X': 17, 'EOL': 18, 'SPACE': 19}

en_dep = {'acl': 0, 'acomp': 1, 'advcl': 2, 'advmod': 3, 'agent': 4, 'amod': 5, 'appos': 6, 'attr': 7, 'aux': 8, 'auxpass': 9, 'case': 10, 'cc': 11, 'ccomp': 12, 'compound': 13, 'conj': 14, 'cop': 15, 'csubj': 16, 'csubjpass': 17, 'dative': 18, 'dep': 19, 'det': 20, 'dobj': 21, 'expl': 22, 'intj': 23, 'mark': 24, 'meta': 25, 'neg': 26, 'nn': 27, 'nounmod': 28, 'npmod': 29, 'nsubj': 30, 'nsubjpass': 31, 'nummod': 32, 'oprd': 33, 'obj': 34, 'obl': 35, 'parataxis': 36, 'pcomp': 37, 'pobj': 38, 'poss': 39, 'preconj': 40, 'prep': 41, 'prt': 42, 'punct': 43, 'quantmod': 44, 'relcl': 45, 'root': 46, 'xcomp': 47, 'clf': 48, 'discourse': 49, 'dislocated': 50, 'fixed': 51, 'flat': 52, 'goeswith': 53, 'iobj': 54, 'list': 55, 'nmod': 56, 'orphan': 57, 'reparandum': 58, 'vocative': 59, 'npadvmod': 60, 'subtok': 61, 'predet': 62}
en_tag = {'$': 0, '``': 1, "''": 2, ',': 3, '-LRB-': 4, '-RRB-': 5, '.': 6, ':': 7, 'ADD': 8, 'AFX': 9, 'PRP$': 10, '_SP': 11, 'WP$': 12, 'CC': 13, 'CD': 14, 'DT': 15, 'EX': 16, 'FW': 17, 'GW': 18, 'HYPH': 19, 'IN': 20, 'JJ': 21, 'JJR': 22, 'JJS': 23, 'LS': 24, 'MD': 25, 'NFP': 26, 'NIL': 27, 'NN': 28, 'NNP': 29, 'NNPS': 30, 'NNS': 31, 'PDT': 32, 'POS': 33, 'PRP': 34, 'RB': 35, 'RBR': 36, 'RBS': 37, 'RP': 38, 'SP': 39, 'SYM': 40, 'TO': 41, 'UH': 42, 'VB': 43, 'VBD': 44, 'VBG': 45, 'VBN': 46, 'VBP': 47, 'VBZ': 48, 'WDT': 49, 'WP': 50, 'WRB': 51, 'XX': 52}
de_dep = {'ac': 0, 'adc': 1, 'ag': 2, 'ams': 3, 'app': 4, 'avc': 5, 'cc': 6, 'cd': 7, 'cj': 8, 'cm': 9, 'cp': 10, 'cvc': 11, 'da': 12, 'dm': 13, 'ep': 14, 'ju': 15, 'mnr': 16, 'mo': 17, 'ng': 18, 'nk': 19, 'nmc': 20, 'oa': 21, 'oa2': 22, 'oc': 23, 'og': 24, 'op': 25, 'par': 26, 'pd': 27, 'pg': 28, 'ph': 29, 'pm': 30, 'pnc': 31, 'punct': 32, 'rc': 33, 're': 34, 'rs': 35, 'sb': 36, 'sbp': 37, 'sp': 38, 'svp': 39, 'uc': 40, 'vo': 41, 'ROOT': 42, 'root': 43, 'subtok': 44, 'dep': 45}
de_tag = {'$(': 0, '$,': 1, '$.': 2, 'ADJA': 3, 'ADJD': 4, 'ADV': 5, 'APPO': 6, 'APPR': 7, 'APPRART': 8, 'APZR': 9, 'ART': 10, 'CARD': 11, 'FM': 12, 'ITJ': 13, 'KOKOM': 14, 'KON': 15, 'KOUI': 16, 'KOUS': 17, 'NE': 18, 'NN': 19, 'NNE': 20, 'PDAT': 21, 'PDS': 22, 'PIAT': 23, 'PIS': 24, 'PPER': 25, 'PPOSAT': 26, 'PPOSS': 27, 'PRELAT': 28, 'PRELS': 29, 'PRF': 30, 'PROAV': 31, 'PTKA': 32, 'PTKANT': 33, 'PTKNEG': 34, 'PTKVZ': 35, 'PTKZU': 36, 'PWAT': 37, 'PWAV': 38, 'PWS': 39, 'TRUNC': 40, 'VAFIN': 41, 'VAIMP': 42, 'VAINF': 43, 'VAPP': 44, 'VMFIN': 45, 'VMINF': 46, 'VMPP': 47, 'VVFIN': 48, 'VVIMP': 49, 'VVINF': 50, 'VVIZU': 51, 'VVPP': 52, 'XY': 53, '_SP': 54}

Functions to extract features from sentences

In [0]:
def num_tokens_in_sent(doc):
    return len(doc)

def avg_token_len(doc):
    return (sum(len(tok.text) for tok in doc) / len(doc))

def num_named_entities(doc):
    return sum(1 for tok in doc if tok.ent_type_ != '')

def num_punctuation_marks(doc):
    return sum(1 for tok in doc if tok.is_punct)

def pos_tag_freqs(doc):
    pos_freqs = np.zeros((len(pos),))
    for tok in doc:
        pos_idx = pos[tok.pos_]
        pos_freqs[pos_idx] += 1
    return pos_freqs

def fine_grained_pos_tag_freqs(doc, lang='en'):    
    tags = en_tag if lang == 'en' else de_tag
    fg_pos_freqs = np.zeros((len(tags),))
    
    for tok in doc:
        fg_pos_idx = tags[tok.tag_]
        fg_pos_freqs[fg_pos_idx] += 1
    return fg_pos_freqs

def dependency_tag_freqs(doc, lang='en'):
    dep_tags = en_dep if lang == 'en' else de_dep
    dep_freqs = np.zeros((len(dep_tags),))
    
    for tok in doc:
        dep_idx = dep_tags[tok.dep_.lower()]
        dep_freqs[dep_idx] += 1
    return dep_freqs

In [0]:
def extract_features(nlp_en, nlp_de, dataset='train', save=True):
    sents_f = sents_files[dataset]
    translations_f = trans_files[dataset]

    sentences = load_data(sents_f)
    translations = load_data(translations_f)

    sent_docs = get_doc(nlp_en, sentences)
    trans_docs = get_doc(nlp_de, translations)


    features = [pairwise_features(sent, trans) for sent, trans in zip(sent_docs, trans_docs)]
    sent_features, trans_features = tuple(zip(*features))
    sentence_features = np.array(sent_features)
    translation_features = np.array(trans_features)

    if save:
        np.save(f'{tensor_dir}/{features_dir}/{dataset}_{sentences_file}', sentence_features)
        np.save(f'{tensor_dir}/{features_dir}/{dataset}_{translations_file}', translation_features)
    return sentence_features, translation_features

def pairwise_features(sentence, translation):  
    sentence_features, trans_features = get_features(sentence, lang='en'), get_features(translation, lang='de')
    num_sent_tokens = sentence_features[-1]
    num_trans_tokens = trans_features[-1]
    num_tokens_ratio = num_sent_tokens[0] / num_trans_tokens[0]
    sentence_features.append([num_tokens_ratio])
    trans_features.append([num_tokens_ratio])

    return np.concatenate(sentence_features, axis=0), np.concatenate(trans_features, axis=0)

def get_features(sentence, lang='en'):
    features = []
    features.append(pos_tag_freqs(sentence))
    features.append(fine_grained_pos_tag_freqs(sentence, lang=lang))
    features.append(dependency_tag_freqs(sentence, lang=lang))
    features.append([num_named_entities(sentence)])
    features.append([num_punctuation_marks(sentence)])
    features.append([avg_token_len(sentence)])
    features.append([num_tokens_in_sent(sentence)])
    return features


## Generating tensors

Comment/uncomment the appropriate lines according to which embeddings you would like to generate.

In [0]:
# Generate vectors for training, testing and validation once, saving them so
# they can be loaded faster

spacy_train_sentences, spacy_train_translations = load_spacy_vecs(nlp_english, nlp_german, dataset="train", save=True)
print("spacy train generated")
spacy_val_sentences, spacy_val_translations = load_spacy_vecs(nlp_english, nlp_german, dataset="val", save=True)
print("spacy val generated")
spacy_test_sentences, spacy_test_translations = load_spacy_vecs(nlp_english, nlp_german, dataset="test", save=True)
print("spacy test generated")

# w2v_train_sentences, w2v_train_translations = load_word2vec_vecs(word2vec_english_model, word2vec_german_model, dataset='train', save=True)
# print("word2vec train generated")
# w2v_val_sentences, w2v_val_translations = load_word2vec_vecs(word2vec_english_model, word2vec_german_model, dataset='val', save=True)
# print("word2vec val generated")
# w2v_test_sentences, w2v_test_translations = load_word2vec_vecs(word2vec_english_model, word2vec_german_model, dataset='test', save=True)
# print("word2vec test generated")

# bpe_train_sentences, bpe_train_translations = load_bpe_vecs(bpemb_en, bpemb_de, dataset="train", save=True)
# print("bpemb train loaded")
# bpe_val_sentences, bpe_val_translations = load_bpe_vecs(bpemb_en, bpemb_de, dataset="val", save=True)
# print("bpemb val loaded")
# bpe_test_sentences, bpe_test_translations = load_bpe_vecs(bpemb_en, bpemb_de, dataset="test", save=True)
# print("bpemb test loaded")

train_sent_features, train_trans_features = extract_features(nlp_english, nlp_german, dataset="train", save=True)
print("train features generated")
val_sent_features, val_trans_features = extract_features(nlp_english, nlp_german, dataset='val', save=True)
print("val features generated")
test_sent_features, test_trans_features = extract_features(nlp_english, nlp_german, dataset='test', save=True)
print("test features generated")

# Models

### MLP models

In [0]:
def build_mlp_model(architecture):
    layers = architecture['layers']
    activation = architecture['activation']
    num_features = architecture['num_features']

    # Inputs are sentence embeddings
    sentence = L.Input(shape=(num_features,), name='sentence')
    translation = L.Input(shape=(num_features,), name='translation')

    concat_sent_trans = Concatenate(axis=1)([sentence, translation])

    dense_in = concat_sent_trans
    for layer in layers:
        dense_layer = L.Dense(layer, activation=activation)(dense_in)
        dense_in = dense_layer
    output = L.Dense(1)(dense_layer)

    model = Model([sentence, translation], output)

    return model

In [0]:
def build_features_model(architecture):
    layers = architecture['layers']
    activation = architecture['activation']
    num_features = architecture['num_features_sent']
    num_features_trans = architecture.get('num_features_trans', num_features)

    # Inputs are sentence embeddings
    sentence = L.Input(shape=(num_features,), name='sentence')
    translation = L.Input(shape=(num_features_trans,), name='translation')

    concat_sent_trans = Concatenate(axis=1)([sentence, translation])

    dense_in = concat_sent_trans
    for layer in layers:
        dense_layer = L.Dense(layer, activation=activation)(dense_in)
        dense_in = dense_layer
    output = L.Dense(1)(dense_layer)

    model = Model([sentence, translation], output)

    return model

### Recurrent models

In [0]:
def build_gru_model(architecture):
    gru_units = architecture['gru_units']
    layers = architecture['layers']
    activation = architecture['activation']
    num_features = architecture['num_features']

    # Inputs are word embeddings
    sentence = L.Input(shape=(None, num_features), name='sentence')
    translation = L.Input(shape=(None, num_features), name='translation')

    masked_sent = Masking()(sentence)
    masked_sent = Bidirectional(GRU(units=gru_units, return_sequences=True))(masked_sent)
    masked_sent = Bidirectional(GRU(units=gru_units, return_sequences=False))(masked_sent)
    
    masked_trans = Masking()(translation)
    masked_trans = Bidirectional(GRU(units=gru_units, return_sequences=True))(masked_trans)
    masked_trans = Bidirectional(GRU(units=gru_units, return_sequences=False))(masked_trans)

    concat_sent_trans = Concatenate(axis=1)([masked_sent, masked_trans])

    dense_in = concat_sent_trans
    for layer in layers:
        dense_layer = L.Dense(layer, activation=activation)(dense_in)
        dense_in = dense_layer
    output = L.Dense(1)(dense_layer)

    model = Model([sentence, translation], output)

    return model

In [0]:
def build_lstm_model(architecture):
    lstm_units = architecture['lstm_units']
    layers = architecture['layers']
    activation = architecture['activation']
    num_features = architecture['num_features']

    # Inputs are word embeddings
    sentence = L.Input(shape=(None, num_features), name='sentence')
    translation = L.Input(shape=(None, num_features), name='translation')

    masked_sent = Masking()(sentence)
    masked_sent = Bidirectional(LSTM(units=lstm_units, return_sequences=True))(masked_sent)
    masked_sent = Bidirectional(LSTM(units=lstm_units, return_sequences=False))(masked_sent)
    
    masked_trans = Masking()(translation)
    masked_trans = Bidirectional(LSTM(units=lstm_units, return_sequences=True))(masked_trans)
    masked_trans = Bidirectional(LSTM(units=lstm_units, return_sequences=False))(masked_trans)


    concat_sent_trans = Concatenate(axis=1)([masked_sent, masked_trans])

    dense_in = concat_sent_trans
    for layer in layers:
        dense_layer = L.Dense(layer, activation=activation)(dense_in)
        dense_in = dense_layer
    output = L.Dense(1)(dense_layer)

    model = Model([sentence, translation], output)

    return model

### Attention models

In [0]:
def build_attention_model(architecture):
    lstm_units = architecture['lstm_units']
    layers = architecture['layers']
    activation = architecture['activation']
    num_features = architecture['num_features']

    sentence = L.Input(shape=(None, num_features), name='sentence')
    translation = L.Input(shape=(None, num_features), name='translation')

    attention_sent = attention(sentence, key_size=300, val_size=num_features)
    sent_summary = Bidirectional(LSTM(units=lstm_units, return_sequences=True, dropout=0.01))(attention_sent)
    sent_summary = Bidirectional(LSTM(units=lstm_units, return_sequences=False, dropout=0.01))(sent_summary)
    
    attention_trans = attention(translation, key_size=300, val_size=num_features)
    trans_summary = Bidirectional(LSTM(units=lstm_units, return_sequences=True, dropout=0.01))(attention_trans)
    trans_summary = Bidirectional(LSTM(units=lstm_units, return_sequences=False, dropout=0.01))(trans_summary)
    
    concat_sent_trans = Concatenate(axis=1)([sent_summary, trans_summary])

    dense_in = concat_sent_trans
    for layer in layers:
        dense_layer = L.Dense(layer, activation=activation)(dense_in)
        dense_in = dense_layer
    output = L.Dense(1)(dense_layer)

    model = Model([sentence, translation], output)

    return model

def attention(hidden_states, key_size=20, val_size=300):
    # hidden_states: (batch, seq_len, width)
    print("hidden", hidden_states.shape)
    att_key = L.Dense(key_size)(hidden_states)
    # att_key: (batch, seq_len, key_size)
    print("key",  att_key.shape)
    att_q = L.Dense(key_size)(hidden_states)
    # att_q: (batch, seq_len, key_size)
    print("query", att_q.shape)
    att_w = L.Lambda(lambda key: K.batch_dot(key, att_q, axes=[2,2]))(att_key)
    # K.batch_dot(att_key, att_q, axes=[2,2])
    att_w = L.Softmax()(att_w)
    # att_w: (batch, seq_len, seq_len)
    print("weights", att_w.shape)
    att_v = L.Dense(val_size)(hidden_states)
    # att_v: (batch, seq_len, val_size)
    print("values", att_v.shape)
    att_out = L.Lambda(lambda weights: K.batch_dot(weights, att_v))(att_w)
    # att_out: (batch, seq_len, val_size)
    print("output", att_out.shape)
    return att_out

# Training

The following cells define the classes and functions to train models and evaluate their performance. We also define a function for hyper-parameter tuning. 

Define `Sequence` class to split dataset into batches for training

In [0]:
class InputSequence(Sequence):
    def __init__(self, sentences, translations, scores, batch_size):
        self.sentences = sentences
        self.translations = translations
        self.scores = scores
        self.batch_size = batch_size
        
        self.total_batch_size = len(self.sentences)
        self.indexes = np.arange(len(self.sentences))
        # self.on_epoch_end()

    def __len__(self):
        """Returns number of batches"""
        return int(np.ceil(self.total_batch_size / float(self.batch_size)))

    def __getitem__(self, index):
        """Returns the tuple containing list of sentence and translation
        tensors and the expected scores for the translations"""

        start_idx = index * self.batch_size
        end_idx = min(start_idx + self.batch_size, self.total_batch_size)
        current_batch_size = end_idx - start_idx

        # Get slice with indexes for batch
        batch_idx_slice = self.indexes[start_idx: end_idx]

        sentences_batch = self.sentences[batch_idx_slice]
        translations_batch = self.translations[batch_idx_slice]
        scores_batch = self.scores[batch_idx_slice]

        return [sentences_batch, translations_batch], scores_batch

Define utilities to load datasets for training and testing

In [0]:
def shuffle_and_split(d1, d2, indices=None, size=7000):
    """
    Combines two datasets, shuffles them and returns two partitions of 
    the same sizes as the inputs, returning the indices used for the shuffling.

    If indices are not provided, then a random shuffle is applied. Otherwise, 
    the provided indices are used for the shuffle
    """
    if indices is None:
        indices = np.arange(len(d1) + len(d2))
        np.random.shuffle(indices)
    all_data = np.concatenate((d1, d2), axis=0)
    d1_indices, d2_indices = indices[:size], indices[size:]
    return all_data[d1_indices], all_data[d2_indices], indices

def get_train_data(embeddings_model="spacy", avg_word_embeddings=False, shuffle=False):
    train_sentences, train_translations = load_inputs(dataset="train", model=embeddings_model)
    test_sentences, test_translations = load_inputs(dataset="test", model=embeddings_model)
    train_scores = load_scores('data/en-de/train.ende.scores')
    test_scores = load_scores('data/en-de/dev.ende.scores')

    if avg_word_embeddings:
        # Make sure embeddings are word embeddings, not sentence embeddings
        train_sentences = np.average(train_sentences, axis=1)
        train_translations = np.average(train_translations, axis=1)
        test_sentences = np.average(test_sentences, axis=1)
        test_translations = np.average(test_translations, axis=1)
    
    if shuffle:
        train_sentences, test_sentences, indices = shuffle_and_split(train_sentences, test_sentences)
        train_translations, test_translations, _ = shuffle_and_split(train_translations, test_translations, indices=indices)
        train_scores, test_scores, _ = shuffle_and_split(train_scores, test_scores, indices=indices)
      
    return (train_sentences, train_translations, train_scores), (test_sentences, test_translations, test_scores)

def get_train_sequences(train_data, val_data, batch_size):
    train_sentences, train_translations, train_scores = train_data
    test_sentences, test_translations, test_scores = val_data

    train_seq = InputSequence(train_sentences,
                                    train_translations,
                                    train_scores,
                                    batch_size)

    val_seq = InputSequence(test_sentences,
                                test_translations,
                                test_scores,
                                batch_size)
    
    return train_seq, val_seq

### Training loop

We define below a function to train a model using training and validation datasets.

In [0]:
def fit_model(model, train_seq, val_seq, n_epochs):
    """Trains the model"""
    with tf.device('/GPU:0'):
        callbacks = [TerminateOnNaN(),
                    ModelCheckpoint("saved_models/model_best.hdf5",
                                    monitor='val_loss', verbose=1,
                                    save_best_only=True, save_weights_only=True),
                    ModelCheckpoint("saved_models/model.hdf5", verbose=0,
                                    save_best_only=False, save_weights_only=True),
                    EarlyStopping(monitor='val_loss', patience=7, verbose=1,
                                restore_best_weights=True),
                    TensorBoard(log_dir="saved_models/logs", write_images=True)]

        model.fit_generator(generator=train_seq, epochs=n_epochs,
                            validation_data=val_seq, shuffle=True,
                            callbacks=callbacks)

We define below function to carry out k-fold cross-validation on a neural network model with a particular architecture to evaluate its performance. We split the training data into training and validation dataset over 7 folds. The validation dataset is used to monitor the model's performance on an unseen dataset, early stopping when the validation loss starts increasing. The trained models are tested on the test dataset provided (usually the dev dataset). 

A new neural network is constructed at the beginning of each fold, using the `build_model` function and the neural network parameters `params`. Tunable parameters for the model depend on its architecture, but some of the ones we used include:
*    `layers`: list containing the number of outputs for hidden `Dense` layers in the network.
*    `num_features`: number of input features for the network.
*    `activation`: activation function to use in the `Dense` layers.
*    `gru_units`/`lstm_units`: dimension of the hidden state of the recurrent layers.



In [0]:
def perform_cross_validation(build_model, params, train_data, test_data, lr, batch_size, indices=None, shuffle=False):
    pearson_scores = []
    mae_scores = []
    mse_scores = []

    best_score = 0.0
    
    n_epochs = 25

    train_sentences, train_translations, train_scores = train_data
    test_sentences, test_translations, test_scores = test_data

    if shuffle:
        train_sentences, test_sentences, indices = shuffle_and_split(train_sentences, test_sentences)
        train_translations, test_translations, _ = shuffle_and_split(train_translations, test_translations, indices=indices)
        train_scores, test_scores, _ = shuffle_and_split(train_scores, test_scores, indices=indices)

    train_data_indexes = np.arange(len(train_sentences))

    k_fold = KFold(n_splits=7, shuffle=True)
    
    test_seq = InputSequence(test_sentences,
                            test_translations,
                            test_scores,
                            batch_size)

    for idx, (train_index, test_index) in enumerate(k_fold.split(train_data_indexes, train_data_indexes)):
        # if idx % 2 == 0:
        #     continue
        # print(f"Fold {(idx + 1) // 2} / 7")
        print(f"Fold {idx + 1} / 7")
        X_train = (train_sentences[train_index], train_translations[train_index])
        X_val = (train_sentences[test_index], train_translations[test_index])
        y_train = train_scores[train_index]
        y_val = train_scores[test_index]
        
        print(params)
        model = build_model(params)
        model.compile(optimizer=keras.optimizers.Adam(lr=lr), loss='mean_squared_error',
                    metrics=['mae'])
        
        train_seq, val_seq = get_train_sequences(X_train + (y_train,), X_val + (y_val,), batch_size)

        fit_model(model, train_seq, val_seq, n_epochs)

        predicted_scores = model.predict(test_seq)
        mae_score = mean_absolute_error(test_scores, predicted_scores.squeeze())
        print(f'\nAverage MAE: {mae_score}')
        pearson_score = pearsonr(test_scores, predicted_scores.squeeze())
        print(f'Average Pearson: {pearson_score}')
        mse_score = (mean_squared_error(test_scores, predicted_scores.squeeze())) ** (0.5)
        print(f'Average RMSE: {mse_score}')
        pearson_scores.append(pearson_score)
        mae_scores.append(mae_score)
        mse_scores.append(mse_score)
        
    return pearson_scores, mae_scores, mse_scores


In [0]:
# Example cross-validation function call for MLP neural network

# These parameters can be changed depending on which model you are building.
architecture = {'num_features': 300, 'layers': [100], 'activation': 'relu'}
(train_data, test_data) = get_train_data(avg_word_embeddings=True)
learning_rate = 0.0005
batch_size = 50

perform_cross_validation(build_mlp_model, architecture, train_data, test_data, learning_rate, batch_size)

### Hyperparameter tuning

The `parameter_search` method allows you to perform hyperparameter tuning. The parameters to the method are as follows:


*   `build_model`: Function to construct the model.
*   `parameters`: Tunable hyperparameters for training such as batch_size and learning_rate.
*   `architecture`: Tunable hyperparameters used to change the architecture of the model.
*   `embeddings_model`: The embeddings model to be used - either `spacy`, `word2vec` or `bert`.
*   `avg_word_embeddings`: Whether or not the word embeddings should be averaged to form an averaged sentence embedding.




In [0]:
def parameter_search(build_model, parameters, architecture, embeddings_model='spacy', avg_word_embeddings=False):
    train_data, val_data = get_train_data(embeddings_model=embeddings_model, avg_word_embeddings=avg_word_embeddings)

    n_epochs = 10
    sampled_arch_params = list(ParameterSampler(architecture, n_iter=20, random_state=42)) 
    parameters['architecture'] = sampled_arch_params

    sampled_params = list(ParameterSampler(parameters, n_iter=7, random_state=42))
    
    models = []
    scores = []
    best_model = None
    best_score = 0.0
    for i, params in enumerate(sampled_params):
        print(f'Parameter set {i + 1}')
        print(params)

        pearson, mae, mse = perform_cross_validation(build_model, params['architecture'], train_data, val_data, params['learning_rate'], params['batch_size'], shuffle=True)
        mae_score = np.average(mae)
        mse_score = np.average(mse)
        pearson_score = np.average(pearson, axis=0)

        print(f'MAE: {mae_score}')
        print(f'Pearson: {pearson_score}')
        print(f'MSE: {mse_score}')
        if (pearson_score[0] > best_score):
            best_score = pearson_score[0]
        scores.append((mae_score, mse_score, pearson_score))

    return best_score, scores, sampled_params

We define below functions to perform hyperparameter tuning on the different models

In [0]:
def parameter_search_mlp():
    parameters = {
        'learning_rate': [0.001, 0.0005, 0.002],
        'batch_size': [32, 64, 50, 100],
    }

    architecture = {
        'layers': [[50], [50, 100], [25, 100], [150], [512, 256, 256], [256, 128, 128, 256],[100, 100, 50, 50], [400], [75, 50, 25], [100, 100], [100, 200], [400, 200, 100], [150, 100, 50], [200, 100, 100], [100, 50]], 
        'activation': ["relu", "tanh"],
        'num_features': [300]
    }
        
    return parameter_search(build_mlp_model, parameters, architecture, avg_word_embeddings=True) 

In [0]:
def parameter_search_features():
    parameters = {
        'learning_rate': [0.001, 0.0005, 0.002],
        'batch_size': [32, 64, 50, 100],
    }

    architecture = {
        'layers': [[512, 256, 256], [256, 128, 128, 256],[100, 100, 50, 50], [400], [75, 50, 25], [100, 100], [100, 200], [400, 200, 100], [150, 100, 50], [200, 100, 100], [100, 50]], 
        'activation': ["relu", "tanh"],
        'num_features_sent': [141],
        'num_features_trans': [126],
    }
        
    return parameter_search(build_features_model, parameters, architecture, embeddings_model='features', avg_word_embeddings=False) 

In [0]:
def parameter_search_gru():
    parameters = {
        'learning_rate': [0.001, 0.0005, 0.002],
        'batch_size': [32, 64, 50, 100],
    }

    architecture = {
        'gru_units': [64, 32, 128, 96], 
        'layers': [[512, 256, 256], [256, 128, 128, 256],[100, 100, 50, 50], [400], [75, 50, 25], [100, 100], [100, 200], [400, 200, 100], [150, 100, 50], [200, 100, 100], [100, 50]], 
        'activation': ["relu", "tanh"],
        'num_features': [300]
    }
        
    return parameter_search(build_gru_model, parameters, architecture) 

In [0]:
def parameter_search_lstm():
    parameters = {
        'learning_rate': [0.001, 0.0005, 0.002],
        'batch_size': [32, 64, 50, 100],
    }

    architecture = {
        'lstm_units': [64, 32, 128, 96], 
        'layers': [[512, 256, 256], [256, 128, 128, 256],[100, 100, 50, 50], [400], [75, 50, 25], [100, 100], [100, 200], [400, 200, 100], [150, 100, 50], [200, 100, 100], [100, 50]], 
        'activation': ["relu", "tanh"],
        'num_features': [300]
    }

    return parameter_search(build_gru_model, parameters, architecture) 

In [0]:
def parameter_search_attention(use_cosine=False):
    parameters = {
        'learning_rate': [0.001, 0.0005, 0.002],
        'batch_size': [32, 64, 50, 100],
    }

    architecture = {
        'gru_units': [64, 32, 128, 92],
        'layers': [[512, 256, 256], [256, 128, 128, 256],[100, 100, 50, 50], [400], [75, 50, 25], [100, 100], [100, 200], [400, 200, 100], [150, 100, 50], [200, 100, 100], [100, 50]], 
        'activation': ["relu", "tanh"],
        'num_features': [300]
    }
    
    return parameter_search(build_attention_model, parameters, architecture) 

In [0]:
# Example of hyperparameter tuning call
parameter_search_mlp()

# Running model on Test dataset

Build model with best architecture and train it on both training datasets

In [0]:
def get_test_predictions(build_model, params, lr, embedding_model='spacy', avg_word_embeddings=False):
    train_data, val_data = get_train_data(embeddings_model=embedding_model, avg_word_embeddings=avg_word_embeddings)

    test_sentences, test_translations = load_inputs(dataset='test', model=embedding_model)

    if avg_word_embeddings:
        test_sentences = np.average(test_sentences, axis=1)
        test_translations = np.average(test_translations, axis=1)

    n_epochs = 2

    model = build_model(params)
    model.compile(optimizer=keras.optimizers.Adam(lr=lr), loss='mean_squared_error',
                metrics=['mae'])
    
    train_sentences, train_translations, train_scores = train_data
    val_sentences, val_translations, val_scores = val_data

    print(train_sentences.shape)
    print(val_sentences.shape)
    train_sentences = np.concatenate((train_sentences, val_sentences), axis=0)
    train_translations = np.concatenate((train_translations, val_translations), axis=0)
    train_scores = np.concatenate((train_scores, val_scores), axis=0)

    train_seq = InputSequence(train_sentences,
                                    train_translations,
                                    train_scores,
                                    64)
    
    test_seq = InputSequence(test_sentences,
                                    test_translations,
                                    np.arange(len(test_translations)),
                                    8000)

    model.fit_generator(generator=train_seq, epochs=n_epochs, shuffle=True)

    predicted_scores = model.predict(test_seq)
    return predicted_scores

In [0]:
def write_test_predictions(predicted_scores):
    with open("predictions.txt", "w") as f:
        for score in predicted_scores:
            f.write(f"{score[0]}\n")

In [0]:
# Example of testing the custom features
architecture = {'num_features_sent': 141, 'num_features_trans': 126, 'layers': [400], 'activation': 'tanh'}
predicted_scores = get_test_predictions(build_features_model, architecture, 0.002, embedding_model='features', avg_word_embeddings=False)

write_test_predictions(predicted_scores)