###Word Embeddings
For the first attempt, we started by using **spacy** word embeddings for our model.

**Spacy**

The following cells set up spacy. We use the `en_core_web_md` and `de_web_news_md` models to generate word embeddings.

The runtime needs to be restarted after installing both models.

In [0]:
!pip install spacy
!python -m spacy download en_core_web_md
!python -m spacy download de_core_news_md

### Imports

The following cell imports all the necessary libraries and loads the `en_core_web_md` and `de_web_news_md` models using **spacy**.

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.svm import SVR
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error
from mlxtend.plotting import plot_decision_regions
import numpy as np
import spacy
from scipy.stats import pearsonr

nlp_english = spacy.load("en_core_web_md")
nlp_german = spacy.load("de_core_news_md")

### Data

Loads the English sentences with their corresponding German translations, as well as the translation scores provided. This dataset is used for *training*.

In [0]:
with open('data/en-de/train.ende.src', 'r') as f:
    train_english = []
    for line in f:
        train_english.append(line)

with open('data/en-de/train.ende.mt', 'r') as f:
    train_german = []
    for line in f:
        train_german.append(line)

with open('data/en-de/train.ende.scores', 'r') as f:
    train_scores = []
    for line in f:
        train_scores.append(line)

train_scores = [score.replace('\n', '') for score in train_scores]

### Preprocess

Converts the *training* input sentences and translations to a sequence of **spacy** `Token` objects.

In [0]:
doc_english = list(nlp_english.pipe(train_english, batch_size=32, n_threads=7))
doc_german = list(nlp_german.pipe(train_german, batch_size=32, n_threads=7))

### Embeddings

#### Word Embeddings Method

Uses **spacy** to vectorise the input list of sentences to a **numpy** array of dimension `[MAX NUMBER OF SENTENCES, MAX SENTENCE LENGTH, EMBEDDING DIMENSION]`

In [0]:
def embed_sentences(doc, nlp, max_words):
    result = []
    unknown_vector = nlp.vocab['unk'].vector
    
    result = np.zeros((len(doc), 76, len(unknown_vector)))

    for sentence_index, sentence in enumerate(doc):
        tokens = []
        token_index = 0 
        for sent in sentence.sents:
            for i in range(len(sent)):
                token = sent[i]

                if token.has_vector:
                    result[sentence_index, token_index] = token.vector
                else:
                    result[sentence_index, token_index] = unknown_vector
                
                token_index += 1

    return result

#### English and German Embeddings

Calculates the maximum length of the *train* input sentences and then converts the input sentences into word embeddings using the method above. Then converts the 3-dimensional word embeddings into 2-dimensional embeddings by averaging the word vectors.

In [0]:
max_words_english = max(map(len, doc_english))
max_words_german = max(map(len, doc_german))

max_words = max(max_words_english, max_words_german)

english_embeddings = embed_sentences(doc_english, nlp_english, max_words)
german_embeddings = embed_sentences(doc_german, nlp_german, max_words)

english_embeddings = np.average(english_embeddings, axis=1)
german_embeddings = np.average(german_embeddings, axis=1)

word_embeddings = np.concatenate((english_embeddings, german_embeddings), axis=1)

### Data

Loads the English sentences with their corresponding German translations, as well as the translation scores provided. This dataset is used for *testing*.

In [0]:
with open('data/en-de/dev.ende.src', 'r') as f:
    test_english = []
    for line in f:
        test_english.append(line)

with open('data/en-de/dev.ende.mt', 'r') as f:
    test_german = []
    for line in f:
        test_german.append(line)
        
with open('data/en-de/dev.ende.scores', 'r') as f:
    test_scores = []
    for line in f:
        test_scores.append(line)

test_scores = [float(score.replace('\n', '')) for score in test_scores]

### Preprocess

Converts the *testing* input sentences and translations to a sequence of **spacy** `Token` objects.

In [0]:
doc_english = list(nlp_english.pipe(test_english, batch_size=32, n_threads=7))
doc_german = list(nlp_german.pipe(test_german, batch_size=32, n_threads=7))

#### English and German Embeddings

Calculates the maximum length of the *test* input sentences and then converts the input sentences into word embeddings using the same method as above.

In [0]:
max_words_english = max(map(len, doc_english))
max_words_german = max(map(len, doc_german))

max_words = max(max_words_english, max_words_german)

english_embeddings = embed_sentences(doc_english, nlp_english, max_words)
german_embeddings = embed_sentences(doc_german, nlp_german, max_words)

english_embeddings = np.average(english_embeddings, axis=1)
german_embeddings = np.average(german_embeddings, axis=1)

word_embeddings_test = np.concatenate((english_embeddings, german_embeddings), axis=1)

In [0]:
train_scores = np.array(train_scores)
test_scores = np.array(test_scores)

### Statistical Machine Learning Models

Uses different regression models from **sklearn** to predict the accuracy of the translation by training this model on the averaged word embeddings and the scores of the training data.


In [0]:
from sklearn.linear_model import LinearRegression, Ridge, BayesianRidge, Lasso, LassoLars
from sklearn.ensemble import RandomForestRegressor
import math

regression_models = [
                     LinearRegression(),
                     Ridge(),
                     BayesianRidge(),
                     SVR(),
]

best_mae = 1000
best_model = None
best_pearson = -1000
best_rmse = 1000

for index, model in enumerate(regression_models):
    print(f'START TRAINING MODEL NO. {index + 1}...')
    model.fit(word_embeddings, train_scores)
    prediction = model.predict(word_embeddings_test)
    mae = mean_absolute_error(test_scores, prediction)
    pearson_score = pearsonr(prediction, test_scores)
    rmse = math.sqrt(mean_squared_error(test_scores, prediction))
    
    print(f'Model: {str(model)}\n')
    print(f'Pearson Score: {pearson_score}\n')
    print(f'MAE: {mae}\n')
    print(f'RMSE: {rmse}\n')

    if pearson_score[0] > best_pearson:
        best_pearson = pearson_score[0]
        best_model = model
        best_mae = mae
        best_rmse = rmse
    print('FINISHED TRAINING.\n')

print(f'Best Model: {str(best_model)}\n')
print(f'Best Pearson Score: {best_pearson}\n')
print(f'Best MAE: {best_mae}\n')
print(f'Best RMSE: {best_rmse}\n')