Link to the Weights & Biases report vizualizing the training: https://api.wandb.ai/links/jacspa/k45ni2lb

### Environment preparation

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Define constants used in the whole notebook
PATH = "/content/drive/MyDrive/Thesis"
FILEPATH = "pol.txt"
NUM_SAMPLE = 50000
VOCAB_SIZE = 20000
BATCH_SIZE = 64
GLOVE_DIM = 100
SOS = "<sos>"
EOS = "<eos>"

In [3]:
# Import other python scripts
import sys
sys.path.insert(0, PATH)
import preprocessing
import models

In [None]:
# set-up Weights & Biases
!pip install wandb -qqq
import wandb
wandb.login()
from wandb.keras import WandbCallback

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m205.1/205.1 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [4]:
# Other imports
import tensorflow as tf
from tensorflow import keras
from keras import callbacks
import numpy as np
import pandas as pd
import string

### Prepare data

In [5]:
# Import data
!wget http://www.manythings.org/anki/pol-eng.zip
!unzip -n -q pol-eng.zip 

--2023-05-24 13:40:40--  http://www.manythings.org/anki/pol-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1868937 (1.8M) [application/zip]
Saving to: ‘pol-eng.zip’


2023-05-24 13:40:42 (1.94 MB/s) - ‘pol-eng.zip’ saved [1868937/1868937]



Function to generate embeddings matrix from pre-trained embeddings.
GloVe embeddings will be used (https://nlp.stanford.edu/projects/glove/)

In [6]:
def get_embedding_matrix(max_tokens, embed_dim):
    # load embeddings from a file
    glove_embeddings = {}
    with open(f"{PATH}/Data/glove.6B.{embed_dim}d.txt") as f:
        for line in f:
            values = line.split()
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            glove_embeddings[word] = vec
    
    # map embeddings from file to our vocabulary
    vocabulary = eng_vectorization.get_vocabulary()
    n_tokens = min(len(vocabulary) + 1, max_tokens)
    word_index = dict(zip(vocabulary, range(n_tokens)))
    embedding_matrix = np.zeros((max_tokens, embed_dim))
    for word, i in word_index.items():
        if i < max_tokens:
            embedding_vector = glove_embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

    return embedding_matrix

Pre-process data

In [7]:
english_text, polish_text = preprocessing.read_file(FILEPATH, NUM_SAMPLE)
eng_train, eng_val, eng_test, pol_train, pol_val, pol_test = preprocessing.split_data(english_text, polish_text)
eng_vectorization, pol_vectorization, eng_len, pol_len = preprocessing.create_vectorization(eng_train, pol_train, VOCAB_SIZE)
train_ds, val_ds = preprocessing.create_datasets(eng_train, eng_val, pol_train, pol_val, eng_vectorization, pol_vectorization, BATCH_SIZE)
embedding_matrix = get_embedding_matrix(VOCAB_SIZE, GLOVE_DIM)

### Models

We have 8 basic models defined:

*   GRU
*   Bidirectional GRU
*   LSTM
*   Bidirectional LSTM
*   GRU with pretrained embeddings
*   Bidirectional GRU with pretrained embeddings
*   LSTM with pretrained embeddings
*   Bidirectional LSTM with pretrained embeddings

In first stage, all of them will be trained with the same hyperparameters to choose the best architecture. Then hyperparameters will be fine-tuned for selected model. Ideally, hyperparameters should be fine-tuned for all architectures and only then the best model should be selected, but in the scope of this project this approach is unfeasible in terms of available resources.

Shared hyperparameters are chosen arbitrarily, but they were based on Chollet's implementation. But as he had larger dataset (english - spanish language pair), embeddings dimention and number of cells in RNN layers were halved, to limit the number of parameters. In case of bidirectional layers, number of cells in each of two layers (forward and backward) was halved again.


In [8]:
# Shared hyperparameters
EMBED_DIM = 128
LATENT_DIM = 512
DROPOUT = 0.5 
EPOCHS = 100
OPTIMIZER = "rmsprop"

Create all the models

In [None]:
models_basic = {
    # at this step we do not need seperate encoder and decoder, only end-to-end model
    "gru": models.create_gru_model(VOCAB_SIZE, EMBED_DIM, LATENT_DIM, DROPOUT, OPTIMIZER)[0],
    "bi_gru": models.create_bi_gru_model(VOCAB_SIZE, EMBED_DIM, int(LATENT_DIM/2), DROPOUT, OPTIMIZER)[0],
    "lstm": models.create_lstm_model(VOCAB_SIZE, EMBED_DIM, LATENT_DIM, DROPOUT, OPTIMIZER)[0],
    "bi_lstm": models.create_bi_lstm_model(VOCAB_SIZE, EMBED_DIM, int(LATENT_DIM/2), DROPOUT, OPTIMIZER)[0],
    "gru_glove": models.create_gru_glove_model(VOCAB_SIZE, GLOVE_DIM, LATENT_DIM, DROPOUT, OPTIMIZER, embedding_matrix)[0],
    "bi_gru_glove": models.create_gru_glove_model(VOCAB_SIZE, GLOVE_DIM, int(LATENT_DIM/2), DROPOUT, OPTIMIZER, embedding_matrix)[0],
    "lstm_glove": models.create_lstm_glove_model(VOCAB_SIZE, GLOVE_DIM, LATENT_DIM, DROPOUT, OPTIMIZER, embedding_matrix)[0],
    "bi_lstm_glove": models.create_bi_lstm_glove_model(VOCAB_SIZE, GLOVE_DIM, int(LATENT_DIM/2), DROPOUT, OPTIMIZER, embedding_matrix)[0],
}

In [None]:
early_stopping = callbacks.EarlyStopping(monitor='val_loss', patience=5)

In [None]:
# Apart from ploting training results in Weights & Biases, 
# best val_accuracy will be saved for each model to numerically verify the results
models_basic_history = {}

for model_name in models_basic.keys():

    # Weights & Biases callback
    wandb.init(
    project="Machine_translation", 
    name=f"{model_name}", 
    config={
      "epochs": EPOCHS})
    config = wandb.config
    logging_callback = WandbCallback(log_evaluation=True, save_model=False)

    model = models_basic.get(model_name)
    r = model.fit(train_ds, epochs=EPOCHS, validation_data=val_ds, callbacks=[early_stopping, logging_callback], verbose=2)
    models_basic_history[model_name] = max(r.history["val_accuracy"])

In [None]:
# Save best val_accuracy for each model into csv file for future reference
df = pd.DataFrame(models_basic_history.items(), columns=["model", "val_accuracy"])
df.to_csv(f"{PATH}/Results/basic_models.csv", index=False)

Looking at the graphs in "report.pdf" and table in "basic_models.csv" it seems the best architecture consists of simple GRU with pretrained GloVe embeddings. This model will be selected for further fine-tuning.

In [11]:
GRU_GLOVE_MODEL_NAME = "gru_glove"

In [None]:
# hyperparameter values for fine-tuning
latent_dims = [512, 1024]
dropouts = [0.5, 0.8]
optimizers = ["rmsprop", "adam"]

In [None]:
gru_models_history = {}

for latent_dim in latent_dims:
    for dropout in dropouts:
        for optimizer in optimizers:

            model = models.create_gru_glove_model(VOCAB_SIZE, GLOVE_DIM, latent_dim, dropout, optimizer, embedding_matrix)[0]
            model_name = f"{GRU_GLOVE_MODEL_NAME}_{latent_dim}_units_{dropout}_drop_{optimizer}"

            # Weights & Biases callback
            wandb.init(
            project="Machine_translation", 
            name=model_name, 
            config={
                "epochs": EPOCHS,
                "latent_dims": latent_dim,
                "dropouts": dropout,
                "optimizers": optimizer})
            config = wandb.config
            logging_callback = WandbCallback(log_evaluation=True, save_model=False)

            # Save best model for each hyperparameter config
            checkpoint_filepath = f"{PATH}/Models/{model_name}.h5"
            model_checkpoint_callback = callbacks.ModelCheckpoint(
                filepath=checkpoint_filepath,
                save_weights_only=True,
                monitor='val_accuracy',
                mode='max',
                save_best_only=True)
            
            r = model.fit(train_ds, epochs=EPOCHS, validation_data=val_ds, 
                          callbacks=[early_stopping, logging_callback, model_checkpoint_callback], 
                          verbose=2)
            gru_models_history[model_name] = max(r.history["val_accuracy"])

In [None]:
# Save best val_accuracy for each model into csv file for future reference
df = pd.DataFrame(gru_models_history.items(), columns=["model", "val_accuracy"])
df.to_csv(f"{PATH}/Results/gru_glove_models.csv", index=False)

### Evaluation

The best set of hyperparameters, that allowed the model to reach accuracy of more than 60%, is: 

*   latent_dim = 1024
*   dropout = 0.8
*   optimizer = adam

In [13]:
# The best values for hyperparameters
best_latent_dim = 1024
best_dropout = 0.8
best_optimizer = "adam"
best_model_name = f"{GRU_GLOVE_MODEL_NAME}_{best_latent_dim}_units_{best_dropout}_drop_{best_optimizer}"

In [14]:
model, encoder, decoder = models.create_gru_glove_model(VOCAB_SIZE, GLOVE_DIM, best_latent_dim, best_dropout, best_optimizer, embedding_matrix)
model.load_weights(f"{PATH}/Models/{best_model_name}.h5")

In [20]:
# function for translating one sentence using encoder - decoder pair
pol_vocab = pol_vectorization.get_vocabulary()
pol_index_lookup = dict(zip(range(len(pol_vocab)), pol_vocab))
def decode_sequence_gru(input_sentence):
    input_sentence_tok = eng_vectorization([input_sentence])
    states = encoder.predict(input_sentence_tok, verbose=0)
    target_sentence = np.zeros((1,1))
    target_sentence[0, 0] = pol_vocab.index(SOS)
    decoded_sentence = []
    for _ in range(pol_len):
        decoder_output, h = decoder.predict(
            [target_sentence] + [states],
            verbose=0
        )
        next_token_index = np.argmax(decoder_output[0, 0, :])
        next_token = pol_index_lookup[next_token_index]
        if next_token == EOS:
            break
        decoded_sentence.append(next_token)
        target_sentence[0, 0] = next_token_index
        states = [h]
    return ' '.join(decoded_sentence)

Create translations

In [23]:
gru_translations = []
for eng_sentence in eng_test:
    gru_translations.append(decode_sequence_gru(eng_sentence))

In [26]:
# save translations to file for future reference
import pickle
with open(f"{PATH}/Results/gru_translations.pickle", "wb") as fp:
    pickle.dump(gru_translations, fp)

Translations using pre-trained model

To compare how our custom model really performes, it is useful to confront it with some available open-source model for this task.

In this case I will use mT5 model fine-tuned for Polish-English and English-Polish translation, shared by Sławomir Dadas. Link to his git repo: https://github.com/sdadas/polish-nlp-resources#t5-based-models

In [None]:
# !pip install sentencepiece
!pip install transformers
from transformers import pipeline

In [44]:
from transformers import pipeline
generator = pipeline("translation", model="sdadas/mt5-base-translator-en-pl")
translations_mt5 = generator(eng_test, max_length=pol_len)
translations_mt5 = [preprocessing.clean_sentence(translation["translation_text"])
                    for translation in translations_mt5]

In [47]:
# save translations to file for future reference
with open(f"{PATH}/Results/translations_mt5.pickle", "wb") as fp:
    pickle.dump(translations_mt5, fp)

##### Qualitative evaluation

Let's look at first ten translations and inspect them manually.

Custom GRU model

In [52]:
for i in range(10):
    print(f"English sentence: {eng_test[i]}")
    print(f"GRU model translation: {gru_translations[i]}")
    print(f"Ground truth translation: {preprocessing.clean_sentence(pol_test[i])}")
    print("-----")

English sentence: She arrived when we were about to leave.
GRU model translation: poszła na niego czekać
Ground truth translation: przyjechała akurat w momencie gdy mieliśmy wychodzić
-----
English sentence: Tom is walking.
GRU model translation: tom jest zajęty
Ground truth translation: tom idzie
-----
English sentence: I didn't want you to know.
GRU model translation: nie chciałem żebyś wiedział
Ground truth translation: nie chciałem żebyś wiedział
-----
English sentence: Do you need time to think it over?
GRU model translation: czy potrzebujesz tego czasu
Ground truth translation: czy potrzebujesz czasu żeby to przemyśleć
-----
English sentence: You look like an imbecile.
GRU model translation: wyglądasz na wykończonego
Ground truth translation: wyglądasz na durnia
-----
English sentence: A lot of people are going to tell you that you shouldn't have done that.
GRU model translation: wiele osób o tym co chcesz zrobić to co zrobił
Ground truth translation: wiele osób ci powie że nie p

Some translations are very precise, and even some that are not completely correct give some indication of learning. For example in the last one, model corrctly predicted "nasz nauczyciel", but failed with the rest of the sentence.

Pre-trained mT5 model

In [54]:
for i in range(10):
    print(f"English sentence: {eng_test[i]}")
    print(f"mT5 model translation: {translations_mt5[i]}")
    print(f"Ground truth translation: {preprocessing.clean_sentence(pol_test[i])}")
    print("-----")

English sentence: She arrived when we were about to leave.
mT5 model translation: przyszła kiedy mieliśmy wyjechać
Ground truth translation: przyjechała akurat w momencie gdy mieliśmy wychodzić
-----
English sentence: Tom is walking.
mT5 model translation: tom idzie
Ground truth translation: tom idzie
-----
English sentence: I didn't want you to know.
mT5 model translation: nie chciałem żebyś wiedział
Ground truth translation: nie chciałem żebyś wiedział
-----
English sentence: Do you need time to think it over?
mT5 model translation: potrzebujesz czasu żeby to przemyśleć
Ground truth translation: czy potrzebujesz czasu żeby to przemyśleć
-----
English sentence: You look like an imbecile.
mT5 model translation: wyglądasz jak imbecyl
Ground truth translation: wyglądasz na durnia
-----
English sentence: A lot of people are going to tell you that you shouldn't have done that.
mT5 model translation: wielu ludzi powie ci że nie powinieneś tego robić
Ground truth translation: wiele osób ci p

Pretrained model is basically always correct. Even when translation does not match "ground truth", Polish speaker can judje that the translation is correct.

##### Quantitative evaluation

The metric that will be used for final evaluation is BLEU score: https://en.wikipedia.org/wiki/BLEU

Scores for 1-grams will be used, and cumulative scores for 2, 3 and 4-grams.

In [70]:
import nltk
from nltk.translate.bleu_score import sentence_bleu

In [103]:
reference = [preprocessing.clean_sentence(translation).split() for translation in pol_test]
translations_len = len(reference)

In [113]:
# For custom model translations
individual_scores_gru_1 = np.zeros(translations_len)
individual_scores_gru_2 = np.zeros(translations_len)
individual_scores_gru_3 = np.zeros(translations_len)
individual_scores_gru_4 = np.zeros(translations_len)
candidates_gru = [translation.split() for translation in gru_translations]

for i in range(translations_len):
    individual_scores_gru_1[i] = sentence_bleu([reference[i]], candidates_gru[i], weights=(1, 0, 0, 0))
    individual_scores_gru_2[i] = sentence_bleu([reference[i]], candidates_gru[i], weights=(0.5, 0.5, 0, 0))
    individual_scores_gru_3[i] = sentence_bleu([reference[i]], candidates_gru[i], weights=(0.33, 0.33, 0.33, 0))
    individual_scores_gru_4[i] = sentence_bleu([reference[i]], candidates_gru[i], weights=(0.25, 0.25, 0.25, 0.25))

In [114]:
print(f"BLEU 1-gram score for GRU translations: {np.mean(individual_scores_gru_1)}")
print(f"BLEU 2-gram score for GRU translations: {np.mean(individual_scores_gru_2)}")
print(f"BLEU 3-gram score for GRU translations: {np.mean(individual_scores_gru_3)}")
print(f"BLEU 4-gram score for GRU translations: {np.mean(individual_scores_gru_4)}")

BLEU 1-gram score for GRU translations: 0.42312221459802907
BLEU 2-gram score for GRU translations: 0.24776813091318406
BLEU 3-gram score for GRU translations: 0.1296367321528724
BLEU 4-gram score for GRU translations: 0.05721072461345969


Let's look for individual scores for 1-gram.

In [111]:
np.set_printoptions(suppress=True)
gru_scores, counts = np.unique(individual_scores_gru_1, return_counts=True)
print(np.asarray((gru_scores, counts)).T)

[[  0.         590.        ]
 [  0.04511176   1.        ]
 [  0.04931939   1.        ]
 [  0.05882353   1.        ]
 [  0.06062469   1.        ]
 [  0.06666667   1.        ]
 [  0.06690768   1.        ]
 [  0.07142857   1.        ]
 [  0.0716262    1.        ]
 [  0.07243303   1.        ]
 [  0.07357589   1.        ]
 [  0.07692308   6.        ]
 [  0.07961459   1.        ]
 [  0.08333333   2.        ]
 [  0.08556952   2.        ]
 [  0.08786571   1.        ]
 [  0.09048374   1.        ]
 [  0.09090909   6.        ]
 [  0.09196986   2.        ]
 [  0.09306272   2.        ]
 [  0.09622061   1.        ]
 [  0.0973501    2.        ]
 [  0.09942659   2.        ]
 [  0.1         11.        ]
 [  0.1073539    5.        ]
 [  0.10976233   3.        ]
 [  0.11031211   5.        ]
 [  0.11111111  18.        ]
 [  0.11409269   1.        ]
 [  0.11540662   1.        ]
 [  0.11764706   1.        ]
 [  0.11809164   4.        ]
 [  0.11942189   6.        ]
 [  0.12262648   2.        ]
 [  0.1238397 

We can see there were 368 sentences that model predicted exactly right, and 590 that do not match at all.

The same evaluation for mT5.

In [117]:
# For pretrained mT5 model translations
individual_scores_mt5_1 = np.zeros(translations_len)
individual_scores_mt5_2 = np.zeros(translations_len)
individual_scores_mt5_3 = np.zeros(translations_len)
individual_scores_mt5_4 = np.zeros(translations_len)
candidates_mt5 = [translation.split() for translation in translations_mt5]

for i in range(translations_len):
    individual_scores_mt5_1[i] = sentence_bleu([reference[i]], candidates_mt5[i], weights=(1, 0, 0, 0))
    individual_scores_mt5_2[i] = sentence_bleu([reference[i]], candidates_mt5[i], weights=(0.5, 0.5, 0, 0))
    individual_scores_mt5_3[i] = sentence_bleu([reference[i]], candidates_mt5[i], weights=(0.33, 0.33, 0.33, 0))
    individual_scores_mt5_4[i] = sentence_bleu([reference[i]], candidates_mt5[i], weights=(0.25, 0.25, 0.25, 0.25))

In [118]:
print(f"BLEU 1-gram score for mT5 translations: {np.mean(individual_scores_mt5_1)}")
print(f"BLEU 2-gram score for mT5 translations: {np.mean(individual_scores_mt5_2)}")
print(f"BLEU 3-gram score for mT5 translations: {np.mean(individual_scores_mt5_3)}")
print(f"BLEU 4-gram score for mT5 translations: {np.mean(individual_scores_mt5_4)}")

BLEU 1-gram score for mT5 translations: 0.688217699755092
BLEU 2-gram score for mT5 translations: 0.5484003852145302
BLEU 3-gram score for mT5 translations: 0.39269416392431133
BLEU 4-gram score for mT5 translations: 0.2364343227115686


Even pretrained model only got around 0.69, but as we saw it might be the case that translation is correct, it just does not match exactly "ground truth". Where we can see bigger differences with custom model is on 2, 3, and 4-gram scores, which are significantly higher. 

##### Evaluate on wikipedia

File wikipedia.ipynb created a file with translated wikipedia articla about Artificial Intelligence.

In [123]:
df = pd.read_csv(f"{PATH}/Data/wikipedia_translation.csv", index_col=0)
df.head()

Unnamed: 0,english,polish
0,Artificial intelligence (AI) is intelligence—p...,Sztuczna inteligencja (AI) to inteligencja - p...
1,Example tasks in which this is done include sp...,"Przykładowe zadania, w których jest to wykonyw..."
2,AI applications include advanced web search en...,Zastosowania sztucznej inteligencji obejmują z...
3,"As machines become increasingly capable, tasks...",W miarę jak maszyny stają się coraz bardziej w...
4,"For instance, optical character recognition is...",Na przykład optyczne rozpoznawanie znaków jest...


In [126]:
eng_wiki = np.asarray(df["english"])
pol_wiki = np.asarray(df["polish"])

In [133]:
gru_translations_wiki = []
for eng_sentence in eng_wiki:
    gru_translations_wiki.append(decode_sequence_gru(eng_sentence))

In [134]:
with open(f"{PATH}/Results/gru_translations_wiki.pickle", "wb") as fp:
    pickle.dump(gru_translations_wiki, fp)

In [None]:
translations_mt5_wiki = generator(list(eng_wiki), max_length=400)
translations_mt5_wiki = [preprocessing.clean_sentence(translation["translation_text"])
                        for translation in translations_mt5_wiki]

In [141]:
with open(f"{PATH}/Results/translations_mt5_wiki.pickle", "wb") as fp:
    pickle.dump(translations_mt5_wiki, fp)

In [142]:
for i in range(5):
    print(f"English sentence: {eng_wiki[i]}")
    print(f"GRU model translation: {gru_translations_wiki[i]}")
    print(f"Ground truth translation: {preprocessing.clean_sentence(pol_wiki[i])}")
    print("-----")

English sentence: Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by humans or by other animals.
GRU model translation: nikiel to że nie da się wyuczyć języka obcego jest niewinny ale wsadzili go do samobójstwa
Ground truth translation: sztuczna inteligencja ai to inteligencja  postrzeganie synteza i wnioskowanie informacji  demonstrowana przez maszyny w przeciwieństwie do inteligencji wykazywanej przez ludzi lub inne zwierzęta
-----
English sentence: Example tasks in which this is done include speech recognition, computer vision, translation between (natural) languages, as well as other mappings of inputs.
GRU model translation: te dane w dzisiejszych czasach niewiele osób mówiących się z rodzimymi użytkownikami
Ground truth translation: przykładowe zadania w których jest to wykonywane obejmują rozpoznawanie mowy widzenie komputerowe tłumaczenie między naturalnymi językami a

Unfortunately translations generated by the custom GRU model are of very poor quality.

In [143]:
for i in range(5):
    print(f"English sentence: {eng_wiki[i]}")
    print(f"mT5 model translation: {translations_mt5_wiki[i]}")
    print(f"Ground truth translation: {preprocessing.clean_sentence(pol_wiki[i])}")
    print("-----")

English sentence: Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by humans or by other animals.
mT5 model translation: sztuczna inteligencja ai to inteligencja – postrzeganie synteza i wyciąganie informacji – demonstrowana przez maszyny w przeciwieństwie do inteligencji wyświetlanej przez ludzi lub inne zwierzęta
Ground truth translation: sztuczna inteligencja ai to inteligencja  postrzeganie synteza i wnioskowanie informacji  demonstrowana przez maszyny w przeciwieństwie do inteligencji wykazywanej przez ludzi lub inne zwierzęta
-----
English sentence: Example tasks in which this is done include speech recognition, computer vision, translation between (natural) languages, as well as other mappings of inputs.
mT5 model translation: przykładowe zadania w których jest to wykonywane to rozpoznawanie mowy widzenie komputerowe tłumaczenie między naturalnymi językami a także inne 

In [144]:
reference_wiki = [preprocessing.clean_sentence(translation).split() for translation in pol_wiki]
translations_wiki_len = len(reference_wiki)

In [168]:
individual_scores_gru_wiki = np.zeros(translations_wiki_len)
candidates_gru_wiki = [translation.split() for translation in gru_translations_wiki]

for i in range(translations_wiki_len):
    individual_scores_gru_wiki[i] = sentence_bleu([reference_wiki[i]], candidates_gru_wiki[i], weights=(1, 0, 0, 0))

print(f"BLEU 1-gram score for GRU translations: {np.mean(individual_scores_gru_wiki)}")

BLEU 1-gram score for GRU translations: 0.06544866793155198


As expected after manually inspecting translations, custom GRU model does not perform well on more complex translations.

In [167]:
individual_scores_mt5_wiki = np.zeros(translations_wiki_len)
candidates_mt5_wiki = [translation.split() for translation in translations_mt5_wiki]

for i in range(translations_wiki_len):
    individual_scores_mt5_wiki[i] = sentence_bleu([reference_wiki[i]], candidates_mt5_wiki[i], weights=(1, 0, 0, 0))

print(f"BLEU 1-gram score for mT5 translations: {np.mean(individual_scores_mt5_wiki)}")

BLEU 1-gram score for mT5 translations: 0.723360264694758


1-gram BLEU score is somehow suprisingly higher for mT5 model. However, with short sentences in initial dataset, we saw mT5 model produced correct translations that did not match "ground truth". Perhabs in longer wikipedia sentences this ambiguity is less present, therefore translations match closer "ground truth".