Human Language Technologies project for the a.y. 2020/2021.
English-Italian
Europarl Corpus or the english-italian dataset from http://www.manythings.org/anki/
- https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt
- https://www.tensorflow.org/text/tutorials/transformer
- https://keras.io/examples/nlp/neural_machine_translation_with_transformer/
The baselines for confronting the results of our models were chosen from:
- MarianMT, https://huggingface.co/Helsinki-NLP/opus-mt-en-it
- DeltaLM, https://arxiv.org/pdf/2106.13736.pdf
We used https://huggingface.co/dbmdz/bert-base-italian-cased as the italian tokenizer for each of our models, for the source language we used the correct tokenizer for each encoder.