## Translation Model - Demo Run 

This notebook
* runs pre-trained 't5-small' model (English -> French, German, Romanian)
* runs Fine-tuned model (trained on English -> French opus_book data)
* runs pre-trained English -> Turkish translation model (Helsinki-NLP/opus-mt-tc-big-en-tr)
* evaluates model performance (Bleu score)

In [4]:
# Import Libraries
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import MarianMTModel, MarianTokenizer
import evaluate

In [2]:
# Translate Function: pre-process, tokenize, predict, and detokenize

def translate(text, prefix, model, tokenizer, print_detail=0):
    text_wp = prefix + text
    text_tokenized = tokenizer(text_wp, return_tensors='pt')
    out_tokenized = model.generate(**text_tokenized, max_length=128)
    text_translated = tokenizer.decode(out_tokenized[0], skip_special_tokens=True)

    if print_detail: 
        print('Input text: ', text)
        print('text with prefix: ', text_wp)
        print('text tekonized: ', text_tokenized)
        print('out tekonized: ', out_tokenized)
        print('Translated text:  ', text_translated)

    return text_translated 

### Pre-trained 't5-small' model (English -> French, German, Romanian)

In [3]:
# Load model and tokenizer

model_t5s = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer_t5s = AutoTokenizer.from_pretrained("t5-small", model_max_length=128)

In [4]:
# Translate English -> French

prefix = "translate English to French: "
text = 'Education is not the learning of facts, but the training of the mind to think!'

translate(text, prefix, model_t5s, tokenizer_t5s)

"L'éducation n'est pas l'apprentissage des faits, mais la formation de l'esprit à penser!"

In [5]:
# Translate English -> German

prefix = "translate English to German: "
text = 'Education is not the learning of facts, but the training of the mind to think!'

translate(text, prefix, model_t5s, tokenizer_t5s)

'Bildung ist nicht das Erlernen von Fakten, sondern die Ausbildung des Geistes zum Denken!'

In [52]:
# Translate English -> Romanian

prefix = "translate English to Romanian: "
text = 'Education is not the learning of facts, but the training of the mind to think!'

translate(text, prefix, model_t5s, tokenizer_t5s)

'Educaţia nu înseamnă învăţarea faptelor, ci formarea spiritului de gândire!'

### Fine-tuned model (English -> French)

The 'translatorModel3', is fined-tuned from 't5-small' pre-trained model on dataset 'opus_books'.

PyTorch is used for training.

Hugging Face training guide is followed:

    https://huggingface.co/docs/transformers/tasks/translation



In [5]:
# Load fine-tuned model and tokenizer

#model_directory = './translatorModel3' 
model_directory = 'kamileyagci/t5small-finetuned-opusbooks-en-fr'
model_tuned = AutoModelForSeq2SeqLM.from_pretrained(model_directory)
tokenizer_tuned = AutoTokenizer.from_pretrained(model_directory, model_max_length=128)

Downloading:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [6]:
# Translate

prefix = "translate English to French: "
text = 'Education is not the learning of facts, but the training of the mind to think!'

translate(text, prefix, model_tuned, tokenizer_tuned)

"L'éducation n'est pas l'apprentissage des faits, mais la formation de l'esprit à penser!"

In [9]:
# Translate

prefix = "translate English to French: "
text = 'Welcome!'

translate(text, prefix, model_tuned, tokenizer_tuned)

'Bienvenue!'

### Pre-trained English-Turkish Translation (Helsinki-NLP/opus-mt-tc-big-en-tr")

In [7]:
# Load model and tokenizer

tr_model_name = "Helsinki-NLP/opus-mt-tc-big-en-tr"

model_tr = MarianMTModel.from_pretrained(tr_model_name)
tokenizer_tr = MarianTokenizer.from_pretrained(tr_model_name)

In [10]:
#Translate English -> Turkish

prefix = ""
text = 'Education is not the learning of facts, but the training of the mind to think!'

translate(text, prefix, model_tr, tokenizer_tr)

'Eğitim, olguların öğrenilmesi değil, zihnin düşünme eğitimidir!'

### Model performance with BLEU score

In [10]:
metric = evaluate.load("sacrebleu")

In [11]:
books = load_dataset("opus_books", "en-fr")

Found cached dataset opus_books (/home/kyagci/.cache/huggingface/datasets/opus_books/en-fr/1.0.0/e8f950a4f32dc39b7f9088908216cd2d7e21ac35f893d04d39eb594746af2daf)


  0%|          | 0/1 [00:00<?, ?it/s]

#### Evaluate t5-small pre-trained model

In [12]:
#### Example 1 with t5-small

prefix = "translate English to French: "
text = 'Education is not the learning of facts, but the training of the mind to think!'
ref_text = "L'éducation n'est pas l'apprentissage de faits, mais l'entraînement de l'esprit à penser." #from Google translate

pred_text = translate(text, prefix, model_t5s, tokenizer_t5s)

print('Input text: ', text)
print('Reference text: ', ref_text)
print('Predicted text: ', pred_text)

print('\nSacreBLEU result:')

metric = evaluate.load("sacrebleu")
metric.compute(predictions=[pred_text], references=[[ref_text]])

Input text:  Education is not the learning of facts, but the training of the mind to think!
Reference text:  L'éducation n'est pas l'apprentissage de faits, mais l'entraînement de l'esprit à penser.
Predicted text:  L'éducation n'est pas l'apprentissage des faits, mais la formation de l'esprit à penser!

SacreBLEU result:


{'score': 40.48411918659966,
 'counts': [11, 8, 5, 2],
 'totals': [15, 14, 13, 12],
 'precisions': [73.33333333333333,
  57.142857142857146,
  38.46153846153846,
  16.666666666666668],
 'bp': 1.0,
 'sys_len': 15,
 'ref_len': 14}

In [14]:
# Example 2 with t5-small

id = 20

prefix = "translate English to French: "
text = books["train"][id]['translation']['en']
ref_text = books["train"][id]['translation']['fr']
pred_text = translate(text, prefix, model_t5s, tokenizer_t5s)

print('Input text: ', text)
print('Reference text: ', ref_text)
print('Predicted text: ', pred_text)

# Evaluate Bleu Score

print('\nSacreBLEU result:')

metric = evaluate.load("sacrebleu")
metric.compute(predictions=[pred_text], references=[[ref_text]])

Input text:  And that quiet countryside - the school, old Father Martin's field, with its three walnut trees, the garden daily invaded on the stroke of four by women paying calls - all this is, in my memory, forever stirred and transformed by the presence of him who upset all our youth and whose sudden flight even did not leave us in peace.
Reference text:  Tout ce paysage paisible – l’école, le champ du père Martin, avec ses trois noyers, le jardin dès quatre heures envahi chaque jour par des femmes en visite – est à jamais, dans ma mémoire, agité, transformé par la présence de celui qui bouleversa toute notre adolescence et dont la fuite même ne nous a pas laissé de repos.
Predicted text:  Et cette campagne tranquille - l'école, le champ du père Martin, avec ses trois noix, le jardin envahi quotidiennement à l'arrière-plan de quatre par des femmes qui paient des appels - c'est, à ma mémoire, à jamais étonné et transformé par la présence de lui qui a bouleversé toute notre jeunesse et

{'score': 29.182367007439346,
 'counts': [42, 25, 15, 9],
 'totals': [68, 67, 66, 65],
 'precisions': [61.76470588235294,
  37.3134328358209,
  22.727272727272727,
  13.846153846153847],
 'bp': 1.0,
 'sys_len': 68,
 'ref_len': 66}

#### Evaluate fine-tuned model

In [15]:
#### Example 1 with fine-tuned

prefix = "translate English to French: "
text = 'Education is not the learning of facts, but the training of the mind to think!'
ref_text = "L'éducation n'est pas l'apprentissage de faits, mais l'entraînement de l'esprit à penser." #from Google translate

pred_text = translate(text, prefix, model_tuned, tokenizer_tuned)

print('Input text: ', text)
print('Reference text: ', ref_text)
print('Predicted text: ', pred_text)

print('\nSacreBLEU result:')

metric = evaluate.load("sacrebleu")
metric.compute(predictions=[pred_text], references=[[ref_text]])

Input text:  Education is not the learning of facts, but the training of the mind to think!
Reference text:  L'éducation n'est pas l'apprentissage de faits, mais l'entraînement de l'esprit à penser.
Predicted text:  L'éducation n'est pas l'apprentissage des faits, mais la formation de l'esprit à penser!

SacreBLEU result:


{'score': 40.48411918659966,
 'counts': [11, 8, 5, 2],
 'totals': [15, 14, 13, 12],
 'precisions': [73.33333333333333,
  57.142857142857146,
  38.46153846153846,
  16.666666666666668],
 'bp': 1.0,
 'sys_len': 15,
 'ref_len': 14}

In [85]:
# Example 2 with fine-tuned

id = 20

prefix = "translate English to French: "
text = books["train"][id]['translation']['en']
ref_text = books["train"][id]['translation']['fr']
pred_text = translate(in_text, prefix, model_tuned, tokenizer_tuned)

print('Input text: ', text)
print('Reference text: ', ref_text)
print('Predicted text: ', pred_text)

# Evaluate Bleu Score

print('\nSacreBLEU result:')

metric = evaluate.load("sacrebleu")
metric.compute(predictions=[pred_text], references=[[ref_text]])

Input text:  And that quiet countryside - the school, old Father Martin's field, with its three walnut trees, the garden daily invaded on the stroke of four by women paying calls - all this is, in my memory, forever stirred and transformed by the presence of him who upset all our youth and whose sudden flight even did not leave us in peace.
Reference text:  Tout ce paysage paisible – l’école, le champ du père Martin, avec ses trois noyers, le jardin dès quatre heures envahi chaque jour par des femmes en visite – est à jamais, dans ma mémoire, agité, transformé par la présence de celui qui bouleversa toute notre adolescence et dont la fuite même ne nous a pas laissé de repos.
Predicted text:  Et cette campagne tranquille - l'école, le champ du père Martin, avec ses trois noix, le jardin envahi quotidiennement à l'arrière-plan de quatre par des femmes qui s'encouragèrent - tout cela est, à ma mémoire, étonné et transformé par la présence de lui qui a bouleversé toute notre jeunesse et do

{'score': 28.580345320702776,
 'counts': [42, 23, 15, 9],
 'totals': [68, 67, 66, 65],
 'precisions': [61.76470588235294,
  34.32835820895522,
  22.727272727272727,
  13.846153846153847],
 'bp': 1.0,
 'sys_len': 68,
 'ref_len': 66}

**Note: Why not the fine-tuned model has better performance?**


#### Evaluate English -> Turkish translator

In [16]:
### Example 1 with English -> Turkish translator

prefix = ""
text = 'Education is not the learning of facts, but the training of the mind to think!'
ref_text = "Eğitim, gerçeklerin öğrenilmesi değil, düşünmek için zihnin eğitilmesidir." #from Google translate

pred_text = translate(text, prefix, model_tr, tokenizer_tr)

print('Input text: ', text)
print('Reference text: ', ref_text)
print('Predicted text: ', pred_text)

print('\nSacreBLEU result:')

metric = evaluate.load("sacrebleu")
metric.compute(predictions=[pred_text], references=[[ref_text]])

Input text:  Education is not the learning of facts, but the training of the mind to think!
Reference text:  Eğitim, gerçeklerin öğrenilmesi değil, düşünmek için zihnin eğitilmesidir.
Predicted text:  Eğitim, olguların öğrenilmesi değil, zihnin düşünme eğitimidir!

SacreBLEU result:


{'score': 18.60045401920258,
 'counts': [6, 3, 1, 0],
 'totals': [10, 9, 8, 7],
 'precisions': [60.0, 33.333333333333336, 12.5, 7.142857142857143],
 'bp': 0.9048374180359595,
 'sys_len': 10,
 'ref_len': 11}

In [17]:
### Example 2 with English -> Turkish translator

prefix = ""
text = books["train"][id]['translation']['en']
ref_text = "Ve o sessiz kırsal bölge - okul, yaşlı Peder Martin'in tarlası, \
    üç ceviz ağacıyla, her gün dört kişi telefon eden kadınlar tarafından istila edilen bahçe - tüm bunlar, \
        onun varlığıyla sonsuza dek karıştırılmış ve dönüştürülmüştür. bütün gençlerimizi üzen, \
            ani kaçışı bile bizi rahat bırakmayan." #Google translate

pred_text = translate(text, prefix, model_tr, tokenizer_tr)

print('Input text: ', text)
print('Reference text: ', ref_text)
print('Predicted text: ', pred_text)

print('\nSacreBLEU result:')

metric = evaluate.load("sacrebleu")
metric.compute(predictions=[pred_text], references=[[ref_text]])

Input text:  And that quiet countryside - the school, old Father Martin's field, with its three walnut trees, the garden daily invaded on the stroke of four by women paying calls - all this is, in my memory, forever stirred and transformed by the presence of him who upset all our youth and whose sudden flight even did not leave us in peace.
Reference text:  Ve o sessiz kırsal bölge - okul, yaşlı Peder Martin'in tarlası,     üç ceviz ağacıyla, her gün dört kişi telefon eden kadınlar tarafından istila edilen bahçe - tüm bunlar,         onun varlığıyla sonsuza dek karıştırılmış ve dönüştürülmüştür. bütün gençlerimizi üzen,             ani kaçışı bile bizi rahat bırakmayan.
Predicted text:  Ve o sessiz kırsal - okul, yaşlı Peder Martin'in tarlası, üç ceviz ağacıyla, bahçe günlük olarak çağrı yapan kadınlar tarafından dört vuruşla işgal edildi - tüm bunlar, hafızamda, tüm gençlerimizi üzen ve ani uçuşu bile bizi huzur içinde bırakmayan kişinin varlığıyla sonsuza kadar karıştırıldı ve dönüşt

{'score': 36.33014355647905,
 'counts': [35, 21, 14, 11],
 'totals': [52, 51, 50, 49],
 'precisions': [67.3076923076923, 41.1764705882353, 28.0, 22.448979591836736],
 'bp': 1.0,
 'sys_len': 52,
 'ref_len': 51}