# Example Run
* One possible way how the translation task can be conducted
* In the real run, pairs will be chosen with reason and depending on the translator, we may run on batches of pairs or on all of them
* In this Demo, we randomly choose some pairs

## Translation Task

* Code used for translation, namely from `data_management`, `util`, `translators` and `task` MUST NOT CHANGE mid or post translation.
* It has to be decided at which commit code is considered `fixed` and after that those 3 files must remain untouched.
* If Git still tracks changes, those changes may not impact anything that would make the code behave differently from before.

In [None]:
from scripts.task import TranslationTask
from scripts.data_management import EuroParlManager
from scripts.translators import GPT4Client, DeeplClient
from scripts.util import MyLogger
from os.path import join
from random import sample, seed
seed(42)
possible = [tuple(pair.split('-')) for pair in EuroParlManager.EP_PAIRS]
extended = [(pair[1], pair[0]) for pair in possible]
possible = possible + extended
some_pairs = sample(sorted(possible), k=8)

example_folder = 'exmpl'
mt_folder_gpt = join(example_folder, 'gpt41')
mt_folder_deepl = join(example_folder, 'deepl')

dm = EuroParlManager()  
logger = MyLogger(logfile=join('exmpl', 'log.jsonl')) 
client_gpt = GPT4Client(logger=logger)  
client_deepl = DeeplClient(logger=logger)

In [2]:
some_pairs

[('nl', 'de'),
 ('de', 'fi'),
 ('da', 'es'),
 ('pt', 'es'),
 ('en', 'fr'),
 ('en', 'de'),
 ('el', 'pt'),
 ('de', 'nl')]

In [3]:
task_gpt = TranslationTask(
    target_pairs=some_pairs,
    dm=dm,
    client=client_gpt,
    logger=logger,
    mt_folder=mt_folder_gpt,
    num_of_sents=50
)

task_deepl = TranslationTask(
    target_pairs=some_pairs,
    dm=dm,
    client=client_deepl,
    logger=logger,
    mt_folder=mt_folder_deepl,
    num_of_sents=50
)

In [4]:
task_deepl.run()

Document for pair nl-de has been translated already.
50 translated from nl to de
Document for pair de-fi has been translated already.
50 translated from de to fi
Document for pair da-es has been translated already.
50 translated from da to es
Document for pair pt-es has been translated already.
50 translated from pt to es
Document for pair en-fr has been translated already.
50 translated from en to fr
Document for pair en-de has been translated already.
50 translated from en to de
Document for pair el-pt has been translated already.
50 translated from el to pt
Document for pair de-nl has been translated already.
50 translated from de to nl


In [5]:
task_gpt.run()

Document for pair nl-de has been translated already.
50 translated from nl to de
Document for pair de-fi has been translated already.
50 translated from de to fi
Document for pair da-es has been translated already.
50 translated from da to es
Document for pair pt-es has been translated already.
50 translated from pt to es
Document for pair en-fr has been translated already.
50 translated from en to fr
Document for pair en-de has been translated already.
50 translated from en to de
Document for pair el-pt has been translated already.
50 translated from el to pt
Document for pair de-nl has been translated already.
50 translated from de to nl


* "has been translated already" shows up because the code checks the `mt_folder` for existing files and if it finds them, it will not make the API call
* Makes it overall safer to run API-call code within Jupyter Notebooks

## Logs
* We can print our logs within the notebook but it is safer to store them externally.
* This notebook can be re-run post translation, API calls will not be made but the logs will change
* External stored logs represent logs created at time of translation and can be viewed through Python or unix commands

In [6]:
!cat $example_folder/log.jsonl | head -n 1
!cat $example_folder/log.jsonl | tail -n 1

{"translator": "deepl_document", "src_lang": "nl", "tgt_lang": "de", "start": 1745503704.618983, "id": "56b7f065-4230-4cc4-99aa-54aa00e459b2", "in_lines": 50, "in_sents": 56, "timestamp": "2025-04-24 16:08:24.635019+02:00", "in_chars": 7097, "in_tokens": 1561, "dataset": {"name": "Helsinki-NLP/europarl", "num_of_sents": 50, "start_idx": 0, "split": "train[:500]"}, "end": 1745503710.2351432, "time": 5.616160154342651, "out_chars": 7092, "out_lines": 50, "out_sents": 56, "out_tokens": 1533, "status": "done", "error_msg": null}
{"translator": "gpt-4.1", "src_lang": "de", "tgt_lang": "nl", "start": 1745503943.4546154, "id": "63441a16-ae9d-433b-8a44-e7adf2ed54fc", "in_lines": 50, "in_sents": 53, "timestamp": "2025-04-24 16:12:23.465651+02:00", "in_chars": 6889, "in_tokens": 1516, "dataset": {"name": "Helsinki-NLP/europarl", "num_of_sents": 50, "start_idx": 0, "split": "train[:500]"}, "end": 1745503963.2170389, "time": 19.762423515319824, "out_chars": 6836, "out_lines": 50, "out_sents": 53, 

* Logs for DeepL and GPT4.1 can differ based on what the respective API provides in the response body.
* DeepL provides a `"status"` field that contains values such as `done`, indicating us that the request has been processed fully.
* GPT4.1's response body contains the actual tokens it used, we estimate them using tiktoken, and also a `"finish_reason"`, which tells us that the output came from a request that was processed fully. If the value was `"length"` instead of `"stop"`, then that would mean the output was likely cut off due to the rate limit

In [7]:
from scripts.stats import GPT41_RATE
import json
with open(join(example_folder, 'log.jsonl')) as f:
    log_data = [json.loads(ln) for ln in f]

total_est_cost = 0
total_real_cost = 0
for log in log_data:
    if log['translator'] == 'gpt-4.1':
        est_cost = GPT41_RATE[0]*log['in_tokens'] + GPT41_RATE[1]*log['out_tokens']
        real_cost = GPT41_RATE[0]*log['in_model_tokens'] + \
            GPT41_RATE[1]*log['out_model_tokens']
        ratio = est_cost / real_cost
        total_est_cost += est_cost
        total_real_cost += real_cost
print(f'Total est. cost:\t{total_est_cost:.4f}')
print(f'Total real cost:\t{total_real_cost:.4f}')
print(f'Ratio\t{total_est_cost/total_real_cost:.4f}')

Total est. cost:	0.1289
Total real cost:	0.1298
Ratio	0.9934


# Post Processing
* This example case was an ideal case, as the number of input and output remained the same. 
    * For DeepL this is likely. 
    * For GPT, this can also go wrong and we may get back malformatted output that we have to align again. 
* This is an ideal case, hence we perform a direct alignment. 
* Code for post-processing can change whenever, **last one committed counts**

In [8]:
from scripts.post_process import direct_triplet_align
from scripts.util import load_sents

for pair in some_pairs:
    s, t = pair
    src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=50)
    mt_sents = load_sents(mt_folder_gpt, s, t)
    direct_triplet_align(
        mt_sents=mt_sents,
        ref_sents=tgt_sents,
        src_sents=src_sents,
        src_lang=s,
        ref_lang=t,
        folder_path='tmp_gpt'
    )

for pair in some_pairs:
    s, t = pair
    src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=50)
    mt_sents = load_sents(mt_folder_deepl, s, t)
    direct_triplet_align(
        mt_sents=mt_sents,
        ref_sents=tgt_sents,
        src_sents=src_sents,
        src_lang=s,
        ref_lang=t,
        folder_path='tmp_deepl'
    )

In [9]:
# Example of direct aligned sentences and translaitons
!cat tmp_deepl/de-fi.jsonl | head -n 1

{"mt": "Istunnon jatkaminen", "ref": "Istuntokauden uudelleenavaaminen", "src": "Wiederaufnahme der Sitzungsperiode"}


# Eval

In [10]:
from scripts.scoring import ResultProducer
import os
l2f_deepl = {f.replace('.jsonl', ''): join('tmp_deepl', f) for f in os.listdir('tmp_deepl') if f.endswith('.jsonl')}
l2f_gpt = {f.replace('.jsonl', ''): join('tmp_gpt', f)
             for f in os.listdir('tmp_gpt') if f.endswith('.jsonl')}


rp_deepl = ResultProducer(label2files=l2f_deepl)
rp_gpt = ResultProducer(label2files=l2f_gpt)
rp_deepl.compute_results()
rp_gpt.compute_results()

In [11]:
rp_deepl.display_results()
print()
rp_gpt.display_results()

   Label       BLEU       chrF
0  da-es  33.483544  58.715067
1  de-fi  22.054097  55.634523
2  de-nl  28.670945  55.509332
3  el-pt  34.546446  59.584432
4  en-de  26.702647  57.385844
5  en-fr  39.130502  63.101555
6  nl-de  22.364264  52.949496
7  pt-es  37.786870  62.199097

   Label       BLEU       chrF
0  da-es  30.808541  57.883883
1  de-fi  16.372493  51.396547
2  de-nl  21.865457  51.125152
3  el-pt  33.344611  58.294434
4  en-de  22.727958  54.300195
5  en-fr  32.233153  59.738210
6  nl-de  18.566548  52.401442
7  pt-es  33.013480  61.545275


## Effect of Bertalign
* In the following, we show a case were Bertalign is used to fix alignments
* In this example this is redundant but we can still observe an effect nevertheless

In [None]:
from os.path import join
from scripts.data_management import EuroParlManager
dm = EuroParlManager()
src_sents, tgt_sents = dm.get_sentence_pairs('de', 'fi', num_of_sents=50)

with open(join(mt_folder_gpt, 'de-fi.txt'), 'r') as f:
    mt_sents = [ln.strip() for ln in f]

len(mt_sents)

50

* `mt_sents` is 50 because the GPT translator's output was good. In some cases, it can go wrong and in those cases, bertalign may be required.
* Here, we just re-align something that is already considered aligned. 

In [13]:
from scripts.post_process import align_src_mt_sents
_ = align_src_mt_sents(
    src_lang='de',
    mt_lang='fi',
    src_sents=src_sents,
    mt_sents=mt_sents,
    folder_path='tmp'
)

Source language: de, Number of sentences: 55
Target language: fi, Number of sentences: 55
Embedding source and target text using paraphrase-multilingual-MiniLM-L12-v2 ...
Performing first-step alignment ...
Performing second-step alignment ...
Finished! Successfully aligned 55 de sentences to 55 fi sentences



* Bertalign detected 55 sentences, this is because one **line** can contain multiple sentences
* Can be either because many-to-one alignments or corpus authors made blunders
* For evaluation, the impact of this should not be too severe

In [14]:
from scripts.post_process import post_triplet_align

with open(join('tmp', 'de-fi.de'), 'r') as f:
    src_sents_ali = [ln.strip() for ln in f]

with open(join('tmp', 'de-fi.fi'), 'r') as f:
    mt_sents_ali = [ln.strip() for ln in f]


post_triplet_align(
    src_sents_org=src_sents,
    src_sents_ali=src_sents_ali,
    mt_sents_ali=mt_sents_ali,
    ref_sents_org=tgt_sents,
    src_lang='de',
    ref_lang='fi',
    folder_path='tmp'
)

47 sents aligned for de and fi


* Bertalign detected 55 sentences in both languages but when we tried to re-align the triplets, there were sentences that were left empty
* Those sentences were removed, resulting only in 47 aligned sentences

In [15]:
from scripts.scoring import ResultProducer
l2f = {'de-fi' : join('tmp', 'de-fi.jsonl')}
rp = ResultProducer(label2files=l2f)
rp.compute_results()
rp.display_results()

   Label       BLEU       chrF
0  de-fi  15.560334  50.320907


* There is a visible impact on BLEU score but this is also because we're working only with 47-55 sentences