# Example Run
* One possible way how the translation task can be conducted
* In the real run, pairs will be chosen with reason and depending on the translator, we may run on batches of pairs or on all of them
* In this Demo, we randomly choose some pairs

## Translation Task

* Code used for translation, namely from `data_management`, `util`, `translators` and `task` MUST NOT CHANGE mid or post translation.
* It has to be decided at which commit code is considered `fixed` and after that those 3 files must remain untouched.
* If Git still tracks changes, those changes may not impact anything that would make the code behave differently from before.

In [1]:
from scripts.task import TranslationTask
from scripts.data_management import EuroParlManager
from scripts.translators import GPTClient, DeeplClient
from scripts.util import MyLogger
from os.path import join
from random import sample, seed
seed(64)
possible = [tuple(pair.split('-')) for pair in EuroParlManager.EP_PAIRS]
extended = [(pair[1], pair[0]) for pair in possible]
possible = possible + extended
some_pairs = sample(sorted(possible), k=4)

example_folder = 'exmpl'
mt_folder_gpt = join(example_folder, 'gpt41')
mt_folder_deepl = join(example_folder, 'deepl')

dm = EuroParlManager()  
logger = MyLogger(logfile=join(example_folder, 'log.jsonl'))
client_gpt = GPTClient(logger=logger)  
client_deepl = DeeplClient(logger=logger)

In [2]:
some_pairs

[('fr', 'da'), ('de', 'fr'), ('nl', 'da'), ('it', 'pt')]

In [3]:
task_gpt = TranslationTask(
    target_pairs=some_pairs,
    dm=dm,
    client=client_gpt,
    logger=logger,
    mt_folder=mt_folder_gpt,
    num_of_sents=50
)

task_deepl = TranslationTask(
    target_pairs=some_pairs,
    dm=dm,
    client=client_deepl,
    logger=logger,
    mt_folder=mt_folder_deepl,
    num_of_sents=50
)

In [4]:
task_deepl.run()

Document for pair fr-da has been translated already.
50 translated from fr to da
Document for pair de-fr has been translated already.
50 translated from de to fr
Document for pair nl-da has been translated already.
50 translated from nl to da
Document for pair it-pt has been translated already.
50 translated from it to pt


In [5]:
task_gpt.run()

Document for pair fr-da has been translated already.
50 translated from fr to da
Document for pair de-fr has been translated already.
50 translated from de to fr
Document for pair nl-da has been translated already.
50 translated from nl to da
Document for pair it-pt has been translated already.
50 translated from it to pt


* *Document for pair {src_lang}-{tgt_lang} has been translated already* shows up because the code checks the `mt_folder` for existing files and if it finds them, it will not call the API
* Makes it overall safer to run API-calling code within Jupyter Notebooks, the notebook is re-runnable

**NOTE**: Everything that comes after this will NOT be part of the official translation task notebooks. 
* The translation tasks will be performed as soon as relevant code (`data_management.py`, `util.p`, `translators.py`, `task.py`) is deemed stable and safe enough. 
* The following will contain post-processing and analysis steps that belong to the second part of the project, where code can be still developed further.

## Logs
* We can print our logs within the notebook but it is safer to store them externally.
* This notebook can be re-run post translation, API calls will not be made but the logs will change
* External stored logs represent logs created at time of translation and can be viewed through Python or unix commands

In [6]:
!cat $example_folder/log.jsonl | head -n 1
!cat $example_folder/log.jsonl | tail -n 1

{"translator": "deepl_document", "src_lang": "fr", "tgt_lang": "da", "start": 1745851892.2054236, "id": "cb59119b-7f06-4fc1-9f83-a2d36957fb00", "in_lines": 50, "in_sents": 54, "timestamp": "2025-04-28 16:51:32.223822+02:00", "in_chars": 7457, "in_tokens": 1644, "git_hash": "76cf461", "dataset": {"name": "Helsinki-NLP/europarl", "num_of_sents": 50, "start_idx": 0, "split": "train[:500]"}, "end": 1745851898.7286563, "time": 6.523232698440552, "out_chars": 6571, "out_lines": 50, "out_sents": 53, "out_tokens": 1815, "status": "done", "error_msg": null}
{"translator": "gpt-4.1", "src_lang": "it", "tgt_lang": "pt", "start": 1745852021.4017231, "id": "9bf55f53-44ec-4351-90d3-a9ab01e2fc44", "in_lines": 50, "in_sents": 53, "timestamp": "2025-04-28 16:53:41.459331+02:00", "in_chars": 7494, "in_tokens": 1823, "git_hash": "76cf461", "dataset": {"name": "Helsinki-NLP/europarl", "num_of_sents": 50, "start_idx": 0, "split": "train[:500]"}, "end": 1745852037.7404916, "time": 16.338768482208252, "out_c

* Logs for DeepL and GPT4.1 can differ based on what the respective API provides in the response body.
* DeepL provides a `"status"` field that contains values such as `done`, indicating us that the request has been processed fully.
* GPT4.1's response body contains the actual tokens it used, we estimate them using tiktoken, and also a `"finish_reason"`, which tells us that the output came from a request that was processed fully. If the value was `"length"` instead of `"stop"`, then that would mean the output was likely cut off due to the rate limit

In [7]:
from scripts.stats import GPT41_RATE
import json
with open(join(example_folder, 'log.jsonl')) as f:
    log_data = [json.loads(ln) for ln in f]

total_est_cost = 0
total_real_cost = 0
for log in log_data:
    if log['translator'] == 'gpt-4.1':
        est_cost = GPT41_RATE[0]*log['in_tokens'] + GPT41_RATE[1]*log['out_tokens']
        real_cost = GPT41_RATE[0]*log['in_model_tokens'] + \
            GPT41_RATE[1]*log['out_model_tokens']
        total_est_cost += est_cost
        total_real_cost += real_cost
print(f'Total est. cost:\t{total_est_cost:.4f}')
print(f'Total real cost:\t{total_real_cost:.4f}')
print(f'Ratio\t{total_est_cost/total_real_cost:.4f}')

Total est. cost:	0.0670
Total real cost:	0.0674
Ratio	0.9937


# Post Processing
* This example case was an ideal case, as the number of input and output remained the same. 
    * For DeepL this is likely. 
    * For GPT, this can also go wrong and we may get back malformatted output that we have to align again. 
* This is an ideal case, hence we perform a direct alignment. 
* Code for post-processing can change whenever, **last one committed counts**

In [8]:
from scripts.post_process import direct_triplet_align
from scripts.util import load_sents

for pair in some_pairs:
    s, t = pair
    src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=50)
    mt_sents = load_sents(mt_folder_gpt, s, t)
    direct_triplet_align(
        mt_sents=mt_sents,
        ref_sents=tgt_sents,
        src_sents=src_sents,
        src_lang=s,
        ref_lang=t,
        folder_path='tmp_gpt'
    )

for pair in some_pairs:
    s, t = pair
    src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=50)
    mt_sents = load_sents(mt_folder_deepl, s, t)
    direct_triplet_align(
        mt_sents=mt_sents,
        ref_sents=tgt_sents,
        src_sents=src_sents,
        src_lang=s,
        ref_lang=t,
        folder_path='tmp_deepl'
    )

In [9]:
# Example of direct aligned sentences and translations in COMET format
!cat tmp_deepl/de-fr.jsonl | head -n 1

{"mt": "Reprise de la session", "ref": "Reprise de la session", "src": "Wiederaufnahme der Sitzungsperiode"}


# Eval

In [10]:
from scripts.scoring import ResultProducer
import os
l2f_deepl = {f.replace('.jsonl', ''): join('tmp_deepl', f) for f in os.listdir('tmp_deepl') if f.endswith('.jsonl')}
l2f_gpt = {f.replace('.jsonl', ''): join('tmp_gpt', f)
             for f in os.listdir('tmp_gpt') if f.endswith('.jsonl')}


rp_deepl = ResultProducer(label2files=l2f_deepl)
rp_gpt = ResultProducer(label2files=l2f_gpt)
rp_deepl.compute_results()
rp_gpt.compute_results()

In [11]:
rp_deepl.display_results()
print()
rp_gpt.display_results()

   Label       BLEU       chrF
0  de-fr  35.901115  60.228306
1  fr-da  32.608040  58.539183
2  it-pt  33.270872  57.956424
3  nl-da  26.663987  54.279791

   Label       BLEU       chrF
0  de-fr  27.803709  55.916120
1  fr-da  30.738095  56.870821
2  it-pt  30.903444  56.830762
3  nl-da  26.754190  52.946628


## Effect of Bertalign
* In the following, we show a case were Bertalign is used to fix alignments
* In this example this is redundant but we can still observe an effect nevertheless

In [12]:
from os.path import join
from scripts.data_management import EuroParlManager
dm = EuroParlManager()
src_sents, tgt_sents = dm.get_sentence_pairs('nl', 'da', num_of_sents=50)

with open(join(mt_folder_gpt, 'nl-da.txt'), 'r') as f:
    mt_sents = [ln.strip() for ln in f]

len(mt_sents)

50

In [13]:
!rm -rf tmp/nl-da.nl
!rm -rf tmp/nl-da.da

* `mt_sents` is 50 because the GPT translator's output was good. In some cases, it can go wrong and in those cases, bertalign may be required.
* Here, we just re-align something that is already considered aligned. 

In [14]:
from scripts.post_process import align_src_mt_sents
_ = align_src_mt_sents(
    src_lang='nl',
    mt_lang='da',
    src_sents=src_sents,
    mt_sents=mt_sents,
    folder_path='tmp'
)

Source language: nl, Number of sentences: 58
Target language: da, Number of sentences: 58
Embedding source and target text using paraphrase-multilingual-MiniLM-L12-v2 ...
Performing first-step alignment ...
Performing second-step alignment ...
Finished! Successfully aligned 58 nl sentences to 58 da sentences



* Bertalign detected 58 sentences, this is because one **line** can contain multiple sentences
* Can be either because many-to-one alignments or corpus authors made blunders
* For evaluation, the impact of this should not be too severe

In [15]:
from scripts.post_process import post_triplet_align

with open(join('tmp', 'nl-da.nl'), 'r') as f:
    src_sents_ali = [ln.strip() for ln in f]

with open(join('tmp', 'nl-da.da'), 'r') as f:
    mt_sents_ali = [ln.strip() for ln in f]


post_triplet_align(
    src_sents_org=src_sents,
    src_sents_ali=src_sents_ali,
    mt_sents_ali=mt_sents_ali,
    ref_sents_org=tgt_sents,
    src_lang='nl',
    ref_lang='da',
    folder_path='tmp'
)

44 sents aligned for nl and da


* Bertalign detected 58 sentences in both languages but when we tried to re-align the triplets, there were sentences that were left empty
* Those sentences were removed, resulting only in 44 aligned sentences

In [16]:
from scripts.scoring import ResultProducer
l2f = {'nl-da' : join('tmp', 'nl-da.jsonl')}
rp = ResultProducer(label2files=l2f)
rp.compute_results()
rp.display_results()

   Label       BLEU       chrF
0  nl-da  28.385311  53.761071


* There is a visible impact on BLEU score but this is also because we're working only with 47-55 sentences