# Example Run
* One possible way how the translation task can be conducted
* In the real run, pairs will be chosen with reason and depending on the translator, we may run on batches of pairs or on all of them
* In this Demo, we randomly choose some pairs

## Translation Task

* Code used for translation, namely from `data_management`, `util`, `translators` and `task` MUST NOT CHANGE mid or post translation.
* It has to be decided at which commit code is considered `fixed` and after that those 3 files must remain untouched.
* If Git still tracks changes, those changes may not impact anything that would make the code behave differently from before.

In [1]:
from scripts.task import TranslationTask
from scripts.data_management import EuroParlManager
from scripts.translators import GPTClient, DeeplClient
from scripts.logger import MyLogger
from os.path import join
from random import sample, seed
seed(64)
possible = [tuple(pair.split('-')) for pair in EuroParlManager.EP_PAIRS]
extended = [(pair[1], pair[0]) for pair in possible]
possible = possible + extended
some_pairs = sample(sorted(possible), k=4)

example_folder = 'exmpl'
logfile = join(example_folder, 'log.jsonl')


dm = EuroParlManager()
logger = MyLogger(logfile=logfile)
client_gpt = GPTClient(logger=logger)
client_deepl = DeeplClient(logger=logger)

mt_folder_gpt = join(example_folder, client_gpt.model)
mt_folder_deepl = join(example_folder, client_deepl.model)


In [2]:
some_pairs

[('fr', 'da'), ('de', 'fr'), ('nl', 'da'), ('it', 'pt')]

In [3]:
task_gpt = TranslationTask(
    target_pairs=some_pairs,
    dm=dm,
    client=client_gpt,
    logger=logger,
    mt_folder=mt_folder_gpt,
    num_of_sents=400
)

task_deepl = TranslationTask(
    target_pairs=some_pairs,
    dm=dm,
    client=client_deepl,
    logger=logger,
    mt_folder=mt_folder_deepl,
    num_of_sents=400
)

In [4]:
task_deepl.run()

Document for pair fr-da has been translated already.
400 translated from fr to da
Document for pair de-fr has been translated already.
400 translated from de to fr
Document for pair nl-da has been translated already.
400 translated from nl to da
Document for pair it-pt has been translated already.
400 translated from it to pt


In [5]:
task_gpt.run()

Document for pair fr-da has been translated already.
400 translated from fr to da
Document for pair de-fr has been translated already.
400 translated from de to fr
Document for pair nl-da has been translated already.
125 translated from nl to da
Document for pair it-pt has been translated already.
400 translated from it to pt


* *Document for pair {src_lang}-{tgt_lang} has been translated already* shows up because the code checks the `mt_folder` for existing files and if it finds them, it will not call the API
* Makes it overall safer to run API-calling code within Jupyter Notebooks, the notebook is re-runnable
* **The choice of language pairs was actually non-trivial, we observed that occasionally, GPT4.1 does not return 400 sentences for pair nl-da, in such cases, we have to either check if 400 sentences are still present within 125 or run the task again.**

## Logs
* We can print our logs within the notebook but it is safer to store them externally.
* This notebook can be re-run post translation, API calls will not be made but the logs will change
* External stored logs represent logs created at time of translation and can be viewed through Python or unix commands

In [6]:
!cat $example_folder/log.jsonl | tail -n 2

{"git_hash": "3c7d40c", "dataset": {"name": "Helsinki-NLP/europarl", "num_of_sents": 400, "start_idx": 0, "split": "train[:500]"}, "translation": {"translator": "gpt-4.1", "src_lang": "nl", "tgt_lang": "da", "start": 1745938912.6919239, "id": "c1328927-1072-457c-b64f-f02097d04639", "in_lines": 400, "in_sents": 460, "start_timestamp": "2025-04-29 17:01:52.804929+02:00", "in_chars": 67089, "in_tokens": 14372, "end": 1745938977.7328722, "end_timestamp": "2025-04-29 17:02:57.732872+02:00", "time": 65.0409483909607, "out_chars": 17276, "out_lines": 125, "out_sents": 152, "out_tokens": 4627, "in_model_tokens": 14421, "out_model_tokens": 4628, "finish_reason": "stop"}}
{"git_hash": "3c7d40c", "dataset": {"name": "Helsinki-NLP/europarl", "num_of_sents": 400, "start_idx": 0, "split": "train[:500]"}, "translation": {"translator": "gpt-4.1", "src_lang": "it", "tgt_lang": "pt", "start": 1745938977.7837636, "id": "7d87d2a8-8062-4c29-accf-21a35e0fb2c2", "in_lines": 400, "in_sents": 431, "start_times

* Logs for DeepL and GPT4.1 can differ based on what the respective API provides in the response body.
* DeepL provides a `"status"` field that contains values such as `done`, indicating us that the request has been processed fully.
* GPT4.1's response body contains the actual tokens it used, we estimate them using tiktoken, and also a `"finish_reason"`, which tells us that the output came from a request that was processed fully. If the value was `"length"` instead of `"stop"`, then that would mean the output was likely cut off due to the rate lim

In [7]:
import json
with open(logfile, 'r') as f:
    log_data = [json.loads(ln) for ln in f]

tl_log = log_data[-2]['translation']
tl_log['in_lines'], tl_log['in_sents'], tl_log['out_lines'], tl_log['out_sents'], tl_log['finish_reason'], tl_log['id']

(400, 460, 125, 152, 'stop', 'c1328927-1072-457c-b64f-f02097d04639')

* `out_sents` contains the "real number of sentences" per "sentence entry" in the dataset. I.e., one sentence in the dataset can refer to multiple actual sentences if we apply a sentence splitter. 
* This can occur because the dataset authors made a mistake (usuing multiple sentences in one entry) or because of 1-to-many alignments.
* Regardless, we observe that the 125 "lines" did not contain roughly 400 (+/-) sentences, which means GPT4.1 provided incomplete output and we have no choice but to re-run the translation. 
* This is somewhat of a mystery as well, as the finish reason should be `length` rather than `stop`

## Re-Running
* In the following we demonstrate how the re-running practice is conducted. 
* In general, we re-run only AFTER all tasks have been completed rather than directly after the first completed task. 
* We do it directly here because it is a demonstration. 

In [8]:
from scripts.logger import ReRun
new_logger = MyLogger(logfile=logfile, rerun=ReRun(pairs=[('nl', 'da')], reasons=['output too small'], log_ids=[tl_log['id']]))
client = GPTClient(logger=new_logger)
task = TranslationTask(
    target_pairs=[('nl', 'da')],
    dm=dm,
    client=client,
    logger=new_logger,
    mt_folder=join(mt_folder_gpt, 'rerun'),
    num_of_sents=400,
    rerun=True
)

In [9]:
task.run()

125 translated from nl to da


In [11]:
with open(logfile, 'r') as f:
    log_data = [json.loads(ln) for ln in f]

log_data[-1]['rerun'], log_data[-1]['translation']

({'log_id': 'c1328927-1072-457c-b64f-f02097d04639',
  'reason': 'output too small'},
 {'translator': 'gpt-4.1',
  'src_lang': 'nl',
  'tgt_lang': 'da',
  'start': 1745940064.5667317,
  'id': 'de6e754d-35a8-4fb1-8950-7d8d527434b7',
  'in_lines': 400,
  'in_sents': 460,
  'start_timestamp': '2025-04-29 17:21:04.743725+02:00',
  'in_chars': 67089,
  'in_tokens': 14372,
  'end': 1745940158.0418706,
  'end_timestamp': '2025-04-29 17:22:38.041870+02:00',
  'time': 93.47513890266418,
  'out_chars': 17345,
  'out_lines': 126,
  'out_sents': 153,
  'out_tokens': 4651,
  'in_model_tokens': 14421,
  'out_model_tokens': 4652,
  'finish_reason': 'stop'})

# Post Processing
* This example case was an ideal case, as the number of input and output remained the same. 
    * For DeepL this is likely. 
    * For GPT, this can also go wrong and we may get back malformatted output that we have to align again. 
* This is an ideal case, hence we perform a direct alignment. 
* Code for post-processing can change whenever, **last one committed counts**

In [None]:
from scripts.post_process import direct_triplet_align
from scripts.util import load_sents

for pair in some_pairs:
    s, t = pair
    src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=400)
    mt_sents = load_sents(mt_folder_gpt, s, t)
    direct_triplet_align(
        mt_sents=mt_sents,
        ref_sents=tgt_sents,
        src_sents=src_sents,
        src_lang=s,
        ref_lang=t,
        folder_path='tmp_gpt'
    )

for pair in some_pairs:
    s, t = pair
    src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=400)
    mt_sents = load_sents(mt_folder_deepl, s, t)
    direct_triplet_align(
        mt_sents=mt_sents,
        ref_sents=tgt_sents,
        src_sents=src_sents,
        src_lang=s,
        ref_lang=t,
        folder_path='tmp_deepl'
    )

In [None]:
# Example of direct aligned sentences and translations in COMET format
!cat tmp_deepl/de-fr.jsonl | head -n 1

# Eval

In [None]:
from scripts.scoring import ResultProducer
import os
l2f_deepl = {f.replace('.jsonl', ''): join('tmp_deepl', f) for f in os.listdir('tmp_deepl') if f.endswith('.jsonl')}
l2f_gpt = {f.replace('.jsonl', ''): join('tmp_gpt', f)
             for f in os.listdir('tmp_gpt') if f.endswith('.jsonl')}


rp_deepl = ResultProducer(label2files=l2f_deepl)
rp_gpt = ResultProducer(label2files=l2f_gpt)
rp_deepl.compute_results()
rp_gpt.compute_results()

In [None]:
rp_deepl.display_results()
print()
rp_gpt.display_results()

## Effect of Bertalign
* In the following, we show a case were Bertalign is used to fix alignments
* In this example this is redundant but we can still observe an effect nevertheless

In [None]:
from os.path import join
from scripts.data_management import EuroParlManager
dm = EuroParlManager()
src_sents, tgt_sents = dm.get_sentence_pairs('nl', 'da', num_of_sents=50)

with open(join(mt_folder_gpt, 'nl-da.txt'), 'r') as f:
    mt_sents = [ln.strip() for ln in f]

print(len(mt_sents))
mt_sents = mt_sents[:50]

In [None]:
!rm -rf tmp_src_ref/nl-da.nl
!rm -rf tmp_src_ref/nl-da.da
!rm -rf tmp_src_mt/nl-da.nl
!rm -rf tmp_src_mt/nl-da.da

* `mt_sents` is 50 because the GPT translator's output was good. In some cases, it can go wrong and in those cases, bertalign may be required.
* Here, we just re-align something that is already considered aligned. 
* The re-alignment requires us to re-align the original source with the reference and then align source with the machine translation
* Then we just simply use the src as a key to align all three together with `post_triplet_align`

In [None]:
from scripts.post_process import align_sents
_ = align_sents(
    src_lang='nl',
    tgt_lang='da',
    src_sents=src_sents,
    tgt_sents=tgt_sents,
    folder_path='tmp_src_ref'
)

* 58 nl sents were aligned to 51 da sents implies:
    * Some sents lost partners
    * bertalign accounted for 1-to-many alignments 

In [None]:
from scripts.post_process import align_sents
_ = align_sents(
    src_lang='nl',
    tgt_lang='da',
    src_sents=src_sents,
    tgt_sents=mt_sents,
    folder_path='tmp_src_mt'
)

* This time bertalign aligned 58 nl sents to 58 da sents
* The original da sents were only 51 though
* This implies that during triplet alignment, some sents will be discarded

In [None]:
from scripts.post_process import post_triplet_align

with open(join('tmp_src_ref', 'nl-da.nl'), 'r') as f:
    src_sents_org = [ln.strip() for ln in f]

with open(join('tmp_src_ref', 'nl-da.da'), 'r') as f:
    tgt_sents_org = [ln.strip() for ln in f]

with open(join('tmp_src_mt', 'nl-da.nl'), 'r') as f:
    src_sents_ali = [ln.strip() for ln in f]

with open(join('tmp_src_mt', 'nl-da.da'), 'r') as f:
    mt_sents_ali = [ln.strip() for ln in f]


post_triplet_align(
    src_sents_org=src_sents_org,
    src_sents_ali=src_sents_ali,
    mt_sents_ali=mt_sents_ali,
    ref_sents_org=tgt_sents_org,
    src_lang='nl',
    ref_lang='da',
    folder_path='tmp'
)

* It is noteworthy that we could in theory run this directly on the tgt_sents from the data manager rather than aligning twice


In [None]:
post_triplet_align(
    src_sents_org=src_sents,
    src_sents_ali=src_sents_ali,
    mt_sents_ali=mt_sents_ali,
    ref_sents_org=tgt_sents,
    src_lang='nl',
    ref_lang='da',
    folder_path='tmp_'
)

* But because the observed issue prior, it can result in more sentence loss. Hence for alignments, we do it twice to recover as many alignments as possible.

In [None]:
from scripts.scoring import ResultProducer
l2f = {'nl-da_full_fix': join('tmp', 'nl-da.jsonl'),
       'nl-da_half_fix': join('tmp_', 'nl-da.jsonl')}
rp = ResultProducer(label2files=l2f)
rp.compute_results()
rp.display_results()

* There is a visible impact on BLEU score but this is also because we're working only with 44-58 sentences
* Observe how the full-fix brought us closer to the original case where 50 sents were aligned with 50 other sents.