# Example Run
* One possible way how the translation task can be conducted
* In the real run, pairs will be chosen with reason and depending on the translator, we may run on batches of pairs or on all of them
* In this Demo, we randomly choose some pairs

## Translation Task

* Code used for translation, namely from `data_management`, `util`, `translators` and `task` MUST NOT CHANGE mid or post translation.
* It has to be decided at which commit code is considered `fixed` and after that those 3 files must remain untouched.
* If Git still tracks changes, those changes may not impact anything that would make the code behave differently from before.

In [20]:
from scripts.task import TranslationTask
from scripts.data_management import EuroParlManager
from scripts.translators import GPTClient, DeeplClient
from scripts.logger import MyLogger
from os.path import join

some_pairs = [('de', 'en'), ('en', 'de'), ('en', 'el'), ('el', 'en')]

example_folder = 'exmpl'
mt_folder_gpt = join(example_folder, 'gpt41')
mt_folder_deepl = join(example_folder, 'deepl')

dm = EuroParlManager()  
logfile_path = join(example_folder, 'log.jsonl')
logger = MyLogger(logfile=logfile_path)


client_gpt = GPTClient(logger=logger)  
client_deepl = DeeplClient(logger=logger)

In [None]:
task_gpt = TranslationTask(
    target_pairs=some_pairs,
    dm=dm,
    client=client_gpt,
    logger=logger,
    mt_folder=mt_folder_gpt,
    num_of_sents=400
)

task_deepl = TranslationTask(
    target_pairs=some_pairs,
    dm=dm,
    client=client_deepl,
    logger=logger,
    mt_folder=mt_folder_deepl,
    num_of_sents=400
)

In [None]:
task_deepl.run()

In [None]:
task_gpt.run()

* *Document for pair {src_lang}-{tgt_lang} has been translated already* shows up because the code checks the `mt_folder` for existing files and if it finds them, it will not call the API
* Makes it overall safer to run API-calling code within Jupyter Notebooks, the notebook is re-runnable

**NOTE**: Everything that comes after this will NOT be part of the official translation task notebooks. 
* The translation tasks will be performed as soon as relevant code (`data_management.py`, `util.p`, `translators.py`, `task.py`) is deemed stable and safe enough. 
* The following will contain post-processing and analysis steps that belong to the second part of the project, where code can be still developed further.

## Re-Running
* In some cases, translations may fail and it may be required to redo the API call. To ensure transparancy, we account for this by implementing rigurous logging.
* Let us assume that the translation for `de-en` by GPT4.1 only contains 200 lines opposed to 400. Let us also assume that these 200 lines correspond to roughly 200 sentences, certainly not roughly 400. 
* In some cases, GPT can place multiple sentences on the same line, in those cases we do not need to re-run the translation but can utilize bertalign to align the source and target sentences.
* In such case, we do the following:


In [None]:
import json
with open(logfile_path, 'r') as f:
    log_data = [json.loads(ln) for ln in f]

for log in log_data:
    tl_log = log['translation']
    src_l = tl_log['src_lang']
    tgt_l = tl_log['tgt_lang']
    if src_l == 'de' and tgt_l == 'en':
        print(tl_log['id'])

In [None]:
from scripts.logger import ReRun
new_logger = MyLogger(logfile=login_path), rerun=ReRun(pairs=[('de-en')], reasons=['logger demonstration'], log_ids=[]))


## Logs
* We can print our logs within the notebook but it is safer to store them externally.
* This notebook can be re-run post translation, API calls will not be made but the logs will change
* External stored logs represent logs created at time of translation and can be viewed through Python or unix commands

In [None]:
!cat $example_folder/log.jsonl | head -n 1
!cat $example_folder/log.jsonl | tail -n 1

* Logs for DeepL and GPT4.1 can differ based on what the respective API provides in the response body.
* DeepL provides a `"status"` field that contains values such as `done`, indicating us that the request has been processed fully.
* GPT4.1's response body contains the actual tokens it used, we estimate them using tiktoken, and also a `"finish_reason"`, which tells us that the output came from a request that was processed fully. If the value was `"length"` instead of `"stop"`, then that would mean the output was likely cut off due to the rate limit

In [None]:
from scripts.stats import GPT41_RATE
import json
with open(join(example_folder, 'log.jsonl')) as f:
    log_data = [json.loads(ln) for ln in f]

total_est_cost = 0
total_real_cost = 0
for log in log_data:
    if log['translator'] == 'gpt-4.1':
        est_cost = GPT41_RATE[0]*log['in_tokens'] + GPT41_RATE[1]*log['out_tokens']
        real_cost = GPT41_RATE[0]*log['in_model_tokens'] + \
            GPT41_RATE[1]*log['out_model_tokens']
        total_est_cost += est_cost
        total_real_cost += real_cost
print(f'Total est. cost:\t{total_est_cost:.4f}')
print(f'Total real cost:\t{total_real_cost:.4f}')
print(f'Ratio\t{total_est_cost/total_real_cost:.4f}')

# Post Processing
* This example case was an ideal case, as the number of input and output remained the same. 
    * For DeepL this is likely. 
    * For GPT, this can also go wrong and we may get back malformatted output that we have to align again. 
* This is an ideal case, hence we perform a direct alignment. 
* Code for post-processing can change whenever, **last one committed counts**

In [None]:
from scripts.post_process import direct_triplet_align
from scripts.util import load_sents

for pair in some_pairs:
    s, t = pair
    src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=400)
    mt_sents = load_sents(mt_folder_gpt, s, t)
    direct_triplet_align(
        mt_sents=mt_sents,
        ref_sents=tgt_sents,
        src_sents=src_sents,
        src_lang=s,
        ref_lang=t,
        folder_path='tmp_gpt'
    )

for pair in some_pairs:
    s, t = pair
    src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=400)
    mt_sents = load_sents(mt_folder_deepl, s, t)
    direct_triplet_align(
        mt_sents=mt_sents,
        ref_sents=tgt_sents,
        src_sents=src_sents,
        src_lang=s,
        ref_lang=t,
        folder_path='tmp_deepl'
    )

In [None]:
# Example of direct aligned sentences and translations in COMET format
!cat tmp_deepl/de-fr.jsonl | head -n 1

# Eval

In [None]:
from scripts.scoring import ResultProducer
import os
l2f_deepl = {f.replace('.jsonl', ''): join('tmp_deepl', f) for f in os.listdir('tmp_deepl') if f.endswith('.jsonl')}
l2f_gpt = {f.replace('.jsonl', ''): join('tmp_gpt', f)
             for f in os.listdir('tmp_gpt') if f.endswith('.jsonl')}


rp_deepl = ResultProducer(label2files=l2f_deepl)
rp_gpt = ResultProducer(label2files=l2f_gpt)
rp_deepl.compute_results()
rp_gpt.compute_results()

In [None]:
rp_deepl.display_results()
print()
rp_gpt.display_results()

## Effect of Bertalign
* In the following, we show a case were Bertalign is used to fix alignments
* In this example this is redundant but we can still observe an effect nevertheless

In [None]:
from os.path import join
from scripts.data_management import EuroParlManager
dm = EuroParlManager()
src_sents, tgt_sents = dm.get_sentence_pairs('nl', 'da', num_of_sents=50)

with open(join(mt_folder_gpt, 'nl-da.txt'), 'r') as f:
    mt_sents = [ln.strip() for ln in f]

print(len(mt_sents))
mt_sents = mt_sents[:50]

In [None]:
!rm -rf tmp_src_ref/nl-da.nl
!rm -rf tmp_src_ref/nl-da.da
!rm -rf tmp_src_mt/nl-da.nl
!rm -rf tmp_src_mt/nl-da.da

* `mt_sents` is 50 because the GPT translator's output was good. In some cases, it can go wrong and in those cases, bertalign may be required.
* Here, we just re-align something that is already considered aligned. 
* The re-alignment requires us to re-align the original source with the reference and then align source with the machine translation
* Then we just simply use the src as a key to align all three together with `post_triplet_align`

In [None]:
from scripts.post_process import align_sents
_ = align_sents(
    src_lang='nl',
    tgt_lang='da',
    src_sents=src_sents,
    tgt_sents=tgt_sents,
    folder_path='tmp_src_ref'
)

* 58 nl sents were aligned to 51 da sents implies:
    * Some sents lost partners
    * bertalign accounted for 1-to-many alignments 

In [None]:
from scripts.post_process import align_sents
_ = align_sents(
    src_lang='nl',
    tgt_lang='da',
    src_sents=src_sents,
    tgt_sents=mt_sents,
    folder_path='tmp_src_mt'
)

* This time bertalign aligned 58 nl sents to 58 da sents
* The original da sents were only 51 though
* This implies that during triplet alignment, some sents will be discarded

In [None]:
from scripts.post_process import post_triplet_align

with open(join('tmp_src_ref', 'nl-da.nl'), 'r') as f:
    src_sents_org = [ln.strip() for ln in f]

with open(join('tmp_src_ref', 'nl-da.da'), 'r') as f:
    tgt_sents_org = [ln.strip() for ln in f]

with open(join('tmp_src_mt', 'nl-da.nl'), 'r') as f:
    src_sents_ali = [ln.strip() for ln in f]

with open(join('tmp_src_mt', 'nl-da.da'), 'r') as f:
    mt_sents_ali = [ln.strip() for ln in f]


post_triplet_align(
    src_sents_org=src_sents_org,
    src_sents_ali=src_sents_ali,
    mt_sents_ali=mt_sents_ali,
    ref_sents_org=tgt_sents_org,
    src_lang='nl',
    ref_lang='da',
    folder_path='tmp'
)

* It is noteworthy that we could in theory run this directly on the tgt_sents from the data manager rather than aligning twice


In [None]:
post_triplet_align(
    src_sents_org=src_sents,
    src_sents_ali=src_sents_ali,
    mt_sents_ali=mt_sents_ali,
    ref_sents_org=tgt_sents,
    src_lang='nl',
    ref_lang='da',
    folder_path='tmp_'
)

* But because the observed issue prior, it can result in more sentence loss. Hence for alignments, we do it twice to recover as many alignments as possible.

In [None]:
from scripts.scoring import ResultProducer
l2f = {'nl-da_full_fix': join('tmp', 'nl-da.jsonl'),
       'nl-da_half_fix': join('tmp_', 'nl-da.jsonl')}
rp = ResultProducer(label2files=l2f)
rp.compute_results()
rp.display_results()

* There is a visible impact on BLEU score but this is also because we're working only with 44-58 sentences
* Observe how the full-fix brought us closer to the original case where 50 sents were aligned with 50 other sents.