# Preparation
**IMPORTANT**: This notebook assums that the `tasks` folder is in the same folder as the notebook, as well as `proc1-6.log`, `it-el.txt`, `it-el.log` and `it-el.mhtml`, otherwise it acts as documentation and should not be run.

* The translation tasks have been completed and all translations are now stored in certain hierarchy within the `tasks` folder, which was backed up with the logs associated with it. At the time of writing, it is unclear what can be made public or not, the logs contain IP addresses in some cases due to how the errors were logged, the translations could be uplouded to Google Drive and provided to the public. For now, we plan to only provide them to evaluators of this project. 

* In this notebook, we re-structure the translation data to make it easier to work with them (align & evaluate)
* First, we extract time information found in proc1-3 logs as the start times stored proc1-3 JSONL files include each time OpenAI's code triggered an automatic retry.
* Second, we rename filenames and move all translations into a single folder, preserve hierarchy by using prefixes and omit procedure-related information, as it is not required for alignment and evaluation. 

## Getting Preciser Timestamps
* In proc 1 to 3, there were cases where GPT4.1 took longer than expected. 
* The structured logs in the JSONL file stored the start time when the method that calls the API was called and the end time when the response was received. 
* However, this did not account for the case of OpenAI's code doing the retries, so the retries where included in the time calculation.
* In this notebook, we use the logs to extract start and end times that are more in line with the real translation time
* We do this by looking at the `DEBUG` logs created by OpenAI's code, they contain an entry of when exactly a request was sent.
* The end time we can obtain by our own logs, looking for `Translated X sents for src-tgt` message.

In [1]:
!cat proc3.log | grep -P "(Sending HTTP Request: POST https://api.openai.com/v1/chat/completions|Translated \d+ sents for \w\w-\w\w)" | head -n 20

DEBUG: 2025-05-09 13:18:06 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-09 13:22:10 - [✔️]: Translated 400 sents for fr-da
DEBUG: 2025-05-09 13:22:10 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-09 13:26:42 - [✔️]: Translated 400 sents for fr-de
DEBUG: 2025-05-09 13:26:42 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
DEBUG: 2025-05-09 13:31:43 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-09 13:36:34 - [✔️]: Translated 400 sents for fr-el
DEBUG: 2025-05-09 13:36:35 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-09 13:39:17 - [✔️]: Translated 400 sents for fr-es
DEBUG: 2025-05-09 13:39:17 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
DEBUG: 2025-05-09 13:49:07 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
DEBUG: 2025-05-09 13:54:08 - Sending HTTP Reque

In [2]:
import json
from os.path import join
from datetime import datetime

with open(join('tasks', 'proc3.jsonl'), 'r') as f:
    logs = [json.loads(ln) for ln in f.readlines()]

pairs = [('sv', 'es'), ('pt', 'da')]

for log in logs:
    pair = (log['src_lang'], log['tgt_lang'])
    if pair in pairs:
        print(pair)
        start = datetime.fromtimestamp(
            log['start']).strftime('%Y-%m-%d %H:%M:%S')
        end = datetime.fromtimestamp(log['end']).strftime('%Y-%m-%d %H:%M:%S')
        print('start', start)
        print('end', end)
        print('duration', log['end'] - log['start'])
        print()


('pt', 'da')
start 2025-05-09 19:19:38
end 2025-05-09 19:28:39
duration 541.1454193592072

('sv', 'es')
start 2025-05-09 21:21:34
end 2025-05-09 21:24:14
duration 160.7068531513214



* For some cases, the structured logs from `proc3.jsonl` got the time right because the automtic retries exceeded OpenAI's limit of 2 retries and triggered the retries implemented by us, which restarted the timing.
* However, in other cases, such as `pt-da`, we observe that the structured logs captured start time `19:19:38` which corresponds to the first try of OpenAI's code, not the last, hence the duration is longer than it really was.
* Our goal is to now to go through proc1 to 3 logs and store the real start and end times. We lose precision as we aren't working with Unix timestamps anymore, however, since we mainly work with seconds, it should not make a big difference.

In [3]:
!cat proc2.log proc3.log | grep -P "(Sending HTTP Request: POST https://api.openai.com/v1/chat/completions|Translated \d+ sents for \w\w-\w\w)" | grep -P -B1 "(Translated \d+ sents for \w\w-\w\w)" > tmp_proc2-3.log

In [4]:
with open('tmp_proc2-3.log', 'r') as f:
    logs = [ln for ln in f if ln.startswith('DEBUG') or ln.startswith('INFO')]

In [5]:
import re
pat = r"Translated \d+ sents for (\w\w-\w\w)"
pair2log_idx = {}
for i, log in enumerate(logs):
    if log.startswith('INFO'):
        pair = re.search(pat, log).group(1)
        pair2log_idx[pair] = i

In [6]:
import time
pair2time = {}
time_pat = r"(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)"
fmt = '%Y-%m-%d %H:%M:%S'
for pair in pair2log_idx:
    idx = pair2log_idx[pair]
    start = re.search(time_pat, logs[idx-1]).group(1)
    end = re.search(time_pat, logs[idx]).group(1)
    start = datetime.strptime(start, fmt)
    start_unix = time.mktime(start.timetuple())
    end = datetime.strptime(end, fmt)
    end_unix = time.mktime(end.timetuple())
    pair2time[pair] = {'start': start_unix, 'end': end_unix}


In [7]:
# Number of translations generated by Proc2
!ls tasks/proc2/europarl/gpt-4.1-2025-04-14/*.txt | wc -l

44


In [8]:
# Number of translations generated by Proc 3
!ls tasks/proc3/europarl-*/gpt-4.1-2025-04-14/*.txt | wc -l

42


In [9]:
proc2_success = 44 
proc3_success = 42
len(pair2time) == proc2_success + proc3_success
proc2_3_gpt_ep = pair2time

* For Proc1, we need to make it a bit smarter, since Proc1 involved EuroParl, Opus100 and Flores+ as well, so we cannot use the pair directly as the sole key.

In [10]:
!cat proc1.log | grep -P "(Sending HTTP Request: POST https://api.openai.com/v1/chat/completions|Translated \d+ sents for \w\w-\w\w)" | grep -P -B1 "(Translated \d+ sents for \w\w-\w\w)" | grep -P -A1 "^(DEBUG)" > tmp_proc1.log

In [11]:
!cat tmp_proc1.log | head -n 10

DEBUG: 2025-05-07 13:55:46 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-07 13:59:58 - [✔️]: Translated 398 sents for de-en
DEBUG: 2025-05-07 13:59:58 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-07 14:03:12 - [✔️]: Translated 400 sents for en-de
DEBUG: 2025-05-07 14:03:12 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-07 14:05:34 - [✔️]: Translated 400 sents for da-en
DEBUG: 2025-05-07 14:05:34 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-07 14:08:09 - [✔️]: Translated 400 sents for en-da
DEBUG: 2025-05-07 14:08:09 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-07 14:12:09 - [✔️]: Translated 400 sents for el-en


In [12]:
with open('tmp_proc1.log', 'r') as f:
    logs = [ln for ln in f if ln.startswith('DEBUG') or ln.startswith('INFO')]

* We exploit the fact that the logs are in chronological order
* We know which Dataset was used with GPT4.1 first, second and third, so we can rely on the indices.

In [13]:
import re
pat = r"Translated \d+ sents for (\w\w-\w\w)"
pair2log_idx = {}
for i, log in enumerate(logs):
    if log.startswith('INFO'):
        pair = re.search(pat, log).group(1)
        if pair not in pair2log_idx:
            pair2log_idx[pair] = [i]
        else:
            pair2log_idx[pair].append(i)

In [14]:
import time
pair2time = {}
time_pat = r"(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)"
fmt = '%Y-%m-%d %H:%M:%S'
for pair in pair2log_idx:
    indices = pair2log_idx[pair]
    pair2time[pair] = []
    for idx in indices:
        if logs[idx-1].startswith('DEBUG'):
            start = re.search(time_pat, logs[idx-1]).group(1)
            end = re.search(time_pat, logs[idx]).group(1)
            start = datetime.strptime(start, fmt)
            start_unix = time.mktime(start.timetuple())
            end = datetime.strptime(end, fmt)
            end_unix = time.mktime(end.timetuple())
            pair2time[pair].append({'start': start_unix, 'end': end_unix})


* Since Proc1 was successful throughout, we expect 3 time values for each pair.

In [15]:
check = [len(times) == 3 for times in pair2time.values()]
all(check)

True

In [16]:
!cat tasks/proc1/europarl/gpt-4.1-2025-04-14/task.json | grep -P 'task_id'
!cat tasks/proc1/flores_plus/gpt-4.1-2025-04-14/task.json | grep -P 'task_id'
!cat tasks/proc1/opus-100/gpt-4.1-2025-04-14/task.json | grep -P 'task_id'

    "task_id": "9549e927-4f63-4205-82bb-5c7ccabfe943",
    "task_id": "fdbf6190-d061-4154-ae94-d7b11199d043",
    "task_id": "fa628bc1-4f85-446d-9167-1a8d99ccc493",


In [17]:
!cat proc1.log | grep -P "Starting task (9549e927-|fdbf6190-|fa628bc1-)"

INFO: 2025-05-07 13:55:45 - [🏁]: Starting task 9549e927-4f63-4205-82bb-5c7ccabfe943 on commit 66ef922
INFO: 2025-05-07 15:01:53 - [🏁]: Starting task fa628bc1-4f85-446d-9167-1a8d99ccc493 on commit 66ef922
INFO: 2025-05-07 15:41:35 - [🏁]: Starting task fdbf6190-d061-4154-ae94-d7b11199d043 on commit 66ef922


* So we infer that the first one is EuroParl, the second is Opus100 and the third is FloresPlus

In [18]:
pair2time_modified = {}
for pair, times in pair2time.items():
    pair2time_modified[pair] = {'ep': times[0], 'opus': times[1], 'flores': times[2]}

proc1_gpt = pair2time_modified

* For Proc 4 to 6, we can rely on structured logs as there were no automatic retries from OpenAI's side anymore.
* We can still confirm as a sanity check.

In [19]:
!cat proc4.log | grep -P "(Sending HTTP Request: POST https://api.openai.com/v1/chat/completions|Translated \d+ sents for \w\w-\w\w)" | grep -P -B1 "(Translated \d+ sents for \w\w-\w\w)" > tmp_proc4.log

In [20]:
with open('tmp_proc4.log', 'r') as f:
    logs = [ln for ln in f if ln.startswith('DEBUG') or ln.startswith('INFO')]

In [21]:
import re
pat = r"Translated \d+ sents for (\w\w-\w\w)"
pair2log_idx = {}
for i, log in enumerate(logs):
    if log.startswith('INFO'):
        pair = re.search(pat, log).group(1)
        pair2log_idx[pair] = i

In [22]:
import time
pair2time = {}
time_pat = r"(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)"
fmt = '%Y-%m-%d %H:%M:%S'
for pair in pair2log_idx:
    idx = pair2log_idx[pair]
    start = re.search(time_pat, logs[idx-1]).group(1)
    end = re.search(time_pat, logs[idx]).group(1)
    start = datetime.strptime(start, fmt)
    start_unix = time.mktime(start.timetuple())
    end = datetime.strptime(end, fmt)
    end_unix = time.mktime(end.timetuple())
    pair2time[pair] = {'start': start_unix, 'end': end_unix}

In [23]:
len(pair2time)

90

In [24]:
with open(join('tasks', 'proc4.jsonl'), 'r') as f:
    logs = [json.loads(ln) for ln in f.readlines()]
len(logs)

90

In [25]:
pair_keys_1 = set(pair2time.keys())
pair_keys_2 = set([f'{log["src_lang"]}-{log["tgt_lang"]}' for log in logs])
pair_keys_1 == pair_keys_2

True

In [26]:
diffs = []
for log in logs:
    pair = (log['src_lang'], log['tgt_lang'])
    pair_key = '-'.join(pair)
    structured_dur = log['end'] - log['start']
    unstructured_dur = pair2time[pair_key]['end'] - pair2time[pair_key]['start']
    diff = structured_dur - unstructured_dur
    diffs.append(diff)

print(f'Max diff: {max(diffs):.2f} s')
print(f'Min diff: {min(diffs):.2f} s')
print(f'Mean diff: {sum(diffs)/len(diffs):.2f} s')

Max diff: 0.79 s
Min diff: -1.21 s
Mean diff: -0.19 s


* Not too tragic

In [27]:
from random import sample
selected_pairs = sample(list(pair_keys_1), 5)
for log in logs:
    pair = (log['src_lang'], log['tgt_lang'])
    pair_key = '-'.join(pair)
    if pair_key in selected_pairs:
        print(pair_key)
        print('duration (structured log)', f'{log["end"] - log["start"]:.2f} s')
        print('duration (unstructured log)', f'{pair2time[pair_key]["end"] - pair2time[pair_key]["start"]:.2f} s')
        print()


da-de
duration (structured log) 179.83 s
duration (unstructured log) 180.00 s

de-fi
duration (structured log) 290.98 s
duration (unstructured log) 292.00 s

el-sv
duration (structured log) 160.37 s
duration (unstructured log) 160.00 s

it-pt
duration (structured log) 155.24 s
duration (unstructured log) 155.00 s

it-sv
duration (structured log) 196.79 s
duration (unstructured log) 196.00 s



## Preparing Files for Analysis
* At the moment, all files are stored hierarchically in the `tasks` folder

```
.
├── proc1
│   ├── europarl
│   │   ├── deepl_document
│   │   │   ├── da-en.txt
|   |   |   |── ...
│   │   │   └── task.json
│   │   └── gpt-4.1-2025-04-14
│   │       ├── da-en.txt
|   |   |   |── ...
│   │       └── task.json
│   ├── flores_plus
│   │   ├── deepl_document
│   │   └── gpt-4.1-2025-04-14
│   └── opus-100
│       ├── deepl_document
│       └── gpt-4.1-2025-04-14
├── proc1.jsonl
├── proc2
│   └── europarl
│       └── gpt-4.1-2025-04-14
├── proc2.jsonl
├── proc3
│   ├── europarl-fr
│   │   └── gpt-4.1-2025-04-14
│   ├── europarl-it
│   │   └── gpt-4.1-2025-04-14
│   ├── europarl-nl
│   │   └── gpt-4.1-2025-04-14
│   ├── europarl-pt
│   │   └── gpt-4.1-2025-04-14
│   ├── europarl-sv
│   │   └── gpt-4.1-2025-04-14
├── proc3.jsonl
├── proc4
│   ├── flores_plus-da
│   │   └── gpt-4.1-2025-04-14
│   ├── flores_plus-de
│   │   └── gpt-4.1-2025-04-14
│   ├── flores_plus-el
│   │   └── gpt-4.1-2025-04-14
│   ├── flores_plus-es
│   │   └── gpt-4.1-2025-04-14
│   ├── flores_plus-fi
│   │   └── gpt-4.1-2025-04-14
│   ├── flores_plus-fr
│   │   └── gpt-4.1-2025-04-14
│   ├── flores_plus-it
│   │   └── gpt-4.1-2025-04-14
│   ├── flores_plus-nl
│   │   └── gpt-4.1-2025-04-14
│   ├── flores_plus-pt
│   │   └── gpt-4.1-2025-04-14
│   └── flores_plus-sv
│       └── gpt-4.1-2025-04-14
├── proc4.jsonl
├── proc5
│   ├── europarl
│   │   └── gpt-4.1-2025-04-14
│   └── flores_plus
│       └── gpt-4.1-2025-04-14
│           ├── de-fi.txt
│           └── task.json
├── proc5.jsonl
├── proc6
│   ├── europarl
│   │   └── deepl_document
│   └── flores_plus
│       └── deepl_document
└── proc6.jsonl
```


* This is not very convenient for analysis purposes, so we stored them all into a single folder preserve hierarchical information by filename prefix.

In [28]:
import os
from os.path import join
import re
import json

procs = [f'proc{i}' for i in range(1, 7)]

long2short = {
    'gpt-4.1-2025-04-14': 'gpt',
    'deepl_document': 'deepl',
    'opus-100': 'opus',
    'europarl': 'ep',
    'flores_plus': 'flores'
}

prefix2file = {}


procs = [p for p in os.listdir('tasks') if not p.endswith('.jsonl')]
for proc in procs:
    with open(join('tasks', f'{proc}.jsonl'), 'r') as fi:
        logs = [json.loads(ln) for ln in fi]
    
    proc_path = join('tasks', proc)
    datasets = os.listdir(proc_path)
    for dataset in datasets:
        dataset_path = join(proc_path, dataset)
        addon = None
        
        # Account for different procedures
        if proc in ['proc1', 'proc2', 'proc5', 'proc6']:
            short_ds = long2short[dataset]
        else:
            short_ds = long2short[dataset.split('-')[0]]
            addon = dataset.split('-')[1]
        
        translators = os.listdir(dataset_path)
        for translator in translators:
            translator_path = join(dataset_path, translator)
            short_tl = long2short[translator]

            translations = [t for t in os.listdir(translator_path) if t.endswith('.txt')]
            num_of_translations = len(translations)
            if num_of_translations == 0:
                continue
            
            
            with open(join(translator_path, 'task.json')) as f:
                task = json.load(f)
            
            for translation in translations:
                pair = translation.replace('.txt', '')
                # A failed pair is one that contains substantially more or less than 400 sentences
                if 'fail' in pair:
                    print(f'Skip failed entry: {proc}: {short_ds}-{short_tl}-{pair}')
                    num_of_translations -= 1
                    continue

                prefix = f'{short_ds}-{short_tl}-{pair}'
                prefix2file[prefix] = {'file': join(translator_path, translation), 'procedure': proc, 'task':task}
                for log in logs:
                    log_pair = f'{log["src_lang"]}-{log["tgt_lang"]}'
                    log_translator = log['translator']
                    log_dataset = long2short[log['dataset'].split('/')[1]]
                    if log_pair == pair and log_translator==translator and log_dataset==short_ds:
                        prefix2file[prefix]['log'] = log
                    if addon:
                        add = f'({addon})'
                    else:
                        add = ''
            print(f'Transfer {num_of_translations} translations from {proc}: task={short_ds}-{short_tl} {add}\n')

Transfer 20 translations from proc1: task=ep-deepl 

Transfer 20 translations from proc1: task=ep-gpt 

Transfer 20 translations from proc1: task=flores-deepl 

Transfer 20 translations from proc1: task=flores-gpt 

Transfer 20 translations from proc1: task=opus-deepl 

Transfer 20 translations from proc1: task=opus-gpt 

Transfer 44 translations from proc2: task=ep-gpt 

Transfer 9 translations from proc3: task=ep-gpt (fr)

Transfer 7 translations from proc3: task=ep-gpt (it)

Transfer 9 translations from proc3: task=ep-gpt (nl)

Transfer 9 translations from proc3: task=ep-gpt (pt)

Transfer 8 translations from proc3: task=ep-gpt (sv)

Transfer 9 translations from proc4: task=flores-gpt (da)

Skip failed entry: proc4: flores-gpt-de-fi_fail2
Transfer 8 translations from proc4: task=flores-gpt (de)

Transfer 9 translations from proc4: task=flores-gpt (el)

Transfer 9 translations from proc4: task=flores-gpt (es)

Transfer 9 translations from proc4: task=flores-gpt (fi)

Transfer 9 trans

Reminder
* proc1 had no issues, no translation was skipped
* proc2 failed to translate `fi-el`, skipped, hence 44 instead of expected 45
* proc3 failed to translate `it-el`, `it-fi` and `sv-el`, hence 7 and 8 instead of expected 9
* proc4 failed to translate `de-fi`, no error, just not expected number of output sentences
* proc5 used to trasnlate 5 skipped ones, 4 (3 ep, 1 flores) succeeded, failed `it-el`
* proc6 had no issues

In [29]:
opus = 0
flores = 0
ep = 0
deepl = 0
gpt = 0
for prefix, info in prefix2file.items():
    p1 = prefix.split('-')[0]
    p2 = prefix.split('-')[1]
    if p2 == 'deepl':
        deepl+=1
    if p2 == 'gpt':
        gpt+=1
    if p1 == 'opus':
        opus+=1
    if p1 == 'ep':
        ep+=1
    if p1 == 'flores':
        flores+=1


print(f'opus:{opus}\nep:{ep}\nflores:{flores}\n')
print(f'deepl:{deepl}\ngpt:{gpt}')

opus:40
ep:219
flores:220

deepl:240
gpt:239


In [30]:
# Update times
for pair, datasets in proc1_gpt.items():
    for dataset, times in datasets.items():
        key = f'{dataset}-gpt-{pair}'
        diff1 = prefix2file[key]['log']['end'] - prefix2file[key]['log']['start']
        diff2 = times['end'] - times['start']
        prefix2file[key]['log']['start'] = times['start']
        prefix2file[key]['log']['end'] = times['end']
        if round(diff1, 2) != round(diff2, 2):
            if abs(diff1-diff2) > 2:
                add = '[!]'
            else:
                add = ''
            print(f'{key}: {diff1:.2f}s -> {diff2:.2f}s {add}')

ep-gpt-de-en: 251.73s -> 252.00s 
opus-gpt-de-en: 76.07s -> 76.00s 
flores-gpt-de-en: 107.51s -> 108.00s 
ep-gpt-en-de: 193.44s -> 194.00s 
opus-gpt-en-de: 114.27s -> 114.00s 
flores-gpt-en-de: 169.91s -> 170.00s 
ep-gpt-da-en: 141.60s -> 142.00s 
opus-gpt-da-en: 98.62s -> 99.00s 
flores-gpt-da-en: 128.14s -> 129.00s 
ep-gpt-en-da: 155.34s -> 155.00s 
opus-gpt-en-da: 136.54s -> 137.00s 
flores-gpt-en-da: 186.58s -> 187.00s 
ep-gpt-el-en: 239.32s -> 240.00s 
opus-gpt-el-en: 98.73s -> 98.00s 
flores-gpt-el-en: 220.27s -> 220.00s 
ep-gpt-en-el: 238.58s -> 239.00s 
opus-gpt-en-el: 217.24s -> 217.00s 
flores-gpt-en-el: 599.17s -> 299.00s [!]
ep-gpt-pt-en: 88.44s -> 89.00s 
opus-gpt-pt-en: 85.96s -> 86.00s 
flores-gpt-pt-en: 169.84s -> 170.00s 
ep-gpt-en-pt: 120.94s -> 121.00s 
opus-gpt-en-pt: 105.46s -> 106.00s 
flores-gpt-en-pt: 187.91s -> 188.00s 
ep-gpt-sv-en: 110.78s -> 111.00s 
opus-gpt-sv-en: 64.26s -> 64.00s 
flores-gpt-sv-en: 133.35s -> 133.00s 
ep-gpt-en-sv: 105.27s -> 105.00s 
opu

* In theory, we could have used a more sophisticated regular expression to only account for the cases where duration would be reduced substantially but for completion sake, we did it fully for proc1 gpt tasks
* There is a slight loss in precision since the structured logs stored in UNIX epochs directly, whereas the raw logs used timestamps
* The important cases where reduction was substantial, greater than 2 seconds, emphasized with `[!]`

In [31]:
for pair, times in proc2_3_gpt_ep.items():
    key = f'ep-gpt-{pair}'
    diff1 = prefix2file[key]['log']['end'] - prefix2file[key]['log']['start']
    diff2 = times['end'] - times['start']
    prefix2file[key]['log']['start'] = times['start']
    prefix2file[key]['log']['end'] = times['end']
    if round(diff1, 2) != round(diff2, 2):
        if abs(diff1-diff2) > 2:
            add = '[!]'
        else:
            add = ''
        print(f'{key}: {diff1:.2f}s -> {diff2:.2f}s {add}')

ep-gpt-da-de: 156.66s -> 157.00s 
ep-gpt-da-el: 278.81s -> 279.00s 
ep-gpt-da-es: 218.03s -> 218.00s 
ep-gpt-da-fi: 588.78s -> 288.00s [!]
ep-gpt-da-fr: 166.79s -> 167.00s 
ep-gpt-da-it: 232.04s -> 232.00s 
ep-gpt-da-nl: 244.76s -> 245.00s 
ep-gpt-da-pt: 213.71s -> 214.00s 
ep-gpt-da-sv: 202.09s -> 202.00s 
ep-gpt-de-da: 178.06s -> 178.00s 
ep-gpt-de-el: 569.56s -> 269.00s [!]
ep-gpt-de-es: 201.55s -> 202.00s 
ep-gpt-de-fi: 205.06s -> 205.00s 
ep-gpt-de-fr: 227.59s -> 228.00s 
ep-gpt-de-it: 173.11s -> 173.00s 
ep-gpt-de-nl: 209.11s -> 209.00s 
ep-gpt-de-pt: 211.15s -> 211.00s 
ep-gpt-de-sv: 511.33s -> 192.00s [!]
ep-gpt-el-da: 280.89s -> 281.00s 
ep-gpt-el-de: 230.38s -> 231.00s 
ep-gpt-el-es: 220.42s -> 221.00s 
ep-gpt-el-fi: 256.99s -> 257.00s 
ep-gpt-el-fr: 219.91s -> 220.00s 
ep-gpt-el-it: 278.71s -> 279.00s 
ep-gpt-el-nl: 170.33s -> 171.00s 
ep-gpt-el-pt: 236.71s -> 237.00s 
ep-gpt-el-sv: 200.67s -> 201.00s 
ep-gpt-es-da: 243.37s -> 244.00s 
ep-gpt-es-de: 194.29s -> 194.00s 
ep-gp

* For proc2 and 3 it is much more noticable, more `[!]` visible
* Updating the times does give a clearer idea of the 'true translation time', at least without any retries in-between.

### Adding `it-el`
* One pair we had to obtain with other means, see `it-el.ipynb`, the last 3 cells
* We add this pair to `prefix2file` as well

In [32]:
prefix = 'ep-gpt-it-el'
file_path = 'it-el.txt'
with open(join('tmp-it-el', 'task.json')) as f:
    task = json.load(f)

In [33]:
!cat it-el.log | grep -P "(Sending HTTP Request: POST https://api.openai.com/v1/chat/completions|Failed)"
!cat it-el.log | grep -P "Task took"

DEBUG: 2025-05-16 10:50:52 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-16 10:55:53 - [⏭️]: Failed 0 times, skipping it-el...
INFO: 2025-05-16 10:55:53 - [🏁]: Task took 301.63s


In [34]:
from datetime import datetime

fmt = "%Y-%m-%d %H:%M:%S"
start_stamp = "2025-05-16 10:50:52"
end_stamp = "2025-05-16 10:55:53"
start = datetime.strptime(start_stamp, fmt).timestamp()
end = datetime.strptime(end_stamp, fmt).timestamp()
time_entries = {'start': start, 'end': end}
# The following is just perfectionism

In [35]:
# Single Web Page file of the log page from OpenAI Platform
!cat it-el.mhtml | grep -P 'chatcmp' | head -n2
!cat it-el.mhtml | grep -P 'Input'
!cat it-el.mhtml | grep -P -A1 'Output'

Snapshot-Content-Location: https://platform.openai.com/logs/chatcmpl-BXl4IuCVnFZTSuF2Nwy7nMkqNRndr
Subject: Logs - chatcmpl-BXl4IuCVnFZTSuF2Nwy7nMkqNRndr - OpenAI API
edium">Input</span></div><span class=3D"font-medium">17,445t</span></div><d=
x items-center gap-1"><span class=3D"font-medium">Output</span></div><span =
class=3D"font-medium">24,907t</span></div><div class=3D"SzU-2"><div class=


In [36]:
from scripts.data_management import EuroParlManager
from scripts.util import split_sents
from scripts.logger import TranslationLogger
dm = EuroParlManager()
with open('it-el.txt', 'r') as f:
    mt_lines = [ln.strip() for ln in f]

src_lines, tgt_lines = dm.get_sentence_pairs(src_lang='it', tgt_lang='el', num_of_sents=400)
src_text = '\n'.join(src_lines)
mt_text = '\n'.join(mt_lines)
src_sents = split_sents(src_text, lang='it')
mt_sents = split_sents(mt_text, lang='el')

logger = TranslationLogger(logfile='tmp')

logger.start(src_lang='it', tgt_lang='el', src_text=src_text)
logger.finish(tgt_text=mt_text, in_model_tokens=17445, out_model_tokens=24907)
retry = {
    'prev_id' : None,
    'reason': 'exceptional case; obtained through OpenAI logs, id=chatcmpl-BXl4IuCVnFZTSuF2Nwy7nMkqNRndr'
}
logger.add_entry(manual_retry=retry)
logger.add_entry(translator='gpt-4.1-2025-04-14')
logger.add_entry(dataset='Helsinki-NLP/europarl')
log = logger.curr_log
log.update(time_entries)
log.update({'manual_retry':True})
log.update({'error_msg':'504 Gateway Timeout'})
log['end']-log['start']  # should be around 5min

301.0

In [37]:
prefix2file[prefix] = {'file':file_path, 'log':log, 'task':task}

### Transfer

In [38]:
# Actually move all files from tasks to a folder called translations
import shutil
os.makedirs('translations', exist_ok=True)
for prefix, info in prefix2file.items():
    shutil.copy(info['file'], join('translations', f'{prefix}.txt'))

with open(join('translations', 'info.json'), 'w') as f:
    json.dump(prefix2file, fp=f, indent=4)

In [39]:
check = os.listdir('translations')
# Should contain 240 + 240 + info.json = 481
len(check) == 481

True

Again, all files that are required for this notebook to run will be provided to the evaluators of this project. It just explains how the `translations` folder was created, which is used for the remainder of this project for alignment and evaluation. The translations folder may be provided to the public

## Alignment Consideration
* We can somewhat guess which translations may yield low BLEU score due to misalignments.

In [40]:
from os.path import join
import json
with open(join('translations', 'info.json'), 'r') as f:
    prefix2file = json.load(f)

num = 0
for prefix, info in prefix2file.items():
    outlines = info['log']['out_lines']
    if outlines != 400:
        print(prefix, outlines)
        num += 1

print(f'{num} pairs likely need to be re-aligned')

ep-gpt-de-en 398
ep-gpt-en-fi 401
ep-gpt-es-en 401
opus-gpt-de-en 372
opus-gpt-pt-en 402
ep-gpt-da-it 399
ep-gpt-de-fi 399
ep-gpt-el-it 399
ep-gpt-el-nl 399
ep-gpt-es-el 399
ep-gpt-es-nl 399
ep-gpt-fi-it 399
ep-gpt-fi-nl 399
ep-gpt-pt-fi 401
ep-gpt-sv-fi 399
flores-gpt-es-el 402
flores-gpt-fi-da 402
flores-gpt-fi-el 402
flores-gpt-fi-es 402
flores-gpt-fi-nl 402
flores-gpt-fi-pt 402
flores-gpt-it-fr 402
ep-gpt-it-fi 399
23 pairs likely need to be re-aligned
