## Preparation
* The translation tasks have been completed and all translations are now stored in certain hierarchy within the `tasks` folder, which was backed up with the logs associated with it. At the time of writing, it is unclear what can be made public or not, the logs contain IP addresses in some cases due to how the errors were logged, the translations could be uplouded to Google Drive and provided to everyone. For now, we plan to only provide them to evaluators of this project. 

* In this notebook, we re-structure the translation data to make it easier to work with them (align & evaluate)
* First, we extract time information found in proc1-3 logs as the start times stored proc1-3 JSONL files include each time OpenAI's code triggered an automatic retry.
* Second, we rename filenames and move all translations into a single folder, preserve hierarchy by using prefixes and omit procedure-related information, as it is not required for alignment and evaluation.

## Getting Preciser Timestamps
* In proc 1 to 3, there were cases where GPT4.1 took longer than expected. 
* The structured logs in the JSONL file stored the start time when the method that calls the API was called and the end time when the response was received. 
* However, this did not account for the case of OpenAI's code doing the retries, so the retries where included in the time calculation.
* In this notebook, we use the logs to extract start and end times that are more in line with the real translation time
* We do this by looking at the `DEBUG` logs created by OpenAI's code, they contain an entry of when exactly a request was sent.
* The end time we can obtain by our own logs, looking for `Translated X sents for src-tgt` message.

In [33]:
!cat proc3.log | grep -P "(Sending HTTP Request: POST https://api.openai.com/v1/chat/completions|Translated \d+ sents for \w\w-\w\w)" | head -n 20

DEBUG: 2025-05-09 13:18:06 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-09 13:22:10 - [✔️]: Translated 400 sents for fr-da
DEBUG: 2025-05-09 13:22:10 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-09 13:26:42 - [✔️]: Translated 400 sents for fr-de
DEBUG: 2025-05-09 13:26:42 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
DEBUG: 2025-05-09 13:31:43 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-09 13:36:34 - [✔️]: Translated 400 sents for fr-el
DEBUG: 2025-05-09 13:36:35 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
INFO: 2025-05-09 13:39:17 - [✔️]: Translated 400 sents for fr-es
DEBUG: 2025-05-09 13:39:17 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
DEBUG: 2025-05-09 13:49:07 - Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
DEBUG: 2025-05-09 13:54:08 - Sending HTTP Reque

In [34]:
import json
from os.path import join
from datetime import datetime

with open(join('tasks', 'proc3.jsonl'), 'r') as f:
    logs = [json.loads(ln) for ln in f.readlines()]

pairs = [('sv', 'es'), ('pt', 'da')]

for log in logs:
    pair = (log['src_lang'], log['tgt_lang'])
    if pair in pairs:
        print(pair)
        start = datetime.fromtimestamp(
            log['start']).strftime('%Y-%m-%d %H:%M:%S')
        end = datetime.fromtimestamp(log['end']).strftime('%Y-%m-%d %H:%M:%S')
        print('start', start)
        print('end', end)
        print('duration', log['end'] - log['start'])
        print()


('pt', 'da')
start 2025-05-09 19:19:38
end 2025-05-09 19:28:39
duration 541.1454193592072

('sv', 'es')
start 2025-05-09 21:21:34
end 2025-05-09 21:24:14
duration 160.7068531513214



* For some cases, the structured logs from `proc3.jsonl` got the time right because the automtic retries exceeded OpenAI's limit of 2 retries and triggered the retries implemented by us, which restarted the timing.
* However, in other cases, such as `pt-da`, we observe that the structured logs captured start time `19:19:38` which corresponds to the first try of OpenAI's code, not the last, hence the duration is longer than it really was.
* Our goal is to now to go through proc1 to 3 logs and store the real start and end times. We lose precision as we aren't working with Unix timestamps anymore, however, since we mainly work with seconds, it should not make a big difference.

In [35]:
!cat proc2.log proc3.log | grep -P "(Sending HTTP Request: POST https://api.openai.com/v1/chat/completions|Translated \d+ sents for \w\w-\w\w)" | grep -P -B1 "(Translated \d+ sents for \w\w-\w\w)" > tmp_proc2-3.log

In [36]:
with open('tmp_proc2-3.log', 'r') as f:
    logs = [ln for ln in f if ln.startswith('DEBUG') or ln.startswith('INFO')]

In [37]:
import re
pat = r"Translated \d+ sents for (\w\w-\w\w)"
pair2log_idx = {}
for i, log in enumerate(logs):
    if log.startswith('INFO'):
        pair = re.search(pat, log).group(1)
        pair2log_idx[pair] = i

In [38]:
import time
pair2time = {}
time_pat = r"(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)"
fmt = '%Y-%m-%d %H:%M:%S'
for pair in pair2log_idx:
    idx = pair2log_idx[pair]
    start = re.search(time_pat, logs[idx-1]).group(1)
    end = re.search(time_pat, logs[idx]).group(1)
    start = datetime.strptime(start, fmt)
    start_unix = time.mktime(start.timetuple())
    end = datetime.strptime(end, fmt)
    end_unix = time.mktime(end.timetuple())
    pair2time[pair] = {'start': start_unix, 'end': end_unix}


In [39]:
# Number of translations generated by Proc2
!ls tasks/proc2/europarl/gpt-4.1-2025-04-14/*.txt | wc -l

44


In [40]:
# Number of translations generated by Proc 3
!ls tasks/proc3/europarl-*/gpt-4.1-2025-04-14/*.txt | wc -l

42


In [41]:
proc2_success = 44 
proc3_success = 42
len(pair2time) == proc2_success + proc3_success

True

In [42]:
import json
with open('proc2-3-europarl-gpt.json', 'w') as f:
    json.dump(pair2time, fp=f, indent=4)

* For Proc1, we need to make it a bit smarter, since Proc1 involved EuroParl, Opus100 and Flores+ as well, so we cannot use the pair directly as the sole key.

In [43]:
!cat proc1.log | grep -P "(Sending HTTP Request: POST https://api.openai.com/v1/chat/completions|Translated \d+ sents for \w\w-\w\w)" | grep -P -B1 "(Translated \d+ sents for \w\w-\w\w)" | grep -P -A1 "^(DEBUG)" > tmp_proc1.log

In [44]:
with open('tmp_proc1.log', 'r') as f:
    logs = [ln for ln in f if ln.startswith('DEBUG') or ln.startswith('INFO')]

* We exploit the fact that the logs are in chronological order
* We know which Dataset was used with GPT4.1 first, second and third, so we can rely on the indices.

In [45]:
import re
pat = r"Translated \d+ sents for (\w\w-\w\w)"
pair2log_idx = {}
for i, log in enumerate(logs):
    if log.startswith('INFO'):
        pair = re.search(pat, log).group(1)
        if pair not in pair2log_idx:
            pair2log_idx[pair] = [i]
        else:
            pair2log_idx[pair].append(i)

In [46]:
import time
pair2time = {}
time_pat = r"(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)"
fmt = '%Y-%m-%d %H:%M:%S'
for pair in pair2log_idx:
    indices = pair2log_idx[pair]
    pair2time[pair] = []
    for idx in indices:
        if logs[idx-1].startswith('DEBUG'):
            start = re.search(time_pat, logs[idx-1]).group(1)
            end = re.search(time_pat, logs[idx]).group(1)
            start = datetime.strptime(start, fmt)
            start_unix = time.mktime(start.timetuple())
            end = datetime.strptime(end, fmt)
            end_unix = time.mktime(end.timetuple())
            pair2time[pair].append({'start': start_unix, 'end': end_unix})


* Since Proc1 was successful throughout, we expect 3 time values for each pair.

In [47]:
check = [len(times) == 3 for times in pair2time.values()]
all(check)

True

In [48]:
!cat tasks/proc1/europarl/gpt-4.1-2025-04-14/task.json | grep -P 'task_id'
!cat tasks/proc1/flores_plus/gpt-4.1-2025-04-14/task.json | grep -P 'task_id'
!cat tasks/proc1/opus-100/gpt-4.1-2025-04-14/task.json | grep -P 'task_id'

    "task_id": "9549e927-4f63-4205-82bb-5c7ccabfe943",
    "task_id": "fdbf6190-d061-4154-ae94-d7b11199d043",
    "task_id": "fa628bc1-4f85-446d-9167-1a8d99ccc493",


In [49]:
!cat proc1.log | grep -P "Starting task (9549e927-|fdbf6190-|fa628bc1-)"

INFO: 2025-05-07 13:55:45 - [🏁]: Starting task 9549e927-4f63-4205-82bb-5c7ccabfe943 on commit 66ef922
INFO: 2025-05-07 15:01:53 - [🏁]: Starting task fa628bc1-4f85-446d-9167-1a8d99ccc493 on commit 66ef922
INFO: 2025-05-07 15:41:35 - [🏁]: Starting task fdbf6190-d061-4154-ae94-d7b11199d043 on commit 66ef922


* So we infer that the first one is EuroParl, the second is Opus100 and the third is FloresPlus

In [50]:
pair2time_modified = {}
for pair, times in pair2time.items():
    pair2time_modified[pair] = {'ep': times[0], 'opus': times[1], 'flores': times[2]}

In [51]:
import json 
with open('proc1-gpt.json', 'w') as f:
    json.dump(pair2time_modified, fp=f, indent=4)

* For Proc 4 to 6, we can rely on structured logs as there were no automatic retries from OpenAI's side anymore.
* We can still confirm as a sanity check.

In [52]:
!cat proc4.log | grep -P "(Sending HTTP Request: POST https://api.openai.com/v1/chat/completions|Translated \d+ sents for \w\w-\w\w)" | grep -P -B1 "(Translated \d+ sents for \w\w-\w\w)" > tmp_proc4.log

In [53]:
with open('tmp_proc4.log', 'r') as f:
    logs = [ln for ln in f if ln.startswith('DEBUG') or ln.startswith('INFO')]

In [54]:
import re
pat = r"Translated \d+ sents for (\w\w-\w\w)"
pair2log_idx = {}
for i, log in enumerate(logs):
    if log.startswith('INFO'):
        pair = re.search(pat, log).group(1)
        pair2log_idx[pair] = i

In [55]:
import time
pair2time = {}
time_pat = r"(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)"
fmt = '%Y-%m-%d %H:%M:%S'
for pair in pair2log_idx:
    idx = pair2log_idx[pair]
    start = re.search(time_pat, logs[idx-1]).group(1)
    end = re.search(time_pat, logs[idx]).group(1)
    start = datetime.strptime(start, fmt)
    start_unix = time.mktime(start.timetuple())
    end = datetime.strptime(end, fmt)
    end_unix = time.mktime(end.timetuple())
    pair2time[pair] = {'start': start_unix, 'end': end_unix}

In [56]:
len(pair2time)

90

In [57]:
with open(join('tasks', 'proc4.jsonl'), 'r') as f:
    logs = [json.loads(ln) for ln in f.readlines()]
len(logs)

90

In [58]:
pair_keys_1 = set(pair2time.keys())
pair_keys_2 = set([f'{log["src_lang"]}-{log["tgt_lang"]}' for log in logs])
pair_keys_1 == pair_keys_2

True

In [59]:
diffs = []
for log in logs:
    pair = (log['src_lang'], log['tgt_lang'])
    pair_key = '-'.join(pair)
    structured_dur = log['end'] - log['start']
    unstructured_dur = pair2time[pair_key]['end'] - pair2time[pair_key]['start']
    diff = structured_dur - unstructured_dur
    diffs.append(diff)

print(f'Max diff: {max(diffs):.2f} s')
print(f'Min diff: {min(diffs):.2f} s')
print(f'Mean diff: {sum(diffs)/len(diffs):.2f} s')

Max diff: 0.79 s
Min diff: -1.21 s
Mean diff: -0.19 s


* Not too tragic

In [60]:
from random import sample
selected_pairs = sample(list(pair_keys_1), 5)
for log in logs:
    pair = (log['src_lang'], log['tgt_lang'])
    pair_key = '-'.join(pair)
    if pair_key in selected_pairs:
        print(pair_key)
        print('duration (structured log)', f'{log["end"] - log["start"]:.2f} s')
        print('duration (unstructured log)', f'{pair2time[pair_key]["end"] - pair2time[pair_key]["start"]:.2f} s')
        print()


da-de
duration (structured log) 179.83 s
duration (unstructured log) 180.00 s

el-es
duration (structured log) 155.98 s
duration (unstructured log) 156.00 s

es-nl
duration (structured log) 155.98 s
duration (unstructured log) 156.00 s

fi-es
duration (structured log) 153.05 s
duration (unstructured log) 153.00 s

nl-pt
duration (structured log) 172.44 s
duration (unstructured log) 172.00 s



## Preparing Files for Analysis
* At the moment, all files are stored hierarchically in the `tasks` folder
* This is not very convenient for analysis purposes, so we stored them all into a single folder preserve hierarchical information by filename prefix.

In [61]:
import os
from os.path import join
import re
import json

procs = [f'proc{i}' for i in range(1, 7)]

long2short = {
    'gpt-4.1-2025-04-14': 'gpt',
    'deepl_document': 'deepl',
    'opus-100': 'opus',
    'europarl': 'ep',
    'flores_plus': 'flores'
}

prefix2file = {}
for p in procs:
    folder = join('tasks', p)
    folders = os.listdir(folder)
    for fo in folders:
        tl_folders = os.listdir(join(folder, fo))
        for t in tl_folders:
            tl_files = [f for f in os.listdir(join(folder, join(fo, t))) if f.endswith('.txt')]
            for tf in tl_files:
                pair = tf.replace('.txt', '')
                if 'fail' in pair:
                    pair = re.sub(r'_fail\d+', '', pair)
                if p in ['proc1', 'proc2', 'proc5', 'proc6']:
                    sf = long2short[fo]
                else:
                    sf = long2short[fo.split('-')[0]]
                st = long2short[t]
                prefix = f'{sf}-{st}-{pair}'
                prefix2file[prefix] = {'file': join(folder, join(fo, join(t, tf))), 'procedure': p}
                with open(join('tasks', f'{p}.jsonl'), 'r') as fi:
                    logs = [json.loads(ln) for ln in fi.readlines()]
                for log in logs:
                    log_pair = f'{log["src_lang"]}-{log["tgt_lang"]}'
                    log_translator = log['translator']
                    if log_pair == pair and log_translator==t:
                        prefix2file[prefix]['log'] = log


In [62]:
opus = 0
flores = 0
ep = 0
deepl = 0
gpt = 0
for prefix, info in prefix2file.items():
    p1 = prefix.split('-')[0]
    p2 = prefix.split('-')[1]
    if p2 == 'deepl':
        deepl+=1
    if p2 == 'gpt':
        gpt+=1
    if p1 == 'opus':
        opus+=1
    if p1 == 'ep':
        ep+=1
    if p1 == 'flores':
        flores+=1


print(opus, flores, ep)

40 220 219


In [63]:
print(deepl, gpt)

240 239


* 40 translation for OPUS, 20 DeepL, 20 GPT4.1
* 220 translations for Flores+, 110 DeepL, 110 GPT4.1
* 219 translations for Europarl, 110 DeepL, 109 GPT4.1 (because it kept failing `it-el`)

In [64]:
import json
# Update times
with open('proc1-gpt.json', 'r') as f:
    proc1_times = json.load(f)

for pair, datasets in proc1_times.items():
    for dataset, times in datasets.items():
        key = f'{dataset}-gpt-{pair}'
        diff1 = prefix2file[key]['log']['end'] - prefix2file[key]['log']['start']
        diff2 = times['end'] - times['start']
        prefix2file[key]['log']['start'] = times['start']
        prefix2file[key]['log']['end'] = times['end']
        if round(diff1) != round(diff2):
            print(f'{key}: {diff1:.2f}s -> {diff2:.2f}s')

ep-gpt-de-en: 107.51s -> 252.00s
opus-gpt-de-en: 107.51s -> 76.00s
ep-gpt-en-de: 169.91s -> 194.00s
opus-gpt-en-de: 169.91s -> 114.00s
ep-gpt-da-en: 128.14s -> 142.00s
opus-gpt-da-en: 128.14s -> 99.00s
flores-gpt-da-en: 128.14s -> 129.00s
ep-gpt-en-da: 186.58s -> 155.00s
opus-gpt-en-da: 186.58s -> 137.00s
ep-gpt-el-en: 220.27s -> 240.00s
opus-gpt-el-en: 220.27s -> 98.00s
ep-gpt-en-el: 599.17s -> 239.00s
opus-gpt-en-el: 599.17s -> 217.00s
flores-gpt-en-el: 599.17s -> 299.00s
ep-gpt-pt-en: 169.84s -> 89.00s
opus-gpt-pt-en: 169.84s -> 86.00s
ep-gpt-en-pt: 187.91s -> 121.00s
opus-gpt-en-pt: 187.91s -> 106.00s
ep-gpt-sv-en: 133.35s -> 111.00s
opus-gpt-sv-en: 133.35s -> 64.00s
ep-gpt-en-sv: 134.15s -> 105.00s
opus-gpt-en-sv: 134.15s -> 83.00s
ep-gpt-es-en: 102.99s -> 116.00s
opus-gpt-es-en: 102.99s -> 86.00s
ep-gpt-en-es: 135.10s -> 159.00s
opus-gpt-en-es: 135.10s -> 94.00s
ep-gpt-fi-en: 156.83s -> 139.00s
opus-gpt-fi-en: 156.83s -> 95.00s
ep-gpt-en-fi: 223.18s -> 275.00s
opus-gpt-en-fi: 223

In [65]:
with open('proc2-3-europarl-gpt.json') as f:
    proc2_3_times = json.load(f)

for pair, times in proc2_3_times.items():
    key = f'ep-gpt-{pair}'
    diff1 = prefix2file[key]['log']['end'] - prefix2file[key]['log']['start']
    diff2 = times['end'] - times['start']
    prefix2file[key]['log']['start'] = times['start']
    prefix2file[key]['log']['end'] = times['end']
    if round(diff1) != round(diff2):
        print(f'{key}: {diff1:.2f}s -> {diff2:.2f}s')

ep-gpt-da-fi: 588.78s -> 288.00s
ep-gpt-de-el: 569.56s -> 269.00s
ep-gpt-de-sv: 511.33s -> 192.00s
ep-gpt-el-de: 230.38s -> 231.00s
ep-gpt-el-es: 220.42s -> 221.00s
ep-gpt-el-nl: 170.33s -> 171.00s
ep-gpt-es-da: 243.37s -> 244.00s
ep-gpt-es-el: 254.19s -> 255.00s
ep-gpt-es-fr: 241.38s -> 242.00s
ep-gpt-fi-nl: 300.36s -> 301.00s
ep-gpt-fi-sv: 269.22s -> 270.00s
ep-gpt-fr-da: 243.23s -> 244.00s
ep-gpt-fr-el: 592.28s -> 291.00s
ep-gpt-fr-fi: 1164.01s -> 273.00s
ep-gpt-fr-it: 221.33s -> 222.00s
ep-gpt-it-da: 868.21s -> 267.00s
ep-gpt-it-de: 193.62s -> 193.00s
ep-gpt-it-es: 182.50s -> 182.00s
ep-gpt-it-pt: 210.34s -> 211.00s
ep-gpt-nl-da: 562.67s -> 263.00s
ep-gpt-nl-de: 535.17s -> 234.00s
ep-gpt-nl-el: 820.63s -> 218.00s
ep-gpt-nl-pt: 203.56s -> 203.00s
ep-gpt-nl-sv: 293.55s -> 293.00s
ep-gpt-pt-da: 541.15s -> 241.00s
ep-gpt-pt-fi: 894.12s -> 291.00s
ep-gpt-pt-sv: 234.47s -> 235.00s


In [66]:
import shutil
os.makedirs('translations', exist_ok=True)
for prefix, info in prefix2file.items():
    shutil.copy(info['file'], join('translations', f'{prefix}.txt'))

with open(join('translations', 'info.json'), 'w') as f:
    json.dump(prefix2file, file=f, indent=4)

TypeError: dump() missing 1 required positional argument: 'fp'

In [None]:
check = os.listdir('translations')
len(check) == 480

## Time Analysis

In [None]:
import json
prefix2time = {}
with open(join('translations', 'info.json'), 'r') as f:
    prefix2file = json.load(f)

for prefix, info in prefix2file.items():
    prefix2time[prefix] = info['log']['end'] - info['log']['start']

In [None]:
with open('proc2-3-europarl-gpt.json') as f:
    pair2time = json.load(f)

for prefix in prefix2time:
    if prefix.startswith('ep-gpt'):
        info = prefix2file[prefix]
        if info['procedure'] == 'proc2' or info['procedure'] == 'proc3':
            pair = '-'.join([prefix.split('-')[2], prefix.split('-')[3]])
            prefix2time[prefix] = pair2time[pair]['end'] - pair2time[pair]['start']


In [None]:
with open('proc1-gpt.json') as f:
    pair2time = json.load(f)

for prefix in prefix2time:
    data, tl = prefix.split('-')[:2]
    if tl == 'gpt':
        info = prefix2file[prefix]
        if info['procedure'] == 'proc1':
            pair = '-'.join([prefix.split('-')[2], prefix.split('-')[3]])
            prefix2time[prefix] = pair2time[pair][data]['end'] - pair2time[pair][data]['start']

In [None]:
times = []
for prefix in sorted(prefix2time, key=lambda x: prefix2time[x], reverse=True):
    if 'gpt' in prefix:
        print(prefix, f'{prefix2time[prefix]:.2f}')
        times.append(prefix2time[prefix])

In [None]:
mean = sum(times)/len(times)
print(f'Mean: {mean:.2f}s')
print(f'Max: {max(times):.2f}s')
print(f'Min: {min(times):.2f}s')

## Alignment Scheduling
* We can somewhat guess which translations have to be aligned again by checking if the number of outlines != 400

In [None]:
from os.path import join
import json
with open(join('translations', 'info.json'), 'r') as f:
    prefix2file = json.load(f)

num = 0
for prefix, info in prefix2file.items():
    outlines = info['log']['out_lines']
    if outlines != 400:
        print(prefix, outlines)
        num += 1

print(f'{num} pairs likely need to be re-aligned')