## Direct Triplets
* No alignment is performed on the text, we just place source, reference and translation side by side based on their order (line by line) in the dataset and translation output.
* I.e., assume that translators preserved the alignment.

In [1]:
from scripts.util import get_env_variables
from os.path import join
from scripts.data_management import EuroParlManager, FloresPlusManager, Opus100Manager
triplet_folder = get_env_variables('TRIPLETS')
dst_path = join(triplet_folder, 'direct_triplets')
parts = {
    'opus': {'dm': Opus100Manager(), 'pairs': Opus100Manager.get_pairs()},
    'ep': {'dm': EuroParlManager(), 'pairs': EuroParlManager.get_pairs()},
    'flores': {'dm': FloresPlusManager(), 'pairs': FloresPlusManager.get_pairs()}
}
translators = ['gpt', 'deepl']

In [2]:
from scripts.post_process import direct_triplet_align, load_sents_from_file
fn2align_cnt_direct = {}
for dataset, content in parts.items():
    dm = content['dm']
    pairs = content['pairs']
    for pair in pairs:
        s, t = pair
        for translator in translators:
            filename = f'{dataset}-{translator}-{s}-{t}'
            mt_sents = load_sents_from_file(
                folder='translations', filename=filename)
            src_sents, tgt_sents = dm.get_sentence_pairs(
                s, t, num_of_sents=400)
            cnt = direct_triplet_align(
                mt_sents=mt_sents,
                src_sents=src_sents,
                ref_sents=tgt_sents,
                folder_path=dst_path,
                filename=filename
            )
            fn2align_cnt_direct[filename] = cnt

In [3]:
from collections import defaultdict
cnt_freq = defaultdict(int)
for k, v in fn2align_cnt_direct.items():
    cnt_freq[v] += 1

for k in sorted(cnt_freq):
    print(k, cnt_freq[k])

372 1
398 1
399 20
400 458


In [4]:
import json
with open('direct_cnt.json', 'w') as f:
    json.dump(fn2align_cnt_direct, f, indent=4)

* Direct alignment only removes empty strings if there are any
* Most of the time, we have 400 aligned triplets

## Aligned Triplets WITHOUT Sentence Splitting
* Alignments were computed in this notebook: [Alignments_No_Sent_Split.ipynb](https://colab.research.google.com/drive/1867NBRM7ixgiVmeznqRf4oh9nYDd4D5S?usp=sharing)

In [5]:
from scripts.post_process import post_triplet_align, load_aligned_sents_from_file
from scripts.util import get_env_variables
import os
from os.path import join
aligned_folder = get_env_variables('ALIGNMENTS')
triplet_folder = get_env_variables('TRIPLETS')
src2hyp_split_fo = join(aligned_folder, 'source2translations_no_sent_split')
dst_path = join(triplet_folder, 'aligned_triplets_no_sent_split')
filenames = [f.replace('.jsonl', '') for f in os.listdir(src2hyp_split_fo)]
len(filenames)

480

In [6]:
from scripts.data_management import EuroParlManager, FloresPlusManager, Opus100Manager

dms = {
    'ep': EuroParlManager(),
    'flores': FloresPlusManager(),
    'opus': Opus100Manager()
}

fn2align_cnt_no_sent_split = {}
fn2discard_no_sent_split = {}
cases = ['ep-gpt', 'ep-deepl', 'flores-gpt',
         'flores-deepl', 'opus-gpt', 'opus-deepl']
case2align_cnts_no_sent_split = {c: [] for c in cases}

for fn in filenames:
    dataset, translator, s, t = fn.split('-')
    src_sents_a, mt_sents_a = load_aligned_sents_from_file(
        fn, folder=src2hyp_split_fo)
    dm = dms[dataset]
    src_sents_o, ref_sents_o = dm.get_sentence_pairs(s, t, num_of_sents=400)
    align_cnt, dis = post_triplet_align(
        src_sents_org=src_sents_o,
        src_sents_ali=src_sents_a,
        ref_sents_org=ref_sents_o,
        mt_sents_ali=mt_sents_a,
        folder_path=dst_path,
        filename=fn)

    fn2align_cnt_no_sent_split[fn] = align_cnt
    fn2discard_no_sent_split[fn] = dis
    case2align_cnts_no_sent_split[f'{dataset}-{translator}'].append(align_cnt)

for t, ac in case2align_cnts_no_sent_split.items():
    max_cnt = max(ac)
    min_cnt = min(ac)
    mean = sum(ac) / len(ac)
    print(t)
    print(f'min: {min_cnt}')
    print(f'max: {max_cnt}')
    print(f'mean: {mean:.2f}')
    print()

ep-gpt
min: 383
max: 398
mean: 396.31

ep-deepl
min: 369
max: 398
mean: 395.77

flores-gpt
min: 372
max: 400
mean: 396.65

flores-deepl
min: 374
max: 400
mean: 396.61

opus-gpt
min: 372
max: 400
mean: 397.85

opus-deepl
min: 395
max: 400
mean: 398.70



In [7]:
from collections import defaultdict
cnt_freq = defaultdict(int)
for k, v in fn2align_cnt_no_sent_split.items():
    cnt_freq[v] += 1

for k in sorted(cnt_freq):
    print(k, cnt_freq[k])

369 1
372 2
374 19
383 2
384 1
391 1
392 2
393 8
394 13
395 38
396 36
397 144
398 47
399 32
400 134


In [8]:
import json
with open('aligned_cnt_no_sent_split.json', 'w') as f:
    json.dump(fn2align_cnt_no_sent_split, f, indent=4)

### Investigating Loss of Text
* If we apply `bertalign` WITHOUT sentence splitting to all translations, we lose text in some cases.
* The 'biggest' loss is 391-369=22 sentences for `ep-deepl-sv-de`

In [9]:
!cat translations/ep-deepl-sv-de.txt | wc -l

400


In [10]:
!cat $src2hyp_split_fo/ep-deepl-sv-de.jsonl | wc -l

391


In [11]:
!cat $dst_path/ep-deepl-sv-de.jsonl | wc -l

369


In [12]:
from scripts.data_management import EuroParlManager
from scripts.post_process import load_aligned_sents_from_file, load_sents_from_file
mt_sents = load_sents_from_file('ep-deepl-sv-de', 'translations')
dm = EuroParlManager()
src_sents, tgt_sents = dm.get_sentence_pairs('sv', 'de', num_of_sents=400)
src_sents[100:106]

['Vi skall rösta om begäran från PPE-DE-gruppen som syftar till att stryka den muntliga frågan om kapitalskatt från föredragningslistan.',
 '(Parlamentet avslog begäran med 164 röster för, 166 emot. 7 ledamöter avstod från att rösta.)',
 'Fru talman! Jag skulle vilja tacka Poettering för att han just gjort reklam för denna debatt.',
 'Tack.',
 'Fru talman!',
 'Jag undrar om även min röst har räknats, trots att den inte kunde avges på elektronisk väg, eftersom jag inte har något kort?']

In [13]:
tgt_sents[100:106]

['Wir stimmen jetzt über den Antrag der PPE/DE-Fraktion ab, die mündliche Anfrage über die Kapitalsteuer von der Tagesordnung abzusetzen.',
 '(Das Parlament lehnt den Antrag mit 164 Ja-Stimmen, 166 Nein-Stimmen und 7 Enthaltungen ab.)',
 'Frau Präsidentin, ich möchte Herrn Poettering für das Rühren der Werbetrommel zugunsten dieser Aussprache danken.',
 'Vielen Dank.',
 'Frau Präsidentin!',
 'Ist meine Stimme mitgezählt worden? Ich konnte sie nämlich nicht elektronisch abgeben, weil ich die Karte nicht habe.']

In [14]:
mt_sents[100:106]

['Wir werden über den Antrag der Fraktion der Europäischen Volkspartei (Christdemokraten) und europäischer Demokraten abstimmen, die mündliche Anfrage zur Gesellschaftssteuer von der Tagesordnung abzusetzen.',
 '(Das Parlament lehnt den Antrag mit 164 gegen 166 Stimmen ab. 7 Abgeordnete enthalten sich der Stimme).',
 'Frau Präsidentin, ich möchte Herrn Poettering dafür danken, dass er diese Aussprache soeben angekündigt hat.',
 'Vielen Dank, Herr Pöttering.',
 'Frau Präsidentin, ich möchte Folgendes fragen',
 'Ich frage mich, ob meine Stimme auch gezählt worden ist, obwohl sie nicht elektronisch abgegeben werden konnte, weil ich keine Karte habe?']

* The direct alignment for this specific instance is correct, however, we already observe a slight difference between reference and translation
* `Vielen Dank` vs `Vielen Dank, Herr Pöttering.`

In [15]:
src_sents_ali, mt_sents_ali = load_aligned_sents_from_file('ep-deepl-sv-de', src2hyp_split_fo)
src_sents_ali[100:107]

['(Parlamentet avslog begäran med 164 röster för, 166 emot. 7 ledamöter avstod från att rösta.)',
 'Fru talman! Jag skulle vilja tacka Poettering för att han just gjort reklam för denna debatt.',
 '',
 'Tack. Fru talman!',
 'Jag undrar om även min röst har räknats, trots att den inte kunde avges på elektronisk väg, eftersom jag inte har något kort?',
 'Jag röstade "för".',
 'Om man lägger till de två kolleger som yttrat sig blir resultatet...']

In [16]:
mt_sents_ali[100:107]

['(Das Parlament lehnt den Antrag mit 164 gegen 166 Stimmen ab. 7 Abgeordnete enthalten sich der Stimme).',
 'Frau Präsidentin, ich möchte Herrn Poettering dafür danken, dass er diese Aussprache soeben angekündigt hat.',
 'Vielen Dank, Herr Pöttering.',
 'Frau Präsidentin, ich möchte Folgendes fragen',
 'Ich frage mich, ob meine Stimme auch gezählt worden ist, obwohl sie nicht elektronisch abgegeben werden konnte, weil ich keine Karte habe?',
 'Ich habe mit "Ja" gestimmt.',
 'Wenn Sie die beiden Kollegen, die sich zu Wort gemeldet haben, hinzuzählen, lautet das Ergebnis...']

* Alignment with `bertalign` works but is dependent on the translation. It merged the source sentences together to get a 2-1 alignment with:
```
'Tack. Fru talman!' <->  'Vielen Dank, Herr Pöttering.',
```
* As a consequence, we see an empty string in the aligned source sentences
* Furthermore, since the original source sentences do not contain the string `'Tack. Fru talman!'`, it will be discarded, as triplet alignments are created by matching the sentences from the original source strings to the aligned source strings.

## Aligned Triplets WITH Sentences Splitting
* The alignments were computed in this notebook: [Alignments.ipynb](https://colab.research.google.com/drive/1xlwQPctsOGjZB2NpB9WNtzWPae_Oj4gt?usp=sharing)
* Note: To make alignments with sentence splitting, we had to align source to reference AND source to translation, whereas before only source to translation was enough, as we could use the alignments provided by the respective datasets directly. 

In [17]:
from scripts.post_process import post_triplet_align, load_aligned_sents_from_file
import os
from scripts.util import get_env_variables
from os.path import join
aligned_folder = get_env_variables('ALIGNMENTS')
triplet_folder = get_env_variables('TRIPLETS')
# Aligned source to translation
src2hyp_split_fo = join(aligned_folder, 'source2translations_sent_split')
# Aligned source to reference
src2ref_split_fo = join(aligned_folder, 'source2reference_sent_split')
dst_path = join(triplet_folder, 'aligned_triplets_sent_split')
filenames = [f.replace('.jsonl', '') for f in os.listdir(src2hyp_split_fo)]
len(filenames)

480

In [18]:
fn2align_cnt_sent_split = {}
cases = ['ep-gpt', 'ep-deepl', 'flores-gpt',
         'flores-deepl', 'opus-gpt', 'opus-deepl']
case2align_cnts_sent_split = {c: [] for c in cases}

for fn in filenames:
    dataset, translator, s, t = fn.split('-')
    src_sents_a, mt_sents_a = load_aligned_sents_from_file(
        fn, folder=src2hyp_split_fo)
    src_sents_o, ref_sents_o = load_aligned_sents_from_file(
        f'{dataset}-{s}-{t}', folder=src2ref_split_fo)
    align_cnt, dis = post_triplet_align(
        src_sents_org=src_sents_o,
        src_sents_ali=src_sents_a,
        ref_sents_org=ref_sents_o,
        mt_sents_ali=mt_sents_a,
        folder_path=dst_path,
        filename=fn)

    fn2align_cnt_sent_split[fn] = align_cnt
    case2align_cnts_sent_split[f'{dataset}-{translator}'].append(align_cnt)

for t, ac in case2align_cnts_sent_split.items():
    max_cnt = max(ac)
    min_cnt = min(ac)
    mean = sum(ac) / len(ac)
    print(t)
    print(f'min: {min_cnt}')
    print(f'max: {max_cnt}')
    print(f'mean: {mean:.2f}')
    print()

ep-gpt
min: 329
max: 396
mean: 374.74

ep-deepl
min: 328
max: 395
mean: 377.38

flores-gpt
min: 390
max: 428
mean: 417.58

flores-deepl
min: 390
max: 428
mean: 417.23

opus-gpt
min: 361
max: 399
mean: 382.95

opus-deepl
min: 361
max: 398
mean: 383.85



In [19]:
from collections import defaultdict
cnt_freq = defaultdict(int)
for k, v in fn2align_cnt_sent_split.items():
    cnt_freq[v] += 1

for k in sorted(cnt_freq):
    print(k, cnt_freq[k])

328 1
329 1
333 2
334 1
335 1
336 1
338 3
339 5
342 2
343 2
344 2
345 1
347 1
348 2
349 1
351 2
356 1
357 1
360 2
361 4
362 2
363 1
364 2
365 1
366 6
367 2
368 3
369 4
370 3
371 1
372 5
373 2
374 5
375 5
376 4
377 11
378 14
379 7
380 7
381 7
382 7
383 12
384 13
385 4
386 18
387 8
388 12
389 11
390 16
391 13
392 6
393 4
394 7
395 3
396 3
397 3
398 4
399 2
400 2
401 1
403 4
404 1
405 3
406 4
407 1
408 2
409 2
410 1
411 6
412 7
413 8
414 4
415 5
416 3
417 6
418 10
419 13
420 14
421 20
422 13
423 30
424 20
425 10
426 8
427 5
428 3


In [20]:
import json
with open('aligned_cnt_sent_split.json', 'w') as f:
    json.dump(fn2align_cnt_sent_split, f, indent=4)

## Scoring
* For all three triplet formations, no alignment, alignment without sentence splitting and alignment with sentence splitting, we compute BLUE scores.

In [21]:
from scripts.scoring import ResultProducer
from scripts.util import get_env_variables
import os
from os.path import join
triplet_folder = get_env_variables('TRIPLETS')
results_folder = get_env_variables('RESULTS')

result_types = {'direct_results':join(triplet_folder, 'direct_triplets'), 
                'aligned_results_no_sent_split': join(triplet_folder, 'aligned_triplets_no_sent_split'),
                'aligned_results_sent_split': join(triplet_folder, 'aligned_triplets_sent_split')}

for rt in result_types:
    files = os.listdir(result_types[rt])
    folder_path = join(results_folder, rt)
    os.makedirs(folder_path, exist_ok=True)

    cases = ['ep-gpt', 'ep-deepl', 'flores-gpt',
             'flores-deepl', 'opus-gpt', 'opus-deepl']

    for case in cases:
        l2f = {f.replace(f'{case}-', '').replace('.jsonl', '')
                         : join(result_types[rt], f) for f in files if f.startswith(case)}
        rp = ResultProducer(label2files=l2f)
        rp.compute_results()
        rp.store_results(join(folder_path, f'{case}.csv'))

In [22]:
from scripts.presentation import Presenter
p_direct = Presenter(results_folder=join(results_folder, 'direct_results'))
p_no_sent_split = Presenter(results_folder=join(results_folder, 'aligned_results_no_sent_split'))
p_sent_split = Presenter(results_folder=join(results_folder, 'aligned_results_sent_split'))

## Difference between Alignment WITH and WITHOUT Sentence Splitting

In [23]:
# reload cnts 
import json
with open('aligned_cnt_no_sent_split.json', 'r') as f:
    fn2align_cnt_no_sent_split = json.load(f)

with open('aligned_cnt_sent_split.json', 'r') as f:
    fn2align_cnt_sent_split = json.load(f)

with open('direct_cnt.json', 'r') as f:
    fn2align_cnt_direct = json.load(f)

In [24]:
for x in p_no_sent_split.cases:
    df1 = p_no_sent_split.cases[x]['BLEU']
    df2 = p_sent_split.cases[x]['BLEU']
    diff = df1 - df2
    print(x)
    diff_mean = diff.values.mean()

    diff_flat = diff.stack()
    top3_max = diff_flat.nlargest(15)
    print('Largest Positive Differences (Sent Splits made things worse)')
    print('Label: ΔBLEU (no sent split cnt-> sent split cnt)')
    for (s, t), score in top3_max.items():
        fn = f'{x}-{s}-{t}'
        no_sent_split = fn2align_cnt_no_sent_split[fn]
        sent_split = fn2align_cnt_sent_split[fn]
        print(f'{s}-{t}: {score:.1f} ({no_sent_split}->{sent_split})')

    top3_min = diff_flat.nsmallest(15)
    print('Largest Negative Differences (Sent Splits made things better)')
    print('Label: ΔBLEU (no sent split cnt-> sent split cnt)')
    for (s, t), score in top3_min.items():
        fn = f'{x}-{s}-{t}'
        no_sent_split = fn2align_cnt_no_sent_split[fn]
        sent_split = fn2align_cnt_sent_split[fn]
        print(f'{s}-{t}: {score:.1f} ({no_sent_split}->{sent_split})')
    print()

ep-deepl
Largest Positive Differences (Sent Splits made things worse)
Label: ΔBLEU (no sent split cnt-> sent split cnt)
da-es: 0.9 (395->383)
fr-fi: 0.4 (397->390)
fi-es: 0.3 (395->388)
fr-it: 0.3 (397->372)
es-da: 0.2 (396->390)
fr-el: 0.2 (396->381)
fr-de: 0.1 (397->391)
sv-fi: 0.1 (393->376)
es-fr: 0.1 (396->389)
fr-da: 0.1 (397->389)
pt-es: 0.1 (397->386)
da-el: 0.1 (395->385)
en-it: 0.1 (397->366)
fr-es: 0.1 (397->387)
pt-de: 0.1 (397->384)
Largest Negative Differences (Sent Splits made things better)
Label: ΔBLEU (no sent split cnt-> sent split cnt)
nl-fi: -2.5 (397->345)
nl-da: -2.5 (397->343)
nl-sv: -2.2 (395->333)
nl-en: -2.2 (397->339)
nl-pt: -2.1 (396->336)
nl-es: -2.0 (397->338)
nl-fr: -2.0 (397->339)
nl-de: -1.9 (397->342)
nl-it: -1.5 (397->328)
nl-el: -1.3 (397->339)
it-pt: -0.9 (398->370)
it-fr: -0.9 (398->378)
it-de: -0.8 (398->378)
it-es: -0.7 (398->380)
it-en: -0.7 (398->378)

ep-gpt
Largest Positive Differences (Sent Splits made things worse)
Label: ΔBLEU (no sent sp

* In this case, Negative Differences refer to the case where Alignment WITH Sentence Splitting yielded higher BLEU scores than WITHOUT
* However, we can see based on alignment counts, that it may be not related to alignment but due to the fact that it works with less sentences overall.
* Additionally, the highest difference we observe is -3.4, not too tragic.

## Difference between Direct and Alignment WITHOUT Sentence Splitting

In [25]:
import json
mismatches = {}

with open(join('translations', 'info.json'), 'r') as f:
    prefix2file = json.load(f)
    
for prefix, info in prefix2file.items():
    dataset, translator, s, t = prefix.split('-')
    key = f'{dataset}-{translator}'
    outlines = info['log']['out_lines']
    if outlines != 400:
        mismatches[prefix] = outlines

In [26]:
mismatch_stats = {}



for x in p_direct.cases:
    df1 = p_direct.cases[x]['BLEU']
    df2 = p_no_sent_split.cases[x]['BLEU']
    
    diff = df1 - df2 
    print(x)
    diff_mean = diff.values.mean()
    
    diff_flat = diff.stack()
    top3_max = diff_flat.nlargest(15)
    print('Largest Positive Differences (Alignment made things worse)')
    for (s, t), score in top3_max.items():
        fn = f'{x}-{s}-{t}'
        no_sent_split = fn2align_cnt_no_sent_split[fn]
        direct = fn2align_cnt_direct[fn]
        mark = ''
        if fn in mismatches:
            mismatch_stats[fn] = [score, direct, no_sent_split]
            mark = "🎯"
        if abs(score) > 0.01:
            print(f'{s}-{t}: {score:.1f} ({direct}->{no_sent_split}) {mark}')
        
    top3_min = diff_flat.nsmallest(15)
    print('Largest Negative Differences (Alignment made things better)')
    for (s, t), score in top3_min.items():
        fn = f'{x}-{s}-{t}'
        no_sent_split = fn2align_cnt_no_sent_split[fn]
        direct = fn2align_cnt_direct[fn]
        mark = ''
        if fn in mismatches:
            mismatch_stats[fn] = [score, direct, no_sent_split]
            mark = "🎯"
        if abs(score) > 0.01:
            print(f'{s}-{t}: {score:.1f} ({direct}->{no_sent_split}) {mark}')
    print()

ep-deepl
Largest Positive Differences (Alignment made things worse)
da-fi: 0.2 (400->395) 
sv-es: 0.2 (400->393) 
nl-sv: 0.1 (400->395) 
sv-fr: 0.1 (400->393) 
da-de: 0.1 (400->395) 
fr-fi: 0.1 (400->397) 
pt-fi: 0.1 (400->397) 
fr-pt: 0.1 (400->397) 
da-en: 0.1 (400->395) 
de-fr: 0.1 (400->397) 
el-sv: 0.1 (400->394) 
de-da: 0.1 (400->397) 
en-fr: 0.1 (400->397) 
nl-pt: 0.1 (400->396) 
pt-da: 0.1 (400->395) 
Largest Negative Differences (Alignment made things better)
da-el: -0.1 (400->395) 
sv-de: -0.0 (400->369) 
el-en: 0.0 (400->397) 
el-it: 0.0 (400->398) 
it-el: 0.0 (400->398) 
fi-el: 0.0 (400->397) 
es-en: 0.0 (400->396) 
de-el: 0.0 (400->397) 

ep-gpt
Largest Positive Differences (Alignment made things worse)
fr-nl: 0.1 (400->395) 
sv-de: 0.1 (400->383) 
de-sv: 0.1 (400->384) 
nl-pt: 0.1 (400->396) 
it-de: 0.1 (400->398) 
el-de: 0.1 (400->397) 
nl-fi: 0.1 (400->397) 
pt-nl: 0.1 (400->397) 
en-de: 0.1 (400->397) 
it-nl: 0.1 (400->398) 
fi-de: 0.1 (400->396) 
es-fi: 0.1 (400->395)

* Biggest differences occur only for the ones that we marked, the ones that we suspect to have low BLEU scores due to misalignment!

In [27]:
print('Label: ΔBLEU (direct_cnt->no_sent_split_cnt)')
for k, v in mismatch_stats.items():
    print(f'{k}: {v[0]:.1f} ({v[1]}->{v[2]})')

Label: ΔBLEU (direct_cnt->no_sent_split_cnt)
ep-gpt-es-en: -34.1 (399->397)
ep-gpt-de-en: -25.1 (398->394)
ep-gpt-de-fi: -15.6 (399->394)
ep-gpt-fi-nl: -15.4 (399->395)
ep-gpt-es-el: -15.0 (399->395)
ep-gpt-fi-it: -14.8 (399->394)
ep-gpt-el-it: -14.3 (399->396)
ep-gpt-en-fi: -13.1 (400->397)
ep-gpt-pt-fi: -12.6 (400->397)
ep-gpt-sv-fi: -12.4 (399->394)
ep-gpt-es-nl: -12.3 (399->395)
ep-gpt-el-nl: -12.2 (399->395)
ep-gpt-da-it: -4.9 (399->396)
ep-gpt-it-fi: -1.9 (399->395)
flores-gpt-it-fr: -32.7 (399->400)
flores-gpt-fi-pt: -31.7 (399->400)
flores-gpt-fi-da: -29.9 (399->398)
flores-gpt-fi-nl: -24.6 (399->400)
flores-gpt-fi-es: -22.2 (399->400)
flores-gpt-fi-el: -20.9 (399->400)
flores-gpt-es-el: -19.6 (399->400)
opus-gpt-pt-en: -37.6 (399->400)
opus-gpt-de-en: -24.3 (372->372)


* If we compare Alignment WITHOUT Sentence Splitting to direct alignment, resp. no additional alignment, we observe that differences occur mainly where we assumed to a have mismatch anyway. 
* It is noteworthy that we see things like `399->400`, this occurs because `direct_alignment` excludes triplets where either source, reference or translation are missing. It means that after alignment, there was not anything that was missing because the alignment managed to fix that, the true counts from `translations` are added as well.
* It is also noteworthy that the biggest difference in aligned sent cnt, `sv-de: 400->369`,  had no difference in BLEU scores.

## Consider LaBSE
* LaBSE alignments were computed in this notebook: [Alignments_No_Sent_Split_LaBSE.ipynb](https://colab.research.google.com/drive/1ieADAugVQ2nVq0Sqr9a299rsTjs9eMsB?usp=sharing)
* LaBSE is computationally more expensive and a bit trickier to use even with Google Colab; has issues with memory occasionally, however, since we used paraphrase-multilingual-MiniLM-L12-v2 for everything, we can now just compare it against LaBSE for the sentences that we plan to align for sure and see if it makes a big difference or not.

In [28]:
from scripts.post_process import post_triplet_align, load_aligned_sents_from_file
import os
from scripts.util import get_env_variables
from os.path import join
aligned_folder = get_env_variables('ALIGNMENTS')
triplet_folder = get_env_variables('TRIPLETS')
src2hyp_split_fo = join(aligned_folder, 'source2translations_no_sent_split_LaBSE')
dst_path = join(triplet_folder, 'aligned_triplets_no_sent_split_LaBSE')
filenames = [f.replace('.jsonl', '') for f in os.listdir(src2hyp_split_fo)]
len(filenames)

23

In [29]:
from scripts.data_management import EuroParlManager, FloresPlusManager, Opus100Manager

dms = {
    'ep': EuroParlManager(),
    'flores': FloresPlusManager(),
    'opus': Opus100Manager()
}

fn2align_cnt_no_sent_split_LaBSE = {}
fn2discard_no_sent_split_LaBSE = {}
cases = ['ep-gpt', 'ep-deepl', 'flores-gpt',
         'flores-deepl', 'opus-gpt', 'opus-deepl']
case2align_cnts_no_sent_split_LaBSE = {c: [] for c in cases}

for fn in filenames:
    dataset, translator, s, t = fn.split('-')
    src_sents_a, mt_sents_a = load_aligned_sents_from_file(
        fn, folder=src2hyp_split_fo)
    dm = dms[dataset]
    src_sents_o, ref_sents_o = dm.get_sentence_pairs(s, t, num_of_sents=400)
    align_cnt, dis = post_triplet_align(
        src_sents_org=src_sents_o,
        src_sents_ali=src_sents_a,
        ref_sents_org=ref_sents_o,
        mt_sents_ali=mt_sents_a,
        folder_path=dst_path,
        filename=fn)

    fn2align_cnt_no_sent_split_LaBSE[fn] = align_cnt


for k, v in fn2align_cnt_no_sent_split_LaBSE.items():
    other = fn2align_cnt_no_sent_split[k]
    if v!=other:
        print(k, v, other)

ep-gpt-de-fi 396 394
ep-gpt-el-nl 396 395
ep-gpt-sv-fi 393 394
flores-gpt-fi-da 400 398


* We observe only few disagreements in terms of alignment count between LaBSE and paraphrase-multilingual-MiniLM-L12-v2

In [30]:
from scripts.post_process import load_aligned_sents_from_file
from scripts.scoring import compute_bleu
from scripts.util import get_env_variables
triplet_folder = get_env_variables('TRIPLETS')
paraphrase = join(triplet_folder, 'aligned_triplets_no_sent_split')
laBSE = join(triplet_folder, 'aligned_triplets_no_sent_split_LaBSE')
filenames = [f.replace('.jsonl', '') for f in os.listdir(laBSE)]
diff_cnt = 0
total_diff = 0
max_diff = 0
min_diff = 100
for fn in filenames:
    ref_sents_p, mt_sents_p = load_aligned_sents_from_file(
        fn, folder=paraphrase, src_label='ref', tgt_label='mt')
    ref_sents_l, mt_sents_l = load_aligned_sents_from_file(fn, folder=laBSE,
                                                           src_label='ref', tgt_label='mt')
    bleu_p = compute_bleu(ref_sents_p, mt_sents_p)
    bleu_l = compute_bleu(ref_sents_l, mt_sents_l)
    no_diff = bleu_l == bleu_p
    if no_diff:
        continue
    else:
        mark = '❌'
        diff_cnt += 1
        total_diff += abs(bleu_l - bleu_p)
        max_diff = max(max_diff, abs(bleu_l - bleu_p))
        min_diff = min(min_diff, abs(bleu_l - bleu_p))
        print(fn)
        print(f'BLEU Paraphrase: {bleu_p:.3f}')
        print(f'BLEU LaBSE: {bleu_l:.3f} {mark}')
        print()


print()
print(f'Mean Difference: {total_diff/diff_cnt:.3f}')
print(f'Max Difference: {max_diff:.3f}')
print(f'Min Difference: {min_diff:.3f}')

ep-gpt-de-fi
BLEU Paraphrase: 20.383
BLEU LaBSE: 20.356 ❌

ep-gpt-el-nl
BLEU Paraphrase: 25.019
BLEU LaBSE: 25.032 ❌

ep-gpt-sv-fi
BLEU Paraphrase: 19.018
BLEU LaBSE: 18.983 ❌

flores-gpt-fi-da
BLEU Paraphrase: 30.189
BLEU LaBSE: 30.124 ❌

flores-gpt-fi-es
BLEU Paraphrase: 22.857
BLEU LaBSE: 22.858 ❌

flores-gpt-fi-pt
BLEU Paraphrase: 32.266
BLEU LaBSE: 32.265 ❌


Mean Difference: 0.024
Max Difference: 0.065
Min Difference: 0.001


* 6 out of 23 alignments yielded slightly different BLEU scores.
* 4 out of 6 alignments were likely different because LaBSE aligned differently
* **OVERALL CONCLUSION**: Using paraphrase-multilingual-MiniLM-L12-v2 for alignment instead of LaBSE should not be an issue. 

## Final Triplets
* After doing these various alignment experiments, we can finally create our final triplets that we use for evaluation. 

In [31]:
len(mismatches)

23

In [32]:
import re
import os
from os.path import join
import shutil
from scripts.util import get_env_variables
triplet_folder = get_env_variables('TRIPLETS')

direct_folder = join(triplet_folder, 'direct_triplets')
direct_files = os.listdir(direct_folder)
aligned_folder = join(triplet_folder, 'aligned_triplets_no_sent_split')
dst_folder = join(triplet_folder, 'final_triplets')

cases = ['ep-gpt', 'ep-deepl', 'flores-gpt',
         'flores-deepl', 'opus-gpt', 'opus-deepl']

case2cnt = {c: [] for c in cases}
os.makedirs(dst_folder, exist_ok=True)
for fn in direct_files:
    name = fn.replace('.jsonl', '')
    if name not in mismatches:
        src_file = join(direct_folder, fn)
    else:
        src_file = join(aligned_folder, fn)

    with open(src_file, 'r') as f:
        lines = f.readlines()
        key = re.search(r'\w+-\w+', fn).group(0)
        case2cnt[key].append(len(lines))

    dst_file = join(dst_folder, fn)
    shutil.copy(src_file, dst_file)

In [33]:
from collections import Counter
for c, cnts in case2cnt.items():
    print(c)
    print(Counter(cnts))
    print()

ep-gpt
Counter({400: 96, 395: 5, 394: 4, 397: 3, 396: 2})

ep-deepl
Counter({400: 110})

flores-gpt
Counter({400: 109, 398: 1})

flores-deepl
Counter({400: 109, 399: 1})

opus-gpt
Counter({400: 19, 372: 1})

opus-deepl
Counter({400: 20})

