# Demo

* Shows possible translation process using GPTClient
* Some failures occured while making this Demo, act as good guideline on how to deal with failures.

In [1]:
from scripts.translators import GPTClient
gpt = GPTClient()
print(gpt.user_prompt('en', 'de', 'Hello World'))

Translate the following English sentences into German.
Please make sure to keep the same formatting, do not add more newlines.
You are not allowed to omit anything.
Here is the text:
Hello World


## Translation Task

* Setup the task

In [2]:
from scripts.task import TranslationTask
from scripts.data_management import EuroParlManager
from scripts.translators import GPTClient
from scripts.logger import MyLogger
from os.path import join
from random import sample, seed

some_pairs = [('da', 'en'), ('de', 'fr'), ('de', 'pt'), ('fi', 'it'), 
              ('fr', 'da'), ('it', 'pt'), ('nl', 'da'), ('sv', 'es')]

example_folder = 'exmpl'
logfile = join(example_folder, 'log.jsonl')


dm = EuroParlManager()
logger = MyLogger(logfile=logfile)
client_gpt = GPTClient(logger=logger)

mt_folder_gpt = join(example_folder, client_gpt.model)

In [3]:
task_gpt = TranslationTask(
    target_pairs=some_pairs,
    dm=dm,
    client=client_gpt,
    logger=logger,
    mt_folder=mt_folder_gpt,
    num_of_sents=400
)

In [4]:
task_gpt.run()
# This cell has be run in the past multiple times with provided input
# See logs in log.jsonl

Document for pair da-en has been translated already.
400 translated from da to en
Document for pair de-fr has been translated already.
400 translated from de to fr
Document for pair de-pt has been translated already.
400 translated from de to pt
Document for pair fi-it has been translated already.
399 translated from fi to it
Document for pair fr-da has been translated already.
400 translated from fr to da
Document for pair it-pt has been translated already.
400 translated from it to pt
Document for pair nl-da has been translated already.
400 translated from nl to da
Document for pair sv-es has been translated already.
400 translated from sv to es


## Post-Processing
* Phase1 and Phase2 will not involve Post-Processing directly, both are translation Phases.
* In this Demo, we demonstrate how the Post-Processing process will look like.

In [5]:
from scripts.post_process import direct_triplet_align
from scripts.util import load_sents

for pair in some_pairs:
    s, t = pair
    src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=400)
    mt_sents = load_sents(mt_folder_gpt, s, t)
    direct_triplet_align(
        mt_sents=mt_sents,
        ref_sents=tgt_sents,
        src_sents=src_sents,
        src_lang=s,
        ref_lang=t,
        folder_path='tmp_gpt'
    )

### Dealing with Malformatted Output

* The pair `fi-it` has 399 sentences, which is still sufficient for evaluation but cannot be directly aligned. 
* We use bertalign to correct the alignment.

In [6]:
from scripts.post_process import align_sents
from scripts.util import load_sents
mt_sents = load_sents(mt_folder_gpt, 'fi', 'it')
print(len(mt_sents))

mt_sents = mt_sents[:50]
# For demonstration, we run bertalign only on the first 50
len(mt_sents)


399


50

In [7]:
from scripts.data_management import EuroParlManager
dm = EuroParlManager()
src_sents, tgt_sents = dm.get_sentence_pairs('fi', 'it', num_of_sents=50)

src_sents_a, mt_sents_a = align_sents(
    src_sents=src_sents,
    tgt_sents=mt_sents,
    src_lang='fi',
    tgt_lang='it',
    folder_path='tmp'
)

Source language: fi, Number of sentences: 53
Target language: it, Number of sentences: 53
Embedding source and target text using paraphrase-multilingual-MiniLM-L12-v2 ...
Performing first-step alignment ...
Performing second-step alignment ...
Finished! Successfully aligned 53 fi sentences to 53 it sentences



* Observe that 50 sents from the EuroParl corpus can refer 53 sents in reality
* Can be caused due to 1-to-many alignments or author blunders

In [8]:
!rm -rf tmp

In [9]:
from scripts.post_process import post_triplet_align
post_triplet_align(
    src_sents_org=src_sents,
    src_sents_ali=src_sents_a,
    ref_sents_org=tgt_sents,
    mt_sents_ali=mt_sents_a,
    src_lang='fi',
    ref_lang='it',
    folder_path='tmp'
)

49 sents aligned for fi and it


## Eval

* Evaluation is done on JSONL files that are in COMET format, containing machine translation, reference and source text.
* In this notebook we compute only BLEU scores but the format allows us to compute COMET scores as well (will be done on Google Colab due to requiremnt of GPUs)

In [10]:
!cat tmp_gpt/de-fr.jsonl | head -n 1

{"mt": "Reprise de la p\u00e9riode de session", "ref": "Reprise de la session", "src": "Wiederaufnahme der Sitzungsperiode"}


In [11]:

from scripts.scoring import ResultProducer
import os
l2f_gpt = {f.replace('.jsonl', ''): join('tmp_gpt', f)
             for f in os.listdir('tmp_gpt') if f.endswith('.jsonl')}

l2f_gpt['fi-it-fixed'] = join('tmp', 'fi-it.jsonl')

rp_gpt = ResultProducer(label2files=l2f_gpt)
rp_gpt.compute_results()

In [12]:
rp_gpt.display_results()

         Label       BLEU       chrF
0        da-en  34.999449  61.202820
1        de-fr  30.537712  58.877412
2        de-pt  26.598626  54.204422
3        fi-it   7.331725  30.730069
4        fr-da  32.340128  59.683924
5        it-pt  26.317238  55.412811
6        nl-da  29.005539  55.651123
7        sv-es  31.480298  58.966272
8  fi-it-fixed  20.754495  52.785338


* We observed that the re-alignment will improve the BLEU score
* It may be still low due to running it only on the first 50 rather than all roughly 400 sents