# GPT Testing
* Showcases how different prompts affect the output
* We choose the prompt used for `gpt1` using the model `gpt-4.1` as it seems to be the most stable
* Note, some issues can be dealt with using `bertalign`, others can only be solved by re-running it, as GPT is never fully deterministic. 
* `temperature=0` makes it as close to determnistic as possible but random sampling is still present behind the scenes

In [1]:
from scripts.data_management import FloresPlusManager
from scripts.translators import GPTClient
from scripts.util import MyLogger, LANG_ISO, split_sents
from io import StringIO
logfile = StringIO()
logger = MyLogger(logfile=logfile)

dm = FloresPlusManager()
de_sents, en_sents = dm.get_sentence_pairs('de', 'en', num_of_sents=100)
real_de_sents = split_sents('\n'.join(de_sents), 'de')
real_en_sents = split_sents('\n'.join(en_sents), 'en')
len(de_sents), len(en_sents), len(real_de_sents), len(real_en_sents)

(100, 100, 106, 105)

* Case1: Default Prompt, Include System & User prompt

In [2]:
gpt1 = GPTClient(logger=logger)
out = gpt1.translate_document(
    text=de_sents,
    src_lang='de',
    tgt_lang='en'
)
real_sents = split_sents('\n'.join(out), 'en')
len(out), len(real_sents)

(100, 106)

* Case2: Includes only System prompt

In [3]:
gpt2 = GPTClient(logger=logger)
gpt2.user_prompt = lambda src_lang, tgt_lang, text: text
out = gpt2.translate_document(
    text=de_sents,
    src_lang='de',
    tgt_lang='en'
)
real_sents = split_sents('\n'.join(out), 'en')
len(out), len(real_sents)

(100, 106)

* Case3: Includes only modified System prompt

In [4]:
def sys_prompt(src_lang, tgt_lang):
    p1 = f"You are a {LANG_ISO[src_lang]}-to-{LANG_ISO[tgt_lang]} translator."
    p2 = f"Please make sure to keep the same formatting, do not add more newlines."
    return '\n'.join([p1, p2])

gpt3 = GPTClient(logger=logger)
gpt3.user_prompt = lambda src_lang, tgt_lang, text: text
gpt3.sys_prompt = sys_prompt

out = gpt3.translate_document(
    text=de_sents,
    src_lang='de',
    tgt_lang='en'
)
real_sents = split_sents('\n'.join(out), 'en')
len(out), len(real_sents)

(100, 106)

* Case4: Default prompt but with GPT-4o

In [5]:
gpt4 = GPTClient(logger=logger, model='gpt-4o')
out = gpt4.translate_document(
    text=de_sents,
    src_lang='de',
    tgt_lang='en'
)
real_sents = split_sents('\n'.join(out), 'en')
len(out), len(real_sents)

(1, 104)

* Case5: Only System prompt with GPT-4o

In [6]:
gpt5 = GPTClient(logger=logger, model='gpt-4o')
gpt5.user_prompt = lambda src_lang, tgt_lang, text: text
out = gpt5.translate_document(
    text=de_sents,
    src_lang='de',
    tgt_lang='en'
)
real_sents = split_sents('\n'.join(out), 'en')
len(out), len(real_sents)

(63, 136)

* Case6: Only modified System prompt with GPT-4o

In [7]:
def sys_prompt(src_lang, tgt_lang):
    p1 = f"You are a {LANG_ISO[src_lang]}-to-{LANG_ISO[tgt_lang]} translator."
    p2 = f"Please make sure to keep the same formatting, do not add more newlines."
    return '\n'.join([p1, p2])

gpt6 = GPTClient(logger=logger, model='gpt-4o')
gpt6.user_prompt = lambda src_lang, tgt_lang, text: text
gpt6.sys_prompt = sys_prompt

out = gpt6.translate_document(
    text=de_sents,
    src_lang='de',
    tgt_lang='en'
)
real_sents = split_sents('\n'.join(out), 'en')
len(out), len(real_sents)

(1, 104)

* This notebook also demonstrates how the logger can be used. In this case, a logfile does not exist and everything was stored in-memory within an `StringIO` object.

In [None]:
import json
logfile_str = logfile.getvalue()
log_data = [json.loads(ln) for ln in logfile_str.splitlines()]
for log in log_data:
    print(f"{log['translator']:<8} {log['time']:.2f}s {log['in_lines']} {log['out_lines']}")

gpt-4.1  31.86s 100 100
gpt-4.1  55.64s 100 100
gpt-4.1  40.34s 100 100
gpt-4o   39.95s 100 1
gpt-4o   35.27s 100 63
gpt-4o   33.22s 100 1


## Limitations
* It is difficult to illustrate all possible variations of prompt-engineering issues we faced as they are not fully reproducible.
* In some instances, they can be language-dependent but not for a specific language
* In some instances, they are input-length depdendent, issue not observable for 50 sentences, may occur for 100 or 200 and etc. 