# Costs
* 110 pairs the European languages used in Phillip Koehn's paper

In [2]:
from scripts.data_management import FloresPlusManager, EuroParlManager, Opus100Manager
from scripts.stats import get_deepl_cost, get_gpt41_cost
langs = EuroParlManager.EURO_LANGS

pairs = []
for x in langs:
    for y in langs:
        if x != y:
            pairs.append((x, y))

In [3]:
run_1 = []
run_2 = []

for pair in pairs:
    if 'en' in pair:
        run_1.append(pair)
    else:
        run_2.append(pair)

len(run_1), len(run_2)

(20, 90)

## Phase 1
* Compute translations from and into English for 10 European languages accross 3 Corpora and 2 Translators

In [4]:
dm_ep = EuroParlManager()
dm_flo = FloresPlusManager()
dm_opus = Opus100Manager()
dms = [dm_ep, dm_flo, dm_opus]

deepl_cost = []
gpt41_cost = []

for pair in run_1:
    s, t = pair
    for dm in dms:
        src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=400)
        deepl_cost.append(get_deepl_cost(src_sents))
        gpt41_cost.append(get_gpt41_cost(src_sents))

print(f'Deepl Cost for Run 1: €{sum(deepl_cost):.2f}')
print(f'GPT4.1 Cost for Run 1: ${sum(gpt41_cost):.2f}')
len(deepl_cost), len(gpt41_cost)

Deepl Cost for Run 1: €58.42
GPT4.1 Cost for Run 1: $6.78


(60, 60)

## Phase 2
* Compute translations for the remaining 90 language directions to obtain a matrix of 110 BLEU scores for 2 Corpora and 2 Translators

In [None]:
dm_ep = EuroParlManager()
dm_flo = FloresPlusManager()
dm_opus = Opus100Manager()
dms = [dm_ep, dm_flo]
deepl_cost = []
gpt41_cost = []

for pair in run_2:
    s, t = pair
    for dm in dms:
        src_sents, tgt_sents = dm.get_sentence_pairs(s, t, num_of_sents=400)
        deepl_cost.append(get_deepl_cost(src_sents))
        gpt41_cost.append(get_gpt41_cost(src_sents))

print(f'Deepl Cost for Run 2: €{sum(deepl_cost):.2f}')
print(f'GPT4.1 Cost for Run 2: ${sum(gpt41_cost):.2f}')

Deepl Cost for Run 2: €219.94
GPT4.1 Cost for Run 2: $27.91


* Thus, cost for Deepl for the whole project would be around €280 and for GPT4.1 around $35
* However, we use tiktoken to estimate the token counts, input token count is slightly higher due to the user prompt and system prompt additions to the input text and output token count tends to be lower than input usually. We can roughly raise our estimation by 10% and should be close to the truth, thus GPT4.1 cost would be around $40. 
* Note: Yes, I tend to round up. Over estimation of cost does no harm, as it will guarantee that the code runs and doesn't halt because we ran out of budget. 

## Storage Cost
* Based on `Demo.ipynb`, we observe that one pair creates `.txt` file between 7-8 KB. Let us assume that we get always 10 KB per pair (overestimate)
* We plan to translate for 400 sentences rather than 50, so then we have 10 * 8 = 80 KB per pair 
* For Phase 1, we have 20 pairs, 3 datasets and 2 translators, so 20 * 3 * 2 = 120 files. 120 * 80 = 9600 KB.
* For Phase 2, we have 90 pairs, 2 datasets and 2 translators, so 90 * 2 * 2 = 360 files. 360 * 80 = 28800 KB.
* In total, we expect to store 38400 KB roughly = 38.4 MB (overestimated)
