In this notebook, we're looking into how the larger GPT-2 model behaves with a translation task, by prefixing the prompt with some fr=en examples, as mentioned in the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).

We'll be using the OPUS-100 dataset of English-French phrase pairs:

https://opus.nlpl.eu/opus-100.php

https://data.statmt.org/opus-100-corpus/v1.0/supervised/en-fr/

Download these two files of the test split, one for English, another for French:

https://data.statmt.org/opus-100-corpus/v1.0/supervised/en-fr/opus.en-fr-test.en

https://data.statmt.org/opus-100-corpus/v1.0/supervised/en-fr/opus.en-fr-test.fr

And copy the two files to folder ../data/OPUS-100/


In [5]:
import random
from gptbench import Sample, empty_config

In [6]:
ben = Sample(seed=0xDAD5CAFE)

cfg = empty_config()

cfg.model.set(dtype='bfloat16')

# the next sample config settings are important:
# top=1 will only emit the most probable token on each step (argmax) - we want accuracy, not randomness
# emit_start=False will skip emitting the initial context
# emit_until='.' will stop generating after the first '.', as we're translating single phrases
cfg.sample.set(top=1, emit_start=False, emit_until='.')

# if you get an out of memory error, try 'gpt2', the smaller model:
ben.init_pretrained('gpt2-xl', cfg)

Initializing model from gpt2-xl
Dataset: dummy 0 tokens
Dataset: loading uint16 tokens
Expanding initial dataset size of 1 (less than block_size+1) by 1025 times to size of 1025
Dataset train_path: dummy empty dataset, val_path: None, train_split: 0.9, vocab_size: 50257
Model params: 1557.61M


In [7]:
with open('../data/OPUS-100/opus.en-fr-test.fr', 'r', encoding='utf-8') as f:
    fr_lines = [l.strip() for l in f.readlines()]

with open('../data/OPUS-100/opus.en-fr-test.en', 'r', encoding='utf-8') as f:
    en_lines = [l.strip() for l in f.readlines()]

print(fr_lines[:3])
print(en_lines[:3])

["- Vous étiez en train de vous embrasser à l'arrêt de bus!", 'Avec une ironie farceuse, le jeune artiste tchèque Krištof Kintera chamboule l’art et la vie.', 'Qui va parler au vendeur de voitures ? Big Freddy.']
['- You were at a bus stop kissing him!', 'With irony and mischief the young Czech artist Krištof Kintera turns art and life on their heads.', "- Who's going to talk to the used car salesman?"]


Each line in one list is matched to the translation in the same index of the other list.

We'll be passing, before the French phrase to be translated, several examples of french_phrase=english_phrase - this will set the context. Finally we'll pass the French phrase and end the prompt with an '='. So the prompt will be:

```
fr1=en1\n
fr2=en2\n
...
french_phrase=
```

Which the model should complete with the English translation of the final french_phrase, before '='.

The next function builds random pairs, taking care that they fit in the block_size context window that the model can processs. (while 
also reserving space for the French phrase) 

In [8]:
def build_random_context(fr_lines, en_lines, token_reserve, dataset):
    """
    Build 'fr=en\n' lines that when encoded to tokens betch the model's block_size (1024)
    """
    out=''    
    while True:
        index = random.randrange(len(fr_lines))
        fr, en = fr_lines[index], en_lines[index]
        new_out = out + fr + '=' + en + '\n'
        new_enc = dataset.encode(new_out)
        if len(new_enc) > dataset.get_block_size() - token_reserve:
            return out

        out = new_out

In [35]:
# context reserving 100 tokens
ctx = build_random_context(fr_lines,en_lines, 100, ben.train_dataset)
ctx

"La séance vise à établir un dialogue interactif entre les membres du Conseil de sécurité et les représentants des organisations régionales et sous-régionales.=The meeting is intended to be an interactive dialogue between the Security Council members and representatives of regional and sub-regional organizations.\nLes activités d'engagement du public devraient inciter les Canadiens :=Public engagement activities should encourage Canadians to:\nVoilà un matraquage datant de Juillet.=Here's a bludgeoning from july.\nLe dossier de travail du personnel indique que le contrôleur avait pris deux pauses; cependant, celui-ci ne se rappelait pas en avoir pris.=The personnel utilization record shows that the controller had taken two breaks; however, he could not recall having taken a break.\nUne mycose ?=Athlete's foot?\nTélévision par satellite avec 20 chaînes dans plusieurs langues en moyenne=Satellite TV with 20 channels in different languages\nVotre plan avec le gamin ?=What was the plan wit

In [30]:
# some convenience functions

def prepare(fr):
    ds = ben.train_dataset
    reserve = len(ds.encode(fr)) + 1 # for '='
    ctx = build_random_context(fr_lines,en_lines, reserve, ds)
    ctx_len = len(ds.encode(ctx))
    total = ctx + fr + '=' # the context, then the French phrase and finish the prompt with '='
    total_len = len(ds.encode(total))
    print(f"ctx={ctx_len} + fr={reserve} -> total={total_len} tokens")
    return total

def translate(fr, **kwargs):
    print('========================== French input:')
    print(fr)
    q = prepare(fr)
    print('========================== Random translations context + input:')
    print(q[:500], '(...)')
    print('========================== English output:')
    ben.sample(q, **kwargs)

In [31]:
fr = "Un homme a expliqué que l’opération gratuite qu’il avait subie pour soigner une hernie lui permettrait de travailler à nouveau."
translate(fr)

Un homme a expliqué que l’opération gratuite qu’il avait subie pour soigner une hernie lui permettrait de travailler à nouveau.
ctx=929 + fr=48 -> total=977 tokens
o Observation de la Terre depuis l'espace (OT)?L'objectif de l'activité de programme est de développer et d'opérationnaliser l'utilisation de l'observation spatiale de la Terre pour le bénéfice des Canadiens, particulièrement en matière d'environnement, de gestion des ressources et d'utilisation des terres, ainsi que de sécurité et de politique étrangère.=Space Based Earth Observation (EO)?The program activity objective is to develop and make operational the use of space Earth Observation for th (...)
A man explained that he would have preferred to work for a better salary.

In [33]:
fr = "Le ministre esquiva la question s'il a obtenu de la part de l'Inde la promesse de s'occuper de l'affaire."
translate(fr)

Le ministre esquiva la question s'il a obtenu de la part de l'Inde la promesse de s'occuper de l'affaire.
ctx=970 + fr=39 -> total=1009 tokens
Pourquoi précisément Anvers et Rotterdam?=Why specifically Antwerp or Rotterdam?
Le paragraphe104(1.1) est modifié de façon à préciser qu’il s’applique malgré le paragraphe248(25) de la Loi. Ce dernier paragrapheprévoit, pour l’application de la Loi, les circonstances dans lesquelles une personne ou une société de personnes est réputée avoir un «droit de bénéficiaire» dans une fiducie.=Subsection104(1.1) is amended to clarify that it applies notwithstanding subsection248(25) of the Act. Subsecti (...)
The Minister is obliged to answer the question whether the Inde is responsible for the incident.

In [34]:
fr = "Dernière visite 26 septembre 2007 04:17."
translate(fr)

Dernière visite 26 septembre 2007 04:17.
ctx=937 + fr=17 -> total=954 tokens
Nous vous avons invités à la fête.=We invited you to the party.
J'ai dit : "J'y touche pas". Et là, il a refusé.=I told him I didn't want to touch it, so he turned him down.
Départs d’étrangers par sexe, groupe d’âge, pays de citoyenneté et date d’expiration du visa ou permis actuel=Foreigners departing by sex, age group, country of citizenship and expiration date of current visa or permit
Source : «Les «sauveurs» de l’Irak», par Manlio Dinucci, Traduction Marie-Ange Patrizio, Il Manifesto (Ital (...)
The first visit to 26 September 2007 04:17.

If you generate multiple times the queries above, their translation depends heavily on the prior context.

Still, translated phrases are mostly coherent and it definitely is able to translate portions to corresponding English words.