# train GPT2 not (only) in English

mlugs, 2020-09-15  
Andreas Madsack

## agenda

1. what is GPT2?
2. huggingface transformers
3. karpathy minGPT

## what is GPT2

- developed by OpenAI
- encoder: transformer-based language model
- decoder: next word prediction
- GPT2 is 10x GPT
- GPT3 is 10x GPT2 -- 175 billion parameters
- newest version (GPT3) is somehow closed source
- GPT3 paper: https://arxiv.org/abs/2005.14165
- training-data (of GPT3) >500GB text

### openai code / forks

- https://github.com/openai/gpt-2 -- unmaintained, hard to use
- https://github.com/nshepperd/gpt-2 -- everyone used this fork
- example to train jpop (japanese) GPT2:  
  https://medium.com/@ngwaifoong92/beginners-guide-to-retrain-gpt-2-117m-to-generate-custom-text-content-8bb5363d8b7f

## huggingface transformers

- https://huggingface.co/
- all opensource, good documentation, focus on BERT, but supports GPT2
- pytorch and tensorflow!
- really good: https://github.com/huggingface/tokenizers - rust+python
- ready to use BPE tokenizer - Byte Pair Encoding - best solution for unknown words
- because Zipf's law: `The probability of occurrence of words or other items starts high and tapers off. Thus, a few occur very often while many others occur rarely.`

In [1]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7f45841bdd00>

In [2]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2").eval()

In [3]:
text = "My name is Bob, my favorite"
tokens = tokenizer.encode(text)
tokens = torch.tensor([tokens])
tokens = model.generate(tokens, max_length=20+tokens.shape[1], do_sample=True)
tokens = tokens[0].tolist()

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


In [4]:
tokenizer.decode(tokens)

'My name is Bob, my favorite and I\'m on my way."\n\nThe show began with what the creator told the Daily News'

### pretrained models in huggingface model zoo

- gpt2 (small, medium, large)
- distilgpt2 (smaller, but nearly same results)
- pierreguillou/gpt2-small-portuguese -- https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787
- anonymous-german-nlp/german-gpt2

In [5]:
from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer_de = AutoTokenizer.from_pretrained("anonymous-german-nlp/german-gpt2")
model_de = AutoModelWithLMHead.from_pretrained("anonymous-german-nlp/german-gpt2")



In [6]:
text = "Mein name ist Hans, am liebsten "
tokens = tokenizer_de.encode(text)
tokens = torch.tensor([tokens])
tokens = model_de.generate(tokens, max_length=20+tokens.shape[1], do_sample=True)
tokens = tokens[0].tolist()

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


In [7]:
tokenizer_de.decode(tokens)

'Mein name ist Hans, am liebsten!!\nMein Name ist Peter, am liebsten!!\nIch bin 26 Jahre alt, hellwach'

Main problem? -> Trainingdata!!

### another example using huggingface

- https://github.com/mgrankin/ru_transformers
- very good description what has to be done
- training time: ```Nvidia Titan RTX an epoch takes 70 minutes and the same epoch takes 12.5 minutes on TPU v3-8```
- 80+ epochs!

## karpathy minGPT

- https://github.com/karpathy/minGPT/
- pytorch only implementation of GPT2
- no BPE support (yet)
- char generation is a lot better than "classical" RNN generator

### Example for Latin

- copy `mingpt` folder from https://github.com/karpathy/minGPT/ here
- download la_dedup.txt.gz from https://oscar-public.huma-num.fr/shuffled/la_dedup.txt.gz
- training time about 1 day on RTX-2070.

In [8]:
from mingpt.model import GPT, GPTConfig
from mingpt.utils import sample
import torch

#### data

In [9]:
from torch.utils.data import Dataset
import numpy as np
import gzip
import math

block_size = 200  # spatial extent of the model for its context
batch_size = 64  # maximum possible for GPU

class CharDataset(Dataset):
    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print("data has %d characters, %d unique." % (data_size, vocab_size))

        self.stoi = {ch: i for i, ch in enumerate(chars)}
        self.itos = {i: ch for i, ch in enumerate(chars)}
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data

    def __len__(self):
        return math.ceil(len(self.data) / (self.block_size + 1))

    def __getitem__(self, idx):
        # we're actually going to "cheat" and pick a spot in the dataset at random
        i = np.random.randint(0, len(self.data) - (self.block_size + 1))
        chunk = self.data[i : i + self.block_size + 1]
        dix = [self.stoi[s] for s in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

In [10]:
text = gzip.open("la_dedup.txt.gz", "r").read().decode("utf-8")
train_dataset = CharDataset(text, block_size)

data has 8622770 characters, 2232 unique.


#### training

In [11]:
from mingpt.trainer import Trainer, TrainerConfig

mconf = GPTConfig(
    vocab_size=train_dataset.vocab_size,
    block_size=train_dataset.block_size,
    n_layer=8,n_head=8,n_embd=512,
)

tconf = TrainerConfig(
    max_epochs=200,
    batch_size=batch_size,
    learning_rate=6e-4,
    lr_decay=True,
    warmup_tokens=batch_size * 20,
    final_tokens=200 * len(train_dataset) * block_size,
    num_workers=4,
)
trainer = Trainer(model, train_dataset, None, tconf)
# trainer.train()

#### prediction

In [12]:
model_latin = GPT(mconf)
model_latin.load_state_dict(torch.load("latin_char_model.pt", map_location=torch.device('cpu')))

context = "Delectus magni accusamus ut qui."
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None, ...]
y = sample(model_latin.cpu(), x, 200, temperature=0.9, sample=True, top_k=5)[0]
completion = "".join([train_dataset.itos[int(i)] for i in y])

In [13]:
completion

'Delectus magni accusamus ut qui.\nMinus id quod rem faciendi. Aliquam deleniti dolorem excepturi deserunt saepe ut error consequatur. Eos aut ea non in aliquid. Non numquam eius modi tempora incidunt. Illic quodam facilis et Iustus q'