## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [1]:
# Added by Marxav: import minGPT library (only 3 .py files) and import a dataset (input.txt file)
!wget -N https://raw.githubusercontent.com/karpathy/minGPT/master/mingpt/model.py
!wget -N https://raw.githubusercontent.com/karpathy/minGPT/master/mingpt/utils.py
!wget -N https://raw.githubusercontent.com/karpathy/minGPT/master/mingpt/trainer.py
!wget -N https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
!mkdir mingpt
!mv *.py mingpt

--2020-10-08 10:02:19--  https://raw.githubusercontent.com/karpathy/minGPT/master/mingpt/model.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8455 (8.3K) [text/plain]
Saving to: ‘model.py’


Last-modified header missing -- time-stamps turned off.
2020-10-08 10:02:19 (82.8 MB/s) - ‘model.py’ saved [8455/8455]

--2020-10-08 10:02:20--  https://raw.githubusercontent.com/karpathy/minGPT/master/mingpt/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1718 (1.7K) [text/plain]
Saving to: ‘utils.py’


Last-modified header missing -- time

In [2]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [3]:
# Added by Marxav: add info about the GPU
print('device_name:', torch.cuda.get_device_name())
print('device_count:', torch.cuda.device_count())

device_name: Tesla P4
device_count: 1


In [4]:
# make deterministic
from mingpt.utils import set_seed
set_seed(42)

In [5]:
import math
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [6]:
block_size = 128 # spatial extent of the model for its context

In [7]:
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
text = text[0:200000] # downsized by XM for a 1-minute demo
train_dataset = CharDataset(text[0:200000], block_size) # one line of poem is roughly 50 characters

data has 200000 characters, 62 unique.


In [8]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=2, n_head=2, n_embd=128)
                  #n_layer=8, n_head=8, n_embd=512) # downsized by XM for a 1-minute demo
model = GPT(mconf)

In [9]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
#tconf = TrainerConfig(max_epochs=2, batch_size=512, learning_rate=6e-4, # downsized by XM for a 1-minute demo
tconf = TrainerConfig(max_epochs=1, batch_size=512, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=2*len(train_dataset)*block_size,
                      #num_workers=4) # downsized by XM for a 1-minute demo
                      num_workers=1)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

epoch 1 iter 390: train loss 1.92737. lr 3.000943e-04: 100%|██████████| 391/391 [01:17<00:00,  5.04it/s]


In [10]:
# alright, let's sample some character-level Shakespeare
from mingpt.utils import sample

context = "O God, O God!"
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 2000, temperature=1.0, sample=True, top_k=10)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God!
COMINIUS:
What's me had be ceturys seats motch the dows. Than the sow, tursts,-
He cas he the back; faten sand the have arge se will. My, helch come have sen that mis the hou pone.

Them he when so, the peer: not to he brught. I think mee my will to shath beers,
Thalt thes: sto he word all bearrt his in at then but o of you,
Your that apppire, son feeace
Wen true to of my beant the word, both he sute at bear their is for
Made thir that, the think and word o mell bether.
Out wath what words your have the coms, a wich he cof hadsere thy pace
Thim, are's swere counss maccis,.
Second Say shim mes ase warch as ortthe boody thy
Trrught sprices, wall say that with that in he and the have
Auffir the ince hey, the to they pack ser,
Whes in shild th hom shall to an sto with cour of ine wen,
Then brow, as warrd yee he prartionsster
Sher brateng a fart my and is atend
My complenes.

SICINIUS:
Noh, hing sit.

SICINIUS:
A wish.

MENENIUS:
Sto to

Bet say sto wer tre or and harrk.

Stiz

In [11]:
# well that was fun