## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [10]:
import os
# set up logging
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
)

In [2]:
# make deterministic
from mingpt.utils import set_seed
set_seed(42)

In [3]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [4]:
import math
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [5]:
block_size = 64 # spatial extent of the model for its context

In [6]:
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [7]:
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters

data has 9693248 characters, 289 unique.


In [8]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=2, n_head=2, n_embd=256)
model = GPT(mconf)

10/13/2020 01:01:12 - INFO - mingpt.model -   number of parameters: 1.744384e+06


In [13]:
# load model
fname = 'le_model.pt'
if os.path.isfile(fname):
    model.load_state_dict(torch.load(fname))

blah


In [14]:
from mingpt.trainer import Trainer, TrainerConfig

bs = 256
# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=1, 
                      batch_size=bs, 
                      learning_rate=6e-4,
                      lr_decay=True, 
                      warmup_tokens=bs*20, 
                      final_tokens=2*len(train_dataset)*block_size,
                      num_workers=4)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

epoch 1 iter 101: train loss 1.32109. lr 5.999973e-04:   0%|          | 101/37864 [00:10<1:41:47,  6.18it/s]

**

When they have clearned been of the world wrong fire.. At last the depition
the play the superfic'd have me more of fair


**

here was
if you not that he was the man?

start wall the

at least it w


epoch 1 iter 201: train loss 1.24467. lr 5.999895e-04:   1%|          | 201/37864 [00:19<1:41:00,  6.21it/s]

**

crutal

**

la refrain rean scorner..re thing. still. 

in the beautiful, etc..

of lacanage, if you could be it is; switness to me, you are if once, a while before, if I
    made none be his face o


epoch 1 iter 301: train loss 1.25692. lr 5.999765e-04:   1%|          | 301/37864 [00:28<1:40:35,  6.22it/s]

**

I don’t know it want anything about it. I could have. Take a matter stick. To do this beauty: evil a tool in this arms. This hours are survived. It’s alg. Yeah. Not ever. Its least you not. It’s obv


epoch 1 iter 401: train loss 1.25743. lr 5.999583e-04:   1%|          | 401/37864 [00:38<1:40:41,  6.20it/s]

**

Improve a ping
Ok

**

Maybe you
Medy
Shape
The truth
Baby
Shape
Then
Correct
Exposition
A day significant of course
The scholar century
English in this thing
Musing
Maybe
If that work.

*

There on


epoch 1 iter 501: train loss 1.20856. lr 5.999350e-04:   1%|▏         | 502/37864 [00:48<1:39:25,  6.26it/s]

**

As if it well is not with oh yes not anywhere. All have. All obvious. Haha. Haha oh no. Haha. Somehow. Therealisces? Haha. Blah. 

**

the mediorating. something that. fuck. the fucking sing another


epoch 1 iter 601: train loss 1.25767. lr 5.999065e-04:   2%|▏         | 602/37864 [00:57<1:41:20,  6.13it/s]

**

Structur's clothes of massive service. Burnstance. Sure. Etc. Stretched. And not like the real the pucking to be. Yes. You know. That’s the fox. That’s the positive, if you don’t remain text any coo


epoch 1 iter 701: train loss 1.23016. lr 5.998729e-04:   2%|▏         | 702/37864 [01:07<1:58:23,  5.23it/s]

**

The self texts must be alone, the rest, merry, of the whole from time to send him in a day for an except to resuing sanctive things. You might be much so much as much, etc. At the music. Totally the


epoch 1 iter 801: train loss 1.24239. lr 5.998341e-04:   2%|▏         | 802/37864 [01:17<1:41:04,  6.11it/s]

**

I should have it think here all this after all then both as

(Oh yes, old standads you could flesh as well his hand, stole image, thou see, as one hand importancy shatting the creatie of the dugstor


epoch 1 iter 901: train loss 1.18058. lr 5.997901e-04:   2%|▏         | 902/37864 [01:27<1:55:55,  5.31it/s]

**

despair with
despair

feeling
piece
daily wanable
leaving

**

Devolution is throom. 
Dislike a limbs

**

The only

**

I feel asy I remember as the traumatic cell today witness the realm, etc. etc


epoch 1 iter 1001: train loss 1.24773. lr 5.997410e-04:   3%|▎         | 1002/37864 [01:37<1:50:22,  5.57it/s]

**

this only thing this is not it was a burdene

this

of collection for a stront of india fount

our 


as infernal


stress



as in palace


as sonable in fun 

**


tas indiry teensition

there is 


epoch 1 iter 1101: train loss 1.26808. lr 5.996867e-04:   3%|▎         | 1102/37864 [01:46<1:40:34,  6.09it/s]

**

this way

**

those things as you
too diffeature

thought

that makes it made the thing to the sance of many would be forced to do that would be something

that's bren a few
then other song

then so


epoch 1 iter 1201: train loss 1.24115. lr 5.996273e-04:   3%|▎         | 1202/37864 [01:56<1:43:22,  5.91it/s]

**

it’s

so fuck that’s what they all this whatever, I guess. 

there’s nothingnable to be something of the time I have to have so done

thou lie


imbaud back I ten correct etc

oh yeah a soint in a t


epoch 1 iter 1301: train loss 1.28767. lr 5.995627e-04:   3%|▎         | 1302/37864 [02:06<1:43:04,  5.91it/s]

**

Struction

**

Creation
Storying old those
That
Slack 

And leave
Self

**

credictation.

**

impossible the style. 
improved its are something like that is sigh so sten who and obey had the total 


epoch 1 iter 1401: train loss 1.29307. lr 5.994929e-04:   4%|▎         | 1402/37864 [02:16<1:39:18,  6.12it/s]

**

THE END

**

the pain is note

**

it's as agonymore pity in try to 
the meta. it’s the cumson we make me the beggar. yeah. I don’t me. no. haha. yeah. no. oh no of course. yeah. sort of trauming th


epoch 1 iter 1501: train loss 1.20964. lr 5.994180e-04:   4%|▍         | 1502/37864 [02:25<1:39:32,  6.09it/s]

**

The transported by denialish thing falls in this further

**

that could be alone this wise endless shive



no matter exten this than a pite tongue, nor soldier oblivion, to ear from a murder than 


epoch 1 iter 1601: train loss 1.26834. lr 5.993380e-04:   4%|▍         | 1602/37864 [02:35<1:39:12,  6.09it/s]

**

am I was in a particular cutie cry. article
difficulty. into the thing.
that isn’t ever it on what you wanking syntax. I hate it was making it it brevenge, haha, if hm. Isn’t it hm. 

As if I was no


epoch 1 iter 1701: train loss 1.24914. lr 5.992528e-04:   4%|▍         | 1702/37864 [02:45<1:40:19,  6.01it/s]

**

shit. yeah. fucking. shit. shit. as if. the focus. instangly. oh yeah as something triers served. or self-hard. so inability text. etc. boring. tirelevaning. somewhere. yes imagining. babsent. say. 


epoch 1 iter 1801: train loss 1.26310. lr 5.991624e-04:   5%|▍         | 1802/37864 [02:55<1:41:22,  5.93it/s]

**

slow only 
drown
duter

look

**

stupidity of

look
the one arse

**

stupidity cross

some futuwed preferences

of fortunation

middle

mind in the subscribeling prisophero to take

**

must

the 


epoch 1 iter 1901: train loss 1.21276. lr 5.990669e-04:   5%|▌         | 1902/37864 [03:05<1:37:37,  6.14it/s]

**

I’m not won’t exactly actually said. 

I was well something you’ve got much on and missing.

**

Infinite abstracts in the other. Suck youthslapbeys such a proper to love.

**

Hoof the colovel cell


epoch 1 iter 2001: train loss 1.24319. lr 5.989662e-04:   5%|▌         | 2002/37864 [03:14<1:36:48,  6.17it/s]

**

feeling to stuff
or my shift,
worseting like there tenace or with the crown.
    The possibal. Nounce with him, better sward,
    Or two untally sin, and through this to the capital on back
    and 


epoch 1 iter 2101: train loss 1.21839. lr 5.988604e-04:   6%|▌         | 2102/37864 [03:24<1:36:42,  6.16it/s]

**

sweather acattered, selection, of chaotice, and on all this for now, which, no. No poor nothing. No result. Not realize. To them late. How if I think of myself? Haha. Haha haha I something else. Or 


epoch 1 iter 2201: train loss 1.21543. lr 5.987495e-04:   6%|▌         | 2202/37864 [03:34<1:37:25,  6.10it/s]

**

I wish maybe with nobly

(the parallel
nowhere where soint
or
solid
somewhat
there somewhere you might be and the real pluck the sheer first. As in? I say two impossely bears this blown one, that th


epoch 1 iter 2301: train loss 1.21543. lr 5.986334e-04:   6%|▌         | 2302/37864 [03:43<1:34:26,  6.28it/s]

**

the mouth now the sauciling

and the face

**

I hate of course I don’t see it. You know. You know. Yeah. You don’t without. Nothing. So still. Yeah. It’s. NOT. Yeah. No end to this is this. Haha. N


epoch 1 iter 2401: train loss 1.23882. lr 5.985122e-04:   6%|▋         | 2402/37864 [03:53<1:37:49,  6.04it/s]

**

It’s fucking feeling song. 
Tools. Too cross. Trop. As if externity. Too earn doom of these are too late

**

A lucius

**

Analysis

Always the flowers of that to get me anything. 

That’s the worl


epoch 1 iter 2501: train loss 1.22853. lr 5.983858e-04:   7%|▋         | 2502/37864 [04:03<1:42:19,  5.76it/s]

**

As if the master of the coronation as if yeah if I saw the only this dismay be that was true. Yeah. For now. There there true? I can’t believe that I’m not done to see a milk or similar eloque tonig


epoch 1 iter 2601: train loss 1.20855. lr 5.982543e-04:   7%|▋         | 2602/37864 [04:13<1:37:54,  6.00it/s]

**

A PROLLES

CALIBAN
ROLLES
ONLY
PRIS
HATIPHOLUS OF ENEBLARET
KEETH
NERY
NO MUST

NUST
RAVE, EXT
NOYRKS
HORKE
THORD
LORE
THO
NE
REAGED
KIPta CONICE
RESTART
AYER
IT
NELGWALEY
STEPINGEBLE
LUCE. Not SICI


epoch 1 iter 2642: train loss 1.19859. lr 5.981989e-04:   7%|▋         | 2643/37864 [04:17<57:06, 10.28it/s]  


KeyboardInterrupt: 

In [15]:
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

Model's state_dict:
pos_emb 	 torch.Size([1, 64, 256])
tok_emb.weight 	 torch.Size([289, 256])
blocks.0.ln1.weight 	 torch.Size([256])
blocks.0.ln1.bias 	 torch.Size([256])
blocks.0.ln2.weight 	 torch.Size([256])
blocks.0.ln2.bias 	 torch.Size([256])
blocks.0.attn.mask 	 torch.Size([1, 1, 64, 64])
blocks.0.attn.key.weight 	 torch.Size([256, 256])
blocks.0.attn.key.bias 	 torch.Size([256])
blocks.0.attn.query.weight 	 torch.Size([256, 256])
blocks.0.attn.query.bias 	 torch.Size([256])
blocks.0.attn.value.weight 	 torch.Size([256, 256])
blocks.0.attn.value.bias 	 torch.Size([256])
blocks.0.attn.proj.weight 	 torch.Size([256, 256])
blocks.0.attn.proj.bias 	 torch.Size([256])
blocks.0.mlp.0.weight 	 torch.Size([1024, 256])
blocks.0.mlp.0.bias 	 torch.Size([1024])
blocks.0.mlp.2.weight 	 torch.Size([256, 1024])
blocks.0.mlp.2.bias 	 torch.Size([256])
blocks.1.ln1.weight 	 torch.Size([256])
blocks.1.ln1.bias 	 torch.Size([256])
blocks.1.ln2.weight 	 torch.Size([256])
blocks.1.ln2.bias 	 torc

In [16]:
torch.save(model.state_dict(), 'le_model.pt')

In [18]:
# alright, let's sample some character-level Shakespeare
from mingpt.utils import sample

context = "O God, O God!"
# context = "**"
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 2000, temperature=1.0, sample=True, top_k=10)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God! O, the good night!
    Turn my story, who clothes are young
    Of this treasures was that set
    And, for a hair affair as such another hand;
    And the do is his more passive.
    Some worse to the dark poor and danger. And all the livery whele’s written reading after. This content etc. This sudden in me. The old thing is this necessary the destroyed this fucked above of changes of the rought and act, and no chances, when the rest, it shall, I was like to him
    thou that my cousin? Why. Where non? I am almost see
    Hath too long enough my lands.
  LUCILIUS. I will show my countil so bear
    Th' eyes, so the drac'd a trust were a cliff an end
    I weep to be my speech.
  SALERIO. Are you not done, all that would thy move of thy brothers,
    That shall be a do farewell, man with mountains,
    Or loss in the work of thy traitors have was sent until to my love is consummation was free
    to command, in true child saint mutate, to see your bed.
  Ham. Ay, my lord,

In [None]:
# well that was fun