<a href="https://colab.research.google.com/github/guilhermeterenciani/IA/blob/main/20_gpt_dev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a GPT

Companion notebook to the [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT.

In [None]:
import nltk


In [None]:
nltk.download('machado')

[nltk_data] Downloading package machado to /root/nltk_data...


True

In [None]:
from nltk.corpus import machado

In [None]:
text=machado.raw()

In [None]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  14840456


In [None]:
# let's look at the first 1000 characters
print(text[:1000])

Conto, Contos Fluminenses, 1870

Contos Fluminenses

Texto-fonte:

Obra Completa, Machado de Assis, vol. II,

Rio de Janeiro: Nova Aguilar, 1994.

Publicado originalmente pela
Editora Garnier, Rio de Janeiro, em 1870.

ÍNDICE

MISS DOLLAR

LUÍS
SOARES

A MULHER DE
PRETO

O
SEGREDO DE AUGUSTA

CONFISSÕES DE UMA VIÚVA MOÇA

LINHA
RETA E LINHA CURVA

FREI
SIMÃO

MISS
DOLLAR

ÍNDICE

Capítulo Primeiro

Capítulo II

Capítulo iii

Capítulo iv

Capítulo v

Capítulo vI

Capítulo vII

CAPÍTULO VIII

CAPÍTULO PRIMEIRO

Era conveniente ao romance que o leitor
ficasse muito tempo sem saber quem era Miss Dollar. Mas por outro lado,
sem a apresentação de Miss Dollar, seria o autor obrigado a longas
digressões, que encheriam o papel sem adiantar a ação. Não há hesitação
possível: vou apresentar-lhes Miss Dollar.

Se o leitor é rapaz e dado ao gênio
melancólico, imagina que Miss Dollar é uma inglesa pálida e delgada,
escassa de carnes e de sangue, abrindo à flor do rosto dois grandes olhos azuis
e sac

In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

	
 !"$%&'()*+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_`abcdefghijklmnopqrstuvwxyz ¡§ª«­°´º»½¿ÀÁÂÃÇÈÉÊËÍÓÔÕÚÛÜàáâãäçèéêëìíîïñòóôõöùúûü
147


In [None]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("olá isso é um teste"))
print(decode(encode("olá isso é um teste")))

[75, 72, 124, 2, 69, 79, 79, 75, 2, 130, 2, 81, 73, 2, 80, 65, 79, 80, 65]
olá isso é um teste


In [None]:
itos

{0: '\t',
 1: '\n',
 2: ' ',
 3: '!',
 4: '"',
 5: '$',
 6: '%',
 7: '&',
 8: "'",
 9: '(',
 10: ')',
 11: '*',
 12: '+',
 13: ',',
 14: '-',
 15: '.',
 16: '/',
 17: '0',
 18: '1',
 19: '2',
 20: '3',
 21: '4',
 22: '5',
 23: '6',
 24: '7',
 25: '8',
 26: '9',
 27: ':',
 28: ';',
 29: '=',
 30: '?',
 31: 'A',
 32: 'B',
 33: 'C',
 34: 'D',
 35: 'E',
 36: 'F',
 37: 'G',
 38: 'H',
 39: 'I',
 40: 'J',
 41: 'K',
 42: 'L',
 43: 'M',
 44: 'N',
 45: 'O',
 46: 'P',
 47: 'Q',
 48: 'R',
 49: 'S',
 50: 'T',
 51: 'U',
 52: 'V',
 53: 'W',
 54: 'X',
 55: 'Y',
 56: 'Z',
 57: '[',
 58: ']',
 59: '_',
 60: '`',
 61: 'a',
 62: 'b',
 63: 'c',
 64: 'd',
 65: 'e',
 66: 'f',
 67: 'g',
 68: 'h',
 69: 'i',
 70: 'j',
 71: 'k',
 72: 'l',
 73: 'm',
 74: 'n',
 75: 'o',
 76: 'p',
 77: 'q',
 78: 'r',
 79: 's',
 80: 't',
 81: 'u',
 82: 'v',
 83: 'w',
 84: 'x',
 85: 'y',
 86: 'z',
 87: '\x85',
 88: '\x91',
 89: '\x92',
 90: '\x93',
 91: '\x94',
 92: '\x96',
 93: '\x97',
 94: '\x9c',
 95: '\xa0',
 96: '¡',
 97: '§',
 

In [None]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([14840456]) torch.int64
tensor([ 33,  75,  74,  80,  75,  13,   2,  33,  75,  74,  80,  75,  79,   2,
         36,  72,  81,  73,  69,  74,  65,  74,  79,  65,  79,  13,   2,  18,
         25,  24,  17,   1,   1,  33,  75,  74,  80,  75,  79,   2,  36,  72,
         81,  73,  69,  74,  65,  74,  79,  65,  79,   1,   1,  50,  65,  84,
         80,  75,  14,  66,  75,  74,  80,  65,  27,   1,   1,  45,  62,  78,
         61,   2,  33,  75,  73,  76,  72,  65,  80,  61,  13,   2,  43,  61,
         63,  68,  61,  64,  75,   2,  64,  65,   2,  31,  79,  79,  69,  79,
         13,   2,  82,  75,  72,  15,   2,  39,  39,  13,   1,   1,  48,  69,
         75,   2,  64,  65,   2,  40,  61,  74,  65,  69,  78,  75,  27,   2,
         44,  75,  82,  61,   2,  31,  67,  81,  69,  72,  61,  78,  13,   2,
         18,  26,  26,  21,  15,   1,   1,  46,  81,  62,  72,  69,  63,  61,
         64,  75,   2,  75,  78,  69,  67,  69,  74,  61,  72,  73,  65,  74,
         80,  65,   2,  76,  

In [None]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [None]:
block_size = 8
train_data[:block_size+1]

tensor([33, 75, 74, 80, 75, 13,  2, 33, 75])

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([33]) the target: 75
when input is tensor([33, 75]) the target: 74
when input is tensor([33, 75, 74]) the target: 80
when input is tensor([33, 75, 74, 80]) the target: 75
when input is tensor([33, 75, 74, 80, 75]) the target: 13
when input is tensor([33, 75, 74, 80, 75, 13]) the target: 2
when input is tensor([33, 75, 74, 80, 75, 13,  2]) the target: 33
when input is tensor([33, 75, 74, 80, 75, 13,  2, 33]) the target: 75


In [None]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[76, 78, 65, 65, 74, 64, 65, 81],
        [78, 73, 69, 80, 65, 13,  2, 82],
        [65, 67, 75, 63, 69, 61, 74, 80],
        [75, 73, 65, 81,  2, 65,  2, 75]])
targets:
torch.Size([4, 8])
tensor([[78, 65, 65, 74, 64, 65, 81,  2],
        [73, 69, 80, 65, 13,  2, 82, 69],
        [67, 75, 63, 69, 61, 74, 80, 65],
        [73, 65, 81,  2, 65,  2, 75, 72]])
----
when input is [76] the target: 78
when input is [76, 78] the target: 65
when input is [76, 78, 65] the target: 65
when input is [76, 78, 65, 65] the target: 74
when input is [76, 78, 65, 65, 74] the target: 64
when input is [76, 78, 65, 65, 74, 64] the target: 65
when input is [76, 78, 65, 65, 74, 64, 65] the target: 81
when input is [76, 78, 65, 65, 74, 64, 65, 81] the target: 2
when input is [78] the target: 73
when input is [78, 73] the target: 69
when input is [78, 73, 69] the target: 80
when input is [78, 73, 69, 80] the target: 65
when input is [78, 73, 69, 80, 65] the target: 13
when inpu

In [None]:
print(xb) # our input to the transformer

tensor([[76, 78, 65, 65, 74, 64, 65, 81],
        [78, 73, 69, 80, 65, 13,  2, 82],
        [65, 67, 75, 63, 69, 61, 74, 80],
        [75, 73, 65, 81,  2, 65,  2, 75]])


In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            print(idx.shape)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


torch.Size([32, 147])
tensor(5.6646, grad_fn=<NllLossBackward0>)
torch.Size([1, 2])
torch.Size([1, 3])
torch.Size([1, 4])
torch.Size([1, 5])
torch.Size([1, 6])
torch.Size([1, 7])
torch.Size([1, 8])
torch.Size([1, 9])
torch.Size([1, 10])
torch.Size([1, 11])
torch.Size([1, 12])
torch.Size([1, 13])
torch.Size([1, 14])
torch.Size([1, 15])
torch.Size([1, 16])
torch.Size([1, 17])
torch.Size([1, 18])
torch.Size([1, 19])
torch.Size([1, 20])
torch.Size([1, 21])
torch.Size([1, 22])
torch.Size([1, 23])
torch.Size([1, 24])
torch.Size([1, 25])
torch.Size([1, 26])
torch.Size([1, 27])
torch.Size([1, 28])
torch.Size([1, 29])
torch.Size([1, 30])
torch.Size([1, 31])
torch.Size([1, 32])
torch.Size([1, 33])
torch.Size([1, 34])
torch.Size([1, 35])
torch.Size([1, 36])
torch.Size([1, 37])
torch.Size([1, 38])
torch.Size([1, 39])
torch.Size([1, 40])
torch.Size([1, 41])
torch.Size([1, 42])
torch.Size([1, 43])
torch.Size([1, 44])
torch.Size([1, 45])
torch.Size([1, 46])
torch.Size([1, 47])
torch.Size([1, 48])
tor

In [None]:
m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)

torch.Size([1, 2])
torch.Size([1, 3])
torch.Size([1, 4])
torch.Size([1, 5])
torch.Size([1, 6])
torch.Size([1, 7])
torch.Size([1, 8])
torch.Size([1, 9])
torch.Size([1, 10])
torch.Size([1, 11])
torch.Size([1, 12])
torch.Size([1, 13])
torch.Size([1, 14])
torch.Size([1, 15])
torch.Size([1, 16])
torch.Size([1, 17])
torch.Size([1, 18])
torch.Size([1, 19])
torch.Size([1, 20])
torch.Size([1, 21])
torch.Size([1, 22])
torch.Size([1, 23])
torch.Size([1, 24])
torch.Size([1, 25])
torch.Size([1, 26])
torch.Size([1, 27])
torch.Size([1, 28])
torch.Size([1, 29])
torch.Size([1, 30])
torch.Size([1, 31])
torch.Size([1, 32])
torch.Size([1, 33])
torch.Size([1, 34])
torch.Size([1, 35])
torch.Size([1, 36])
torch.Size([1, 37])
torch.Size([1, 38])
torch.Size([1, 39])
torch.Size([1, 40])
torch.Size([1, 41])
torch.Size([1, 42])
torch.Size([1, 43])
torch.Size([1, 44])
torch.Size([1, 45])
torch.Size([1, 46])
torch.Size([1, 47])
torch.Size([1, 48])
torch.Size([1, 49])
torch.Size([1, 50])
torch.Size([1, 51])
torch.Si

tensor([[  0, 136,  15,  96,  88,  96,  77,  34,  11,  83,  32, 108,  13,  14,
         124,   6, 139, 124,  61,   0, 132, 144,  60,  17, 113,  33,  82,  22,
          96, 130, 106,  45,  29,  76,  74,  51, 111, 116, 103,   4, 104, 110,
         127, 125,   1, 113,  55, 100,  13, 112, 145, 134,  40,  31,  80,  13,
         112, 124,  45,  60, 113,  27,  61,  69,  58,  57, 126,  55,   2,  84,
         126,  25, 131,  97, 100,  19,  56,  42, 104,   5, 120, 100, 139, 139,
           3, 110,  97,  81,  97,  39,  22,  79,  28, 106,   3,  37, 140,  56,
          10, 105,  66]])

In [None]:
idx = torch.zeros((1, 1), dtype=torch.long)

In [None]:
logits, loss = m.forward(idx)

In [None]:
logits.shape

torch.Size([1, 1, 147])

In [None]:
logits = logits[:, -1, :] # becomes (B, C)

In [None]:
logits.shape

torch.Size([1, 147])

In [None]:
probs = F.softmax(logits, dim=-1) # (B, C)

In [None]:
probs

tensor([[0.0043, 0.0033, 0.0025, 0.0014, 0.0067, 0.0037, 0.0093, 0.0038, 0.0051,
         0.0115, 0.0009, 0.0022, 0.0045, 0.0028, 0.0014, 0.0168, 0.0138, 0.0031,
         0.0048, 0.0094, 0.0005, 0.0059, 0.0159, 0.0065, 0.0041, 0.0008, 0.0011,
         0.0026, 0.0056, 0.0016, 0.0165, 0.0441, 0.0018, 0.0028, 0.0099, 0.0041,
         0.0042, 0.0112, 0.0011, 0.0027, 0.0022, 0.0014, 0.0062, 0.0008, 0.0011,
         0.0064, 0.0020, 0.0018, 0.0186, 0.0016, 0.0139, 0.0027, 0.0008, 0.0294,
         0.0569, 0.0006, 0.0153, 0.0008, 0.0082, 0.0029, 0.0078, 0.0166, 0.0179,
         0.0024, 0.0016, 0.0065, 0.0034, 0.0034, 0.0022, 0.0057, 0.0028, 0.0012,
         0.0267, 0.0021, 0.0045, 0.0072, 0.0009, 0.0089, 0.0041, 0.0045, 0.0433,
         0.0011, 0.0099, 0.0063, 0.0007, 0.0010, 0.0072, 0.0029, 0.0051, 0.0025,
         0.0058, 0.0017, 0.0024, 0.0023, 0.0013, 0.0024, 0.0007, 0.0009, 0.0030,
         0.0094, 0.0037, 0.0017, 0.0027, 0.0036, 0.0030, 0.0031, 0.0048, 0.0143,
         0.0040, 0.0456, 0.0

In [None]:
torch.argmax(probs).view([1,1])

tensor([[54]])

In [None]:
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)

In [None]:
idx_next.shape

torch.Size([1, 1])

In [None]:
idx_next

tensor([[80]])

In [None]:
probs[0,idx_next]

tensor([[0.0433]], grad_fn=<IndexBackward0>)

In [None]:
torch.argmax(probs,dim=1)

tensor([54])

In [None]:
# get the predictions
logits, loss = m.forward(idx)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)

In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32
for steps in range(100): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())


5.3323235511779785


In [None]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

torch.Size([1, 2])
torch.Size([1, 3])
torch.Size([1, 4])
torch.Size([1, 5])
torch.Size([1, 6])
torch.Size([1, 7])
torch.Size([1, 8])
torch.Size([1, 9])
torch.Size([1, 10])
torch.Size([1, 11])
torch.Size([1, 12])
torch.Size([1, 13])
torch.Size([1, 14])
torch.Size([1, 15])
torch.Size([1, 16])
torch.Size([1, 17])
torch.Size([1, 18])
torch.Size([1, 19])
torch.Size([1, 20])
torch.Size([1, 21])
torch.Size([1, 22])
torch.Size([1, 23])
torch.Size([1, 24])
torch.Size([1, 25])
torch.Size([1, 26])
torch.Size([1, 27])
torch.Size([1, 28])
torch.Size([1, 29])
torch.Size([1, 30])
torch.Size([1, 31])
torch.Size([1, 32])
torch.Size([1, 33])
torch.Size([1, 34])
torch.Size([1, 35])
torch.Size([1, 36])
torch.Size([1, 37])
torch.Size([1, 38])
torch.Size([1, 39])
torch.Size([1, 40])
torch.Size([1, 41])
torch.Size([1, 42])
torch.Size([1, 43])
torch.Size([1, 44])
torch.Size([1, 45])
torch.Size([1, 46])
torch.Size([1, 47])
torch.Size([1, 48])
torch.Size([1, 49])
torch.Size([1, 50])
torch.Size([1, 51])
torch.Si

## The mathematical trick in self-attention

In [None]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

### Full finished code, for reference

You may want to refer directly to the git repo instead though.

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

import nltk
nltk.download('machado')
from nltk.corpus import machado
text=machado.raw()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx
    def generate_final(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.argmax(probs).view([1,1])
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


[nltk_data] Downloading package machado to /root/nltk_data...
[nltk_data]   Package machado is already up-to-date!


0.220307 M parameters
step 0: train loss 5.2092, val loss 5.2124
step 100: train loss 2.6881, val loss 2.8650
step 200: train loss 2.5287, val loss 2.6908
step 300: train loss 2.4631, val loss 2.6037
step 400: train loss 2.4085, val loss 2.5562
step 500: train loss 2.3622, val loss 2.5132
step 600: train loss 2.2941, val loss 2.4093
step 700: train loss 2.2577, val loss 2.3856
step 800: train loss 2.2295, val loss 2.3596
step 900: train loss 2.1980, val loss 2.3245
step 1000: train loss 2.1836, val loss 2.2906
step 1100: train loss 2.1533, val loss 2.2548
step 1200: train loss 2.1311, val loss 2.2709
step 1300: train loss 2.1132, val loss 2.2251
step 1400: train loss 2.0820, val loss 2.1964
step 1500: train loss 2.0621, val loss 2.1777
step 1600: train loss 2.0427, val loss 2.1612
step 1700: train loss 2.0286, val loss 2.1643
step 1800: train loss 2.0177, val loss 2.1460
step 1900: train loss 2.0086, val loss 2.1315
step 2000: train loss 2.0069, val loss 2.1302
step 2100: train loss 1.

In [None]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

	ilgusto.

Senguém, seja daquela histene umpíessesse pileva
  vustinha. No fesa
dela, mas sabubas senhus
elentro a queria,

A ioHRAX. BBRoura os capazeiros
esterueus
impulsos dizên e própio
de capaz aqueltes felizsa! Parecebito, filhavam entrão se com um fexsaça sunhã. A veita Calagência desale viveral e cripiossões da azejai,

Reu V'......                       
     moçe o
seguíla aprosindido
a casaz valhem aindâmito foi
Talenção, não tamberam digou-mendo
dorado. Talizara as tusaís e outroas
os compunsos. EÚlem de idor com algumas penhamos, é dito mansta, pobulo envinte.

A violte de uma irmã correlou ao coleiras, sônito deselizo nos exenstas coitas, como meu, não morreu; aindar ele as iervamenee,
que oijante a assado. Ossantes, não a repelinam ser mulher os
  nossos nadas dincluir a baça. Iso aliber-me que crgue lemantos, não essa de às espatites. Mari de sógreira ná vhciência, a noiteina vivolta que e conchei um doimente, ou dois tambos do éscientes vinte mais
no láprio; nem espíri

In [None]:
max_iters = 50000

In [None]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 1.6760, val loss 1.8749
step 100: train loss 1.6862, val loss 1.8575
step 200: train loss 1.6822, val loss 1.8654
step 300: train loss 1.6803, val loss 1.8529
step 400: train loss 1.6837, val loss 1.8590
step 500: train loss 1.6829, val loss 1.8627
step 600: train loss 1.6713, val loss 1.8641
step 700: train loss 1.6750, val loss 1.8599
step 800: train loss 1.6719, val loss 1.8505
step 900: train loss 1.6655, val loss 1.8547
step 1000: train loss 1.6749, val loss 1.8564
step 1100: train loss 1.6637, val loss 1.8518
step 1200: train loss 1.6674, val loss 1.8434
step 1300: train loss 1.6783, val loss 1.8522
step 1400: train loss 1.6698, val loss 1.8516
step 1500: train loss 1.6561, val loss 1.8450
step 1600: train loss 1.6723, val loss 1.8495
step 1700: train loss 1.6575, val loss 1.8162
step 1800: train loss 1.6630, val loss 1.8468
step 1900: train loss 1.6620, val loss 1.8400
step 2000: train loss 1.6638, val loss 1.8440
step 2100: train loss 1.6574, val loss 1.8358


KeyboardInterrupt: ignored

In [None]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

	idade. Um vinte do passa coisa, na corresta

     Tomou por pobre há se trouxer o cas que se pode creio? Ele se avi, os prezinos do
promenho. Número maiorino para o deslimar? sejanve talentos complandim, a acadendendo aos nossos
troneiros, andas mentraram por carafigar o próprio disponto que os
  sentos media as socarei. Toda a sua pessoa dos mesmos repetismo, ou antes dos eliva-me foram os
que ossos da pronola
  para o femponte do astordarária ao pé das mãos feitos do
mérico. Um sabe bonção, disse nada moça. 41 se desejos sogravam.
Na meia coisa. Era
misterioso.

No vento cacrifical e deles de um inspirar a minha poliz política e palavra belo do coração de Caixo. Que vai saiu
de algum com a esperar do alvanião. Recreitaramente tinham ao que estava mesmo
  sucidência. Talvez que fala ao feliz do ouvir no conversamento. Vão
incônciplicária, que leva bonista com
que fala que bem de ventado a quarta da carto toda
  se rehtendia de um origentexo que não é grande, um caso deprodimento é e 