## Contents
1. Load (our) GPT model from pre-trained weights
1. Sample from the model

In [1]:
import torch
from torch.nn import functional as F
import tiktoken
from gpt2 import GPT # our GPT class

In [2]:
model = GPT.from_pretrained("gpt2")
print(f'Model object type: {type(model)}\n')
print(model)

loading weights from pretrained gpt: gpt2
Model object type: <class 'gpt2.GPT'>

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='tanh')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [3]:
model.eval(); # good practice even if you do not use dropout or batchnorm
model.to('cuda');

In [4]:
num_return_sequences = 5
max_length = 30

In [5]:
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode("Hello, I'm a language model,")
tokens = torch.tensor(tokens, dtype=torch.long) # (8, )
tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1) # (5, 8)
print(tokens) # this will be the first input to the auto-regressive model

tensor([[15496,    11,   314,  1101,   257,  3303,  2746,    11],
        [15496,    11,   314,  1101,   257,  3303,  2746,    11],
        [15496,    11,   314,  1101,   257,  3303,  2746,    11],
        [15496,    11,   314,  1101,   257,  3303,  2746,    11],
        [15496,    11,   314,  1101,   257,  3303,  2746,    11]])


In [6]:
torch.manual_seed(42)
torch.cuda.manual_seed(42)
x = tokens.to('cuda') # (B, T) = (5, 8)
while x.size(1) < max_length:
    with torch.no_grad():
        logits, loss = model(x) # (B, T, vocab_size)
        logits = logits[:, -1, :] # (B, vocab_size)
        probs = F.softmax(logits, dim=-1) # (B, vocab_size)
        topk_probs, topk_indices = torch.topk(probs, 50, dim=-1) # (B, 50)
        ix = torch.multinomial(topk_probs, 1) # (B, 1)
        xcol = torch.gather(topk_indices, 1, ix) # (B, 1)
        x = torch.cat((x, xcol), dim=1)

In [7]:
for i in range(num_return_sequences):
    tokens = x[i, :max_length].tolist()
    print(enc.decode(tokens))

Hello, I'm a language model, not a program.

So this morning I started studying for the interview in the lab. This was not
Hello, I'm a language model, and one of the main things that bothers me when they create languages is how easy it becomes to create something that
Hello, I'm a language model, and I wrote it off on the grounds that a language model would make me more fluent. But I'm not
Hello, I'm a language model, I really like languages. I like languages because like, they're good. And the way we talk about languages
Hello, I'm a language model, a language model I'm using for data modelling. All I did was test the results and then I wrote some


The results match with the huggingface model!