### Build a GPT from scratch

GPT is based on the Transformer architecture and is a decoder-only Transformer, meaning there is no encoder stack in the model. Like other Transformer models, GPT uses self-attention mechanisms to process input data in parallel, significantly improving the efficiency and effectiveness of training large language models.



Downloading the dataset!

In [13]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-03-25 10:03:29--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-03-25 10:03:29 (27.6 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [14]:
with open('./input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [15]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [16]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Total unique characters:", vocab_size, end="\n\n")
print(chars)
print(''.join(chars))

Total unique characters: 65

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


There is only one number there "3"

In [17]:
text.count("3")

27

In [18]:
char_to_number = {ch:i for i,ch in enumerate(chars)}
number_to_char = {i:ch for i,ch in enumerate(chars)}

# encoder: take a string, output a list of integers
encode = lambda s: [char_to_number[c] for c in s]

# decoder: take a list of integers, output a string
decode = lambda l: ''.join([number_to_char[i] for i in l])

encoded_tokens = encode("This is GPT.")
print(encoded_tokens)

decoded_tokens = decode(encoded_tokens)
print(decoded_tokens)

[32, 46, 47, 57, 1, 47, 57, 1, 19, 28, 32, 8]
This is GPT.


## Dataset

90% for the training data

10% for the validation data

In [19]:
import torch
data = torch.tensor(encode(text), dtype=torch.long)

In [23]:
# dtype long to avoid numerical overflow

print(data.shape, data.dtype)

torch.Size([1115394]) torch.int64


In [24]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

In [25]:
decode(train_data[:100].tolist())

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

In [26]:
context_window = 8
sample_data = train_data[:context_window+1]
print(sample_data, decode(sample_data.tolist()), sep=" = ")

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58]) = First Cit


In [27]:
x = train_data[:context_window] # except the `+1`th character
y = train_data[1:context_window+1]
for t in range(context_window):
    context = x[:t+1].tolist()
    target = y[t]
    print(f"When input is {context} the target: {target}")

When input is [18] the target: 47
When input is [18, 47] the target: 56
When input is [18, 47, 56] the target: 57
When input is [18, 47, 56, 57] the target: 58
When input is [18, 47, 56, 57, 58] the target: 1
When input is [18, 47, 56, 57, 58, 1] the target: 15
When input is [18, 47, 56, 57, 58, 1, 15] the target: 47
When input is [18, 47, 56, 57, 58, 1, 15, 47] the target: 58


In [28]:
for t in range(context_window):
    context = x[:t+1].tolist()
    target = y[t].tolist()
    print(f"{decode(context)} → {decode([target])}")

F → i
Fi → r
Fir → s
Firs → t
First →  
First  → C
First C → i
First Ci → t


In [29]:
train_data.shape

torch.Size([1003854])

In [30]:
n_samples = 4       # batch_size
context_window = 8  # block_size
torch.manual_seed(1337)

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - context_window, (n_samples,))
    x = torch.stack([data[i:i+context_window] for i in ix])
    y = torch.stack([data[i+1:i+context_window+1] for i in ix])
    return x, y

In [36]:
xb, yb = get_batch('train')
print('Inputs:')
print(xb.shape)
print(xb)
print('\nTargets:')
print(yb.shape)
print(yb)

Inputs:
torch.Size([4, 8])
tensor([[52, 42,  8,  0,  0, 23, 21, 26],
        [45, 53, 42, 57,  0, 23, 43, 43],
        [52,  1, 61, 39, 57,  1, 51, 53],
        [39, 49, 12,  1, 27,  1, 58, 56]])

Targets:
torch.Size([4, 8])
tensor([[42,  8,  0,  0, 23, 21, 26, 19],
        [53, 42, 57,  0, 23, 43, 43, 54],
        [ 1, 61, 39, 57,  1, 51, 53, 56],
        [49, 12,  1, 27,  1, 58, 56, 39]])


In [37]:
print('Inputs:')
for decoded_xb in map(lambda t: decode(t.tolist()), xb):
    print("[", decoded_xb.replace("\n","\\n"), "]", sep='')

print('\nTargets:')
for decoded_yb in map(lambda t: decode(t.tolist()), yb):
    print("[", decoded_yb.replace("\n","\\n"), "]", sep='')

Inputs:
[nd.\n\nKIN]
[gods\nKee]
[n was mo]
[ak? O tr]

Targets:
[d.\n\nKING]
[ods\nKeep]
[ was mor]
[k? O tra]


## Bigram LM



*   The model only takes care of the previous token
*   Based on that, it will select the row (in the manual case) and will pick out the next most probable character.



In [38]:
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

<torch._C.Generator at 0x7812ccb9da10>

In [39]:
class BigramLM(nn.Module):
    """
    In the very next cell --> implement full BigramLM
    """

    def __init__(self, vocab_size):
        super().__init__()

        # The simple lookup table...
        self.embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        '''
        Just take the index of the "current" character
        and then use that to get the probability dist
        for the next one using the `embedding_table`.
        '''

        logits = self.embedding_table(idx)

        return logits # Form of (B, T, C)

In [40]:
model = BigramLM(vocab_size)
logits = model(xb, yb)
logits.shape # B, T, C

torch.Size([4, 8, 65])



*   4: Number of samples
*   8: Context window (each tokens)
*   65: The vocab size (distributions of next token for each tokens in context window)



*  Input: You have a text corpus from which you're training your model.
*  Tokenization: You tokenize the text into characters. Each character becomes a token.
*   Character Distribution: For each character in the corpus, you analyze the distribution of the next character that follows it.
*  Sampling: When generating new text, you start with an initial character and sample the next character based on the distribution you've learned for that character. Repeat this process iteratively to generate a sequence of characters.






## Builing Bigram model

In [41]:
class BigramLM(nn.Module):
    """
    This class takes the `vocab_size` as a single input because its being
    the "simplest" model, we won't need anything else.

    It will create `vocab_size` * `vocab_size` table and then we will be
    able to access it.

    The Forward function will take the `x` input and based on the shape
    it will perform the forward pass on it. The nuances are explained in
    the following markdown cells.
    """

    def __init__(self, vocab_size):
        super().__init__()

        self.embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        '''
        Just take the index of the "current" character
        and then use that to get the probability dist
        for the next one using the `embedding_table`.
        '''

        logits = self.embedding_table(idx)

        if targets is None: # inferencing and not training
            loss=None
        else:               # training
            # Refer: Change [1]
            B, T, C = logits.shape
            logits = logits.view(B*T, C)

            # Refer: Change [1]
            targets = targets.view(B*T)

            # For the given logits and *correct* targets, pick the pre-
            # dicted logits for the given target to calculate the
            # negative log likelihood loss.
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        '''
        Take the index of the token and guess the next
        token based on the embeddings!
        '''
        for _ in range(max_new_tokens):
            # call the `forward` method
            logits, loss = self(idx) # It's possible bc we have inherited `nn.Module`

            # Take the very last token (8th in the context window)
            # and use its distribution to get the next token
            logits = logits[:, -1, :] ## refer: Change [2]

            # Convert the logits into the probabilities
            probs = F.softmax(logits, dim=-1) ## dim=-1: along the last dimension ~ here `1`

            # Take the next idx!
            next_idx = torch.multinomial(probs, num_samples=1)

            # Here we will ALWAYS append, there is NO shrinking!
            idx = torch.cat((idx, next_idx), dim=1)

        return idx

In [42]:
# Generation of a un-trained random model
model = BigramLM(vocab_size)
logits, loss = model(xb, yb)
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), dtype=torch.long),
        max_new_tokens=100)[0].tolist()
)
print(output)


hbH

:CLP.A!fq'3ggt!O!T?X!!SA?W&TrpvYybSE3w&S BXUhmiKYyTmWMPhhmnHKj!!btgnwNNULuEzRuYyiWEQxPX!$3C'MBj


## Optimization

In [43]:
model = BigramLM(vocab_size)

In [44]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

We only have a simple, single layer... the parameters of that layer, which is 65 will be returned.

In [45]:
param = model.parameters()
for p in param:
    print(len(p))

65


## Training loop

In [46]:
n_samples = 32 # batch size

for steps in range(20_000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)

    loss.backward()

    optimizer.step()

print(loss.item())

2.420738458633423


In [47]:
output = decode(
    model.generate(
        idx = torch.zeros((1, 1), dtype=torch.long),
        max_new_tokens=500)[0].tolist()
)
print(output)


Toul sil prangir sis.

Wh I whise brthit RD:
Gom.
I INomere y ghesen cond
YCouchin, w?

Te counere ne ung;
GE n;
II t.

He verve o was.


Hes aift NTHEDIO:
Sl:
Whinke ngt?
MAms NGO shemo too mo anthatinthakes f utous as Agonteopr botherore thind spat PTEShiat ureraierio pr son me LO:
Wats me S:
To tllingewe lley ayom

Mo;
Latanssuromas:

Y:
PE:
Therucover, min ld te o e, un rd o s hthecals,

WI thank m:


NTharu t irlendoucin,
Y: Tisteriomad t.
Yor f.

S:
G, g w,

Hee:


NGLI,
JUTeden t t IVo sl
