This is my implementation of a Transfomer language model loosely follwing Andrej Karpathy's video lecture [Let's Build GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=1866s).

Some key differences between his lecture and this notebook:
- This notebook uses Keras/Tensorflow instead of PyTorch
- This notebook uses Keras' MultiHeadAttention and LayerNormalization layers instead of implementing them from scratch
- This notebook uses TikToken for token embeddings (TODO)

In [1]:
import tensorflow as tf
import numpy as np

In [4]:
with open('../input.txt', 'r', encoding='utf8') as f:
    text = f.read()
    
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print('Vocab size:', vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size: 65


In [7]:
stoi = { ch:i for i, ch in enumerate(chars) }
itos = {i:ch for i, ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
print(encode('hello world'))
print(decode(encode('hello world')))

[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
hello world


In [8]:
data = tf.constant(encode(text))
print(data[:200])

tf.Tensor(
[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42  1 39 52 63  1 44 59 56 58 46 43 56  6  1 46 43
 39 56  1 51 43  1 57 54 43 39 49  8  0  0 13 50 50 10  0 31 54 43 39 49
  6  1 57 54 43 39 49  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10
  0 37 53 59  1 39 56 43  1 39 50 50  1 56 43 57 53 50 60 43 42  1 56 39
 58 46 43 56  1 58 53  1 42 47 43  1 58 46 39 52  1 58 53  1 44 39 51 47
 57 46 12  0  0 13 50 50 10  0 30 43 57 53 50 60 43 42  8  1 56 43 57 53
 50 60 43 42  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 18 47
 56 57 58  6  1 63 53 59], shape=(200,), dtype=int32)


In [10]:
split_index = int(0.9 * len(data))
train_data = data[:split_index]
val_data = data[split_index:]

In [18]:
block_size = 64
batch_size = 32

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = tf.experimental.numpy.random.randint(low = 0, high = len(data) - block_size, size = batch_size)
    x = tf.stack([data[i:i+block_size] for i in ix])
    y = tf.stack(tf.one_hot(indices=[data[i+1:i+block_size+1] for i in ix], depth=vocab_size))
    return x, y

In [21]:
def create_model(depth):
    causal_mask = np.tril(np.ones((block_size, block_size)), 0)

    inputs = tf.keras.layers.Input(shape=(block_size,))
    x = tf.keras.layers.Embedding(vocab_size, vocab_size)(inputs)

    for _ in range(depth):
        # Attention
        mha = tf.keras.layers.MultiHeadAttention(num_heads=1, 
                                                 key_dim=vocab_size, 
                                                 value_dim=vocab_size)

        att = mha(query=x, value=x, attention_mask=causal_mask)
        x = tf.keras.layers.LayerNormalization()(x + att)

        # Feed Forward
        ff1 = tf.keras.layers.Dense(units=vocab_size, activation='relu')(x)
        ff2 = tf.keras.layers.Dense(units=vocab_size)(ff1)
        x = tf.keras.layers.LayerNormalization()(x + ff2)

    x = tf.keras.layers.Softmax()(x)

    model = tf.keras.Model(inputs=inputs, outputs=x)
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    return model

In [28]:
def generate(model, input_string, num_new_tokens):
    s = input_string
    x = encode(input_string)
    for _ in range(num_new_tokens):
        out = model.predict(tf.constant([x]))
        next_token = np.random.choice(np.arange(0, vocab_size), p=out[0][-1])
        x.append(next_token)
        x = x[1:]
        s += decode([next_token])
    return s

In [23]:
def train(model, train_steps):
    for step in range(train_steps):
        xb, yb = get_batch('train')
        model.fit(xb, yb)

In [24]:
model = create_model(depth=6)

In [26]:
train(model, 100)



In [30]:
output = generate(model, text[:64], 100)
print(output)

First Citizen:
Before we proceed any further, hear me speak.

Al?
SAqundl merowet ollangh, Zmyce yonoasinXE UL3 st&enemnt,MICe FZEd hy, owamal ngh th! mXEgfNY:
V,
S
