# Transformer Text Generation

This is my implementation of a Transfomer language model loosely follwing Andrej Karpathy's video lecture [Let's Build GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=1866s).

Some key differences between his lecture and this notebook:
- This notebook uses Keras/Tensorflow instead of PyTorch
- This notebook uses Keras' MultiHeadAttention and LayerNormalization layers instead of implementing them from scratch
- This notebook uses TikToken for token embeddings (TODO)

ToDo List:
- Implement TikToken
- Write custom train loop
- Validation
- Dropout
- Read transformer lit and improve architecture?

In [3]:
import tensorflow as tf
import numpy as np

In [4]:
with open('../input.txt', 'r', encoding='utf8') as f:
    text = f.read()
    
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print('Vocab size:', vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size: 65


In [6]:
stoi = { ch:i for i, ch in enumerate(chars) }
itos = {i:ch for i, ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
print(encode('hello world'))
print(decode(encode('hello world')))

[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
hello world


In [7]:
data = tf.constant(encode(text))
print(data[:200])

tf.Tensor(
[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42  1 39 52 63  1 44 59 56 58 46 43 56  6  1 46 43
 39 56  1 51 43  1 57 54 43 39 49  8  0  0 13 50 50 10  0 31 54 43 39 49
  6  1 57 54 43 39 49  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10
  0 37 53 59  1 39 56 43  1 39 50 50  1 56 43 57 53 50 60 43 42  1 56 39
 58 46 43 56  1 58 53  1 42 47 43  1 58 46 39 52  1 58 53  1 44 39 51 47
 57 46 12  0  0 13 50 50 10  0 30 43 57 53 50 60 43 42  8  1 56 43 57 53
 50 60 43 42  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 18 47
 56 57 58  6  1 63 53 59], shape=(200,), dtype=int32)


In [8]:
split_index = int(0.9 * len(data))
train_data = data[:split_index]
val_data = data[split_index:]

In [27]:
block_size = 256
batch_size = 64
embed_dim = 32

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = tf.experimental.numpy.random.randint(low = 0, high = len(data) - block_size, size = batch_size)
    x = tf.stack([data[i:i+block_size] for i in ix])
    y = tf.stack(tf.one_hot(indices=[data[i+1:i+block_size+1] for i in ix], depth=vocab_size))
    return x, y

In [28]:
np.arange(block_size)

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

In [33]:
def create_model(depth):
    causal_mask = np.tril(np.ones((block_size, block_size)), 0)

    inputs = tf.keras.layers.Input(shape=(block_size,))
    # Token embedding.
    tok_emb = tf.keras.layers.Embedding(vocab_size, vocab_size)(inputs)
    # Positional embedding.
    pos_emb_layer = tf.keras.layers.Embedding(block_size, vocab_size)
    pos_emb = pos_emb_layer(np.arange(block_size))
    x = tok_emb + pos_emb

    for _ in range(depth):
        # Attention
        mha = tf.keras.layers.MultiHeadAttention(num_heads=6, 
                                                 key_dim=vocab_size, 
                                                 value_dim=vocab_size)

        att = mha(query=x, value=x, attention_mask=causal_mask)
        x = tf.keras.layers.LayerNormalization()(x + att)

        # Feed Forward
        ff1 = tf.keras.layers.Dense(units=vocab_size, activation='relu')(x)
        ff2 = tf.keras.layers.Dense(units=vocab_size)(ff1)
        x = tf.keras.layers.LayerNormalization()(x + ff2)

    x = tf.keras.layers.Softmax()(x)

    model = tf.keras.Model(inputs=inputs, outputs=x)
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    return model

In [34]:
def generate(model, input_string, num_new_tokens):
    s = input_string
    x = encode(input_string)
    for _ in range(num_new_tokens):
        out = model.predict(tf.constant([x]))
        next_token = np.random.choice(np.arange(0, vocab_size), p=out[0][-1])
        x.append(next_token)
        x = x[1:]
        s += decode([next_token])
    return s

In [35]:
def train(model, train_steps):
    for step in range(train_steps):
        xb, yb = get_batch('train')
        model.fit(xb, yb)

In [36]:
model = create_model(depth=6)

In [None]:
train(model, 5000)

In [None]:
output = generate(model, text[:block_size], 1000)

In [40]:
print(output)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
Have me; sir, his is a ster?

PAULINA:
I'll be contience too, Hastile Coriolanus
As men: as I have no what never, your majesty
That go your bents there your saves,
And that was to have you. Come! Which indeed you?

CORIOLANUS:
Then yield say there
all all it full with sea roughs throw ask
Thormening away from my fashiling grief;
Who the gods so, conveniency, seance, and,
Which iat they not, and actions strangely
Have the herself proppert her: here say
Is were this Gaster one, lovethour shall
The tears men boan to-day poind or spart.

First Murderer:
When it, tucketh you are copens
For callenty many head?

LUCIO:
Ten 'made me all dirot deliver and of a back;
The vpervel with likecowiens
to give you are giverse, being my fleave-by in
Un