# shakespeare.ai

A fun text generator built using a transformer, to generate text similar to William Shakespeare's style of English.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(1618)

This cell imports the necessary modules from the PyTorch library to work with neural networks. The `torch` module is the main package for tensor computations, while `torch.nn` provides support for defining and training neural networks. The `torch.nn.functional` module contains various functions for implementing neural network operations.

Additionally, the line `torch.manual_seed(1618)` sets the random seed to 1618. Setting a random seed ensures reproducibility, meaning that the random values generated by PyTorch will be the same each time you run the code, which is useful for debugging and result reproducibility purposes.

In [None]:
# hyperparameters
batchSize = 64
blockSize = 256
maxIters = 5000
evalInter = 500
learningRate = 3e-4
# device = "cuda" if torch.cuda.is_available() else "cpu"
evalIters = 200
nEmbd = 384
nHead = 6
nLayer = 6
dropout = 0.2

In this cell, various hyperparameters for the model are defined. These hyperparameters control different aspects of the training and architecture of the neural network. Here's a brief explanation of each hyperparameter:

- `batchSize`: The number of samples in each training batch.
- `blockSize`: The length of the input text sequence used for training.
- `maxIters`: The maximum number of training iterations.
- `evalInter`: The interval at which to evaluate the model during training.
- `learningRate`: The learning rate used in the optimization algorithm.
- `evalIters`: The number of evaluation iterations.
- `nEmbd`: The dimensionality of the embedding layer.
- `nHead`: The number of attention heads in the transformer model.
- `nLayer`: The number of layers in the transformer model.
- `dropout`: The probability of dropout to apply during training for regularization.

Feel free to adjust these hyperparameters according to your specific requirements and the characteristics of your dataset. This particular setting accounts to about 10 million parameters, I'd suggest not running it on CPU.

In [None]:
# reading and viewing the text corpus
with open("/content/drive/MyDrive/shakespeare.txt", "r") as file:
    text = file.read()

print(f"Length of the dataset: {len(text)}")
print("First 1000 characters:")
print(text[:1000])

The cell reads a text corpus from a file named "shakespeare.txt" and stores its content in a variable called `text`. The `with open()` statement ensures proper handling of file resources by automatically closing the file once it's done being read.

The length of the text corpus is then printed using the `len()` function, providing an indication of the total number of characters in the dataset.

Lastly, the first 1000 characters of the corpus are printed using `print(text[:1000])`, allowing a glimpse into the content of the dataset.

This code is useful for loading and examining the text corpus before further processing or training. It helps ensure that the data is correctly loaded and provides an initial understanding of the dataset's structure and content.

In [None]:
# all the chars that occur in the doc
chars = sorted(list(set(text)))
vocabSize = len(chars)
print("Charset:", "".join(chars))
print(f"{vocabSize=}")

# creating mappings from chars to ints
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

The cell performs the following tasks:

1. It creates a sorted list of unique characters present in the `text` corpus using `set(text)` to eliminate duplicates and `sorted()` to sort them in ascending order. The resulting list is assigned to the variable `chars`.
2. The variable `vocabSize` is then assigned the length of the `chars` list, representing the total number of unique characters in the corpus.
3. The line `print("Charset:", "".join(chars))` outputs the sorted list of characters as a string, displaying all the unique characters present in the dataset.
4. The line `print(f"{vocabSize=}")` prints the value of `vocabSize`, which represents the number of unique characters in the dataset.

The next part of the code snippet involves creating two mappings: `stoi` (string to integer) and `itos` (integer to string). These mappings are dictionaries that associate each character with a unique index or vice versa.

- The line `stoi = {ch: i for i, ch in enumerate(chars)}` creates the `stoi` dictionary, where each character from `chars` is mapped to its corresponding index.
- The line `itos = {i: ch for i, ch in enumerate(chars)}` creates the `itos` dictionary, where each index is mapped to its corresponding character.

These mappings are useful for encoding and decoding characters during text generation or any other tasks that require converting between characters and their corresponding integer representations.

In [None]:
# encoder function; string to list(int)
def encode(s):
    return [stoi[c] for c in s]

# decoder function; list(int) to string
def decode(l):
    return "".join(itos[i] for i in l)

The cell defines two functions:

1. `encode(s)`: This function takes a string `s` as input and returns a list of integers representing the encoded version of the string. The function utilizes a list comprehension and the `stoi` dictionary to map each character in `s` to its corresponding integer index.

2. `decode(l)`: This function takes a list of integers `l` as input and returns a string representing the decoded version of the list. It uses a list comprehension and the `itos` dictionary to map each integer in `l` to its corresponding character and then joins the characters to form a string.

These functions are useful for converting between string and integer representations of text. The `encode()` function is typically used to convert input text into a numerical representation that can be fed into a neural network, while the `decode()` function is used to convert the output of a neural network (a list of predicted integers) back into a readable text format.

In [None]:
print("***Test***")
temp = encode("Carpe diem!")
print("Encoded vector for 'Carpe diem!':", temp)
print(decode(temp))

# encoding the entire doc
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Data encoding metadata: {data.shape=}, {data.dtype=}")

In this cell, a test case is performed to demonstrate the usage of the `encode()` and `decode()` functions.

- The line `temp = encode("Carpe diem!")` encodes the string "Carpe diem!" using the `encode()` function, resulting in a list of integers representing the encoded version of the text.
- The line `print("Encoded vector for 'Carpe diem!':", temp)` prints the encoded vector.
- The line `print(decode(temp))` decodes the encoded vector using the `decode()` function and prints the decoded string.

After the test case, the entire `text` corpus is encoded using `encode(text)`, and the resulting list of integers is converted into a `torch.tensor` using `torch.tensor(encode(text), dtype=torch.long)`. The resulting `data` tensor represents the encoded version of the entire text corpus.

The last line `print(f"Data encoding metadata: {data.shape=}, {data.dtype=}")` outputs the shape and data type information of the `data` tensor, providing metadata about the encoded data.

In [None]:
# train & val split
n = int(0.9*len(data))
train = data[:n]
val = data[n:]

The cell performs a train-validation split on the encoded data.

The encoded data, stored in the `data` tensor, is split into two parts: a training set (`train`) and a validation set (`val`). The split is done by determining the index `n` which represents 90% of the data length.

- `train` is created by taking the portion of the `data` tensor from the beginning up to index `n`, representing the first 90% of the data.
- `val` is created by taking the portion of the `data` tensor from index `n` until the end, representing the remaining 10% of the data.

This train-validation split is commonly used in machine learning to divide the data into two separate sets for training and evaluating a model, respectively. Adjusting the split percentage allows for different proportions of data allocated for training and validation.

In [None]:
# loading data
def getBatch(split):
    data = train if split == "train" else val
    idx = torch.randint(len(data) - blockSize, (batchSize,))
    x = torch.stack([data[i: i + blockSize] for i in idx])
    y = torch.stack([data[i + 1: i + blockSize + 1] for i in idx])
    # x, y = x.to(device), y.to(device)

    return x, y

The `getBatch` function retrieves a batch of data for either the training or validation split.

- `getBatch(split)` is a function that takes a `split` argument indicating whether to fetch a batch for the "train" or "val" (validation) split.
- Inside the function, the `data` variable is assigned the `train` tensor if the `split` is "train", and the `val` tensor otherwise.
- Random indices `idx` are generated to select samples from `data`. The indices are generated within the valid range of indices that allow constructing input and target sequences of size `blockSize`.
- The input sequences `x` are constructed by selecting slices of length `blockSize` from `data` based on the generated indices `idx`.
- The target sequences `y` are constructed similarly to `x`, but with an offset of one.
- Finally, the function returns the input sequences `x` and the corresponding target sequences `y`.

Note: The commented line `# x, y = x.to(device), y.to(device)` suggests that the code is originally written to run on a specific device, such as a CUDA-capable GPU. However, it is currently disabled. If desired, uncomment and modify this line to specify the device for tensor computations.

In [None]:
# creating necessary classes

class Head(nn.Module):
    # one head of self-attention

    def __init__(self, headSize):
        super().__init__()
        self.key = nn.Linear(nEmbd, headSize, bias=False)
        self.query = nn.Linear(nEmbd, headSize, bias=False)
        self.value = nn.Linear(nEmbd, headSize, bias=False)
        self.register_buffer("tril", torch.tril(
            torch.ones(blockSize, blockSize)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)  # (B, T, C)
        q = self.query(x)  # (B, T, C)

        # computing attention weights or affinities
        wei = q @ k.transpose(-2, -1) * C**(-0.5)
        # (B, T, 16) @ (B, 16, T) --> (B, T, T)

        wei = wei.masked_fill(
            self.tril[:T, :T] == 0, float("-inf"))  # (B, T, T)
        '''
        this is a decoder block, it prevents token communication with the future tokens
        if this decoder block is absent, the tokens will be able to communicate with the past & future tokens
        '''

        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)

        # computing weighted sum of the values
        v = self.value(x)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    # multiple heads of self-attention in parallel

    def __init__(self, numHeads, headSize):
        super().__init__()
        self.heads = nn.ModuleList([Head(headSize)
                                    for _ in range(0, numHeads)])
        self.proj = nn.Linear(nEmbd, nEmbd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        out = self.dropout(out)

        return out


class FeedForward(nn.Module):
    # simple linear layer followed by non-linearity

    def __init__(self, nEmbd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(nEmbd, 4 * nEmbd),
            nn.ReLU(),
            nn.Linear(4 * nEmbd, nEmbd),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)


class Block(nn.Module):
    # transformer block: communication followed by computation

    def __init__(self, nEmbd, nHead):
        # nEmbd: embedding dimension, nHead: number of heads we'd like
        super().__init__()
        headSize = nEmbd // nHead
        # nHead heads of headSize dimensional self-attention
        self.sa = MultiHeadAttention(nHead, headSize)
        self.ffwd = FeedForward(nEmbd)
        self.ln1 = nn.LayerNorm(nEmbd)  # layer norm 1
        self.ln2 = nn.LayerNorm(nEmbd)  # layer norm 2

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))

        return x


class BigramLanguageModel(nn.Module):
    # simple bigram language model

    def __init__(self):
        super().__init__()
        # each token directly reads the logits of the next token
        self.tokenEmbeddingTable = nn.Embedding(vocabSize, nEmbd)
        self.positionEmbeddingTable = nn.Embedding(blockSize, nEmbd)
        self.blocks = nn.Sequential(
            *[Block(nEmbd, nHead=nHead) for _ in range(0, nLayer)])
        self.lnF = nn.LayerNorm(nEmbd, nEmbd)  # final layer norm
        self.lmHead = nn.Linear(nEmbd, vocabSize)

    def forward(self, idx, targets=None):
        # idx & targets are both (B, T) tensors of integers
        B, T = idx.shape

        tokEmb = self.tokenEmbeddingTable(idx)  # (B, T, C)
        posEmb = self.positionEmbeddingTable(
            torch.arange(T))  # (T, C)  ''', device=device'''
        x = tokEmb + posEmb  # (B, T, C)
        x = self.blocks(x)  # multi head attention, (B, T, C)
        x = self.lnF(x)  # (B, T, C)
        logits = self.lmHead(x)  # (B, T, C=len(vocabSize))

        if targets == None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    # even though the model was built with the vision to use context from 1 to blockSize, this is sampling using bigrams
    def generate(self, idx, maxNewTokens):
        # idx is a (B, T) array of indices of the current context
        for _ in range(0, maxNewTokens):
            # cropping idx to the last blockSize tokens
            idxCond = idx[:, -blockSize:]
            # getting the predictions
            logits, loss = self(idxCond)
            # focusing only on the last time step (essentially making it a bigram model)
            logits = logits[:, -1, :]
            # applying softmax to get probablities
            probs = F.softmax(logits, dim=-1)
            # sampling from a multinomial distribution
            idxNext = torch.multinomial(probs, num_samples=1)
            # appending the predicted index to a running vector
            idx = torch.cat((idx, idxNext), dim=1)

        return idx

The cell defines several classes necessary for the language model:

1. `Head`: Represents one head of self-attention.
2. `MultiHeadAttention`: Contains multiple heads of self-attention in parallel.
3. `FeedForward`: Implements a simple linear layer followed by a non-linearity.
4. `Block`: Represents a transformer block, which consists of communication and computation.
5. `BigramLanguageModel`: Represents a simple bigram language model.

Each class has a `forward` method that defines the forward pass of the model. Here's a brief summary of each class:

- `Head`:
  - `__init__(self, headSize)`: Initializes the head with linear layers for key, query, and value, as well as dropout and a buffer for the lower triangular mask.
  - `forward(self, x)`: Computes self-attention using key, query, and value and returns the output.

- `MultiHeadAttention`:
  - `__init__(self, numHeads, headSize)`: Initializes multiple heads of self-attention with the specified number of heads and head size.
  - `forward(self, x)`: Applies each head of self-attention in parallel and returns the concatenated output.

- `FeedForward`:
  - `__init__(self, nEmbd)`: Initializes a simple feed-forward module with linear layers and dropout.
  - `forward(self, x)`: Passes the input through the linear layers and returns the output.

- `Block`:
  - `__init__(self, nEmbd, nHead)`: Initializes a transformer block with self-attention and feed-forward layers.
  - `forward(self, x)`: Performs self-attention and feed-forward computations on the input and returns the output.

- `BigramLanguageModel`:
  - `__init__(self)`: Initializes the bigram language model with embedding tables, transformer blocks, and linear layers.
  - `forward(self, idx, targets=None)`: Computes the forward pass of the model, returning logits and an optional loss if targets are provided.
  - `generate(self, idx, maxNewTokens)`: Generates new tokens based on the given input indices using the model.

These classes provide the necessary building blocks for the language model architecture, including self-attention, feed-forward layers, and transformer blocks.

In [None]:
xb, yb = getBatch("train")
# the below code is to analyse & interpret the inputs and targets
# print(f"***Inputs***\nShape: {xb.shape}\n{xb}")
# print(f"***Targets***\nShape: {yb.shape}\n{yb}")

# for b in range(batchSize):
#     for t in range(blockSize):
#         context = xb[b, : t + 1]
#         target = yb[b, t]
#         print(f"when input is {context.tolist()} the target: {target}")

The cell defines a snippet that generates a batch of inputs and targets using the `getBatch` function. It includes commented-out code that can be used to analyze and interpret the inputs and targets.

The code initializes `xb` and `yb` by calling `getBatch` with the argument "train" to obtain a batch of inputs and targets for training.

The commented-out code provides an example of how to analyze and interpret the inputs and targets. It shows how to print the shape and contents of the inputs and targets. Additionally, it includes nested loops to iterate over each element in the batch and timestep, printing the context (input sequence) and target for each timestep.

This code can be useful for understanding the structure and content of the inputs and targets in the training batch.

In [None]:
model = BigramLanguageModel()
# m = model.to(device)

The cell creates an instance of the `BigramLanguageModel` class and assigns it to the variable `model`. There is a commented-out line that suggests the model might be moved to a device, but it is currently disabled.

The `BigramLanguageModel` is a class representing a simple bigram language model. The instantiated `model` variable represents an instance of this model. The commented-out line suggests that the model could be moved to a specific device using the `to` method, where `device` is a variable that holds the target device (e.g., "cuda" for GPU or "cpu" for CPU). However, in the provided code, this line is currently disabled by being commented out.

In [None]:
# function for estimating losses
@torch.no_grad()
def estimateLoss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.zeros(evalIters)
        for k in range(0, evalIters):
            X, y = getBatch(split)
            logits, loss = model(X, y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()

    return out

The cell defines a function named `estimateLoss` for estimating losses during evaluation.

The function is decorated with `@torch.no_grad()` to ensure that no gradients are computed during the evaluation.

Inside the function:
- The dictionary `out` is initialized to store the loss values.
- The model is set to evaluation mode using `model.eval()`.
- For each split (either "train" or "val"):
  - A tensor `losses` is created to store the losses for the current split.
  - For `evalIters` number of iterations:
    - Input sequences `X` and target sequences `y` are obtained using the `getBatch` function.
    - The model is called with `X` and `y` to obtain logits and loss.
    - The loss value is extracted using `loss.item()` and stored in the `losses` tensor.
  - The mean of the `losses` tensor is calculated and assigned to the `split` key in the `out` dictionary.
- Finally, the model is set back to train mode using `model.train()`, and the `out` dictionary containing the mean losses for each split is returned.

This function is useful for estimating losses during evaluation, providing insights into the performance of the model on the training and validation splits.

In [None]:
# preliminary analysis before training
print("***Preliminary analysis***")
logits, loss = model(xb, yb)
print(f"{logits.shape=}, {loss=}")  # expected loss = -ln(1/65) = -4.17

print(decode(model.generate(idx=torch.zeros(
    (1, 1), dtype=torch.long), maxNewTokens=100)[0].tolist()))

The cell performs preliminary analysis before training the `model`. It includes two print statements to analyze the shape of the `logits` tensor and the value of the `loss` variable. Additionally, it generates a sequence of text using the `generate` method of the `model` and prints the decoded text.

The first print statement displays the shape of the `logits` tensor and the value of the `loss` variable. It helps in understanding the dimensions of the `logits` tensor and the initial value of the loss.

The second print statement generates a sequence of text by calling the `generate` method of the `model`. It initializes the generation process with an input tensor of shape `(1, 1)` filled with zeros and generates a sequence of maximum length `100`. The generated sequence is then decoded using the `decode` function and printed. This provides a glimpse of the text that the model might produce during generation.

In [None]:
print("***Training begins***")

# creating an optimizer object
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

for iter in range(0, maxIters):
    # evaluating train & val loss once in a while
    if iter % evalInter == 0:
        losses = estimateLoss()
        print(
            f"{iter=}: train loss - {losses['train']:.4f}, val loss - {losses['val']:.4f}")

    # sampling a batch of data
    xb, yb = getBatch("train")

    # evaluating the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

The cell starts the training process for the model. It includes a loop that iterates over a specified number of iterations (`maxIters`). Within each iteration, it performs the following steps:

1. Every `evalInter` iterations, it calls the `estimateLoss` function to evaluate the training and validation loss. The current iteration number, training loss, and validation loss are printed.
2. It obtains a batch of training data by calling the `getBatch` function.
3. It calculates the logits and loss by calling the `model` with the input batch (`xb`) and target batch (`yb`).
4. It initializes the gradients of the model parameters with respect to the loss using `optimizer.zero_grad(set_to_none=True)`.
5. It performs backpropagation by calling `loss.backward()` to compute the gradients.
6. It updates the model parameters by calling `optimizer.step()` to perform an optimization step.

The code initializes an optimizer object (`AdamW`) to optimize the model parameters. Then, in each iteration, it performs training steps, evaluates the loss periodically, and updates the model parameters using backpropagation and optimization.

In [None]:
# sampling from the model
context = torch.zeros((1, 1), dtype=torch.long) # ''', device=device'''
print(decode(model.generate(context, maxNewTokens=100)[0].tolist()))

The code cell generates text samples from the trained model. It initializes a context tensor with shape (1, 1) containing zeros, and then calls the `generate` method of the `model` to generate text based on this context. The generated text has a maximum length of 100 tokens.

The generated text is then printed.