# Bigram Language Model and Generative Pretrained Transformer (GPT)


The objective of this example is to train a simplified transformer model. The primary differences between the implementation:
* tokenizer (we use a character level encoder simplicity and compute constraints)
* size (we are using 1 consumer grade gpu hosted on colab and a small dataset. in practice, the models are much larger and are trained on much more data)
* efficiency


Most modern LLMs have multiple training stages, so we won't get a model that is capable of replying to you yet. However, this is the first step towards a model like ChatGPT and Llama.


In [None]:
%matplotlib inline
import torch
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from torch import nn
import torch.nn.functional as F

## Part 1: Bigram MLP for TinyShakespeare

Create a list `chars` that contains all unique characters in `text`

Implement `encode(s: str) -> list[int]`

Implement `decode(ids: list[int]) -> str`

Create two tensors, `inputs_one_hot` and `outputs_one_hot`. Use one hot encoding. Make sure to get every consecutive pair of characters. For example, for the word 'hello', we should create the following input-output pairs
```
he
el
ll
lo
```

Implement BigramOneHotMLP, a 2 layer MLP that predicts the next token. Specifically, implement the constructor, forward, and generate. The output dimension of the first layer should be 8. Use `torch.optim`. The activation function for the first layer should be `nn.LeakyReLU()`

Note: Use the `torch.nn.function.cross_entropy` loss. Read the [docs](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html) about how this loss function works. The logits are the output of a network WITHOUT an activation function applied to the last layer. There are activation functions are applied to every layer except the last.

Train the BigramOneHotMLP for 1000 steps.

Create two tensors, `input_ids` and `outputs_one_hot`. These `input_ids` will be used for the embedding layer.

Implement and train BigramEmbeddingMLP, a 2 layer mlp that predicts the next token. Specifically, implement the constructor, forward, and generate functions. The output dimension of the first layer should be 8. Use `torch.optim`.


In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-02-26 01:24:46--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-02-26 01:24:47 (18.5 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
# For the bigram model, let's use the first 1000 characters for the data

with open('input.txt', 'r') as f:
    text = f.read()
text = text[:1000]

In [None]:
chars = sorted(list(set(text)))  # implement

def encode(s: str) -> list[int]:
    return [chars.index(c) for c in s]

def decode(ids: list[int]) -> str:

    return ''.join([chars[i] for i in ids if i < len(chars)])

def create_one_hot_inputs_and_outputs() -> list[torch.tensor, torch.tensor]:
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    idx_to_char = {i: ch for i, ch in enumerate(chars)}

    input_one_hot = torch.zeros((len(text)-1, len(chars)), dtype = torch.float32)
    output_one_hot = torch.zeros((len(text)-1, len(chars)), dtype = torch.float32)
    for i in range(len(text)-1):
        input_one_hot[i, char_to_idx[text[i]]] = 1
        output_one_hot[i, char_to_idx[text[i+1]]] = 1

    return input_one_hot, output_one_hot


inputs_one_hot, outputs_one_hot = create_one_hot_inputs_and_outputs()

class BigramOneHotMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.chars = chars  # 保存字符集
        self.char_to_idx = {ch: i for i, ch in enumerate(chars)}  # 字符到索引的映射
        self.idx_to_char = {i: ch for i, ch in enumerate(chars)}  # 索引到字符的映射

        self.linear1 = nn.Linear(len(chars),8)
        self.linear2 = nn.Linear(8,len(chars))
        self.activation = nn.LeakyReLU()
        self.optimizer = torch.optim.SGD(self.parameters(), lr=0.01,momentum = 0.9)
    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        return x

    def generate(self, start='a', max_new_tokens=100) -> str:
        # Generate text starting from the 'start' character
        result = [start]

        for _ in range(max_new_tokens):
            # Get the last character and encode+one-hot it
            last_char = result[-1]
            x = torch.zeros(len(chars))
            x[chars.index(last_char)] = 1

            # Forward pass to get logits
            with torch.no_grad():  # No need to track gradients during generation
                logits = self(x)

            # Apply softmax to get probabilities
            probs = torch.softmax(logits, dim=0)

            # Sample the next character from the probability distribution
            next_idx = min(torch.multinomial(probs, num_samples=1).item(),len(chars)-1)
            next_char = chars[next_idx]

            # Add to result
            result.append(next_char)

        return ''.join(result)



bigram_one_hot_mlp = BigramOneHotMLP()

# training loop
torch.cuda.empty_cache()
for step in range(1000):
    #zero gradients
    bigram_one_hot_mlp.optimizer.zero_grad()
    logits = bigram_one_hot_mlp(inputs_one_hot)
    loss = nn.functional.cross_entropy(logits, outputs_one_hot)

    loss.backward()
    bigram_one_hot_mlp.optimizer.step()
    if step % 100 == 0:
        print(f'step {step} loss {loss.item()}')



print(bigram_one_hot_mlp.generate())

step 0 loss 3.8161284923553467
step 100 loss 3.4338643550872803
step 200 loss 3.2288928031921387
step 300 loss 3.147360324859619
step 400 loss 3.1018383502960205
step 500 loss 3.060556173324585
step 600 loss 3.0190250873565674
step 700 loss 2.9771745204925537
step 800 loss 2.9342856407165527
step 900 loss 2.8903801441192627
aas e t pcfske
iteelit s revt wou:'i o: tirnaludetrot aW uelos r'
w
ar

abs
ito, pwode ureceve it zrs


In [None]:
def create_embedding_inputs_and_outputs() -> list[torch.tensor, torch.tensor]:
    # Get all consecutive character pairs
    input_chars = text[:-1]
    output_chars = text[1:]

    # Encode to IDs
    input_ids = torch.tensor(encode(input_chars))
    output_ids = torch.tensor(encode(output_chars))

    return [input_ids, output_ids]

input_ids, outputs_one_hot = create_embedding_inputs_and_outputs()
print(f"Embedding input shape: {input_ids.shape}, Output shape: {outputs_one_hot.shape}")


class BigramEmbeddingMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.vocab_size = len(chars)

        # Embedding layer: vocab_size tokens -> 8-dimensional embeddings
        self.embedding = nn.Embedding(self.vocab_size, 8)

        # Output layer: 8 -> vocab_size
        self.fc = nn.Linear(8, self.vocab_size)
    def forward(self, x):
       # Convert IDs to embeddings
        embeddings = self.embedding(x)  # (batch_size, 8)

        # Get logits from embeddings
        logits = self.fc(embeddings)  # (batch_size, vocab_size)
        return logits

    def generate(self, start='a', max_new_tokens=100) -> str:
        # Start with the given character
        idx = torch.tensor([encode(start)[0]])
        result_indices = [idx.item()]

        # Generate one character at a time
        for _ in range(max_new_tokens):
            # Get logits from the model
            with torch.no_grad():
                logits = self.forward(idx)

            # Get the most likely next character
            idx = torch.argmax(logits, dim=1)
            result_indices.append(idx.item())

        # Convert indices back to characters
        return decode(result_indices)

bigram_embedding_mlp = BigramEmbeddingMLP()
print(f"Embedding model structure: {bigram_embedding_mlp}")

# Training loop
optimizer = torch.optim.Adam(bigram_embedding_mlp.parameters(), lr=0.01)
for step in range(1000):
    # Forward pass
    logits = bigram_embedding_mlp(input_ids)

    # Calculate loss
    loss = F.cross_entropy(logits, outputs_one_hot.long())

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print progress
    if step % 100 == 0:
        print(f"Step {step}, Loss: {loss.item():.4f}")



print(bigram_embedding_mlp.generate())

Embedding input shape: torch.Size([999]), Output shape: torch.Size([999])
Embedding model structure: BigramEmbeddingMLP(
  (embedding): Embedding(46, 8)
  (fc): Linear(in_features=8, out_features=46, bias=True)
)
Step 0, Loss: 3.9985
Step 100, Loss: 2.3765
Step 200, Loss: 2.2096
Step 300, Loss: 2.1677
Step 400, Loss: 2.1492
Step 500, Loss: 2.1355
Step 600, Loss: 2.1244
Step 700, Loss: 2.1157
Step 800, Loss: 2.1093
Step 900, Loss: 2.1047
an the the the the the the the the the the the the the the the the the the the the the the the the th


## Part 2: Generative Pretrained Transformer


In [None]:
# run nvidia-smi to check gpu usage
!nvidia-smi

Wed Feb 26 01:24:54 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   41C    P8             10W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# For the gpt model, let's use the full text

with open('input.txt', 'r') as f:
    text = f.read()

Implement a character level tokenization function.

1. Create a list of unique characters in the string.
2. Implement a function `encode(s: str) -> list[int]` that takes a string and returns a list of ids
3. Implement a function `decode(ids: list[int]) -> str` that takes a list of ids (ints) and returns a string


In [None]:
chars = sorted(list(set(text)))  # implement

def encode(s: str) -> list[int]:
    return [chars.index(c) for c in s]

def decode(ids: list[int]) -> str:
    return ''.join([chars[i] for i in ids])

In [None]:
data = torch.tensor(encode(text), dtype=torch.long).cuda()

In [None]:
block_size = 16
data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43],
       device='cuda:0')

To train a transformer, we feed the model `n` tokens (context) and try to predict the `n+1`th token (target) in the sequence.



In [None]:
x = data[:block_size]
y = data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18], device='cuda:0') the target: 47
when input is tensor([18, 47], device='cuda:0') the target: 56
when input is tensor([18, 47, 56], device='cuda:0') the target: 57
when input is tensor([18, 47, 56, 57], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58], device='cuda:0') the target: 1
when input is tensor([18, 47, 56, 57, 58,  1], device='cuda:0') the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47], device='cuda:0') the target: 64
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64], device='cuda:0') the target: 43
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43], device='cuda:0') the target: 52
when input is tensor([18, 47,

In [None]:
batch_size = 64
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_batch():
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


### Single Self Attention Head
![](https://i.ibb.co/GWR1XG0/head.png)

In [None]:
class SelfAttentionHead(nn.Module):
    def __init__(self, n_embd,head_size,block_size,dropout = 0.1):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.scaling = head_size ** -0.5
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)
        attention = q @ k.transpose(-2,-1) * self.scaling

        masked_attention = attention.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        masked_attention = F.softmax(masked_attention, dim=-1)
        masked_attention = self.dropout(masked_attention)
        out = masked_attention @ v
        return out
BATCH = 4
SEQ_LEN = 8
EMBED_DIM = 16
HEAD_SIZE = 8
BLOCK_SIZE = SEQ_LEN

x = torch.randn(BATCH, SEQ_LEN, EMBED_DIM)  # 4 个样本，每个样本有 8 个 token，每个 token 16 维
attn = SelfAttentionHead(n_embd=EMBED_DIM, head_size=HEAD_SIZE, block_size=BLOCK_SIZE)
output = attn(x)

print(output.shape)

torch.Size([4, 8, 8])


### Multihead Self Attention

`constructor`

- Create 4 `SelfAttentionHead` instances. Consider using `nn.ModuleList`
- Create a linear layer with n_embd input dim and n_embd output dim

`forward`

In the forward implementation, pass `x` through each head, then concatenate all the outputs along the feature dimension, then pass the concatenated output through the linear layer

![](https://i.ibb.co/y5SwyZZ/multihead.png)

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, n_embd,num_heads, dropout = 0.1):
        super().__init__()
        self.head_size  = n_embd//num_heads
        self.num_heads = num_heads
        self.heads = nn.ModuleList([SelfAttentionHead(self.head_size) for _ in range(num_heads)])
        #linear projection layer
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)
        self.scaling = self.head_size ** -0.5


    def forward(self, x):
        head_outputs = torch.cat([head(x) for head in self.heads],dim = -1)
        out = self.proj(head_outputs)
        out = self.dropout(out)
        return out


## MLP
Implement a 2 layer MLP


![](https://i.ibb.co/C0DtrF5/ff.png)

In [None]:
# implement
class MLP(nn.Module):
    def __init__(self,  input_dim=64, hidden_dim=256, output_dim=64, dropout_prob=0.5):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.activation = nn.ReLU()
        self.fc2 = nn.Linear(128, output_dim)
        self.dropout = nn.Dropout(dropout_prob)


    def forward(self, x: torch.tensor) -> torch.tensor:
        x = self.fc1(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x


## Transformer block

Layer normalization help training stability by normalizing the outputs of neurons within a single layer across all features for each individual data point, not across a full batch or a specific feature.

Dropout is a form of regularization to prevent overfitting.

This is the diagram of a transformer block:

![](https://i.ibb.co/X85C473/block.png)

In [None]:
class Block(nn.Module):
    def __init__(self, n_embd: int, n_head: int,ff_hidden_dim:int, dropout = 0.1):
        super(Block,self).__init__()
        self.n_embd = n_embd
        self.n_head = n_head
        # Multi-Head Self-Attention
        self.attention = nn.MultiheadAttention(n_embd,num_heads = n_head, dropout = 0.1)
        # Layer Normalization
        self.norm1 = nn.LayerNorm(n_embd)
        self.norm2 = nn.LayerNorm(n_embd)
        # Feed-Forward Network
        self.ffn = nn.Sequential(
           nn.Linear(n_embd, ff_hidden_dim),
           nn.ReLU(),
           nn.Linear(ff_hidden_dim, n_embd)
)
        #Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Pre-LN + Self-Attention + Residual
        attn_out, _ = self.attention(x,x,x)
        x = x + self.dropout(attn_out)
        x = self.norm1(x)
        #
        ff_out = self.ffn(x)
        x = x + self.dropout(ff_out)
        x = self.norm2(x)
        return x

## GPT

`constructor`
1. create the token embedding table and the position embedding table
2. create variable `self.blocks` that is a series of 4 `Block`s. The data will pass through each block sequentially. Consider using `nn.Sequential`
3. create a layer norm layer
4. create a linear layer for predicting the next token

`forward(self, idx, targets=None)`.

`forward` takes a batch of context ids as input of size (B, T) and returns the logits and the loss, if targets is not None. If targets is None, return the logits and None.
1. get the token by using the token embedding table created in the constructor
2. create the position embeddings
3. sum the token and position embeddings to get the model input
4. pass the model through the blocks, the layernorm layer, and the final linear layer
5. compute the loss

`generate(start_char, max_new_tokens, top_p, top_k, temperature) -> str`
1. implement top p, top_k, and temperature for sampling



![](https://i.ibb.co/n8sbQ0V/Screenshot-2024-01-23-at-8-59-08-PM.png)

In [None]:
class GPT(nn.Module):
    def __init__(self, n_embd, n_head,vocab_size,num_blocks,block_size,dropout= 0.1 ):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(block_size, n_embd)
        self.block_size = block_size
        self.blocks = nn.ModuleList([Block(n_embd, n_head, block_size, dropout) for _ in range(num_blocks)])
        self.norm = nn.LayerNorm(n_embd)
        self.fc = nn.Linear(n_embd, vocab_size)
        self.dropout = nn.Dropout(dropout)

        self.register_buffer("position_ids", torch.arange(block_size).unsqueeze(0),persistent = False)
    def forward(self, idx, targets=None):
        print(f"Debug idx.shape = {idx.shape}")
        idx = idx.squeeze()
        print(f"Debug after squeeze = {idx.shape}")
        if idx.dim() == 1:
            idx = idx.unsqueeze(0)
        elif idx.dim() == 0:
            idx = idx.unsqueeze(0).unsqueeze(0)
        #print(f"Debug after unsqueeze = {idx.shape}")

        B, T = idx.shape
        position_embeddings = self.position_embedding(self.position_ids[:, :T])
        token_embeddings = self.token_embedding(idx)
        #T = idx.size(1)  # 获取输入序列的长度
        #position_embeddings = self.position_embedding(torch.arange(T, device=idx.device))

        # 确保 position_embeddings 的形状与 token_embeddings 一致
        if position_embeddings.shape[0] != B:
             position_embeddings = position_embeddings.expand(B, -1, -1)  # 扩展为 (B, T, n_embd)
        x = token_embeddings + position_embeddings
        for block in self.blocks:
            x = block(x)
        x = self.norm(x)
        logits = self.fc(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.permute(0, 2, 1), targets)


        return logits, loss

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


    @torch.no_grad()

    def generate(self, start_char, max_new_tokens, top_p, top_k, temperature):
        idx = torch.tensor([[encode(start_char)]],dtype=torch.long, device=self.token_embedding.weight.device)
        if idx.dim() == 1:
            idx = idx.unsqueeze(0)
        elif idx.dim() == 0:
            idx = idx.unsqueeze(0).unsqueeze(0)
        elif idx.dim() == 3:
            idx = idx.squeeze(0)


        for _ in range(max_new_tokens):
            idx = idx[:, -self.block_size:]
            logits, _ = self.forward(idx)
            logits = logits[:, -1, :] / temperature  # Scale by temperature

            # Top-k filtering
            if top_k > 0:
                values, indices = torch.topk(logits, top_k)
                logits[logits < values[:, [-1]]] = float('-inf')

            # Top-p (nucleus) filtering
            if top_p < 1.0:
                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                sorted_indices_to_remove = cumulative_probs > top_p

                sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
                sorted_indices_to_remove[:, 0] = False
                logits.scatter_(1, sorted_indices, torch.where(sorted_indices_to_remove, float('-inf'), sorted_logits))

            probs = F.softmax(logits, dim=-1)
            probs[torch.isnan(probs)] = 0
            next_token = torch.multinomial(probs, num_samples=1)
            next_token = next_token.unsqueeze(0) if next_token.dim() == 1 else next_token
            print(f"Debug before cat: idx.shape={idx.shape}, next_token.shape={next_token.shape}")

            idx = torch.cat([idx, next_token], dim=1)

        return idx.squeeze(0).tolist()



In [None]:
vocab_size = len(chars)
block_size = 16
device = torch.device("cpu")  # Change device to CPU
model = GPT(n_embd=256, n_head=8, vocab_size=vocab_size, num_blocks=6, block_size=block_size).to(device)
idx = torch.randint(0, vocab_size, (2, 10), dtype=torch.long).to(device)
logits, loss = model(idx)
print(f"logits.shape = {logits.shape}, loss = {loss}")

Debug idx.shape = torch.Size([2, 10])
Debug after squeeze = torch.Size([2, 10])
logits.shape = torch.Size([2, 10, 65]), loss = None


### Training loop

implement training loop

In [None]:
model = GPT(n_embd = 64, n_head = 4, vocab_size = len(chars), num_blocks=4, block_size=16,dropout=0.1).to('cuda')
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
max_iters = 5000


for iter in range(max_iters):
    x,y = get_batch()
    x,y = x.to('cuda'), y.to('cuda')
    #forward pass
    logits, loss = model(x,y)
    #backward pass
    optimizer.zero_grad()
    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    # Optimizer step
    optimizer.step()
    # Logging every 100 iterations
    if iter % 100 == 0:
        print(f"Iteration {iter}/{max_iters}, Loss: {loss.item():.4f}")
print("Completed.")

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shape = torch.Size([64, 16])
Debug after squeeze = torch.Size([64, 16])
Debug idx.shap

### Generate text


print some text that your model generates

In [None]:
# eval mode on
model.eval()

start_char = "A"
# using GPT to generate text
generated_text = model.generate(
    start_char=start_char,
    max_new_tokens=500,  # 500 max new tokens
    top_p=0.9,           # nucleus sampling
    top_k=40,            # top-k sampling
    temperature=0.8
)

print(f"Generated text starting with '{start_char}':\n")
print(generated_text)
decoded_text = decode(generated_text)
print(decoded_text)
print(f"len(chars) = {len(chars)}")
print(f"max token id = {max(generated_text)}")

Debug idx.shape = torch.Size([1, 1])
Debug after squeeze = torch.Size([])
Debug before cat: idx.shape=torch.Size([1, 1]), next_token.shape=torch.Size([1, 1])
Debug idx.shape = torch.Size([1, 2])
Debug after squeeze = torch.Size([2])
Debug before cat: idx.shape=torch.Size([1, 2]), next_token.shape=torch.Size([1, 1])
Debug idx.shape = torch.Size([1, 3])
Debug after squeeze = torch.Size([3])
Debug before cat: idx.shape=torch.Size([1, 3]), next_token.shape=torch.Size([1, 1])
Debug idx.shape = torch.Size([1, 4])
Debug after squeeze = torch.Size([4])
Debug before cat: idx.shape=torch.Size([1, 4]), next_token.shape=torch.Size([1, 1])
Debug idx.shape = torch.Size([1, 5])
Debug after squeeze = torch.Size([5])
Debug before cat: idx.shape=torch.Size([1, 5]), next_token.shape=torch.Size([1, 1])
Debug idx.shape = torch.Size([1, 6])
Debug after squeeze = torch.Size([6])
Debug before cat: idx.shape=torch.Size([1, 6]), next_token.shape=torch.Size([1, 1])
Debug idx.shape = torch.Size([1, 7])
Debug afte