# Tutorial 1: nanoGPT - Word-Level Text Generation

Welcome to the nanoGPT tutorial! In this notebook, we will train a simple word-level language model to generate text.

**Goal:** Understand how to train a small transformer model for text generation on a CPU.

**Key Concepts:**
- Word-level tokenization
- Preparing data for language modeling
- A basic PyTorch training loop
- Generating text with the trained model

## 1. Setup and Imports

First, let's import the necessary libraries. We'll need `torch` for building and training our model.

In [1]:
import torch # PyTorch: a library for deep learning
from torch.nn import functional as F # functional: a module for built functions
from transformers import GPT2LMHeadModel, GPT2Config # transformers: a library for untrained and pre-trained models
import random # random: a module for generating random numbers
import numpy as np # numpy: a library for numerical computing

# For reproducibility: set the random seed to 1
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

<torch._C.Generator at 0x10ecceb90>

## 2. Configuration

We'll define some hyperparameters for our model and training process. Since we're running on CPU and want this to be quick for a tutorial, these values will be small.

In [2]:
# Hyperparameters
batch_size = 32       # How many independent sequences will we process in parallel 
block_size = 10      # What is the maximum context length for predictions?
max_iters = 5000      # Total training iterations
eval_interval = 500   # How often to evaluate on validation set
learning_rate = 5e-4  # Learning rate for the optimizer
device = 'cpu'        # Explicitly set to CPU
eval_iters = 100      # Number of iterations for evaluation
n_embd = 128          # Embedding dimension (reduced for CPU)
n_head = 4            # Number of attention heads (reduced for CPU)
n_layer = 4           # Number of transformer blocks (reduced for CPU)
dropout = 0.0          # Dropout rate>0 to prevent overfitting

## 3. Data Preparation

We'll use a small text file as our dataset. For this tutorial, let's use a snippet of Shakespeare's writings.

### 3.1 Load Data

In [3]:
# We'll use a small snippet of text for this tutorial.
# You can replace this with the path to a larger .txt file if you wish.
text = """To stale 't a little more.

First Citizen:
Well, I'll hear it, sir: yet you must not think to
fob off our disgrace with a tale: but, an 't please
you, deliver.

MENENIUS:
There was a time when all the body's members
Rebell'd against the belly, thus accused it:
That only like a gulf it did remain
I' the midst o' the body, idle and unactive,
Still cupboarding the viand, never bearing
Like labour with the rest, where the other instruments
Did see and hear, devise, instruct, walk, feel,
And, mutually participate, did minister
Unto the appetite and affection common
Of the whole body. The belly answer'd--

First Citizen:
Well, sir, what answer made the belly?

MENENIUS:
Sir, I shall tell you. With a kind of smile,
Which ne'er came from the lungs, but even thus--
For, look you, I may make the belly smile
As well as speak--it tauntingly replied
To the discontented members, the mutinous parts
That envied his receipt; even so most fitly
As you malign our senators for that
They are not such as you."""

### 3.2 Word-level Tokenization

Since this is a word-level model, our vocabulary will consist of all unique words present in the text. We'll create mappings from words to integers (encode) and integers to words (decode).

In [4]:
# Get all unique words in the text by regular expression
import re 

# Simple word tokenization - split on whitespace and punctuation
words = re.findall(r'\b\w+\b|[^\w\s]', text.lower())
unique_words = sorted(list(set(words)))
vocab_size = len(unique_words)

print("Sample words:", unique_words[:20])
print(f"Vocabulary size: {vocab_size}")

Sample words: ["'", ',', '-', '.', ':', ';', '?', 'a', 'accused', 'affection', 'against', 'all', 'an', 'and', 'answer', 'appetite', 'are', 'as', 'bearing', 'belly']
Vocabulary size: 123


In [5]:
# Create a mapping from words to integers and vice-versa
stoi = {word: i for i, word in enumerate(unique_words)}
itos = {i: word for i, word in enumerate(unique_words)}

def encode(s):
    """Encoder: take a string, output a list of integers"""
    words = re.findall(r'\b\w+\b|[^\w\s]', s.lower())
    return [stoi[word] for word in words if word in stoi]

def decode(l):
    """Decoder: take a list of integers, output a string"""
    words = [itos[i] for i in l]
    # Simple reconstruction - join with spaces, handle punctuation
    result = ""
    for i, word in enumerate(words):
        if word in ".,!?;:-'":
            result += word
        else:
            result += " " + word
    return result

# Test encoding and decoding
test_text = "what answer made you"
encoded = encode(test_text)
decoded = decode(encoded)
print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

Original: what answer made you
Encoded: [115, 14, 56, 122]
Decoded:  what answer made you


#### Explanation: Tokenizer and Tokens

A **tokenizer** is a crucial component in Natural Language Processing (NLP). Its primary job is to break down a piece of text (like a sentence or a paragraph) into smaller units called **tokens**. These tokens can be words, sub-words, or even individual characters, depending on the type of tokenizer used.

Think of it like this: computers don't understand words and sentences directly. They understand numbers. So, a tokenizer first segments the text and then, in conjunction with a vocabulary mapping (like our `stoi` and `itos` dictionaries), converts these tokens into numerical representations that a machine learning model can process.

In this notebook, we are using a **word-level tokenizer**. This means each token is a unique word or punctuation mark found in our text.
- The `re.findall(r'\b\w+\b|[^\w\s]', text.lower())` line is our simple tokenizer. It uses regular expressions to find sequences of word characters (`\w+`) or any character that is not a word character or whitespace (`[^\w\s]`). This helps us capture both words and punctuation marks as separate tokens.
- `stoi` (string to integer) maps each unique token (word/punctuation) to a unique integer.
- `itos` (integer to string) does the reverse, mapping integers back to their original tokens.

This process is fundamental because it allows the model to learn patterns and relationships between these numerical representations of words.

### 3.3 Create Training and Validation Splits

We'll split our dataset into a training set and a validation set. The model learns from the training set, and we use the validation set to check how well it's generalizing.

In [6]:
# Encode the entire text dataset and store it into a torch.Tensor
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Data shape: {data.shape}, Data type: {data.dtype}")
print(f"Sample encoded text: {data[:20].tolist()}")
print(f"Sample decoded text: {decode(data[:20].tolist())}")

# Split up the data into train and validation sets
n = int(0.5*len(data)) # first 50% will be train, rest val
train_data = data[:n]
val_data = data[n:]

print(f"Training data length: {len(train_data)} words")
print(f"Validation data length: {len(val_data)} words")

Data shape: torch.Size([238]), Data type: torch.int64
Sample encoded text: [108, 94, 0, 97, 7, 52, 64, 3, 36, 23, 4, 114, 1, 44, 0, 53, 42, 48, 1, 90]
Sample decoded text:  to stale' t a little more. first citizen: well, i' ll hear it, sir
Training data length: 119 words
Validation data length: 119 words


### 3.4 Data Loader

We need a way to feed data to our model in batches. The `get_batch` function will randomly sample `batch_size` chunks of `block_size` length from the data.

In [7]:
def get_batch(split): #  Data loading function, we would generate (X,Y)'s from the text just like normal ML training data
    # Generate a small batch of data of inputs x and targets y
    # X: the context of the words as the input
    # Y: the next word as the target, Y[i] is the next word of X[i]
    data = train_data if split == 'train' else val_data
    # batch size: the number of training examples in a batch for speeding up the training process
    # block size: the maximum number of words in the context (X)
    # If the block size is too large, there would be a large computational cost; if the block size is too small, the model would not be able to capture the context of the words
    ix = torch.randint(len(data) - block_size, (batch_size,)) # randomly select the starting point of the context with the length of block_size
    x = torch.stack([data[i:i+block_size] for i in ix]) # X: the context of the words as the input
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # Y: the next word as the target, Y[i] is the next word of X[i]
    x, y = x.to(device), y.to(device)
    return x, y

In [8]:
xb, yb = get_batch('train')
# Example of a batch
print("Example of a batch:")
for b in range(min(1, batch_size)): 
    print(f"Context (X): {decode(xb[b].tolist())}")
    print(f"Target (Y): {decode(yb[b].tolist())}")


print("What happens when we train the model: each (X,Y) would be broken down into a sequence of (X'_t,Y'_t) pairs as the the real training data")
print("X': the context of the words as the input X'_t=X[:t+1] for each t")
print("Y': the next word as the target, Y'_t=Y[t] is the next word of X[:t+1] for each t")
for b in range(min(1, batch_size)):  # Show1 examples
    for t in range(min(8, block_size)):  # Show first 8 positions
        context_words = decode(xb[b, :t+1].tolist())
        target_word = itos[yb[b,t].item()]
        print(f"Context (X'_{t}): '{context_words}' -> Target (Y'_{t}): '{target_word}'")
        if t == 7:  # Only show first 8 examples
            break
    print()
    if b == 1:  # Only show for first 2 batch items
        break

Example of a batch:
Context (X):  instruments did see and hear, devise, instruct,
Target (Y):  did see and hear, devise, instruct, walk
What happens when we train the model: each (X,Y) would be broken down into a sequence of (X'_t,Y'_t) pairs as the the real training data
X': the context of the words as the input X'_t=X[:t+1] for each t
Y': the next word as the target, Y'_t=Y[t] is the next word of X[:t+1] for each t
Context (X'_0): ' instruments' -> Target (Y'_0): 'did'
Context (X'_1): ' instruments did' -> Target (Y'_1): 'see'
Context (X'_2): ' instruments did see' -> Target (Y'_2): 'and'
Context (X'_3): ' instruments did see and' -> Target (Y'_3): 'hear'
Context (X'_4): ' instruments did see and hear' -> Target (Y'_4): ','
Context (X'_5): ' instruments did see and hear,' -> Target (Y'_5): 'devise'
Context (X'_6): ' instruments did see and hear, devise' -> Target (Y'_6): ','
Context (X'_7): ' instruments did see and hear, devise,' -> Target (Y'_7): 'instruct'



#### Explanation: `get_batch` - Why X and Y have the same length

In the `get_batch` function, both the input `x` and the target `y` are sequences of `block_size` length. This might seem counterintuitive at first, as we are trying to predict the *next* word. Let's clarify:

The task of a language model is to predict the next token in a sequence, given the preceding tokens.
- `x` represents the **input context**: a sequence of words the model sees.
- `y` represents the **target output**: for each position in `x`, `y` contains the word that *immediately follows* it in the original text.

Consider a `block_size` of 10.
If `x` is `[word_1, word_2, ..., word_10]`,
then `y` will be `[word_2, word_3, ..., word_11]`.

So, for each element `x[i]` in the input sequence (at a specific time step `t` within the block), the corresponding `y[i]` is the actual next word that the model should have predicted.

For example, when the model sees:
- `x[0]` (i.e., `word_1`), it tries to predict `y[0]` (i.e., `word_2`).
- `x[0:1]` (i.e., `word_1, word_2`), it tries to predict `y[1]` (i.e., `word_3`).
- ...
- `x[0:9]` (i.e., `word_1, ..., word_10`), it tries to predict `y[9]` (i.e., `word_11`).

The model makes predictions for *every position* in the `block_size`. The loss function then compares all these predictions with the actual next words in `y`. This way, a single block of data provides `block_size` individual training examples for the model to learn from. The "context" for predicting `y[t]` is `x[0...t]`.

## 4. Model Definition (Simplified nanoGPT Style)

Now we'll build a simplified version of the GPT model.

In [9]:
# Define the GPT-2 model configuration using transformers library
config = GPT2Config(
    vocab_size=vocab_size, # the number of unique words in the vocabulary, i.e., unique classes for the output layer
    n_positions=block_size,  # the maximum number of words in the context (X)
    n_embd=n_embd, # the number of embedding dimensions
    n_layer=n_layer, # the number of layers
    n_head=n_head, # the number of attention heads
    resid_pdrop=dropout, # the dropout rate for the residual connections
    embd_pdrop=dropout, # the dropout rate for the embedding layer
    attn_pdrop=dropout, # the dropout rate for the attention layer
    bos_token_id=None,  # No beginning of sentence token for word model
    eos_token_id=None   # No end of sentence token for word model
)

# Instantiate the model: we use GPT2LMHeadModel, which is an (untrained) GPT-2 model for language modeling/next-token prediction
model = GPT2LMHeadModel(config)
model.to(device)

# Print the number of parameters in the model
print(f"{sum(p.numel() for p in model.parameters())/1e6:.2f} M parameters")

0.81 M parameters


#### (Optional) Explanation: Why No BOS/EOS Tokens? 

You might notice `bos_token_id` (Beginning of Sequence) and `eos_token_id` (End of Sequence) are set to `None`. This is a deliberate simplification for this tutorial.

**Why They Are Crucial for Real LLMs:**
- **To Define Boundaries:** In massive datasets with millions of documents, `BOS` and `EOS` tell the model where one document ends and another begins, preventing it from nonsensically connecting unrelated texts.
- **To Control Generation:** A real chatbot or assistant needs to know when to stop talking. It learns to generate an `EOS` token when its response is complete.
- **For Task Formatting:** These tokens separate the user's prompt from the model's desired output, helping the model learn the structure of a conversation or task.

**Why We Don't Need Them Here:**
- **Continuous Data:** We train on a single, continuous stream of text. The model isn't learning from separate documents where boundaries are important.
- **Controlled Generation:** We manually start generation with a prompt and stop it after a fixed number of words. We don't rely on the model to decide when to stop.

#### (Optional) How `dropout` helps as a regularization technique:
Dropout is a simple yet effective regularization technique to combat overfitting in neural networks. Here's how it works during training:
1.  At each training step, for every neuron in a layer where dropout is applied, it is "dropped out" (i.e., temporarily removed from the network) with a certain probability (`dropout` rate).
2.  This means the neuron, along with all its incoming and outgoing connections, is ignored during the forward pass and backward pass for that particular training iteration.
3.  The choice of which neurons to drop is random for each training iteration and each input example.

**Why does this help?**
-   **Prevents Co-adaptation:** Neurons become less reliant on specific other neurons because they can't be sure which ones will be active. This encourages each neuron to learn more robust features that are useful in conjunction with different random subsets of other neurons.
-   **Ensemble Effect (Implicit):** Training with dropout can be seen as training a large number of thinned networks (networks with different subsets of neurons). At test time (when dropout is turned off), using the full network can be viewed as an approximation of averaging the predictions of these many thinned networks.

## 5. Training the Model

### 5.1 Optimizer
We'll use the AdamW optimizer, a common choice for training transformer models.

In [10]:
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

### 5.2 Loss Estimation
We need a function to estimate the loss on the training and validation sets without calculating gradients, which is useful for monitoring training progress.

In [11]:
@torch.no_grad() # Decorator to disable gradient calculation 
def estimate_loss():
    out = {}
    model.eval() # Set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters) #eval_iters: the number of iterations for evaluation
        for k in range(eval_iters):
            X, Y = get_batch(split)
            outputs = model(X, labels=Y)
            loss = outputs.loss # loss: the loss of the model, i.e., the difference between the predicted and the actual labels values. This is automatically calculated by the model.
            losses[k] = loss.item() # loss.item(): convert the loss to a scalar value
        out[split] = losses.mean()
    model.train() # Set model back to training mode
    return out

### 5.3 Training Loop
This is the main loop where the model learns. For each iteration, we:
1. Get a batch of data.
2. Perform a forward pass (get model predictions).
3. Calculate the loss (how wrong the predictions are).
4. Perform a backward pass (calculate gradients).
5. Update the model's parameters using the optimizer.

In [12]:
print(f"Training on {device}...")

for iter_num in range(max_iters):

    # Evaluation part
    if iter_num % eval_interval == 0 or iter_num == max_iters - 1: 
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Training part
    xb, yb = get_batch('train') # 1. Sample a batch of data for training

    
    outputs = model(xb, labels=yb) # 2. Foward pass: the model predicts the next word of the context
    loss = outputs.loss # 3. Calculate the loss of the model
    
    #4 and 5. Backward pass and optimization step
    optimizer.zero_grad(set_to_none=True) # Zero out gradients from previous step
    loss.backward() # Compute gradients for the current batch
    optimizer.step() # Update model parameters

print("Training finished!")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Training on cpu...
step 0: train loss 4.8432, val loss 4.8423
step 500: train loss 0.1071, val loss 6.8329
step 1000: train loss 0.0881, val loss 7.4115
step 1500: train loss 0.0837, val loss 7.8524
step 2000: train loss 0.0821, val loss 8.1622
step 2500: train loss 0.0795, val loss 8.4942
step 3000: train loss 0.0800, val loss 8.7935
step 3500: train loss 0.0789, val loss 8.9024
step 4000: train loss 0.0810, val loss 8.9835
step 4500: train loss 0.0770, val loss 9.2195
step 4999: train loss 0.0838, val loss 9.3415
Training finished!


#### (Optional) Explanation: The Training Loop - How a Model Learns

The training loop is the core process where the model "learns" by repeatedly guessing, checking how wrong its guess is, and then slightly improving its guess. To understand this, let's use a simple analogy: **finding the average of a list of numbers** using Gradient Descent.

Imagine our "dataset" is the list of numbers `[1, 2, 3, 4, 5]`. The true average is 3.
Our "model" is not a complex LLM, but just a single parameter: our current **guess** for the average, which we'll call `m`.

The training process follows four key steps, repeated over and over.

**Step 1: Initialize**
First, we make a random guess. Let's say we initialize our model with `m = 0`. This is the equivalent of the LLM's millions of parameters starting as random numbers.

**Step 2: The Forward Pass (Predict & Calculate Loss)**
We see how wrong our guess is. We take a data point (e.g., the number `4`), compare it to our prediction (`m=0`), and calculate the error, or **loss**. A common loss function is the squared error.

-   **Prediction:** Our current guess, `0`.
-   **Actual Value:** The data point, `4`.
-   **Loss:** `(prediction - actual)^2 = (0 - 4)^2 = 16`.
-   *LLM Equivalent:* This is `outputs = model(xb, labels=yb)`. The model takes an input `xb`, makes a prediction, and the loss is calculated by comparing it to the correct answer `yb`. This entire process is the **Forward Pass**.

**Step 3: The Backward Pass (Calculate Gradients)**
Now for the magic: how do we improve our guess? The **gradient** tells us the direction of the error. For our loss `(m - 4)^2`, the gradient is `2 * (m - 4)`.

-   **Gradient:** `2 * (0 - 4) = -8`.
-   **What this means:** The negative sign tells us we need to *increase* `m` to reduce the error. The magnitude (`8`) tells us we are far from the correct answer.
-   *LLM Equivalent:* `loss.backward()` does this for every single one of the model's millions of parameters. It calculates whether each parameter should be nudged up or down to reduce the final loss. This is the **Backward Pass**.

**Step 4: The Update Step (Adjust the Model)**
We use the gradient to update our guess. We take a small step in the right direction, controlled by a `learning_rate` (e.g., 0.1).

-   **Update Rule:** `new_m = old_m - learning_rate * gradient`
-   **New Guess:** `m = 0 - 0.1 * (-8) = 0.8`.
-   Our new guess, 0.8, is much closer to the true average of 3. We have learned!
-   *LLM Equivalent:* `optimizer.step()` performs this update for all parameters. (AdamW is just a more advanced version of this same principle).

The **training loop** simply repeats these three steps (Forward Pass, Backward Pass, Update) thousands of times. With each iteration, the model's parameters get slightly less wrong, until they converge to values that produce accurate and coherent text.

#### Explanation: The Overfitting Issue

**Overfitting** is a common problem in machine learning. It occurs when a model learns the training data *too well*, including its noise and specific details, to the point where it performs poorly on new, unseen data (like our validation set or real-world data).

Imagine a student who memorizes the answers to all the questions in a specific textbook (the training data) but doesn't understand the underlying concepts. When given a new exam with slightly different questions (the validation or test data), the student would likely fail. This is analogous to an overfitted model.

**Signs of Overfitting:**
-   **Training loss continues to decrease, while validation loss starts to increase or plateaus.** This is a classic indicator. It means the model is getting better at fitting the training data but worse (or no better) at generalizing to unseen data.
-   The model might produce excellent results on examples it has seen during training but generate nonsensical or poor outputs for new inputs.


**In our case, there is another reason (and might be the major reason)** for the poor validation loss: since the training data (the first part of the text) is very different from the testing data (the second part), the LLM cannot learn to predict the testing data from training. This is also called the out-of-distribution (OOD) issue.


## 6. Generating Text

Now that our model is trained, let's use it to generate some text. We'll start with a context (e.g., a newline word) and ask the model to predict the next words sequentially.

In [13]:
# Generation function - create a simple generation loop


# Start with the first few words from our training data
text_test="sir: yet you must not think to fob off our disgrace with a tale"
context = torch.tensor(encode(text_test), 
                      dtype=torch.long, device=device).unsqueeze(0) # unsqueeze(0): add a dimension at the beginning of the tensor to match the format of the training data
#Training data shape for X: (batch_size, block_size)
#Context shape in X: (1, block_size)

print(f"Input context: '{decode(context[0].tolist())}'")



Input context: ' sir: yet you must not think to fob off our disgrace with a tale'


In [None]:
print(f"Input context: '{decode(context[0].tolist())}'")
# Generation loop to generate text token by token

generated = context #initialize the generated sequence with the context

model.eval()
with torch.no_grad():
    for _ in range(50):  # Generate 50 more words
        # Get predictions for the current context
        # Crop to last block_size tokens if context gets too long
        context_cropped = generated[:, -block_size:] if generated.size(1) > block_size else generated
        
        outputs = model(context_cropped) # forward pass: the model predicts the next word of the context
        logits = outputs.logits[:, -1, :] # logits: the output of the model, the scores of each word in the vocabulary to be the next word of the context
        
        # Apply temperature for sampling (Optional)
        # temperature = 1
        # logits = logits / temperature
        
        # Sample from the distribution
        probs = F.softmax(logits, dim=-1) # softmax: convert the logits to probabilities over the vocabulary, exp(logits_i) / sum_j(exp(logits_j))
        next_token = torch.multinomial(probs, num_samples=1) # multinomial: sample from the distribution, i.e., the next word of the context
        
        # Append to generated sequence
        generated = torch.cat([generated, next_token], dim=1) # concatenate the generated sequence and the next word of the context

        
# Decode the generated text
generated_text = decode(generated[0].tolist())
print("--- Generated Text ---")
print(generated_text)
print("--- Original Text ---")
print('''sir: yet you must not think to fob off our disgrace with a tale: but, an 't please you, deliver. MENENIUS: There was a time when all the body's members Rebell'd against the belly, thus accused it: That only like a gulf it did remain''')
print("----------------------")

Input context: ' sir: yet you must not think to fob off our disgrace with a tale'


--- Generated Text ---
 sir: yet you must not think to fob off our disgrace with a tale but an t you deliver menenius there a more first: was time all body s rebell d the,' against belly thus it remain' midst' accused:, accused:,' hear,,,,,,,,: was' hear,
--- Original Text ---
sir: yet you must not think to fob off our disgrace with a tale: but, an 't please you, deliver. MENENIUS: There was a time when all the body's members Rebell'd against the belly, thus accused it: That only like a gulf it did remain
----------------------


#### Explanation: Sampling in Text Generation

When our model generates text, it doesn't just pick the single most probable next word. Instead, it often uses a technique called **sampling**, sometimes influenced by a parameter called **temperature**.

1.  **Logits and Probabilities:**
    After processing the input context, the model outputs **logits**. These are raw, unnormalized scores for each word in the vocabulary. To turn these logits into probabilities (i.e., values between 0 and 1 that sum up to 1), we apply a **softmax function**. The word with the highest probability is the one the model thinks is most likely to come next.

2.  **Sampling:**
    Instead of always picking the word with the absolute highest probability (which is called greedy decoding and can lead to repetitive or boring text), we can *sample* from the probability distribution. This means a word that is slightly less probable might still be chosen, introducing randomness and creativity into the generated text.
    `torch.multinomial(probs, num_samples=1)` is the function that performs this sampling. It takes the probability distribution (`probs`) and picks one word based on these probabilities.

3.  **(Optional) Temperature:**
    Temperature is a hyperparameter that controls the randomness of the sampling. It's applied to the logits *before* the softmax function.
    -   **Low Temperature (e.g., < 1.0, closer to 0):** Dividing logits by a small number makes the differences between them larger. When softmax is applied, the probability distribution becomes "sharper" or "spikier." The model becomes more confident and deterministic, tending to pick the most likely words. This can lead to more focused and coherent text, but potentially less creative.
    -   **High Temperature (e.g., > 1.0):** Dividing logits by a larger number makes the differences between them smaller. The resulting softmax probability distribution becomes "flatter" or "softer." The model becomes less confident, and less likely words have a higher chance of being selected. This increases randomness and creativity, but can also lead to more errors or nonsensical text.
    -   **Temperature = 1.0:** This is the standard setting where the original probabilities are used for sampling.

In our generation code:
`logits = logits / temperature`
`probs = F.softmax(logits, dim=-1)`

By adjusting the `temperature` value, you can control the trade-off between coherence and creativity in the generated text. For this tutorial, we use a temperature of 1.0, which means we sample directly from the model's learned probability distribution.