# How LLMs Choose the Next Token

In this notebook we will explore how the end of our LLM (a decoder-only transformer model version) works!

More specifically, we'll explore how we go from the decoder stack to a "next token"!

In order to better understand why we use the loss we do - we'll start here, with generation, to get a sense of what the model is doing "under the hood".

Let's jump right in!

## Dependencies

Today we'll be using a classic minamalist implementation of a decoder-only transformer model called `nanoGPT`, built by the one-and-only Andrej Karpathy - found [here](https://github.com/karpathy/nanoGPT/tree/master)!

It does require a few dependencies - though most are covered by the default Colab environment.

> NOTE: You will need to make sure you're in a GPU enabled environment for effective use of this notebook.

In [None]:
!pip install -qU datasets tiktoken wandb tqdm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.0/289.0 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

In [None]:
!git clone https://github.com/karpathy/nanoGPT.git

Cloning into 'nanoGPT'...
remote: Enumerating objects: 671, done.[K
remote: Total 671 (delta 0), reused 0 (delta 0), pack-reused 671[K
Receiving objects: 100% (671/671), 947.92 KiB | 2.49 MiB/s, done.
Resolving deltas: 100% (379/379), done.


In [None]:
%cd nanoGPT

/content/nanoGPT


## Generating Tokens!

Let's just try to do some inference and see what happens before we dig in.

In [None]:
!python sample.py \
    --init_from=gpt2-xl \
    --start="What is the answer to life, the universe, and everything?" \
    --num_samples=1 --max_new_tokens=100

Overriding: init_from = gpt2-xl
Overriding: start = What is the answer to life, the universe, and everything?
Overriding: num_samples = 1
Overriding: max_new_tokens = 100
loading weights from pretrained gpt: gpt2-xl
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 1555.97M
No meta.pkl found, assuming GPT-2 encodings...
What is the answer to life, the universe, and everything?

One possibility is that they have a universe-wide perspective that allows for no contradictions. And, if so, then why is there contradiction?

Another possibility is that the universe is one big joke. They are not the only joke in the universe. But it is not an empty universe.

Hence the problems with the universe's Big Bang theory, which tells us that the universe started out in a singularity with a zero initial mass. The Big Bang is essentially consistent with
---------------


You'll notice that we pass *in* text - and we receive *back* text from our model.

## How Does the LLM Generate Tokens

So, we pas in text - and get text back - but how do we actually generate each token?

You might have heard the term "auto-regressive" or "causal" kicking around when reading about LLMs - and what those terms, in a simplified sense, mean is straightfoward enough:

- They take an input, and generate a single token
- They append that token to the input and repeat this process for as long as we want it to repeat (or use heuristics to determine when to stop, such as when we see a stop token)

Let's take a look at the function that does this in the `nanoGPT` repository.



```python
@torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx
```

### What is a Logit?!

Technically - a logit is a "raw unnormalized score".

However, we can think of them as scores for each token in our vocabulary. These scores aren't probabilities in and of themselves - but they can be easily converted to probabilities through the softmax function.

### What is Temperature Doing?!

While something like `top_k` makes intuitive sense - what in the heck is temperature doing here?

In order to understand - let's look at a few examples!

Starting with an easy `temperature = 1.0`.

> NOTE: We'll also define our softmax function!

In [None]:
def softmax(x):
    return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())

In [None]:
import numpy as np

temperature = 1.0

logits = np.array([6, 2, 7, 0.1, -8, 9])

temp_scaled_logits = logits / temperature
print(f"Scaled Logits: {temp_scaled_logits}")

softmaxed_logits = softmax(temp_scaled_logits)
print(f"Softmax-ed Logits: {softmaxed_logits}")

Scaled Logits: [ 6.   2.   7.   0.1 -8.   9. ]
Softmax-ed Logits: [4.19729385e-02 7.68761185e-04 1.14094276e-01 1.14982549e-04
 3.49017038e-08 8.43049007e-01]
1.0


As you can see - our logits are not changed, and our softmax output has quite a bit of variety - from `e-08` to `e-1`, meaning that our index with the score `9` is most likely to be selected, but it's not absurdly likely.

Let's look at an example with a very low temperature next!

In [None]:
temperature = 0.1

logits = np.array([6, 2, 7, 0.1, -8, 9])

temp_scaled_logits = logits / temperature
print(f"Scaled Logits: {temp_scaled_logits}")

softmaxed_logits = softmax(temp_scaled_logits)
print(f"Softmax-ed Logits: {softmaxed_logits}")

Scaled Logits: [ 60.  20.  70.   1. -80.  90.]
Softmax-ed Logits: [9.35762295e-14 3.97544973e-31 2.06115362e-09 2.22736356e-39
 1.47889750e-74 9.99999998e-01]


As you can see - now that we changed our temperature to be very low - the index with score `9` is *vastly* more likely than any other option.

This is the idea that a low (<1) temperature value will scale our logits to be larger - resulting in a sharper probability distribution after softmax.

Let's look at a final example with a higher temperature.

In [None]:
temperature = 10

logits = np.array([6, 2, 7, 0.1, -8, 9])

temp_scaled_logits = logits / temperature
print(f"Scaled Logits: {temp_scaled_logits}")

softmaxed_logits = softmax(temp_scaled_logits)
print(f"Softmax-ed Logits: {softmaxed_logits}")

Scaled Logits: [ 0.6   0.2   0.7   0.01 -0.8   0.9 ]
Softmax-ed Logits: [0.20299317 0.13607039 0.22434215 0.11252466 0.0500575  0.27401212]


Now we can see that, while our index with score `9` is still the most likely - we can see that the probabilities are much closer together!

### Psuedo-Code For Generation

Now that we have an intuition for what logits and temperature are doing - let's see what our generation code is doing in simpler terms:

1. For some range (user decided)
2. We check and make sure our current range of indices will fit in our block size - if they don't, we trim it so it does
3. We get the logits for the provided indices
4. We scale the logits by the user defined temperature (the default is 1)
5. We optionally crop the logits by our tok k - meaning we only keep the top k values in our logits. This effectively limits our choices to only the k most likely tokens
6. We apply softmax to convert our logits into a probability distribution
7. We sample from that probability
8. We add the sampled index to our input indices (auto-regressive, anyone?)
9. We're done!

## How do we get to Logits?

There's a question that might be nagging at you - how do we get from our decoder stack output to a series of scores for each token in our vocabulary?

That's where our `lm_head` comes in - in this case, a linear layer!

Let's take a look at this layer as it's define in `nanoGPT`.

```python
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
```

We can see that the `lm_head` is a linear layer that has input size `n_embed` (the internal dimension of our model), and output size `vocab_size` (the vocabulary size).

So what this means, is that the `lm_head` is a linear projection across the vocabulary - which is what ultimately provides our logits (or scores) as determined by the multiple-decoder stacks that we have passed our inputs through.

Essentially, this linear layer acts as a translation between our model's internal representation and our desired output format which, in this case, is tokens!

## Logits and Loss

Okay, so now we have a better understanding of how a model generates the next token:

1. The decoder-blocks take our input and compute attention scores
2. We project those scores from our internal model dimension onto our vocabulary
3. We use the obtained "raw unnormalized scores" (logits!) to find a probability distribution (through softmax, after some potential processing)
4. We sample from that probability distribution to find our next token!
5. We append the token to our input
6. RINSE AND REPEAT

So - how does this relate to loss?

Well - there's a question lurking in there, which is:

"How do we know that the scores we assigned are what they're supposed to be? Or even close?"

For that - we'll need loss!