# Playing with Top_K, Top_P, and temperature for LLMs

This notebook aims to describe what those parameters are and how changing them impacts text generation.

Before describing them, we need to know about two ways LLMs can do sampling.

- **Greedy sampling**: At each step, it chooses the token with the highest probability based on the preceding tokens.
  - Pros:
    - Produces output that is often highly coherent and reflects the most common patterns in the training data.
    - Is deterministic, meaning the same prompt will always result in the same output.
  - Cons:
    - Can lead to repetitive or dull text because it never deviates from the most likely path.
    - It's a short-sighted approach, as the best next word might not be part of the best overall sequence.


- **Random sampling**: It randomly selects a token from a probability distribution. Tokens with higher probabilities are more likely to be chosen, but even low-probability tokens have a chance of being selected.
  - Pros:
    - Generates more diverse and creative output.
    - Can help avoid the repetitive loops that greedy sampling is prone to.
  - Cons:
    - Can produce lower-quality or less coherent output compared to greedy sampling.
    - Is non-deterministic; running it multiple times will likely yield different results.

The way to control if a model does greedy or random sampling is influenced by the use of `top_k`, `top_p` and `temperature` parameters.

## Top_K

Limits the model's output to the top-k most probable tokens at each generation step.

### Example

For the prompt `The future of AI is`, with a top_k=5 this is what we will get you the following tokens:

```
Token: ' in' (ID: 11) | Logprob: -2.1674 | %: 11.45%
Token: ' not' (ID: 45) | Logprob: -3.1830 | %: 4.15%
Token: ' now' (ID: 122) | Logprob: -3.4174 | %: 3.28%
Token: ' here' (ID: 259) | Logprob: -3.4330 | %: 3.23%
Token: ' a' (ID: 10) | Logprob: -3.4955 | %: 3.03%
```

As you can see we got the list of the 5 most probable tokens.

## Top_P

Filters out tokens when the cumulative probability (top_p) is reached.
Top_p will consider more tokens compared to using Top_k, this is because the model will include as much tokens as needed until the cumulative probabilities reach the specific top_p.

Example:

top_k=5 will get you the 5 tokens with the highest probability.
top_p=0.3 will get you tokens until the cumulative probabilities for all of them reach 0.3 (30%) value. This means that if we have a list of tokens with a % probability each. We will sum them from highest to lowest % and we will stop when we reach the threshold.

```
Token: ' the'       | Prob:  8.86% | Cumul:  8.86%
Token: ' a'         | Prob:  7.61% | Cumul: 16.48%
Token: ' not'       | Prob:  3.81% | Cumul: 20.29%
Token: ' in'        | Prob:  3.11% | Cumul: 23.40%
Token: ' being'     | Prob:  2.16% | Cumul: 25.56%
Token: ' one'       | Prob:  1.36% | Cumul: 26.93%
Token: ' home'      | Prob:  1.36% | Cumul: 28.29%
Token: ' now'       | Prob:  1.29% | Cumul: 29.58%
Token: ' getting'   | Prob:  1.24% | Cumul: 30.82%
Top_p threshold reached (30.82%/30.0%). Skipping remaining vocab
```

## Temperature

Adjusts token prediction randomness by scaling the log probabilities. Higher temperatures lead to more creative outputs, while lower lead to more predictable responses.


## vLLM

We will be using vLLM python bindings, the parameters can be set via the [SamplingParams](https://docs.vllm.ai/en/stable/api/vllm/#vllm.SamplingParams) class. Below the descriptions for each parameter.

- [temperature](https://docs.vllm.ai/en/stable/api/vllm/#vllm.SamplingParams.temperature) – Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. (0, 1)

- [top_p](https://docs.vllm.ai/en/stable/api/vllm/#vllm.SamplingParams.top_p) – Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1). Set to 1 to consider all tokens.

- [top_k](https://docs.vllm.ai/en/stable/api/vllm/#vllm.SamplingParams.top_k) – Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.







# Deploy vLLM

In [None]:
!uv venv --python 3.12 --seed --clear
!source .venv/bin/activate
!uv pip install vllm --torch-backend=auto


# Load model

In [None]:
from vllm import LLM, SamplingParams

# max_logprobs=-1 will help us see token distribution/cumulative probabilities later
llm = LLM(model="facebook/opt-125m", max_logprobs=-1)


# Define prompt

In [None]:
prompts = [
    "The capital of France is",
]

# Top_K


## Test 1 - Temperature 0, top_k 10

In this test we will configure top_k to return the top 10 tokens with the highest probability, on top of that we are configuring a temperature of 0, this means the model will always use the token with the highest probability. We will run 5 generations and we will see we always get the same output.






In [None]:
sampling_params = SamplingParams(temperature=0, top_k=10)

for i in range(0,4):
  print(f"Generation {i+1}")
  outputs = llm.generate(prompts, sampling_params)
  for output in outputs:
      prompt = output.prompt
      generated_text = output.outputs[0].text
      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


## Test 2 - Temperature 0.9, top_k 10

In this test we will configure top_k to return the top 10 tokens with the highest probability, but this time, temperature is 0.9, this should led to different generations. The model will use on of the top 10 tokens in a less predictable way. We will run 5 generations and we will likely see different outputs for each generation.






In [None]:
sampling_params = SamplingParams(temperature=0.9, top_k=10)

for i in range(0,4):
  print(f"Generation {i+1}")
  outputs = llm.generate(prompts, sampling_params)
  for output in outputs:
      prompt = output.prompt
      generated_text = output.outputs[0].text
      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


## Test 3 - Printing probability for each token

In the previous test we have seen how top_k can be use to restrict the model to choose tokens up to the configured `k` for the tokens with the highest probabilities. This time we will do the same, but we will print the probabilities.

You will see that the tokens always get the same probabilities, but since temperature is 0.9 the outputs will differ. Try to run the cell multiple times and you will see same probabilities but different output. Try changing the temperature to 0 and you will see the same output every time, that output will match with the word with the highest probability for each step.

In [None]:
# logprobs=-1 will return probabilities for every token, since we only care about the 10 highest probability tokens, we set this parameter to 10, otherwise we get probabilities for the whole vocabsize of the model
sampling_params = SamplingParams(temperature=0.9, top_k=10, logprobs=10)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

    import math
    # The logprobs are a list of dictionaries (one dict per generated token)
    # output.outputs[0].logprobs[i] corresponds to the i-th token generated
    token_logprobs = output.outputs[0].logprobs
    print(f"Probabilities:")
    for i, top_k_dict in enumerate(token_logprobs):
      # We can get the specific token that was sampled
      # Note: In newer vLLM versions, the dict keys are token IDs (integers)
      print(f"  Step {i+1}:")
      for token_id, logprob_obj in top_k_dict.items():
        # logprob_obj contains 'logprob', 'rank', and 'decoded_token' (if available)
        print(f"    Token: {logprob_obj.decoded_token!r} (ID: {token_id}) | Logprob: {logprob_obj.logprob:.4f} | %: {math.exp(logprob_obj.logprob)*100:.2f}%")

# Top_P

## Test 1 - Temperature 0, top_p 0.9

In this test we will configure top_p to return tokens until the cumulative probability reaches the 0.9 (~90%). This means that the ammount of different tokens for the model to use we will get in each step will depend on the cumulative probability for all of them. We may get steps with lots of tokens to choose from, and other times we will get just a few. This will be easier to understand when we output the probabilities later.

At the same time we're using the temperature=0, this means that from the list of words we will continue to choose the one with the highest probability. If we run the generation below 5 times, we will see the same output.

---



In [None]:
sampling_params = SamplingParams(temperature=0, top_p=0.9)

for i in range(0,5):
  print(f"Generation {i+1}")
  outputs = llm.generate(prompts, sampling_params)
  for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

## Test 2 - Temperature 0.9, top_p 0.9

Same test as above, but this time we're setting temperature to 0.9. This means we will get different outputs.


In [None]:
sampling_params = SamplingParams(temperature=0.9, top_p=0.9)

for i in range(0,5):
  print(f"Generation {i+1}")
  outputs = llm.generate(prompts, sampling_params)
  for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

## Test 3 - Printing probability for each token and cumulative probability

We will see the two parameters that are taking place for token selection. In one side we will see how the list of tokens in each step is generated until we reach the top_p threshold, and we will see how from that list we get the token with the highest probability (temperature=0)

In [None]:
# logprobs=-1 will return probabilities for every token in the vocabsize of the model, we will then limit the output when we reach the top_p threshold
sampling_params = SamplingParams(temperature=0, top_p=0.3, logprobs=-1)
# We will use this to stop printing distribution/cumulative probabilities once we reached this threshold
target_p = 0.3
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
  prompt = output.prompt
  generated_text = output.outputs[0].text
  print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
  token_logprobs_list = output.outputs[0].logprobs
  for i, top_k_dict in enumerate(token_logprobs_list):
    print(f"  Step {i+1}:")
    # Sort by probability (High -> Low)
    sorted_preds = sorted(
      top_k_dict.values(),
      key=lambda x: x.logprob,
      reverse=True
    )
    current_cumulative = 0.0
    for logprob_obj in sorted_preds:
      p = math.exp(logprob_obj.logprob)
      current_cumulative += p
      print(f"    Token: {logprob_obj.decoded_token!r:<12} | Prob: {p*100:5.2f}% | Cumul: {current_cumulative*100:5.2f}%")

      # We stop showing probabilities if we reach the threshold
      if current_cumulative >= target_p:
        print(f"    Top_p threshold reached ({current_cumulative*100:5.2f}%/{target_p*100}%). Skipping remaining vocab")
        break
