# Playing with Top_K, Top_P, and temperature for LLMs
Before trying this Notebook, read the blog post [here](https://linuxera.org/turning-the-knobs-of-llm-text-generation/)

## vLLM

We will be using vLLM python bindings, the parameters can be set via the [SamplingParams](https://docs.vllm.ai/en/stable/api/vllm/#vllm.SamplingParams) class. Below the descriptions for each parameter.

- [temperature](https://docs.vllm.ai/en/stable/api/vllm/#vllm.SamplingParams.temperature) – Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. (0, 1)

- [top_p](https://docs.vllm.ai/en/stable/api/vllm/#vllm.SamplingParams.top_p) – Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1). Set to 1 to consider all tokens.

- [top_k](https://docs.vllm.ai/en/stable/api/vllm/#vllm.SamplingParams.top_k) – Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.







# Deploy vLLM

In [None]:
!uv venv --python 3.12 --seed --clear
!source .venv/bin/activate
!uv pip install vllm --torch-backend=auto


Using CPython 3.12.12 interpreter at: [36m/usr/bin/python3[39m
Creating virtual environment with seed packages at: [36m.venv[39m
 [32m+[39m [1mpip[0m[2m==25.3[0m
Activate with: [32msource .venv/bin/activate[39m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m152 packages[0m [2min 5.73s[0m[0m
[2K[2mPrepared [1m50 packages[0m [2min 13.91s[0m[0m
[2mUninstalled [1m7 packages[0m [2min 113ms[0m[0m
[2K[2mInstalled [1m50 packages[0m [2min 245ms[0m[0m
 [32m+[39m [1manthropic[0m[2m==0.71.0[0m
 [32m+[39m [1mapache-tvm-ffi[0m[2m==0.1.3[0m
 [32m+[39m [1mastor[0m[2m==0.8.1[0m
 [32m+[39m [1mblake3[0m[2m==1.0.8[0m
 [32m+[39m [1mcbor2[0m[2m==5.7.1[0m
 [31m-[39m [1mclick[0m[2m==8.3.1[0m
 [32m+[39m [1mclick[0m[2m==8.2.1[0m
 [32m+[39m [1mcompressed-tensors[0m[2m==0.12.2[0m
 [32m+[39m [1mdepyf[0m[2m==0.20.0[0m
 [32m+[39m [1mdiskcache[0m[2m==5.6.3[0m
 [32m+[39m [1mdnspython[0m[2m==2.8.

# Load model

In [None]:
from vllm import LLM, SamplingParams

# max_logprobs=-1 will help us see token distribution/cumulative probabilities later
llm = LLM(model="facebook/opt-125m", max_logprobs=-1)


INFO 11-25 16:45:08 [utils.py:253] non-default args: {'max_logprobs': -1, 'disable_log_stats': True, 'model': 'facebook/opt-125m'}


config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

INFO 11-25 16:45:29 [model.py:631] Resolved architecture: OPTForCausalLM
INFO 11-25 16:45:29 [model.py:1745] Using max model len 2048
INFO 11-25 16:45:32 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=8192.


tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

INFO 11-25 16:48:58 [llm.py:352] Supported tasks: ['generate']


# Define prompt

In [None]:
prompts = [
    "The capital of France is",
]

# Top_K


## Test 1 - Temperature 0, top_k 10

In this test we will configure top_k to return the top 10 tokens with the highest probability, on top of that we are configuring a temperature of 0, this means the model will always use the token with the highest probability. We will run 5 generations and we will see we always get the same output.






In [None]:
sampling_params = SamplingParams(temperature=0, top_k=10)

for i in range(0,4):
  print(f"Generation {i+1}")
  outputs = llm.generate(prompts, sampling_params)
  for output in outputs:
      prompt = output.prompt
      generated_text = output.outputs[0].text
      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


Generation 1


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'
Generation 2


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'
Generation 3


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'
Generation 4


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'


## Test 2 - Temperature 0.9, top_k 10

In this test we will configure top_k to return the top 10 tokens with the highest probability, but this time, temperature is 0.9, this should led to different generations. The model will use on of the top 10 tokens in a less predictable way. We will run 5 generations and we will likely see different outputs for each generation.






In [None]:
sampling_params = SamplingParams(temperature=0.9, top_k=10)

for i in range(0,4):
  print(f"Generation {i+1}")
  outputs = llm.generate(prompts, sampling_params)
  for output in outputs:
      prompt = output.prompt
      generated_text = output.outputs[0].text
      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


Generation 1


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of Europe. The people of France and Europe are the most beautiful countries'
Generation 2


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the European Union with a population of about 10.6 million people'
Generation 3


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the largest country in the world and the second largest nation in Europe after China,'
Generation 4


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: " a city of about 7 billion people, the capital of the world's second highest"


## Test 3 - Printing probability for each token

In the previous test we have seen how top_k can be use to restrict the model to choose tokens up to the configured `k` for the tokens with the highest probabilities. This time we will do the same, but we will print the probabilities.

You will see that the tokens always get the same probabilities, but since temperature is 0.9 the outputs will differ. Try to run the cell multiple times and you will see same probabilities but different output. Try changing the temperature to 0 and you will see the same output every time, that output will match with the word with the highest probability for each step.

In [None]:
# logprobs=-1 will return probabilities for every token, since we only care about the 10 highest probability tokens, we set this parameter to 10, otherwise we get probabilities for the whole vocabsize of the model
sampling_params = SamplingParams(temperature=0.9, top_k=10, logprobs=10)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

    import math
    # The logprobs are a list of dictionaries (one dict per generated token)
    # output.outputs[0].logprobs[i] corresponds to the i-th token generated
    token_logprobs = output.outputs[0].logprobs
    print(f"Probabilities:")
    for i, top_k_dict in enumerate(token_logprobs):
      # We can get the specific token that was sampled
      # Note: In newer vLLM versions, the dict keys are token IDs (integers)
      print(f"  Step {i+1}:")
      for token_id, logprob_obj in top_k_dict.items():
        # logprob_obj contains 'logprob', 'rank', and 'decoded_token' (if available)
        print(f"    Token: {logprob_obj.decoded_token!r} (ID: {token_id}) | Logprob: {logprob_obj.logprob:.4f} | %: {math.exp(logprob_obj.logprob)*100:.2f}%")

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' a very important country for the EU. If I had to describe how I feel'
Probabilities:
  Step 1:
    Token: ' a' (ID: 10) | Logprob: -2.5755 | %: 7.61%
    Token: ' the' (ID: 5) | Logprob: -2.4231 | %: 8.86%
    Token: ' not' (ID: 45) | Logprob: -3.2669 | %: 3.81%
    Token: ' in' (ID: 11) | Logprob: -3.4700 | %: 3.11%
    Token: ' being' (ID: 145) | Logprob: -3.8333 | %: 2.16%
    Token: ' one' (ID: 65) | Logprob: -4.2942 | %: 1.36%
    Token: ' home' (ID: 184) | Logprob: -4.2981 | %: 1.36%
    Token: ' now' (ID: 122) | Logprob: -4.3489 | %: 1.29%
    Token: ' getting' (ID: 562) | Logprob: -4.3919 | %: 1.24%
    Token: ' going' (ID: 164) | Logprob: -4.4505 | %: 1.17%
  Step 2:
    Token: ' very' (ID: 182) | Logprob: -3.9039 | %: 2.02%
    Token: ' city' (ID: 343) | Logprob: -2.7789 | %: 6.21%
    Token: ' country' (ID: 247) | Logprob: -3.1618 | %: 4.24%
    Token: ' capital' (ID: 812) | Logprob: -3.9860 | %: 1.86%
    Token: ' place'

# Top_P

## Test 1 - Temperature 0, top_p 0.9

In this test we will configure top_p to return tokens until the cumulative probability reaches the 0.9 (~90%). This means that the ammount of different tokens for the model to use we will get in each step will depend on the cumulative probability for all of them. We may get steps with lots of tokens to choose from, and other times we will get just a few. This will be easier to understand when we output the probabilities later.

At the same time we're using the temperature=0, this means that from the list of words we will continue to choose the one with the highest probability. If we run the generation below 5 times, we will see the same output.

---



In [None]:
sampling_params = SamplingParams(temperature=0, top_p=0.9)

for i in range(0,5):
  print(f"Generation {i+1}")
  outputs = llm.generate(prompts, sampling_params)
  for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Generation 1


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'
Generation 2


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'
Generation 3


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'
Generation 4


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'
Generation 5


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'


## Test 2 - Temperature 0.9, top_p 0.9

Same test as above, but this time we're setting temperature to 0.9. This means we will get different outputs.


In [None]:
sampling_params = SamplingParams(temperature=0.9, top_p=0.9)

for i in range(0,5):
  print(f"Generation {i+1}")
  outputs = llm.generate(prompts, sampling_params)
  for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Generation 1


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: " just outside of Paris and you'll find many beautiful places in both city. If"
Generation 2


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' seen as the highest concentrations of terrorism victims in the world, according to figures published'
Generation 3


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' awash with laddies. From Paris to Bragat, and all'
Generation 4


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the city in the heart of Paris. It is one of the biggest, most'
Generation 5


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: " not one of the best states in the world. I don't think the French"


## Test 3 - Printing probability for each token and cumulative probability

We will see the two parameters that are taking place for token selection. In one side we will see how the list of tokens in each step is generated until we reach the top_p threshold, and we will see how from that list we get the token with the highest probability (temperature=0)

In [None]:
# logprobs=-1 will return probabilities for every token in the vocabsize of the model, we will then limit the output when we reach the top_p threshold
sampling_params = SamplingParams(temperature=0, top_p=0.3, logprobs=-1)
# We will use this to stop printing distribution/cumulative probabilities once we reached this threshold
target_p = 0.3
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
  prompt = output.prompt
  generated_text = output.outputs[0].text
  print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
  token_logprobs_list = output.outputs[0].logprobs
  for i, top_k_dict in enumerate(token_logprobs_list):
    print(f"  Step {i+1}:")
    # Sort by probability (High -> Low)
    sorted_preds = sorted(
      top_k_dict.values(),
      key=lambda x: x.logprob,
      reverse=True
    )
    current_cumulative = 0.0
    for logprob_obj in sorted_preds:
      p = math.exp(logprob_obj.logprob)
      current_cumulative += p
      print(f"    Token: {logprob_obj.decoded_token!r:<12} | Prob: {p*100:5.2f}% | Cumul: {current_cumulative*100:5.2f}%")

      # We stop showing probabilities if we reach the threshold
      if current_cumulative >= target_p:
        print(f"    Top_p threshold reached ({current_cumulative*100:5.2f}%/{target_p*100}%). Skipping remaining vocab")
        break


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital'
  Step 1:
    Token: ' the'       | Prob:  8.86% | Cumul:  8.86%
    Token: ' a'         | Prob:  7.61% | Cumul: 16.48%
    Token: ' not'       | Prob:  3.81% | Cumul: 20.29%
    Token: ' in'        | Prob:  3.11% | Cumul: 23.40%
    Token: ' being'     | Prob:  2.16% | Cumul: 25.56%
    Token: ' one'       | Prob:  1.36% | Cumul: 26.93%
    Token: ' home'      | Prob:  1.36% | Cumul: 28.29%
    Token: ' now'       | Prob:  1.29% | Cumul: 29.58%
    Token: ' getting'   | Prob:  1.24% | Cumul: 30.82%
    Top_p threshold reached (30.82%/30.0%). Skipping remaining vocab
  Step 2:
    Token: ' capital'   | Prob: 38.30% | Cumul: 38.30%
    Top_p threshold reached (38.30%/30.0%). Skipping remaining vocab
  Step 3:
    Token: ' of'        | Prob: 89.68% | Cumul: 89.68%
    Top_p threshold reached (89.68%/30.0%). Skipping remaining vocab
  Step 4:
    Token: ' the'