# LLMs and Uncertainty

The estimation of uncertainty with LLMs is much more complicated than regular regression or classification models.

The output of a classification model is just a probability distribution over $c$ classes:

$$
\hat{y}=[\hat{y}_1,\dots,\hat{y}_c]
$$

LLMs works with sequential generation of text.
Text is composed of tokens, which compose words, which compose sentences.

We start with a prompt $p$ composed of a variable amount, say, $p$, of tokens $\tau$:

$$
\text{prompt} = [\tau_1,\dots,\tau_p]
$$

and the model proceeds to generate up to $k$ tokens.

$$
\hat{y} = [\hat{\tau}_1,\dots,\hat{\tau}_k]
$$

Actually, each output token $\tau$ is determined by a probability distribution over the full dictionary of tokens, thus resembling the usual classification behavior.
At each generation step, say $j$, the prompt is enriched with the initial prompt, plus all the generated tokens up to $j-1$:

$$
\text{prompt}_j = \text{prompt} + [\hat{\tau}_1,\dots,\hat{\tau}_{j-1}]
$$

This makes it so that the generation process can be assimilated to a sequential classification problem.

The second issue is related to the fact that there is often no real "correct" solutions to a given LLM task: there can be several different ways a response can be answered with (and each token is gonna have a different probability level associated to it - this would lead to different uncertainties associated to answers with the same level of "correctness").
Also, this makes it so that there can be confusion as to which tokens are considered as "inconfident" because of the fact that there are more possible answers, or which are inconfident because the model truly does not have a knowledge over the specific question or topic.

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [1]:
from huggingface_hub import login as huggingface_login
from utils import decrypt_huggingface_token

huggingface_login(token=decrypt_huggingface_token())

  from .autonotebook import tqdm as notebook_tqdm


Hugging Face token decrypted successfully.


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

### Loading the model

We load one of the many pre-trained models available on HuggingFace.
There are several options, we use Qwen2.5-7B-Instruct-1M, a ChatGPT-like model with a relatively small memory footprint.


In [4]:
model_name = "Qwen/Qwen2.5-7B-Instruct-1M" 

# For larger models, consider quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    #bnb_4bit_use_double_quant=True, # Often helps
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, # Apply quantization if desired
    # torch_dtype=torch.bfloat16, # Or torch.float16
    device_map="auto" # Automatically distribute model layers across available devices
)

Loading checkpoint shards: 100%|██████████| 4/4 [02:01<00:00, 30.43s/it]


In [32]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear4bit(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear4bit(in_features=3584, out_features=3584, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear4bit(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-05)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-05)
      )
    )
    (norm): Qwen2RMSNorm((3584,), 

### Regular text generation

Let us first indicate how the model can be prompted for a response

In [5]:
from transformers import pipeline

# Using the pipeline for simplicity
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = "Write a short story about a cat who learns to fly."
generated_text = generator(prompt, max_new_tokens=100, num_return_sequences=1)

print(generated_text[0]["generated_text"])

Device set to use cuda:0


Write a short story about a cat who learns to fly. In the quiet town of Willowbrook, nestled between rolling hills and whispering woods, lived a curious calico cat named Whiskers. She was no ordinary cat. With fur that shimmered in the sunlight and eyes that sparkled like stars, Whiskers was always eager for new adventures. But what set her apart from other cats was her insatiable curiosity and a peculiar fascination with the sky.

One crisp autumn morning, as the leaves danced in the breeze, Whiskers found herself


### Generate tokens

The previous behavior is not really useful. We actually need the model to return the logits associated to each of the generated tokens. We need to call the generate attribute of the model to get this, along with other information on the output.

In the cell below you can see an example of how we can inspect the output

In [6]:
prompt = "What is the capital of France?"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

generation_output = model.generate(
    input_ids,
    max_new_tokens=15, # Generate 5 new tokens
    return_dict_in_generate=True,
    output_scores=True, # This will return the logits for each generated token
    do_sample=True,     # Use sampling to get more varied probabilities
    temperature=0.7,    # Lower temperature for less randomness
    top_k=50,           # Top-k sampling
    top_p=0.95          # Top-p (nucleus) sampling
)

generated_ids = generation_output.sequences[0]
generated_scores = generation_output.scores

start_index_of_new_tokens = input_ids.shape[1]
new_generated_ids = generated_ids[start_index_of_new_tokens:]

print(f"Prompt (token IDs): {input_ids[0].tolist()}")
print(f"Generated sequence (full token IDs): {generated_ids.tolist()}")
print(f"Newly generated tokens (IDs): {new_generated_ids.tolist()}")

print("\n--- Detailed Output ---")
decoded_tokens_with_softmax = []

# Process each newly generated token and its corresponding scores
for i, token_id in enumerate(new_generated_ids):
    # Get the logits for the i-th generated token
    # scores[i] corresponds to the logits for predicting the (i+1)th generated token
    # (after the first i tokens were generated)
    logits_for_current_token = generated_scores[i][0] # [0] because batch_size is 1

    # Apply softmax to get probabilities
    probabilities = torch.softmax(logits_for_current_token, dim=-1)

    # Get the probability of the *chosen* token
    chosen_token_prob = probabilities[token_id].item()

    # Get the top N probable tokens and their probabilities for this step
    top_k_values, top_k_indices = torch.topk(probabilities, k=5) # Get top 5

    # Decode the chosen token
    decoded_chosen_token = tokenizer.decode(token_id)

    print(f"\nToken {i+1}: '{decoded_chosen_token}' (ID: {token_id})")
    print(f"Probability of chosen token: {chosen_token_prob:.4f}")
    print("Top 5 predictions for this step:")
    for j in range(top_k_values.shape[0]):
        top_prob = top_k_values[j].item()
        top_token_id = top_k_indices[j].item()
        top_decoded_token = tokenizer.decode(top_token_id)
        print(f"  - '{top_decoded_token}' (ID: {top_token_id}): {top_prob:.4f}")

    decoded_tokens_with_softmax.append({
        'token_id': token_id.item(),
        'decoded_token': decoded_chosen_token,
        'probability_of_chosen': chosen_token_prob,
        'top_predictions': [
            {'token_id': top_k_indices[j].item(), 'decoded_token': tokenizer.decode(top_k_indices[j].item()), 'probability': top_k_values[j].item()}
            for j in range(top_k_values.shape[0])
        ]
    })

print("\n--- Final Structured Output ---")
import json
print(json.dumps(decoded_tokens_with_softmax, indent=2, ensure_ascii=False))

# To get the full generated text from the token IDs
full_decoded_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(f"\nFull generated text: {full_decoded_text}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Prompt (token IDs): [3838, 374, 279, 6722, 315, 9625, 30]
Generated sequence (full token IDs): [3838, 374, 279, 6722, 315, 9625, 30, 576, 6722, 315, 9625, 374, 12095, 13, 12095, 702, 1012, 279, 6722, 2474, 220, 20]
Newly generated tokens (IDs): [576, 6722, 315, 9625, 374, 12095, 13, 12095, 702, 1012, 279, 6722, 2474, 220, 20]

--- Detailed Output ---

Token 1: ' The' (ID: 576)
Probability of chosen token: 0.9424
Top 5 predictions for this step:
  - ' The' (ID: 576): 0.9424
  - ' What' (ID: 3555): 0.0300
  - ' (' (ID: 320): 0.0142
  - ' -' (ID: 481): 0.0135
  - '!' (ID: 0): 0.0000

Token 2: ' capital' (ID: 6722)
Probability of chosen token: 1.0000
Top 5 predictions for this step:
  - ' capital' (ID: 6722): 1.0000
  - '#' (ID: 2): 0.0000
  - '!' (ID: 0): 0.0000
  - '$' (ID: 3): 0.0000
  - '"' (ID: 1): 0.0000

Token 3: ' of' (ID: 315)
Probability of chosen token: 1.0000
Top 5 predictions for this step:
  - ' of' (ID: 315): 1.0000
  - '#' (ID: 2): 0.0000
  - '!' (ID: 0): 0.0000
  - '$' (ID

Let's create a helper function for getting the logits alongside the output

In [7]:
def generate_with_logits(prompt, model, tokenizer, max_new_tokens=15):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

    generation_output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        return_dict_in_generate=True,
        output_scores=True,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
    )

    generated_ids = generation_output.sequences[0]
    full_decoded_text = tokenizer.decode(generated_ids, skip_special_tokens=True)

    return generation_output, full_decoded_text

### Verbalized uncertainty

The first thing we can do is to prompt the model to return the uncertainty. Here is an example of prompts we can use, obtained from the paper [On Verbalized Confidence Scores for LLMs, Yang et al., 2024](https://arxiv.org/abs/2412.14737)

In [8]:
NO_ANSWER_TEXT = "NO ANSWER"
PROMPT_NO_ANSWER = f"If you cannot provide an answer, answer with `{NO_ANSWER_TEXT}`."

PATTERN_SEP = r"\n(.*\n)*?"
PATTERN_ANSWER = r".+"
PATTERN_FLOAT = r"\d*\.?\d+"

VERBALIZED_UNCERTAINTY_PROMPT = f"After your answer, provide a confidence score in percentage which measures how confident you are in your answer. Use the following format to respond:\n```\nAnswer: [Write your answer here.]\nConfidence: [Write your confidence score here.]\n```\n{PROMPT_NO_ANSWER}"

In [9]:
prompt_complete = prompt + "\n" + VERBALIZED_UNCERTAINTY_PROMPT
generator(prompt_complete, max_new_tokens=100, num_return_sequences=1)

[{'generated_text': 'What is the capital of France?\nAfter your answer, provide a confidence score in percentage which measures how confident you are in your answer. Use the following format to respond:\n```\nAnswer: [Write your answer here.]\nConfidence: [Write your confidence score here.]\n```\nIf you cannot provide an answer, answer with `NO ANSWER`. Confidence score should be between 0 and 100.\n```\nAnswer: Paris\nConfidence: 100\n```'}]

### Token uncertainty

Let us switch our attention to token-level uncertainty.
Within the `uncertainty` library I have created, you will find 3 different implementations of uncertainty:

1. **Naive token uncertainty (1 - product of single-token confidence)**

$$
\text{Uncertainty}_{\text{naive}} = 1 - \prod_{j=1}^{k} \max(\tau_j)
$$

2. **Vanilla token uncertainty (1 - average of single-token confidence)**

$$
\text{Uncertainty}_{\text{vanilla}} = 1 - \frac{\sum_{j=1}^{k} \max(\tau_j)}{k}
$$

3. **LogTokU (from [Estimating LLM Uncertainty with Evidence, Ma et al., 2025](https://arxiv.org/abs/2502.00290))**

Operates a disentangling between per-token aleatoric and epistemic uncertainty on the **logits space** of tokens.

We retrieve the logits associated to the token prediction $\tau_j$, to which we strip the negative part by applying elementwise ReLU.
Remember that the logits form a real-valued vector of logit elements over the whole dictionary of size $V$:

![](imgs/logits.png)

Here, $V$ indicates the size of the dictionary.

We restrict ourselves to a single one of these output logits $l^{(j)}$.

- We select only the top-$\kappa$ values: $\alpha^{(j)} \doteq \text{top}\kappa_{v\in\{1,\dots,V\}}(l^{(j)}_v)$
- We suppress the negative coefficients by means of element-wise ReLU: $\alpha^{(j)} \leftarrow \text{ReLU}(\alpha^{(j)})$
- We define the total evidence associated to the vector $\alpha^{(j)}$: $\alpha_0^{(j)} \doteq \sum_t\alpha_t^{(j)} $

Then we can operate the disentangling into aleatoric and epistemic uncertainty:

$$
\text{Aleatoric}_j = - \sum_{t=1}^{\kappa} \frac{\alpha^{(j)}_t}{\alpha_0^{(j)}}\left(\digamma(\alpha_t^{(j)} + 1 ) - \digamma(\alpha_0^{(j)} + 1 )\right),
$$

where $\digamma$ indicate the [digamma function](https://en.wikipedia.org/wiki/Digamma_function).

$$
\text{Epistemic}_j = \frac{\kappa}{\sum_{t=1}^{\kappa} (\alpha_t^{(j)} + 1)}.
$$

The interpretation of the aleatoric and epistemic uncertainty is as follows:

- An output with high aleatoric uncertainty indicates a general lack of knowledge of the model in the specific domain of the prompt
- An output with high epistemic uncertainty indicates an undecisiveness of the model in determining the "correct" answer to a problem, and it could also happen because there are multiple correct options

The authors propose to combine Aleatoric and Epistemic uncertainty to obtain an unreliability metric:

$$
\text{Unreliability}_j = \text{Epistemic}_j\cdot\text{Aleatoric}_j.
$$

And, for a generic response, we can compute the total reliability of the answer

$$
\text{Unreliability} = \frac{\sum_{j=1}^{k} \text{Unreliability}_j}{k},
$$

even though the authors suggest restricting the calculation to the top-$\kappa$ most unreliable tokens.

In [1]:
from uncertainty import token_uncertainty_naive, token_uncertainty_vanilla, logTokU, get_token_confidence

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
PROMPT_TOKEN_UNCERTAINTY = "Respond to the question with a short answer. If you are prompted to provide a single answer, just respond with that answer. Do not add any extra detail."

In [3]:
prompt = f"{PROMPT_TOKEN_UNCERTAINTY}\nWhat is the capital of France?"
generation_output, full_decoded_text = generate_with_logits(prompt, model, tokenizer, max_new_tokens=100)
print(full_decoded_text)

NameError: name 'generate_with_logits' is not defined

In [147]:
U.logTokU(generation_output, top_k_inconfident=10)

RuntimeError: selected index k out of range

In [136]:
prompt = "What is the capital of Plamplamping?"
generation_output, full_decoded_text = generate_with_logits(prompt, model, tokenizer, max_new_tokens=100)
print(full_decoded_text)

What is the capital of Plamplamping? - Answers\nMath and Arithmetic\nWhat is the capital of Plamplamping?\nAsked by Wiki User\nBe the first to answer!\n🙏\n01:42\nHD\n серьезн\nAnswer\n🙏\n01:42\nHD\n-seri-зн\nRelated Questions\nWhat is the capital of Plamplamping?\nWhat is the capital of Swaziland?\nWhat is the capital of the capital of the capital of the capital of the


In [137]:
U.logTokU(generation_output, top_k_inconfident=10)

tensor(-0.1575)

In [138]:
prompt = "Could you give me one name of president?"
generation_output, full_decoded_text = generate_with_logits(prompt, model, tokenizer, max_new_tokens=100)
print(full_decoded_text)

Could you give me one name of president? Sure! Abraham Lincoln was the 16th president of the United States. If you need more information or another example, feel free to ask! 😊


In [139]:
U.logTokU(generation_output, top_k_inconfident=10)

tensor(-0.0751)