# Testing New White-Box Scorers in UQEnsemble

This notebook demonstrates that UQEnsemble now supports all 9 white-box scorers (previously only 2 were supported).

In [None]:
# Import required libraries
from uqlm.scorers.shortform.ensemble import UQEnsemble
from langchain_openai import ChatOpenAI

# Note: You'll need to set your OpenAI API key
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

## 1. Test Single-Generation Scorers

These scorers use a single response with its logprobs:
- `min_probability` - Minimum token probability
- `sequence_probability` - Overall sequence probability (length-normalized)

In [2]:
# Create LLM with logprobs enabled
llm = ChatOpenAI(temperature=0.7, model="gpt-3.5-turbo", logprobs=True)

# Test single-generation scorers
ensemble_single = UQEnsemble(llm=llm, scorers=["min_probability", "sequence_probability"], device="cpu")

print("âœ… Single-generation scorers accepted!")
print(f"White-box components: {ensemble_single.white_box_components}")
print(f"WhiteBoxUQ scorers: {ensemble_single.white_box_object.scorers}")

âœ… Single-generation scorers accepted!
White-box components: ['min_probability', 'sequence_probability']
WhiteBoxUQ scorers: ['min_probability', 'sequence_probability']


In [4]:
# Generate and score with single-generation scorers
prompts = ["What is the capital of France?"]

result_single = await ensemble_single.generate_and_score(
    prompts=prompts,
    num_responses=1,  # Only need 1 response for single-generation scorers
    show_progress_bars=True,
)

print("\nðŸ“Š Results:")
print(f"Response: {result_single.data['responses'][0]}")
print(f"Min Probability: {result_single.data['min_probability'][0]:.4f}")
print(f"Sequence Probability: {result_single.data['sequence_probability'][0]:.4f}")
print(f"Ensemble Score: {result_single.data['ensemble_scores'][0]:.4f}")


ðŸ“Š Results:
Response: The capital of France is Paris.
Min Probability: 0.9996
Sequence Probability: 0.9999
Ensemble Score: 0.9998


## 2. Test Top-Logprobs Scorers

These scorers use the top-k alternative tokens at each position:
- `min_token_negentropy` - Minimum negentropy across tokens
- `mean_token_negentropy` - Average negentropy across tokens
- `probability_margin` - Mean difference between top-2 token probabilities

In [5]:
# Test top-logprobs scorers (will show beta warning)
ensemble_top = UQEnsemble(llm=llm, scorers=["min_token_negentropy", "mean_token_negentropy", "probability_margin"], device="cpu")

print("âœ… Top-logprobs scorers accepted!")
print(f"White-box components: {ensemble_top.white_box_components}")

âœ… Top-logprobs scorers accepted!
White-box components: ['min_token_negentropy', 'mean_token_negentropy', 'probability_margin']




In [6]:
# Generate and score with top-logprobs scorers
result_top = await ensemble_top.generate_and_score(prompts=prompts, num_responses=1, show_progress_bars=True)

print("\nðŸ“Š Results:")
print(f"Response: {result_top.data['responses'][0]}")
print(f"Min Token Negentropy: {result_top.data['min_token_negentropy'][0]:.4f}")
print(f"Mean Token Negentropy: {result_top.data['mean_token_negentropy'][0]:.4f}")
print(f"Probability Margin: {result_top.data['probability_margin'][0]:.4f}")
print(f"Ensemble Score: {result_top.data['ensemble_scores'][0]:.4f}")

Output()


ðŸ“Š Results:
Response: The capital of France is Paris.
Min Token Negentropy: 0.9985
Mean Token Negentropy: 0.9997
Probability Margin: 0.9999
Ensemble Score: 0.9994


## 3. Test Sampled-Logprobs Scorers

These scorers require multiple sampled responses:
- `semantic_negentropy` - Entropy based on semantic clustering
- `semantic_density` - Density-based confidence
- `monte_carlo_probability` - Average sequence probability
- `consistency_and_confidence` - Cosine similarity Ã— response probability

In [7]:
# Test sampled-logprobs scorers
ensemble_sampled = UQEnsemble(llm=llm, scorers=["semantic_negentropy", "monte_carlo_probability"], device="cpu")

print("âœ… Sampled-logprobs scorers accepted!")
print(f"White-box components: {ensemble_sampled.white_box_components}")

âœ… Sampled-logprobs scorers accepted!
White-box components: ['semantic_negentropy', 'monte_carlo_probability']


In [8]:
# Generate and score with sampled-logprobs scorers
# Note: This will generate multiple responses automatically
result_sampled = await ensemble_sampled.generate_and_score(
    prompts=prompts,
    num_responses=5,  # Need multiple responses for these scorers
    show_progress_bars=True,
)

print("\nðŸ“Š Results:")
print(f"Response: {result_sampled.data['responses'][0]}")
print(f"Semantic Negentropy: {result_sampled.data['semantic_negentropy'][0]:.4f}")
print(f"Monte Carlo Probability: {result_sampled.data['monte_carlo_probability'][0]:.4f}")
print(f"Ensemble Score: {result_sampled.data['ensemble_scores'][0]:.4f}")

Output()


ðŸ“Š Results:
Response: The capital of France is Paris.
Semantic Negentropy: 1.0000
Monte Carlo Probability: 0.9999
Ensemble Score: 1.0000


## 4. Test P(True) Scorer

This scorer asks the LLM to estimate the probability that its response is true.

In [9]:
# Test p_true scorer
ensemble_ptrue = UQEnsemble(llm=llm, scorers=["p_true"], device="cpu")

print("âœ… P(True) scorer accepted!")
print(f"White-box components: {ensemble_ptrue.white_box_components}")

âœ… P(True) scorer accepted!
White-box components: ['p_true']


In [10]:
# Generate and score with p_true scorer
result_ptrue = await ensemble_ptrue.generate_and_score(prompts=prompts, num_responses=1, show_progress_bars=True)

print("\nðŸ“Š Results:")
print(f"Response: {result_ptrue.data['responses'][0]}")
print(f"P(True): {result_ptrue.data['p_true'][0]:.4f}")
print(f"Ensemble Score: {result_ptrue.data['ensemble_scores'][0]:.4f}")

Output()


ðŸ“Š Results:
Response: The capital of France is Paris.
P(True): 1.0000
Ensemble Score: 1.0000


## 5. Test Combined Ensemble

Combine different types of scorers in one ensemble!

In [11]:
# Combine multiple scorer types
ensemble_combined = UQEnsemble(
    llm=llm,
    scorers=[
        "sequence_probability",  # single-generation
        "min_token_negentropy",  # top-logprobs
        "monte_carlo_probability",  # sampled-logprobs
        "p_true",  # p_true
        "exact_match",  # black-box (for comparison)
    ],
    device="cpu",
)

print("âœ… Combined ensemble created!")
print(f"White-box components: {ensemble_combined.white_box_components}")
print(f"Black-box components: {ensemble_combined.black_box_components}")
print(f"All components: {ensemble_combined.component_names}")

âœ… Combined ensemble created!
White-box components: ['sequence_probability', 'min_token_negentropy', 'monte_carlo_probability', 'p_true']
Black-box components: ['exact_match']
All components: ['sequence_probability', 'min_token_negentropy', 'monte_carlo_probability', 'p_true', 'exact_match']




In [12]:
# Generate and score with combined ensemble
result_combined = await ensemble_combined.generate_and_score(prompts=prompts, num_responses=5, show_progress_bars=True)

print("\nðŸ“Š Combined Results:")
print(f"Response: {result_combined.data['responses'][0]}")
print("\nScores:")
print(f"  Sequence Probability: {result_combined.data['sequence_probability'][0]:.4f}")
print(f"  Min Token Negentropy: {result_combined.data['min_token_negentropy'][0]:.4f}")
print(f"  Monte Carlo Probability: {result_combined.data['monte_carlo_probability'][0]:.4f}")
print(f"  P(True): {result_combined.data['p_true'][0]:.4f}")
print(f"  Exact Match: {result_combined.data['exact_match'][0]:.4f}")
print(f"\n  Ensemble Score: {result_combined.data['ensemble_scores'][0]:.4f}")

Output()


ðŸ“Š Combined Results:
Response: The capital of France is Paris.

Scores:
  Sequence Probability: 0.9999
  Min Token Negentropy: 0.9987
  Monte Carlo Probability: 0.9999
  P(True): 0.9999
  Exact Match: 1.0000

  Ensemble Score: 0.9997


## 6. Print Ensemble Weights

See how each scorer contributes to the final ensemble score:

In [13]:
ensemble_combined.print_ensemble_weights()

## Summary

âœ… **All 9 white-box scorers now work with UQEnsemble!**

Previously supported (2):
- `min_probability`
- ~~`normalized_probability`~~ (deprecated)

Newly enabled (7):
- `sequence_probability`
- `min_token_negentropy`
- `mean_token_negentropy`
- `probability_margin`
- `semantic_negentropy`
- `semantic_density`
- `monte_carlo_probability`
- `consistency_and_confidence`
- `p_true`