Open-source tools for autograding rubrics with LLMs.
- Define and evaluate rubrics for LLM-generated responses using YAML configurations.
- Support for custom scoring strategies (binary, continuous, free-text, etc.).
- Flexible aggregation methods (mean, median, mode, custom LLM-based aggregators).
- Asynchronous and synchronous evaluation workflows.
- Visualization of rubric structure and evaluation progress.
- Integration with HealthBench dataset for medical dialogue evaluation.
- Provider-agnostic via LiteLLM; built-in token-aware rate limiting and request caching.
Click to expand
Requirements: Python 3.10 or higher.First, install uv (a drop-in, faster replacement for pip
):
# via the official install script
curl -Ls https://astral.sh/uv/install.sh | sh
# or with Homebrew
brew install astral-sh/uv/uv
Then install the package and core dependencies:
git clone https://github.com/jacobphillips99/protorubric
cd protorubric
uv pip install -r requirements.txt # core deps
uv pip install -e . # editable install
-
(Optional) Install visualization dependencies:
uv pip install -r requirements-viz.txt
-
Install Graphviz for network diagrams:
- macOS:
brew install graphviz
- Ubuntu/Debian:
sudo apt-get install graphviz
- macOS:
- Set API keys as environment variables (inferred from configured providers):
OPENAI_API_KEY
,ANTHROPIC_API_KEY
,GEMINI_API_KEY
- Optional environment variables:
PROTORUBRIC_LOG_LEVEL
(default:ERROR
)PROTORUBRIC_INVALIDATE_CACHE
(set toTrue
to bypass the on-disk cache)
- Rate limits and available providers/models come from
rate_limits.yaml
. - Token-aware rate limits are enforced per
rate_limits.yaml
with thellm-rate-limiter
package. See LLM Rate Limiter.
Create a rubric file, either in YAML or code, describing scoring, evaluators, aggregators, and requirements.
Example in assets/examples/example_configs/test_rubric.yaml
or scripts/healthbench/healthbench_to_open_rubric_utils.py
.
You can also create a rubric in code:
Style-guide Rubric example
Construct a rubric for grading a response based on grammar and tone.from protorubric.configs.evaluating import ModelEvaluatorConfig
from protorubric.configs.query import QueryConfig
from protorubric.configs.requirement import RequirementConfig
from protorubric.configs.aggregating import WeightedAverageAggregatingConfig
from protorubric.rubric import Rubric
llm_judge = ModelEvaluatorConfig(model="gpt-4o", provider="openai")
grammar_requirement = RequirementConfig(
name="grammar",
query=QueryConfig(instruction="Is the response grammatically correct?", scoring_config="binary"),
evaluator=llm_judge,
)
tone_requirement = RequirementConfig(
name="tone",
query=QueryConfig(instruction="What tone does the response have?", scoring_config="unit_scalar"),
evaluator=llm_judge,
)
overall_score_requirement = RequirementConfig(
name="overall_score",
aggregator=WeightedAverageAggregatingConfig(weights=[0.9, 0.1]),
dependency_names=["grammar", "tone"],
)
rubric = Rubric(requirements=[grammar_requirement, tone_requirement, overall_score_requirement])
Job Rubric example
Construct a rubric for determining whether the valuation of Scale AI is over 25 billion dollars.
from protorubric.configs.evaluating import ModelEvaluatorConfig, PassThroughEvaluatorConfig
from protorubric.configs.query import QueryConfig, NullQueryConfig
from protorubric.configs.requirement import RequirementConfig
from protorubric.configs.aggregating import AllAggregatingConfig, LLMAggregatingConfig
from protorubric.rubric import Rubric
llm_judge = ModelEvaluatorConfig(model="gpt-4o", provider="openai")
research_requirement = RequirementConfig(
name="research",
query=QueryConfig(instruction="research the company", scoring_config="free_text"),
evaluator=llm_judge,
)
arr_requirement = RequirementConfig(
name="arr",
query=QueryConfig(instruction="determine the ARR of the company", scoring_config="free_text"),
evaluator=llm_judge,
dependency_names=["research"],
)
arr_multiples_requirement = RequirementConfig(
name="arr_multiples",
query=QueryConfig(instruction="determine ARR multiples of similar companies", scoring_config="free_text"),
evaluator=llm_judge,
dependency_names=["research"],
)
valuation_requirement = RequirementConfig(
name="valuation",
query=QueryConfig(instruction="determine the valuation of the company", scoring_config="free_text"),
evaluator=llm_judge,
dependency_names=["arr", "arr_multiples"],
)
# Aggregate boolean conclusion using dependent results
bool_final = RequirementConfig(
name="is_over_25b",
query=NullQueryConfig(),
dependency_names=["valuation"],
evaluator=PassThroughEvaluatorConfig(),
aggregator=AllAggregatingConfig(),
)
# Free-text explanation combining dependent results
default_summary_prompt = (
"Summarize the available information and conclude in one sentence."
)
text_final = RequirementConfig(
name="explanation",
query=NullQueryConfig(),
dependency_names=["valuation", "arr", "arr_multiples", "research"],
evaluator=PassThroughEvaluatorConfig(),
aggregator=LLMAggregatingConfig(model="gpt-4o", aggregation_prompt=default_summary_prompt),
)
rubric = Rubric(requirements=[
research_requirement, arr_requirement, arr_multiples_requirement,
valuation_requirement, bool_final, text_final
])
Rubrics take in an a "input" object, which is typically a conversation between a user and an assistant or a blob of text. The rubric will then evaluate the input based on the requirements in the rubric. We conduct evaluation asynchronously by determining a topological ordering of the requirements and evaluating them in order. This enables us to finish the evaluation in the shortest amount of time according to the critical path. For example, the dependency graph for a given rubric may look like this:
{a: [], b: [], c:[a], d:[a,b], e:[c, d]}
Instead of evaluating the requirements one-by-one, we can conduct a topological level-finding sort and then evaluate the requirements asynchronously according to the critical path, such as the following:
Level 0: a, b
Level 1: c, d
Level 2: e
This makes evaluation significantly faster, as we can evaluate the requirements in parallel, especially for large rubrics that are much wider than they are deep.
from protorubric.rubric import Rubric
# Load rubric from YAML
rubric = Rubric.from_yaml("my_rubric.yaml")
# Prepare inputs (e.g., conversation string or text)
inputs = "role: user: Hello, how are you?\nrole: assistant: I'm fine, thank you!"
# Run evaluation synchronously
results = rubric.solve(inputs)
Generate visual representations of the rubric DAG and component usage:
from protorubric.viz.visualize import visualize_rubric
# visualize and save outputs under assets/viz_outputs/
visualizer, rubric = visualize_rubric(rubric=rubric, inputs=inputs, output_dir="assets/viz_outputs")
See scripts/test_viz.py
for a runnable example.
Use RubricWithAnswers
to compare rubric evaluation against known answers, or even invoke teacher_force=True
to force the rubric to use the known answers when considering dependent requirements. This enables us to evaluate either the full-length performance of a rubric or model or break up the evaluation into multiple parts.
from protorubric.eval.rubric_with_answers import RubricWithAnswers, generate_random_answers
rubric = Rubric.from_yaml("my_rubric.yaml")
answers = generate_random_answers(rubric)
# Teacher-forced evaluation
rwa_tf = RubricWithAnswers.from_rubric_and_answers(rubric, answers, teacher_force=True)
rwa_tf.solve(inputs)
# Standard evaluation
rwa = RubricWithAnswers.from_rubric_and_answers(rubric, answers, teacher_force=False)
rwa.solve(inputs)
We build on OpenAI’s HealthBench scripts to make the evaluation more modular and rubric‑first. If you’re new to HealthBench, start with the original scripts here: OpenAI HealthBench scripts.
What we add on top:
- Rubric → structured requirements: Convert HealthBench rubric items into explicit
Requirement
s with binary scoring and multiple aggregations (mode, weighted average, weighted sum). Seescripts/healthbench/healthbench_to_open_rubric_utils.py
. - End‑to‑end runner: Download data, generate assistant completions, build a rubric, and evaluate—clear separation between a sampler model and a grader model. See
scripts/healthbench/run.py
andscripts/healthbench/setup_healthbench.py
. - Meta‑HealthBench: Take paragraph‑style rubrics, use an LLM to decompose them into yes/no checks, and aggregate into both a boolean and a short text answer. See
scripts/healthbench/setup_meta_healthbench.py
andscripts/healthbench/run_meta.py
.
Try it:
# Standard HealthBench: downloads/samples, builds rubric, evaluates 1 row
python -m scripts.healthbench.run
# Meta‑HealthBench: LLM‑decomposed rubric → requirements → evaluation
python -m scripts.healthbench.run_meta
Notes:
- Data caches under
assets/examples/healthbench/
on first run. - Configure model credentials via your environment (compatible with
litellm
providers). - Default models and sample size are set at the top of each script.
src/protorubric/
— core library modulesconfigs/
— data classes forscoring
,evaluating
,aggregating
,query
, andrequirement
models/
— LiteLLM request plumbing, caching, and typeseval/
— evaluation helpers likeRubricWithAnswers
and metricsviz/
— visualization utilitiesutils/
— graph utilities (topological levels, etc.)rubric.py
— orchestrates DAG execution over requirements
assets/
— images, example configs, cache outputsexamples/example_configs/
— YAML examples (scoring/evaluator/aggregator/rubric)viz_outputs/
— visualization output directoryeval/
— pickled evaluation artifacts
scripts/
— runnable examples and HealthBench utilitiestest_viz.py
,tester.py
healthbench/
— dataset setup, rubric conversion, and runners
tests/
— lightweight examples used during developmentrate_limits.yaml
— provider/model RPM and TPM settingsnotebooks/
— exploratory notebooks
scoring_configs
: how a requirement is graded (binary
,unit_scalar
,continuous
,categorical
,free_text
, or custom)evaluator_configs
: how a query is answered (e.g.,llm
,llm-ensemble
,pass-through
)aggregator_configs
: how multiple answers are combined (mean
,median
,mode
,all
,any
,weighted_sum
,weighted_average
,llm
)requirements
: list of requirement objects withname
,query
,evaluator
, optionaldependency_names
, and optionalaggregator
- Configs can recursively include other YAMLs; see
assets/examples/example_configs/test_rubric.yaml
- Responses are cached to
assets/request_cache.db
and keyed by request hash - Set
PROTORUBRIC_INVALIDATE_CACHE=True
to bypass the cache - Token-aware rate limits are enforced per
rate_limits.yaml
with thellm-rate-limiter
package. See LLM Rate Limiter.
This project is licensed under the MIT License - see the LICENSE
file for details.
References:
- protorubric repo: https://github.com/jacobphillips99/protorubric
- llm-rate-limiter repo: https://github.com/jacobphillips99/llm-rate-limiter