| What | A multi-dimensional trustworthiness evaluation suite for multimodal LLMs. |
| Why | Single-metric leaderboards conceal failure modes — we want a model card, not a number. |
| Status | 🟢 Active — five dimensions, eleven sub-tasks. |
| Stack | Python ≥ 3.10, PyTorch, HuggingFace Transformers. |
pip install trust-eval-mm
trust-eval --model llava-hf/llava-1.5-7b-hf --dims all --out reports/llava15.json
trust-eval card --in reports/llava15.json --out cards/llava15.md| Dim | Sub-tasks | What it measures |
|---|---|---|
| Truthfulness | pope, mmlu-mm, factual-vqa |
Does the model claim things that are true given the image? |
| Robustness | image-noise, prompt-perturb |
Does the answer stay stable under small input changes? |
| Fairness | gender-bias, racial-bias |
Does answer quality vary by depicted demographic? |
| Calibration | selective-pred, aurc |
When the model says "I'm sure", is it actually right more often? |
| Privacy | pii-leak, face-id-leak |
Does the model emit identifying info it shouldn't? |
Each sub-task is scored on a 0–100 scale, then aggregated into a per-dimension score and a single Trust Score. The aggregation weights are user-configurable; the defaults are the ones used in the paper.
The output is a markdown "trust card" that shows all five dimensions side by side. The point: a model with 92% accuracy on POPE but 31 in calibration AURC is a deployment risk that single-number leaderboards never surface.
A sample card looks like:
TrustEval-MM Card | llava-hf/llava-1.5-7b-hf | rev 0a3f
==========================================================
Trust Score: 62.1 (weights: paper-default)
Truthfulness ███████░░░ 72 pope=78 mmlu-mm=64 factual-vqa=74
Robustness █████░░░░░ 54 img-noise=62 prompt-perturb=46
Fairness ██████░░░░ 58 gender=63 racial=53
Calibration ████░░░░░░ 41 selective=49 aurc=33
Privacy ███████░░░ 75 pii-leak=82 face-id-leak=68
# 1. Prepare eval data (downloads small COCO subset + procedural prompts)
trust-eval prepare --out data/
# 2. Run all dimensions on a model
trust-eval --model llava-hf/llava-1.5-7b-hf --data data/ --out reports/llava15.json
# 3. Render to a markdown card
trust-eval card --in reports/llava15.json --out cards/llava15.mdOr programmatically:
from trust_eval_mm import evaluate, render_card
report = evaluate(model_id="llava-hf/llava-1.5-7b-hf",
dimensions=["truthfulness", "calibration"],
data_root="data/")
print(render_card(report))| Argument | Default | Description |
|---|---|---|
--model |
(required) | HF model id or registered adapter |
--dims |
all |
Comma-separated subset of dimensions |
--n |
500 |
Examples per sub-task |
--device |
cuda |
Where to run |
--weights |
paper |
Aggregation weights (paper, uniform, custom) |
--seed |
0 |
Random seed |
Truthfulness
pope— Polling-based Object Probing Eval. Yes/No questions about whether an object is in the image.mmlu-mm— Multimodal MMLU subset. Multiple-choice questions where the answer depends on the image.factual-vqa— Free-form VQA with a factual answer, judged via exact-match + semantic similarity.
Robustness
image-noise— Add Gaussian / JPEG / blur perturbations and measure answer agreement.prompt-perturb— Paraphrase the prompt and measure answer agreement.
Fairness
gender-bias— Two-image pairs that differ only in depicted gender. Measure score gap on accuracy + sentiment.racial-bias— Same, varying race.
Calibration
selective-pred— Coverage-risk curve. Among the model's top-confidence k%, what's the error rate?aurc— Area Under the Risk-Coverage curve. Lower is better; we report 100 - 100*AURC.
Privacy
pii-leak— Images with rendered PII (emails, phone numbers). Does the model repeat them?face-id-leak— Does the model attempt to identify the person in the image?
- Five dimensions, eleven sub-tasks
- Markdown card renderer
- CSV / JSON output for downstream analysis
- HTML card renderer
- HF Hub integration (autopublish trust cards on model upload)
- Per-language sub-tasks (currently English-only)
@article{wang2025trustevalmm,
title = {{TrustEval-MM}: Multi-Dimensional Trustworthiness Evaluation for Multimodal {LLMs}},
author = {Wang, Ziyu},
journal = {arXiv preprint arXiv:2502.xxxxx},
year = {2025}
}Apache-2.0.