TrustEval-MM


What	A multi-dimensional trustworthiness evaluation suite for multimodal LLMs.
Why	Single-metric leaderboards conceal failure modes — we want a model card, not a number.
Status	🟢 Active — five dimensions, eleven sub-tasks.
Stack	Python ≥ 3.10, PyTorch, HuggingFace Transformers.

TL;DR

pip install trust-eval-mm

trust-eval --model llava-hf/llava-1.5-7b-hf --dims all --out reports/llava15.json
trust-eval card --in reports/llava15.json --out cards/llava15.md

The five dimensions

Dim	Sub-tasks	What it measures
Truthfulness	`pope`, `mmlu-mm`, `factual-vqa`	Does the model claim things that are true given the image?
Robustness	`image-noise`, `prompt-perturb`	Does the answer stay stable under small input changes?
Fairness	`gender-bias`, `racial-bias`	Does answer quality vary by depicted demographic?
Calibration	`selective-pred`, `aurc`	When the model says "I'm sure", is it actually right more often?
Privacy	`pii-leak`, `face-id-leak`	Does the model emit identifying info it shouldn't?

Each sub-task is scored on a 0–100 scale, then aggregated into a per-dimension score and a single Trust Score. The aggregation weights are user-configurable; the defaults are the ones used in the paper.

Why bother with a card?

The output is a markdown "trust card" that shows all five dimensions side by side. The point: a model with 92% accuracy on POPE but 31 in calibration AURC is a deployment risk that single-number leaderboards never surface.

A sample card looks like:

TrustEval-MM Card | llava-hf/llava-1.5-7b-hf | rev 0a3f
==========================================================
Trust Score:  62.1   (weights: paper-default)

Truthfulness  ███████░░░  72   pope=78  mmlu-mm=64  factual-vqa=74
Robustness    █████░░░░░  54   img-noise=62  prompt-perturb=46
Fairness      ██████░░░░  58   gender=63  racial=53
Calibration   ████░░░░░░  41   selective=49  aurc=33
Privacy       ███████░░░  75   pii-leak=82  face-id-leak=68

Quick start

# 1. Prepare eval data (downloads small COCO subset + procedural prompts)
trust-eval prepare --out data/

# 2. Run all dimensions on a model
trust-eval --model llava-hf/llava-1.5-7b-hf --data data/ --out reports/llava15.json

# 3. Render to a markdown card
trust-eval card --in reports/llava15.json --out cards/llava15.md

Or programmatically:

from trust_eval_mm import evaluate, render_card

report = evaluate(model_id="llava-hf/llava-1.5-7b-hf",
                  dimensions=["truthfulness", "calibration"],
                  data_root="data/")
print(render_card(report))

Configuration

Argument	Default	Description
`--model`	(required)	HF model id or registered adapter
`--dims`	`all`	Comma-separated subset of dimensions
`--n`	`500`	Examples per sub-task
`--device`	`cuda`	Where to run
`--weights`	`paper`	Aggregation weights (`paper`, `uniform`, custom)
`--seed`	`0`	Random seed

Sub-task quick reference

Truthfulness

pope — Polling-based Object Probing Eval. Yes/No questions about whether an object is in the image.
mmlu-mm — Multimodal MMLU subset. Multiple-choice questions where the answer depends on the image.
factual-vqa — Free-form VQA with a factual answer, judged via exact-match + semantic similarity.

Robustness

image-noise — Add Gaussian / JPEG / blur perturbations and measure answer agreement.
prompt-perturb — Paraphrase the prompt and measure answer agreement.

Fairness

gender-bias — Two-image pairs that differ only in depicted gender. Measure score gap on accuracy + sentiment.
racial-bias — Same, varying race.

Calibration

selective-pred — Coverage-risk curve. Among the model's top-confidence k%, what's the error rate?
aurc — Area Under the Risk-Coverage curve. Lower is better; we report 100 - 100*AURC.

Privacy

pii-leak — Images with rendered PII (emails, phone numbers). Does the model repeat them?
face-id-leak — Does the model attempt to identify the person in the image?

Roadmap

Five dimensions, eleven sub-tasks
Markdown card renderer
CSV / JSON output for downstream analysis
HTML card renderer
HF Hub integration (autopublish trust cards on model upload)
Per-language sub-tasks (currently English-only)

Citation

@article{wang2025trustevalmm,
  title   = {{TrustEval-MM}: Multi-Dimensional Trustworthiness Evaluation for Multimodal {LLMs}},
  author  = {Wang, Ziyu},
  journal = {arXiv preprint arXiv:2502.xxxxx},
  year    = {2025}
}

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
src/trust_eval_mm		src/trust_eval_mm
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrustEval-MM

TL;DR

The five dimensions

Why bother with a card?

Quick start

Configuration

Sub-task quick reference

Roadmap

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TrustEval-MM

TL;DR

The five dimensions

Why bother with a card?

Quick start

Configuration

Sub-task quick reference

Roadmap

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages