Skip to content

pardcomper/trust-eval-mm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TrustEval-MM

What A multi-dimensional trustworthiness evaluation suite for multimodal LLMs.
Why Single-metric leaderboards conceal failure modes — we want a model card, not a number.
Status 🟢 Active — five dimensions, eleven sub-tasks.
Stack Python ≥ 3.10, PyTorch, HuggingFace Transformers.

TL;DR

pip install trust-eval-mm

trust-eval --model llava-hf/llava-1.5-7b-hf --dims all --out reports/llava15.json
trust-eval card --in reports/llava15.json --out cards/llava15.md

The five dimensions

Dim Sub-tasks What it measures
Truthfulness pope, mmlu-mm, factual-vqa Does the model claim things that are true given the image?
Robustness image-noise, prompt-perturb Does the answer stay stable under small input changes?
Fairness gender-bias, racial-bias Does answer quality vary by depicted demographic?
Calibration selective-pred, aurc When the model says "I'm sure", is it actually right more often?
Privacy pii-leak, face-id-leak Does the model emit identifying info it shouldn't?

Each sub-task is scored on a 0–100 scale, then aggregated into a per-dimension score and a single Trust Score. The aggregation weights are user-configurable; the defaults are the ones used in the paper.

Why bother with a card?

The output is a markdown "trust card" that shows all five dimensions side by side. The point: a model with 92% accuracy on POPE but 31 in calibration AURC is a deployment risk that single-number leaderboards never surface.

A sample card looks like:

TrustEval-MM Card | llava-hf/llava-1.5-7b-hf | rev 0a3f
==========================================================
Trust Score:  62.1   (weights: paper-default)

Truthfulness  ███████░░░  72   pope=78  mmlu-mm=64  factual-vqa=74
Robustness    █████░░░░░  54   img-noise=62  prompt-perturb=46
Fairness      ██████░░░░  58   gender=63  racial=53
Calibration   ████░░░░░░  41   selective=49  aurc=33
Privacy       ███████░░░  75   pii-leak=82  face-id-leak=68

Quick start

# 1. Prepare eval data (downloads small COCO subset + procedural prompts)
trust-eval prepare --out data/

# 2. Run all dimensions on a model
trust-eval --model llava-hf/llava-1.5-7b-hf --data data/ --out reports/llava15.json

# 3. Render to a markdown card
trust-eval card --in reports/llava15.json --out cards/llava15.md

Or programmatically:

from trust_eval_mm import evaluate, render_card

report = evaluate(model_id="llava-hf/llava-1.5-7b-hf",
                  dimensions=["truthfulness", "calibration"],
                  data_root="data/")
print(render_card(report))

Configuration

Argument Default Description
--model (required) HF model id or registered adapter
--dims all Comma-separated subset of dimensions
--n 500 Examples per sub-task
--device cuda Where to run
--weights paper Aggregation weights (paper, uniform, custom)
--seed 0 Random seed

Sub-task quick reference

Truthfulness
  • pope — Polling-based Object Probing Eval. Yes/No questions about whether an object is in the image.
  • mmlu-mm — Multimodal MMLU subset. Multiple-choice questions where the answer depends on the image.
  • factual-vqa — Free-form VQA with a factual answer, judged via exact-match + semantic similarity.
Robustness
  • image-noise — Add Gaussian / JPEG / blur perturbations and measure answer agreement.
  • prompt-perturb — Paraphrase the prompt and measure answer agreement.
Fairness
  • gender-bias — Two-image pairs that differ only in depicted gender. Measure score gap on accuracy + sentiment.
  • racial-bias — Same, varying race.
Calibration
  • selective-pred — Coverage-risk curve. Among the model's top-confidence k%, what's the error rate?
  • aurc — Area Under the Risk-Coverage curve. Lower is better; we report 100 - 100*AURC.
Privacy
  • pii-leak — Images with rendered PII (emails, phone numbers). Does the model repeat them?
  • face-id-leak — Does the model attempt to identify the person in the image?

Roadmap

  • Five dimensions, eleven sub-tasks
  • Markdown card renderer
  • CSV / JSON output for downstream analysis
  • HTML card renderer
  • HF Hub integration (autopublish trust cards on model upload)
  • Per-language sub-tasks (currently English-only)

Citation

@article{wang2025trustevalmm,
  title   = {{TrustEval-MM}: Multi-Dimensional Trustworthiness Evaluation for Multimodal {LLMs}},
  author  = {Wang, Ziyu},
  journal = {arXiv preprint arXiv:2502.xxxxx},
  year    = {2025}
}

License

Apache-2.0.

About

Multi-dimensional trustworthiness evaluation for multimodal LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages