A unified ranking framework that learns directly from public leaderboard interactions to recommend the best pretrained model for an unseen dataset — without ever running a candidate on the target task.
This repository contains the official implementation of ModelLens, the metric-aware ranking framework introduced in our paper "ModelLens: Finding the Best for Your Task from Myriads of Models".
Left: the learned model–dataset atlas — a single embedding space, trained on 1.62M public benchmark records, that co-locates every model and every dataset. Models from the same family (BERT / LLaMA / T5 / ViT / Whisper / …) cluster together, and datasets from the same domain (NLP / Vision / Speech / Retrieval / Multimodal / Math & Code) form their own neighborhoods. The geometry reflects what works on what, not just text similarity. Right: given an unseen target dataset (here: MMMU), ModelLens returns top-K candidates that are task-appropriate — multimodal LMs such as Gemini-2.5-Pro, Step-3-VL-108B and Qwen3-VL-235B — in stark contrast to the nearest text-embedding neighbors (DeBERTa-MNLI, mDeBERTa-Vietnamese, MiniLM-IMDb) which match the description but solve the wrong problem.
The open-source model ecosystem is exploding. HuggingFace alone now hosts hundreds of thousands of pretrained models across thousands of architectures, and a practitioner facing a new task has to answer one deceptively simple question:
Which of these myriad models will do best on my dataset?
Existing answers are unsatisfying for very different reasons:
- AutoML / fine-tune-and-rank. Train every candidate on the target task and pick the winner. Optimal in the limit, infeasible at the scale of hundreds of thousands of models.
- Transferability estimation (LEEP, NCE, LogME, …). Cheaper than full fine-tuning, but still requires a forward pass per candidate on the target dataset. The cost grows linearly with the candidate pool, and most estimators assume a single, well-defined task setup.
- Model routing (RouterBench, RouteLLM, …). Fast at inference, but presupposes a tiny, hand-curated pool of ~5–30 models. Asks "which of these few?", not "which of these many?".
- Metadata-only retrieval. Embed the model card and the dataset description with a frozen text encoder, return nearest neighbors. Cheap and scalable, but as the right panel of the teaser shows, text similarity is not task similarity: a Vietnamese DeBERTa is among the nearest text-neighbors of MMMU but a hopeless choice for solving it.
ModelLens reframes model selection as a ranking problem over
(model, dataset, task, metric) tuples, learned directly from the
large-scale but noisy trace of public benchmark records. Once trained, it
ranks unseen models on unseen datasets zero-shot, using only metadata
(names, descriptions, model size, architecture family) — no forward pass
on the target dataset, no curated pool.
On a benchmark of 1.62M evaluation records spanning ~47K models and ~9.6K datasets, ModelLens surpasses both metadata-only and forward-pass transferability baselines, and its recommended Top-K pools improve five representative routers by 21%–81% across QA benchmarks.
A useful side-effect of training a single ranker over all
(model, dataset) interactions is that we can inspect the resulting
latent space directly. Each star below is a model, colored by
architecture family; the surrounding scatter / mesh shows the
datasets it has been evaluated on, colored by task domain.
The two atlases tell the same story from opposite ends:
- The semantic-only atlas (left) shows that text similarity alone produces a tangled mass: families overlap heavily in the centre, and many task-relevant distinctions (e.g. encoder-only LMs vs decoder-only LMs, multimodal vs vision-only) collapse together because their descriptions read similarly.
- The full-data atlas (right), driven by actual evaluation interactions, untangles this geometry: speech models (orange) detach cleanly from the text continent, retrieval embedders (green) form their own arc, and vision / multimodal models bridge the vision–text boundary. Family structure is recovered from co-evaluation patterns, not supplied as a label.
The practical consequence is the right panel of the teaser: in the learned space, nearest-neighbor in fact means task-appropriate, while in the semantic-only space it means text-similar. ModelLens's recommendation quality is, in large part, a downstream effect of having the right geometry to begin with.
ModelLens/
├── config/
│ ├── FinalModel_unified_augmented.yaml # main model config (Table 1)
│ ├── method_ablation/ # loss-objective ablations
│ ├── ablation_information/ # structural/semantic/interaction ablations
│ ├── ablation_size/ # size-prior / size-feature ablations
│ └── ablation_family/ # family-prior / family-holdout ablations
├── module/
│ ├── data/ # leaderboard corpus loader, name tokenizer
│ ├── model/ # MLP backbone, MLPMetric, MLPMetricFull (the paper model)
│ ├── procedure/ # listwise / pairwise / pointwise / ensemble training loops
│ └── utils/ # metrics (Kendall-w τ, NDCG@K, Hit@K, Rec@K), family extractor
├── src/main.py # entry point: parse YAML, build model, train, evaluate
├── figures/ # teaser & atlas figures used in this README
└── scripts/ # one-shot training and ablation drivers
The recommended setup is conda — it pins both Python and CUDA-capable PyTorch:
conda env create -f environment.yml
conda activate modellensIf you prefer pip / venv:
# Python 3.10+ recommended; install PyTorch separately to match your CUDA.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtGPU training requires a CUDA-capable PyTorch build. Distributed (DDP)
training is supported out of the box; see scripts/train.sh.
The training corpus and pretrained checkpoints are not included in this repository — they are large and partially derived from third-party sources (HuggingFace, Open LLM Leaderboard, Papers-with-Code) whose licences must be respected when redistributing.
The expected layout under data/<data_name>/ is:
data/unified_augmented/
├── data.csv # (model_id, dataset_id, task_id, metric_id, score) records
├── model2id.json # model name -> integer id
├── task2id.json # task type -> integer id
├── metric2id.json # metric name -> integer id
├── model2family.json # model name -> architecture family
├── model_profile.json # canonical model metadata (size in B params, family, ...)
├── model2desp_embeddings.npz # frozen text embeddings of model cards
├── dataset2desp.json # dataset description text (per dataset id)
├── train/ val/ test/ # split-specific (model, dataset, metric, score) files
└── new_dataset_evaluation/ # held-out unseen-dataset / unseen-model splits
Once available, place the data under ./data/unified_augmented/ (or set
data_name in the YAML to point at a different subdirectory).
Where to get the data. We are preparing a public release of the 1.62M-record corpus and pretrained ModelLens checkpoints on HuggingFace Datasets. A download script will be added to this repo when the release is finalised. In the meantime, please contact the authors.
Once data is in place:
# Train the full ModelLens model (ensemble loss, all features)
bash scripts/train.sh
# or, equivalently, single-GPU
python src/main.py --config config/FinalModel_unified_augmented.yaml
# Multi-GPU (DDP). nproc_per_node should match your number of devices.
USE_DDP=1 NPROC=4 bash scripts/train.shReproduce the loss-objective and information-source ablations:
bash scripts/run_method_ablations.sh
bash scripts/run_feature_ablations.shOutputs:
- Checkpoints —
checkpoint/mlp/<data_name>/<trail_name>/ - Logs —
log/mlp/<data_name>/<trail_name>/train.log - Optional W&B run — controlled by
use_wandbin the YAML
All hyperparameters live in YAML. Key knobs (see
config/FinalModel_unified_augmented.yaml for defaults):
| Field | Meaning |
|---|---|
model_name |
One of MLP, MLPMetric, MLPMetricFull (the paper model). |
loss_type |
ensemble, listwise, pairwise, pairwise_pointwise, listwise_pointwise, listwise_pairwise. |
id_dropout_rate |
Probability of masking a learned model/dataset ID with [UNK]. |
use_size_prior, use_family_prior |
Toggle the structural-prior head terms. |
use_size_feature |
If False, drops the size embedding from both backbone and prior. |
use_dataset_id_as_desp |
When True, the dataloader passes a global dataset id in the dataset-description slot, which the model intercepts to look up both a learned dataset embedding and a frozen description embedding. Required by MLPMetricFull. |
lambda_list, lambda_pair, point_loss_weight |
Loss weights λ_list, λ_pair, λ_point. |
tau |
Initial value of the learnable temperature τ. |
topk |
List of K values for Hit@K / NDCG@K / Rec@K. |
ModelLens supports the two settings from Section 4.2.1 of the paper:
- Performance completion — randomly mask entries from a partially
observed
(model × dataset)matrix and predict their values. - Cold-start generalisation — hold out entire datasets or entire
models (
new_dataset_evaluation/new_model_evaluationsplit modes) and score them zero-shot.
Ranking quality is reported with Kendall-weighted τ_w (the primary
metric, emphasising top-rank correctness) and NDCG@K, Hit@K, Rec@K, all
implemented in module/utils/metric.py.
If you find ModelLens useful in your research, please cite:
@article{cai2026modellens,
title = {{ModelLens}: Finding the Best for Your Task from Myriads of Models},
author = {Cai, Rui and Mo, Weijie Jacky and Wen, Xiaofei and Ma, Qiyao and
Zhu, Wenhui and Chen, Xiwen and Chen, Muhao and Zhao, Zhe},
journal = {arXiv preprint},
year = {2026}
}Released under the MIT License — see LICENSE.


