Skip to content

jpaillard/ensemble_vim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 

Repository files navigation

Aggregate Models, Not Explanations: Improving Feature Importance Estimation

Code for reproducing the experiments from the paper "Aggregate Models, Not Explanations: Improving Feature Importance Estimation".

@inproceedings{paillard2026aggregate,
  title={Aggregate Models, Not Explanations: Improving Feature Importance Estimation},
  author={Paillard, Joseph and Reyero Lobo, Angel and Engemann, Denis A. and Thirion, Bertrand},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026},
}

Requirements

All required packages can be installed using pip:

pip install -e .

For the TabICL foundation model experiment:

pip install -e ".[tabicl]"

For generating figures:

pip install -e ".[plots]"

Usage

Simulation benchmarks

ensemble_vim/simulation.py runs variable importance estimation on synthetic datasets. ensemble_vim/asymptotic.py computes asymptotic ground-truth importance at large sample size.

python ensemble_vim/simulation.py \
    --n_samples 128 256 512 1024 2048 \
    --seed 1 \
    --n_jobs 4 \
    --n_splits 5 \
    --snr 1 \
    --n_ensemble 10 \
    --results_dir ./results \
    --dataset_name friedman1 \
    --n_features 20 \
    --model_name mlp \
    --ensemble bagging \
    --sage

Available models: mlp, mlp256, rf, linear, tabicl. Available datasets: friedman1, ishigami, g_function, nonlinear. Importance methods computed: LOCO, CFI, PFI (always), SAGE (with --sage flag).

High-dimensional simulation (d=100)

python ensemble_vim/simulation.py \
    --dataset_name nonlinear --n_features 100 --model_name mlp256 \
    --ensemble bagging --n_ensemble 10 --snr 1 --seed 1

TabICL foundation model

python ensemble_vim/simulation.py \
    --model_name tabicl --dataset_name friedman1 \
    --ensemble bagging --n_ensemble 5 --n_samples 512 --seed 1

Requires pip install -e ".[tabicl]".

BRCA experiment

ensemble_vim/run_brca.py evaluates ensemble vs. sub-models LOCO importance on the TCGA BRCA gene expression dataset with 10 validated driver genes as ground truth.

python ensemble_vim/run_brca.py --model_name mlp --seed 0
python ensemble_vim/run_brca.py --model_name logreg_l2 --seed 0

The BRCA dataset (572 patients, 50 genes) can be downloaded from Catav et al. 2021 and placed at ./data/BRCA.csv.

Cluster execution

The scripts ensemble_vim/run_simulation.slurm and ensemble_vim/run_asymptotic.slurm can be used to submit the experiments to a SLURM cluster. They use job arrays to parallelize over random seeds.

Results

Results are saved in the specified results_dir:

results/
    ├── <dataset>_<model>_n<n>_p<d>_<ensemble><B>/
    │   ├── models/
    │   ├── scores_<dataset>_<seed>.csv
    │   ├── support_<dataset>_<seed>.npy
    │   ├── loco_<dataset>_<seed>.csv
    │   ├── cfi_<dataset>_<seed>.csv
    │   ├── pfi_<dataset>_<seed>.csv
    │   └── sage_<dataset>_<seed>.csv
    └── ...

Figures

Figure scripts are in ensemble_vim/figures/. Set the results_dir variable at the top of each script to point to your results directory.

  • figure_2.py, figure_3.py, figure_4.py — main paper figures
  • figure_supplement.py — supplementary learning curves (LOCO, SAGE, CFI)
  • plot_brca.py — BRCA driver gene recovery
  • compute_stability.py — Spearman rank correlation stability

UK Biobank experiment

The UK Biobank experiment code is in ensemble_vim/script_ukbb.py. It requires access to the UK Biobank proteomics data.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors