Sheng-Yu Wang1, Aaron Hertzmann2, Alexei A. Efros3, Richard Zhang2, Jun-Yan Zhu1.
Carnegie Mellon University1, Adobe Research2, UC Berkeley3
In NeurIPS, 2025.
Data attribution for text-to-image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real-world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500x - 400,000x. Our work represents a meaningful step towards the large-scale application of data attribution methods on real-world models such as Stable Diffusion.
Create a conda/micromamba environment with all dependencies:
# Using conda
conda env create -f environment.yaml
conda activate fastgda
# Or using micromamba (faster)
micromamba env create -f environment.yaml
micromamba activate fastgda(We mainly test our environment on micromamba.)
All data files (pretrain weights, precomputed features, COCO images, influence rankings, and query images) are available on HuggingFace:
Download everything with a single command:
# Use the download script (~15GB total)
bash scripts/download_data.sh(If you experienced rate limit by huggingface download, please rerun this script multiple times.)
The download script will automatically:
- Download precomputed features for all feature types (DINO, CLIP, CLIP text, DINO+CLIP text)
- Download and extract COCO train2017 images (concatenates split tar files)
- Download ground truth influence rankings (influence_train.pkl, influence_test.pkl)
- Download query images with precomputed latents and text embeddings
- Download pretrain weight files
Expected data structure after download:
data/coco/
├── train2017/ # COCO training images
├── feats/ # Precomputed features
│ ├── dino+clip_text/
│ │ ├── data_feats.npy # [118287, 1280] training features
│ │ ├── train_query_feats.npy # [5000, 1280] training queries
│ │ └── test_query_feats.npy # [1000, 1280] test queries
│ ├── dino/
│ ├── clip/
│ └── clip_text/
├── influence_train.pkl # Ground truth training influences
├── influence_test.pkl # Ground truth test influences
├── query_train/ # Training query images
│ ├── images/
│ ├── latents.npy
│ ├── text_embeddings.npy
│ └── nn_dino.pkl
└── query_test/ # Test query images
├── images/
├── latents.npy
└── text_embeddings.npy
Launch a Gradio demo to explore image attributions:
python demo.py \
--checkpoint weights/dino+clip_text.pth \
--data_dir data/coco \
--feature_dir data/coco/feats/dino+clip_textThe demo will:
- Load a generated image and its caption
- Compute calibrated features using the trained model
- Rank all training images by influence score
- Display the top-k most influential training images
Before training, you need to run the data download script scripts/download_data.sh.
Train a model using DINO + CLIP text features from this script:
bash scripts/train_coco.shDataset Settings:
--ftype: Feature type (e.g.,dino+clip_text,dino,clip)--data_dir: Directory containing feature files--rank_file: Path to ground truth influence rankings (.pkl)
Model Architecture:
--hidden_sizes: Hidden layer sizes (default:[768, 768, 768])--input_norm: Use layer normalization on input--dropout: Dropout probability (default:0.1)--out_feat_dim: Output feature dimension (default:768)
Training Hyperparameters:
--epochs: Number of epochs (default:10)--batch_size: Batch size (default:4096)--lr: Learning rate (default:0.001)
Logging:
--wandb: Enable Weights & Biases logging--wandb_project: W&B project name (default:fastgda)
Evaluate a trained model on the test set:
bash scripts/eval_coco.shThis computes mAP@k (mean average precision at k) for different values of k, measuring how well the model ranks truly influential training images.
If you want to generate features and influence rankings from scratch, follow these steps:
In case you haven't download the data, run bash scripts/download_data.sh.
Extract DINO, CLIP, and text features from images:
cd feature_extraction
# Extract all features (takes ~60-90 minutes on A100)
bash extract_coco.sh
# This will generate:
# - dino features (768-dim)
# - clip features (512-dim)
# - clip_text features (512-dim)
# - dino+clip_text features (1280-dim)
cd ..The features will be stored in data/coco/feats_test by default. You can change the output location by specifying FEAT_DIR argument in feature_extraction/extract_coco.sh.
Compute expensive ground truth influence scores using AttributeByUnlearning(AbU). See abu/coco/README.md for detailed documentation.
We thank Simon Niklaus for the help on the LAION image retrieval. We thank Ruihan Gao, Maxwell Jones, and Gaurav Parmar for helpful discussions and feedback on drafts. Sheng-Yu Wang is supported by the Google PhD Fellowship. The project was partly supported by Adobe Inc., the Packard Fellowship, the IITP grant funded by the Korean Government (MSIT) (No. RS-2024-00457882, National AI Research Lab Project), NSF IIS-2239076, and NSF ISS-2403303.
If you use FastGDA in your research, please cite:
@inproceedings{wang2025fastgda,
title={Fast Data Attribution for Text-to-Image Models},
author={Wang, Sheng-Yu and Hertzmann, Aaron and Efros, Alexei A and Zhang, Richard and Zhu, Jun-Yan},
booktitle={NeurIPS},
year = {2025},
}