GitHub

Fast Data Attribution for Text-to-Image Models

Sheng-Yu Wang¹, Aaron Hertzmann², Alexei A. Efros³, Richard Zhang², Jun-Yan Zhu¹.
Carnegie Mellon University¹, Adobe Research², UC Berkeley³
In NeurIPS, 2025.

Abstract

Data attribution for text-to-image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real-world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500x - 400,000x. Our work represents a meaningful step towards the large-scale application of data attribution methods on real-world models such as Stable Diffusion.

Quick Start

Environment Setup

Create a conda/micromamba environment with all dependencies:

# Using conda
conda env create -f environment.yaml
conda activate fastgda

# Or using micromamba (faster)
micromamba env create -f environment.yaml
micromamba activate fastgda

(We mainly test our environment on micromamba.)

Download Data and Weights

All data files (pretrain weights, precomputed features, COCO images, influence rankings, and query images) are available on HuggingFace:

Download everything with a single command:

# Use the download script (~15GB total)
bash scripts/download_data.sh

(If you experienced rate limit by huggingface download, please rerun this script multiple times.)

The download script will automatically:

Download precomputed features for all feature types (DINO, CLIP, CLIP text, DINO+CLIP text)
Download and extract COCO train2017 images (concatenates split tar files)
Download ground truth influence rankings (influence_train.pkl, influence_test.pkl)
Download query images with precomputed latents and text embeddings
Download pretrain weight files

Expected data structure after download:

data/coco/
├── train2017/                     # COCO training images
├── feats/                         # Precomputed features
│   ├── dino+clip_text/
│   │   ├── data_feats.npy         # [118287, 1280] training features
│   │   ├── train_query_feats.npy  # [5000, 1280] training queries
│   │   └── test_query_feats.npy   # [1000, 1280] test queries
│   ├── dino/
│   ├── clip/
│   └── clip_text/
├── influence_train.pkl            # Ground truth training influences
├── influence_test.pkl             # Ground truth test influences
├── query_train/                   # Training query images
│   ├── images/
│   ├── latents.npy
│   ├── text_embeddings.npy
│   └── nn_dino.pkl
└── query_test/                    # Test query images
    ├── images/
    ├── latents.npy
    └── text_embeddings.npy

Run Interactive Demo

Launch a Gradio demo to explore image attributions:

python demo.py \
    --checkpoint weights/dino+clip_text.pth \
    --data_dir data/coco \
    --feature_dir data/coco/feats/dino+clip_text

The demo will:

Load a generated image and its caption
Compute calibrated features using the trained model
Rank all training images by influence score
Display the top-k most influential training images

Training the Ranker

Prerequisites

Before training, you need to run the data download script scripts/download_data.sh.

Train on COCO Dataset

Train a model using DINO + CLIP text features from this script:

bash scripts/train_coco.sh

Training Arguments

Dataset Settings:

--ftype: Feature type (e.g., dino+clip_text, dino, clip)
--data_dir: Directory containing feature files
--rank_file: Path to ground truth influence rankings (.pkl)

Model Architecture:

--hidden_sizes: Hidden layer sizes (default: [768, 768, 768])
--input_norm: Use layer normalization on input
--dropout: Dropout probability (default: 0.1)
--out_feat_dim: Output feature dimension (default: 768)

Training Hyperparameters:

--epochs: Number of epochs (default: 10)
--batch_size: Batch size (default: 4096)
--lr: Learning rate (default: 0.001)

Logging:

--wandb: Enable Weights & Biases logging
--wandb_project: W&B project name (default: fastgda)

Evaluation

Evaluate a trained model on the test set:

bash scripts/eval_coco.sh

This computes mAP@k (mean average precision at k) for different values of k, measuring how well the model ranks truly influential training images.

(Optional) Preprocessing Pipeline

If you want to generate features and influence rankings from scratch, follow these steps:

Step 1: Download Raw Data

In case you haven't download the data, run bash scripts/download_data.sh.

Step 2: Extract Features

Extract DINO, CLIP, and text features from images:

cd feature_extraction

# Extract all features (takes ~60-90 minutes on A100)
bash extract_coco.sh

# This will generate:
# - dino features (768-dim)
# - clip features (512-dim)
# - clip_text features (512-dim)
# - dino+clip_text features (1280-dim)
cd ..

The features will be stored in data/coco/feats_test by default. You can change the output location by specifying FEAT_DIR argument in feature_extraction/extract_coco.sh.

Step 3: Compute Ground Truth Influences

Compute expensive ground truth influence scores using AttributeByUnlearning(AbU). See abu/coco/README.md for detailed documentation.

Acknowledgments

We thank Simon Niklaus for the help on the LAION image retrieval. We thank Ruihan Gao, Maxwell Jones, and Gaurav Parmar for helpful discussions and feedback on drafts. Sheng-Yu Wang is supported by the Google PhD Fellowship. The project was partly supported by Adobe Inc., the Packard Fellowship, the IITP grant funded by the Korean Government (MSIT) (No. RS-2024-00457882, National AI Research Lab Project), NSF IIS-2239076, and NSF ISS-2403303.

Citation

If you use FastGDA in your research, please cite:

@inproceedings{wang2025fastgda,
  title={Fast Data Attribution for Text-to-Image Models},
  author={Wang, Sheng-Yu and Hertzmann, Aaron and Efros, Alexei A and Zhang, Richard and Zhu, Jun-Yan},
  booktitle={NeurIPS},
  year = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
abu/coco		abu/coco
fastgda		fastgda
feature_extraction		feature_extraction
scripts		scripts
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
demo.py		demo.py
environment.yaml		environment.yaml
eval.py		eval.py
teaser.jpg		teaser.jpg
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fast Data Attribution for Text-to-Image Models

Abstract

Quick Start

Environment Setup

Download Data and Weights

Run Interactive Demo

Training the Ranker

Prerequisites

Train on COCO Dataset

Training Arguments

Evaluation

(Optional) Preprocessing Pipeline

Step 1: Download Raw Data

Step 2: Extract Features

Step 3: Compute Ground Truth Influences

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Languages

License

PeterWang512/FastGDA

Folders and files

Latest commit

History

Repository files navigation

Fast Data Attribution for Text-to-Image Models

Abstract

Quick Start

Environment Setup

Download Data and Weights

Run Interactive Demo

Training the Ranker

Prerequisites

Train on COCO Dataset

Training Arguments

Evaluation

(Optional) Preprocessing Pipeline

Step 1: Download Raw Data

Step 2: Extract Features

Step 3: Compute Ground Truth Influences

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages