Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageNet 21k based filtered dataset #83

Open
isidentical opened this issue May 12, 2024 · 1 comment
Open

ImageNet 21k based filtered dataset #83

isidentical opened this issue May 12, 2024 · 1 comment

Comments

@isidentical
Copy link

Image-based filtering. We select a subset of examples whose visual content overlaps with ImageNet
classes. After applying English language (fasttext) and caption length filtering, we cluster the
image embeddings extracted by the OpenAI ViT-L/14 model for each image into 100K groups using
Faiss [ 75]. We then find the nearest neighbor group for every ImageNet training example, and keep
examples belonging to these groups. We apply this procedure using either ImageNet-21K (14M
images) or ImageNet-1K (1.2M images), forming two subsets.

In the paper, regarding the composition of "Image filters", it mentions that either ImageNet-21K or ImageNet-1K can be used. Looking into the code however, especially for the Datacomp 1B, it looks like only IN1K is used. Is there a version of the Datacomp 1B with IN21K?

@sagadre
Copy link
Collaborator

sagadre commented May 15, 2024

Hi @isidentical, thanks for the questions! In our scaling experiments we scaled both the IN1k and IN21k strategies up to the large pool (filtering 1.28B samples). Looking at Table 27 in the paper and comparing rows Image-based clustering (ImageNet1k) and Image-based clustering (ImageNet21k), we noticed average performance of 0.481 vs. 0.471. Hence, we only scaled up the IN1k strategy to the xlarge pool (filtering 12.8B samples). Unfortunately we don't have a IN21k version of DataComp 1B on hand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants