Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require OCR-CC information (image IDs) #5

Closed
prajwalgatti opened this issue Oct 4, 2021 · 1 comment
Closed

Require OCR-CC information (image IDs) #5

prajwalgatti opened this issue Oct 4, 2021 · 1 comment

Comments

@prajwalgatti
Copy link

Hello @zyang-ur, and all

Thanks for this work, it is quite interesting.

I'm trying to obtain the OCR-CC dataset but due to my constraints, I can't download the 1.7TB dataset.
However, I have the CC dataset and it would be possible for me to obtain the subset of images that are in OCR-CC.

Could you please share the image IDs of CC that were used to construct OCR-CC?

Thanks in advance!

@zyang-ur
Copy link
Contributor

zyang-ur commented Oct 5, 2021

Hi @prajwalgatti ,

Thank you for your interest.

In this case, you could download the index files only, at:
path/to/azcopy copy https://tapvqacaption.blob.core.windows.net/data/data/imdb/cc <local_path>/data --recursive

The "image_name" in the index files are the IDs of CC. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants