Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving ImageNet-1k support #749

Closed
sayakpaul opened this issue Aug 30, 2022 · 7 comments
Closed

Improving ImageNet-1k support #749

sayakpaul opened this issue Aug 30, 2022 · 7 comments

Comments

@sayakpaul
Copy link
Contributor

sayakpaul commented Aug 30, 2022

W.r.t the current support for ImageNet-1k, we can improve things:

  • First, let's start leveraging TFDS. It significantly reduces the work expected to be done by a user. Let's walk through an example.

First, the user needs to keep the ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar archives to this path: gs://[BUCKET-NAME]/tensorflow_datasets/downloads/manual.

  • One this is done, the user does the following:
import tensorflow_datasets as tfds

data_dir = "gs://[BUCKET-NAME]/tensorflow_datasets"
builder = tfds.builder("imagenet2012", data_dir=data_dir)
builder.download_and_prepare()

builder.download_and_prepare() takes some time but it's lesser than what the current process of obtaining the initial TFRecords takes.

  • The the user can load the ImageNet-1k dataset with tfds.load("imagenet2012", data_dir=data_dir) and that is it.

The above two points assume the user already has access to the GCS bucket and all the necessary privileges to write data into it.

General recommendations

W.r.t

dataset = tf.data.TFRecordDataset(filenames=filenames)

enable interleaved reading by setting num_parallel_reads=tf.data.AUTOTUNE.

W.r.t

enable prefetching of a few batches so that the accelerator doesn't have to wait by using dataset.prefetch(tf.data.AUTOTUNE).

@bhack
Copy link
Contributor

bhack commented Aug 30, 2022

I agree, honestly I didn't understand the raptly/rationale to my question at #735 (comment)

@sebastian-sz
Copy link
Contributor

Agreed with TFDS approach for simplicity.

I think it's also possible to use local path instead of GCS bucket.

@sayakpaul
Copy link
Contributor Author

Agreed with TFDS approach for simplicity.

I think it's also possible to use local path instead of GCS bucket.

Yes, it's possible. However, keeping things inside a GCS Bucket is necessary to leverage TPU-based training runs. So, it kind of solves different purposes.

@bhack
Copy link
Contributor

bhack commented Sep 9, 2022

#774

@tanzhenyu
Copy link
Contributor

tfds still requires you to download the dataset manually. Are you referring to the process of converting from .tar.gz to TFRecords?

@tanzhenyu tanzhenyu added the stat: awaiting external input waiting for others to respond label Oct 27, 2022
Copy link

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale label Feb 13, 2024
Copy link

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants