-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added TFRecord support as a preprocessing cache format #1194
Conversation
@tgaddair On my cluster with 2x 2080Ti and a 32 cores CPU, I observe a constant 5.5 - 6 iter/s throughout the training, with batch_size=256 on the As a baseline, training with ParquetDataset was ~2.5iter/s at early epoch and descreasing to 5 ~7 s/iter, finally at 9s/iter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is very awesome. Just a few comments.
if shard_count > 1: | ||
dataset = dataset.shard(shard_count, cur_shard) | ||
total_samples = self.size | ||
local_samples = int(total_samples / shard_count) if shard_count else total_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine heuristic. Basically, we want the biggest buffer that will fit in memory, but not larger.
compression_level=compression_level) | ||
|
||
|
||
def get_schema(df, columns=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine for now. One thing we can improve in the future is to use Ludwig's training_set_metadata
to more precisely obtain the data type of each column. But if this works, this is fine for now.
lambda x: tf.data.TFRecordDataset(x, compression_type="GZIP"), | ||
num_parallel_calls=tf.data.AUTOTUNE) | ||
# Fetch one element so to get the parser. | ||
features, feature_lists = self._detect_schema(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of detecting the schema here, what if we just stored it in the metadata at write time?
A couple of observations in testing:
|
Got an error when writing out the test set:
|
Does Parquet actually read&serialize the images? If it only saves a link (a.k.a. file path) to the image, then this is expected? As TFRecords offset the overhead from reading at training to dataset construction time?
Yeah it would be easy to add an S3 support. Does Parquet dataset support S3? |
For my current test, I'm using a tabular dataset consisting of many small rows. I think the bottlenecks is the per-row serialization. The current Parquet writer/reader supports S3 and other distributed filesystems through |
Just a random note: a critical performance factor for serialization is to set the df_engine parallelism, such as here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/backend/ray.py#L44 In my case it is set as the number of CPUs ray auto-detected (which is 20), and the serialization takes <3mins on the pet-adoption-finder dataset... |
All comments addressed except this test set issue. Could you let me know the procedure to reproduce so I can take a look? |
Nice work @zhisbug. I'll try and get a small repro together for the issue above. In the meantime, looks like there are a few tests failing we should fix: https://github.com/ludwig-ai/ludwig/pull/1194/checks?check_run_id=2751504369#step:9:319 |
|
||
# interleave the tfrecord files for parallel reading | ||
dataset = files.interleave( | ||
lambda x: tf.data.TFRecordDataset(x, compression_type=self.compression_type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this should work with S3 and other remote paths natively: https://blog.min.io/hyper-scale-machine-learning-with-minio-and-tensorflow/
@zhisbug I fixed the issue above, it was caused by an empty test set. The check lazy load issue has also been resolved. We should be good to land once tests pass. |
Code Pull Requests
This PR adds support for
TFRecordDataset
when the backend is Ray.The functionality is complete but several problems need to be addressed before merging:
I have to comment out this line in order to make Ray backend + Dask working on images. This issue is irrelevant with the feature intorduced by this PR. I did some diagnosis and found that a previous commit 30d164e7cc3fa7d1c45286727c0183f8eefa8e39 caused the issue.There is still an unknown issue when writing images into tfrecords. This line will die when dumping multiple dask DF parittions into disk. This only happens when we use images. Still under my investigation.[Fixed]some minor bugs on type conversion which I will fix after running some tests.