-
Notifications
You must be signed in to change notification settings - Fork 5
Enable predicate pushdown #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
self.streaming_dataset = streaming_dataset[self.split] | ||
if not self.streaming_dataset.features: | ||
self.streaming_dataset = self.streaming_dataset._resolve_features() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the ultimate way to be 100% sure we have the features
- because for some data formats like JSON Lines we need to stream some rows to infer the features
yield from pa_table.select(columns).to_batches() | ||
if self.streaming_dataset: | ||
shard = self.streaming_dataset.shard(num_shards=self.streaming_dataset.num_shards, index=partition.index) | ||
if shard._ex_iterable.iter_arrow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor edit: some streaming datasets don't have iter_arrow
, like datasets WebDataset formats for which we stream python objects
self.builder.download_and_prepare() | ||
dataset = self.builder.as_dataset(self.split) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor edit: reuse the builder instead of calling load_dataset
if "path" not in options or not options["path"]: | ||
raise Exception("You must specify a dataset name.") | ||
|
||
kwargs = dict(self.options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this ok to pass all the remaining options
as kwargs to the builder ? or should I set an allow-list / disallow-list ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine. Just note all values in options
are string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! We currently don't support filter pushdown and column pruning for Python data sources, and this is a nice workaround!
if "path" not in options or not options["path"]: | ||
raise Exception("You must specify a dataset name.") | ||
|
||
kwargs = dict(self.options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine. Just note all values in options
are string.
if self.split not in streaming_dataset: | ||
raise Exception(f"Split {self.split} is invalid. Valid options are {list(streaming_dataset)}") | ||
|
||
self.streaming_dataset = streaming_dataset[self.split] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure: is streaming_dataset (or the dataset builder) pickleable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes everything is pickleable :)
for example:
I also moved most of the logic inside
HuggingFaceDatasets.__init__
otherwise for some datasets it couldn't get aschema()
, and did some minor edits