Skip to content

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Dec 13, 2024

for example:

>>> spark.read.format("huggingface") \
...     .option("filters", '[("language_score", ">", 0.99)]') \
...     .option("columns", '["text", "language_score"]') \
...     .load("HuggingFaceFW/fineweb-edu") \
...     .show()
+--------------------+------------------+                                       
|                text|    language_score|
+--------------------+------------------+
|died Aug. 28, 181...|0.9901925325393677|
|Coyotes spend a g...|0.9902171492576599|
|...                 |               ...|
+--------------------+------------------+

I also moved most of the logic inside HuggingFaceDatasets.__init__ otherwise for some datasets it couldn't get a schema(), and did some minor edits


self.streaming_dataset = streaming_dataset[self.split]
if not self.streaming_dataset.features:
self.streaming_dataset = self.streaming_dataset._resolve_features()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the ultimate way to be 100% sure we have the features - because for some data formats like JSON Lines we need to stream some rows to infer the features

yield from pa_table.select(columns).to_batches()
if self.streaming_dataset:
shard = self.streaming_dataset.shard(num_shards=self.streaming_dataset.num_shards, index=partition.index)
if shard._ex_iterable.iter_arrow:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor edit: some streaming datasets don't have iter_arrow , like datasets WebDataset formats for which we stream python objects

Comment on lines +157 to +158
self.builder.download_and_prepare()
dataset = self.builder.as_dataset(self.split)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor edit: reuse the builder instead of calling load_dataset

if "path" not in options or not options["path"]:
raise Exception("You must specify a dataset name.")

kwargs = dict(self.options)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this ok to pass all the remaining options as kwargs to the builder ? or should I set an allow-list / disallow-list ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fine. Just note all values in options are string.

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! We currently don't support filter pushdown and column pruning for Python data sources, and this is a nice workaround!

if "path" not in options or not options["path"]:
raise Exception("You must specify a dataset name.")

kwargs = dict(self.options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fine. Just note all values in options are string.

if self.split not in streaming_dataset:
raise Exception(f"Split {self.split} is invalid. Valid options are {list(streaming_dataset)}")

self.streaming_dataset = streaming_dataset[self.split]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure: is streaming_dataset (or the dataset builder) pickleable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes everything is pickleable :)

@lhoestq lhoestq merged commit df34a46 into main Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants