Enable predicate pushdown #4

lhoestq · 2024-12-13T15:42:10Z

for example:

>>> spark.read.format("huggingface") \
...     .option("filters", '[("language_score", ">", 0.99)]') \
...     .option("columns", '["text", "language_score"]') \
...     .load("HuggingFaceFW/fineweb-edu") \
...     .show()
+--------------------+------------------+                                       
|                text|    language_score|
+--------------------+------------------+
|died Aug. 28, 181...|0.9901925325393677|
|Coyotes spend a g...|0.9902171492576599|
|...                 |               ...|
+--------------------+------------------+

I also moved most of the logic inside HuggingFaceDatasets.__init__ otherwise for some datasets it couldn't get a schema(), and did some minor edits

lhoestq · 2024-12-13T15:45:36Z

pyspark_huggingface/huggingface.py

+
+        self.streaming_dataset = streaming_dataset[self.split]
+        if not self.streaming_dataset.features:
+            self.streaming_dataset = self.streaming_dataset._resolve_features()


this is the ultimate way to be 100% sure we have the features - because for some data formats like JSON Lines we need to stream some rows to infer the features

lhoestq · 2024-12-13T15:46:43Z

pyspark_huggingface/huggingface.py

-                yield from pa_table.select(columns).to_batches()
+        if self.streaming_dataset:
+            shard = self.streaming_dataset.shard(num_shards=self.streaming_dataset.num_shards, index=partition.index)
+            if shard._ex_iterable.iter_arrow:


minor edit: some streaming datasets don't have iter_arrow , like datasets WebDataset formats for which we stream python objects

lhoestq · 2024-12-13T15:47:33Z

pyspark_huggingface/huggingface.py

+            self.builder.download_and_prepare()
+            dataset = self.builder.as_dataset(self.split)


minor edit: reuse the builder instead of calling load_dataset

lhoestq · 2024-12-13T15:49:09Z

pyspark_huggingface/huggingface.py

        if "path" not in options or not options["path"]:
            raise Exception("You must specify a dataset name.")

+        kwargs = dict(self.options)


is this ok to pass all the remaining options as kwargs to the builder ? or should I set an allow-list / disallow-list ?

This should be fine. Just note all values in options are string.

allisonwang-db

This is awesome! We currently don't support filter pushdown and column pruning for Python data sources, and this is a nice workaround!

allisonwang-db · 2024-12-16T10:02:25Z

pyspark_huggingface/huggingface.py

        if "path" not in options or not options["path"]:
            raise Exception("You must specify a dataset name.")

+        kwargs = dict(self.options)


This should be fine. Just note all values in options are string.

allisonwang-db · 2024-12-16T10:07:17Z

pyspark_huggingface/huggingface.py

+        if self.split not in streaming_dataset:
+            raise Exception(f"Split {self.split} is invalid. Valid options are {list(streaming_dataset)}")
+
+        self.streaming_dataset = streaming_dataset[self.split]


Just want to make sure: is streaming_dataset (or the dataset builder) pickleable?

yes everything is pickleable :)

lhoestq added 2 commits December 13, 2024 16:40

enable predicate pushdown

f7f11c4

update datasets minimum version

28699a0

lhoestq commented Dec 13, 2024

View reviewed changes

allisonwang-db approved these changes Dec 16, 2024

View reviewed changes

lhoestq merged commit df34a46 into main Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable predicate pushdown #4

Enable predicate pushdown #4

Uh oh!

lhoestq commented Dec 13, 2024 •

edited

Loading

Uh oh!

lhoestq Dec 13, 2024

Uh oh!

lhoestq Dec 13, 2024

Uh oh!

lhoestq Dec 13, 2024

Uh oh!

lhoestq Dec 13, 2024

Uh oh!

allisonwang-db Dec 16, 2024

Uh oh!

allisonwang-db left a comment

Uh oh!

allisonwang-db Dec 16, 2024

Uh oh!

allisonwang-db Dec 16, 2024

Uh oh!

lhoestq Dec 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		self.builder.download_and_prepare()
		dataset = self.builder.as_dataset(self.split)

Enable predicate pushdown #4

Enable predicate pushdown #4

Uh oh!

Conversation

lhoestq commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhoestq commented Dec 13, 2024 •

edited

Loading