Add basic HuggingFace Data Source Implementation #1

allisonwang-db · 2024-11-25T11:37:50Z

This PR ports the existing Hugging Face data source implementation from pyspark-data-sources and adds a demo notebook.

TODOs:

Implement the partitions method in DataSourceReader that leverages the new num_shards parameter from the Dataset: https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.shard
Add more data source options
- config/subset: allow users to specify a subset
- custom caching locations
Improve performance (use caching for repeated df.load() operations).
Yield Arrow record batches directly to reduce serialization overhead (requires Spark master branch build).
Provide a better progress indicator during df.load(). Currently, it lacks the nice progress bar available when using load_dataset.
Implement a Hugging Face Sink (DataSourceWriter).

Once Spark 4 is released, we can add Spark as an option in "Use this dataset" :)

lhoestq

Awesome 🔥 LGTM :)

Implement the partitions method in DataSourceReader that leverages the new num_shards parameter from the Dataset: https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.shard

Link to IterableDataset.shard docs: https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.IterableDataset.shard

(load_dataset with streaming returns a IterableDataset, not a Dataset)

Add more data source options, such as custom caching locations.

There is no caching when streaming using datasets, or you meant some caching on Spark side ?

Once Spark 4 is released, we can add Spark as an option in "Use this dataset" :)

YES 1000% !

lhoestq · 2024-11-25T12:02:14Z

pyspark_huggingface/huggingface.py

+    def schema(self):
+        from datasets import load_dataset_builder
+        dataset_name = self.options["path"]
+        ds_builder = load_dataset_builder(dataset_name)


Some datasets have configs/subsets that can ba loaded like load_dataset_builder(dataset_name, subset_name)

Some functions that we can use:

get_dataset_config_names

get_dataset_default_config_name

and we can also validate the split name using

get_dataset_split_names

Nice we can add an additional data source option for config/subset.

lhoestq · 2024-11-25T12:05:58Z

pyspark_huggingface/huggingface.py

+        schema = StructType()
+        for key, value in features.items():
+            # For simplicity, use string for all values.
+            schema.add(StructField(key, StringType(), True))
+        return schema


or simply this ? :)

Suggested change

schema = StructType()

for key, value in features.items():

# For simplicity, use string for all values.

schema.add(StructField(key, StringType(), True))

return schema

return from_arrow_schema(features.arrow_schema)

provided this is imported:

from pyspark.sql.pandas.types import from_arrow_schema

feel free to try in another PR if you prefer

Good to know! So much easier to convert the schema now :)

init

d818b03

lhoestq approved these changes Nov 25, 2024

View reviewed changes

address comments

e3412fc

allisonwang-db merged commit a7d719d into main Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add basic HuggingFace Data Source Implementation #1

Add basic HuggingFace Data Source Implementation #1

Uh oh!

allisonwang-db commented Nov 25, 2024 •

edited

Loading

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq Nov 25, 2024

Uh oh!

allisonwang-db Nov 26, 2024

Uh oh!

lhoestq Nov 25, 2024

Uh oh!

allisonwang-db Nov 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add basic HuggingFace Data Source Implementation #1

Add basic HuggingFace Data Source Implementation #1

Uh oh!

Conversation

allisonwang-db commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

lhoestq Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

allisonwang-db commented Nov 25, 2024 •

edited

Loading