Skip to content

Conversation

allisonwang-db
Copy link
Contributor

@allisonwang-db allisonwang-db commented Nov 25, 2024

This PR ports the existing Hugging Face data source implementation from pyspark-data-sources and adds a demo notebook.

TODOs:

  • Implement the partitions method in DataSourceReader that leverages the new num_shards parameter from the Dataset: https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.shard
  • Add more data source options
    • config/subset: allow users to specify a subset
    • custom caching locations
  • Improve performance (use caching for repeated df.load() operations).
  • Yield Arrow record batches directly to reduce serialization overhead (requires Spark master branch build).
  • Provide a better progress indicator during df.load(). Currently, it lacks the nice progress bar available when using load_dataset.
  • Implement a Hugging Face Sink (DataSourceWriter).

Once Spark 4 is released, we can add Spark as an option in "Use this dataset" :)

Screenshot 2024-11-25 at 7 34 45 PM

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 🔥 LGTM :)

Implement the partitions method in DataSourceReader that leverages the new num_shards parameter from the Dataset: https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.shard

Link to IterableDataset.shard docs: https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.IterableDataset.shard

(load_dataset with streaming returns a IterableDataset, not a Dataset)

Add more data source options, such as custom caching locations.

There is no caching when streaming using datasets, or you meant some caching on Spark side ?

Once Spark 4 is released, we can add Spark as an option in "Use this dataset" :)

YES 1000% !

def schema(self):
from datasets import load_dataset_builder
dataset_name = self.options["path"]
ds_builder = load_dataset_builder(dataset_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some datasets have configs/subsets that can ba loaded like load_dataset_builder(dataset_name, subset_name)

Some functions that we can use:

  • get_dataset_config_names
  • get_dataset_default_config_name

and we can also validate the split name using

  • get_dataset_split_names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice we can add an additional data source option for config/subset.

Comment on lines 67 to 71
schema = StructType()
for key, value in features.items():
# For simplicity, use string for all values.
schema.add(StructField(key, StringType(), True))
return schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or simply this ? :)

Suggested change
schema = StructType()
for key, value in features.items():
# For simplicity, use string for all values.
schema.add(StructField(key, StringType(), True))
return schema
return from_arrow_schema(features.arrow_schema)

provided this is imported:

from pyspark.sql.pandas.types import from_arrow_schema

feel free to try in another PR if you prefer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! So much easier to convert the schema now :)

@allisonwang-db allisonwang-db merged commit a7d719d into main Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants