-
Notifications
You must be signed in to change notification settings - Fork 5
Add basic HuggingFace Data Source Implementation #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 🔥 LGTM :)
Implement the partitions method in DataSourceReader that leverages the new num_shards parameter from the Dataset: https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.shard
Link to IterableDataset.shard docs: https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.IterableDataset.shard
(load_dataset with streaming returns a IterableDataset, not a Dataset)
Add more data source options, such as custom caching locations.
There is no caching when streaming using datasets
, or you meant some caching on Spark side ?
Once Spark 4 is released, we can add Spark as an option in "Use this dataset" :)
YES 1000% !
def schema(self): | ||
from datasets import load_dataset_builder | ||
dataset_name = self.options["path"] | ||
ds_builder = load_dataset_builder(dataset_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some datasets have configs/subsets that can ba loaded like load_dataset_builder(dataset_name, subset_name)
Some functions that we can use:
- get_dataset_config_names
- get_dataset_default_config_name
and we can also validate the split name using
- get_dataset_split_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice we can add an additional data source option for config/subset.
pyspark_huggingface/huggingface.py
Outdated
schema = StructType() | ||
for key, value in features.items(): | ||
# For simplicity, use string for all values. | ||
schema.add(StructField(key, StringType(), True)) | ||
return schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or simply this ? :)
schema = StructType() | |
for key, value in features.items(): | |
# For simplicity, use string for all values. | |
schema.add(StructField(key, StringType(), True)) | |
return schema | |
return from_arrow_schema(features.arrow_schema) |
provided this is imported:
from pyspark.sql.pandas.types import from_arrow_schema
feel free to try in another PR if you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! So much easier to convert the schema now :)
This PR ports the existing Hugging Face data source implementation from pyspark-data-sources and adds a demo notebook.
TODOs:
partitions
method inDataSourceReader
that leverages the newnum_shards
parameter from the Dataset: https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.shardload_dataset
.Once Spark 4 is released, we can add Spark as an option in "Use this dataset" :)