Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HF Dataset components #8

Merged
merged 11 commits into from
Mar 10, 2023
Merged

Add HF Dataset components #8

merged 11 commits into from
Mar 10, 2023

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Mar 8, 2023

This PR adds boilerplate HFDatasetLoaderComponent and HFDatasetTransformComponent classes.

Questions:

  • in the Datasets library, there's a distinction between a Dataset and a DatasetDict. The latter is just a dictionary of several Dataset instances, with keys like "train", "validation", "test". I assume the index could be stored as a Dataset, whereas data sources can be either a Dataset or a DatasetDict.
  • if you use load_dataset without providing a split (like "train", "validation"), this method returns a DatasetDict instead of a Dataset. So was wondering how we could handle these splits a user might have.

To do:

  • add an example



# pylint: disable=too-few-public-methods
class HFDatasetsDataset(ExpressDataset[List[str], Union[datasets.Dataset, datasets.DatasetDict]]):
Copy link
Contributor Author

@NielsRogge NielsRogge Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also was wondering whether we could have a better name (DatasetsDataset might not be the best name)

Copy link
Member

@RobbeSneyders RobbeSneyders Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to just go for HFDataset?

tmp_dir
)

data_source_hf_datasets = load_dataset("parquet", data_dir=local_parquet_path, split="train")
Copy link
Contributor Author

@NielsRogge NielsRogge Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_dataset also supports things like multiprocessing (if there are multiple CPU cores) and streaming (loading data on-the-fly), not sure if we want to leverage those too

@NielsRogge
Copy link
Contributor Author

@RobbeSneyders so the pipeline fails because datasets is not in the dependencies of pyproject.toml. However I don't think it makes sense to make this a hard dependency, which is why I was arguing for soft dependency checks depending on which components a user might want to use. Thoughts?

@RobbeSneyders
Copy link
Member

I think handling soft dependencies would be good indeed. I would add it as a dependency in this PR, and then open a new PR where we implement the soft dependency checks for both datasets and pandas.

@RobbeSneyders
Copy link
Member

Seems like I was a bit too strict with the allowed licenses. We can allow weak-copyleft licenses as well.

You can add the following ones and it will fix the pipeline:

  • GNU Library or Lesser General Public License (LGPL)
  • GNU Lesser General Public License v2 (LGPLv2)
  • GNU Lesser General Public License v2 or later (LGPLv2+)
  • GNU Lesser General Public License v3 (LGPLv3)
  • GNU Lesser General Public License v3 or later (LGPLv3+)
  • Mozilla Public License 1.0 (MPL)
  • Mozilla Public License 1.1 (MPL 1.1)
  • Mozilla Public License 2.0 (MPL 2.0)
  • The Unlicense (Unlicense)

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM.

Can you just remove the pylint: disables and resolve the conflict?

@RobbeSneyders RobbeSneyders merged commit 08190db into main Mar 10, 2023
PhilippeMoussalli added a commit that referenced this pull request Mar 14, 2023
This PR adds boilerplate `HFDatasetLoaderComponent` and
`HFDatasetTransformComponent` classes.
@RobbeSneyders RobbeSneyders deleted the add_hf_dataset_component branch May 15, 2023 16:30
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR adds boilerplate `HFDatasetLoaderComponent` and
`HFDatasetTransformComponent` classes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants