Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' into remove-metrics
Browse files Browse the repository at this point in the history
  • Loading branch information
albertvillanova committed Jun 20, 2024
2 parents e076faf + a6ccf94 commit b6ec0c4
Show file tree
Hide file tree
Showing 6 changed files with 39 additions and 43 deletions.
24 changes: 6 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,12 @@
</p>

<p align="center">
<a href="https://github.com/huggingface/datasets/actions/workflows/ci.yml?query=branch%3Amain">
<img alt="Build" src="https://github.com/huggingface/datasets/actions/workflows/ci.yml/badge.svg?branch=main">
</a>
<a href="https://github.com/huggingface/datasets/blob/main/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue">
</a>
<a href="https://huggingface.co/docs/datasets/index.html">
<img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/datasets/index.html.svg?down_color=red&down_message=offline&up_message=online">
</a>
<a href="https://github.com/huggingface/datasets/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/datasets.svg">
</a>
<a href="https://huggingface.co/datasets/">
<img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen">
</a>
<a href="CODE_OF_CONDUCT.md">
<img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
</a>
<a href="https://github.com/huggingface/datasets/actions/workflows/ci.yml?query=branch%3Amain"><img alt="Build" src="https://github.com/huggingface/datasets/actions/workflows/ci.yml/badge.svg?branch=main"></a>
<a href="https://github.com/huggingface/datasets/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue"></a>
<a href="https://huggingface.co/docs/datasets/index.html"><img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/datasets/index.html.svg?down_color=red&down_message=offline&up_message=online"></a>
<a href="https://github.com/huggingface/datasets/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/datasets.svg"></a>
<a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a>
<a href="CODE_OF_CONDUCT.md"><img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg"></a>
<a href="https://zenodo.org/badge/latestdoi/250213286"><img src="https://zenodo.org/badge/250213286.svg" alt="DOI"></a>
</p>

Expand Down
2 changes: 1 addition & 1 deletion docs/source/dataset_script.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ as long as your dataset repository has a [required structure](./repository_struc

<Tip warning=true>

In the next major release, the new safety features of 馃 Datasets will disable running dataset loading scripts by default, and you will have to pass `trust_remote_code=True` to load datasets that require running a dataset script.
For security reasons, 馃 Datasets do not allow running dataset loading scripts by default, and you have to pass `trust_remote_code=True` to load datasets that require running a dataset script.

</Tip>

Expand Down
4 changes: 2 additions & 2 deletions docs/source/load_hub.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Certain datasets repositories contain a loading script with the Python code used
Those datasets are generally exported to Parquet by Hugging Face, so that 馃 Datasets can load the dataset fast and without running a loading script.

Even if a Parquet export is not available, you can still use any dataset with Python code in its repository with `load_dataset`.
All files and code uploaded to the Hub are scanned for malware (refer to the Hub security documentation for more information), but you should still review the dataset loading scripts and authors to avoid executing malicious code on your machine. You should set `trust_remote_code=True` to use a dataset with a loading script, or you will get a warning:
All files and code uploaded to the Hub are scanned for malware (refer to the Hub security documentation for more information), but you should still review the dataset loading scripts and authors to avoid executing malicious code on your machine. You should set `trust_remote_code=True` to use a dataset with a loading script, or you will get an error:

```py
>>> from datasets import get_dataset_config_names, get_dataset_split_names, load_dataset
Expand All @@ -120,6 +120,6 @@ All files and code uploaded to the Hub are scanned for malware (refer to the Hub

<Tip warning=true>

In the next major release, the new safety features of 馃 Datasets will disable running dataset loading scripts by default, and you will have to pass `trust_remote_code=True` to load datasets that require running a dataset script.
For security reasons, 馃 Datasets do not allow running dataset loading scripts by default, and you have to pass `trust_remote_code=True` to load datasets that require running a dataset script.

</Tip>
8 changes: 7 additions & 1 deletion src/datasets/formatting/formatting.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,14 +187,20 @@ def _arrow_array_to_numpy(self, pa_array: pa.Array) -> np.ndarray:
else:
zero_copy_only = _is_zero_copy_only(pa_array.type) and not _is_array_with_nulls(pa_array)
array: List = pa_array.to_numpy(zero_copy_only=zero_copy_only).tolist()

if len(array) > 0:
if any(
(isinstance(x, np.ndarray) and (x.dtype == object or x.shape != array[0].shape))
or (isinstance(x, float) and np.isnan(x))
for x in array
):
if np.lib.NumpyVersion(np.__version__) >= "2.0.0b1":
return np.asarray(array, dtype=object)
return np.array(array, copy=False, dtype=object)
return np.array(array, copy=False)
if np.lib.NumpyVersion(np.__version__) >= "2.0.0b1":
return np.asarray(array)
else:
return np.array(array, copy=False)


class PandasArrowExtractor(BaseArrowExtractor[pd.DataFrame, pd.Series, pd.DataFrame]):
Expand Down
8 changes: 4 additions & 4 deletions src/datasets/hub.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,15 @@ def convert_to_parquet(
`<org>/<dataset_name>`.
revision (`str`, *optional*): Branch of the source Hub dataset repository. Defaults to the `"main"` branch.
token (`bool` or `str`, *optional*): Authentication token for the Hugging Face Hub.
trust_remote_code (`bool`, defaults to `True`): Whether you trust the remote code of the Hub script-based
trust_remote_code (`bool`, defaults to `False`): Whether you trust the remote code of the Hub script-based
dataset to be executed locally on your machine. This option should only be set to `True` for repositories
where you have read the code and which you trust.
<Tip warning={true}>
<Changed version="2.20.0">
`trust_remote_code` will default to False in the next major release.
`trust_remote_code` defaults to `False` if not specified.
</Tip>
</Changed>
Returns:
`huggingface_hub.CommitInfo`
Expand Down
36 changes: 19 additions & 17 deletions src/datasets/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,8 +157,7 @@ def init_dynamic_modules(


def import_main_class(module_path) -> Optional[Type[DatasetBuilder]]:
"""Import a module at module_path and return its main class: a DatasetBuilder
"""
"""Import a module at module_path and return its main class: a DatasetBuilder"""
module = importlib.import_module(module_path)
# Find the main class in our imported module
module_main_cls = None
Expand Down Expand Up @@ -1485,18 +1484,19 @@ def dataset_module_factory(
Directory to read/write data. Defaults to `"~/.cache/huggingface/datasets"`.
<Added version="2.16.0"/>
trust_remote_code (`bool`, defaults to `True`):
trust_remote_code (`bool`, defaults to `False`):
Whether or not to allow for datasets defined on the Hub using a dataset script. This option
should only be set to `True` for repositories you trust and in which you have read the code, as it will
execute code present on the Hub on your local machine.
<Tip warning={true}>
<Added version="2.16.0"/>
`trust_remote_code` will default to False in the next major release.
<Changed version="2.20.0">
</Tip>
`trust_remote_code` defaults to `False` if not specified.
</Changed>
<Added version="2.16.0"/>
**download_kwargs (additional keyword arguments): optional attributes for DownloadConfig() which will override
the attributes in download_config if supplied.
Expand Down Expand Up @@ -1742,18 +1742,19 @@ def load_dataset_builder(
**Experimental**. Key/value pairs to be passed on to the dataset file-system backend, if any.
<Added version="2.11.0"/>
trust_remote_code (`bool`, defaults to `True`):
trust_remote_code (`bool`, defaults to `False`):
Whether or not to allow for datasets defined on the Hub using a dataset script. This option
should only be set to `True` for repositories you trust and in which you have read the code, as it will
execute code present on the Hub on your local machine.
<Tip warning={true}>
<Added version="2.16.0"/>
<Changed version="2.20.0">
`trust_remote_code` will default to False in the next major release.
`trust_remote_code` defaults to `False` if not specified.
</Tip>
</Changed>
<Added version="2.16.0"/>
**config_kwargs (additional keyword arguments):
Keyword arguments to be passed to the [`BuilderConfig`]
and used in the [`DatasetBuilder`].
Expand Down Expand Up @@ -2003,18 +2004,19 @@ def load_dataset(
**Experimental**. Key/value pairs to be passed on to the dataset file-system backend, if any.
<Added version="2.11.0"/>
trust_remote_code (`bool`, defaults to `True`):
trust_remote_code (`bool`, defaults to `False`):
Whether or not to allow for datasets defined on the Hub using a dataset script. This option
should only be set to `True` for repositories you trust and in which you have read the code, as it will
execute code present on the Hub on your local machine.
<Tip warning={true}>
<Added version="2.16.0"/>
`trust_remote_code` will default to False in the next major release.
<Changed version="2.20.0">
</Tip>
`trust_remote_code` defaults to `False` if not specified.
</Changed>
<Added version="2.16.0"/>
**config_kwargs (additional keyword arguments):
Keyword arguments to be passed to the `BuilderConfig`
and used in the [`DatasetBuilder`].
Expand Down

0 comments on commit b6ec0c4

Please sign in to comment.