Skip to content

Commit

Permalink
Dataset infos in yaml (#4926)
Browse files Browse the repository at this point in the history
* wip

* fix Features yaml

* splits to yaml

* add _to_yaml_list

* style

* example: conll2000

* example: crime_and_punish

* add pyyaml dependency

* remove unused imports

* remove validation tests

* style

* allow dataset_infos to be struct or list in YAML

* fix test

* style

* update "datasets-cli test" + remove "version"

* remove config definitions in conll2000 and crime_and_punish

* remove versions for conll2000 and crime_and_punish

* move conll2000 and cap dummy data

* fix test

* add tests

* comments and tests

* more test

* don't mention the dataset_infos.json file in docs

* nit in docs

* docs

* dataset_infos -> dataset_info

* again

* use id2label in class_label

* update conll2000

* fix utf-8 yaml dump

* --save_infos -> --save_info

* Apply suggestions from code review

Co-authored-by: Polina Kazakova <polina@huggingface.co>

* style

* fix reloading a single dataset_info

* push info to README.md in push_to_hub

* update test

Co-authored-by: Polina Kazakova <polina@huggingface.co>
  • Loading branch information
lhoestq and polinaeterna committed Oct 3, 2022
1 parent 55924c5 commit 67e65c9
Show file tree
Hide file tree
Showing 35 changed files with 1,035 additions and 1,113 deletions.
8 changes: 3 additions & 5 deletions .github/PULL_REQUEST_TEMPLATE/add_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,9 @@

### Checkbox

- [ ] Create the dataset script `/datasets/my_dataset/my_dataset.py` using the template
- [ ] Create the dataset script `./my_dataset/my_dataset.py` using the template
- [ ] Fill the `_DESCRIPTION` and `_CITATION` variables
- [ ] Implement `_infos()`, `_split_generators()` and `_generate_examples()`
- [ ] Implement `_info()`, `_split_generators()` and `_generate_examples()`
- [ ] Make sure that the `BUILDER_CONFIGS` class attribute is filled with the different configurations of the dataset and that the `BUILDER_CONFIG_CLASS` is specified if there is a custom config class.
- [ ] Generate the metadata file `dataset_infos.json` for all configurations
- [ ] Generate the dummy data `dummy_data.zip` files to have the dataset script tested and that they don't weigh too much (<50KB)
- [ ] Add the dataset card `README.md` using the template : fill the tags and the various paragraphs
- [ ] Both tests for the real data and the dummy data pass.
- [ ] Optional - test the dataset using `datasets-cli test ./dataset_name --save_info`
17 changes: 9 additions & 8 deletions ADD_NEW_DATASET.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,6 @@ You are now ready to start the process of adding the dataset. We will create the

- a **dataset script** which contains the code to download and pre-process the dataset: e.g. `squad.py`,
- a **dataset card** with tags and information on the dataset in a `README.md`.
- a **metadata file** (automatically created) which contains checksums and information about the dataset to guarantee that the loading went fine: `dataset_infos.json`
- a **dummy-data file** (automatically created) which contains small examples from the original files to test and guarantee that the script is working well in the future: `dummy_data.zip`

2. Let's start by creating a new branch to hold your development changes with the name of your dataset:

Expand Down Expand Up @@ -166,15 +164,18 @@ Sometimes you need to use several *configurations* and/or *splits* (usually at l
- if some of you dataset features are in a fixed set of classes (e.g. labels), you should use a `ClassLabel` feature.


**Last step:** To check that your dataset works correctly and to create its `dataset_infos.json` file run the command:
#### Tests (optional)

To check that your dataset works correctly and to create its `dataset_info` metadata in the dataset card, run the command:


```bash
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
datasets-cli test datasets/<your-dataset-folder> --save_info --all_configs
```

**Note:** If your dataset requires manually downloading the data and having the user provide the path to the dataset you can run the following command:
```bash
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs --data_dir your/manual/dir
datasets-cli test datasets/<your-dataset-folder> --save_info --all_configs --data_dir your/manual/dir
```
To have the configs use the path from `--data_dir` when generating them.

Expand Down Expand Up @@ -229,13 +230,13 @@ Now that your dataset script runs and create a dataset with the format you expec
```
to enable the slow tests, instead of `RUN_SLOW=1`.

3. If all tests pass, your dataset works correctly. You can finally create the metadata JSON by running the command:
3. If all tests pass, your dataset works correctly. You can finally create the metadata by running the command:

```bash
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
datasets-cli test datasets/<your-dataset-folder> --all_configs
```

This first command should create a `dataset_infos.json` file in your dataset folder.
This first command should create a `README.md` file containing the metadata if this file doesn't exist already, or add the metadata to an existing `README.md` file in your dataset folder.


You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎
Expand Down
92 changes: 91 additions & 1 deletion datasets/conll2000/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,96 @@ language:
- en
paperswithcode_id: conll-2000-1
pretty_name: CoNLL-2000
dataset_info:
features:
- name: id
dtype: string
- name: tokens
sequence: string
- name: pos_tags
sequence:
class_label:
names:
0: ''''''
1: '#'
2: $
3: (
4: )
5: ','
6: .
7: ':'
8: '``'
9: CC
10: CD
11: DT
12: EX
13: FW
14: IN
15: JJ
16: JJR
17: JJS
18: MD
19: NN
20: NNP
21: NNPS
22: NNS
23: PDT
24: POS
25: PRP
26: PRP$
27: RB
28: RBR
29: RBS
30: RP
31: SYM
32: TO
33: UH
34: VB
35: VBD
36: VBG
37: VBN
38: VBP
39: VBZ
40: WDT
41: WP
42: WP$
43: WRB
- name: chunk_tags
sequence:
class_label:
names:
0: O
1: B-ADJP
2: I-ADJP
3: B-ADVP
4: I-ADVP
5: B-CONJP
6: I-CONJP
7: B-INTJ
8: I-INTJ
9: B-LST
10: I-LST
11: B-NP
12: I-NP
13: B-PP
14: I-PP
15: B-PRT
16: I-PRT
17: B-SBAR
18: I-SBAR
19: B-UCP
20: I-UCP
21: B-VP
22: I-VP
splits:
- name: test
num_bytes: 1201151
num_examples: 2013
- name: train
num_bytes: 5356965
num_examples: 8937
download_size: 3481560
dataset_size: 6558116
---

# Dataset Card for "conll2000"
Expand Down Expand Up @@ -173,4 +263,4 @@ The data fields are the same among all splits.

### Contributions

Thanks to [@vblagoje](https://github.com/vblagoje), [@jplu](https://github.com/jplu) for adding this dataset.
Thanks to [@vblagoje](https://github.com/vblagoje), [@jplu](https://github.com/jplu) for adding this dataset.
16 changes: 0 additions & 16 deletions datasets/conll2000/conll2000.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,25 +53,9 @@
_TEST_FILE = "test.txt"


class Conll2000Config(datasets.BuilderConfig):
"""BuilderConfig for Conll2000"""

def __init__(self, **kwargs):
"""BuilderConfig forConll2000.
Args:
**kwargs: keyword arguments forwarded to super.
"""
super(Conll2000Config, self).__init__(**kwargs)


class Conll2000(datasets.GeneratorBasedBuilder):
"""Conll2000 dataset."""

BUILDER_CONFIGS = [
Conll2000Config(name="conll2000", version=datasets.Version("1.0.0"), description="Conll2000 dataset"),
]

def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
Expand Down
1 change: 0 additions & 1 deletion datasets/conll2000/dataset_infos.json

This file was deleted.

12 changes: 11 additions & 1 deletion datasets/crime_and_punish/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,16 @@ language:
- en
paperswithcode_id: null
pretty_name: CrimeAndPunish
dataset_info:
dataset_size: 1270540
download_size: 1201735
features:
- dtype: string
name: line
splits:
- name: train
num_bytes: 1270540
num_examples: 21969
---

# Dataset Card for "crime_and_punish"
Expand Down Expand Up @@ -144,4 +154,4 @@ The data fields are the same among all splits.

### Contributions

Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
46 changes: 7 additions & 39 deletions datasets/crime_and_punish/crime_and_punish.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,36 +8,7 @@
_DATA_URL = "https://raw.githubusercontent.com/patrickvonplaten/datasets/master/crime_and_punishment.txt"


class CrimeAndPunishConfig(datasets.BuilderConfig):
"""BuilderConfig for Crime and Punish."""

def __init__(self, data_url, **kwargs):
"""BuilderConfig for BlogAuthorship
Args:
data_url: `string`, url to the dataset (word or raw level)
**kwargs: keyword arguments forwarded to super.
"""
super(CrimeAndPunishConfig, self).__init__(
version=datasets.Version(
"1.0.0",
),
**kwargs,
)
self.data_url = data_url


class CrimeAndPunish(datasets.GeneratorBasedBuilder):

VERSION = datasets.Version("0.1.0")
BUILDER_CONFIGS = [
CrimeAndPunishConfig(
name="crime-and-punish",
data_url=_DATA_URL,
description="word level dataset. No processing is needed other than replacing newlines with <eos> tokens.",
),
]

def _info(self):
return datasets.DatasetInfo(
# This is the description that will appear on the datasets page.
Expand All @@ -58,17 +29,14 @@ def _info(self):
def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""

if self.config.name == "crime-and-punish":
data = dl_manager.download_and_extract(self.config.data_url)
data = dl_manager.download_and_extract(_DATA_URL)

return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={"data_file": data, "split": "train"},
),
]
else:
raise ValueError(f"{self.config.name} does not exist")
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={"data_file": data, "split": "train"},
),
]

def _generate_examples(self, data_file, split):

Expand Down
1 change: 0 additions & 1 deletion datasets/crime_and_punish/dataset_infos.json

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/beam.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ echo "apache_beam" >> /tmp/beam_requirements.txt
```
datasets-cli run_beam datasets/$DATASET_NAME \
--name $CONFIG_NAME \
--save_infos \
--save_info \
--cache_dir gs://$BUCKET/cache/datasets \
--beam_pipeline_options=\
"runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\
Expand Down
85 changes: 84 additions & 1 deletion docs/source/dataset_card.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,87 @@ Feel free to take a look at these dataset card examples to help you get started:
- [CNN / DailyMail](https://huggingface.co/datasets/cnn_dailymail)
- [Allociné](https://huggingface.co/datasets/allocine)

You can also check out the (similar) documentation about [dataset cards on the Hub side](https://huggingface.co/docs/hub/datasets-cards).
You can also check out the (similar) documentation about [dataset cards on the Hub side](https://huggingface.co/docs/hub/datasets-cards).

## More YAML tags

You can use the `dataset_info` YAML fields to define additional metadata for the dataset. Here is an example for SQuAD:

```YAML
pretty_name: SQuAD
language:
- en
...
dataset_info:
features:
- name: id
dtype: string
- name: title
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answers
sequence:
- name: text
dtype: string
- name: answer_start
dtype: int32
splits:
- name: train
num_bytes: 79346360
num_examples: 87599
- name: validation
num_bytes: 10473040
num_examples: 10570
download_size: 35142551
dataset_size: 89819400
```

These metadata used to be included in the `dataset_infos.json` file, which is now deprecated.

### Feature types

Using the `features` field you can explicitly define the feature types of your dataset.
This is especially useful when type inference is not obvious.
For example if there's only one non-empty example in a 1TB dataset, the type inference is not able to infer the type of each column without going through the full dataset.
In this case, specifying the `features` field makes type inference much easier.

### Split sizes

Specifying the split sizes with `num_examples` enables TQDM bars (otherwise it doesn't know how many examples are left).
It also enables integrity verifications: if the dataset doesn't have the right number of `num_examples`, an error is returned.

Additionally you can add `num_bytes` to specify how big each split is.

### Dataset size

When [`load_dataset`] is called, it first downloads the dataset raw data files, and then it prepares the dataset in Arrow format.

You can specify how many bytes are required to download the raw data files with `dataset_size`, and use `dataset_size` for the size of the dataset in Arrow format.

### Multiple configurations

Certain datasets like `glue` have several configurations (`cola`, `sst2`, etc.) that can be loaded using `load_dataset("glue", "cola")` for example.

Each configuration can have different features, splits and sizes.
You can specify those fields per configuration using a YAML list:

```YAML
dataset_info:
- config_name: cola
features:
...
splits:
...
download_size: ...
dataset_size: ...
- config_name: sst2
features:
...
splits:
...
download_size: ...
dataset_size: ...
```

1 comment on commit 67e65c9

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008333 / 0.011353 (-0.003020) 0.004009 / 0.011008 (-0.006999) 0.029892 / 0.038508 (-0.008616) 0.036142 / 0.023109 (0.013033) 0.306052 / 0.275898 (0.030154) 0.390351 / 0.323480 (0.066871) 0.006400 / 0.007986 (-0.001585) 0.003680 / 0.004328 (-0.000648) 0.007359 / 0.004250 (0.003109) 0.048033 / 0.037052 (0.010981) 0.318789 / 0.258489 (0.060300) 0.356718 / 0.293841 (0.062877) 0.031421 / 0.128546 (-0.097125) 0.009659 / 0.075646 (-0.065987) 0.259398 / 0.419271 (-0.159874) 0.051912 / 0.043533 (0.008380) 0.305671 / 0.255139 (0.050532) 0.326349 / 0.283200 (0.043149) 0.107042 / 0.141683 (-0.034641) 1.464996 / 1.452155 (0.012841) 1.528256 / 1.492716 (0.035540)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.286958 / 0.018006 (0.268952) 0.530088 / 0.000490 (0.529599) 0.004252 / 0.000200 (0.004053) 0.000098 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026637 / 0.037411 (-0.010775) 0.104749 / 0.014526 (0.090223) 0.117341 / 0.176557 (-0.059215) 0.179620 / 0.737135 (-0.557515) 0.122799 / 0.296338 (-0.173539)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.401623 / 0.215209 (0.186414) 4.002517 / 2.077655 (1.924862) 1.796422 / 1.504120 (0.292302) 1.613703 / 1.541195 (0.072508) 1.707221 / 1.468490 (0.238730) 0.422814 / 4.584777 (-4.161963) 3.754838 / 3.745712 (0.009126) 3.325112 / 5.269862 (-1.944750) 1.747767 / 4.565676 (-2.817910) 0.051332 / 0.424275 (-0.372943) 0.011206 / 0.007607 (0.003599) 0.517404 / 0.226044 (0.291359) 5.147115 / 2.268929 (2.878186) 2.261913 / 55.444624 (-53.182711) 1.908111 / 6.876477 (-4.968366) 2.126932 / 2.142072 (-0.015140) 0.542721 / 4.805227 (-4.262506) 0.120034 / 6.500664 (-6.380630) 0.061282 / 0.075469 (-0.014187)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.476290 / 1.841788 (-0.365498) 14.643683 / 8.074308 (6.569375) 24.769702 / 10.191392 (14.578310) 0.875425 / 0.680424 (0.195001) 0.567289 / 0.534201 (0.033088) 0.386833 / 0.579283 (-0.192450) 0.439282 / 0.434364 (0.004918) 0.263647 / 0.540337 (-0.276690) 0.267395 / 1.386936 (-1.119541)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006351 / 0.011353 (-0.005002) 0.004264 / 0.011008 (-0.006744) 0.028416 / 0.038508 (-0.010092) 0.034056 / 0.023109 (0.010946) 0.385309 / 0.275898 (0.109411) 0.464653 / 0.323480 (0.141173) 0.004342 / 0.007986 (-0.003644) 0.003878 / 0.004328 (-0.000451) 0.005194 / 0.004250 (0.000943) 0.044209 / 0.037052 (0.007157) 0.390169 / 0.258489 (0.131680) 0.453022 / 0.293841 (0.159181) 0.026410 / 0.128546 (-0.102136) 0.007478 / 0.075646 (-0.068169) 0.253998 / 0.419271 (-0.165273) 0.050845 / 0.043533 (0.007312) 0.375467 / 0.255139 (0.120328) 0.400318 / 0.283200 (0.117118) 0.110347 / 0.141683 (-0.031336) 1.522831 / 1.452155 (0.070676) 1.552093 / 1.492716 (0.059377)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.351135 / 0.018006 (0.333129) 0.511285 / 0.000490 (0.510796) 0.027710 / 0.000200 (0.027510) 0.000269 / 0.000054 (0.000214)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023787 / 0.037411 (-0.013624) 0.104763 / 0.014526 (0.090237) 0.116087 / 0.176557 (-0.060470) 0.158282 / 0.737135 (-0.578854) 0.121559 / 0.296338 (-0.174780)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.442514 / 0.215209 (0.227305) 4.393003 / 2.077655 (2.315348) 2.213802 / 1.504120 (0.709682) 2.037461 / 1.541195 (0.496267) 2.127010 / 1.468490 (0.658520) 0.417464 / 4.584777 (-4.167313) 3.857810 / 3.745712 (0.112098) 3.272049 / 5.269862 (-1.997812) 1.581566 / 4.565676 (-2.984111) 0.050435 / 0.424275 (-0.373840) 0.011772 / 0.007607 (0.004165) 0.543781 / 0.226044 (0.317737) 5.393870 / 2.268929 (3.124941) 2.679555 / 55.444624 (-52.765069) 2.321202 / 6.876477 (-4.555275) 2.514811 / 2.142072 (0.372738) 0.525966 / 4.805227 (-4.279262) 0.119451 / 6.500664 (-6.381213) 0.061711 / 0.075469 (-0.013758)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.513327 / 1.841788 (-0.328461) 14.389146 / 8.074308 (6.314838) 13.047054 / 10.191392 (2.855662) 0.867357 / 0.680424 (0.186933) 0.580434 / 0.534201 (0.046233) 0.374005 / 0.579283 (-0.205278) 0.415326 / 0.434364 (-0.019038) 0.249916 / 0.540337 (-0.290421) 0.258781 / 1.386936 (-1.128155)

CML watermark

Please sign in to comment.