Dataset infos in yaml (#4926)

* wip * fix Features yaml * splits to yaml * add _to_yaml_list * style * example: conll2000 * example: crime_and_punish * add pyyaml dependency * remove unused imports * remove validation tests * style * allow dataset_infos to be struct or list in YAML * fix test * style * update "datasets-cli test" + remove "version" * remove config definitions in conll2000 and crime_and_punish * remove versions for conll2000 and crime_and_punish * move conll2000 and cap dummy data * fix test * add tests * comments and tests * more test * don't mention the dataset_infos.json file in docs * nit in docs * docs * dataset_infos -> dataset_info * again * use id2label in class_label * update conll2000 * fix utf-8 yaml dump * --save_infos -> --save_info * Apply suggestions from code review Co-authored-by: Polina Kazakova <polina@huggingface.co> * style * fix reloading a single dataset_info * push info to README.md in push_to_hub * update test Co-authored-by: Polina Kazakova <polina@huggingface.co>
huggingface · Oct 3, 2022 · 67e65c9 · 67e65c9 · github-actions · Oct 3, 2022
1 parent 55924c5
commit 67e65c9
Show file tree

Hide file tree

Showing 35 changed files with 1,035 additions and 1,113 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE/add_dataset.md b/.github/PULL_REQUEST_TEMPLATE/add_dataset.md
@@ -6,11 +6,9 @@
 
 ### Checkbox
 
-- [ ] Create the dataset script `/datasets/my_dataset/my_dataset.py` using the template
+- [ ] Create the dataset script `./my_dataset/my_dataset.py` using the template
 - [ ] Fill the `_DESCRIPTION` and `_CITATION` variables
-- [ ] Implement `_infos()`, `_split_generators()` and `_generate_examples()`
+- [ ] Implement `_info()`, `_split_generators()` and `_generate_examples()`
 - [ ] Make sure that the `BUILDER_CONFIGS` class attribute is filled with the different configurations of the dataset and that the `BUILDER_CONFIG_CLASS` is specified if there is a custom config class.
-- [ ] Generate the metadata file `dataset_infos.json` for all configurations
-- [ ] Generate the dummy data `dummy_data.zip` files to have the dataset script tested and that they don't weigh too much (<50KB)
 - [ ] Add the dataset card `README.md` using the template : fill the tags and the various paragraphs
-- [ ] Both tests for the real data and the dummy data pass.
+- [ ] Optional - test the dataset using `datasets-cli test ./dataset_name --save_info`
diff --git a/ADD_NEW_DATASET.md b/ADD_NEW_DATASET.md
@@ -63,8 +63,6 @@ You are now ready to start the process of adding the dataset. We will create the
 
 - a **dataset script** which contains the code to download and pre-process the dataset: e.g. `squad.py`,
 - a **dataset card** with tags and information on the dataset in a `README.md`.
-- a **metadata file** (automatically created) which contains checksums and information about the dataset to guarantee that the loading went fine: `dataset_infos.json`
-- a **dummy-data file** (automatically created) which contains small examples from the original files to test and guarantee that the script is working well in the future: `dummy_data.zip`
 
 2. Let's start by creating a new branch to hold your development changes with the name of your dataset:
 
@@ -166,15 +164,18 @@ Sometimes you need to use several *configurations* and/or *splits* (usually at l
 - if some of you dataset features are in a fixed set of classes (e.g. labels), you should use a `ClassLabel` feature.
 
 
-**Last step:** To check that your dataset works correctly and to create its `dataset_infos.json` file run the command:
+#### Tests (optional)
+
+ To check that your dataset works correctly and to create its `dataset_info` metadata in the dataset card, run the command:
+
 
 ```bash
-datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
+datasets-cli test datasets/<your-dataset-folder> --save_info --all_configs
 ```
 
 **Note:** If your dataset requires manually downloading the data and having the user provide the path to the dataset you can run the following command:
 ```bash
-datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs --data_dir your/manual/dir
+datasets-cli test datasets/<your-dataset-folder> --save_info --all_configs --data_dir your/manual/dir
 ```
 To have the configs use the path from `--data_dir` when generating them.
 
@@ -229,13 +230,13 @@ Now that your dataset script runs and create a dataset with the format you expec
 	```
 	to enable the slow tests, instead of `RUN_SLOW=1`.
 
-3. If all tests pass, your dataset works correctly. You can finally create the metadata JSON by running the command:
+3. If all tests pass, your dataset works correctly. You can finally create the metadata by running the command:
 
 	```bash
-	datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
+	datasets-cli test datasets/<your-dataset-folder>  --all_configs
 	```
 
-	This first command should create a `dataset_infos.json` file in your dataset folder.
+	This first command should create a `README.md` file containing the metadata if this file doesn't exist already, or add the metadata to an existing `README.md` file in your dataset folder.
 
 
 You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎

diff --git a/datasets/conll2000/README.md b/datasets/conll2000/README.md
@@ -3,6 +3,96 @@ language:
 - en
 paperswithcode_id: conll-2000-1
 pretty_name: CoNLL-2000
+dataset_info:
+  features:
+  - name: id
+    dtype: string
+  - name: tokens
+    sequence: string
+  - name: pos_tags
+    sequence:
+      class_label:
+        names:
+          0: ''''''
+          1: '#'
+          2: $
+          3: (
+          4: )
+          5: ','
+          6: .
+          7: ':'
+          8: '``'
+          9: CC
+          10: CD
+          11: DT
+          12: EX
+          13: FW
+          14: IN
+          15: JJ
+          16: JJR
+          17: JJS
+          18: MD
+          19: NN
+          20: NNP
+          21: NNPS
+          22: NNS
+          23: PDT
+          24: POS
+          25: PRP
+          26: PRP$
+          27: RB
+          28: RBR
+          29: RBS
+          30: RP
+          31: SYM
+          32: TO
+          33: UH
+          34: VB
+          35: VBD
+          36: VBG
+          37: VBN
+          38: VBP
+          39: VBZ
+          40: WDT
+          41: WP
+          42: WP$
+          43: WRB
+  - name: chunk_tags
+    sequence:
+      class_label:
+        names:
+          0: O
+          1: B-ADJP
+          2: I-ADJP
+          3: B-ADVP
+          4: I-ADVP
+          5: B-CONJP
+          6: I-CONJP
+          7: B-INTJ
+          8: I-INTJ
+          9: B-LST
+          10: I-LST
+          11: B-NP
+          12: I-NP
+          13: B-PP
+          14: I-PP
+          15: B-PRT
+          16: I-PRT
+          17: B-SBAR
+          18: I-SBAR
+          19: B-UCP
+          20: I-UCP
+          21: B-VP
+          22: I-VP
+  splits:
+  - name: test
+    num_bytes: 1201151
+    num_examples: 2013
+  - name: train
+    num_bytes: 5356965
+    num_examples: 8937
+  download_size: 3481560
+  dataset_size: 6558116
 ---
 
 # Dataset Card for "conll2000"
@@ -173,4 +263,4 @@ The data fields are the same among all splits.
 
 ### Contributions
 
-Thanks to [@vblagoje](https://github.com/vblagoje), [@jplu](https://github.com/jplu) for adding this dataset.
+Thanks to [@vblagoje](https://github.com/vblagoje), [@jplu](https://github.com/jplu) for adding this dataset.
diff --git a/datasets/conll2000/conll2000.py b/datasets/conll2000/conll2000.py
@@ -53,25 +53,9 @@
 _TEST_FILE = "test.txt"
 
 
-class Conll2000Config(datasets.BuilderConfig):
-    """BuilderConfig for Conll2000"""
-
-    def __init__(self, **kwargs):
-        """BuilderConfig forConll2000.
-
-        Args:
-          **kwargs: keyword arguments forwarded to super.
-        """
-        super(Conll2000Config, self).__init__(**kwargs)
-
-
 class Conll2000(datasets.GeneratorBasedBuilder):
     """Conll2000 dataset."""
 
-    BUILDER_CONFIGS = [
-        Conll2000Config(name="conll2000", version=datasets.Version("1.0.0"), description="Conll2000 dataset"),
-    ]
-
     def _info(self):
         return datasets.DatasetInfo(
             description=_DESCRIPTION,

diff --git a/datasets/conll2000/dataset_infos.json b/datasets/conll2000/dataset_infos.json
diff --git a/...2000/dummy/conll2000/1.0.0/dummy_data.zip → ...sets/conll2000/dummy/0.0.0/dummy_data.zip b/...2000/dummy/conll2000/1.0.0/dummy_data.zip → ...sets/conll2000/dummy/0.0.0/dummy_data.zip
diff --git a/datasets/crime_and_punish/README.md b/datasets/crime_and_punish/README.md
@@ -3,6 +3,16 @@ language:
 - en
 paperswithcode_id: null
 pretty_name: CrimeAndPunish
+dataset_info:
+  dataset_size: 1270540
+  download_size: 1201735
+  features:
+  - dtype: string
+    name: line
+  splits:
+  - name: train
+    num_bytes: 1270540
+    num_examples: 21969
 ---
 
 # Dataset Card for "crime_and_punish"
@@ -144,4 +154,4 @@ The data fields are the same among all splits.
 
 ### Contributions
 
-Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
+Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
diff --git a/datasets/crime_and_punish/crime_and_punish.py b/datasets/crime_and_punish/crime_and_punish.py
@@ -8,36 +8,7 @@
 _DATA_URL = "https://raw.githubusercontent.com/patrickvonplaten/datasets/master/crime_and_punishment.txt"
 
 
-class CrimeAndPunishConfig(datasets.BuilderConfig):
-    """BuilderConfig for Crime and Punish."""
-
-    def __init__(self, data_url, **kwargs):
-        """BuilderConfig for BlogAuthorship
-
-        Args:
-          data_url: `string`, url to the dataset (word or raw level)
-          **kwargs: keyword arguments forwarded to super.
-        """
-        super(CrimeAndPunishConfig, self).__init__(
-            version=datasets.Version(
-                "1.0.0",
-            ),
-            **kwargs,
-        )
-        self.data_url = data_url
-
-
 class CrimeAndPunish(datasets.GeneratorBasedBuilder):
-
-    VERSION = datasets.Version("0.1.0")
-    BUILDER_CONFIGS = [
-        CrimeAndPunishConfig(
-            name="crime-and-punish",
-            data_url=_DATA_URL,
-            description="word level dataset. No processing is needed other than replacing newlines with <eos> tokens.",
-        ),
-    ]
-
     def _info(self):
         return datasets.DatasetInfo(
             # This is the description that will appear on the datasets page.
@@ -58,17 +29,14 @@ def _info(self):
     def _split_generators(self, dl_manager):
         """Returns SplitGenerators."""
 
-        if self.config.name == "crime-and-punish":
-            data = dl_manager.download_and_extract(self.config.data_url)
+        data = dl_manager.download_and_extract(_DATA_URL)
 
-            return [
-                datasets.SplitGenerator(
-                    name=datasets.Split.TRAIN,
-                    gen_kwargs={"data_file": data, "split": "train"},
-                ),
-            ]
-        else:
-            raise ValueError(f"{self.config.name} does not exist")
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={"data_file": data, "split": "train"},
+            ),
+        ]
 
     def _generate_examples(self, data_file, split):
 

diff --git a/datasets/crime_and_punish/dataset_infos.json b/datasets/crime_and_punish/dataset_infos.json
diff --git a/...mmy/crime-and-punish/1.0.0/dummy_data.zip → ...ime_and_punish/dummy/0.0.0/dummy_data.zip b/...mmy/crime-and-punish/1.0.0/dummy_data.zip → ...ime_and_punish/dummy/0.0.0/dummy_data.zip
diff --git a/docs/source/beam.mdx b/docs/source/beam.mdx
@@ -31,7 +31,7 @@ echo "apache_beam" >> /tmp/beam_requirements.txt
 ```
 datasets-cli run_beam datasets/$DATASET_NAME \
 --name $CONFIG_NAME \
---save_infos \
+--save_info \
 --cache_dir gs://$BUCKET/cache/datasets \
 --beam_pipeline_options=\
 "runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\

diff --git a/docs/source/dataset_card.mdx b/docs/source/dataset_card.mdx
@@ -24,4 +24,87 @@ Feel free to take a look at these dataset card examples to help you get started:
 - [CNN / DailyMail](https://huggingface.co/datasets/cnn_dailymail)
 - [Allociné](https://huggingface.co/datasets/allocine)
 
-You can also check out the (similar) documentation about [dataset cards on the Hub side](https://huggingface.co/docs/hub/datasets-cards).
+You can also check out the (similar) documentation about [dataset cards on the Hub side](https://huggingface.co/docs/hub/datasets-cards).
+
+## More YAML tags
+
+You can use the `dataset_info` YAML fields to define additional metadata for the dataset. Here is an example for SQuAD:
+
+```YAML
+pretty_name: SQuAD
+language:
+- en
+...
+dataset_info:
+  features:
+  - name: id
+    dtype: string
+  - name: title
+    dtype: string
+  - name: context
+    dtype: string
+  - name: question
+    dtype: string
+  - name: answers
+    sequence:
+    - name: text
+      dtype: string
+    - name: answer_start
+      dtype: int32
+  splits:
+  - name: train
+    num_bytes: 79346360
+    num_examples: 87599
+  - name: validation
+    num_bytes: 10473040
+    num_examples: 10570
+  download_size: 35142551
+  dataset_size: 89819400
+```
+
+These metadata used to be included in the `dataset_infos.json` file, which is now deprecated.
+
+### Feature types
+
+Using the `features` field you can explicitly define the feature types of your dataset.
+This is especially useful when type inference is not obvious.
+For example if there's only one non-empty example in a 1TB dataset, the type inference is not able to infer the type of each column without going through the full dataset.
+In this case, specifying the `features` field makes type inference much easier.
+
+### Split sizes
+
+Specifying the split sizes with `num_examples` enables TQDM bars (otherwise it doesn't know how many examples are left).
+It also enables integrity verifications: if the dataset doesn't have the right number of `num_examples`, an error is returned.
+
+Additionally you can add `num_bytes` to specify how big each split is.
+
+### Dataset size
+
+When [`load_dataset`] is called, it first downloads the dataset raw data files, and then it prepares the dataset in Arrow format.
+
+You can specify how many bytes are required to download the raw data files with `dataset_size`, and use `dataset_size` for the size of the dataset in Arrow format.
+
+### Multiple configurations
+
+Certain datasets like `glue` have several configurations (`cola`, `sst2`, etc.) that can be loaded using `load_dataset("glue", "cola")` for example.
+
+Each configuration can have different features, splits and sizes.
+You can specify those fields per configuration using a YAML list:
+
+```YAML
+dataset_info:
+- config_name: cola
+  features:
+    ...
+  splits:
+    ...
+  download_size: ...
+  dataset_size: ...
+- config_name: sst2
+  features:
+    ...
+  splits:
+    ...
+  download_size: ...
+  dataset_size: ...
+```