Add The Pile dataset and PubMed Central subset (#3287)

* Add The Pile dataset and PubMed Central subset * Fix style * Fix style * Add README * Make streamable the all config * Add dummy data * Add more info to README * Fix dummy data
huggingface · Dec 1, 2021 · d5724c7 · d5724c7 · github-actions · Dec 1, 2021
1 parent 5efcab5
commit d5724c7
Show file tree

Hide file tree

Showing 3 changed files with 337 additions and 0 deletions.
diff --git a/datasets/the_pile/README.md b/datasets/the_pile/README.md
@@ -0,0 +1,174 @@
+---
+annotations_creators:
+- no-annotation
+language_creators:
+- found
+languages:
+- en
+licenses:
+- other-
+multilinguality:
+- monolingual
+pretty_name: The Pile
+size_categories:
+- unknown
+source_datasets:
+- original
+task_categories:
+- sequence-modeling
+task_ids:
+- language-modeling
+---
+
+# Dataset Card for The Pile
+
+## Table of Contents
+- [Table of Contents](#table-of-contents)
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-fields)
+  - [Data Splits](#data-splits)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+
+## Dataset Description
+
+- **Homepage:** https://pile.eleuther.ai/
+- **Repository:** https://github.com/EleutherAI/the-pile
+- **Paper:** [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://arxiv.org/abs/2101.00027)
+- **Leaderboard:**
+- **Point of Contact:** [EleutherAI](mailto:contact@eleuther.ai)
+
+### Dataset Summary
+
+The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality
+datasets combined together.
+
+
+### Supported Tasks and Leaderboards
+
+[More Information Needed]
+
+### Languages
+
+[More Information Needed]
+
+## Dataset Structure
+
+### Data Instances
+
+#### all
+```
+{
+  'meta': {'pile_set_name': 'Pile-CC'},
+  'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on...'
+}
+```
+
+#### pubmed_central
+```
+{
+  'id': 'PMC5595690',
+  'text': 'Introduction {#acel12642-sec-0001}\n============\n\nAlzheimer\\\'s disease (AD), the most common cause of...'
+}
+```
+
+### Data Fields
+
+#### all
+
+- `meta` (dict): Metadata of the data instance, with keys:
+   - pile_set_name: Name of the subset.
+- `text` (str): Text.
+
+### Data Splits
+
+The "all" configuration is composed of 3 splits: train, validation and test.
+
+## Dataset Creation
+
+### Curation Rationale
+
+[More Information Needed]
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+[More Information Needed]
+
+#### Who are the source language producers?
+
+[More Information Needed]
+
+### Annotations
+
+#### Annotation process
+
+[More Information Needed]
+
+#### Who are the annotators?
+
+[More Information Needed]
+
+### Personal and Sensitive Information
+
+[More Information Needed]
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[More Information Needed]
+
+### Discussion of Biases
+
+[More Information Needed]
+
+### Other Known Limitations
+
+[More Information Needed]
+
+## Additional Information
+
+### Dataset Curators
+
+[More Information Needed]
+
+### Licensing Information
+
+Please refer to the specific license depending on the subset you use:
+- PubMed Central: [MIT License](https://github.com/EleutherAI/pile-pubmedcentral/blob/master/LICENSE)
+
+### Citation Information
+
+```
+@misc{gao2020pile,
+      title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
+      author={Leo Gao and Stella Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy},
+      year={2020},
+      eprint={2101.00027},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+### Contributions
+
+Thanks to [@github-username](https://github.com/<github-username>) for adding this dataset.
diff --git a/datasets/the_pile/dummy/all/0.0.0/dummy_data.zip b/datasets/the_pile/dummy/all/0.0.0/dummy_data.zip
diff --git a/datasets/the_pile/the_pile.py b/datasets/the_pile/the_pile.py
@@ -0,0 +1,163 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""The Pile dataset."""
+
+import json
+
+import datasets
+
+
+_CITATION = """\
+@misc{gao2020pile,
+      title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
+      author={Leo Gao and Stella Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy},
+      year={2020},
+      eprint={2101.00027},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+"""
+
+_DESCRIPTION = """\
+The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality
+datasets combined together.
+"""
+
+_HOMEPAGE = "https://pile.eleuther.ai/"
+
+_LICENSES = {
+    "all": "MIT License",
+    "pubmed_central": "MIT License",
+}
+
+_DATA_URLS = {
+    "all": {
+        "train": [f"https://the-eye.eu/public/AI/pile/train/{i:0>2}.jsonl.zst" for i in range(30)],
+        "validation": ["https://the-eye.eu/public/AI/pile/val.jsonl.zst"],
+        "test": ["https://the-eye.eu/public/AI/pile/test.jsonl.zst"],
+    },
+    "pubmed_central": "https://the-eye.eu/public/AI/pile_preliminary_components/PMC_extracts.tar.gz",
+}
+
+_FEATURES = {
+    "all": datasets.Features(
+        {
+            "text": datasets.Value("string"),
+            "meta": {"pile_set_name": datasets.Value("string")},
+        }
+    ),
+    "pubmed_central": datasets.Features(
+        {
+            "id": datasets.Value("string"),
+            "text": datasets.Value("string"),
+        }
+    ),
+}
+
+
+class ThePileConfig(datasets.BuilderConfig):
+    """BuilderConfig for The Pile."""
+
+    def __init__(self, *args, subsets, **kwargs):
+        """BuilderConfig for The Pile.
+
+        Args:
+            subsets (:obj:`List[str]`): List of subsets to load.
+            **kwargs: keyword arguments forwarded to super.
+        """
+        super().__init__(
+            *args,
+            name="+".join(subsets),
+            **kwargs,
+        )
+        self.subsets = subsets
+
+
+class ThePile(datasets.GeneratorBasedBuilder):
+    """The Pile dataset."""
+
+    VERSION = datasets.Version("1.1.0")
+
+    BUILDER_CONFIG_CLASS = ThePileConfig
+    BUILDER_CONFIGS = [ThePileConfig(subsets=[subset]) for subset in _DATA_URLS]
+    DEFAULT_CONFIG_NAME = "all"
+
+    def _info(self):
+        """Give information and typings for the dataset."""
+        return datasets.DatasetInfo(
+            # This is the description that will appear on the datasets page.
+            description=_DESCRIPTION,
+            # This defines the different columns of the dataset and their types
+            features=_FEATURES[self.config.name],
+            # If there's a common (input, target) tuple from the features,
+            # specify them here. They'll be used if as_supervised=True in
+            # builder.as_dataset.
+            supervised_keys=None,
+            # Homepage of the dataset for documentation
+            homepage=_HOMEPAGE,
+            # License for the dataset if available
+            license=_LICENSES[self.config.name],
+            # Citation for the dataset
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        """Return SplitGenerators."""
+        if self.config.name == "all":
+            data_dir = dl_manager.download(_DATA_URLS[self.config.name])
+            return [
+                datasets.SplitGenerator(
+                    name=split,
+                    gen_kwargs={
+                        "files": data_dir[split],
+                    },
+                )
+                for split in [datasets.Split.TRAIN, datasets.Split.VALIDATION, datasets.Split.TEST]
+            ]
+        else:
+            data_urls = {subset: _DATA_URLS[subset] for subset in self.config.subsets}
+            archive = dl_manager.download(data_urls)
+            return [
+                datasets.SplitGenerator(
+                    name=datasets.Split.TRAIN,
+                    gen_kwargs={
+                        "files": {subset: dl_manager.iter_archive(archive[subset]) for subset in self.config.subsets},
+                    },
+                ),
+            ]
+
+    def _generate_examples(self, files):
+        """Yield examples as (key, example) tuples."""
+        key = 0
+        if isinstance(files, list):
+            import zstandard as zstd
+
+            for path in files:
+                with zstd.open(open(path, "rb"), "rt", encoding="utf-8") as f:
+                    for row in f:
+                        data = json.loads(row)
+                        yield key, data
+                        key += 1
+        else:
+            for subset in files:
+                if subset == "pubmed_central":
+                    for path, file in files[subset]:
+                        id_ = path.split("/")[-1].split(".")[0]
+                        text = file.read().decode("utf-8")
+                        yield key, {
+                            "id": id_,
+                            "text": text,
+                        }
+                        key += 1