Update doc links to point to new docs (#3116)

* Update README links * Update docs * Update add dataset template * Add Features docstring
huggingface · Oct 22, 2021 · ac0d1d1 · ac0d1d1 · github-actions · Oct 22, 2021
1 parent 1a9380a
commit ac0d1d1
Show file tree

Hide file tree

Showing 8 changed files with 44 additions and 17 deletions.
diff --git a/ADD_NEW_DATASET.md b/ADD_NEW_DATASET.md
@@ -86,7 +86,7 @@ Now let's get coding :-)
 
 The dataset script is the main entry point to load and process the data. It is a python script under `datasets/<your_dataset_name>/<your_dataset_name>.py`.
 
-There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/add_dataset.html).
+There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/master/about_dataset_load.html).
 
 Note on naming: the dataset class should be camel case, while the dataset short_name is its snake case equivalent (ex: `class BookCorpus` for the dataset `book_corpus`).
 
@@ -96,7 +96,7 @@ To add a new dataset, you can start from the empty template which is [in the `te
 cp ./templates/new_dataset_script.py ./datasets/<your_dataset_name>/<your_dataset_name>.py
 ```
 
-And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/add_dataset.html).
+And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/master/dataset_script.html).
 
 You can also start (or copy any part) from one of the datasets of reference listed below. The main criteria for choosing among these reference dataset is the format of the data files (JSON/JSONL/CSV/TSV/text) and whether you need or don't need several configurations (see above explanations on configurations). Feel free to reuse any parts of the following examples and adapt them to your case:
 
@@ -137,7 +137,7 @@ Sometimes you need to use several *configurations* and/or *splits* (usually at l
 **Some rules to follow when adding the dataset**:
 
 - try to give access to all the data, columns, features and information in the dataset. If the dataset contains various sub-parts with differing formats, create several configurations to give access to all of them.
-- datasets in the `datasets` library are typed. Take some time to carefully think about the `features` (see an introduction [here](https://huggingface.co/docs/datasets/exploring.html#features-and-columns) and the full list of possible features [here](https://huggingface.co/docs/datasets/features.html))
+- datasets in the `datasets` library are typed. Take some time to carefully think about the `features` (see an introduction [here](https://huggingface.co/docs/datasets/about_dataset_features.html) and the full list of possible features [here](https://huggingface.co/docs/datasets/package_reference/main_classes.html#features)
 - if some of you dataset features are in a fixed set of classes (e.g. labels), you should use a `ClassLabel` feature.
 
 
@@ -179,7 +179,7 @@ Now that your dataset script runs and create a dataset with the format you expec
 	datasets-cli dummy_data datasets/<your-dataset-folder>
 	```
 
-   If this doesn't work more information on how to add dummy data can be found in the documentation [here](https://huggingface.co/docs/datasets/share_dataset.html#adding-dummy-data).
+   If this doesn't work more information on how to add dummy data can be found in the documentation [here](https://huggingface.co/docs/datasets/dataset_script.html#dummy-data).
 
    If you've been fighting with dummy data creation without success for some time and can't seems to make it work: Go to the next step (open a Pull Request) and we'll help you cross the finish line 🙂.
 

diff --git a/README.md b/README.md
@@ -46,7 +46,7 @@
 - Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
 - Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.
 
-🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between-🤗-datasets-and-tfds).
+🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between--datasets-and-tfds).
 
 # Installation
 
@@ -74,7 +74,7 @@ For more details on installation, check the installation page in the documentati
 
 If you plan to use 🤗 Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.
 
-For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html
+For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart.html
 
 # Usage
 
@@ -113,12 +113,12 @@ tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
 tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)
 ```
 
-For more details on using the library, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html and the specific pages on:
+For more details on using the library, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart.html and the specific pages on:
 
-- Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html
-- What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html
-- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/processing.html
-- Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html
+- Loading a dataset https://huggingface.co/docs/datasets/loading.html
+- What's in a Dataset: https://huggingface.co/docs/datasets/access.html
+- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/process.html
+- Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script.html
 - etc.
 
 Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
@@ -130,7 +130,7 @@ We have a very detailed step-by-step guide to add a new dataset to the ![number
 
 You will find [the step-by-step guide here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md) to add a dataset to this repository.
 
-You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html).
+You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share.html).
 
 # Main differences between 🤗 Datasets and `tfds`
 

diff --git a/docs/source/installation.md b/docs/source/installation.md
@@ -3,7 +3,7 @@
 Before you start, you will need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.6+**.
 
 ```{seealso}
-If you want to use 🤗 Datasets with TensorFlow or PyTorch, you will need to install them separately. Refer to the [TensorFlow](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) or the [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) for the specific install command for your framework.
+If you want to use 🤗 Datasets with TensorFlow or PyTorch, you will need to install them separately. Refer to the [TensorFlow](https://www.tensorflow.org/install/pip#tensorflow-2-packages-are-available) or the [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) for the specific install command for your framework.
 ```
 
 ## Virtual environment

diff --git a/docs/source/process.rst b/docs/source/process.rst
@@ -408,7 +408,7 @@ Data augmentation
 
 With batch processing, you can even augment your dataset with additional examples. In the following example, you will generate additional words for a masked token in a sentence.
 
-Load the `RoBERTA <https://huggingface.co/roberta-base>`_ model for use in the 🤗 Transformer `FillMaskPipeline <https://huggingface.co/transformers/main_classes/pipelines.html?#transformers.FillMaskPipeline>`_:
+Load the `RoBERTA <https://huggingface.co/roberta-base>`_ model for use in the 🤗 Transformer `FillMaskPipeline <https://huggingface.co/transformers/main_classes/pipelines.html#transformers.FillMaskPipeline>`_:
 
 .. code-block::
 

diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
@@ -76,7 +76,7 @@ Format the dataset
 
 Depending on whether you are using PyTorch, TensorFlow, or JAX, you will need to format the dataset accordingly. There are three changes you need to make to the dataset:
 
-1. Rename the ``label`` column to ``labels``, the expected input name in `BertForSequenceClassification <https://huggingface.co/transformers/model_doc/bert.html?#transformers.BertForSequenceClassification.forward>`__ or `TFBertForSequenceClassification <https://huggingface.co/transformers/model_doc/bert.html?#tfbertforsequenceclassification>`__:
+1. Rename the ``label`` column to ``labels``, the expected input name in `BertForSequenceClassification <https://huggingface.co/transformers/model_doc/bert.html#transformers.BertForSequenceClassification.forward>`__ or `TFBertForSequenceClassification <https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification>`__:
 
 .. code::
 

diff --git a/docs/source/share.rst b/docs/source/share.rst
@@ -71,7 +71,7 @@ Create the repository
 ^^^^^^^^^^^^^^^^^^^^^
 
 Sharing a community dataset will require you to create an account on `hf.co <https://huggingface.co/join>`_ if you don't have one yet.
-You can directly create a `new dataset repository <https://huggingface.co/new-dataset>`_ from your account on the Hugging Face Hub, but this guide will show you how to upload a dataset from the terminal.
+You can directly create a `new dataset repository <https://huggingface.co/login?next=%2Fnew-dataset>`_ from your account on the Hugging Face Hub, but this guide will show you how to upload a dataset from the terminal.
 
 1. Make sure you are in the virtual environment where you installed Datasets, and run the following command:
 

diff --git a/src/datasets/features/features.py b/src/datasets/features/features.py
@@ -888,6 +888,33 @@ def list_of_np_array_to_pyarrow_listarray(l_arr: List[np.ndarray], type: pa.Data
 
 
 class Features(dict):
+    """A special dictionary that defines the internal structure of a dataset.
+
+    Instantiated with a dictionary of type ``dict[str, FieldType]``, where keys are the desired column names,
+    and values are the type of that column.
+
+    ``FieldType`` can be one of the following:
+        - a :class:`datasets.Value` feature specifies a single typed value, e.g. ``int64`` or ``string``
+        - a :class:`datasets.ClassLabel` feature specifies a field with a predefined set of classes which can have labels
+          associated to them and will be stored as integers in the dataset
+        - a python :obj:`dict` which specifies that the field is a nested field containing a mapping of sub-fields to sub-fields
+          features. It's possible to have nested fields of nested fields in an arbitrary manner
+        - a python :obj:`list` or a :class:`datasets.Sequence` specifies that the field contains a list of objects. The python
+          :obj:`list` or :class:`datasets.Sequence` should be provided with a single sub-feature as an example of the feature
+          type hosted in this list
+
+          .. note::
+
+           A :class:`datasets.Sequence` with a internal dictionary feature will be automatically converted into a dictionary of
+           lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be
+           un-wanted in some cases. If you don't want this behavior, you can use a python :obj:`list` instead of the
+           :class:`datasets.Sequence`.
+
+        - a :class:`Array2D`, :class:`Array3D`, :class:`Array4D` or :class:`Array5D` feature for multidimensional arrays
+        - a :class:`datasets.Audio` stores the path to an audio file and can extract audio data from it
+        - :class:`datasets.Translation` and :class:`datasets.TranslationVariableLanguages`, the two features specific to Machine Translation
+    """
+
     @property
     def type(self):
         """

diff --git a/src/datasets/load.py b/src/datasets/load.py
@@ -1498,7 +1498,7 @@ def load_dataset(
             Processing scripts are small python scripts that define the citation, info and format of the dataset,
             contain the URL to the original data files and the code to load examples from the original data files.
 
-            You can find some of the scripts here: https://github.com/huggingface/datasets/datasets
+            You can find some of the scripts here: https://github.com/huggingface/datasets/tree/master/datasets
             and easily upload yours to share them using the CLI ``huggingface-cli``.
             You can find the complete list of datasets in the Datasets Hub at https://huggingface.co/datasets