Skip to content

Commit

Permalink
Datasets README improvements (#1741)
Browse files Browse the repository at this point in the history
Added dataset formats and requirements sections. 
Comments are welcome.
  • Loading branch information
Vechtomov committed Feb 19, 2023
1 parent b701e4f commit d048248
Show file tree
Hide file tree
Showing 2 changed files with 74 additions and 21 deletions.
93 changes: 72 additions & 21 deletions openassistant/datasets/README.md
@@ -1,10 +1,63 @@
# **Datasets**
## **Overview**

This folder contains datasets loading scripts that are used to train
OpenAssistant. The current list of datasets can be found
[here](https://docs.google.com/spreadsheets/d/1NYYa6vHiRnk5kwnyYaCT0cBO62--Tm3w4ihdBtp4ISk).
This repository aims to provide a diverse and accessible collection of datasets
that can be used to train OpenAssistant models.<br/> Our goal is to cover a wide
range of topics, languages and tasks.

## **Adding a New Dataset**
### **Current Progress**

To see the datasets people are currently working on, please refer to
**[the spreadsheet](https://docs.google.com/spreadsheets/d/1NYYa6vHiRnk5kwnyYaCT0cBO62--Tm3w4ihdBtp4ISk)**.

### **Repository Structure**

- Each dataset is organized into its own folder, which may include notebooks,
processing scripts, markdown files and other materials that explain the
dataset creation process
- The dataset files themselves are stored on Hugging Face
- The root `__init__.py` lists the dataset names and corresponding Hugging Face
datasets
- The final version of each dataset is pushed to the
[OpenAssisstant Hugging Face](https://huggingface.co/OpenAssistant)

## **Dataset Formats**

To simplify the training process, all datasets must be stored as Parquet files
with the option `row_group_size=100`.<br/> There are two types of datasets
accepted: instruction and text-only.

### **Instruction format**

Instruction datasets are designed to align language models with human
interactions. These can take the form of question-answer, request-response,
task-solution pairs, and so on. The instruction dataset must include the
following columns:

1. **INSTRUCTION** (string): Instruction text
2. **RESPONSE** (string): Expected response to the instruction
3. **SOURCE** (string): Original data source short name, e.g. "wikipedia"
4. **METADATA** (JSON string, optional): Any other useful information stored in
JSON<br/> For example, NSFW content can be marked as `{"nsfw": true}`

### **Text-only format**

For datasets that do not fit into the instruction format, text-only format is
proposed. The text-only dataset must include the following columns:

1. **TEXT** (string)
2. **SOURCE** (string)
3. **METADATA** (JSON string, optional)

## **Dataset Requirements**

The dataset must adhere to the following requirements:

- Must have a permissive license
- Must not contain child sexual abuse materials
- Must not contain materials with private individual's personal information
(e.g. name, address, phone number, government ID, or medical information)

## **How to Contribute**

To add a new dataset to OpenAssistant, follow these steps:

Expand All @@ -20,11 +73,11 @@ To add a new dataset to OpenAssistant, follow these steps:
link the issue in the pull request description. For more information, see
[below](#making-a-pull-request).

## **Creating a Dataset on Hugging Face**
### **Creating a Dataset on Hugging Face**

To create a new dataset on Hugging Face, follow these steps:

#### 1. Convert your dataset file(s) to the Parquet format using the [pandas](https://pandas.pydata.org/) library:
#### 1. Convert your dataset file(s) to the Parquet format using [pandas](https://pandas.pydata.org/) and [pyarrow](https://pypi.org/project/pyarrow/) libraries:

```python
import pandas as pd
Expand Down Expand Up @@ -53,7 +106,7 @@ login:
huggingface-cli login
```

- in Jupyter notebook (cuurently does not work in
- in Jupyter notebook (currently does not work in
[Visual Studio Code](https://github.com/huggingface/huggingface_hub/issues/752))

```python
Expand All @@ -69,13 +122,13 @@ ds = Dataset.from_parquet("dataset.parquet")
ds.push_to_hub("your_huggingface_name/dataset_name")
```

#### 5. Update the `README.md` file
#### 5. Update the Hugging Face `README.md` file

Update the `README.md` file of your dataset by visiting this link:
https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md
(paste your HuggingFace name and dataset)

## **Making a Pull Request**
### **Making a Pull Request**

#### 1. Fork this repository

Expand All @@ -84,18 +137,16 @@ https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/REA
#### 3. Add your dataset to the repository

- Create a folder with the name of your dataset.
- Add a loading script that loads your dataset from HuggingFace, for example:
- Add files that describe your dataset and its creation, such as a README,
notebooks, scrapers, etc.
- Add your dataset to the parent `__init__.py`

```python
from datasets import load_dataset

if __name__ == "__main__":
ds = load_dataset("your_huggingface_name/dataset_name")
print(ds["train"])
```

- Optionally, add any other files that describe your dataset and its creation,
such as a README, notebooks, scrapers, etc.
```python
INSTRUCTION_DATASETS = {
...,
"dataset_name": "your_huggingface_name/dataset_name"
}
```

#### 4. Stage your changes and run the pre-commit hook

Expand Down
2 changes: 2 additions & 0 deletions openassistant/datasets/__init__.py
@@ -1 +1,3 @@
TEXT_DATASETS = {}

INSTRUCTION_DATASETS = {"grade-school-math-instructions": "qwedsacf/grade-school-math-instructions"}

0 comments on commit d048248

Please sign in to comment.