Skip to content

Commit

Permalink
UTF-8 warning (quickfix) (#2777)
Browse files Browse the repository at this point in the history
- added UTF-8 as requirement to dataset's README.md
- added a few lines to the docs as well, although it's mostly outdated
  • Loading branch information
sedthh committed Apr 20, 2023
1 parent a700700 commit 99301cc
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 5 deletions.
9 changes: 6 additions & 3 deletions data/datasets/README.md
Expand Up @@ -19,13 +19,14 @@ To see the datasets people are currently working on, please refer to
datasets
- The final version of each dataset is pushed to the
[OpenAssisstant Hugging Face](https://huggingface.co/OpenAssistant)
- All data **must** be `UTF-8` encoded to simplify training!

## **Dataset Formats**

To simplify the training process, all datasets must be stored in one of the two
formats:
To simplify the training process, all datasets must be `UTF-8` encoded and
stored in either one of these two formats:

- parquet with the option `row_group_size=100`
- parquet with the option `row_group_size=100` and `index=False`
- jsonl or jsonl.gz

## **Dataset Types**
Expand Down Expand Up @@ -183,6 +184,8 @@ df = pd.read_json(...) # or any other way
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
```

Make sure the text data in the dataframe is properly encoded as `UTF-8`!

#### 2. Install Hugging Face Hub

```bash
Expand Down
8 changes: 6 additions & 2 deletions docs/docs/data/datasets.md
Expand Up @@ -7,6 +7,8 @@ github repository aims to provide a diverse and accessible collection of
datasets that can be used to train OpenAssistant models.<br/> Our goal is to
cover a wide range of topics, languages and tasks.

To simplify the training process, all data must be `UTF-8` encoded.

### **Current Progress**

To see the datasets people are currently working on, please refer to
Expand All @@ -26,8 +28,8 @@ To see the datasets people are currently working on, please refer to
## **Dataset Formats**

To simplify the training process, all datasets must be stored as Parquet files
with the option `row_group_size=100`.<br/> There are two types of datasets
accepted: instruction and text-only.
with the option `row_group_size=100` and `index=False`.<br/> There are two types
of datasets accepted: instruction and text-only.

### **Instruction format**

Expand Down Expand Up @@ -92,6 +94,8 @@ df = pd.read_json(...) # or any other way
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
```

Make sure the text data in the dataframe is properly encoded as `UTF-8`!

#### 2. Install Hugging Face Hub

```bash
Expand Down

0 comments on commit 99301cc

Please sign in to comment.