UTF-8 warning (quickfix) (#2777)

- added UTF-8 as requirement to dataset's README.md - added a few lines to the docs as well, although it's mostly outdated
LAION-AI · Apr 20, 2023 · 99301cc · 99301cc
1 parent a700700
commit 99301cc
Show file tree

Hide file tree

Showing 2 changed files with 12 additions and 5 deletions.
diff --git a/data/datasets/README.md b/data/datasets/README.md
@@ -19,13 +19,14 @@ To see the datasets people are currently working on, please refer to
   datasets
 - The final version of each dataset is pushed to the
   [OpenAssisstant Hugging Face](https://huggingface.co/OpenAssistant)
+- All data **must** be `UTF-8` encoded to simplify training!
 
 ## **Dataset Formats**
 
-To simplify the training process, all datasets must be stored in one of the two
-formats:
+To simplify the training process, all datasets must be `UTF-8` encoded and
+stored in either one of these two formats:
 
-- parquet with the option `row_group_size=100`
+- parquet with the option `row_group_size=100` and `index=False`
 - jsonl or jsonl.gz
 
 ## **Dataset Types**
@@ -183,6 +184,8 @@ df = pd.read_json(...) # or any other way
 df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
 ```
 
+Make sure the text data in the dataframe is properly encoded as `UTF-8`!
+
 #### 2. Install Hugging Face Hub
 
 ```bash

diff --git a/docs/docs/data/datasets.md b/docs/docs/data/datasets.md
@@ -7,6 +7,8 @@ github repository aims to provide a diverse and accessible collection of
 datasets that can be used to train OpenAssistant models.<br/> Our goal is to
 cover a wide range of topics, languages and tasks.
 
+To simplify the training process, all data must be `UTF-8` encoded.
+
 ### **Current Progress**
 
 To see the datasets people are currently working on, please refer to
@@ -26,8 +28,8 @@ To see the datasets people are currently working on, please refer to
 ## **Dataset Formats**
 
 To simplify the training process, all datasets must be stored as Parquet files
-with the option `row_group_size=100`.<br/> There are two types of datasets
-accepted: instruction and text-only.
+with the option `row_group_size=100` and `index=False`.<br/> There are two types
+of datasets accepted: instruction and text-only.
 
 ### **Instruction format**
 
@@ -92,6 +94,8 @@ df = pd.read_json(...) # or any other way
 df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
 ```
 
+Make sure the text data in the dataframe is properly encoded as `UTF-8`!
+
 #### 2. Install Hugging Face Hub
 
 ```bash