# Build Data Recipes

Data-Juicer leverages a `data recipe` in YAML format to orchestrate and streamline the entire data processing pipeline.


Let's begin with the fundamental data recipe format outlined below.

  <img src="https://img.alicdn.com/imgextra/i1/O1CN01uXgjgj1khWKOigYww_!!6000000004715-0-tps-1745-871.jpg" alt="Basic config example of format and definition" title="Basic config file example" width="60%" />

In this recipe, we can find that it is primarily composed of two parts: 
- Global arguments
- Operator list

Next, we'll explore the definitions of these arguments.

### Global Arguments
  Global arguments typically include a set of required arguments as well as various optional arguments for optimization and debugging purposes.
  
  Here are the required arguments:

| Name                   | Value                          | Description                             |
|------------------------|--------------------------------|----------------------------------------------------|
| `dataset_path`         | /path/to/your/dataset          | Path to your dataset directory or file |
| `export_path`          | /path/to/result/dataset.jsonl  | Path to processed result dataset. Supported suffixes include ['jsonl', 'json', 'parquet']. |
| `np`                   | 4                              | Number of subprocesses to process your dataset. |

For other optional arguments, please refer to [Optional Arguments](#Optional-Arguments).


### Operator list

Data-Juicer offers an extensive array of operators for data manipulation, encompassing modification, cleansing, filtering, and deduplication tasks.

Data recipe must include the necessary operators and their respective arguments for efficient dataset processing. And Data-Juicer will process the operators sequentially as arranged in the provided operation list.

A sample of operator list as follows:

```yaml
process:
  - clean_email_mapper:           # remove emails from text.
  - image_blur_mapper:            # mapper to blur images.
      p: 0.2                      # probability of the image being blured
      blur_type: 'gaussian'       # type of blur kernel, including ['mean', 'box', 'gaussian']
      radius: 2                   # radius of blur kernel
  - video_duration_filter:        # Keep data samples whose videos' durations are within a specified range.
      min_duration: 0             # the min video duration of filter range (in seconds)
      max_duration: 10            # the max video duration of filter range (in seconds)
      any_or_all: any             # keep this sample when any/all videos meet the filter condition
  - image_deduplicator:           # deduplicate samples using exact matching of images between documents.
      method: phash               # hash method for image. One of [phash, dhash, whash, ahash]
```

## Build Method
There are two approaches to constructing a data recipe.

### Customize the Default Configuration File

The [`config_all.yaml`](https://github.com/modelscope/data-juicer/blob/main/configs/config_all.yaml) contains all operators and their default arguments. 

You just need to **remove** ops that you won't use and refine some arguments of ops.

### Create a New Configuration from Scratch

You can refer our example config file [`config_all.yaml`](https://github.com/modelscope/data-juicer/blob/main/configs/config_all.yaml), [op documents](https://github.com/modelscope/data-juicer/blob/main/docs/Operators.md), and advanced [Build-Up Guide for developers](https://github.com/modelscope/data-juicer/blob/main/docs/DeveloperGuide.md#build-your-own-configs).

## A Demo to Check Data Recipe

After constructing a recipe, it can serve as input for an executor or analyzer, or it can be used to perform a preliminary check.


In [None]:
# Install data-juicer package if you are NOT in the Playground
!pip3 install py-data-juicer

# Or use newest code of data-juicer
!pip install git+https://github.com/modelscope/data-juicer

Here, we create a temporary recipe for demonstration purposes.


In [None]:
import tempfile
config_str = """
project_name: 'test_demo'
dataset_path: './demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

export_path: './outputs/demo/demo-processed.parquet'

# process schedule
# a list of several process operators with their arguments
process:
  - whitespace_normalization_mapper:
  - language_id_score_filter:
      lang: 'zh'
  - document_deduplicator: # deduplicate text samples using md5 hashing exact matching method
      lowercase: false   # whether to convert text to lower case
      ignore_non_character: false
"""

with tempfile.NamedTemporaryFile(mode='w+', delete=False) as recipe:
    recipe.write(config_str)
    recipe.flush()

Load and check the recipe

In [None]:
from data_juicer.config import init_configs
cfg = init_configs(args=f'--config {recipe.name}'.split())
print(f'np = {cfg.np}')

Finally we delete the temporary recipe

In [None]:
import os
if os.path.exists(recipe.name):
    os.unlink(recipe.name)

## Optional Arguments


Within the recipe, there are numerous optional arguments available, primarily employed for enhancing system performance, multimodal processing, and tracer functionalities, among others.

### Arguments for System Performance

These arguments are primarily related to system performance.

| Name                   | Value                          | Description                             |
|------------------------|--------------------------------|----------------------------------------------------|
| `export_shard_size`    | 0                              | Shard size of exported dataset in Byte. |
| `export_in_parallel`   | false                          | Whether to export the result dataset in parallel to a single file. |
| `use_cache`            | true                           | Whether to use the cache management of Hugging Face datasets. |
| `ds_cache_dir`         | null                           | Cache dir for Hugging Face datasets. |
| `use_checkpoint`       | false                          | Whether to use checkpoint management to save the latest version of the dataset to the work dir when processing. |
| `temp_dir`             | null                           | The path to the temporary directory to store intermediate caches when cache is disabled. |
| `op_fusion`            | false                          | Whether to fuse operators that share the same intermediate variables automatically. |
| `cache_compress`       | null                           | The compression method of the cache file. |
| `save_stats_in_one_file`| false                         | Whether to store all stats results into one file. |
| `ray_address`          | auto                           | The address of the Ray cluster. |


### Arguments for text-based Interleaved Data Format
Due to the large format diversity among different multimodal datasets, Data-Juicer proposes a novel intermediate text-based interleaved data format for multimodal datasets, which is based on chunk-wise formats such as the MMC4 dataset.

In the Data-Juicer format, a multimodal sample or document is based on text, which consists of several text chunks. Each chunk is a semantic unit, and all the multimodal information in a chunk should talk about the same thing and be aligned with each other.

You can modify these keys to suit your custom dataset.

| Name                   | Value                          | Description                             |
|------------------------|--------------------------------|----------------------------------------------------|
| `text_keys`            | `text`                      | The key name of the field where the sample texts to be processed are located.|
| `image_key`            | `images`                     | Key name of the field to store the list of sample image paths. |
| `image_special_token`  | `<__dj__image>`              | The special token that represents an image in the text.     |
| `audio_key`            | `audios`                     | Key name of the field to store the list of sample audio paths. |
| `audio_special_token`  | `<__dj__audio>`              | The special token that represents audio in the text.        |
| `video_key`            | `videos`                     | Key name of the field to store the list of sample video paths. |
| `video_special_token`  | `<__dj__video>`              | The special token that represents a video in the text.      |
| `eoc_special_token`    | `<\|__dj__eoc\|>`              | The special token that represents the end of a chunk in the text. |



### Arguments for Tracer

Tracer is a very important and useful module within Data-Juicer; it can help you track changes before and after data processing, such as identifying which samples have been filtered out and which ones are duplicates.

| Parameter Name         | Value                          | Description |
|------------------------|--------------------------------|----------------------------|
| `open_tracer`          | false                          | Whether to open the tracer to trace changes during processing. |
| `op_list_to_trace`     | []                             | Only operations in this list will be traced by the tracer. |
| `trace_num`            | 10                             | Number of samples to show the differences between datasets before and after each operation. |
| `cache_compress`       | null                           | The compression method of the cache file. |

## Reuseble Recipes

Data-Juice offers tens of [pre-built data processing recipes](https://github.com/modelscope/data-juicer/blob/main/configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios.
### Reproduced Redpajama

We have reproduced the processing flow of some RedPajama datasets. Please refer to the [reproduced_redpajama](https://github.com/modelscope/data-juicer/blob/main/configs/reproduced_redpajama/README.md) folder for details.

### Reproduced BLOOM

We have reproduced the processing flow of some BLOOM datasets. please refer to the [reproduced_bloom](https://github.com/modelscope/data-juicer/blob/main/configs/reproduced_bloom/README.md) folder for details.

### Data-Juicer Recipes
We have refined some open source datasets (including CFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](https://github.com/modelscope/data-juicer/blob/main/configs/data_juicer_recipes/README.md) folder for details.


## Awesome LLM Data 

We provide a tag-based categorization to help readers easy diving into the myriad of materials, promoting an intuitive understanding of each entry's key focus areas. Soon we will provide a dynamic table of contents to help readers more easily navigate through the materials with features such as search, filter, and sort.

For more detail, please refer to [Awesome LLM Data ](https://github.com/modelscope/data-juicer/blob/main/docs/awesome_llm_data.md)

