# Process Data

This tutorial demonstrates how to use given data_recipes to process data using Data-Juicer.

To process data using data-juicer run process_data.py tool with your config as the argument to process your dataset.

``` shell
python tools/process_data.py --config configs/demo/process.yaml
```

The [configs/demo/process.yaml](https://github.com/modelscope/data-juicer/blob/main/configs/demo/process.yaml) here is the given data_recipes.

In [None]:
# You can run this in you CLI
data_juicer_path = '..'
%cd {data_juicer_path}
!python tools/process_data.py --config configs/demo/process.yaml

The [process_data.py](https://github.com/modelscope/data-juicer/blob/main/tools/process_data.py) will call the executor.run() method to process the data.

``` python
# tools/process_data.py

from data_juicer.config import init_configs
from data_juicer.core import Executor

@logger.catch
def main():
    cfg = init_configs()
    # We run the single Executor here to process data using the given cfg
    if cfg.executor_type == 'default':
        executor = Executor(cfg)
    executor.run()


if __name__ == '__main__':
    main()

```


Below we provide simple examples to show how the data is processed. 

In [None]:
# we init the corresponding config
from loguru import logger
from data_juicer.config import init_configs
cfg = init_configs(['--config', 'configs/demo/process.yaml'])

In [None]:
from data_juicer.core import Executor
executor = Executor(cfg)
dataset = executor.run()

Now, we explain the key executor.run() method in [executor.py](https://github.com/modelscope/data-juicer/blob/main/data_juicer/core/executor.py) step by step.

First the method will load and format the data.

We can load dataset from check point in previous run, or load dataset from data formatter.

``` python 
    def run(self, load_data_np=None):
        ...
        # 1. format data
        if self.cfg.use_checkpoint and self.ckpt_manager.ckpt_available:
            logger.info('Loading dataset from checkpoint...')
            dataset = self.ckpt_manager.load_ckpt()
        else:
            logger.info('Loading dataset from data formatter...')
            if load_data_np is None:
                load_data_np = self.cfg.np
        dataset = self.formatter.load_dataset(load_data_np, self.cfg)
        ...
```

You can run the code below to see the dataset here interactively.

In [3]:
loaded_dataset = executor.formatter.load_dataset(cfg.np, cfg)
loaded_dataset

Dataset({
    features: ['text', 'meta'],
    num_rows: 6
})

Then the method will load the operators from the given cfg file process fields.

``` python

    def run(self, load_data_np = None)
        ...
        # 2. extract processes
        logger.info('Preparing process operators...')
        self.process_list, self.ops = load_ops(self.cfg.process,
                                               self.cfg.op_fusion)
        ...

```

You can run the code below to see the process_list and ops.

In [4]:
process_list = executor.process_list
process_list

[{'language_id_score_filter': {'lang': 'zh',
   'min_score': 0.8,
   'text_key': 'text',
   'image_key': 'images',
   'audio_key': 'audios',
   'video_key': 'videos'}}]

In [5]:
ops = executor.ops
ops

[<data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter at 0x2e31f1be0>]

According to the loaded self.process_list and self.ops, the method will apply op functions to data, we use the mapper/filter function as examples.

``` python
    def run():
        ...
        # 3. data process
        # - If tracer is open, trace each op after it's processed
        # - If checkpoint is open, clean the cache files after each process
        logger.info('Processing data...')
        ...
        for op_cfg, op in zip(self.process_list, self.ops):
            op_name, op_args = list(op_cfg.items())[0]
            if isinstance(op, Mapper):
                dataset = dataset.map(op.process)
            ...
            elif isinstance(op, Filter):
                ...
                dataset = dataset.map(op.compute_stats)
                dataset = dataset.filter(op.process) 
            ...
        ...
```

Here, all data samples are processed through the list of ops.

After all the ops are processed, the method will dump the dataset to according to the given export path.

``` python
    def run():
        ...
        # 4. data export
        logger.info('Exporting dataset to disk...')
        self.exporter.export(dataset)
        ...
```

You can check the process dataset in the export path in [configs/demo/process.yaml](https://github.com/modelscope/data-juicer/blob/main/configs/demo/process.yaml)

``` yaml
export_path: './outputs/demo-process/demo-processed.jsonl'
```

In [6]:
# the exported dataset after run
dataset

Dataset({
    features: ['text', 'meta', '__dj__stats__'],
    num_rows: 2
})