### Text Data Processing Usage

Methods for text data processing:  
[Processor Documentation](../../docs/text_process.md)

Below is a simple example of a YAML configuration file (format of `configs/process/text_process_example.yaml`):

```yaml
model_cache_path: '../ckpt' # Path for model caching
dependencies: [text]
save_path: "./processed.jsonl"

data:
  text:
    use_hf: False # Whether to use huggingface_dataset; if used, the following local data path is ignored
    dataset_name: 'yahma/alpaca-cleaned'
    dataset_split: 'train'  
    name: 'default' 
    revision: null
    data_path: 'demos/text_process/fineweb_5_samples.json'  # Local data path, supports json, jsonl, parquet formats
    formatter: "TextFormatter" # Data loader type

    keys: 'text' # Key name to process; for SFT data, it can be specified as ['instruction','input','output']
```
The `data` section specifies the path to the data file/folder and related configurations.
```yaml
processors:
  RemoveExtraSpacesRefiner: {} # Removes extra spaces
  CCNetDeduplicator: 
    bit_length: 64 
  NgramFilter:
    min_score: 0.99
    max_score: 1.0
    scorer_args:
      ngrams: 5 # Specifies the n-gram value
```
The configuration under `processors` specifies the parameters for the processors being used.

In [None]:

import sys
import os

target_dir = os.path.abspath('../..') 
current_dir = os.getcwd()

if current_dir != target_dir:
    os.chdir(target_dir)  

dataflow_path = os.path.abspath(os.path.join(os.getcwd(), '..', '..')) 
sys.path.insert(0, dataflow_path)
sys.argv = ['notebook', '--config', 'configs/process/text_process_example.yaml']

from dataflow.utils.utils import process



In [2]:
process()


RemoveExtraSpacesRefiner {'num_workers': 1, 'model_cache_dir': '../ckpt'}


Implementing RemoveExtraSpacesRefiner: 100%|██████████| 5/5 [00:00<00:00, 1969.34it/s]

Implemented RemoveExtraSpacesRefiner. 4 data refined.
CCNetDeduplicator {'bit_length': 64, 'num_workers': 1, 'model_cache_dir': '../ckpt'}





Module dataflow.process.text.refiners has no attribute CCNetDeduplicator
Module dataflow.process.text.filters has no attribute CCNetDeduplicator


Implementing CCNetDeduplicator: 100%|██████████| 5/5 [00:00<00:00, 14112.73it/s]

Implemented CCNetDeduplicator. Data Number: 5 -> 4
NgramFilter {'min_score': 0.99, 'max_score': 1.0, 'scorer_args': {'ngrams': 5}, 'num_workers': 1, 'model_cache_dir': '../ckpt'}





Module dataflow.process.text.refiners has no attribute NgramFilter


Evaluating NgramScore: 100%|██████████| 4/4 [00:00<00:00, 337.88it/s]

Implemented NgramFilter. Data Number: 4 -> 3
Data saved to ./processed.jsonl



