### Text Data Evaluation Usage

The method for evaluating text data:
[Evaluation Algorithm Documentation](../../docs/text_metrics.md)

Below is a simple YAML configuration file format for `configs/eval/text_scorer_pt_example1.yaml`:

```yaml
# Only some example scorers are listed here. Please refer to all_scorers.yaml for all scorers

model_cache_path: '../ckpt' # cache path for models
dependencies: [text]
save_path: './scores.json'

data:
  text:
    use_hf: False # Whether to use huggingface_dataset, if used, ignore the local data path below
    dataset_name: 'yahma/alpaca-cleaned'
    dataset_split: 'train'  
    name: 'default' 
    revision: null
    data_path: 'demos/text_eval/fineweb_5_samples.json'  # Local data path, supports json, jsonl, parquet formats
    formatter: "TextFormatter" # Data loader type

    keys: 'text' # Key name to be evaluated, for sft data, it can be specified as ['instruction','input','output']
```
The `data` section specifies the path and related configurations for the data file/folder.
```yaml
scorers: # You can select multiple text scorers from all_scorers.yaml and put their configuration information here
  NgramScorer:
      ngrams: 5
  FineWebEduScorer:
    model_name: 'HuggingFaceTB/fineweb-edu-classifier'
    device: 'cuda:0'
```
The configuration under  `scorers` specifies the parameter settings for the selected scorers.

In [1]:

import sys
import os

target_dir = os.path.abspath('../..') 
current_dir = os.getcwd()

if current_dir != target_dir:
    os.chdir(target_dir)  

dataflow_path = os.path.abspath(os.path.join(os.getcwd(), '..', '..')) 
sys.path.insert(0, dataflow_path)
sys.argv = ['notebook', '--config', 'configs/eval/text_scorer_pt_example1.yaml']

from dataflow.utils.utils import calculate_score



In [2]:
calculate_score()


NgramScorer {'ngrams': 5, 'num_workers': 1, 'model_cache_dir': '../ckpt'}


Evaluating NgramScore: 100%|██████████| 5/5 [00:01<00:00,  4.91it/s]


LexicalDiversityScorer {'metrics_to_keep': {'mtld': True, 'hdd': True}, 'num_workers': 1, 'model_cache_dir': '../ckpt'}


Evaluating LexicalDiversityScore: 100%|██████████| 5/5 [00:00<00:00,  5.06it/s]

scores_len:5
Scores saved to ./scores.json



