# RecDP LLM - Dataset Score Assessment

This notebook shows how to use several tools to evaluate the quality score, diversity, toxicity, perplexity and rouge of a dataset.

# Get Started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -qq -y openjdk-8-jre
! pip install -q pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. Prepare your data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
!wget https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/alpaca_sample_10.parquet

## 3. Score Assement

In [2]:
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

JAVA_HOME is not set, use default value of /usr/lib/jvm/java-8-openjdk-amd64/




### 3.1 Process with toxicity and perplexity scorer

In [5]:
pipeline = ResumableTextPipeline()
pipeline.enable_statistics()
ops = [
    ParquetReader("/content/test_data/"),
    TextPerplexityScore(),
    TextToxicity(huggingface_config_path="/root/.cache/huggingface/hub/models--xlm-roberta-base"),
    ParquetWriter("ResumableTextPipeline_output-1")
]
pipeline.add_operations(ops)
ret = pipeline.execute()
del pipeline

[32m2023-11-13 07:03:27.160[0m | [1mINFO    [0m | [36mpyrecdp.core.model_utils[0m:[36mprepare_sentencepiece_model[0m:[36m107[0m - [1mLoading sentencepiece model...[0m
[32m2023-11-13 07:03:27.238[0m | [1mINFO    [0m | [36mpyrecdp.core.model_utils[0m:[36mprepare_kenlm_model[0m:[36m125[0m - [1mLoading kenlm language model...[0m
[DatasetReader, PerfileSourcedParquetReader, TextPerplexityScore, TextToxicity, PerfileParquetWriter]
init ray with total mem of 324137904537


2023-11-13 07:03:33,059	INFO worker.py:1642 -- Started a local Ray instance.


(pid=59708) Parquet Files Sample 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(_get_reader pid=59708)[0m   pq_ds.pieces, **prefetch_remote_args
[2m[36m(_get_reader pid=59708)[0m   self._pq_pieces = [_SerializedPiece(p) for p in pq_ds.pieces]
[2m[36m(_get_reader pid=59708)[0m   self._pq_paths = [p.path for p in pq_ds.pieces]
2023-11-13 07:03:38,953	INFO read_api.py:406 -- To satisfy the requested parallelism of 192, each read task output is split into 192 smaller blocks.

  0%|          | 0/1 [00:00<?, ?it/s][A
ResumableTextPipeline, current on alpaca_sample_10.parquet:   0%|          | 0/1 [00:00<?, ?it/s][A2023-11-13 07:03:38,999	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->SplitBlocks(192)] -> TaskPoolMapOperator[Map(<lambda>)]
2023-11-13 07:03:39,004	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbos

Running 0:   0%|          | 0/36864 [00:00<?, ?it/s]

2023-11-13 07:03:41,105	INFO dataset.py:2380 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2023-11-13 07:03:41,109	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->SplitBlocks(192)] -> TaskPoolMapOperator[Map(<lambda>)->Map(<lambda>)] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2023-11-13 07:03:41,112	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-11-13 07:03:41,113	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/36864 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/36864 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/36864 [00:00<?, ?it/s]

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

ResumableTextPipeline, current on alpaca_sample_10.parquet:   0%|          | 0/1 [00:15<?, ?it/s]


KeyboardInterrupt: 

### 3.2 Process with QualityScorer, Diversity and Rouge scorer

In [16]:
pipeline = ResumableTextPipeline()
pipeline.enable_statistics()
out_dir = "ResumableTextPipeline_output-2"
ops = [
    ParquetReader("/content/test_data/alpaca_sample_10.parquet"),
    TextDiversityIndicate(out_dir=out_dir, language="en", first_sent=False),
    TextQualityScorer(text_key="text", model="gpt3"),
    RougeScoreDedup(max_ratio=0.7, batch_size=10,score_store_path=os.path.join(out_dir,'RougeScorefiltered.parquet')),
    ParquetWriter(out_dir)
]
pipeline.add_operations(ops)
ret = pipeline.execute()
del pipeline

[DatasetReader, PerfileSourcedParquetReader, TextDiversityIndicate, TextQualityScorer, RougeScoreDedup, PerfileParquetWriter]
Will assign 48 cores and 412162 M memory for spark
per core memory size is 8.385 GB and shuffle_disk maximum capacity is 8589934592.000 GB
execute with spark for global tasks started ...
DatasetReader
[32m2023-11-13 07:18:08.664[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mop_summary[0m:[36m281[0m - [1mDatasetReader: A total of 0 rows of data were processed, using 0 seconds, with 0 rows modified or removed, 0 rows of data remaining.[0m
execute with spark for global tasks took 0.003243283019401133 sec
PerfileSourcedParquetReader


ResumableTextPipeline, current on alpaca_sample_10.parquet:   0%|          | 0/1 [00:00<?, ?it/s]

alpaca_sample_10.parquet
TextDiversityIndicate
statistics_decorator spark


                                                                                

TextQualityScorer
statistics_decorator spark
model_name is gpt3
[32m2023-11-13 07:18:18.588[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_qualityscorer[0m:[36mprepare_model[0m:[36m122[0m - [1mPreparing scorer model in [/root/.cache/recdp/models/gpt3_quality_model]...[0m
real_model_path is /root/.cache/recdp/models/gpt3_quality_model


                                                                                

[32m2023-11-13 07:18:20.270[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_qualityscorer[0m:[36mpredict[0m:[36m252[0m - [1mStart scoring dataset...[0m
RougeScoreDedup
statistics_decorator spark



  0%|          | 0/1 [00:00<?, ?it/s][A

Round 0 started ...




[32m2023-11-13 07:18:26.178[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 0: total processing num_samples is 45, detected high score num_samples is 0[0m


                                                                                
100%|██████████| 1/1 [00:06<00:00,  6.03s/it][A


Round 0 took 6.029559508984676 sec
generate_connected_components => duplicates started ...



0it [00:00, ?it/s][A

generate_connected_components => duplicates took 0.017523382004583254 sec





[32m2023-11-13 07:18:27.515[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mop_summary[0m:[36m277[0m - [1mTextDiversityIndicate: A total of 10 rows of data were processed, using 9.549101114273071 seconds, Get max diversity types 10, Get average diversity types 1.9,Get the std of diversity types 2.8460498941515415[0m
[32m2023-11-13 07:18:27.519[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mop_summary[0m:[36m277[0m - [1mTextQualityScorer: A total of 10 rows of data were processed, using 2.182723045349121 seconds, Get average quality score 0.9541670706928078[0m
[32m2023-11-13 07:18:27.527[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mop_summary[0m:[36m277[0m - [1mRougeScoreDedup: A total of 10 rows of data were processed, using 6.3509862422943115 seconds, A duplication list containing 0 found, around 0.0% of total data, Sampled, duplication preview: Empty DataFrame
Columns: [id_1, id_2, id_pair, similarity_left, similari

ResumableTextPipeline, current on alpaca_sample_10.parquet: 100%|██████████| 1/1 [00:18<00:00, 18.68s/it]

[32m2023-11-13 07:18:27.539[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mexecute[0m:[36m421[0m - [1mCompleted! ResumableTextPipeline will not return dataset, please check ResumableTextPipeline_output-2 for verification.[0m





### 3.3 View score 

In [14]:
import json
# ppl_score = json.load(open("ResumableTextPipeline_output-2/TextPerplexityScore-statistics", "r"))
# toxicity_score = json.load(open("ResumableTextPipeline_output-2/TextToxicity-statistics", "r"))
quality_score = json.load(open("ResumableTextPipeline_output-2/TextQualityScorer-statistics", "r"))
diversity_score = json.load(open("ResumableTextPipeline_output-2/TextDiversityIndicate-statistics", "r"))
rouge_score = json.load(open("ResumableTextPipeline_output-2/RougeScoreDedup-statistics", "r"))

print("Perplexity scores: ", ppl_score)
print("Toxicity scores: ", toxicity_score)
print("Quality scores: ", quality_score)
print("Diversity scores: ", diversity_score)
print("Rouge scores: ", rouge_score)

Quality scores:  {'mean': 0.9541670706928078}
Diversity scores:  {'max': nan, 'mean': nan, 'std': nan}
Rouge scores:  {'dup_num': 0, 'dup_ratio': 0.0}
