# RecDP LLM - rough score deduplication

Remove similar data by calculating the rough score. Using python library [rouge-score](https://pypi.org/project/rouge-score/)


### use case:
* Expect Input format: a folder of *.jsonl.
* Expect Output format: a folder of *.jsonl after reduction.


# Get started

## Install pyrecdp and dependencies

In [4]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
! pip install pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
openjdk-8-jre is already the newest version (8u382-ga-1~22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.


## 2. prepare your own data

In [2]:
%mkdir -p /content/test_data
%cd /content/test_data
file_list = ['tiny_c4_sample.jsonl']
file_list += [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/{i}" for i in file_names]
!wget -P /content/test_data {" ".join(file_list)}

/content/test_data
--2023-11-10 07:49:21--  http://tiny_c4_sample.jsonl/
Resolving proxy-prc.intel.com (proxy-prc.intel.com)... 10.240.252.16
Connecting to proxy-prc.intel.com (proxy-prc.intel.com)|10.240.252.16|:912... connected.
Proxy request sent, awaiting response... 500 Internal Server Error
2023-11-10 07:49:21 ERROR 500: Internal Server Error.

--2023-11-10 07:49:21--  https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/tiny_c4_sample.jsonl
Connecting to proxy-prc.intel.com (proxy-prc.intel.com)|10.240.252.16|:912... connected.
Proxy request sent, awaiting response... 200 OK
Length: 1062126 (1.0M) [text/plain]
Saving to: ‘/content/test_data/tiny_c4_sample.jsonl’


2023-11-10 07:49:23 (922 KB/s) - ‘/content/test_data/tiny_c4_sample.jsonl’ saved [1062126/1062126]

FINISHED --2023-11-10 07:49:23--
Total wall clock time: 2.2s
Downloaded: 1 files, 1.0M in 1.1s (922 KB/s)


## 3. fuzzy deduplicate (seperate detection and reduction)

In [3]:
! ls /content/test_data

tiny_c4_sample.jsonl


### 3.1 PIPELINE based API

In [2]:
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

pipeline = ResumableTextPipeline()
# optional to enable or not enable statistics.
# enable_statistics helps to show summary info
# not enable_statistics provides better performance
pipeline.enable_statistics()

# start to add ops to pipeline
ops = [
    JsonlReader("/content/test_data/"),
    RougeScoreDedup(text_key='text', max_ratio=0.7, batch_size=20),
]
pipeline.add_operations(ops)
# pipeline.plot()



{0: {'children': None, 'op': 'DatasetReader', 'config': {}},
 1: {'children': [0], 'op': 'JsonlReader', 'config': {'input_dir': '/content/test_data/'}},
 2: {'children': [1], 'op': 'RougeScoreDedup', 'config': {'text_key': 'text', 'max_ratio': 0.7, 'batch_size': 20, 'score_store_path': 'RougeScorefiltered.parquet'}}}

In [3]:
ret = pipeline.execute()

[DatasetReader, PerfileSourcedJsonlReader, RougeScoreDedup, PerfileParquetWriter]
Will assign 48 cores and 412162 M memory for spark


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/10 07:51:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


per core memory size is 8.385 GB and shuffle_disk maximum capacity is 8589934592.000 GB
execute with spark for global tasks started ...
DatasetReader
[32m2023-11-10 07:51:58.406[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mexecute[0m:[36m365[0m - [1mDatasetReader: A total of 0 rows of data were processed, using 0 seconds, with 0 rows modified or removed, 0 rows of data remaining.[0m
execute with spark for global tasks took 0.003309955000077025 sec
PerfileSourcedJsonlReader


ResumableTextPipeline, current on tiny_c4_sample.jsonl:   0%|          | 0/1 [00:00<?, ?it/s]

tiny_c4_sample.jsonl
RougeScoreDedup
statistics_decorator spark


                                                                                
  0%|          | 0/23 [00:00<?, ?it/s][A

Round 0 started ...




[32m2023-11-10 07:52:35.658[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 0: total processing num_samples is 8770, detected high score num_samples is 11[0m


                                                                                
  4%|▍         | 1/23 [00:32<12:01, 32.79s/it][A

Round 0 took 32.78911198000003 sec
Round 1 started ...




[32m2023-11-10 07:53:14.611[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 1: total processing num_samples is 8370, detected high score num_samples is 0[0m


                                                                                
  9%|▊         | 2/23 [01:10<12:30, 35.76s/it][A

Round 1 took 37.82962588499993 sec
Round 2 started ...




[32m2023-11-10 07:54:20.294[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 2: total processing num_samples is 7970, detected high score num_samples is 0[0m


                                                                                
 13%|█▎        | 3/23 [02:16<16:27, 49.37s/it][A

Round 2 took 65.57384711899999 sec
Round 3 started ...




[32m2023-11-10 07:54:39.516[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 3: total processing num_samples is 7570, detected high score num_samples is 0[0m


                                                                                
 17%|█▋        | 4/23 [02:35<11:51, 37.47s/it][A

Round 3 took 19.224398283000028 sec
Round 4 started ...




[32m2023-11-10 07:54:58.154[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 4: total processing num_samples is 7170, detected high score num_samples is 2[0m


                                                                                
 22%|██▏       | 5/23 [02:54<09:12, 30.68s/it][A

Round 4 took 18.62540297399994 sec
Round 5 started ...




[32m2023-11-10 07:55:12.462[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 5: total processing num_samples is 6770, detected high score num_samples is 0[0m


                                                                                
 26%|██▌       | 6/23 [03:08<07:06, 25.10s/it][A

Round 5 took 14.275545535999981 sec
Round 6 started ...




[32m2023-11-10 07:55:24.737[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 6: total processing num_samples is 6370, detected high score num_samples is 0[0m


                                                                                
 30%|███       | 7/23 [03:20<05:34, 20.91s/it][A

Round 6 took 12.261500141999932 sec
Round 7 started ...




[32m2023-11-10 07:55:38.443[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 7: total processing num_samples is 5970, detected high score num_samples is 0[0m


                                                                                
 35%|███▍      | 8/23 [03:34<04:39, 18.60s/it][A

Round 7 took 13.671373291000009 sec
Round 8 started ...




[32m2023-11-10 07:55:50.146[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 8: total processing num_samples is 5570, detected high score num_samples is 5[0m


                                                                                
 39%|███▉      | 9/23 [03:46<03:50, 16.47s/it][A

Round 8 took 11.786905280999918 sec
Round 9 started ...




[32m2023-11-10 07:55:59.721[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 9: total processing num_samples is 5170, detected high score num_samples is 1[0m


                                                                                
 43%|████▎     | 10/23 [03:55<03:06, 14.34s/it][A

Round 9 took 9.56197019800004 sec
Round 10 started ...




[32m2023-11-10 07:56:10.393[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 10: total processing num_samples is 4770, detected high score num_samples is 3[0m


                                                                                
 48%|████▊     | 11/23 [04:06<02:38, 13.21s/it][A

Round 10 took 10.65314421000005 sec
Round 11 started ...




[32m2023-11-10 07:56:24.142[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 11: total processing num_samples is 4370, detected high score num_samples is 0[0m


                                                                                
 52%|█████▏    | 12/23 [04:20<02:27, 13.40s/it][A

Round 11 took 13.806791524000005 sec
Round 12 started ...




[32m2023-11-10 07:56:38.419[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 12: total processing num_samples is 3970, detected high score num_samples is 0[0m


                                                                                
 57%|█████▋    | 13/23 [04:34<02:16, 13.66s/it][A

Round 12 took 14.27520691899997 sec
Round 13 started ...




[32m2023-11-10 07:56:47.479[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 13: total processing num_samples is 3570, detected high score num_samples is 0[0m


                                                                                
 61%|██████    | 14/23 [04:43<01:50, 12.25s/it][A

Round 13 took 8.987865287999966 sec
Round 14 started ...




[32m2023-11-10 07:57:07.598[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 14: total processing num_samples is 3170, detected high score num_samples is 0[0m


                                                                                
 65%|██████▌   | 15/23 [05:03<01:57, 14.64s/it][A

Round 14 took 20.170672734999926 sec
Round 15 started ...




[32m2023-11-10 07:57:22.569[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 15: total processing num_samples is 2770, detected high score num_samples is 0[0m


                                                                                
 70%|██████▉   | 16/23 [05:18<01:43, 14.73s/it][A

Round 15 took 14.940632277000077 sec
Round 16 started ...




[32m2023-11-10 07:57:30.755[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 16: total processing num_samples is 2370, detected high score num_samples is 4[0m


                                                                                
 74%|███████▍  | 17/23 [05:26<01:16, 12.77s/it][A

Round 16 took 8.218632835000108 sec
Round 17 started ...




[32m2023-11-10 07:57:39.984[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 17: total processing num_samples is 1970, detected high score num_samples is 0[0m


                                                                                
 78%|███████▊  | 18/23 [05:35<00:58, 11.72s/it][A

Round 17 took 9.267777621999812 sec
Round 18 started ...




[32m2023-11-10 07:57:46.405[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 18: total processing num_samples is 1570, detected high score num_samples is 1[0m


                                                                                
 83%|████████▎ | 19/23 [05:42<00:40, 10.13s/it][A

Round 18 took 6.425793315999954 sec
Round 19 started ...




[32m2023-11-10 07:58:13.327[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 19: total processing num_samples is 1170, detected high score num_samples is 0[0m


                                                                                
 87%|████████▋ | 20/23 [06:09<00:45, 15.17s/it][A

Round 19 took 26.911595658000124 sec
Round 20 started ...




[32m2023-11-10 07:58:24.568[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 20: total processing num_samples is 770, detected high score num_samples is 0[0m


                                                                                
 91%|█████████▏| 21/23 [06:20<00:28, 14.00s/it][A

Round 20 took 11.270802099999855 sec
Round 21 started ...




[32m2023-11-10 07:58:28.352[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 21: total processing num_samples is 370, detected high score num_samples is 0[0m


                                                                                
 96%|█████████▌| 22/23 [06:24<00:10, 10.94s/it][A

Round 21 took 3.779994233000025 sec
Round 22 started ...




[32m2023-11-10 07:58:31.054[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m105[0m - [1mRound 22: total processing num_samples is 36, detected high score num_samples is 0[0m


                                                                                
100%|██████████| 23/23 [06:27<00:00, 16.83s/it][A


Round 22 took 2.671257287999879 sec
generate_connected_components => duplicates started ...



100%|██████████| 27/27 [00:00<00:00, 16104.41it/s]


[32m2023-11-10 07:58:32.416[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_compare_dedup[0m:[36mprocess_spark[0m:[36m130[0m - [1mFinally detected duplicated num_samples is 12[0m
generate_connected_components => duplicates took 0.2964779440001166 sec
[32m2023-11-10 07:58:32.930[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mexecute[0m:[36m404[0m - [1mRougeScoreDedup: A total of 449 rows of data were processed, using 388.623343706131 seconds, A duplication list containing 437 found, around 97.32739420935413% of total data, Sampled, duplication preview:     id_1  id_2     id_pair                                    similarity_left  \
0    351    19   19 :: 351  After tipping 25 tokens in a day, you'll be ab...   
1    201     4    4 :: 201  After tipping 25 tokens in a day, you'll be ab...   
2    351     4    4 :: 351  After tipping 25 tokens in a day, you'll be ab...   
3     19     4     4 :: 19  After tipping 25 tokens in a day, you'll be

ResumableTextPipeline, current on tiny_c4_sample.jsonl: 100%|██████████| 1/1 [06:28<00:00, 388.99s/it]

[32m2023-11-10 07:58:32.939[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mexecute[0m:[36m409[0m - [1mCompleted! ResumableTextPipeline will not return dataset, please check ResumableTextPipeline_output_20231110075154 for verification.[0m





In [None]:
pipeline.plot()
del pipeline