# RecDP LLM - TextBytesize

TextBytesize is a tool to get total byte size of text in the data.

### We support two types of input and output:

example 1:
* Expect Input format: a folder of *.jsonl.
* Expect Output format: a folder of *.jsonl after reduction.

# Get started

## Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
! pip install pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. prepare your own data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
file_names = ['NIH_sample.jsonl']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/PILE/{i}" for i in file_names]
!wget -P /content/test_data {" ".join(file_list)}

## 3. get bytesize

In [27]:
! ls /content/test_data

NIH_sample.jsonl


In [28]:
data_files = ["/content/test_data/NIH_sample.jsonl"]
dup_dir = "/content/fuzzy_dedup"

ngram_size = 13 # num_words to do compare
num_perm = 256 # num_permutation to hold this whole document.
# ranges and bands will impact the probabilities of false positive and false negative.
ranges = 13
bands = 9

from pyrecdp.primitives.llmutils import near_dedup
import pandas as pd

near_dedup(data_files, dup_dir, ngram_size, num_perm, bands, ranges)

## Validate codes
import pickle
print("Detected duplications are:")
connects, num_pair, index_list = pickle.load(open(f"{dup_dir}/connected_components.pickle", 'rb'))
connected_component_reverse = [[index_list[j] for j in i] for i in connects]
connected_component_reverse

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
Load data with RowID started ...
Load data with RowID took 2.592839109000124 sec
num_bands is 9, ranges is 13
generate minHashLsh started ...
generate minHashLsh took 63.16276241699961 sec
generate_connected_components all started ...
Started graph building


loop on file: 100%|██████████| 2/2 [00:00<00:00, 5312.61it/s]


length of the set of duplicates: 13 0.01398324966430664


100%|██████████| 13/13 [00:00<00:00, 71275.75it/s]


number of connected components: 7 0.02257537841796875
Graph generated duplicates list!!! 0.022832632064819336
generate_connected_components all took 0.02385874599985982 sec
generate_duplicates_dict all started ...
Processing duplicates!!!


100%|██████████| 7/7 [00:00<00:00, 46163.72it/s]

number of duplicate documents that will be removed: 12
generate_duplicates_dict all took 0.009320153000317077 sec
Completed!!
    total processed 10000 documents
    total detected 12 duplicated documents
    duplicate ratio is 0.0012





Detected duplications are:


[['NIH_sample.jsonl@1769', 'NIH_sample.jsonl@1764', 'NIH_sample.jsonl@1765'],
 ['NIH_sample.jsonl@245',
  'NIH_sample.jsonl@243',
  'NIH_sample.jsonl@246',
  'NIH_sample.jsonl@248',
  'NIH_sample.jsonl@244',
  'NIH_sample.jsonl@247'],
 ['NIH_sample.jsonl@1191', 'NIH_sample.jsonl@1190'],
 ['NIH_sample.jsonl@7746', 'NIH_sample.jsonl@7745'],
 ['NIH_sample.jsonl@9026', 'NIH_sample.jsonl@8561'],
 ['NIH_sample.jsonl@8200', 'NIH_sample.jsonl@7354'],
 ['NIH_sample.jsonl@3037', 'NIH_sample.jsonl@3024']]

In [29]:
# apply duplication list to original data to remove duplication

from pyrecdp.primitives.llmutils import shrink_document_MP
import os

data_dir = "/content/test_data/"
dup_dir = "/content/fuzzy_dedup"
dup_dict = os.path.join(dup_dir, "duplicates.pickle")
out_dir = os.path.join(dup_dir, "output")

shrink_document_MP(data_dir, dup_dict, out_dir)

# validate
print("\nReduction is completed, checkout the new jsonl filesize")
! ls "/content/fuzzy_dedup/output"
! cat /content/fuzzy_dedup/output/* | wc -l

resetting to 1 for number of processes
parallelize with 1 processes


100%|██████████| 1/1 [00:00<00:00, 12.83it/s]


Reduction is completed, checkout the new jsonl filesize





NIH_sample.jsonl
9988


In [30]:
# Visual compare based on detection
print(f"First duplication is {connected_component_reverse[0]}")
print("You'll see the similar content in above documents")

for f_id in connected_component_reverse[0]:
  print(f_id)
  f_name, rid = f_id.split("@")
  ! sed -n {rid}p {f_name}


First duplication is ['NIH_sample.jsonl@1769', 'NIH_sample.jsonl@1764', 'NIH_sample.jsonl@1765']
You'll see the similar content in above documents
NIH_sample.jsonl@1769
{"meta": {"APPLICATION_ID": 2044519}, "text": "The overall aim of study is to test the \"matching hypothesis\" that alcohol treatment effectiveness can be increased by assigning clients with certain characteristics to particular treatments. The present application proposes to continue work initiated and conducted over the past five years. The specific aims of study are: to test primary and secondary a priori matching hypotheses over the course of 15 months of follow-up; to conduct psychometric and other analyses of patient, treatment process, and outcome variables to test these matching hypotheses; to examine alternative analytic strategies and variables for testing matching; and to determine the extent to which matching effects persist over a three year period following treatment completion. Data sets collected from th

## 3. fuzzy deduplicate (unified detection and reduction)

In [36]:
from pyrecdp.core import SparkDataProcessor
from pyspark.sql.types import StructType, StructField, StringType
import pyspark.sql.functions as F
from pyrecdp.primitives.llmutils import near_dedup_spk

data_files = ["/content/test_data/NIH_sample.jsonl"]
dup_dir = "/content/fuzzy_dedup_spark"

ngram_size = 13
num_perm = 256
bands = 9
ranges = 13
rdp = SparkDataProcessor()
spark = rdp.spark
schema = StructType([
    StructField("text", StringType(), True),
    StructField("meta", StringType(), True)
])
spark_df = spark.read.text(data_files)
spark_df = spark_df.withColumn('jsonData', F.from_json(F.col('value'), schema)).select("jsonData.*")
print("input is ")
spark_df.show()
print(f"Total num_rows of input is {spark_df.count()}")

ret_df = near_dedup_spk(spark_df, ngram_size, num_perm, bands, ranges)

print("output is")
ret_df.show()
print(f"Total num_rows of output is {ret_df.count()}")
del rdp

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
input is 
+--------------------+--------------------+
|                text|                meta|
+--------------------+--------------------+
|The National Dome...|{"APPLICATION_ID"...|
|The Office of Pla...|{"APPLICATION_ID"...|
|Improving outcome...|{"APPLICATION_ID"...|
|This project is i...|{"APPLICATION_ID"...|
|The CCDF Policies...|{"APPLICATION_ID"...|
|The overall purpo...|{"APPLICATION_ID"...|
|This contract wil...|{"APPLICATION_ID"...|
|The purpose of th...|{"APPLICATION_ID"...|
|The purpose of th...|{"APPLICATION_ID"...|
|Intimate partner ...|{"APPLICATION_ID"...|
|ACF's Office of R...|{"APPLICATION_ID"...|
|The Temporary Ass...|{"APPLICATION_ID"...|
|Investing in Qual...|{"APPLICATION_ID"...|
|Current developme...|{"APPLICATION_ID"...|
|The proposed diss...|{"APPLICATION_ID"...|
|As the US populat...|{"APPLICATION_ID"...|
|Through employi

100%|██████████| 13/13 [00:00<00:00, 69283.29it/s]


generate_connected_components => duplicates took 0.8337461669998447 sec
deduplicate input data started ...
deduplicate input data took 0.9185702700001457 sec
Completed!!
    total processed 10000 documents
    total detected 12 duplicated documents, exact deduplicated counts is 12
    duplicate ratio is 0.0012
output is
+--------------+--------------------+--------------------+
|filename_docid|                text|                meta|
+--------------+--------------------+--------------------+
|   global_id@0|The National Dome...|{"APPLICATION_ID"...|
|   global_id@1|The Office of Pla...|{"APPLICATION_ID"...|
|   global_id@2|Improving outcome...|{"APPLICATION_ID"...|
|   global_id@3|This project is i...|{"APPLICATION_ID"...|
|   global_id@4|The CCDF Policies...|{"APPLICATION_ID"...|
|   global_id@5|The overall purpo...|{"APPLICATION_ID"...|
|   global_id@6|This contract wil...|{"APPLICATION_ID"...|
|   global_id@7|The purpose of th...|{"APPLICATION_ID"...|
|   global_id@8|The purpose o