# RecDP LLM - TextBytesize

TextBytesize is a tool to get total byte size of text in the data.

### We support two types of input and output:

example 1:
* Expect Input format: a folder of *.jsonl.
* Expect Output format: a folder of *.jsonl after reduction.

# Get started

## Install pyrecdp and dependencies

In [1]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
! pip install pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libfontenc1 libgail-common libgail18 libgtk2.0-0
  libgtk2.0-bin libgtk2.0-common librsvg2-common libxkbfile1 libxtst6
  libxxf86dga1 openjdk-8-jre-headless x11-utils
Suggested packages:
  gvfs libnss-mdns fonts-nanum fonts-ipafont-gothic fonts-ipafont-mincho
  fonts-wqy-microhei fonts-wqy-zenhei fonts-indic mesa-utils
The following NEW packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libfontenc1 libgail-common libgail18 libgtk2.0-0
  libgtk2.0-bin libgtk2.0-common librsvg2-common libxkbfile1 libxtst6
  libxxf86dga1 openjdk-8-jre openjdk-8-jre-headless x11-utils
0 upgraded, 17 newly installed, 0 to remove and 18 not upgraded.
Need to get 36.7 MB of archives.
After this operation, 123 MB of ad

## 2. prepare your own data

In [2]:
%mkdir -p /content/test_data
%cd /content/test_data
file_names = ['NIH_sample.jsonl']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/tiny_c4_sample.jsonl" for i in file_names]
!wget -P /content/test_data {" ".join(file_list)}

/content/test_data
--2023-10-12 23:32:37--  https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/tiny_c4_sample.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1062126 (1.0M) [text/plain]
Saving to: ‘/content/test_data/tiny_c4_sample.jsonl’


2023-10-12 23:32:38 (4.96 MB/s) - ‘/content/test_data/tiny_c4_sample.jsonl’ saved [1062126/1062126]



## 3. User Defined Filter

In [3]:
! ls /content/test_data

tiny_c4_sample.jsonl


### 3.1 PIPELINE based API

In [4]:
# Define your filter condition
def cond(text):
    return text > 0.9

In [5]:
# plugin into pipeline
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

pipeline = ResumableTextPipeline()
pipeline.enable_statistics()
ops = [
    JsonlReader("/content/test_data"),
    TextQualityScorer(),
    TextCustomerFilter(cond, text_key='doc_score'),
    PerfileParquetWriter("ResumableTextPipeline_output")
]
pipeline.add_operations(ops)
pipeline.execute()
del pipeline

JAVA_HOME is not set, use default value of /usr/lib/jvm/java-8-openjdk-amd64/




Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB


ResumableTextPipeline, current on tiny_c4_sample.jsonl:   0%|          | 0/1 [00:00<?, ?it/s]

model_name is gpt3
[32m2023-10-12 23:33:17.586[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_qualityscorer[0m:[36mprepare_model[0m:[36m122[0m - [1mPreparing scorer model in [/root/.cache/recdp/models/gpt3_quality_model]...[0m
[32m2023-10-12 23:33:26.797[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_qualityscorer[0m:[36mpredict[0m:[36m252[0m - [1mStart scoring dataset...[0m


ResumableTextPipeline, current on tiny_c4_sample.jsonl: 100%|██████████| 1/1 [00:21<00:00, 21.74s/it]

[32m2023-10-12 23:33:39.327[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mexecute[0m:[36m323[0m - [1mTextQualityScorer: A total of 0 rows of data were processed, using 0 seconds, with 0 rows modified or removed, 0 rows of data remaining.[0m
[32m2023-10-12 23:33:39.330[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mexecute[0m:[36m323[0m - [1mTextCustomerFilter: A total of 449 rows of data were processed, using 0.6762804985046387 seconds, with 139 rows modified or removed, 310 rows of data remaining.[0m
[32m2023-10-12 23:33:39.334[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mexecute[0m:[36m323[0m - [1mPerfileParquetWriter: A total of 0 rows of data were processed, using 0 seconds, with 0 rows modified or removed, 0 rows of data remaining.[0m
[32m2023-10-12 23:33:39.339[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mexecute[0m:[36m325[0m - [1mCompleted! ResumableTextPipeline will not return datas




In [6]:
# View output
! ls ResumableTextPipeline_output

pipeline.json  pipeline.log  status.log  tiny_c4_sample.jsonl


In [7]:
# After Filter
import pandas as pd
pd.read_parquet("ResumableTextPipeline_output/tiny_c4_sample.jsonl")

Unnamed: 0,text,meta,source_id,doc_score,should_keep
0,It is possible to love someone who does not lo...,"{""timestamp"":""2019-04-23T06:32:35Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.999997,1
1,Canon PIXMA TS9520 All-in-One Print / Scan / C...,"{""timestamp"":""2019-04-25T17:03:36Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.941116,1
2,For those who plan on buying an iPad this Satu...,"{""timestamp"":""2019-04-22T22:39:52Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.999765,1
3,"After tipping 25 tokens in a day, you'll be ab...","{""timestamp"":""2019-04-20T00:25:13Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.939119,1
4,When cute redhead Lola Fae gets caught flickin...,"{""timestamp"":""2019-04-19T10:57:45Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.907368,1
...,...,...,...,...,...
305,This dark haired angel really loves to play wi...,"{""timestamp"":""2019-04-25T17:41:41Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.904412,1
306,Who were the first two guys in the scene. The ...,"{""timestamp"":""2019-04-23T06:35:03Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.908248,1
307,Home / Business / #Exploitation: Coca Cola is ...,"{""timestamp"":""2019-04-24T18:04:45Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.998307,1
308,Here's a brief schedule for 2016 as requested ...,"{""timestamp"":""2019-04-18T10:15:11Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.999769,1
