# RecDP LLM - TextBytesize

TextBytesize is a tool to get total byte size of text in the data.

### We support two types of input and output:

example 1:
* Expect Input format: a folder of *.jsonl.
* Expect Output format: a folder of *.jsonl after reduction.

# Get started

## Install pyrecdp and dependencies

In [1]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
! pip install pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libfontenc1 libgail-common libgail18 libgtk2.0-0
  libgtk2.0-bin libgtk2.0-common librsvg2-common libxkbfile1 libxtst6
  libxxf86dga1 openjdk-8-jre-headless x11-utils
Suggested packages:
  gvfs libnss-mdns fonts-nanum fonts-ipafont-gothic fonts-ipafont-mincho
  fonts-wqy-microhei fonts-wqy-zenhei fonts-indic mesa-utils
The following NEW packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libfontenc1 libgail-common libgail18 libgtk2.0-0
  libgtk2.0-bin libgtk2.0-common librsvg2-common libxkbfile1 libxtst6
  libxxf86dga1 openjdk-8-jre openjdk-8-jre-headless x11-utils
0 upgraded, 17 newly installed, 0 to remove and 18 not upgraded.
Need to get 36.7 MB of archives.
After this operation, 123 MB of ad

## 2. prepare your own data

In [2]:
%mkdir -p /content/test_data
%cd /content/test_data
file_names = ['NIH_sample.jsonl']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/PILE/{i}" for i in file_names]
!wget -P /content/test_data {" ".join(file_list)}

/content/test_data
--2023-10-11 18:10:37--  https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/PILE/NIH_sample.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21626664 (21M) [text/plain]
Saving to: ‘/content/test_data/NIH_sample.jsonl’


2023-10-11 18:10:38 (114 MB/s) - ‘/content/test_data/NIH_sample.jsonl’ saved [21626664/21626664]



## 3. get bytesize

In [3]:
! ls /content/test_data

NIH_sample.jsonl


### 3.1 PIPELINE based API

In [5]:
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

pipeline = TextPipeline()
ops = [
    JsonlReader("/content/test_data/"),
    TextBytesize(),
]
pipeline.add_operations(ops)
ret = pipeline.execute()
ret.to_pandas()

JAVA_HOME is not set, use default value of /usr/lib/jvm/java-8-openjdk-amd64/




init ray
init ray with total mem of 8167961395, total core of 1


2023-10-11 18:11:29,354	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


2023-10-11 18:11:34,510	INFO read_api.py:406 -- To satisfy the requested parallelism of 20, each read task output is split into 20 smaller blocks.
2023-10-11 18:11:34,568	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadText->SplitBlocks(20)] -> TaskPoolMapOperator[Map(convert_json)->Map(<lambda>)]
2023-10-11 18:11:34,576	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-10-11 18:11:34,582	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/400 [00:00<?, ?it/s]



execute with ray took 17.940069602999984 sec


Unnamed: 0,meta,text,bytesize
0,{'APPLICATION_ID': 100065},The National Domestic Violence Hotline (NDVH) ...,1460
1,{'APPLICATION_ID': 100066},"The Office of Planning, Research and Evaluatio...",3181
2,{'APPLICATION_ID': 100067},Improving outcomes for low-income fathers and ...,1777
3,{'APPLICATION_ID': 100068},This project is implementing 36-month follow-u...,1760
4,{'APPLICATION_ID': 100069},The CCDF Policies Database is a source of info...,2157
...,...,...,...
9995,{'APPLICATION_ID': 2120612},Project: Research and produce a videotape that...,1241
9996,{'APPLICATION_ID': 2120613},While relapse prevention has been studied and ...,1281
9997,{'APPLICATION_ID': 2120616},"The proposed study on recruitment, adherence a...",2867
9998,{'APPLICATION_ID': 2120620},Recent studies suggest that HIV epidemics are ...,2424


### 3.2 Operation-based API

#### prepare Ray and Spark context

In [6]:
import psutil
import ray
from pyrecdp.core import SparkDataProcessor
from pyspark.sql import DataFrame
from pyrecdp.core.cache_utils import RECDP_MODELS_CACHE

total_mem = int(psutil.virtual_memory().total * 0.5)
total_cores = psutil.cpu_count(logical=False)

class RayContext:
    def __init__(self, dataset_path):
        self.dataset_path = dataset_path

    def __enter__(self):
        if not ray.is_initialized():
            try:
                ray.init(object_store_memory=total_mem, num_cpus=total_cores)
            except:
                ray.init()

        reader = JsonlReader(self.dataset_path)
        self.ds = reader.process_rayds(None)
        return self

    def __exit__(self, exc_type, exc_value, exc_traceback):
        if ray.is_initialized():
            ray.shutdown()

    def show(self, ds):
        pd = ds.to_pandas()
        display(pd)

class SparkContext:
    def __init__(self, dataset_path):
        self.dataset_path = dataset_path
        self.rdp = SparkDataProcessor()

    def __enter__(self):
        self.spark = self.rdp.spark
        reader = JsonlReader(self.dataset_path)
        self.ds = reader.process_spark(self.spark)
        return self

    def __exit__(self, exc_type, exc_value, exc_traceback):
        pass

    def show(self, ds):
        pd = ds.toPandas()
        display(pd)

In [7]:
# Ray mode

from pyrecdp.primitives.operations import *
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline

op = TextBytesize()
with RayContext("/content/test_data/") as ctx:
    ctx.show(op.process_rayds(ctx.ds))

2023-10-11 18:16:49,727	INFO read_api.py:406 -- To satisfy the requested parallelism of 20, each read task output is split into 20 smaller blocks.
2023-10-11 18:16:49,743	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadText->SplitBlocks(20)] -> TaskPoolMapOperator[Map(convert_json)->Map(<lambda>)]
2023-10-11 18:16:49,748	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-10-11 18:16:49,752	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/400 [00:00<?, ?it/s]

Unnamed: 0,meta,text,bytesize
0,{'APPLICATION_ID': 100065},The National Domestic Violence Hotline (NDVH) ...,1460
1,{'APPLICATION_ID': 100066},"The Office of Planning, Research and Evaluatio...",3181
2,{'APPLICATION_ID': 100067},Improving outcomes for low-income fathers and ...,1777
3,{'APPLICATION_ID': 100068},This project is implementing 36-month follow-u...,1760
4,{'APPLICATION_ID': 100069},The CCDF Policies Database is a source of info...,2157
...,...,...,...
9995,{'APPLICATION_ID': 2120612},Project: Research and produce a videotape that...,1241
9996,{'APPLICATION_ID': 2120613},While relapse prevention has been studied and ...,1281
9997,{'APPLICATION_ID': 2120616},"The proposed study on recruitment, adherence a...",2867
9998,{'APPLICATION_ID': 2120620},Recent studies suggest that HIV epidemics are ...,2424


In [9]:
# Spark mode

from pyrecdp.primitives.operations import *
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline

op = TextBytesize()
with SparkContext("/content/test_data/") as ctx:
    ctx.show(op.process_spark(ctx.spark, ctx.ds))

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB


Unnamed: 0,text,meta,bytesize
0,The National Domestic Violence Hotline (NDVH) ...,"{""APPLICATION_ID"":100065}",1460
1,"The Office of Planning, Research and Evaluatio...","{""APPLICATION_ID"":100066}",3181
2,Improving outcomes for low-income fathers and ...,"{""APPLICATION_ID"":100067}",1777
3,This project is implementing 36-month follow-u...,"{""APPLICATION_ID"":100068}",1760
4,The CCDF Policies Database is a source of info...,"{""APPLICATION_ID"":100069}",2157
...,...,...,...
9995,Project: Research and produce a videotape that...,"{""APPLICATION_ID"":2120612}",1241
9996,While relapse prevention has been studied and ...,"{""APPLICATION_ID"":2120613}",1281
9997,"The proposed study on recruitment, adherence a...","{""APPLICATION_ID"":2120616}",2867
9998,Recent studies suggest that HIV epidemics are ...,"{""APPLICATION_ID"":2120620}",2424
