# RecDP LLM - Reader

We provide multiple Reader methods for different purpose
* JsonlReader
* ParquetReader
* SourcedJsonlReader
* SourcedParquetReader

# Get Started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
! pip install pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. Prepare test data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
!wget -P /content/test_data https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/PILE/NIH_sample.jsonl
!wget -P /content/test_data https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/PILE/NIH_sample.parquet

## 3. Reader

### 3.1 JsonlReader

In [2]:
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

# Below is just a quick example of using some of the operation,
# full operation list please refer to RecDP LLM readme.

pipeline = TextPipeline()
ops = [
    JsonlReader("/content/test_data/NIH_sample.jsonl"),
]
pipeline.add_operations(ops)
result = pipeline.execute().to_pandas()
del pipeline
result

init ray
init ray with total mem of 8167961395, total core of 1


2023-10-11 20:05:59,792	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


2023-10-11 20:06:02,237	INFO read_api.py:406 -- To satisfy the requested parallelism of 20, each read task output is split into 20 smaller blocks.
2023-10-11 20:06:02,278	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadText->SplitBlocks(20)] -> TaskPoolMapOperator[Map(convert_json)]
2023-10-11 20:06:02,288	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-10-11 20:06:02,292	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/400 [00:00<?, ?it/s]

execute with ray took 4.602625061000026 sec


Unnamed: 0,meta,text
0,{'APPLICATION_ID': 100065},The National Domestic Violence Hotline (NDVH) ...
1,{'APPLICATION_ID': 100066},"The Office of Planning, Research and Evaluatio..."
2,{'APPLICATION_ID': 100067},Improving outcomes for low-income fathers and ...
3,{'APPLICATION_ID': 100068},This project is implementing 36-month follow-u...
4,{'APPLICATION_ID': 100069},The CCDF Policies Database is a source of info...
...,...,...
9995,{'APPLICATION_ID': 2120612},Project: Research and produce a videotape that...
9996,{'APPLICATION_ID': 2120613},While relapse prevention has been studied and ...
9997,{'APPLICATION_ID': 2120616},"The proposed study on recruitment, adherence a..."
9998,{'APPLICATION_ID': 2120620},Recent studies suggest that HIV epidemics are ...


### 3.2 ParquetReader

In [3]:
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

# Below is just a quick example of using some of the operation,
# full operation list please refer to RecDP LLM readme.

pipeline = TextPipeline()
ops = [
    ParquetReader("/content/test_data/NIH_sample.parquet"),
]
pipeline.add_operations(ops)
result = pipeline.execute().to_pandas()
del pipeline
result

init ray
init ray with total mem of 8167961395, total core of 1


2023-10-11 20:06:11,882	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


[2m[36m(_get_reader pid=8691)[0m   pq_ds.pieces, **prefetch_remote_args
[2m[36m(_get_reader pid=8691)[0m   self._pq_pieces = [_SerializedPiece(p) for p in pq_ds.pieces]
[2m[36m(_get_reader pid=8691)[0m   self._pq_paths = [p.path for p in pq_ds.pieces]


[2m[36m(pid=8691) [0mParquet Files Sample 0:   0%|          | 0/1 [00:00<?, ?it/s]

2023-10-11 20:06:18,562	INFO read_api.py:406 -- To satisfy the requested parallelism of 40, each read task output is split into 40 smaller blocks.


Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

execute with ray took 5.992776564999986 sec


Unnamed: 0,text,meta
0,The National Domestic Violence Hotline (NDVH) ...,{'APPLICATION_ID': 100065}
1,"The Office of Planning, Research and Evaluatio...",{'APPLICATION_ID': 100066}
2,Improving outcomes for low-income fathers and ...,{'APPLICATION_ID': 100067}
3,This project is implementing 36-month follow-u...,{'APPLICATION_ID': 100068}
4,The CCDF Policies Database is a source of info...,{'APPLICATION_ID': 100069}
...,...,...
9995,Project: Research and produce a videotape that...,{'APPLICATION_ID': 2120612}
9996,While relapse prevention has been studied and ...,{'APPLICATION_ID': 2120613}
9997,"The proposed study on recruitment, adherence a...",{'APPLICATION_ID': 2120616}
9998,Recent studies suggest that HIV epidemics are ...,{'APPLICATION_ID': 2120620}


### 3.3 SourcedJsonlReader

In [1]:
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

# Below is just a quick example of using some of the operation,
# full operation list please refer to RecDP LLM readme.

pipeline = TextPipeline()
ops = [
    SourcedJsonlReader("/content/test_data/NIH_sample.jsonl"),
]
pipeline.add_operations(ops)
result = pipeline.execute().to_pandas()
del pipeline
result

JAVA_HOME is not set, use default value of /usr/lib/jvm/java-8-openjdk-amd64/




init ray
init ray with total mem of 8167961395, total core of 1


2023-10-11 20:08:28,776	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


2023-10-11 20:08:33,099	INFO read_api.py:406 -- To satisfy the requested parallelism of 20, each read task output is split into 20 smaller blocks.
2023-10-11 20:08:33,140	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadText->SplitBlocks(20)] -> TaskPoolMapOperator[Map(<lambda>)]
2023-10-11 20:08:33,146	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-10-11 20:08:33,151	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/400 [00:00<?, ?it/s]



execute with ray took 17.068184013999826 sec


Unnamed: 0,meta,text,source_id
0,{'APPLICATION_ID': 100065},The National Domestic Violence Hotline (NDVH) ...,NIH_sample.jsonl
1,{'APPLICATION_ID': 100066},"The Office of Planning, Research and Evaluatio...",NIH_sample.jsonl
2,{'APPLICATION_ID': 100067},Improving outcomes for low-income fathers and ...,NIH_sample.jsonl
3,{'APPLICATION_ID': 100068},This project is implementing 36-month follow-u...,NIH_sample.jsonl
4,{'APPLICATION_ID': 100069},The CCDF Policies Database is a source of info...,NIH_sample.jsonl
...,...,...,...
9995,{'APPLICATION_ID': 2120612},Project: Research and produce a videotape that...,NIH_sample.jsonl
9996,{'APPLICATION_ID': 2120613},While relapse prevention has been studied and ...,NIH_sample.jsonl
9997,{'APPLICATION_ID': 2120616},"The proposed study on recruitment, adherence a...",NIH_sample.jsonl
9998,{'APPLICATION_ID': 2120620},Recent studies suggest that HIV epidemics are ...,NIH_sample.jsonl


### 3.4 SourcedParquetReader

In [2]:
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

# Below is just a quick example of using some of the operation,
# full operation list please refer to RecDP LLM readme.

pipeline = TextPipeline()
ops = [
    SourcedParquetReader("/content/test_data/NIH_sample.parquet"),
]
pipeline.add_operations(ops)
result = pipeline.execute().to_pandas()
del pipeline
result

init ray
init ray with total mem of 8167961395, total core of 1


2023-10-11 20:09:00,233	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


[2m[36m(_get_reader pid=10028)[0m   pq_ds.pieces, **prefetch_remote_args
[2m[36m(_get_reader pid=10028)[0m   self._pq_pieces = [_SerializedPiece(p) for p in pq_ds.pieces]
[2m[36m(_get_reader pid=10028)[0m   self._pq_paths = [p.path for p in pq_ds.pieces]


[2m[36m(pid=10028) [0mParquet Files Sample 0:   0%|          | 0/1 [00:00<?, ?it/s]

2023-10-11 20:09:05,688	INFO read_api.py:406 -- To satisfy the requested parallelism of 40, each read task output is split into 40 smaller blocks.
2023-10-11 20:09:05,762	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->SplitBlocks(40)] -> TaskPoolMapOperator[Map(<lambda>)]
2023-10-11 20:09:05,764	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-10-11 20:09:05,766	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1600 [00:00<?, ?it/s]



execute with ray took 21.284709955000153 sec


Unnamed: 0,text,meta,source_id
0,The National Domestic Violence Hotline (NDVH) ...,{'APPLICATION_ID': 100065},NIH_sample.parquet
1,"The Office of Planning, Research and Evaluatio...",{'APPLICATION_ID': 100066},NIH_sample.parquet
2,Improving outcomes for low-income fathers and ...,{'APPLICATION_ID': 100067},NIH_sample.parquet
3,This project is implementing 36-month follow-u...,{'APPLICATION_ID': 100068},NIH_sample.parquet
4,The CCDF Policies Database is a source of info...,{'APPLICATION_ID': 100069},NIH_sample.parquet
...,...,...,...
9995,Project: Research and produce a videotape that...,{'APPLICATION_ID': 2120612},NIH_sample.parquet
9996,While relapse prevention has been studied and ...,{'APPLICATION_ID': 2120613},NIH_sample.parquet
9997,"The proposed study on recruitment, adherence a...",{'APPLICATION_ID': 2120616},NIH_sample.parquet
9998,Recent studies suggest that HIV epidemics are ...,{'APPLICATION_ID': 2120620},NIH_sample.parquet
