# RecDP LLM - User Defined Map Function

Support to plugin user defined function to pipeline and run with ray or spark

# Get started

## Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
! pip install pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. prepare your own data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
file_names = ['NIH_sample.jsonl']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/tiny_c4_sample.jsonl" for i in file_names]
!wget -P /content/test_data {" ".join(file_list)}

## 3. User Defined Function

In [3]:
! ls /content/test_data

tiny_c4_sample.jsonl


### 3.1 PIPELINE based API

In [4]:
# Define your own function
def classify(text):
    return 1 if text > 0.8 else 0

In [5]:
# plugin into pipeline
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

pipeline = ResumableTextPipeline()
ops = [
    JsonlReader("/content/test_data"),
    TextQualityScorer(),
    TextCustomerMap(classify, text_key='doc_score'),
    PerfileParquetWriter("ResumableTextPipeline_output")
]
pipeline.add_operations(ops)
pipeline.execute()
del pipeline

JAVA_HOME is not set, use default value of /usr/lib/jvm/java-8-openjdk-amd64/




Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB


ResumableTextPipeline, current on tiny_c4_sample.jsonl:   0%|          | 0/1 [00:00<?, ?it/s]

model_name is gpt3
[32m2023-10-12 22:33:46.431[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_qualityscorer[0m:[36mprepare_model[0m:[36m122[0m - [1mPreparing scorer model in [/root/.cache/recdp/models/gpt3_quality_model]...[0m
[32m2023-10-12 22:34:03.479[0m | [1mINFO    [0m | [36mpyrecdp.primitives.operations.text_qualityscorer[0m:[36mpredict[0m:[36m252[0m - [1mStart scoring dataset...[0m


ResumableTextPipeline, current on tiny_c4_sample.jsonl: 100%|██████████| 1/1 [00:24<00:00, 24.62s/it]

[32m2023-10-12 22:34:11.050[0m | [1mINFO    [0m | [36mpyrecdp.LLM.TextPipeline[0m:[36mexecute[0m:[36m325[0m - [1mCompleted! ResumableTextPipeline will not return dataset, please check ResumableTextPipeline_output for verification.[0m





In [7]:
# View output
! ls ResumableTextPipeline_output

pipeline.json  pipeline.log  status.log  tiny_c4_sample.jsonl


In [9]:
import pandas as pd
pd.read_parquet("ResumableTextPipeline_output/tiny_c4_sample.jsonl")

Unnamed: 0,text,meta,source_id,doc_score,should_keep,classify_text
0,lorazepam nombre comercial mexico From an inte...,"{""timestamp"":""2019-04-24T02:17:53Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.139534,0,0
1,It is possible to love someone who does not lo...,"{""timestamp"":""2019-04-23T06:32:35Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.999997,1,1
2,Canon PIXMA TS9520 All-in-One Print / Scan / C...,"{""timestamp"":""2019-04-25T17:03:36Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.941116,1,1
3,For those who plan on buying an iPad this Satu...,"{""timestamp"":""2019-04-22T22:39:52Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.999765,1,1
4,"After tipping 25 tokens in a day, you'll be ab...","{""timestamp"":""2019-04-20T00:25:13Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.939119,1,1
...,...,...,...,...,...,...
444,Sunrise is an equal opportunity employer. Vete...,"{""timestamp"":""2019-04-22T10:28:15Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.834727,0,1
445,Home / Business / #Exploitation: Coca Cola is ...,"{""timestamp"":""2019-04-24T18:04:45Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.998307,1,1
446,I got really surprised when I saw that I recei...,"{""timestamp"":""2019-04-26T08:57:28Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.864012,1,1
447,Here's a brief schedule for 2016 as requested ...,"{""timestamp"":""2019-04-18T10:15:11Z"",""url"":""htt...",tiny_c4_sample.jsonl,0.999769,1,1
