# RecDP LLM - ClassifyWriter

We provide multiple Writer methods for different purpose
* ClassifyJsonlWriter
* ClassifyParquetWriter

# Get Started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
! pip install pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. Prepare test data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
!wget -P /content/test_data https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/tiny_c4_sample.jsonl

## 3. Classify Data based on language

### 3.1 Write output as parquet

In [13]:
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

# Below is just a quick example of using some of the operation,
# full operation list please refer to RecDP LLM readme.

pipeline = TextPipeline()
ops = [
    JsonlReader("/content/test_data/tiny_c4_sample.jsonl"),
    LanguageIdentify(),
    ClassifyParquetWriter('output/tiny_c4_sample_parquet', 'language'),
]
pipeline.add_operations(ops)
result = pipeline.execute()
del pipeline

init spark
Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
execute with spark started ...

execute with spark took 22.19131917599998 sec


In [14]:
## View Result
! ls output/tiny_c4_sample_parquet

'language=eng_Latn'  'language=kor_Hang'  'language=yue_Hant'   _SUCCESS


In [15]:
import pandas as pd
pd.read_parquet("output/tiny_c4_sample_parquet/language=eng_Latn")

Unnamed: 0,text,meta
0,lorazepam nombre comercial mexico From an inte...,"{""timestamp"":""2019-04-24T02:17:53Z"",""url"":""htt..."
1,It is possible to love someone who does not lo...,"{""timestamp"":""2019-04-23T06:32:35Z"",""url"":""htt..."
2,Canon PIXMA TS9520 All-in-One Print / Scan / C...,"{""timestamp"":""2019-04-25T17:03:36Z"",""url"":""htt..."
3,For those who plan on buying an iPad this Satu...,"{""timestamp"":""2019-04-22T22:39:52Z"",""url"":""htt..."
4,"After tipping 25 tokens in a day, you'll be ab...","{""timestamp"":""2019-04-20T00:25:13Z"",""url"":""htt..."
...,...,...
416,Sunrise is an equal opportunity employer. Vete...,"{""timestamp"":""2019-04-22T10:28:15Z"",""url"":""htt..."
417,Home / Business / #Exploitation: Coca Cola is ...,"{""timestamp"":""2019-04-24T18:04:45Z"",""url"":""htt..."
418,I got really surprised when I saw that I recei...,"{""timestamp"":""2019-04-26T08:57:28Z"",""url"":""htt..."
419,Here's a brief schedule for 2016 as requested ...,"{""timestamp"":""2019-04-18T10:15:11Z"",""url"":""htt..."


In [16]:
import pandas as pd
pd.read_parquet("output/tiny_c4_sample_parquet/language=yue_Hant")

Unnamed: 0,text,meta
0,KATALEIYA . LaraEvans4U. PreciousFoxx. FreakyA...,"{""timestamp"":""2019-04-22T09:05:46Z"",""url"":""htt..."
1,Nightqueen9 . MattPayton. xPerfectGameX. Diamo...,"{""timestamp"":""2019-04-21T21:00:47Z"",""url"":""htt..."
2,Fall Out Boy -Death Valley (Part 8 of 11) feat...,"{""timestamp"":""2019-04-18T16:45:48Z"",""url"":""htt..."
3,KateDoll . brattandjolinna. DanmBBy69. AlinaAn...,"{""timestamp"":""2019-04-20T05:07:53Z"",""url"":""htt..."
4,AlissonX23 . xYourSexyDreamx. VelourDomme. Alm...,"{""timestamp"":""2019-04-22T08:15:06Z"",""url"":""htt..."
5,Xialove Kyle1 jasmin. MillieClark. promorpheus...,"{""timestamp"":""2019-04-21T10:16:22Z"",""url"":""htt..."
6,LittlePussa LiamHolecum jasmin. MorbiidBitch. ...,"{""timestamp"":""2019-04-20T11:05:20Z"",""url"":""htt..."
7,Aeitta . pablobeda2. Marinaorlova. straponin.\...,"{""timestamp"":""2019-04-22T10:16:41Z"",""url"":""htt..."
8,"Advice, Animals, and Dad: DON'T GET JEALOUS IT...","{""timestamp"":""2019-04-18T15:53:04Z"",""url"":""htt..."
9,LoudLove . DaringSandy. Whippedcreamer. Diamon...,"{""timestamp"":""2019-04-24T02:42:05Z"",""url"":""htt..."


In [18]:
import pandas as pd
pd.read_parquet("output/tiny_c4_sample_parquet/language=kor_Hang")

Unnamed: 0,text,meta
0,Image Title: Buy Leonardo Motion Sensor Faucet...,"{""timestamp"":""2019-04-22T02:29:16Z"",""url"":""htt..."
1,Feltrinelli Pordenone. Novita Mondadori 2016. ...,"{""timestamp"":""2019-04-22T12:45:56Z"",""url"":""htt..."
2,Feltrinelli Pordenone. Novita Mondadori 2016. ...,"{""timestamp"":""2019-04-22T12:45:56Z"",""url"":""htt..."
3,Cialis 20 Mg Cpr 8 About Cialis Super Active O...,"{""timestamp"":""2019-04-26T16:53:08Z"",""url"":""htt..."


### 3.1 Write output as Json

In [19]:
from pyrecdp.LLM import TextPipeline, ResumableTextPipeline
from pyrecdp.primitives.operations import *

# Below is just a quick example of using some of the operation,
# full operation list please refer to RecDP LLM readme.

pipeline = TextPipeline()
ops = [
    JsonlReader("/content/test_data/tiny_c4_sample.jsonl"),
    LanguageIdentify(),
    ClassifyJsonlWriter('output/tiny_c4_sample_jsonl', 'language'),
]
pipeline.add_operations(ops)
result = pipeline.execute()
del pipeline

init spark
Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
execute with spark started ...

execute with spark took 34.57764651000002 sec


In [20]:
## View Result
! ls output/tiny_c4_sample_jsonl

'language=eng_Latn'  'language=kor_Hang'  'language=yue_Hant'   _SUCCESS
