## Install pyrecdp from github

In [None]:
! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## Install jdk for pyspark running

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre

### Prepare test data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
!wget -P /content/test_data https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/arxiv_sample_100.jsonl
!wget -P /content/test_data https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/github_sample_50.jsonl
!wget -P /content/test_data https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/pii_test.jsonl
!wget -P /content/test_data https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/tiny_c4_sample.jsonl

## Import filter functions

In [8]:
from pyrecdp.primitives.llmutils import filter_by_blocklist, filter_by_bad_words, filter_by_length, profanity_filter

JAVA_HOME is not set, use default value of /usr/lib/jvm/java-8-openjdk-amd64/




## Specify input data path and output path

In [9]:
data_dir = "/content/test_data"
out_dir = "/content/output"

## Filter out data with URLs based on [blacklist](https://dsi.ut-capitole.fr/blacklists/)



In [10]:
filter_by_blocklist(data_dir, out_dir)

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
Download and load blocklist started ...
Download and load blocklist took 24.48501735399998 sec
Load data from josnl file started ...
Load data from josnl file took 1.9845490640000207 sec
Filter out data according to blocked domains started ...
Filter out data according to blocked domains took 19.93875635300003 sec
Completed!!
    Load total 602 documents
    Load total 4561580 blocked domains
    Removed 348 documents according to blacklist


## Filter out data containing bad words
 The bad words list comes from [List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words)



In [11]:
filter_by_bad_words(data_dir, out_dir)

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
Load bad words list and create pattern started ...
Load bad words list and create pattern took 0.0006733719999374443 sec
Load data from josnl file started ...
Load data from josnl file took 0.9559050610000668 sec
Filter out data according to bad words started ...
Filter out data according to bad words took 3.0107793629999833 sec
Completed!!
    Load total 602 documents
    Load total 403 blocked domains
    Removed 228 documents according to blacklist


## Filter out data based on length limit

In [12]:
filter_by_length(data_dir, out_dir,minimum_length=100, maximum_length=10000)

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
Load data from josnl file started ...
Load data from josnl file took 1.6685367840000254 sec
Filter out data according to length limit started ...
Filter out data according to length limit took 3.836184722000098 sec
Completed!!
    Load total 602 documents
    Removed 128 documents according to length limit


## Filter out data containing profanity language
Mainly using [alt-profanity-check](https://pypi.org/project/alt-profanity-check/) library



In [13]:
profanity_filter(data_dir, out_dir)

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
Load data from josnl file started ...
Load data from josnl file took 0.5557802509999874 sec
Filter out data containing profanity started ...
Filter out data containing profanity took 5.6580194180000944 sec
Save data started ...
Save data took 5.870488991999991 sec
Completed!!
    Load total 602 documents
    Removed 5 documents with profanity
