##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv 
source venv/bin/activate 
pip install jupyterlab
```

In [1]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
%pip install "data-prep-toolkit-transforms[ray,pii_redactor]==1.0.0a5"

##### **** Configure the transform parameters. 
```
 --pii_redactor_entities PII_ENTITIES
                        list of PII entities to be captured for example: ["PERSON", "EMAIL"]
 --pii_redactor_operator REDACTOR_OPERATOR
                        Two redaction techniques are supported - replace(default), redact 
  --pii_redactor_transformed_contents PII_TRANSFORMED_CONTENT_COLUMN_NAME
                        Mention the column name in which transformed contents will be added. This is required argument. 
  --pii_redactor_score_threshold SCORE_THRESHOLD
                        The score_threshold is a parameter that sets the minimum confidence score required for an entity to be considered a match. Provide a value above 0.6
```
#####

##### ***** Import required classes and modules

In [2]:
from dpk_pii_redactor.ray.transform import PIIRedactor
from data_processing.utils import GB

##### ***** Setup runtime parameters and invoke the transform

In [3]:
%%capture
PIIRedactor(input_folder='ray/test-data/input',
            output_folder= 'output',
            run_locally= True,
            num_cpus= 0.8,
            memory= 2 * GB,
            runtime_num_workers = 3,
            runtime_creation_delay = 0,
            pii_redactor_entities = ["PERSON", "EMAIL_ADDRESS"],
            pii_redactor_operator = "replace",
            pii_redactor_transformed_contents = "title").transform()

17:15:38 INFO - pipeline id pipeline_id
17:15:38 INFO - code location None
17:15:38 INFO - number of workers 3 worker options {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1}
17:15:38 INFO - actor creation delay 0
17:15:38 INFO - job details {'job category': 'preprocessing', 'job name': 'pii_redactor', 'job type': 'ray', 'job id': 'job_id'}
17:15:38 INFO - data factory data_ is using local data access: input_folder - ray/test-data/input output_folder - output
17:15:38 INFO - data factory data_ max_files -1, n_sample -1
17:15:38 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
17:15:38 INFO - Running locally
2025-01-16 17:15:39,562	INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
17:16:09 INFO - Completed execution in 0.513 min, execution result 0


##### **** The specified folder will include the transformed parquet files.

In [4]:
import glob
glob.glob("output/*")

['output/metadata.json', 'output/test1.parquet']

In [5]:
import pandas as pd
pd.read_parquet('ray/test-data/input/test1.parquet', engine='pyarrow')

Unnamed: 0,contents,doc_id
0,I am Tom Chandler,doc1
1,My website is www.tomchandler.com,doc2
2,Contact me at greek@yahoo.com,doc3


In [6]:
pd.read_parquet('output/test1.parquet', engine='pyarrow')

Unnamed: 0,detected_pii,title,contents,doc_id
0,[PERSON],I am <PERSON>,I am Tom Chandler,doc1
1,[],My website is www.tomchandler.com,My website is www.tomchandler.com,doc2
2,[EMAIL_ADDRESS],Contact me at <EMAIL_ADDRESS>,Contact me at greek@yahoo.com,doc3
