## PII Redactor Example Notebook


**Author**: Pooja Holkar ,
**email**:poholkar@in.ibm.com

Click link to open notebook in google colab:  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb)


### What is a PII Redactor?

A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:

Names
Email addresses
Phone numbers
Addresses
Financial details (e.g., credit card numbers)

### Overview of the use case
In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.

 **Workflow Overview**

- **Extracting and Converting Text:** The content of the invoice, originally in PDF format, is processed using the pdf2parquet transform to extract the text and convert it into a structured Parquet file, enabling easier handling and downstream processing.

- **Redacting Sensitive Information:** The generated Parquet file serves as the input for the pii_redactor_transform. This step scans the invoice data for personally identifiable information (PII) and applies masking techniques to redact any sensitive content, ensuring data privacy and compliance.

- **Creating the Final Output:** After the redaction process, a new output Parquet file is generated in **output-redacted** folder, containing the same structured data as the original but with all sensitive details securely masked to prevent unauthorized access or exposure.


### Why is PII Redaction Important?

 **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.

 **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.

 **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.


### Pre-req: Install data-prep-kit dependencies

In [1]:
%%capture logpip --no-stderr
!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[pii_redactor]==0.2.2'
!pip install 'data-prep-toolkit-transforms[pdf2parquet]==0.2.2'

###  Figure out Runtime Environment

In [2]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### Download Data if running on Google Colab

In [3]:
if RUNNING_IN_COLAB:
  !mkdir -p 'input-data'
  !wget -O 'input-data/Invoice.pdf' 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/input-data/Invoice.pdf'

## Step 1: Configuration

### Import necessary libraries

In [4]:
import ast
import os
import sys
from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration

### Create input/outpur directories

In [5]:
# create parameters
input_folder = os.path.join("input-data")
output_folder = os.path.join( "output")
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}


### Setup runtime parameters for the transform

In [6]:
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf','.docx','.pptx','.zip']"),
    # execution info
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    # pdf2parquet params
    "pdf2parquet_double_precision": 0,
}

## Step 2: Invoke Pdf2Parquet transform

In [7]:
%%capture
sys.argv = ParamsUtils.dict_to_req(d=params)
launcher = PythonTransformLauncher(runtime_config=Pdf2ParquetPythonTransformConfiguration())
launcher.launch()

18:47:39 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 0}
18:47:39 INFO - pipeline id pipeline_id
18:47:39 INFO - code location None
18:47:39 INFO - data factory data_ is using local data access: input_folder - input-data output_folder - output
18:47:39 INFO - data factory data_ max_files -1, n_sample -1
18:47:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.docx', '.pptx', '.zip'], files to checkpoint ['.parquet']
18:47:39 INFO - orchestrator pdf2parquet started at 2024-12-22 18:47:39
18:47:39 INFO - Number of files is 1, source profile {'max_file_size': 0.03161430358886719, 'min_file_s

### Verify the input parquet created in output folder

In [8]:
import glob
glob.glob("output/*")

['output/Invoice.parquet', 'output/metadata.json']

## Step 3: Import necessary PIIRedactor libraries

In [9]:
from pii_redactor_transform import doc_transformed_contents_cli_param
from pii_redactor_transform_python import PIIRedactorPythonTransformConfiguration


# create parameters
input_folder = os.path.abspath(os.path.join(os.getcwd(), "output"))
output_folder = os.path.abspath(os.path.join(os.getcwd(), "output-redacted"))
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

## Step 4: Invoke PII Redactor configuration transform

In [10]:
import pandas as pd
from pii_redactor_transform import PIIRedactorTransform


config = {
    "entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "LOCATION"],
    "operator": "replace",
    "transformed_contents": "redacted_contents",
    "score_threshold": 0.7
}

In [11]:
params = {"pii_redactor_transformed_contents": "new_contents","data_local_config": local_conf}
sys.argv = ParamsUtils.dict_to_req(d=params)
launcher = PythonTransformLauncher(runtime_config=PIIRedactorPythonTransformConfiguration())
launcher.launch()

18:47:46 INFO - pipeline id pipeline_id
18:47:46 INFO - code location None
18:47:46 INFO - data factory data_ is using local data access: input_folder - /Users/poojaholkar/GSI/WATSONX/WATSONXDATA/DPK/GITHUBCOPY/poojalocalupdated/data-prep-kit/examples/notebooks/PII/output output_folder - /Users/poojaholkar/GSI/WATSONX/WATSONXDATA/DPK/GITHUBCOPY/poojalocalupdated/data-prep-kit/examples/notebooks/PII/output-redacted
18:47:46 INFO - data factory data_ max_files -1, n_sample -1
18:47:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
18:47:46 INFO - orchestrator pii_redactor started at 2024-12-22 18:47:46
18:47:46 INFO - Number of files is 1, source profile {'max_file_size': 0.012392044067382812, 'min_file_size': 0.012392044067382812, 'total_file_size': 0.012392044067382812}
18:47:46 INFO - Loading model from flair/ner-english-large


2024-12-22 18:47:59,263 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


18:48:00 INFO - Completed 1 files (100.0%) in 0.007 min
18:48:00 INFO - Done processing 1 files, waiting for flush() completion.
18:48:00 INFO - done flushing in 0.0 sec
18:48:00 INFO - Completed execution in 0.225 min, execution result 0


0

### Step 5: Display Output in a Readable Format with masked PII information

In [12]:
data = pd.read_parquet('output-redacted/Invoice.parquet')
print(data["new_contents"][0])
print(data["detected_pii"][0])

<ORGANIZATION>.

Invoice Details:

Invoice Number: INV-2024-001

Invoice Date: November 15, 2024

Invoice Date: November 15, 2024

Due Date: November 30, 2024

Billing Information:

Customer Name: <PERSON>

Customer Name: <PERSON>

Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704

Email: <EMAIL_ADDRESS>

Phone: <PHONE_NUMBER>

Shipping Information:

Recipient Name: <PERSON>

Recipient Name: <PERSON>

Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704

## Item Details:

| Description               | Quantity   | Unit Price   | Total                               |
|---------------------------|------------|--------------|-------------------------------------|
| MacBook Air (13-inch, M2) | 1          | $999.00      | $999.00                             |
| 1                         |            | $199.00      | <ORGANIZATION>+ for MacBook Air  $199.00 |

## INVOICE

Transaction ID: 9876543210ABCDE

Notes:

Thank you for your purchase!

For assistance, please contac

<br>
<br>

### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities