Extracting Text from PDF and Configuring PII Redactor

What is a PII Redactor?
A PII (Personally Identifiable Information) Redactor is a tool or system designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:

Names
Email addresses
Phone numbers
Physical or shipping addresses
Financial details (e.g., credit card numbers)
Use Case in This Project
In this project, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.

Workflow Overview
Text Extraction:

The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.
Redactor Configuration:

The system is configured to recognize specific PII entities relevant to invoices, such as:
Customer names
Email addresses
Phone numbers
Shipping addresses
PII Detection and Redaction:

The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.
Output:

The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.
Why is PII Redaction Important?
Data Privacy Compliance: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.
Risk Mitigation: Prevents unauthorized access to or misuse of sensitive data.
Automation Benefits: Simplifies and accelerates the process of securing information in large-scale document handling.


In [8]:
import pdfplumber
#from data_processing.transform.table_transform import AbstractTableTransform
#from data_processing.transform import AbstractTableTransform, TransformConfiguration
from pii_redactor_transform import PIIRedactorTransform


Step 1: Extract Text from PDF

In [9]:

#pdf_path = "/Users/poojaholkar/GSI/WATSONX/WATSONXDATA/DPK/data-prep-kit-dev/invoicedata/invoice_garminwatch.pdf"  # Replace with the path to your uploaded PDF
pdf_path="/Users/poojaholkar/Downloads/Invoice.pdf"

In [None]:
#pip install flair
#pip install spacy
#pip install presidio_anonymizer==2.2.355
#pip install numpy==1.26.4

SyntaxError: invalid syntax (2155885561.py, line 3)

In [10]:
!pip uninstall numpy --yes
#!pip install numpy==1.19.3


Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4


Step 1: Extract Text from PDF

This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string.

In [11]:
with pdfplumber.open(pdf_path) as pdf:
    text = "\n".join(page.extract_text() for page in pdf.pages)



#Step 2: Configure the PII Redactor



This configuration defines the parameters for identifying and redacting Personally Identifiable Information (PII) in the extracted text.

In [12]:

config = {
    "entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "LOCATION"],
    "operator": "replace",
    "transformed_contents": "redacted_contents",
    "score_threshold": 0.6
}

Step 3: Initialize and Run the PII Redactor


This step initializes the PII Redactor using the previously defined configuration and prepares it for processing the extracted text.

In [13]:

redactor = PIIRedactorTransform(config)


20:33:16 INFO - Loading model from flair/ner-english-large


2024-11-24 20:33:33,105 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


Step 4: Apply the Redactor to Text Data


This step applies the initialized PII redactor to the extracted text, redacting sensitive information and providing details about the identified entities.

In [14]:

redacted_text, detected_entities = redactor._redact_pii(text)



Step 5: Display the Redaction Results


This step outputs the results of the redaction process, including the redacted text and the details of the detected PII entities.


In [15]:
# Step 5: Print the Results
print("Redacted Text:\n", redacted_text)
print("Detected Entities:\n", detected_entities)

Redacted Text:
 INVOICE
Apple Inc.
Invoice Details:
Invoice Number: INV-2024-001
Invoice Date: November 15, 2024
Due Date: November 30, 2024
Billing Information:
Customer Name: <PERSON>
Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704
Email: <EMAIL_ADDRESS>
Phone: <PHONE_NUMBER>
Shipping Information:
Recipient Name: <PERSON>
Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704
Item Details:
Description Quantity Unit Price Total
MacBook Air (13-inch, M2) 1 $999.00 $999.00
AppleCare+ for MacBook Air 1 $199.00 $199.00
Subtotal: $1,198.00
Tax (8%): $95.84
Total Amount Due: $1,293.84
Payment Method: Credit Card (Visa)
Transaction ID: 9876543210ABCDE
Notes:
Thank you for your purchase!
For assistance, please contact our support team at <EMAIL_ADDRESS> or 1-800-MY-APPLE.
Detected Entities:
 ['PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PHONE_NUMBER']
