## Extracting Text from PDF and Configuring PII Redactor


**Author**: Pooja Holkar ,
**email**:poholkar@in.ibm.com

Click link to open notebook in google colab:  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb)


### What is a PII Redactor?

A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:

Names
Email addresses
Phone numbers
Addresses
Financial details (e.g., credit card numbers)

### Overview of the use case
In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.

 **Workflow Overview**

The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.

 **Redactor Configuration**

The system is configured to recognize specific PII entities relevant to invoices, such as:
Customer names
Email addresses
Phone numbers
Shipping addresses

 **PII Detection and Redaction**

The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.
Output:

The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.

### Why is PII Redaction Important?

 **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.

 **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.

 **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.


### Pre-req: Install data-prep-kit dependencies

In [1]:
!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[all]==0.2.2'
!pip install pdfplumber 
!pip install flair 
!pip install spacy 
!pip install presidio_analyzer 
!pip install presidio_anonymizer==2.2.355

Collecting numpy<1.29.0 (from data-prep-toolkit==0.2.2)
  Using cached numpy-1.26.4-cp310-cp310-macosx_11_0_arm64.whl.metadata (61 kB)
Collecting argparse (from data-prep-toolkit==0.2.2)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached numpy-1.26.4-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse, numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
blis 1.0.1 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
thinc 8.3.2 requires numpy<2.1.0,>=2.0.0; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.[0m[31m
[0mSuccessfully installed 

In [2]:
import pdfplumber
from pii_redactor_transform import PIIRedactorTransform


### Step 1: Inspect the Data 

We will use simple invoice PDF

[invoicedata](https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf)

In [3]:
!wget 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'

--2024-12-06 19:24:29--  https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf
185.199.109.133, 185.199.108.133, 185.199.111.133, ...tent.com)... 
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33150 (32K) [application/octet-stream]
Saving to: ‘Invoice.pdf.1’


2024-12-06 19:24:30 (744 KB/s) - ‘Invoice.pdf.1’ saved [33150/33150]



In [4]:
pdf_path="Invoice.pdf"

### Step 2: Extract Text from PDF

This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string.

In [5]:
with pdfplumber.open(pdf_path) as pdf:
    text = "\n".join(page.extract_text() for page in pdf.pages)



### Step 3: Configure the PII Redactor



This configuration defines the parameters for identifying and redacting Personally Identifiable Information (PII) in the extracted text.

In [6]:

config = {
    "entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "LOCATION"],
    "operator": "replace",
    "transformed_contents": "redacted_contents",
    "score_threshold": 0.6
}

### Step 4: Initialize and Run the PII Redactor


This step initializes the PII Redactor using the previously defined configuration and prepares it for processing the extracted text.

In [None]:

redactor = PIIRedactorTransform(config)


19:24:30 INFO - Loading model from flair/ner-english-large


### Step 5: Apply the Redactor to Text Data


This step applies the initialized PII redactor to the extracted text, redacting sensitive information and providing details about the identified entities.

In [None]:

redacted_text, detected_entities = redactor._redact_pii(text)



### Step 6: Display the Redaction Results


This step outputs the results of the redaction process, including the redacted text and the details of the detected PII entities.


In [None]:
# Step 5: Print the Results
print("Redacted Text:\n", redacted_text)
print("Detected Entities:\n", detected_entities)

<br>
<br>

### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities