# AWS Comprehend

This notebooks explores [AWS Comprehend](https://aws.amazon.com/blogs/machine-learning/detecting-and-redacting-pii-using-amazon-comprehend/) to identify `Personal Indentifyable Information` (a.k.a `PII`) entities.


> ℹ️ Know more check the [Comprehend Examples](https://docs.aws.amazon.com/code-library/latest/ug/python_3_comprehend_code_examples.html)

> ⚠️ Requires `AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env vars

In [1]:
# Install uv
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Install python deps
!uv pip install --system boto3 python-dotenv

downloading uv 0.6.10 x86_64-unknown-linux-gnu
no checksums to verify
installing to /root/.local/bin
  uv
  uvx
everything's installed!
[2mUsing Python 3.11.7 environment at: /usr[0m
[2K[2mResolved [1m7 packages[0m [2min 112ms[0m[0m                                         [0m
[2K[2mPrepared [1m2 packages[0m [2min 5ms[0m[0m                                               
[2K[2mInstalled [1m2 packages[0m [2min 1ms[0m[0m                                 [0m
 [32m+[39m [1mjmespath[0m[2m==1.0.1[0m
 [32m+[39m [1msix[0m[2m==1.17.0[0m


## 📚 Data

In [2]:
from pathlib import Path

DATA_DIR = Path("/datasets/client-data-us/")

md_docs = list(DATA_DIR.rglob("**/*.md"))
print(f"Total markdown documents: {len(md_docs)}")

text = md_docs[0].open("r").read()
print(text[:200])

Total markdown documents: 60
#### **MASTER LICENSE AGREEMENT**

**Between Customer and Supply Chain Consultants, Inc. d/b/a Arkieva Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808 Telephone: 302-738-9215 Fed ID/TIN: 


In [13]:
import boto3
from dotenv import load_dotenv


load_dotenv()


def detect_and_redact_pii(text):
    # Initialize the Comprehend client
    comprehend_client = boto3.client(
        'comprehend', region_name='us-east-1'
    )  # Replace with your AWS region

    # Call detect_pii_entities
    response = comprehend_client.detect_pii_entities(
        Text=text,
        LanguageCode='en'
    )

    # Extract PII entities
    pii_entities = response['Entities']

    # Redact PII entities from the text
    for entity in pii_entities[::-1]:
        entity_type = entity['Type']
        begin_offset = entity['BeginOffset']
        end_offset = entity['EndOffset']
        # Replace the PII entity with a placeholder
        text = text[:begin_offset] + f'[{entity_type} REDACTED]' + text[end_offset:]

    return text, pii_entities


# Detect and redact PII
# redacted_text, detected_pii = detect_and_redact_pii(text)

# print("Redacted Text:")
# print(redacted_text)
print("\nDetected PII Entities:")
for entity in detected_pii:
    print(f"Type: {entity['Type']}, "
          f"Text: '{text[entity['BeginOffset']:entity['EndOffset']]}', "
          f"Confidence: {entity['Score']:.2f}")


Detected PII Entities:
Type: NAME, Text: 'Arkieva Linden', Confidence: 0.80
Type: ADDRESS, Text: '5460 Fairmont Drive Wilmington, DE 19808', Confidence: 1.00
Type: PHONE, Text: '302-738-9215', Confidence: 1.00
Type: ADDRESS, Text: '1256 North McLean Boulevard, Memphis Tennessee 38108-1241', Confidence: 1.00
Type: ADDRESS, Text: 'U.S.A.', Confidence: 0.78
Type: PHONE, Text: '(901) 278-0330', Confidence: 1.00
Type: ADDRESS, Text: 'Delaware', Confidence: 0.99
Type: NAME, Text: 'Sujit K. Singh', Confidence: 1.00
Type: DATE_TIME, Text: 'August 17, 2012', Confidence: 1.00
Type: DATE_TIME, Text: 'August 17 , 2012', Confidence: 1.00
Type: NAME, Text: 'Sujit K. Singh', Confidence: 1.00
Type: NAME, Text: 'COO, Arkieva', Confidence: 0.50
Type: DATE_TIME, Text: 'August 17, 2012', Confidence: 0.89


In [17]:
import copy

redacted_text = copy.copy(text)

# Redact PII entities from the text
for entity in detected_pii[::-1]:
    entity_type = entity['Type']
    begin_offset = entity['BeginOffset']
    end_offset = entity['EndOffset']
    # Replace the PII entity with a placeholder
    redacted_text = redacted_text[:begin_offset] + f'[{entity_type} REDACTED]' + redacted_text[end_offset:]
    
    
print(redacted_text)

#### **MASTER LICENSE AGREEMENT**

**Between Customer and Supply Chain Consultants, Inc. d/b/a [NAME REDACTED] Green Center [ADDRESS REDACTED] Telephone: [PHONE REDACTED] Fed ID/TIN: 51-035 0007**

Customer Name: BUCKMAN LABORATORIES INTERNATIONAL, INC.

Address: [ADDRESS REDACTED], [ADDRESS REDACTED] Telephone: [PHONE REDACTED]

ATTN:

| 1.0  | DEFINITIONS<br>3             |
|------|------------------------------|
| 2.0  | LICENSE3                     |
| 3.0  | USE3                         |
| 4.0  | PAYMENT<br>4                 |
| 5.0  | SOFTWARE MAINTENANCE4        |
| 6.0  | WARRANTIES AND REMEDIES<br>4 |
| 7.0  | INDEMNITY5                   |
| 8.0  | LIMITATION OF LIABILITY5     |
| 9.0  | OWNERSHIP6                   |
| 10.0 | TERMINATION OF DISTRIBUTOR6  |
| 11.0 | CONFIDENTIALITY<br>6         |
| 12.0 | TERMINATION7                 |
| 13.0 | QUALITY CONTROL7             |
| 14.0 | ASSIGNMENT7                  |
| 15.0 | U.S. EXPORT RESTRICTIONS7    |
| 16.0 | MISCELLANEOU