Skip to content

peterhubina/anonymization

Repository files navigation

README: Advanced Data Anonymizer with Custom Entity Recognition

This project is an Advanced Data Anonymizer tool implemented in Jupyter Notebook (mask.ipynb) that processes PDF files to redact sensitive information (PII) using Presidio, PyMuPDF, and OCR. It supports multiple anonymization approaches and custom entity recognition.


Prerequisites

  1. Python Version: Ensure you have Python 3.9 or later installed.

  2. Tesseract OCR: Required for image redaction and OCR processing.

    • macOS: Install via Homebrew:
      brew install tesseract
    • Ubuntu: Install via APT:
      sudo apt install tesseract-ocr
    • Windows: Download and install from Tesseract GitHub.
  3. Jupyter Notebook: Required to run the main implementation.

    pip install jupyter

Installation

  1. Clone the Repository:

    git clone <repository-url>
    cd anonymize
  2. Create a Virtual Environment:

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install Dependencies: All required packages are installed within the notebook. The main dependencies include:

    • presidio-analyzer and presidio-anonymizer
    • presidio-image-redactor
    • PyMuPDF and fitz
    • pytesseract
    • python-docx
    • opencv-python
    • spacy with en_core_web_lg model

Folder Structure

The project uses the following folder structure:

anonymize/
├── mask.ipynb              # Main Jupyter notebook with all implementations
├── original_files/         # Input PDF files to be anonymized
│   ├── dokumentacia.pdf    # Sample PDF file
│   └── test.pdf           # Sample PDF file
├── anonymized_files/       # Output folder for anonymized content
├── results/               # Additional output folder
│   └── static/           # Subfolder for processed files
├── requirements.txt       # Python dependencies
└── readme.md             # This file

Use Cases Implemented

1. Custom Entity Recognition

The notebook demonstrates three approaches to custom entity recognition:

a) List-based Recognition (Deny List)

  • Detects entities from predefined lists (e.g., titles like "Mr.", "Dr.", "Professor")
  • Example: Title recognition for formal addresses

b) Regex Pattern Recognition

  • Uses regular expressions to detect structured data
  • Example: Employee ID recognition (EMP-12345)
  • Supports confidence scoring for pattern matches

c) Rule-based Recognition with Custom Logic

  • Implements custom logic using spaCy NLP features
  • Example: Number detection using token analysis
  • Allows for complex entity detection beyond simple patterns

2. Text Extraction and Multi-format Output

PDF Text Extraction with Dual Output

  • Extracts text from PDFs using OCR
  • Anonymizes detected entities using Presidio
  • Saves results in both formats:
    • .txt file for plain text output
    • .docx file with proper formatting, timestamps, and document structure

3. Image-based PDF Processing

a) PNG Conversion Approach

  • Converts PDF pages to PNG images
  • Processes each PNG with Presidio Image Redactor
  • Reconstructs anonymized PDF from masked PNGs
  • Preserves original layout and visual appearance

b) In-Memory Image Processing

  • Same functionality as PNG approach but handles images in memory
  • More efficient - no intermediate file storage
  • Faster processing for large documents
  • Converts PDF → Image → Mask → PDF in single workflow

4. Comprehensive PDF Processing

Text + Image Extraction with Organized Output

  • Extracts both text and embedded images from PDFs
  • Processes text through OCR and Presidio anonymization
  • Processes images through OCR + visual redaction
  • Organized output structure:
    • Anonymized text saved as formatted .docx document
    • Masked images saved as separate files in organized folders
    • Maintains page-by-page organization

5. Development Approaches (In Progress)

Advanced PDF Reconstruction

  • Extracts text and images from PDFs
  • Applies anonymization to both content types
  • Attempts to reconstruct a new PDF with:
    • Preserved text selectability
    • Proper image embedding
    • Original document structure

Status: Under development - text overflow and image positioning issues being resolved.


Key Features

  1. Multiple Entity Types:

    • Built-in entities: PERSON, EMAIL_ADDRESS, PHONE_NUMBER, etc.
    • Custom entities: Employee IDs, license plates, document codes
    • Configurable confidence thresholds
  2. Flexible Output Formats:

    • Plain text files (.txt)
    • Formatted Word documents (.docx)
    • Reconstructed PDFs with visual masking
    • Organized image collections
  3. Advanced Processing Options:

    • OCR-based text extraction from image-heavy PDFs
    • Visual redaction with customizable colors
    • Batch processing capabilities
    • Memory-efficient processing options
  4. Custom Entity Configuration:

    • Context-aware entity detection
    • Multiple pattern support per entity
    • Keyword-based recognition
    • Confidence scoring and adjustment

How to Use

  1. Start Jupyter Notebook:

    jupyter notebook mask.ipynb
  2. Place Input Files:

    • Add PDF files to the original_files/ folder
    • The notebook includes sample files for testing
  3. Run Notebook Cells:

    • Execute the "Prerequisites" cell to install dependencies
    • Run the "Imports" cell to load required libraries
    • Choose and execute the specific use case cells you need
  4. Select Your Use Case:

    • Text Extraction: Use the "Save extracted text as DOCX" section
    • Image Processing: Use the "In memory PDF masking" section
    • Comprehensive Processing: Use the "Extract Text, save as DOCX, extract images" section
    • Custom Entities: Configure using the "Custom PII entity" section

Configuration Options

  • Entity Types: Modify the entity list in analyzer calls
  • Masking Colors: Change RGB values in redaction functions
  • OCR Quality: Adjust DPI settings for image processing
  • Output Paths: Customize folder structures and file names
  • Custom Patterns: Add regex patterns for domain-specific entities

Performance Notes

  • In-memory processing is recommended for better performance
  • High DPI settings (300) provide better OCR accuracy but slower processing
  • Custom entities may require confidence threshold tuning
  • Large PDFs benefit from batch processing approaches

Troubleshooting

  • OCR Issues: Ensure Tesseract is properly installed and in PATH
  • Memory Issues: Use lower DPI settings or process pages individually
  • Entity Detection: Adjust confidence thresholds or add context words
  • Output Formatting: Check file permissions in output directories

Future Development

  • Improved PDF reconstruction with selectable text
  • Additional output formats (HTML, XML)
  • Batch processing interface
  • API endpoint for programmatic access
  • Support for additional document formats

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors