GitHub - peterhubina/anonymization

README: Advanced Data Anonymizer with Custom Entity Recognition

This project is an Advanced Data Anonymizer tool implemented in Jupyter Notebook (mask.ipynb) that processes PDF files to redact sensitive information (PII) using Presidio, PyMuPDF, and OCR. It supports multiple anonymization approaches and custom entity recognition.

Prerequisites

Python Version: Ensure you have Python 3.9 or later installed.
Tesseract OCR: Required for image redaction and OCR processing.
- macOS: Install via Homebrew:
```
brew install tesseract
```
- Ubuntu: Install via APT:
```
sudo apt install tesseract-ocr
```
- Windows: Download and install from Tesseract GitHub.
Jupyter Notebook: Required to run the main implementation.
```
pip install jupyter
```

Installation

Clone the Repository:
```
git clone <repository-url>
cd anonymize
```

Create a Virtual Environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies: All required packages are installed within the notebook. The main dependencies include:
- presidio-analyzer and presidio-anonymizer
- presidio-image-redactor
- PyMuPDF and fitz
- pytesseract
- python-docx
- opencv-python
- spacy with en_core_web_lg model

Folder Structure

The project uses the following folder structure:

anonymize/
├── mask.ipynb              # Main Jupyter notebook with all implementations
├── original_files/         # Input PDF files to be anonymized
│   ├── dokumentacia.pdf    # Sample PDF file
│   └── test.pdf           # Sample PDF file
├── anonymized_files/       # Output folder for anonymized content
├── results/               # Additional output folder
│   └── static/           # Subfolder for processed files
├── requirements.txt       # Python dependencies
└── readme.md             # This file

Use Cases Implemented

1. Custom Entity Recognition

The notebook demonstrates three approaches to custom entity recognition:

a) List-based Recognition (Deny List)

Detects entities from predefined lists (e.g., titles like "Mr.", "Dr.", "Professor")
Example: Title recognition for formal addresses

b) Regex Pattern Recognition

Uses regular expressions to detect structured data
Example: Employee ID recognition (EMP-12345)
Supports confidence scoring for pattern matches

c) Rule-based Recognition with Custom Logic

Implements custom logic using spaCy NLP features
Example: Number detection using token analysis
Allows for complex entity detection beyond simple patterns

2. Text Extraction and Multi-format Output

PDF Text Extraction with Dual Output

Extracts text from PDFs using OCR
Anonymizes detected entities using Presidio
Saves results in both formats:
- .txt file for plain text output
- .docx file with proper formatting, timestamps, and document structure

3. Image-based PDF Processing

a) PNG Conversion Approach

Converts PDF pages to PNG images
Processes each PNG with Presidio Image Redactor
Reconstructs anonymized PDF from masked PNGs
Preserves original layout and visual appearance

b) In-Memory Image Processing

Same functionality as PNG approach but handles images in memory
More efficient - no intermediate file storage
Faster processing for large documents
Converts PDF → Image → Mask → PDF in single workflow

4. Comprehensive PDF Processing

Text + Image Extraction with Organized Output

Extracts both text and embedded images from PDFs
Processes text through OCR and Presidio anonymization
Processes images through OCR + visual redaction
Organized output structure:
- Anonymized text saved as formatted .docx document
- Masked images saved as separate files in organized folders
- Maintains page-by-page organization

5. Development Approaches (In Progress)

Advanced PDF Reconstruction

Extracts text and images from PDFs
Applies anonymization to both content types
Attempts to reconstruct a new PDF with:
- Preserved text selectability
- Proper image embedding
- Original document structure

Status: Under development - text overflow and image positioning issues being resolved.

Key Features

Multiple Entity Types:
- Built-in entities: PERSON, EMAIL_ADDRESS, PHONE_NUMBER, etc.
- Custom entities: Employee IDs, license plates, document codes
- Configurable confidence thresholds
Flexible Output Formats:
- Plain text files (.txt)
- Formatted Word documents (.docx)
- Reconstructed PDFs with visual masking
- Organized image collections
Advanced Processing Options:
- OCR-based text extraction from image-heavy PDFs
- Visual redaction with customizable colors
- Batch processing capabilities
- Memory-efficient processing options
Custom Entity Configuration:
- Context-aware entity detection
- Multiple pattern support per entity
- Keyword-based recognition
- Confidence scoring and adjustment

How to Use

Start Jupyter Notebook:
```
jupyter notebook mask.ipynb
```
Place Input Files:
- Add PDF files to the original_files/ folder
- The notebook includes sample files for testing
Run Notebook Cells:
- Execute the "Prerequisites" cell to install dependencies
- Run the "Imports" cell to load required libraries
- Choose and execute the specific use case cells you need
Select Your Use Case:
- Text Extraction: Use the "Save extracted text as DOCX" section
- Image Processing: Use the "In memory PDF masking" section
- Comprehensive Processing: Use the "Extract Text, save as DOCX, extract images" section
- Custom Entities: Configure using the "Custom PII entity" section

Configuration Options

Entity Types: Modify the entity list in analyzer calls
Masking Colors: Change RGB values in redaction functions
OCR Quality: Adjust DPI settings for image processing
Output Paths: Customize folder structures and file names
Custom Patterns: Add regex patterns for domain-specific entities

Performance Notes

In-memory processing is recommended for better performance
High DPI settings (300) provide better OCR accuracy but slower processing
Custom entities may require confidence threshold tuning
Large PDFs benefit from batch processing approaches

Troubleshooting

OCR Issues: Ensure Tesseract is properly installed and in PATH
Memory Issues: Use lower DPI settings or process pages individually
Entity Detection: Adjust confidence thresholds or add context words
Output Formatting: Check file permissions in output directories

Future Development

Improved PDF reconstruction with selectable text
Additional output formats (HTML, XML)
Batch processing interface
API endpoint for programmatic access
Support for additional document formats

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
anonymize.py		anonymize.py
demo.py		demo.py
mask.ipynb		mask.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
video_generation.ipynb		video_generation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README: Advanced Data Anonymizer with Custom Entity Recognition

Prerequisites

Installation

Folder Structure

Use Cases Implemented

1. Custom Entity Recognition

2. Text Extraction and Multi-format Output

3. Image-based PDF Processing

4. Comprehensive PDF Processing

5. Development Approaches (In Progress)

Key Features

How to Use

Configuration Options

Performance Notes

Troubleshooting

Future Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

README: Advanced Data Anonymizer with Custom Entity Recognition

Prerequisites

Installation

Folder Structure

Use Cases Implemented

1. Custom Entity Recognition

2. Text Extraction and Multi-format Output

3. Image-based PDF Processing

4. Comprehensive PDF Processing

5. Development Approaches (In Progress)

Key Features

How to Use

Configuration Options

Performance Notes

Troubleshooting

Future Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages