Skip to content

kilchwein/Redactly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Redactly

Local, privacy-first PDF redaction. Detect and remove personal information from PDFs — text-based and scanned — without any data leaving your device. Redaction is reversible via an encrypted key file only you control.

Built for the Privacy & Open Source AI Tools challenge track.


Features

  • Hybrid PII detection — regex patterns for structured data (IBANs, SSNs, emails, phones, ...) + multilingual BERT NER for names, locations, and organizations
  • True PDF redaction — PII is removed from the PDF content stream, not just drawn over
  • Image & scanned PDF support — embedded images are OCR'd with Tesseract and redacted at the pixel level
  • Reversible — an encrypted .gocalma key file lets the document owner restore original values at any time
  • 5 European languages — English, German, French, Italian, Spanish out of the box
  • Optional external LLM — connect Ollama, OpenAI, LM Studio, or any OpenAI-compatible provider as an alternative NER backend
  • Zero data transmission — everything runs on localhost, nothing phones home

Quick Start

macOS / Linux:

./start.sh

Windows:

start.bat

The script creates a virtual environment, installs dependencies, and opens the app at http://localhost:8000. The NER model (~680 MB) downloads automatically on first launch.

OCR support (for scanned PDFs):

OS Install
macOS brew install tesseract
Linux apt install tesseract-ocr
Windows UB Mannheim installer

Manual Setup

python3 -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt
cd src && uvicorn server:app --host 0.0.0.0 --port 8000

How It Works

Upload PDF
    │
    ▼
┌──────────────────────────────┐
│  Text Extraction (PyMuPDF)   │──── per-word bounding boxes
│  Image OCR (Tesseract)       │──── embedded images extracted + OCR'd
└──────────────────────────────┘
    │
    ▼
┌──────────────────────────────┐
│  PII Detection               │
│  ├─ Regex: emails, phones,   │
│  │  IBANs, AHV, SSNs, etc.  │
│  ├─ NER: names, locations,   │
│  │  orgs (multilingual BERT) │
│  └─ Priority merge + dedup   │
└──────────────────────────────┘
    │
    ▼
┌──────────────────────────────┐
│  User Review                 │──── toggle entities on/off in the UI
└──────────────────────────────┘
    │
    ▼
┌──────────────────────────────┐
│  Redaction                   │
│  ├─ Text: true removal from  │
│  │  PDF content streams      │
│  └─ Images: black boxes over │
│     PII in embedded images   │
└──────────────────────────────┘
    │
    ▼
  redacted.pdf  +  key.gocalma (AES-256-GCM encrypted)

Un-redaction: upload the .gocalma key file + your password to view the original values.


PII Detection

Detection runs two independent passes over the extracted text, then merges the results.

Pass 1 — Regex (instant, deterministic)

Pattern-based detection for structured PII with validation where possible:

Type Example Validation
Email max@example.com Format check
Phone +41 79 123 45 67 7–15 digits, rejects date-like strings
IBAN CH93 0076 2011 6238 5295 7 Country code + check digits
AHV/AVS 756.1234.5678.90 Swiss social security format
Credit card 4111 1111 1111 1111 Luhn checksum
Insurance no. INS-CH-550-229-104 Prefixed identifier patterns
Passport XK0002147 1–2 letters + 6–9 digits
SSN 123-45-6789 US format
Date of birth 15.03.1985, 1986-05-29 Valid day/month/year ranges
IP address 192.168.1.1 Dotted quad

Pass 2 — NER (ML model, multilingual)

A fine-tuned multilingual BERT model (Davlan/bert-base-multilingual-cased-ner-hrl) detects unstructured PII that regex can't catch — person names, locations, and organizations across EN, DE, FR, IT, ES and more. Text is chunked at 512 tokens with proper word-boundary splits.

Alternatively, you can enable an external LLM (Ollama, OpenAI, LM Studio, or any OpenAI-compatible endpoint) via the settings gear icon in the UI. The LLM replaces the NER pass; regex always runs regardless.

Priority Merge

When both passes flag the same text span, the merger resolves overlaps:

Priority 10  AHV/AVS, IBAN, SSN, Credit card     highly specific identifiers
Priority  9  Passport, Email, Insurance, Patient ID
Priority  8  Phone
Priority  6  Person name (NER)
Priority  5  Location, Address (NER)
Priority  4  Organization (NER)
Priority  3  Date of birth

Higher priority wins. At equal priority, higher confidence wins. This ensures a Swiss AHV number is never misclassified as a phone number, and an IBAN isn't partially consumed by a phone match.

Coordinate Mapping

The NER model returns character offsets in plain text, but we need PDF bounding boxes for redaction. During extraction we build a char_offset → word_index map. When NER returns an entity at characters 42–56, we look up which words those characters belong to and union their bounding boxes — giving us the precise PDF rectangle to redact.

For scanned PDFs, the same mapping works in pixel space: Tesseract returns per-word bounding boxes, and coordinates are translated from image pixels to PDF page points using the image's placement transform.


Tech Stack

Component Library
Server FastAPI + uvicorn
PDF extraction & redaction PyMuPDF (content-stream removal)
NER HuggingFace Transformers (multilingual BERT)
External LLM (optional) Any OpenAI-compatible API via httpx
OCR Tesseract via pytesseract
Encryption AES-256-GCM + PBKDF2 key derivation (cryptography)
Frontend HTML + Tailwind CSS + vanilla JS

Project Structure

src/
  server.py        FastAPI routes + static file serving
  detector.py      PII detection (regex + NER + priority merge)
  llm_detector.py  External LLM NER backend (OpenAI-compatible)
  extractor.py     PDF text extraction with word-level bounding boxes
  redactor.py      Text redaction (PyMuPDF) + image redaction (Pillow)
  crypto.py        Key file encryption / decryption (AES-256-GCM)
  ocr.py           Tesseract OCR for scanned PDFs + embedded images
  models.py        Pydantic models
static/
  index.html       Single-page dark-themed UI
  styles.css       Minimal custom CSS (animations, badges)
  app.js           Upload → review → download logic

External LLM Configuration

By default GoCalma uses the local BERT model for NER. To use an external provider instead, click the gear icon in the UI or set environment variables before starting:

export GOCALMA_LLM_ENABLED=true
export GOCALMA_LLM_API_URL=http://localhost:11434/v1   # Ollama default
export GOCALMA_LLM_MODEL=llama3
# export GOCALMA_LLM_API_KEY=sk-...                    # only for remote providers
./start.sh

The UI shows a privacy warning when the API URL points to a non-local server.


License

MIT

About

Redactly

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors