Redactly

Local, privacy-first PDF redaction. Detect and remove personal information from PDFs — text-based and scanned — without any data leaving your device. Redaction is reversible via an encrypted key file only you control.

Built for the Privacy & Open Source AI Tools challenge track.

Features

Hybrid PII detection — regex patterns for structured data (IBANs, SSNs, emails, phones, ...) + multilingual BERT NER for names, locations, and organizations
True PDF redaction — PII is removed from the PDF content stream, not just drawn over
Image & scanned PDF support — embedded images are OCR'd with Tesseract and redacted at the pixel level
Reversible — an encrypted .gocalma key file lets the document owner restore original values at any time
5 European languages — English, German, French, Italian, Spanish out of the box
Optional external LLM — connect Ollama, OpenAI, LM Studio, or any OpenAI-compatible provider as an alternative NER backend
Zero data transmission — everything runs on localhost, nothing phones home

Quick Start

macOS / Linux:

./start.sh

Windows:

start.bat

The script creates a virtual environment, installs dependencies, and opens the app at http://localhost:8000. The NER model (~680 MB) downloads automatically on first launch.

OCR support (for scanned PDFs):

OS	Install
macOS	`brew install tesseract`
Linux	`apt install tesseract-ocr`
Windows	UB Mannheim installer

Manual Setup

python3 -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt
cd src && uvicorn server:app --host 0.0.0.0 --port 8000

How It Works

Upload PDF
    │
    ▼
┌──────────────────────────────┐
│  Text Extraction (PyMuPDF)   │──── per-word bounding boxes
│  Image OCR (Tesseract)       │──── embedded images extracted + OCR'd
└──────────────────────────────┘
    │
    ▼
┌──────────────────────────────┐
│  PII Detection               │
│  ├─ Regex: emails, phones,   │
│  │  IBANs, AHV, SSNs, etc.  │
│  ├─ NER: names, locations,   │
│  │  orgs (multilingual BERT) │
│  └─ Priority merge + dedup   │
└──────────────────────────────┘
    │
    ▼
┌──────────────────────────────┐
│  User Review                 │──── toggle entities on/off in the UI
└──────────────────────────────┘
    │
    ▼
┌──────────────────────────────┐
│  Redaction                   │
│  ├─ Text: true removal from  │
│  │  PDF content streams      │
│  └─ Images: black boxes over │
│     PII in embedded images   │
└──────────────────────────────┘
    │
    ▼
  redacted.pdf  +  key.gocalma (AES-256-GCM encrypted)

Un-redaction: upload the .gocalma key file + your password to view the original values.

PII Detection

Detection runs two independent passes over the extracted text, then merges the results.

Pass 1 — Regex (instant, deterministic)

Pattern-based detection for structured PII with validation where possible:

Type	Example	Validation
Email	`max@example.com`	Format check
Phone	`+41 79 123 45 67`	7–15 digits, rejects date-like strings
IBAN	`CH93 0076 2011 6238 5295 7`	Country code + check digits
AHV/AVS	`756.1234.5678.90`	Swiss social security format
Credit card	`4111 1111 1111 1111`	Luhn checksum
Insurance no.	`INS-CH-550-229-104`	Prefixed identifier patterns
Passport	`XK0002147`	1–2 letters + 6–9 digits
SSN	`123-45-6789`	US format
Date of birth	`15.03.1985`, `1986-05-29`	Valid day/month/year ranges
IP address	`192.168.1.1`	Dotted quad

Pass 2 — NER (ML model, multilingual)

A fine-tuned multilingual BERT model (Davlan/bert-base-multilingual-cased-ner-hrl) detects unstructured PII that regex can't catch — person names, locations, and organizations across EN, DE, FR, IT, ES and more. Text is chunked at 512 tokens with proper word-boundary splits.

Alternatively, you can enable an external LLM (Ollama, OpenAI, LM Studio, or any OpenAI-compatible endpoint) via the settings gear icon in the UI. The LLM replaces the NER pass; regex always runs regardless.

Priority Merge

When both passes flag the same text span, the merger resolves overlaps:

Priority 10  AHV/AVS, IBAN, SSN, Credit card     highly specific identifiers
Priority  9  Passport, Email, Insurance, Patient ID
Priority  8  Phone
Priority  6  Person name (NER)
Priority  5  Location, Address (NER)
Priority  4  Organization (NER)
Priority  3  Date of birth

Higher priority wins. At equal priority, higher confidence wins. This ensures a Swiss AHV number is never misclassified as a phone number, and an IBAN isn't partially consumed by a phone match.

Coordinate Mapping

The NER model returns character offsets in plain text, but we need PDF bounding boxes for redaction. During extraction we build a char_offset → word_index map. When NER returns an entity at characters 42–56, we look up which words those characters belong to and union their bounding boxes — giving us the precise PDF rectangle to redact.

For scanned PDFs, the same mapping works in pixel space: Tesseract returns per-word bounding boxes, and coordinates are translated from image pixels to PDF page points using the image's placement transform.

Tech Stack

Component	Library
Server	FastAPI + uvicorn
PDF extraction & redaction	PyMuPDF (content-stream removal)
NER	HuggingFace Transformers (multilingual BERT)
External LLM (optional)	Any OpenAI-compatible API via httpx
OCR	Tesseract via pytesseract
Encryption	AES-256-GCM + PBKDF2 key derivation (cryptography)
Frontend	HTML + Tailwind CSS + vanilla JS

Project Structure

src/
  server.py        FastAPI routes + static file serving
  detector.py      PII detection (regex + NER + priority merge)
  llm_detector.py  External LLM NER backend (OpenAI-compatible)
  extractor.py     PDF text extraction with word-level bounding boxes
  redactor.py      Text redaction (PyMuPDF) + image redaction (Pillow)
  crypto.py        Key file encryption / decryption (AES-256-GCM)
  ocr.py           Tesseract OCR for scanned PDFs + embedded images
  models.py        Pydantic models
static/
  index.html       Single-page dark-themed UI
  styles.css       Minimal custom CSS (animations, badges)
  app.js           Upload → review → download logic

External LLM Configuration

By default GoCalma uses the local BERT model for NER. To use an external provider instead, click the gear icon in the UI or set environment variables before starting:

export GOCALMA_LLM_ENABLED=true
export GOCALMA_LLM_API_URL=http://localhost:11434/v1   # Ollama default
export GOCALMA_LLM_MODEL=llama3
# export GOCALMA_LLM_API_KEY=sk-...                    # only for remote providers
./start.sh

The UI shows a privacy warning when the API URL points to a non-local server.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
start.bat		start.bat
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Redactly

Features

Quick Start

Manual Setup

How It Works

PII Detection

Pass 1 — Regex (instant, deterministic)

Pass 2 — NER (ML model, multilingual)

Priority Merge

Coordinate Mapping

Tech Stack

Project Structure

External LLM Configuration

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Redactly

Features

Quick Start

Manual Setup

How It Works

PII Detection

Pass 1 — Regex (instant, deterministic)

Pass 2 — NER (ML model, multilingual)

Priority Merge

Coordinate Mapping

Tech Stack

Project Structure

External LLM Configuration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages