Local, privacy-first PDF redaction. Detect and remove personal information from PDFs — text-based and scanned — without any data leaving your device. Redaction is reversible via an encrypted key file only you control.
Built for the Privacy & Open Source AI Tools challenge track.
- Hybrid PII detection — regex patterns for structured data (IBANs, SSNs, emails, phones, ...) + multilingual BERT NER for names, locations, and organizations
- True PDF redaction — PII is removed from the PDF content stream, not just drawn over
- Image & scanned PDF support — embedded images are OCR'd with Tesseract and redacted at the pixel level
- Reversible — an encrypted
.gocalmakey file lets the document owner restore original values at any time - 5 European languages — English, German, French, Italian, Spanish out of the box
- Optional external LLM — connect Ollama, OpenAI, LM Studio, or any OpenAI-compatible provider as an alternative NER backend
- Zero data transmission — everything runs on
localhost, nothing phones home
macOS / Linux:
./start.shWindows:
start.bat
The script creates a virtual environment, installs dependencies, and opens the app at http://localhost:8000. The NER model (~680 MB) downloads automatically on first launch.
OCR support (for scanned PDFs):
| OS | Install |
|---|---|
| macOS | brew install tesseract |
| Linux | apt install tesseract-ocr |
| Windows | UB Mannheim installer |
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cd src && uvicorn server:app --host 0.0.0.0 --port 8000Upload PDF
│
▼
┌──────────────────────────────┐
│ Text Extraction (PyMuPDF) │──── per-word bounding boxes
│ Image OCR (Tesseract) │──── embedded images extracted + OCR'd
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ PII Detection │
│ ├─ Regex: emails, phones, │
│ │ IBANs, AHV, SSNs, etc. │
│ ├─ NER: names, locations, │
│ │ orgs (multilingual BERT) │
│ └─ Priority merge + dedup │
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ User Review │──── toggle entities on/off in the UI
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ Redaction │
│ ├─ Text: true removal from │
│ │ PDF content streams │
│ └─ Images: black boxes over │
│ PII in embedded images │
└──────────────────────────────┘
│
▼
redacted.pdf + key.gocalma (AES-256-GCM encrypted)
Un-redaction: upload the .gocalma key file + your password to view the original values.
Detection runs two independent passes over the extracted text, then merges the results.
Pattern-based detection for structured PII with validation where possible:
| Type | Example | Validation |
|---|---|---|
max@example.com |
Format check | |
| Phone | +41 79 123 45 67 |
7–15 digits, rejects date-like strings |
| IBAN | CH93 0076 2011 6238 5295 7 |
Country code + check digits |
| AHV/AVS | 756.1234.5678.90 |
Swiss social security format |
| Credit card | 4111 1111 1111 1111 |
Luhn checksum |
| Insurance no. | INS-CH-550-229-104 |
Prefixed identifier patterns |
| Passport | XK0002147 |
1–2 letters + 6–9 digits |
| SSN | 123-45-6789 |
US format |
| Date of birth | 15.03.1985, 1986-05-29 |
Valid day/month/year ranges |
| IP address | 192.168.1.1 |
Dotted quad |
A fine-tuned multilingual BERT model (Davlan/bert-base-multilingual-cased-ner-hrl) detects unstructured PII that regex can't catch — person names, locations, and organizations across EN, DE, FR, IT, ES and more. Text is chunked at 512 tokens with proper word-boundary splits.
Alternatively, you can enable an external LLM (Ollama, OpenAI, LM Studio, or any OpenAI-compatible endpoint) via the settings gear icon in the UI. The LLM replaces the NER pass; regex always runs regardless.
When both passes flag the same text span, the merger resolves overlaps:
Priority 10 AHV/AVS, IBAN, SSN, Credit card highly specific identifiers
Priority 9 Passport, Email, Insurance, Patient ID
Priority 8 Phone
Priority 6 Person name (NER)
Priority 5 Location, Address (NER)
Priority 4 Organization (NER)
Priority 3 Date of birth
Higher priority wins. At equal priority, higher confidence wins. This ensures a Swiss AHV number is never misclassified as a phone number, and an IBAN isn't partially consumed by a phone match.
The NER model returns character offsets in plain text, but we need PDF bounding boxes for redaction. During extraction we build a char_offset → word_index map. When NER returns an entity at characters 42–56, we look up which words those characters belong to and union their bounding boxes — giving us the precise PDF rectangle to redact.
For scanned PDFs, the same mapping works in pixel space: Tesseract returns per-word bounding boxes, and coordinates are translated from image pixels to PDF page points using the image's placement transform.
| Component | Library |
|---|---|
| Server | FastAPI + uvicorn |
| PDF extraction & redaction | PyMuPDF (content-stream removal) |
| NER | HuggingFace Transformers (multilingual BERT) |
| External LLM (optional) | Any OpenAI-compatible API via httpx |
| OCR | Tesseract via pytesseract |
| Encryption | AES-256-GCM + PBKDF2 key derivation (cryptography) |
| Frontend | HTML + Tailwind CSS + vanilla JS |
src/
server.py FastAPI routes + static file serving
detector.py PII detection (regex + NER + priority merge)
llm_detector.py External LLM NER backend (OpenAI-compatible)
extractor.py PDF text extraction with word-level bounding boxes
redactor.py Text redaction (PyMuPDF) + image redaction (Pillow)
crypto.py Key file encryption / decryption (AES-256-GCM)
ocr.py Tesseract OCR for scanned PDFs + embedded images
models.py Pydantic models
static/
index.html Single-page dark-themed UI
styles.css Minimal custom CSS (animations, badges)
app.js Upload → review → download logic
By default GoCalma uses the local BERT model for NER. To use an external provider instead, click the gear icon in the UI or set environment variables before starting:
export GOCALMA_LLM_ENABLED=true
export GOCALMA_LLM_API_URL=http://localhost:11434/v1 # Ollama default
export GOCALMA_LLM_MODEL=llama3
# export GOCALMA_LLM_API_KEY=sk-... # only for remote providers
./start.shThe UI shows a privacy warning when the API URL points to a non-local server.