A surgical tool for transforming image-only historical manuscripts into fully searchable PDFs, with a manual transcription editor and multi-format export.
Transcriptum is a self-hosted web application designed for historians, archivists, and genealogical researchers working with scanned historical documents. It runs entirely on your local network — no cloud, no accounts, no subscriptions.
The core workflow:
- Upload a scanned PDF (image-only, or pdf)
- Navigate page by page using the high-resolution viewer
- Manually transcribe the handwritten content into the sidebar editor
- Inject the transcription as an invisible text layer (PDF Text Rendering Mode 3) into the PDF
- Download a fully searchable PDF that works with Cmd+F / Ctrl+F in any standard viewer
| Feature | Description |
|---|---|
| PDF Viewer | High-resolution rendering via PDF.js with selectable text layer |
| Per-page transcription | Each page has its own independent text editor |
| Invisible text injection | Injects text at render mode 3 (invisible, fully searchable — PDF spec §9.3.6) |
| Smart text detection | Detects existing text layers on upload and pre-fills the editor automatically |
| Selective injection | Build searchable PDF for the full document or just the active page |
| Zoom with margin scroll | Zoom in and scroll to document margins for edge annotations |
| Reference marker | Static horizontal guide line for tracking your reading position |
| Document search | Internal search across all transcribed pages with highlighted snippets |
| Export .txt | Full transcription with APA-style header, organised by page number |
| Export .docx | Word-compatible export with the same APA header and page sections |
| Work Notes | Ephemeral browser-local scratchpad (never sent to server) |
| Dark & Sepia themes | Eye-friendly themes for long research sessions |
| Persistence | All transcription data survives page refresh and browser restarts |
| Deep file deletion | Removes source, all outputs, and metadata in one action |
| No login required | Single-user local tool — open and use immediately |
| Docker-ready | Runs on TrueNAS SCALE, Proxmox LXC, Unraid, or any Linux host |
transcriptum/
├── app.py ← Flask backend (all API routes)
├── requirements.txt ← Python dependencies
├── Dockerfile ← Container build instructions
├── docker-compose.yml ← Service definition with volume mounts
├── .gitignore
├── README.md
├── templates/
│ └── index.html ← Complete single-page UI
├── static/ ← Reserved for future static assets
└── data/ ← Created at runtime (not committed)
├── uploads/ ← Source PDFs uploaded by the user
├── outputs/ ← Generated searchable PDFs
└── meta/ ← Per-document transcription JSON files
- Docker and Docker Compose installed on the host
- A folder on persistent storage for the
data/directory
# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/transcriptum.git
cd transcriptum
# 2. Create data directories
mkdir -p data/uploads data/outputs data/meta
# 3. Build and start
docker compose up -d --build
# 4. Open in your browser
http://localhost:5000To stop: docker compose down
To update: git pull && docker compose up -d --build
TrueNAS SCALE supports Docker Compose via the Shell. The data directories should live on a dataset so they survive container recreation.
# 1. SSH into your TrueNAS machine (or open the Shell in the UI)
# 2. Create a dataset for the app (via UI or CLI)
# Example path: /mnt/tank/apps/transcriptum
# 3. Navigate to the dataset
cd /mnt/tank/apps/transcriptum
# 4. Clone the repository
git clone https://github.com/YOUR_USERNAME/transcriptum.git .
# 5. Create data directories on the dataset (persistent storage)
mkdir -p data/uploads data/outputs data/meta
# 6. Build and start
docker compose up -d --build
# 7. Access from any device on your network
http://TRUENAS_IP:5000Important on TrueNAS SCALE:
- Use a ZFS dataset (not the boot pool) for the
data/folder - The docker-compose.yml already maps
./data/*to/data/*inside the container - If you use a custom dataset path, edit the
volumes:section ofdocker-compose.yml:
volumes:
- /mnt/tank/apps/transcriptum/data/uploads:/data/uploads
- /mnt/tank/apps/transcriptum/data/outputs:/data/outputs
- /mnt/tank/apps/transcriptum/data/meta:/data/metaRecommended: create an Ubuntu 24.04 LXC container with Docker installed.
# Inside the LXC container:
# 1. Install Docker (if not already present)
apt update && apt install -y docker.io docker-compose-plugin
systemctl enable --now docker
# 2. Clone the repository
git clone https://github.com/YOUR_USERNAME/transcriptum.git
cd transcriptum
# 3. Create data directories
mkdir -p data/uploads data/outputs data/meta
# 4. Build and start
docker compose up -d --build
# 5. Access from the Proxmox network
http://LXC_IP:5000To expose on a specific port, change ports in docker-compose.yml:
ports:
- "8080:5000" # access on port 8080 instead- Install the Community Applications plugin if not present
- Install the Docker Compose Manager plugin
- Create a new compose stack, paste the contents of
docker-compose.yml - Update the volume paths to point to your Unraid array:
volumes:
- /mnt/user/appdata/transcriptum/uploads:/data/uploads
- /mnt/user/appdata/transcriptum/outputs:/data/outputs
- /mnt/user/appdata/transcriptum/meta:/data/meta- Start the stack and access at
http://UNRAID_IP:5000
The container runs as root internally. The data/ directories need to be readable and writable by the Docker process.
# Set permissions on the data directories (run on the host)
chmod -R 755 data/
# or if you encounter permission errors:
chown -R 1000:1000 data/On TrueNAS SCALE, if you get permission errors:
- Go to Storage → Datasets → transcriptum/data
- Edit Permissions → set User to
root, Group toroot - Check "Apply permissions recursively"
# Pull latest changes
git pull
# Rebuild and restart (data is preserved in the mounted volumes)
docker compose up -d --buildAll user data lives in three folders:
| Folder | Contents | Priority |
|---|---|---|
data/meta/ |
JSON transcription files (your work) | Critical |
data/uploads/ |
Original source PDFs | High |
data/outputs/ |
Generated searchable PDFs | Recoverable |
Minimum backup: just data/meta/ — this contains all your transcription text and can regenerate outputs at any time.
| Layer | Technology |
|---|---|
| Backend | Python 3.11 + Flask |
| PDF processing | PyMuPDF (fitz) |
| PDF viewer | PDF.js v3.11 |
| Image import | Pillow |
| Word export | python-docx |
| Container | Docker + Docker Compose |
MIT License — free to use, modify, and distribute.
This project was built for personal archival research. Pull requests and issues are welcome.
