Skip to content

jobaugit/transcriptum

Repository files navigation

Transcriptum

A surgical tool for transforming image-only historical manuscripts into fully searchable PDFs, with a manual transcription editor and multi-format export.

Version Python Docker License


What is Transcriptum?

Transcriptum is a self-hosted web application designed for historians, archivists, and genealogical researchers working with scanned historical documents. It runs entirely on your local network — no cloud, no accounts, no subscriptions.

The core workflow:

  1. Upload a scanned PDF (image-only, or pdf)
  2. Navigate page by page using the high-resolution viewer
  3. Manually transcribe the handwritten content into the sidebar editor
  4. Inject the transcription as an invisible text layer (PDF Text Rendering Mode 3) into the PDF
  5. Download a fully searchable PDF that works with Cmd+F / Ctrl+F in any standard viewer

Features

Feature Description
PDF Viewer High-resolution rendering via PDF.js with selectable text layer
Per-page transcription Each page has its own independent text editor
Invisible text injection Injects text at render mode 3 (invisible, fully searchable — PDF spec §9.3.6)
Smart text detection Detects existing text layers on upload and pre-fills the editor automatically
Selective injection Build searchable PDF for the full document or just the active page
Zoom with margin scroll Zoom in and scroll to document margins for edge annotations
Reference marker Static horizontal guide line for tracking your reading position
Document search Internal search across all transcribed pages with highlighted snippets
Export .txt Full transcription with APA-style header, organised by page number
Export .docx Word-compatible export with the same APA header and page sections
Work Notes Ephemeral browser-local scratchpad (never sent to server)
Dark & Sepia themes Eye-friendly themes for long research sessions
Persistence All transcription data survives page refresh and browser restarts
Deep file deletion Removes source, all outputs, and metadata in one action
No login required Single-user local tool — open and use immediately
Docker-ready Runs on TrueNAS SCALE, Proxmox LXC, Unraid, or any Linux host

Project Structure

transcriptum/
├── app.py                  ← Flask backend (all API routes)
├── requirements.txt        ← Python dependencies
├── Dockerfile              ← Container build instructions
├── docker-compose.yml      ← Service definition with volume mounts
├── .gitignore
├── README.md
├── templates/
│   └── index.html          ← Complete single-page UI
├── static/                 ← Reserved for future static assets
└── data/                   ← Created at runtime (not committed)
    ├── uploads/            ← Source PDFs uploaded by the user
    ├── outputs/            ← Generated searchable PDFs
    └── meta/               ← Per-document transcription JSON files

Installation

Prerequisites

  • Docker and Docker Compose installed on the host
  • A folder on persistent storage for the data/ directory

Option 1 — Any Linux Host (simplest)

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/transcriptum.git
cd transcriptum

# 2. Create data directories
mkdir -p data/uploads data/outputs data/meta

# 3. Build and start
docker compose up -d --build

# 4. Open in your browser
http://localhost:5000

To stop: docker compose down To update: git pull && docker compose up -d --build


Option 2 — TrueNAS SCALE

TrueNAS SCALE supports Docker Compose via the Shell. The data directories should live on a dataset so they survive container recreation.

# 1. SSH into your TrueNAS machine (or open the Shell in the UI)

# 2. Create a dataset for the app (via UI or CLI)
#    Example path: /mnt/tank/apps/transcriptum

# 3. Navigate to the dataset
cd /mnt/tank/apps/transcriptum

# 4. Clone the repository
git clone https://github.com/YOUR_USERNAME/transcriptum.git .

# 5. Create data directories on the dataset (persistent storage)
mkdir -p data/uploads data/outputs data/meta

# 6. Build and start
docker compose up -d --build

# 7. Access from any device on your network
http://TRUENAS_IP:5000

Important on TrueNAS SCALE:

  • Use a ZFS dataset (not the boot pool) for the data/ folder
  • The docker-compose.yml already maps ./data/* to /data/* inside the container
  • If you use a custom dataset path, edit the volumes: section of docker-compose.yml:
volumes:
  - /mnt/tank/apps/transcriptum/data/uploads:/data/uploads
  - /mnt/tank/apps/transcriptum/data/outputs:/data/outputs
  - /mnt/tank/apps/transcriptum/data/meta:/data/meta

Option 3 — Proxmox (LXC Container)

Recommended: create an Ubuntu 24.04 LXC container with Docker installed.

# Inside the LXC container:

# 1. Install Docker (if not already present)
apt update && apt install -y docker.io docker-compose-plugin
systemctl enable --now docker

# 2. Clone the repository
git clone https://github.com/YOUR_USERNAME/transcriptum.git
cd transcriptum

# 3. Create data directories
mkdir -p data/uploads data/outputs data/meta

# 4. Build and start
docker compose up -d --build

# 5. Access from the Proxmox network
http://LXC_IP:5000

To expose on a specific port, change ports in docker-compose.yml:

ports:
  - "8080:5000"   # access on port 8080 instead

Option 4 — Unraid

  1. Install the Community Applications plugin if not present
  2. Install the Docker Compose Manager plugin
  3. Create a new compose stack, paste the contents of docker-compose.yml
  4. Update the volume paths to point to your Unraid array:
volumes:
  - /mnt/user/appdata/transcriptum/uploads:/data/uploads
  - /mnt/user/appdata/transcriptum/outputs:/data/outputs
  - /mnt/user/appdata/transcriptum/meta:/data/meta
  1. Start the stack and access at http://UNRAID_IP:5000

Folder Permissions

The container runs as root internally. The data/ directories need to be readable and writable by the Docker process.

# Set permissions on the data directories (run on the host)
chmod -R 755 data/
# or if you encounter permission errors:
chown -R 1000:1000 data/

On TrueNAS SCALE, if you get permission errors:

  • Go to Storage → Datasets → transcriptum/data
  • Edit Permissions → set User to root, Group to root
  • Check "Apply permissions recursively"

Updating

# Pull latest changes
git pull

# Rebuild and restart (data is preserved in the mounted volumes)
docker compose up -d --build

Data Backup

All user data lives in three folders:

Folder Contents Priority
data/meta/ JSON transcription files (your work) Critical
data/uploads/ Original source PDFs High
data/outputs/ Generated searchable PDFs Recoverable

Minimum backup: just data/meta/ — this contains all your transcription text and can regenerate outputs at any time.


Tech Stack

Layer Technology
Backend Python 3.11 + Flask
PDF processing PyMuPDF (fitz)
PDF viewer PDF.js v3.11
Image import Pillow
Word export python-docx
Container Docker + Docker Compose

License

MIT License — free to use, modify, and distribute.


Screenshots

Vista principal


Contributing

This project was built for personal archival research. Pull requests and issues are welcome.

About

Self-hosted web application to assist in manually transcribing historical manuscripts into searchable PDF files

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors