This is a markdown cell. It describes the purpose of this notebook and its code.

This notebook will transform a PDF scan of a Latin record into Word doc (Latin), and then translate it to English including its numerals. Final output is a Word doc (English).

Last updated by Kuba Kowalski on 11/11/2025, 23:00.

## Applying OCR to raw scans

Embedding selectable text into PDF to make conversion to Word possible. Note, this is needed because while the original scans appear to have selectable text, it's actually full of errors and cannot be trusted. 

OCR is the standard used for other Winchester Pipe Rolls. No need to reinvent the wheel. 

In [11]:
# Install for convenience
!pip install ocrmypdf pdfplumber python-docx

!sudo apt-get install -y tesseract-ocr tesseract-ocr-lat




[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip
Sudo is disabled on this machine. To enable it, go to the ]8;;ms-settings:developers\Developer Settings page]8;;\ in the Settings app


In [12]:
# Import packages
import os
import subprocess
from pathlib import Path
import pdfplumber
from docx import Document

### IMPORTANT BEFORE RUNNING

The cell below will fail is tesseract and ghostscript are not correctly installed or visible on your Windows install. To fix it, follow steps listed below (For Linux solution, ask ChatGPT):
1. Open Windows PowerShell as admin
2. Run "Get-ExecutionPolicy". If returns "restricted", run "Set-ExecutionPolicy Bypass -Scope Process -Force". Otherwise, continue to step 3
3. Install Chocolatey (Run as single line): Set-ExecutionPolicy Bypass -Scope Process -Force; `
[System.Net.ServicePointManager]::SecurityProtocol = `
[System.Net.ServicePointManager]::SecurityProtocol -bor 3072; `
iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
4. Verify install: "choco --version"
5. Open Command Prompt as admin
6. Install tesseract: "choco install tesseract"
7. Input "Y" in Command Prompt
8. Install ghostscript: "choco install ghostscript"
9. Input "Y" in Command Prompt
10. Download Latin package for tesseract from here: [lat.traineddata](https://github.com/tesseract-ocr/tessdata_best/blob/main/lat.traineddata). Save to "C:\Program Files\Tesseract-OCR\tessdata\lat.traineddata" or any other location where tesseract is installed. To check install location, run "tesseract --print-parameters | findstr /C:"tessdata" in Command Prompt

Note, Chocolatey is a convenient installer. Ghostscript is default choice for OCR. There is a manual way of installing tesseract & ghostscript without Choco, but I won't explain it.

In [None]:
# Runtime: 45 sec per input file. 

# Example bash !ocrmypdf --force-ocr -l lat+eng Cuxham_1288.pdf Cuxham_1288_ocr.pdf
    ## --force-ocr is needed to overwrite original text layer
    ## -l lat+eng is needed to handle mix of English & Latin, suggested by GPT-5
    ## Cuxham_1288.pdf is the input
    ## Cuxham_1288_ocr.pdf is the output

# Paths

## Input
input_dir = Path(r"C:\Users\kubak\OneDrive - Wageningen University & Research\WUR\2024-2025\research_assistant_ox\workfolder\input\Latin\Cuxham")

## Output: where to save OCR’d files (can be same as input)
output_dir = Path(r"C:\Users\kubak\OneDrive - Wageningen University & Research\WUR\2024-2025\research_assistant_ox\workfolder\output\Latin\Cuxham")
output_dir.mkdir(exist_ok=True)

# The bash command above is rebuilt for Python use below
    ## Most convenient way for looping through multiple input PDFs (Latin)
for pdf_file in input_dir.glob("*.pdf"):
    output_pdf = output_dir / f"{pdf_file.stem}_ocr.pdf"
    
    print(f"OCR processing input: {pdf_file.name} -> {output_pdf.name}")

    cmd = [
        "ocrmypdf",
        "--force-ocr",
        "-l", "lat+eng",
        str(pdf_file),
        str(output_pdf)
    ]

    try:
        subprocess.run(cmd, check=True)
        print(f"Task succesful: {pdf_file.name}")
    except subprocess.CalledProcessError as e:
        print(f"Task failed (fix your directories or follow 10 steps above): {pdf_file.name}: {e}")

OCR processing input: Cuxham_1276.pdf -> Cuxham_1276_ocr.pdf
Task succesful: Cuxham_1276.pdf
OCR processing input: Cuxham_1288.pdf -> Cuxham_1288_ocr.pdf
Task succesful: Cuxham_1288.pdf


In [14]:
# Path
word_dir = output_dir / "word"
word_dir.mkdir(exist_ok=True)

# Function for conversion of OCR PDF to Word
def pdf_to_docx(pdf_path: Path, docx_path: Path):
    doc = Document()
    doc.add_heading(pdf_path.stem, level=1)

    with pdfplumber.open(pdf_path) as pdf:
        for page_no, page in enumerate(pdf.pages, start=1):
            text = page.extract_text() or ""
            text = text.strip()
            if not text:
                continue

            # Add a page marker
            doc.add_paragraph(f"[Page {page_no}]")

            # Split into paragraphs on newline
            for line in text.split("\n"):
                if line.strip():
                    doc.add_paragraph(line.strip())

            # Add a blank line between pages
            doc.add_paragraph("")

    doc.save(docx_path)
    print(f"Saved: {docx_path}")

# Conversion of OCR PDFs to Word
for pdf_file in output_dir.glob("*_ocr.pdf"):
    docx_name = pdf_file.stem.replace("_ocr", "") + ".docx"
    docx_path = word_dir / docx_name
    print(f"Converting {pdf_file.name} → {docx_name}")
    pdf_to_docx(pdf_file, docx_path)


Converting Cuxham_1276_ocr.pdf → Cuxham_1276.docx
Saved: C:\Users\kubak\OneDrive - Wageningen University & Research\WUR\2024-2025\research_assistant_ox\workfolder\output\Latin\Cuxham\word\Cuxham_1276.docx
Converting Cuxham_1288_ocr.pdf → Cuxham_1288.docx
Saved: C:\Users\kubak\OneDrive - Wageningen University & Research\WUR\2024-2025\research_assistant_ox\workfolder\output\Latin\Cuxham\word\Cuxham_1288.docx


## Latin Translation

In [None]:
# Install for convenience
!pip install openai

In [None]:
# Packages
from openai import OpenAI

Example line of code below if you want you set the ChatGPT key immediately in the Jupyter Notebook. Not doing this myself since the GitHub repo is public (for now). 

If anyone reading this ever thinks to do it, better not. 
You will be charged money for anyone using your key which would be there for literally everyone in the world with Internet access to see. 

In [None]:
# Don't do this

# os.environ["OPENAI_API_KEY"] = "sk-your-real-key-here"

You can find your key through this link by logging in: https://platform.openai.com/api-keys

Instructions for setting the OpenAI API key:
1. Open PowerShell
2. Run with your key: setx OPENAI_API_KEY "sk-..."

In [None]:
# Safe alternative for setting the API key, follow instructions in Markdown cell above
client = OpenAI() # This takes the key from your env

In [None]:
ocr_dir = Path(r"C:\Users\kubak\OneDrive - Wageningen University & Research\WUR\2024-2025\research_assistant_ox\workfolder\output\Latin\Cuxham")

def chunk_pages(pages, max_chars=6000):
    chunks = []
    current = []
    current_len = 0

    for page_no, text in enumerate(pages, start=1):
        header = f"[Page {page_no}]\n"
        block = header + text + "\n\n"
        if current_len + len(block) > max_chars and current:
            chunks.append("".join(current))
            current = [block]
            current_len = len(block)
        else:
            current.append(block)
            current_len += len(block)

    if current:
        chunks.append("".join(current))

    return chunks

chunks = chunk_pages(pages)
len(chunks)

In [None]:
SYSTEM_PROMPT = """
You are a professional translator of medieval Latin manorial accounts into modern English.

Tasks, in order:
1. Silently fix obvious OCR errors in the Latin (split words, wrong letters, stray punctuation), 
   but do NOT modernize spelling unnecessarily.
2. Translate the corrected Latin into clear, modern English.
3. Preserve all ROMAN NUMERALS exactly as written (e.g. xiiij, xxvij, MCCXL).
4. Preserve any page markers like [Page 3].
5. If some lines are already in English (editorial headings, notes), keep them as they are.
"""

def translate_chunk(text_block, target_language="English"):
    response = client.responses.create(
        model="gpt-4.1-mini",  # or e.g. "gpt-5" if costs aren't an issue
        input=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text_block},
        ],
    )
    return response.output_text

def translate_pdf_to_blocks(pdf_path):
    pages = extract_pages(pdf_path)
    chunks = chunk_pages(pages)
    
    translated_chunks = []
    for i, chunk in enumerate(chunks, start=1):
        print(f"Translating chunk {i}/{len(chunks)} for {pdf_path.name}...")
        translated = translate_chunk(chunk)
        translated_chunks.append(translated)
    return translated_chunks

translated_chunks = translate_pdf_to_blocks(pdf_path)
len(translated_chunks)


In [None]:
def translate_ocr_pdf(pdf_path, out_dir):
    translated_chunks = translate_pdf_to_blocks(pdf_path)
    out_name = pdf_path.stem.replace("_ocr", "") + "_translation.docx"
    out_path = out_dir / out_name
    save_translation_docx(translated_chunks, out_path, title=pdf_path.stem)
    print(f"Saved translation to: {out_path}")
    return out_path

for pdf_file in ocr_dir.glob("*_ocr.pdf"):
    translate_ocr_pdf(pdf_file, ocr_dir)
