This project defines a deterministic, folder-based workflow to convert a PDF exported from PowerPoint into a print-ready PDF for inside pages only. Covers are explicitly out of scope.
The workflow is designed for Linux, fully scriptable, and suitable for execution by an AI agent.
It is written specifically for Ubuntu LTS; the install.sh installer relies on APT and will not work on other systems.
The core principles are:
- Deterministic and reproducible output
- One folder per step, no in-place modification
- Filenames always derived from the original input filename
- Full audit trail with reports per step
- Vector pages stay vector whenever possible
- Rasterization and upscaling only where strictly necessary
- The PowerPoint source already uses the correct final page size.
- Minimum safe margin of 5 mm is already respected inside PowerPoint.
- The PDF in
00-inputwas exported directly from PowerPoint using high-quality settings. - Inside pages are delivered as single pages, not spreads.
- Target effective resolution for raster content is ≥300 dpi at final size.
- Editing layout, text, or margins
- Fixing PowerPoint design mistakes
- Cover, spine, bleed, or binding calculations
- Reflowing or reconstructing vector content
00-input/
01-validate/
02-analyze-dpi/
03-extract-images/
04-upscale-images/
05-verify-images/
06-resize-images/
07-resize-smasks/
08-replace-images/
09-normalize-pdf/
10-pdf-x4/
11-output/
preflight.sh (stdout only)
Each step reads only from earlier steps and writes only to its own folder.
./convert.sh 00-input/boek.pdf
Runs steps 1–11 in order and stops on the first failure. Then runs preflight on both the PDF/X-4 and PDF/X-1a outputs. All script output is streamed to the terminal.
./remove-converted.sh 00-input/boek.pdf
Removes all pipeline outputs for the specified input (e.g., folders like 03-extract-images/boek/ and files like 08-replace-images/boek.*), while keeping the original file in 00-input/.
Input file:
00-input/boek.pdf
Derived names always preserve the base name boek.
Image-based outputs are grouped in a folder named after the document:
03-extract-images/boek/...
04-upscale-images/boek/...
05-verify-images/boek/...
06-resize-images/boek/...
07-resize-smasks/boek/...
Purpose Starting point. Contains the PDF exported from PowerPoint.
Rules
- Treat as read-only.
- No renaming after ingestion.
Contents
00-input/boek.pdf
Purpose Sanity check and baseline metadata extraction. No mutation.
Checks
- Page count
- Page size consistency
- Encryption
- Basic color space detection
- File hash
Outputs
01-validate/boek.validated.txt
Fail if
- Page sizes differ
- PDF is encrypted or unreadable
- Page count is zero
Purpose Identify embedded raster images below the target DPI.
Definition
Effective DPI = pixel resolution of a raster image relative to the physical size it is placed at on the page.
Outputs
02-analyze-dpi/boek.dpi.csv
02-analyze-dpi/boek.lowdpi.images.csv
boek.lowdpi.images.csv contains one image per line with page number, object id, and DPI.
Fail if
- DPI analysis cannot be computed
If no low-DPI images are found in step 02:
- Steps 03, 04, 05, 06, and 07 are skipped.
- The pipeline continues directly with step 09 (normalize PDF) using the original PDF.
In other words:
02-analyze-dpi
├─ low-DPI images found → 03 → 04 → 05 → 06 → 07 → 08 → 09 → 10 → 11
└─ no low-DPI images → 09 → 10 → 11
After step 11, run `preflight.sh` on the output PDF you want to validate.
This ensures:
- No unnecessary rasterization
- Output remains fully vector where possible
Purpose Extract embedded raster images from pages that contain low-DPI content.
Inputs
- Images listed in
boek.lowdpi.images.csv
Outputs
03-extract-images/boek/obj-<object>-<id>.png
03-extract-images/boek.images.csv
Rules
- Lossless output (PNG or TIFF)
- Preserve original pixel dimensions
Purpose Upscale only the extracted images so effective resolution meets or exceeds target DPI.
Strategy
- Compute required scale factor per image
- Clamp to reasonable bounds (e.g. max x4)
- Do not blindly upscale everything
Outputs
04-upscale-images/boek/obj-<object>-<id>.up.png
Fail if
- Upscaler fails
- Output dimensions do not match expected scale
Purpose Verify that upscaled images meet the target DPI. Copies any that still fall short.
Outputs
05-verify-images/boek.verify.csv
05-verify-images/boek/obj-<object>-<id>.up.png
Purpose Non-AI resize for any images that still miss target DPI after AI upscaling.
Outputs
06-resize-images/boek/obj-<object>-<id>.up.png
Purpose Resize soft masks (SMask) to match the resized image dimensions.
Outputs
07-resize-smasks/boek/obj-<object>-<id>.up.png
Purpose Replace the original low-DPI image objects in the PDF with the upscaled versions, preserving vector content. Replacement images are converted to CMYK to avoid color conversion during normalization.
Outputs
08-replace-images/boek.pdf
08-replace-images/boek.replace.txt
Purpose Prepare final print-deliverable PDF.
For PDF/X-4, normalization embeds the required output intent (ICC profile) without rasterizing or resampling content.
Typical actions
- Flatten transparency
- Convert to CMYK
- Export to required PDF standard (PDF/X-4)
- Disable resampling
Outputs
09-normalize-pdf/boek.pdf
09-normalize-pdf/boek.normalize.txt
Purpose Set print trim and bleed boxes. Uses a 3 mm trim inset by default.
Outputs
10-pdf-x4/boek.pdf
10-pdf-x4/boek.trim.txt
Purpose Convert PDF/X-4 output to PDF/X-1a.
Outputs
11-output/boek.pdf
Purpose Final verification.
Checks
- Page count and sizes
- Color space
- Remaining RGB objects
- Remaining low-DPI issues
- File integrity
Outputs
Prints to stdout only.
Fail if
- Page sizes differ
- Low-DPI issues remain
Use preflight directly on a concrete file:
./preflight.sh 10-pdf-x4/boek.pdf
./preflight.sh 11-output/boek.pdf
The default final deliverable is 10-pdf-x4/boek.pdf, with optional PDF/X-1a output in 11-output/.
All tunable values must live in a single config file:
TARGET_DPI=300
RASTERIZE_DPI=400
MAX_UPSCALE=4.0
UPSCALER_MODEL=RealESRGAN_x4plus
IMAGE_FORMAT=png
COLOR_PROFILE=/usr/share/color/icc/colord/FOGRA39L_coated.icc
Each report must log the effective configuration used.
These requirements are summarized from New Energy’s Dutch print delivery specifications and are provided for convenience (inside pages only; covers are out of scope here).
- Images should be ≥300 dpi; below 240 dpi risks visible quality loss. Avoid web‑sourced images due to low quality/rights.
- Add 3 mm bleed for inside pages; bleed artwork must extend into the bleed area.
- Minimum line thickness: 0.1 mm (or 0.4 mm for foil finishes).
- Export as PDF/X-4 for print delivery in this pipeline (transparency preserved).
- Use CMYK only (no RGB); total ink coverage should not exceed 280%.
- Deep black (typically for covers): C50 M40 Y40 K100. Text/line art in body should be K100 only.
- Include trim marks on export; keep offsets outside the bleed.
- Deliver cPDF (certified PDF) and export using PDF/X‑4 presets.
Profile used by default:
/usr/share/color/icc/colord/FOGRA39L_coated.icc(Coated FOGRA39 / ISO 12647-2:2004).
An AI agent working on this repository must:
- Never modify files in place
- Always write outputs to the next step folder
- Fail fast on invariant violations
- Skip steps cleanly when no-op conditions apply
- Produce human-readable reports at every step
Here is the clean, final list of required tools, aligned with the workflow as described. No extras, no overlap.
These are needed to run the pipeline end to end.
-
poppler-utils Used for:
pdfinfo, page size and page countpdfimages, extracting embedded raster imagespdftoppm, rasterizing pages
-
ghostscript Used for:
- transparency flattening
- CMYK conversion
- PDF/X export
- final print normalization
-
qpdf Used for:
- safe PDF object inspection and replacement
- sanity checks
- metadata inspection
-
ImageMagick Used for:
- verifying image dimensions
- confirming DPI after rasterization and upscaling
- format conversion (PNG, TIFF)
-
exiftool Used for:
- inspecting image metadata
- validating DPI and color info in images
-
Real-ESRGAN (external, not via apt) Used for:
- upscaling extracted raster images that fall below target DPI
- GPU-accelerated where available
Recommended model:
RealESRGAN_x4plus
-
bash Primary orchestration language.
-
coreutils (
sed,awk,grep,cut,sort,uniq,wc) Used for:- parsing reports
- page list generation
- control flow decisions
-
jq (optional but recommended) Used if DPI reports are emitted as JSON instead of CSV.