# ðŸ““ The GenAI Revolution Cookbook

**Title:** GPT-4o transcription: How to Preserve Word Document Layouts in Python

**Description:** Preserve Word layouts, tables, and media with Python GPT-4o hybrid workflow; combine text extraction and page images for accurate transcription.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Why This Approach Works

Most DOCX-to-Markdown converters rely on embedded XML or heuristics that fail when documents use complex layouts, tables, or custom styles. This pipeline combines three complementary tools to preserve structure:

- **LibreOffice** converts DOCX to PDF, freezing the layout into a stable format.
- **Poppler and PyMuPDF** extract both rendered images and embedded text from the PDF.
- **GPT-4o** uses both the text and the image to reconstruct Markdown that respects visual hierarchy, tables, and formatting.

By the end of this guide, you'll have a Colab-ready notebook that takes any DOCX file and outputs per-page Markdown files plus a merged document.

## How It Works (High-Level Overview)

The pipeline runs in four stages:

1. **DOCX â†’ PDF**: LibreOffice converts the DOCX to PDF in headless mode, preserving layout.
2. **PDF â†’ Images**: Poppler renders each page as a high-DPI image.
3. **PDF â†’ Text**: PyMuPDF extracts embedded text per page.
4. **Multimodal Transcription**: GPT-4o receives both the text and the image for each page and outputs structured Markdown.

This hybrid approach ensures that even when text extraction is incomplete or out of order, the model can infer structure from the visual layout.

## Setup & Installation

Run this cell first to install system dependencies and Python packages in Colab:

In [None]:
!apt-get update -qq
!apt-get install -y libreoffice poppler-utils
!pip install pdf2image PyMuPDF pillow openai python-dotenv

Next, verify that LibreOffice and Poppler are available:

In [None]:
!soffice --headless --version
!pdftoppm -v

If both commands return version information, you're ready to proceed.

Now, securely load your OpenAI API key from Colab userdata:

In [None]:
import os
from google.colab import userdata
from google.colab.userdata import SecretNotFoundError

keys = ["OPENAI_API_KEY"]
missing = []
for k in keys:
    value = None
    try:
        value = userdata.get(k)
    except SecretNotFoundError:
        pass

    os.environ[k] = value if value is not None else ""

    if not os.environ[k]:
        missing.append(k)

if missing:
    raise EnvironmentError(f"Missing keys: {', '.join(missing)}. Add them in Colab â†’ Settings â†’ Secrets.")

print("All keys loaded.")

Finally, initialize libraries and constants:

In [None]:
import sys
import io
import base64
import subprocess
import logging
from pathlib import Path
from typing import List, Optional

from openai import OpenAI
from pdf2image import convert_from_path
import fitz
from PIL import Image

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    logging.error("OPENAI_API_KEY is not set.")
    sys.exit(1)

client = OpenAI()

LIBREOFFICE_BIN = os.getenv("LIBREOFFICE_BIN", "soffice")
POPPLER_PATH = os.getenv("POPPLER_PATH")

## Step-by-Step Implementation

### Step 1: Convert DOCX to PDF

This function calls LibreOffice in headless mode to convert the DOCX file to PDF. It validates that the input exists, runs the conversion, and checks that the output PDF was created.

In [None]:
def docx_to_pdf(docx_path: Path, out_dir: Optional[Path] = None) -> Path:
    docx_path = Path(docx_path).resolve()
    if not docx_path.exists():
        raise FileNotFoundError(f"Input DOCX not found: {docx_path}")

    out_dir = Path(out_dir or docx_path.parent).resolve()
    out_dir.mkdir(parents=True, exist_ok=True)

    cmd = [
        LIBREOFFICE_BIN,
        "--headless",
        "--convert-to", "pdf",
        "--outdir", str(out_dir),
        str(docx_path),
    ]
    try:
        result = subprocess.run(
            cmd, capture_output=True, text=True, check=False
        )
    except FileNotFoundError as e:
        raise RuntimeError(
            "LibreOffice not found. Install LibreOffice and ensure 'soffice' is on PATH."
        ) from e

    if result.returncode != 0:
        logging.error(f"LibreOffice stderr: {result.stderr.strip()}")
        raise RuntimeError(
            f"LibreOffice conversion failed. Code {result.returncode}. "
            f"stderr: {result.stderr.strip()}"
        )

    pdf_name = docx_path.with_suffix(".pdf").name
    pdf_path = out_dir / pdf_name
    if not pdf_path.exists():
        raise RuntimeError("Expected PDF not created. Check LibreOffice logs and file permissions.")
    logging.info(f"Converted DOCX to PDF: {pdf_path}")
    return pdf_path

Test the conversion with a sample DOCX file:

In [None]:
sample_docx = Path("sample.docx")
pdf_path = docx_to_pdf(sample_docx)
print(f"PDF created at: {pdf_path}")

### Step 2: Render PDF Pages as Images

This function uses Poppler to render each page of the PDF as a high-resolution image. Higher DPI improves OCR accuracy but increases memory usage.

In [None]:
def pdf_to_images(pdf_path: Path, dpi: int = 300) -> List[Image.Image]:
    pdf_path = Path(pdf_path).resolve()
    if not pdf_path.exists():
        raise FileNotFoundError(f"PDF not found: {pdf_path}")

    images = convert_from_path(
        str(pdf_path),
        dpi=dpi,
        poppler_path=POPPLER_PATH if POPPLER_PATH else None
    )
    logging.info(f"Rendered {len(images)} pages from PDF at {dpi} DPI.")
    return images

Render the PDF and preview the first page:

In [None]:
images = pdf_to_images(pdf_path, dpi=300)
print(f"Rendered {len(images)} pages.")
images[0]

### Step 3: Extract Embedded Text Per Page

This function uses PyMuPDF to extract text from each page. The text is in reading order but may be incomplete or out of sequence for complex layouts.

In [None]:
def extract_text_per_page(pdf_path: Path) -> List[str]:
    pdf_path = Path(pdf_path).resolve()
    if not pdf_path.exists():
        raise FileNotFoundError(f"PDF not found: {pdf_path}")

    texts: List[str] = []
    with fitz.open(str(pdf_path)) as doc:
        for page in doc:
            t = page.get_text("text")
            texts.append(t.strip())
    logging.info(f"Extracted text from {len(texts)} PDF pages.")
    return texts

Extract text and print the first 20 lines:

In [None]:
texts = extract_text_per_page(pdf_path)
print("\n".join(texts[0].split("\n")[:20]))

### Step 4: Convert Images to Data URLs

This helper function converts a PIL Image to a PNG data URL for use in GPT-4o multimodal prompts.

In [None]:
def pil_image_to_data_url(img: Image.Image, png_compress_level: int = 6) -> str:
    buf = io.BytesIO()
    img.save(buf, format="PNG", compress_level=png_compress_level)
    b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
    return f"data:image/png;base64,{b64}"

### Step 5: Define the System Prompt

This prompt instructs GPT-4o to produce clean Markdown that preserves structure, including headings, lists, tables, and figure placeholders.

In [None]:
SYSTEM_PROMPT = """You are a meticulous document transcriber. Produce clean Markdown that preserves structure.
Rules:
- Keep the original reading order.
- Use #, ##, ### for headings that reflect visual hierarchy.
- Preserve lists, bold, italics, and footnotes.
- Reconstruct tables as Markdown tables with headers when present.
- For images or figures, insert a placeholder like [Figure N: short caption] with concise alt text.
- Include headers and footers only if they contain meaningful information.
- Do not hallucinate content. If text is illegible or absent, write [Unclear].
- Output only Markdown for the page, no extra commentary."""

### Step 6: Transcribe a Single Page with GPT-4o

This function sends both the extracted text and the rendered image to GPT-4o and returns structured Markdown for the page.

In [None]:
def transcribe_page_with_gpt4o(page_text: str, page_image: Image.Image, temperature: float = 0.0) -> str:
    image_url = pil_image_to_data_url(page_image)

    user_content = [
        {"type": "text", "text": "Here is the exact text extracted from this page:"},
        {"type": "text", "text": page_text or "[No embedded text extracted]"},
        {"type": "text", "text": "Here is the rendered image of the same page to infer layout:"},
        {"type": "image_url", "image_url": {"url": image_url}},
        {"type": "text", "text": "Return structured Markdown for this single page only."},
    ]

    resp = client.chat.completions.create(
        model="gpt-4o",
        temperature=temperature,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_content},
        ],
    )
    return resp.choices[0].message.content.strip()

Test transcription on the first page:

In [None]:
md_page_1 = transcribe_page_with_gpt4o(texts[0], images[0])
print(md_page_1)

### Step 7: Add Retry Logic for Rate Limits

This wrapper function retries the transcription call with exponential backoff if rate limits or transient errors occur.

In [None]:
import time
import random
from openai import APIError, RateLimitError

def safe_transcribe_page(page_text: str, page_image: Image.Image, max_retries: int = 5) -> str:
    delay = 1.0
    for attempt in range(1, max_retries + 1):
        try:
            return transcribe_page_with_gpt4o(page_text, page_image)
        except RateLimitError as e:
            logging.warning(f"Rate limit error on attempt {attempt}: {e}")
            if attempt == max_retries:
                raise
            time.sleep(delay + random.random())
            delay *= 2
        except APIError as e:
            status = getattr(e, "status_code", 500)
            if 500 <= status < 600 and attempt < max_retries:
                logging.warning(f"API error (status {status}) on attempt {attempt}: {e}")
                time.sleep(delay + random.random())
                delay *= 2
                continue
            raise

### Step 8: Orchestrate the Full Pipeline

This function coordinates all stages: convert DOCX to PDF, render images, extract text, and transcribe each page with GPT-4o.

In [None]:
def transcribe_docx(docx_path: Path, dpi: int = 300) -> List[str]:
    pdf_path = docx_to_pdf(docx_path)
    images = pdf_to_images(pdf_path, dpi=dpi)
    texts = extract_text_per_page(pdf_path)

    if len(images) != len(texts):
        raise RuntimeError(
            f"Page count mismatch. images={len(images)} texts={len(texts)}"
        )

    outputs: List[str] = []
    for idx, (img, txt) in enumerate(zip(images, texts), start=1):
        logging.info(f"Transcribing page {idx}/{len(images)} ...")
        md = safe_transcribe_page(txt, img)
        outputs.append(md)
    return outputs

Run the full pipeline:

In [None]:
outputs = transcribe_docx(sample_docx, dpi=300)
print(f"Transcribed {len(outputs)} pages.")

### Step 9: Write Outputs to Disk

This function writes each page's Markdown to a separate file and creates a merged document.

In [None]:
def write_outputs(docx_path: Path, outputs: List[str], out_dir: Optional[Path] = None) -> Path:
    base = Path(out_dir or docx_path.parent) / (docx_path.stem + "_transcription")
    base.mkdir(parents=True, exist_ok=True)
    merged = []
    for i, md in enumerate(outputs, start=1):
        p = base / f"page_{i:04d}.md"
        p.write_text(md, encoding="utf-8")
        merged.append(f"<!-- Page {i} -->\n\n{md}\n")
    merged_path = base / "full_document.md"
    merged_path.write_text("\n\n".join(merged), encoding="utf-8")
    logging.info(f"Wrote {len(outputs)} page files and merged Markdown to {base}")
    return base

Write the outputs:

In [None]:
out_dir = write_outputs(sample_docx, outputs)
print(f"Wrote transcription to {out_dir}")

## Run and Validate

Display the first page image alongside its generated Markdown:

In [None]:
from IPython.display import display, Markdown

print("Original Page Image:")
display(images[0])
print("\nGenerated Markdown:")
display(Markdown(outputs[0]))

Verify that page counts match and detect empty pages:

In [None]:
assert len(images) == len(texts) == len(outputs), "Page count mismatch detected."
empty_pages = [i+1 for i, txt in enumerate(texts) if not txt.strip()]
if empty_pages:
    print(f"Warning: Empty text on pages {empty_pages}")
else:
    print("All pages contain text.")

Check for tables in the Markdown output:

In [None]:
table_pages = [i+1 for i, md in enumerate(outputs) if "|" in md]
if table_pages:
    print(f"Tables detected on pages: {table_pages}")
else:
    print("No tables detected. Verify if tables were expected.")

## Conclusion

You now have a working DOCX-to-Markdown pipeline that preserves layout, tables, and formatting. The hybrid approach combines LibreOffice for stable conversion, Poppler and PyMuPDF for dual extraction, and GPT-4o for intelligent reconstruction.

To extend this pipeline, consider adding hash-based caching to skip re-transcription of unchanged pages, or switch to `gpt-4o-mini` for faster, cheaper tests. You can also add image resizing to keep data URL payloads below API limits, or integrate a worker pool with bounded concurrency to parallelize transcription while respecting rate limits.