A picture is worth a thousand words. We made each one worth about 33,333 tokens, and GPT-5.5 read them back.
What is a token, really? A model's text window stops at a hard number, 272,000 tokens for GPT-5.5 through the Codex endpoint. So we asked a simple question: if you render the text as images instead of sending it as text, does that number still apply? The vision path reads pixels, and pixels are cheap. How much text can you actually get a model to read by photographing it?
The answer, for this one proven run: 1,000,000 tokens of source text, packed into 30 page images, every fact recovered exactly. That is 3.14 tokens of source text carried for every input token the API billed.
We took 1,000,000 tokenizer-counted tokens of text (the source is standard Gutenberg books), hid three exact "needle" markers inside it, rendered the whole thing as 30 dense grayscale page images using the hyperlegible Atkinson Mono font, and sent the images to GPT-5.5. The model found all three needles, character for character.
| Metric | Value |
|---|---|
| Source text rendered | 1,000,000 tokens |
| Page images | 30 (3000x3000 px) |
| Source tokens per image | ~33,333 |
| Billed input tokens | 318,283 |
| Source-to-billed ratio | 3.14x |
| Needles recovered | 3 / 3 exact |
| One image more (31) | rejected: context_length_exceeded |
Evidence: experiments/atkinson_10_1M-api/atkinson_10_1M.summary.json.
This is the vision path used as a lossy, OCR-like transport layer. Images carry far more source text than the text window would normally hold, as long as the text is rendered densely, legibly, and verified with retrieval probes.
The interesting result is that "the endpoint accepted it" and "the model read it correctly" are two separate things, and they fail for different reasons.
Gate 1, the context gate, counts image patches. The endpoint slices each 3000x3000 image into 32x32 patches: 94 x 94 = 8,836 patches per page. The context limit is enforced on that patch count, near 272,000, and the billed input_tokens figure (318,283) is a separate billing number that is allowed to exceed it. The boundary is clean:
| Pages | Patches | Result |
|---|---|---|
| 30 | 265,080 | accepted |
| 31 | 273,916 | context_length_exceeded |
272,000 sits exactly between them. Thirty pages is the wall for this page size. Evidence: experiments/codex-gpt55-image-maximize-2026-06-20/api/t836811-p31-c10-fs10-m0-g0-ext562.summary.json.
Gate 2, the retrieval gate, is OCR. Passing the context gate only means the bytes fit. The model still has to read pixels. While standard Courier New layouts struggled with minor character spelling drops at ~900k tokens (dropping a single 'E' in QUEEQUEG), switching the design to the hyperlegible Atkinson Mono at size 10 and 8px line height completely solved these OCR limits. The model achieved perfect, character-for-character retrieval of all needles even at 1,000,000 source tokens.
Four scripts, run in order:
render_token_images.py text -> dense page PNGs + manifest
build_codex_request.py PNGs -> base64 Codex request with a strict-JSON output schema
send_codex_request.py request -> SSE stream -> classified summary
verify_repo.py sanity-check the packaged evidence
The renderer trims text to an exact token budget with tiktoken, inserts the three needle markers at 20% / 50% / 90% offsets, reflows paragraphs, and packs them into multi-column pages. It computes layout capacity and fails preflight if the text would overflow, so you never pay for an image that dropped characters off the page.
The winning layout is intentionally plain:
- 3000x3000 grayscale PNG pages
- 8 columns, 1 px margins, 3 px gutters
- Atkinson Mono at 10 px, 62 characters per line, 8 px line height
- 30 pages, 81,686 wrapped lines, 97.71% average rendered ink width
The reflow step is the core trick. Early attempts preserved Project Gutenberg's hard wraps, which left text in a narrow strip down the left side and wasted most of the page. Reflowing each paragraph to fill the full column width is what turned that wasted whitespace into dense, readable pages. Switching from Courier New to the hyperlegible Atkinson Mono then let the line height drop to 8 px without the model dropping characters, which is what carried the run past the old Courier New ceiling of 810,549 tokens to a clean 1,000,000.
You need uv. Nothing else installs globally.
Verify the packaged evidence:
uv run --with pillow --with tiktoken python scripts/verify_repo.pyRe-render the proven best source:
uv run --with pillow --with tiktoken python scripts/render_token_images.py \
--source-text experiments/atkinson_10_1M/source-1000000-tokens.txt \
--target-tokens 1000000 \
--pages 30 --columns 8 --font-size 10 \
--chars-per-line 62 --line-height 8 \
--margin 1 --gutter 3 \
--font experiments/color-layer-overlap-2026-06-20/fonts/AtkinsonMono.ttf \
--out runs/repro-1000000Build the request, then send it with your Codex auth:
python scripts/build_codex_request.py \
--render-dir runs/repro-1000000 \
--output runs/repro-1000000.request.json \
--redacted-output runs/repro-1000000.request.redacted.json
python scripts/send_codex_request.py \
--request runs/repro-1000000.request.json \
--manifest runs/repro-1000000/manifest.json \
--out-dir runs/repro-1000000-apisend_codex_request.py reads ~/.codex/auth.json by default, or accepts CODEX_ACCESS_TOKEN and CHATGPT_ACCOUNT_ID.
Good fits:
- Long-document triage where finding anchored evidence matters more than perfect transcription.
- Retrieval evals with inserted needles across huge payloads.
- Agent-memory or compaction experiments comparing text transport against image transport.
- Stress-testing multimodal context windows and billing behavior.
Bad fits:
- Anything where one wrong character is dangerous.
- Legal, medical, or financial work without a second verifier.
- Code execution or patch generation from image text.
- Private data, unless you are comfortable with the endpoint and its retention behavior.
scripts/render_token_images.py render source text into page PNGs
scripts/build_codex_request.py build a Responses-style Codex request body
scripts/send_codex_request.py send the request and summarize the SSE result
scripts/verify_repo.py check the packaged evidence and scripts
docs/experiment-log.md best result first, then superseded and failed runs
docs/request-shape.md observed Codex request shape and limits
docs/blob-context-probe.md earlier blob / base64 / image probes
experiments/ preserved summaries and the best run's images
examples/gutenberg-cache/ public-domain source texts
This is an experiment package, not an official SDK. The Codex endpoint, headers, model behavior, and limits can change. Treat the numbers here as grounded evidence for the recorded run, not a contract. Always test with needles, always inspect the marker crops, and only trust a layout that passes the retrieval task you actually care about.