Skip to content

jaredboynton/tokenfiche

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tokenfiche

A picture is worth a thousand words. We made each one worth about 33,333 tokens, and GPT-5.5 read them back.

What is a token, really? A model's text window stops at a hard number, 272,000 tokens for GPT-5.5 through the Codex endpoint. So we asked a simple question: if you render the text as images instead of sending it as text, does that number still apply? The vision path reads pixels, and pixels are cheap. How much text can you actually get a model to read by photographing it?

The answer, for this one proven run: 1,000,000 tokens of source text, packed into 30 page images, every fact recovered exactly. That is 3.14 tokens of source text carried for every input token the API billed.

TL;DR

We took 1,000,000 tokenizer-counted tokens of text (the source is standard Gutenberg books), hid three exact "needle" markers inside it, rendered the whole thing as 30 dense grayscale page images using the hyperlegible Atkinson Mono font, and sent the images to GPT-5.5. The model found all three needles, character for character.

Metric Value
Source text rendered 1,000,000 tokens
Page images 30 (3000x3000 px)
Source tokens per image ~33,333
Billed input tokens 318,283
Source-to-billed ratio 3.14x
Needles recovered 3 / 3 exact
One image more (31) rejected: context_length_exceeded

Evidence: experiments/atkinson_10_1M-api/atkinson_10_1M.summary.json.

This is the vision path used as a lossy, OCR-like transport layer. Images carry far more source text than the text window would normally hold, as long as the text is rendered densely, legibly, and verified with retrieval probes.

The two gates

The interesting result is that "the endpoint accepted it" and "the model read it correctly" are two separate things, and they fail for different reasons.

Gate 1, the context gate, counts image patches. The endpoint slices each 3000x3000 image into 32x32 patches: 94 x 94 = 8,836 patches per page. The context limit is enforced on that patch count, near 272,000, and the billed input_tokens figure (318,283) is a separate billing number that is allowed to exceed it. The boundary is clean:

Pages Patches Result
30 265,080 accepted
31 273,916 context_length_exceeded

272,000 sits exactly between them. Thirty pages is the wall for this page size. Evidence: experiments/codex-gpt55-image-maximize-2026-06-20/api/t836811-p31-c10-fs10-m0-g0-ext562.summary.json.

Gate 2, the retrieval gate, is OCR. Passing the context gate only means the bytes fit. The model still has to read pixels. While standard Courier New layouts struggled with minor character spelling drops at ~900k tokens (dropping a single 'E' in QUEEQUEG), switching the design to the hyperlegible Atkinson Mono at size 10 and 8px line height completely solved these OCR limits. The model achieved perfect, character-for-character retrieval of all needles even at 1,000,000 source tokens.

How it works

Four scripts, run in order:

render_token_images.py   text -> dense page PNGs + manifest
build_codex_request.py   PNGs -> base64 Codex request with a strict-JSON output schema
send_codex_request.py    request -> SSE stream -> classified summary
verify_repo.py           sanity-check the packaged evidence

The renderer trims text to an exact token budget with tiktoken, inserts the three needle markers at 20% / 50% / 90% offsets, reflows paragraphs, and packs them into multi-column pages. It computes layout capacity and fails preflight if the text would overflow, so you never pay for an image that dropped characters off the page.

The winning layout is intentionally plain:

  • 3000x3000 grayscale PNG pages
  • 8 columns, 1 px margins, 3 px gutters
  • Atkinson Mono at 10 px, 62 characters per line, 8 px line height
  • 30 pages, 81,686 wrapped lines, 97.71% average rendered ink width

The reflow step is the core trick. Early attempts preserved Project Gutenberg's hard wraps, which left text in a narrow strip down the left side and wasted most of the page. Reflowing each paragraph to fill the full column width is what turned that wasted whitespace into dense, readable pages. Switching from Courier New to the hyperlegible Atkinson Mono then let the line height drop to 8 px without the model dropping characters, which is what carried the run past the old Courier New ceiling of 810,549 tokens to a clean 1,000,000.

Reproduce it

You need uv. Nothing else installs globally.

Verify the packaged evidence:

uv run --with pillow --with tiktoken python scripts/verify_repo.py

Re-render the proven best source:

uv run --with pillow --with tiktoken python scripts/render_token_images.py \
  --source-text experiments/atkinson_10_1M/source-1000000-tokens.txt \
  --target-tokens 1000000 \
  --pages 30 --columns 8 --font-size 10 \
  --chars-per-line 62 --line-height 8 \
  --margin 1 --gutter 3 \
  --font experiments/color-layer-overlap-2026-06-20/fonts/AtkinsonMono.ttf \
  --out runs/repro-1000000

Build the request, then send it with your Codex auth:

python scripts/build_codex_request.py \
  --render-dir runs/repro-1000000 \
  --output runs/repro-1000000.request.json \
  --redacted-output runs/repro-1000000.request.redacted.json

python scripts/send_codex_request.py \
  --request runs/repro-1000000.request.json \
  --manifest runs/repro-1000000/manifest.json \
  --out-dir runs/repro-1000000-api

send_codex_request.py reads ~/.codex/auth.json by default, or accepts CODEX_ACCESS_TOKEN and CHATGPT_ACCOUNT_ID.

When this is worth it

Good fits:

  • Long-document triage where finding anchored evidence matters more than perfect transcription.
  • Retrieval evals with inserted needles across huge payloads.
  • Agent-memory or compaction experiments comparing text transport against image transport.
  • Stress-testing multimodal context windows and billing behavior.

Bad fits:

  • Anything where one wrong character is dangerous.
  • Legal, medical, or financial work without a second verifier.
  • Code execution or patch generation from image text.
  • Private data, unless you are comfortable with the endpoint and its retention behavior.

Repo map

scripts/render_token_images.py   render source text into page PNGs
scripts/build_codex_request.py   build a Responses-style Codex request body
scripts/send_codex_request.py    send the request and summarize the SSE result
scripts/verify_repo.py           check the packaged evidence and scripts
docs/experiment-log.md           best result first, then superseded and failed runs
docs/request-shape.md            observed Codex request shape and limits
docs/blob-context-probe.md       earlier blob / base64 / image probes
experiments/                     preserved summaries and the best run's images
examples/gutenberg-cache/        public-domain source texts

Status

This is an experiment package, not an official SDK. The Codex endpoint, headers, model behavior, and limits can change. Treat the numbers here as grounded evidence for the recorded run, not a contract. Always test with needles, always inspect the marker crops, and only trust a layout that passes the retrieval task you actually care about.

About

A picture is worth a thousand words, or about 27,000 tokens. Reading 810k tokens of text through GPT-5.5's vision path when its text window stops at 272k.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages