Xsn/mtmd placeholder chunks by ngxson · Pull Request #106 · ngxson/llama.cpp

ngxson · 2026-05-30T16:37:07Z

For AI review

Mirror upstream ggml-org#23913

Summary by CodeRabbit

New Features
- Added OpenAI-compatible token-counting endpoints for chat completions and responses; router mode now proxies these.
- Media handling upgraded with placeholder-aware bitmaps and explicit size/buffer APIs for safer image/audio preprocessing and conversion.
Documentation
- Added docs and examples for the new token-counting endpoints.
Tests
- Added unit tests for token-counting (chat and vision).
Bug Fixes
- Improved placeholder detection and validation to prevent invalid media from being encoded.

coderabbitai · 2026-05-30T16:37:25Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5ea937dd-4462-49fd-af3a-dd87b23e9805

📥 Commits

Reviewing files that changed from the base of the PR and between 1945165 and c72ef5c.

📒 Files selected for processing (1)

tools/mtmd/mtmd.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

tools/mtmd/mtmd.cpp

📝 Walkthrough

Walkthrough

Refactors CLIP/MTMD image and bitmap types to encapsulated accessors with placeholder support; updates image preprocessing, CLIP usage across vision graphs, media ingestion, removes legacy CLIP helpers, and adds server-side token-counting endpoints with tests and documentation.

Changes

Image Container Encapsulation and CLIP API Update

Layer / File(s)	Summary
Image container refactor `tools/mtmd/clip-impl.h`, `tools/mtmd/clip.h`	`clip_image_u8` and `clip_image_f32` move from public fields to private storage with accessor/mutator methods including placeholder detection, size queries, pixel get/set, conversion, normalization, and `clip_image_size::operator==`.
CLIP API removals and public function updates `tools/mtmd/clip.h`, `tools/mtmd/clip.cpp`	Removes deprecated functions (`clip_embd_nbytes`, `clip_image_u8_get_data`, `clip_build_img_from_pixels`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`); updates debug writers and conversion helpers to use accessor-based APIs.
Core CLIP function and image handling updates `tools/mtmd/clip.cpp`	Debug PPM/BMP writers, f32→u8 conversion, vision patch-count/position-embedding math, raw input tensor creation, warmup sizing, and batch encode paths now use `get_size()/nx()/ny()/get_ro_buf()/get_pixel()` APIs.

Media Bitmap and Tokenization Refactoring

Layer / File(s)	Summary
Bitmap class refactoring and placeholder support `tools/mtmd/mtmd.h`, `tools/mtmd/mtmd.cpp`, `tools/mtmd/mtmd-helper.h`, `tools/mtmd/mtmd-helper.cpp`	`mtmd_bitmap` becomes an initialized container that copies input data, exposes `get_ro_buf()`, `is_placeholder()`, and `n_bytes()`; helper initializers gain a `bool placeholder` parameter.
Token object placeholder detection and initialization `tools/mtmd/mtmd.cpp`	`mtmd_image_tokens` and `mtmd_audio_tokens` add `is_placeholder()` helpers; `mtmd_encode_chunk` and `mtmd_encode` reject null or placeholder batches.
Image/audio preprocessing and media encoding `tools/mtmd/mtmd.cpp`	Image/audio ingestion now validates bitmap dimensions, uses `set_size()`/`cpy_buf()` and `get_ro_buf()` for population, and marks mel/image buffers as placeholders when appropriate; debug helpers updated accordingly.
Image preprocessing tool refactoring `tools/mtmd/mtmd-image.cpp`	`img_u8_to_f32`, `resize`, `crop`, `composite`, `fill`, and resizing algorithms refactored to use `get_size()`, `get_pixel()`, `set_pixel()`, `cpy_buf()` and to handle placeholders; various preprocessors updated to use the accessor API.

Vision Graph and Model Updates

Layer / File(s)	Summary
Vision graph and batch encoding refactor `tools/mtmd/clip.cpp`	Patch-count, token-count, projector math, and `clip_image_batch_encode` vision/audio staging updated to read per-entry `nx()/ny()` and `get_ro_buf()`; conversion and sizing use `set_size()`/`cpy_buf()`.
Model graph builder dimension accessor updates `tools/mtmd/models/*`	Vision model graph builders (conformer, glm4v, granite-speech, kimik25, mimovl, qwen2vl, qwen3vl, whisper-enc) updated to call `img.nx()`/`img.ny()` instead of reading public fields.
CLI media loading updates `tools/mtmd/mtmd-cli.cpp`	`mtmd_cli_context::load_media` now calls `mtmd_helper_bitmap_init_from_file(..., false)` with explicit placeholder argument.

Server Token Counting Endpoints

Layer / File(s)	Summary
Token counting route handlers and implementation `tools/server/server-context.cpp`, `tools/server/server-context.h`	Adds `post_chat_completions_tok` and `post_responses_tok_oai` handlers; implements `handle_count_tokens()` to parse requests, convert payloads (OAI/Anthropic/Responses), extract prompts, and compute `input_tokens` via MTMD or fallback tokenization.
Server endpoint registration and wiring `tools/server/server.cpp`	Registers `POST /chat/completions/input_tokens`, `POST /responses/input_tokens` (and /v1 variants), wires proxy routes in router mode, and groups token-counting routes.
Token counting helper and documentation `tools/server/server-common.h`, `tools/server/server-common.cpp`, `tools/server/README.md`	`process_mtmd_prompt` now takes `const` references and an optional `is_placeholder` flag; README documents the new OpenAI-compatible token-counting endpoints and example responses.
Unit tests for token counting endpoints `tools/server/tests/unit/test_chat_completion.py`, `tools/server/tests/unit/test_vision_api.py`	New tests exercise `/chat/completions/input_tokens` for text-only and text+image payloads and assert successful responses with non-trivial token counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped through pixels, hid widths and heights away,

Accessors whisper where the buffers play,
Placeholders nap until token counts call,
Routes listen closely and tests check them all,
A tiny rabbit applauds this tidy refactor day.

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is severely incomplete, providing only 'For AI review' and a link to upstream. It lacks all required template sections (Overview, Additional information details, and proper Requirements acknowledgment).	Complete the description with: (1) Overview explaining what the changes do and why, (2) clear rationale for mirroring the upstream PR, (3) explicit AI usage disclosure, (4) confirmation of contributing guidelines agreement, and (5) verification of AI content restrictions per AGENTS.md.
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.30% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Xsn/mtmd placeholder chunks' is vague and uses branch naming convention rather than describing the actual change; it lacks clarity about what problem is being solved or what feature is being added.	Use a more descriptive title that clearly summarizes the main change, such as 'Add placeholder bitmap support for MTMD token counting' or 'Refactor CLIP image accessors with placeholder detection'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/mtmd/clip-impl.h (1)
7-15: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make tools/mtmd/clip-impl.h self-contained.

tools/mtmd/clip-impl.h throws std::runtime_error (435-437, 544-546) but doesn’t include <stdexcept>. tools/mtmd/clip.cpp includes clip-impl.h before its own <stdexcept>, so the build currently depends on transitive include order rather than the header.
Proposed fix
 `#include` <array>
 `#include` <climits>
+#include <stdexcept>
 `#include` <cstdarg>
 `#include` <cinttypes>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mtmd/clip-impl.h` around lines 7 - 15, The header
tools/mtmd/clip-impl.h is not self-contained because it throws
std::runtime_error in functions that raise exceptions (see uses around the throw
sites), but doesn't include <stdexcept>; add `#include` <stdexcept> at the top of
clip-impl.h so the declarations that use or throw std::runtime_error compile
without relying on transitive includes (ensure the include sits with the other
standard headers already present).

🧹 Nitpick comments (2)

tools/server/tests/unit/test_vision_api.py (1)

101-117: ⚡ Quick win

Verify image content actually affects token counting.

input_tokens > 10 is a weak proxy. Add a text-only baseline and assert the multimodal request counts more tokens.

Proposed assertion upgrade

 def test_vision_chat_completion_token_count():
     global server
     server.start()
-    res = server.make_request("POST", "/chat/completions/input_tokens", data={
+    res = server.make_request("POST", "/chat/completions/input_tokens", data={
         "temperature": 0.0,
         "top_k": 1,
         "messages": [
             {"role": "user", "content": [
                 {"type": "text", "text": "What is this:"},
                 {"type": "image_url", "image_url": {
                     "url": get_img_url("IMG_URL_0"),
                 }},
             ]},
         ],
     })
     assert res.status_code == 200
+    assert res.body["object"] == "response.input_tokens"
     assert res.body["input_tokens"] > 10
+
+    text_only = server.make_request("POST", "/chat/completions/input_tokens", data={
+        "messages": [{"role": "user", "content": "What is this:"}],
+    })
+    assert text_only.status_code == 200
+    assert res.body["input_tokens"] > text_only.body["input_tokens"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/server/tests/unit/test_vision_api.py` around lines 101 - 117, The test
test_vision_chat_completion_token_count currently only asserts
res.body["input_tokens"] > 10; add a text-only baseline request using
server.make_request to the same "/chat/completions/input_tokens" endpoint with
an equivalent messages payload that contains only the text part (e.g.,
{"role":"user","content":[{"type":"text","text":"What is this:"}]}) and capture
its input_tokens, then assert the multimodal response's input_tokens is greater
than the text-only baseline (res.body["input_tokens"] >
baseline["input_tokens"]). Ensure you reuse the same request parameters
(temperature, top_k) and message ordering so the only difference is the
image_url content.

tools/server/tests/unit/test_chat_completion.py (1)

578-592: ⚡ Quick win

Strengthen token-count contract assertions.

This test currently only checks status and a loose lower bound. It should also validate the response discriminator and deterministic count across identical requests.

Proposed test hardening

 def test_chat_completions_token_count():
     global server
     server.start()
-    # make sure cache can be reused across multiple choices and multiple requests
-    # ref: https://github.com/ggml-org/llama.cpp/pull/18663
-    for _ in range(2):
+    counts = []
+    for _ in range(2):
         res = server.make_request("POST", "/chat/completions/input_tokens", data={
             "messages": [
                 {"role": "system", "content": "Book"},
                 {"role": "user", "content": "What is the best book"},
             ],
         })
         assert res.status_code == 200
-        assert res.body["input_tokens"] > 5
+        assert res.body["object"] == "response.input_tokens"
+        assert isinstance(res.body["input_tokens"], int)
+        assert res.body["input_tokens"] > 5
+        counts.append(res.body["input_tokens"])
+    assert counts[0] == counts[1]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/server/tests/unit/test_chat_completion.py` around lines 578 - 592, The
test test_chat_completions_token_count only asserts status and a loose lower
bound; update it to also assert the response discriminator and deterministic
token counts: after calling server.make_request("POST",
"/chat/completions/input_tokens", ...) verify res.body contains a discriminator
(e.g., res.body["discriminator"] == "chat.completion" or the expected
discriminator key/value used by the API) and capture res.body["input_tokens"] on
the first request then assert on the second identical request that
res.body["input_tokens"] equals the first captured value (in addition to the
existing > 5 check) to ensure deterministic counts across identical requests.
Ensure asserts reference the test function name
test_chat_completions_token_count and the server.make_request response fields
res.body["input_tokens"] and res.body["discriminator"] (or the API's exact
discriminator key).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mtmd/clip.cpp`:
- Around line 3432-3439: The GGML_ASSERT is comparing bytes to element count:
change the assertion to check element counts (use GGML_ASSERT(n_step * n_mel ==
buf.size());) because mel_inp->get_ro_buf() returns a std::vector<float> where
buf.size() is number of floats; keep the memcpy as-is (it should still copy
n_step * n_mel * sizeof(float) bytes into inp_raw).

In `@tools/server/server-context.cpp`:
- Around line 4838-4843: The token-counting path for /input_tokens currently
calls process_mtmd_prompt(mctx, prompt.get<std::string>(), files) which runs
full MTMD preprocessing; change this to the placeholder-mode path so counting
uses cheap placeholder chunks (e.g., call a placeholder variant such as
process_mtmd_prompt_placeholder(mctx, prompt.get<std::string>(), files) or add a
boolean flag to process_mtmd_prompt like process_mtmd_prompt(mctx,
prompt.get<std::string>(), files, /*placeholder=*/true)) so that when mctx is
non-null in the /input_tokens flow you compute n_tokens from the
placeholder-mode result instead of performing full preprocessing.

In `@tools/server/server.cpp`:
- Around line 192-193: Duplicate POST route registration for
"/responses/input_tokens" exists: locate the ctx_http.post calls that reference
routes.post_responses_tok_oai (the two entries registering
"/responses/input_tokens") and remove the redundant registration so only a
single ctx_http.post("/responses/input_tokens",
ex_wrapper(routes.post_responses_tok_oai)) remains; also scan the nearby block
(the other occurrence around the 207-211 region) to ensure no other duplicate
registrations remain and consolidate them to a single registration to avoid
route ambiguity.

---

Outside diff comments:
In `@tools/mtmd/clip-impl.h`:
- Around line 7-15: The header tools/mtmd/clip-impl.h is not self-contained
because it throws std::runtime_error in functions that raise exceptions (see
uses around the throw sites), but doesn't include <stdexcept>; add `#include`
<stdexcept> at the top of clip-impl.h so the declarations that use or throw
std::runtime_error compile without relying on transitive includes (ensure the
include sits with the other standard headers already present).

---

Nitpick comments:
In `@tools/server/tests/unit/test_chat_completion.py`:
- Around line 578-592: The test test_chat_completions_token_count only asserts
status and a loose lower bound; update it to also assert the response
discriminator and deterministic token counts: after calling
server.make_request("POST", "/chat/completions/input_tokens", ...) verify
res.body contains a discriminator (e.g., res.body["discriminator"] ==
"chat.completion" or the expected discriminator key/value used by the API) and
capture res.body["input_tokens"] on the first request then assert on the second
identical request that res.body["input_tokens"] equals the first captured value
(in addition to the existing > 5 check) to ensure deterministic counts across
identical requests. Ensure asserts reference the test function name
test_chat_completions_token_count and the server.make_request response fields
res.body["input_tokens"] and res.body["discriminator"] (or the API's exact
discriminator key).

In `@tools/server/tests/unit/test_vision_api.py`:
- Around line 101-117: The test test_vision_chat_completion_token_count
currently only asserts res.body["input_tokens"] > 10; add a text-only baseline
request using server.make_request to the same "/chat/completions/input_tokens"
endpoint with an equivalent messages payload that contains only the text part
(e.g., {"role":"user","content":[{"type":"text","text":"What is this:"}]}) and
capture its input_tokens, then assert the multimodal response's input_tokens is
greater than the text-only baseline (res.body["input_tokens"] >
baseline["input_tokens"]). Ensure you reuse the same request parameters
(temperature, top_k) and message ordering so the only difference is the
image_url content.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: abc7b4b0-c5df-4e97-862f-f86950f9c5c9

📥 Commits

Reviewing files that changed from the base of the PR and between d38d50e and 447e418.

📒 Files selected for processing (25)

tools/mtmd/clip-impl.h
tools/mtmd/clip.cpp
tools/mtmd/clip.h
tools/mtmd/models/conformer.cpp
tools/mtmd/models/glm4v.cpp
tools/mtmd/models/granite-speech.cpp
tools/mtmd/models/kimik25.cpp
tools/mtmd/models/mimovl.cpp
tools/mtmd/models/qwen2vl.cpp
tools/mtmd/models/qwen3vl.cpp
tools/mtmd/models/whisper-enc.cpp
tools/mtmd/mtmd-cli.cpp
tools/mtmd/mtmd-helper.cpp
tools/mtmd/mtmd-helper.h
tools/mtmd/mtmd-image.cpp
tools/mtmd/mtmd.cpp
tools/mtmd/mtmd.h
tools/server/README.md
tools/server/server-common.cpp
tools/server/server-common.h
tools/server/server-context.cpp
tools/server/server-context.h
tools/server/server.cpp
tools/server/tests/unit/test_chat_completion.py
tools/server/tests/unit/test_vision_api.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/mtmd/mtmd-image.cpp (1)
1265-1270: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate the placeholder flag into dst.

Line 1265 hardcodes dst as non-placeholder, but the function immediately returns on placeholder inputs. That breaks the new placeholder flow for the Step3VL preprocessing path by leaving dst shaped like a real image without populated pixels. Pass src.is_placeholder() through when sizing dst instead of hardcoding false.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mtmd/mtmd-image.cpp` around lines 1265 - 1270, The dst image is always
being created as non-placeholder by calling dst.set_size({target_width,
target_height}, false, false) even when src.is_placeholder() is true; change the
call in the function that handles Step3VL preprocessing so that the placeholder
flag is propagated (pass src.is_placeholder() as the placeholder argument to
dst.set_size) instead of hardcoding false, ensuring dst retains the placeholder
state when src.is_placeholder() returns true.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tools/mtmd/mtmd-image.cpp`:
- Around line 1265-1270: The dst image is always being created as
non-placeholder by calling dst.set_size({target_width, target_height}, false,
false) even when src.is_placeholder() is true; change the call in the function
that handles Step3VL preprocessing so that the placeholder flag is propagated
(pass src.is_placeholder() as the placeholder argument to dst.set_size) instead
of hardcoding false, ensuring dst retains the placeholder state when
src.is_placeholder() returns true.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c10f5ec2-d209-4a85-886d-8d9c18d2baff

📥 Commits

Reviewing files that changed from the base of the PR and between 447e418 and 8f67dfb.

📒 Files selected for processing (4)

tools/mtmd/clip-impl.h
tools/mtmd/clip.cpp
tools/mtmd/mtmd-image.cpp
tools/mtmd/mtmd.cpp

🚧 Files skipped from review as they are similar to previous changes (2)

tools/mtmd/clip-impl.h
tools/mtmd/mtmd.cpp

ngxson added 9 commits May 30, 2026 16:18

mtmd: add "placeholder bitmap" for counting tokens w/o preprocessing

924bbab

fast path skip preproc for placeholder

064c2d7

fix build

d1a098d

correct the api

58171a6

add server endpoint + tests

f1503cf

add object name

aec9eff

update docs

035d72c

add proxy handling

3cb2d8c

fix build

447e418

github-actions Bot added examples python server labels May 30, 2026

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread tools/mtmd/clip.cpp Outdated

Comment thread tools/server/server-context.cpp

Comment thread tools/server/server.cpp Outdated

ngxson added 3 commits May 30, 2026 18:58

fix audio input path

8f67dfb

use is_placeholder in process_mtmd_prompt()

8351aaf

nits

1945165

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

nits (2)

c72ef5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xsn/mtmd placeholder chunks#106

Xsn/mtmd placeholder chunks#106
ngxson wants to merge 13 commits into
ngxson:masterfrom
ggml-org:xsn/mtmd_placeholder_chunks

ngxson commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ngxson commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ngxson commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading