Skip to content

Xsn/mtmd placeholder chunks#106

Open
ngxson wants to merge 13 commits into
ngxson:masterfrom
ggml-org:xsn/mtmd_placeholder_chunks
Open

Xsn/mtmd placeholder chunks#106
ngxson wants to merge 13 commits into
ngxson:masterfrom
ggml-org:xsn/mtmd_placeholder_chunks

Conversation

@ngxson
Copy link
Copy Markdown
Owner

@ngxson ngxson commented May 30, 2026

For AI review

Mirror upstream ggml-org#23913

Summary by CodeRabbit

  • New Features

    • Added OpenAI-compatible token-counting endpoints for chat completions and responses; router mode now proxies these.
    • Media handling upgraded with placeholder-aware bitmaps and explicit size/buffer APIs for safer image/audio preprocessing and conversion.
  • Documentation

    • Added docs and examples for the new token-counting endpoints.
  • Tests

    • Added unit tests for token-counting (chat and vision).
  • Bug Fixes

    • Improved placeholder detection and validation to prevent invalid media from being encoded.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5ea937dd-4462-49fd-af3a-dd87b23e9805

📥 Commits

Reviewing files that changed from the base of the PR and between 1945165 and c72ef5c.

📒 Files selected for processing (1)
  • tools/mtmd/mtmd.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/mtmd/mtmd.cpp

📝 Walkthrough

Walkthrough

Refactors CLIP/MTMD image and bitmap types to encapsulated accessors with placeholder support; updates image preprocessing, CLIP usage across vision graphs, media ingestion, removes legacy CLIP helpers, and adds server-side token-counting endpoints with tests and documentation.

Changes

Image Container Encapsulation and CLIP API Update

Layer / File(s) Summary
Image container refactor
tools/mtmd/clip-impl.h, tools/mtmd/clip.h
clip_image_u8 and clip_image_f32 move from public fields to private storage with accessor/mutator methods including placeholder detection, size queries, pixel get/set, conversion, normalization, and clip_image_size::operator==.
CLIP API removals and public function updates
tools/mtmd/clip.h, tools/mtmd/clip.cpp
Removes deprecated functions (clip_embd_nbytes, clip_image_u8_get_data, clip_build_img_from_pixels, clip_encode_float_image, clip_image_f32_batch_add_mel); updates debug writers and conversion helpers to use accessor-based APIs.
Core CLIP function and image handling updates
tools/mtmd/clip.cpp
Debug PPM/BMP writers, f32→u8 conversion, vision patch-count/position-embedding math, raw input tensor creation, warmup sizing, and batch encode paths now use get_size()/nx()/ny()/get_ro_buf()/get_pixel() APIs.

Media Bitmap and Tokenization Refactoring

Layer / File(s) Summary
Bitmap class refactoring and placeholder support
tools/mtmd/mtmd.h, tools/mtmd/mtmd.cpp, tools/mtmd/mtmd-helper.h, tools/mtmd/mtmd-helper.cpp
mtmd_bitmap becomes an initialized container that copies input data, exposes get_ro_buf(), is_placeholder(), and n_bytes(); helper initializers gain a bool placeholder parameter.
Token object placeholder detection and initialization
tools/mtmd/mtmd.cpp
mtmd_image_tokens and mtmd_audio_tokens add is_placeholder() helpers; mtmd_encode_chunk and mtmd_encode reject null or placeholder batches.
Image/audio preprocessing and media encoding
tools/mtmd/mtmd.cpp
Image/audio ingestion now validates bitmap dimensions, uses set_size()/cpy_buf() and get_ro_buf() for population, and marks mel/image buffers as placeholders when appropriate; debug helpers updated accordingly.
Image preprocessing tool refactoring
tools/mtmd/mtmd-image.cpp
img_u8_to_f32, resize, crop, composite, fill, and resizing algorithms refactored to use get_size(), get_pixel(), set_pixel(), cpy_buf() and to handle placeholders; various preprocessors updated to use the accessor API.

Vision Graph and Model Updates

Layer / File(s) Summary
Vision graph and batch encoding refactor
tools/mtmd/clip.cpp
Patch-count, token-count, projector math, and clip_image_batch_encode vision/audio staging updated to read per-entry nx()/ny() and get_ro_buf(); conversion and sizing use set_size()/cpy_buf().
Model graph builder dimension accessor updates
tools/mtmd/models/*
Vision model graph builders (conformer, glm4v, granite-speech, kimik25, mimovl, qwen2vl, qwen3vl, whisper-enc) updated to call img.nx()/img.ny() instead of reading public fields.
CLI media loading updates
tools/mtmd/mtmd-cli.cpp
mtmd_cli_context::load_media now calls mtmd_helper_bitmap_init_from_file(..., false) with explicit placeholder argument.

Server Token Counting Endpoints

Layer / File(s) Summary
Token counting route handlers and implementation
tools/server/server-context.cpp, tools/server/server-context.h
Adds post_chat_completions_tok and post_responses_tok_oai handlers; implements handle_count_tokens() to parse requests, convert payloads (OAI/Anthropic/Responses), extract prompts, and compute input_tokens via MTMD or fallback tokenization.
Server endpoint registration and wiring
tools/server/server.cpp
Registers POST /chat/completions/input_tokens, POST /responses/input_tokens (and /v1 variants), wires proxy routes in router mode, and groups token-counting routes.
Token counting helper and documentation
tools/server/server-common.h, tools/server/server-common.cpp, tools/server/README.md
process_mtmd_prompt now takes const references and an optional is_placeholder flag; README documents the new OpenAI-compatible token-counting endpoints and example responses.
Unit tests for token counting endpoints
tools/server/tests/unit/test_chat_completion.py, tools/server/tests/unit/test_vision_api.py
New tests exercise /chat/completions/input_tokens for text-only and text+image payloads and assert successful responses with non-trivial token counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped through pixels, hid widths and heights away,

Accessors whisper where the buffers play,
Placeholders nap until token counts call,
Routes listen closely and tests check them all,
A tiny rabbit applauds this tidy refactor day.

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is severely incomplete, providing only 'For AI review' and a link to upstream. It lacks all required template sections (Overview, Additional information details, and proper Requirements acknowledgment). Complete the description with: (1) Overview explaining what the changes do and why, (2) clear rationale for mirroring the upstream PR, (3) explicit AI usage disclosure, (4) confirmation of contributing guidelines agreement, and (5) verification of AI content restrictions per AGENTS.md.
Docstring Coverage ⚠️ Warning Docstring coverage is 30.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Xsn/mtmd placeholder chunks' is vague and uses branch naming convention rather than describing the actual change; it lacks clarity about what problem is being solved or what feature is being added. Use a more descriptive title that clearly summarizes the main change, such as 'Add placeholder bitmap support for MTMD token counting' or 'Refactor CLIP image accessors with placeholder detection'.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/mtmd/clip-impl.h (1)

7-15: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make tools/mtmd/clip-impl.h self-contained.

tools/mtmd/clip-impl.h throws std::runtime_error (435-437, 544-546) but doesn’t include <stdexcept>. tools/mtmd/clip.cpp includes clip-impl.h before its own <stdexcept>, so the build currently depends on transitive include order rather than the header.

Proposed fix
 `#include` <array>
 `#include` <climits>
+#include <stdexcept>
 `#include` <cstdarg>
 `#include` <cinttypes>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mtmd/clip-impl.h` around lines 7 - 15, The header
tools/mtmd/clip-impl.h is not self-contained because it throws
std::runtime_error in functions that raise exceptions (see uses around the throw
sites), but doesn't include <stdexcept>; add `#include` <stdexcept> at the top of
clip-impl.h so the declarations that use or throw std::runtime_error compile
without relying on transitive includes (ensure the include sits with the other
standard headers already present).
🧹 Nitpick comments (2)
tools/server/tests/unit/test_vision_api.py (1)

101-117: ⚡ Quick win

Verify image content actually affects token counting.

input_tokens > 10 is a weak proxy. Add a text-only baseline and assert the multimodal request counts more tokens.

Proposed assertion upgrade
 def test_vision_chat_completion_token_count():
     global server
     server.start()
-    res = server.make_request("POST", "/chat/completions/input_tokens", data={
+    res = server.make_request("POST", "/chat/completions/input_tokens", data={
         "temperature": 0.0,
         "top_k": 1,
         "messages": [
             {"role": "user", "content": [
                 {"type": "text", "text": "What is this:"},
                 {"type": "image_url", "image_url": {
                     "url": get_img_url("IMG_URL_0"),
                 }},
             ]},
         ],
     })
     assert res.status_code == 200
+    assert res.body["object"] == "response.input_tokens"
     assert res.body["input_tokens"] > 10
+
+    text_only = server.make_request("POST", "/chat/completions/input_tokens", data={
+        "messages": [{"role": "user", "content": "What is this:"}],
+    })
+    assert text_only.status_code == 200
+    assert res.body["input_tokens"] > text_only.body["input_tokens"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/server/tests/unit/test_vision_api.py` around lines 101 - 117, The test
test_vision_chat_completion_token_count currently only asserts
res.body["input_tokens"] > 10; add a text-only baseline request using
server.make_request to the same "/chat/completions/input_tokens" endpoint with
an equivalent messages payload that contains only the text part (e.g.,
{"role":"user","content":[{"type":"text","text":"What is this:"}]}) and capture
its input_tokens, then assert the multimodal response's input_tokens is greater
than the text-only baseline (res.body["input_tokens"] >
baseline["input_tokens"]). Ensure you reuse the same request parameters
(temperature, top_k) and message ordering so the only difference is the
image_url content.
tools/server/tests/unit/test_chat_completion.py (1)

578-592: ⚡ Quick win

Strengthen token-count contract assertions.

This test currently only checks status and a loose lower bound. It should also validate the response discriminator and deterministic count across identical requests.

Proposed test hardening
 def test_chat_completions_token_count():
     global server
     server.start()
-    # make sure cache can be reused across multiple choices and multiple requests
-    # ref: https://github.com/ggml-org/llama.cpp/pull/18663
-    for _ in range(2):
+    counts = []
+    for _ in range(2):
         res = server.make_request("POST", "/chat/completions/input_tokens", data={
             "messages": [
                 {"role": "system", "content": "Book"},
                 {"role": "user", "content": "What is the best book"},
             ],
         })
         assert res.status_code == 200
-        assert res.body["input_tokens"] > 5
+        assert res.body["object"] == "response.input_tokens"
+        assert isinstance(res.body["input_tokens"], int)
+        assert res.body["input_tokens"] > 5
+        counts.append(res.body["input_tokens"])
+    assert counts[0] == counts[1]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/server/tests/unit/test_chat_completion.py` around lines 578 - 592, The
test test_chat_completions_token_count only asserts status and a loose lower
bound; update it to also assert the response discriminator and deterministic
token counts: after calling server.make_request("POST",
"/chat/completions/input_tokens", ...) verify res.body contains a discriminator
(e.g., res.body["discriminator"] == "chat.completion" or the expected
discriminator key/value used by the API) and capture res.body["input_tokens"] on
the first request then assert on the second identical request that
res.body["input_tokens"] equals the first captured value (in addition to the
existing > 5 check) to ensure deterministic counts across identical requests.
Ensure asserts reference the test function name
test_chat_completions_token_count and the server.make_request response fields
res.body["input_tokens"] and res.body["discriminator"] (or the API's exact
discriminator key).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mtmd/clip.cpp`:
- Around line 3432-3439: The GGML_ASSERT is comparing bytes to element count:
change the assertion to check element counts (use GGML_ASSERT(n_step * n_mel ==
buf.size());) because mel_inp->get_ro_buf() returns a std::vector<float> where
buf.size() is number of floats; keep the memcpy as-is (it should still copy
n_step * n_mel * sizeof(float) bytes into inp_raw).

In `@tools/server/server-context.cpp`:
- Around line 4838-4843: The token-counting path for /input_tokens currently
calls process_mtmd_prompt(mctx, prompt.get<std::string>(), files) which runs
full MTMD preprocessing; change this to the placeholder-mode path so counting
uses cheap placeholder chunks (e.g., call a placeholder variant such as
process_mtmd_prompt_placeholder(mctx, prompt.get<std::string>(), files) or add a
boolean flag to process_mtmd_prompt like process_mtmd_prompt(mctx,
prompt.get<std::string>(), files, /*placeholder=*/true)) so that when mctx is
non-null in the /input_tokens flow you compute n_tokens from the
placeholder-mode result instead of performing full preprocessing.

In `@tools/server/server.cpp`:
- Around line 192-193: Duplicate POST route registration for
"/responses/input_tokens" exists: locate the ctx_http.post calls that reference
routes.post_responses_tok_oai (the two entries registering
"/responses/input_tokens") and remove the redundant registration so only a
single ctx_http.post("/responses/input_tokens",
ex_wrapper(routes.post_responses_tok_oai)) remains; also scan the nearby block
(the other occurrence around the 207-211 region) to ensure no other duplicate
registrations remain and consolidate them to a single registration to avoid
route ambiguity.

---

Outside diff comments:
In `@tools/mtmd/clip-impl.h`:
- Around line 7-15: The header tools/mtmd/clip-impl.h is not self-contained
because it throws std::runtime_error in functions that raise exceptions (see
uses around the throw sites), but doesn't include <stdexcept>; add `#include`
<stdexcept> at the top of clip-impl.h so the declarations that use or throw
std::runtime_error compile without relying on transitive includes (ensure the
include sits with the other standard headers already present).

---

Nitpick comments:
In `@tools/server/tests/unit/test_chat_completion.py`:
- Around line 578-592: The test test_chat_completions_token_count only asserts
status and a loose lower bound; update it to also assert the response
discriminator and deterministic token counts: after calling
server.make_request("POST", "/chat/completions/input_tokens", ...) verify
res.body contains a discriminator (e.g., res.body["discriminator"] ==
"chat.completion" or the expected discriminator key/value used by the API) and
capture res.body["input_tokens"] on the first request then assert on the second
identical request that res.body["input_tokens"] equals the first captured value
(in addition to the existing > 5 check) to ensure deterministic counts across
identical requests. Ensure asserts reference the test function name
test_chat_completions_token_count and the server.make_request response fields
res.body["input_tokens"] and res.body["discriminator"] (or the API's exact
discriminator key).

In `@tools/server/tests/unit/test_vision_api.py`:
- Around line 101-117: The test test_vision_chat_completion_token_count
currently only asserts res.body["input_tokens"] > 10; add a text-only baseline
request using server.make_request to the same "/chat/completions/input_tokens"
endpoint with an equivalent messages payload that contains only the text part
(e.g., {"role":"user","content":[{"type":"text","text":"What is this:"}]}) and
capture its input_tokens, then assert the multimodal response's input_tokens is
greater than the text-only baseline (res.body["input_tokens"] >
baseline["input_tokens"]). Ensure you reuse the same request parameters
(temperature, top_k) and message ordering so the only difference is the
image_url content.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: abc7b4b0-c5df-4e97-862f-f86950f9c5c9

📥 Commits

Reviewing files that changed from the base of the PR and between d38d50e and 447e418.

📒 Files selected for processing (25)
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/clip.h
  • tools/mtmd/models/conformer.cpp
  • tools/mtmd/models/glm4v.cpp
  • tools/mtmd/models/granite-speech.cpp
  • tools/mtmd/models/kimik25.cpp
  • tools/mtmd/models/mimovl.cpp
  • tools/mtmd/models/qwen2vl.cpp
  • tools/mtmd/models/qwen3vl.cpp
  • tools/mtmd/models/whisper-enc.cpp
  • tools/mtmd/mtmd-cli.cpp
  • tools/mtmd/mtmd-helper.cpp
  • tools/mtmd/mtmd-helper.h
  • tools/mtmd/mtmd-image.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/mtmd.h
  • tools/server/README.md
  • tools/server/server-common.cpp
  • tools/server/server-common.h
  • tools/server/server-context.cpp
  • tools/server/server-context.h
  • tools/server/server.cpp
  • tools/server/tests/unit/test_chat_completion.py
  • tools/server/tests/unit/test_vision_api.py

Comment thread tools/mtmd/clip.cpp Outdated
Comment thread tools/server/server-context.cpp
Comment thread tools/server/server.cpp Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/mtmd/mtmd-image.cpp (1)

1265-1270: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate the placeholder flag into dst.

Line 1265 hardcodes dst as non-placeholder, but the function immediately returns on placeholder inputs. That breaks the new placeholder flow for the Step3VL preprocessing path by leaving dst shaped like a real image without populated pixels. Pass src.is_placeholder() through when sizing dst instead of hardcoding false.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mtmd/mtmd-image.cpp` around lines 1265 - 1270, The dst image is always
being created as non-placeholder by calling dst.set_size({target_width,
target_height}, false, false) even when src.is_placeholder() is true; change the
call in the function that handles Step3VL preprocessing so that the placeholder
flag is propagated (pass src.is_placeholder() as the placeholder argument to
dst.set_size) instead of hardcoding false, ensuring dst retains the placeholder
state when src.is_placeholder() returns true.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tools/mtmd/mtmd-image.cpp`:
- Around line 1265-1270: The dst image is always being created as
non-placeholder by calling dst.set_size({target_width, target_height}, false,
false) even when src.is_placeholder() is true; change the call in the function
that handles Step3VL preprocessing so that the placeholder flag is propagated
(pass src.is_placeholder() as the placeholder argument to dst.set_size) instead
of hardcoding false, ensuring dst retains the placeholder state when
src.is_placeholder() returns true.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c10f5ec2-d209-4a85-886d-8d9c18d2baff

📥 Commits

Reviewing files that changed from the base of the PR and between 447e418 and 8f67dfb.

📒 Files selected for processing (4)
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/mtmd-image.cpp
  • tools/mtmd/mtmd.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • tools/mtmd/clip-impl.h
  • tools/mtmd/mtmd.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant