feat(llama-cpp): video input support (mtmd #24269)#10216
Merged
Conversation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… paths) Wire request->videos() into grpc-server.cpp mirroring the existing image and audio handling: a video_data build + non-template files extraction, and input_video chat chunks on the tokenizer-template path. allow_video is auto-set at model load by the vendored upstream chat_params. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror the image/audio attachment path for video: emit video_url content parts, accept video/* in the picker, keep video files as base64, show a film icon badge, and render attached video inline with a <video> player. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy() fclose()s the same pointer again -> heap corruption that aborts the backend on any base64 input_video request (the CLI --video file path is unaffected). Vendor a one-line fix (null sp->stdin_file after fclose) via prepare.sh's patches/ until upstream merges it. Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue'). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Owner
|
upstream patch: ggml-org/llama.cpp#24313 |
Upstream replaced the ad-hoc video stdin handling with a proper RAII refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc handling"), which includes the same `sp->stdin_file = nullptr` guard our patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to that branch head and drop patches/0001 - it's now redundant. Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode and the model answers correctly (red clip -> "Red", blue -> "Blue"). NOTE: #24316 is not yet merged, so this pins to its branch-head commit (28ca1e60). Re-pin to the squash-merge commit on master once it lands, otherwise `git fetch` may lose the commit after the branch is deleted. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds video input support to the llama.cpp backend, so video-capable multimodal models (e.g. SmolVLM2-Video) can be sent a video in a chat request - mirroring how images and audio already work end to end. This is video understanding input, distinct from the existing text-to-video generation endpoint.
Tracks the upstream landing of video in mtmd: ggml-org/llama.cpp#24269 (merged
8f83d6c).Why
The Go core was already fully wired for video input (proto
Videos = 45,video_urlrequest parsing,opts.Videosforwarding,<__media__>marker counting), but the C++ backend never read the field and the pinned llama.cpp predated mtmd video. This closes that gap and surfaces it in the chat UI.Changes
9e3b928→8f83d6c(9 commits) to pick up mtmd video support.MTMD_VIDEOdefaultsONupstream; it only needsffmpeg/ffprobeon PATH, which the runtime image already ships (Dockerfile).grpc-server.cpp: forwardrequest->videos()into the mtmdfilesvector on both request paths (template + non-template), in both thePredictStreamandPredictmirror blocks:video_databuild + base64-decode intofiles;{"type":"input_video","input_video":{"data": ...}}chat chunks and include videos in the multimodal guard.allow_videois auto-set at model load by the vendored upstreamchat_params(mtmd_helper_support_video(mctx)), so no manual gating is added - video is accepted only when the loaded mmproj supports it.video/*, keep video files as base64, show a film-icon badge, render attached video inline with a<video controls>player, and emitvideo_urlcontent parts.Data flow
UI video_url→request.go StringVideos→inference.go videos[]→llm.go opts.Videos→ gRPCVideos→grpc-server.cpp input_video→ mtmd (ffmpeg frame extraction) → model.Notes / scope
vite build); C++ wiring is a structurally-verified mirror of the working audio path. The native backend build is left to CI.Assisted-by: Claude:claude-opus-4-8 [Claude Code]
🤖 Generated with Claude Code
Status: draft — e2e-verified, with one upstream caveat
Built the backend locally and ran a real video chat against
gemma-4-e2b-it-qat-q4_0(mmproj, tokenizer-template path).Found + fixed an upstream crash. llama.cpp's new video code (#24269) double-
fcloses the ffmpeg/ffprobe stdin:mtmd_helper_video::feed_stdin()closes the FILE returned bysubprocess_stdin()(which issp->stdin_file), thensubprocess_destroy()closes the same pointer again → heap corruption that aborts the backend on any base64input_video(the CLI--video <file>path is unaffected, which is likely why it shipped). This hits upstream's ownllama-servertoo. Worked around here with a vendored one-line patch (backend/cpp/llama-cpp/patches/0001-*.patch, applied byprepare.sh); an upstream PR will follow.After the patch: video works end to end — ffmpeg extracts frames, the model sees them, and answers correctly (solid-red clip → "Red", solid-blue → "Blue").
Known limitation (separate upstream issue, not blocking): within a single server process, repeated video requests that share an identical text prompt can reuse the first video's frames (prompt/KV-cache collision — image bitmaps carry an fnv-hash id so the cache distinguishes them, but the lazy video's expanded frames don't). Workaround: distinct prompts or disabling the prompt cache. ggml-org/llama.cpp#24303
Marking draft until the upstream crash fix is merged (or we're comfortable shipping with the vendored patch).
Update: dropped the vendored patch, re-pinned to upstream #24316
Upstream replaced the ad-hoc video stdin handling with a proper RAII refactor — ggml-org/llama.cpp#24316 ("mtmd: refactor video subproc handling") — which contains the same
sp->stdin_file = nullptrguard the vendored patch added, plus join-before-destroy ordering. SoLLAMA_VERSIONis now re-pinned to that change andpatches/0001-*is removed.Re-verified e2e (no patch): no crash, red clip → "Red", blue → "Blue".
Caveat: #24316 isn't merged yet, so this currently pins to its branch-head commit (
28ca1e60). Re-pin to the squash-merge commit onmasteronce it lands, orgit fetchmay lose the commit after the branch is deleted. The secondary prompt-cache frame-reuse note above is unaffected by #24316 (still open as a separate item).