feat(llama-cpp): video input support (mtmd #24269) by localai-bot · Pull Request #10216 · mudler/LocalAI

localai-bot · 2026-06-08T15:15:09Z

What

Adds video input support to the llama.cpp backend, so video-capable multimodal models (e.g. SmolVLM2-Video) can be sent a video in a chat request - mirroring how images and audio already work end to end. This is video understanding input, distinct from the existing text-to-video generation endpoint.

Tracks the upstream landing of video in mtmd: ggml-org/llama.cpp#24269 (merged 8f83d6c).

Why

The Go core was already fully wired for video input (proto Videos = 45, video_url request parsing, opts.Videos forwarding, <__media__> marker counting), but the C++ backend never read the field and the pinned llama.cpp predated mtmd video. This closes that gap and surfaces it in the chat UI.

Changes

Bump llama.cpp 9e3b928 → 8f83d6c (9 commits) to pick up mtmd video support. MTMD_VIDEO defaults ON upstream; it only needs ffmpeg/ffprobe on PATH, which the runtime image already ships (Dockerfile).
grpc-server.cpp: forward request->videos() into the mtmd files vector on both request paths (template + non-template), in both the PredictStream and Predict mirror blocks:
- non-template: a video_data build + base64-decode into files;
- template: emit {"type":"input_video","input_video":{"data": ...}} chat chunks and include videos in the multimodal guard.
- allow_video is auto-set at model load by the vendored upstream chat_params (mtmd_helper_support_video(mctx)), so no manual gating is added - video is accepted only when the loaded mmproj supports it.
React chat UI: accept video/*, keep video files as base64, show a film-icon badge, render attached video inline with a <video controls> player, and emit video_url content parts.

Data flow

UI video_url → request.go StringVideos → inference.go videos[] → llm.go opts.Videos → gRPC Videos → grpc-server.cpp input_video → mtmd (ffmpeg frame extraction) → model.

Notes / scope

No Docker change (ffmpeg already present). No proto change (field pre-existed). No Go core change (already wired).
Verification: React UI builds clean (vite build); C++ wiring is a structurally-verified mirror of the working audio path. The native backend build is left to CI.
Out of scope (follow-ons): adding a video-capable GGUF to the gallery + a full e2e run.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

🤖 Generated with Claude Code

Status: draft — e2e-verified, with one upstream caveat

Built the backend locally and ran a real video chat against gemma-4-e2b-it-qat-q4_0 (mmproj, tokenizer-template path).

Found + fixed an upstream crash. llama.cpp's new video code (#24269) double-fcloses the ffmpeg/ffprobe stdin: mtmd_helper_video::feed_stdin() closes the FILE returned by subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy() closes the same pointer again → heap corruption that aborts the backend on any base64 input_video (the CLI --video <file> path is unaffected, which is likely why it shipped). This hits upstream's own llama-server too. Worked around here with a vendored one-line patch (backend/cpp/llama-cpp/patches/0001-*.patch, applied by prepare.sh); an upstream PR will follow.

After the patch: video works end to end — ffmpeg extracts frames, the model sees them, and answers correctly (solid-red clip → "Red", solid-blue → "Blue").

Known limitation (separate upstream issue, not blocking): within a single server process, repeated video requests that share an identical text prompt can reuse the first video's frames (prompt/KV-cache collision — image bitmaps carry an fnv-hash id so the cache distinguishes them, but the lazy video's expanded frames don't). Workaround: distinct prompts or disabling the prompt cache. ggml-org/llama.cpp#24303

Marking draft until the upstream crash fix is merged (or we're comfortable shipping with the vendored patch).

Update: dropped the vendored patch, re-pinned to upstream #24316

Upstream replaced the ad-hoc video stdin handling with a proper RAII refactor — ggml-org/llama.cpp#24316 ("mtmd: refactor video subproc handling") — which contains the same sp->stdin_file = nullptr guard the vendored patch added, plus join-before-destroy ordering. So LLAMA_VERSION is now re-pinned to that change and patches/0001-* is removed.

Re-verified e2e (no patch): no crash, red clip → "Red", blue → "Blue".

Caveat: #24316 isn't merged yet, so this currently pins to its branch-head commit (28ca1e60). Re-pin to the squash-merge commit on master once it lands, or git fetch may lose the commit after the branch is deleted. The secondary prompt-cache frame-reuse note above is unaffected by #24316 (still open as a separate item).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… paths) Wire request->videos() into grpc-server.cpp mirroring the existing image and audio handling: a video_data build + non-template files extraction, and input_video chat chunks on the tokenizer-template path. allow_video is auto-set at model load by the vendored upstream chat_params. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Mirror the image/audio attachment path for video: emit video_url content parts, accept video/* in the picker, keep video files as base64, show a film icon badge, and render attached video inline with a <video> player. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy() fclose()s the same pointer again -> heap corruption that aborts the backend on any base64 input_video request (the CLI --video file path is unaffected). Vendor a one-line fix (null sp->stdin_file after fclose) via prepare.sh's patches/ until upstream merges it. Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue'). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler · 2026-06-08T16:46:44Z

upstream patch: ggml-org/llama.cpp#24313

Upstream replaced the ad-hoc video stdin handling with a proper RAII refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc handling"), which includes the same `sp->stdin_file = nullptr` guard our patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to that branch head and drop patches/0001 - it's now redundant. Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode and the model answers correctly (red clip -> "Red", blue -> "Blue"). NOTE: #24316 is not yet merged, so this pins to its branch-head commit (28ca1e60). Re-pin to the squash-merge commit on master once it lands, otherwise `git fetch` may lose the commit after the branch is deleted. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 4 commits June 8, 2026 15:02

chore(llama-cpp): bump to 8f83d6c for mtmd video input support

37158c2

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

localai-bot marked this pull request as draft June 8, 2026 16:30

mudler marked this pull request as ready for review June 8, 2026 16:47

mudler mentioned this pull request Jun 8, 2026

mtmd: fix double-close of ffmpeg/ffprobe stdin in video helper ggml-org/llama.cpp#24313

Closed

mudler merged commit 9323f4b into master Jun 8, 2026
67 of 68 checks passed

mudler deleted the feat/llama-cpp-video-input branch June 8, 2026 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llama-cpp): video input support (mtmd #24269)#10216

feat(llama-cpp): video input support (mtmd #24269)#10216
mudler merged 5 commits into
masterfrom
feat/llama-cpp-video-input

localai-bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

mudler commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Changes

Data flow

Notes / scope

Status: draft — e2e-verified, with one upstream caveat

Update: dropped the vendored patch, re-pinned to upstream #24316

Uh oh!

mudler commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

localai-bot commented Jun 8, 2026 •

edited

Loading

mudler commented Jun 8, 2026 •

edited

Loading