v0.4.1 — VLM UI + Persistence + HTTP by magicnight · Pull Request #35 · magicnight/Mac-MLX

magicnight · 2026-05-10T06:08:31Z

Summary

Third and final v0.4.1 PR — lights up the user-facing surfaces for vision-language models. Closes the v0.4.1 rollout begun in #33 (Foundation) and #34 (Engine).

Plan: `docs/superpowers/plans/2026-05-10-v0.4.1-vlm.md`.

What lands

Chat input — image picker + thumbnail strip

Paperclip button in `ChatInputView` opens SwiftUI `.fileImporter` (image UTTypes only: jpeg / png / webp / gif / heic / bmp).
Horizontal thumbnail strip above the text field; click × on any thumbnail to drop it.
Disabled (with explanatory tooltip) when the loaded model isn't a VLM.
Image-only messages on a VLM are now legitimate sends.

Chat bubbles — inline thumbnails

`ChatMessageView` renders a 96pt `LazyVGrid` of attached images above the text bubble.
Click a thumbnail → open in Preview via `NSWorkspace`.

Conversation persistence

`StoredMessage.images: [ImageAttachment]` round-trips through `ConversationStore` (decoder defaults to empty when key absent — pre-v0.4.1 chats unchanged).
`save(_:)` internalises external image URLs into `//images/<image-uuid>.<ext>` so chats survive the user moving the picked file.
`delete(id:)` tears down the per-conversation directory.

OpenAI multimodal HTTP

`/v1/chat/completions` accepts OpenAI's `content` array shape:
```json
{"role":"user","content":[
{"type":"text","text":"What's this?"},
{"type":"image_url","image_url":{"url":"data:image/png;base64,…"}}
]}
```
Plain-string `content` continues to work — decoder tries string first, falls through to `[Part]`.
base64 data URLs → tmpfile-backed `ImageAttachment`. Caps: 10 MB / image, 4 images / message; `http(s)://` and `file://` URLs are not fetched.
Ollama `/api/chat` / `/api/generate` stay text-only — Ollama uses a separate top-level `images: [base64]` field, follow-up.

Test plan

`swift test --package-path MacMLXCore` — 115/115 green (incl. 4 new `ConversationStoreImagesTests`)
Local `xcodebuild macMLX` — green
Existing `chatCompletionsNonStreaming` test confirms string-form fallback path on the multimodal decoder
Manual smoke (post-merge, on hardware):
- `macmlx pull mlx-community/SmolVLM-Instruct-4bit`
- Load it in the GUI, attach a JPEG, send "what is this?", get a coherent response.
- `curl localhost:8000/v1/chat/completions -d '{ "model": "SmolVLM-Instruct-4bit", "messages": [{"role":"user","content":[{"type":"text","text":"describe"},{"type":"image_url","image_url":{"url":"data:image/png;base64,…"}}]}]}'` returns a sensible reply.

Stack

Builds on `main` after:

v0.4.1 — VLM Foundation (data model + detection) #33 — Foundation (data model + library detection)
v0.4.1 — VLM Engine (MLXSwiftEngine branch via MLXVLM) #34 — Engine (MLXSwiftEngine VLM branch via `MLXVLM.VLMModelFactory`)

After this PR merges, v0.4.1 is feature-complete and a tag is reasonable.

🤖 Generated with Claude Code

- StoredMessage.images: [ImageAttachment] mirrors ChatMessage.images added in PR #33. Custom decoder defaults to empty when key absent — pre-v0.4.1 conversation JSON loads unchanged. - save(_:) internalises image URLs: any attachment outside the conversation's own images dir gets copied to <directory>/<conv-uuid>/images/<image-uuid>.<ext>, then the URL is rewritten to point there. Best-effort: copy failure logs to stderr and falls through with the original URL preserved. - delete(id:) tears down both the JSON sidecar and the per- conversation directory (recursive remove). Pre-v0.4.1 conversations with no per-dir no-op cleanly. - 4 new tests in ConversationStoreImagesTests (.serialized for tmpdir safety): external-image copy, idempotent internal-URL preservation, delete-tears-down-conv-dir, legacy-JSON-decode. 115/115 Core green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- UIChatMessage gains images: [ImageAttachment] mirroring StoredMessage + Core ChatMessage. Hydrated from stored messages on conversation reload; stripped to images: [] on send when isn't a VLM. - ChatViewModel.attachedImages staging bag; canAttachImages / attachImage(at:) / removeAttachedImage(at:) / clearAttachedImages() helpers wired to the input view and to the model-modality gate (ChatViewModel.canAttachImages == coordinator.currentModel.format == .mlxVLM). send() picks the bag up + clears it. generate()'s ChatMessage map now passes images through to Core / engine. - ChatInputView gets a horizontal thumbnail strip above the text field, a paperclip button driving SwiftUI .fileImporter (image UTTypes only — png/jpeg/webp/gif/heic/bmp), and an enabled-state gate with explanatory tooltip when the loaded model is text-only. Send button now also enables when the user has staged images but no text (image-only ask is legitimate on a VLM). - ChatMessageView renders an inline LazyVGrid of 96pt thumbnails above the bubble for any message that has attachments. Click a thumbnail to open the file in Preview via NSWorkspace.shared.open. - AsyncThumbnailImage helper (NSImage-backed) lives next to ChatInputView and is reused by ChatMessageView. Local Xcode App Build green; 115/115 Core tests still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ChatCompletionRequest.Message.content now decodes either: - a plain string (every existing client) - or an OpenAI multimodal array of {type, text|image_url} parts Implementation lives in a new MultimodalContent enum that tries String first and falls through to [Part] — so legacy callers keep working unchanged. handleChatCompletions extracts text via .content.text (concatenated text parts), images via .content.extractImages(): - data:<mime>;base64,<bytes> URLs decode to a tmpfile-backed ImageAttachment (jpeg/png/webp/gif/heic/bmp). Caps: 4 images per message, 10 MB per image. Oversized / unknown-MIME parts silently drop. - http(s):// and file:// are not fetched (defence-in-depth, even though the server is localhost-bound). Decoded ImageAttachments flow through ChatMessage.images → engine (VLM model receives them; LLM model logs + drops, per PR #34). Ollama /api/chat / /api/generate stays text-only — Ollama uses a separate top-level field that's a follow-up. 115/115 Core tests still green; existing chatCompletionsNonStreaming proves the string-form fallback path still works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the v0.4.1 vision-language model rollout: PR #33 Foundation + PR #34 Engine + this PR's UI / persistence / HTTP triple. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

magicnight and others added 4 commits May 10, 2026 13:01

docs: v0.4.1 part-3 changelog entry (UI + persistence + HTTP)

dfdadcf

Closes the v0.4.1 vision-language model rollout: PR #33 Foundation + PR #34 Engine + this PR's UI / persistence / HTTP triple. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

magicnight merged commit 79a5f62 into main May 10, 2026
2 checks passed

magicnight deleted the feat/v0.4.1-vlm-ui branch May 10, 2026 06:16

magicnight mentioned this pull request May 10, 2026

v0.5 — LoRA Foundation (LocalAdapter + AdapterStore) #36

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.1 — VLM UI + Persistence + HTTP#35

v0.4.1 — VLM UI + Persistence + HTTP#35
magicnight merged 4 commits into
mainfrom
feat/v0.4.1-vlm-ui

magicnight commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant