feat: pluggable image/text encoder layer with OpenCLIP and SigLIP backends by lstein · Pull Request #226 · lstein/PhotoMapAI

lstein · 2026-05-08T01:06:52Z

Summary

Replaces the single hard-coded OpenAI CLIP encoder with a pluggable image/text encoder layer that supports three backends — original CLIP, OpenCLIP (DFN-2B weights), and SigLIP 2 — selectable per album from the UI. Existing albums and `.npz` caches keep working with no migration required.

The PR also picks up several adjacent improvements that surfaced while building it: a search-indexing perf rewrite (batching + parallel CPU loaders → ~2-3× speedup), a score-space refactor of the search-combine math so weight sliders mean what they say, and a documentation page introducing users to the encoders and how to tune them per album.

Headline changes

Pluggable encoders. New `photomap.backend.encoders` module with an `ImageTextEncoder` ABC and three concrete backends. Cache files now stamp `model_id` and `embedding_dim` so encoder mismatches are detected on load.
Per-album encoder selection. Album Manager (Add / Edit forms) gets an encoder dropdown. Choice is persisted in `config.yaml` as `encoder_spec`. Changing the encoder on an existing album auto-rebuilds the index from scratch on the next Update Index.
New default for new albums: OpenCLIP ViT-L-14 / DFN-2B (best general-purpose). Two constants split: `DEFAULT_ENCODER_SPEC` for new albums, `LEGACY_ENCODER_SPEC` as the compatibility marker for caches predating the swap layer.
Per-album search tuning. Min. score, max results, and a SigLIP-only "Query optimization" toggle moved out of Settings into the search dialog itself. Defaults are encoder-aware (0.005 for SigLIP, 0.2 for CLIP-style). Values persist back to album config on edit.
Score-space search combine. Image, positive-text, and negative-text cosines are now combined in score space (with text contributions calibrated for SigLIP) instead of the previous embedding-space sum-then-normalize. Slider weights now produce honest weighted averages instead of being silently dominated by image-image cosines.
Indexing perf. `_process_images_batch` gained `batch_size` and `num_workers` parameters, with a bounded sliding-window producer/consumer pipeline that overlaps CPU PIL decode/EXIF/metadata extraction with GPU encode. Production defaults (8/4) chosen via the bundled `benchmark_encoders.py` script. ~3× speedup on CLIP, ~1.8× on the larger encoders.
Search encoder cache. Search uses a process-cached encoder so SigLIP doesn't re-issue HF Hub HEAD checks for every query.
Documentation. New `user-guide/encoders.md` page introducing the three backends and when to pick each. Search and Albums pages updated to reference the per-album controls and the encoder dropdown.

Behavior preserved

Existing `config.yaml` albums without an explicit `encoder_spec` field keep using legacy CLIP, matching what their existing index was built with.
Legacy `.npz` caches without a `model_id` field are still readable and report as legacy CLIP.
Backward-compatible defaults across the board mean no user action is required after upgrading.

Notable bug fixes along the way

SigLIP image-text matches were getting filtered out at the default 0.2 threshold because raw cosines for matching pairs sit ~0.10. Calibration via the model's learned `logit_scale`/`logit_bias` is now applied to text-only and mixed-text similarity scores.
Frontend was re-filtering search results with a hardcoded cutoff (`TEXT_SCORE_CUTOFF=0.2`/`IMAGE_SCORE_CUTOFF=0.75`) regardless of the user's threshold setting; now removed entirely.
Album-manager Edit dialog used to always show the default encoder for an album because `/available_albums/` was stripping the field; fixed.
`SiglipModel.get_image_features` returns `BaseModelOutputWithPooling` in transformers 5.x, not a bare tensor; SigLIP encoder handles both shapes.

Tests

199 backend tests (up from 186), 233 frontend tests, all passing
New backend tests cover: encoder factory dispatch, sigmoid calibration math, parallel-loader ordering preservation, score-space combine semantics, per-album round-trip through `/add_album` → `/available_albums/` → `/update_album/`, the `use_query_optimization` flag round-trip, the redundant-frontend-filter regression
ruff, eslint, prettier all clean
`mkdocs build --strict` passes

Test plan

Fresh-install path: create a new album via the Album Manager, confirm OpenCLIP-DFN appears as the default encoder dropdown selection
Existing album path: open an album that was indexed with legacy CLIP before this branch — confirm search still works at default 0.2 threshold and the dropdown reflects the album's actual encoder
Encoder-change path: edit an existing album, switch to SigLIP, press Update Index, confirm the auto-rebuild kicks off and the indexing modal shows the encoder-changed status
SigLIP search path: with a SigLIP album, confirm text searches return hits at threshold 0.005 (the new SigLIP default), confirm image search returns sane scores, confirm the "Query optimization" checkbox in the search dialog toggles ensembling behavior
Per-album persistence: change the min score / max results / query optimization in the search dialog, switch albums and back, confirm the values are remembered
Indexing perf: run `python benchmark_encoders.py /path/to/images --batch-size 1 8 --workers 1 4` on a substantial album to verify the speedup on the test machine
Docs: open the served docs (`make docs`), confirm Encoders page is linked from User Guide and from Albums / Search pages

🤖 Generated with Claude Code

Introduce an ImageTextEncoder abstraction with three backends (OpenAI CLIP, OpenCLIP, SigLIP/SigLIP 2) selectable per-album via a new Album.encoder_spec field. Default remains openai-clip:ViT-B/32 so existing albums and .npz caches behave identically. The .npz cache is now stamped with the model_id and embedding_dim it was built with; loaders raise EmbeddingCacheMismatch when the active encoder disagrees, instead of silently returning garbage similarity scores. Old caches without the field default to the legacy spec. OpenCLIP and SigLIP dependencies are gated behind optional extras (open-clip, siglip) to keep the base install lean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

build_encoder() forwarded download_root unconditionally, which only OpenAIClipEncoder accepts; SigLIP and OpenCLIP raised TypeError as soon as Embeddings._build_encoder() invoked them. Replace the **kwargs passthrough with an explicit cache_dir parameter that the factory maps to each backend's native option (download_root for OpenAI CLIP, cache_dir for OpenCLIP, ignored for SigLIP since transformers uses HF_HOME). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Avoids the `operator torchvision::nms does not exist` ABI mismatch when torch and torchvision versions drift apart at install time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously, changing an album's encoder_spec while a stale .npz still existed surfaced an EmbeddingCacheMismatch and forced the user to manually delete the index file. The reindex background task now peeks at the existing index's stored model_id, and on mismatch it logs a warning, surfaces a status line through progress_tracker so the UI shows what's happening, deletes the stale .npz, and falls through to a fresh create_index_async. Embeddings.update_index{,_async}() keep their strict mismatch behavior so CLI tools and library callers still fail loud. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In transformers >= 5, ``Siglip{,2}Model.get_image_features`` and ``get_text_features`` return ``BaseModelOutputWithPooling`` instead of a bare tensor. Add ``_unwrap_pooled`` to accept either shape so the SigLIP encoder works on old and new transformers without pinning a specific version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Production indexing was CPU-bound per-image (PIL decode + EXIF + metadata extraction) with the GPU mostly idle waiting for the next image. Two changes: 1. Batched encoding: ``_process_images_batch`` buffers up to ``batch_size`` decoded images and calls ``encoder.encode_images(buf)`` once per batch, amortizing per-call overhead. 2. Parallel CPU loaders: a bounded ThreadPoolExecutor of ``num_workers`` threads decodes images concurrently while the main thread feeds batches to the GPU. A sliding window of ``num_workers * 2 + batch_size`` futures keeps memory in check on large collections. Defaults (``DEFAULT_BATCH_SIZE=8``, ``DEFAULT_NUM_WORKERS=4``) chosen from benchmark_encoders.py. On the test corpus they yield ~3x speedup for CLIP and ~1.8x for the larger encoders. Workers >4 regress due to GIL contention. The async ``_process_images_batch_async`` collapses to a single ``asyncio.to_thread`` delegation: both sync and async paths now share the parallel pipeline, and the FastAPI event loop stays free. Per-image error handling, progress callbacks, and result ordering are preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Standalone CLI (``python benchmark_encoders.py <image_dir>``) that sweeps over encoder specs, batch sizes, and worker counts and reports end-to-end indexing time + throughput. Used to derive the production defaults (batch=8, workers=4) committed alongside. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bounded sliding window in ``_process_images_batch`` is responsible for re-aligning futures with their submission order before flushing each batch to the encoder. This test stands in a deterministic encoder, replicates the bundled test images so multiple batches are processed, and asserts that workers=4 produces byte-identical output to workers=1 — same filenames in the same order, same embeddings, same modtimes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Surfaces the per-album ``encoder_spec`` setting in the UI. Both the "Add Album" and "Edit Album" forms now expose a select with the three bundled encoder options; specs not in the list (e.g. set via config.yaml) are appended as "(custom)" so the dropdown reflects reality. A short hint warns users that switching the encoder will trigger a from-scratch rebuild on the next index update — the auto-rebuild flow lands gracefully. Backend: ``create_album()`` now accepts an optional ``encoder_spec`` and the ``update_album`` route forwards it from the request payload, so edits round-trip through ``/update_album/`` correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pluggable encoder layer now ships with three first-class backends (OpenAI CLIP, OpenCLIP, SigLIP). Hiding two of them behind optional ``[open-clip]`` / ``[siglip]`` extras meant the album-manager dropdown could surface choices that ImportError on selection. Move ``open_clip_torch`` and ``transformers`` into the main dependency list, drop the now-unused extras, and trim the install hints from the encoder ImportError messages — they referenced commands that no longer apply. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The /available_albums/ route was hand-constructing each album dict and forgot to include ``encoder_spec``. The album-manager loads its cards from this endpoint and passes the loaded album object directly into editAlbum(), which then populated the encoder dropdown from the missing field. The dropdown therefore always fell back to the default (openai-clip) regardless of what the album was actually configured — or last indexed — with. Surface ``encoder_spec`` in the listing response and add a regression test covering POST /add_album → GET /available_albums/ → POST /update_album round-trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two issues users hit when searching with the SigLIP backend: 1. Text searches returned zero hits at the default 0.2 threshold even for queries that should match plenty of images. SigLIP's training objective produces image-text cosines around 0.05-0.15 for matching pairs — a CLIP-tuned threshold like 0.2 filters out almost every true match. The model exposes ``logit_scale`` and ``logit_bias`` precisely to recover calibrated probabilities via ``sigmoid(cos * exp(scale) + bias)``. Add a per-encoder ``calibrate_similarity()`` hook (identity for CLIP/OpenCLIP, sigmoid transform for SigLIP) and apply it in ``search_images_by_text_and_image`` before threshold comparison. 2. Every search rebuilt and tore down the encoder, which for SigLIP meant ``AutoModel.from_pretrained`` plus a flurry of HF Hub HEAD checks per request — slow and noisy. Add ``get_cached_encoder()`` that memoizes encoders by ``(spec, cache_dir)`` and use it from the search path. Indexing keeps building/closing its own short-lived encoder since that's a one-shot operation. Verified on siglip2-base: a matching cos of 0.139 was below the user threshold; after calibration it lands at ~0.25, while non-matches collapse to ~1e-6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous fix applied SigLIP's sigmoid calibration to every search result. With ``exp(logit_scale) ~ 113`` and ``logit_bias ~ -17``, any image-image cosine above ~0.15 saturated to 1.0 — so image search returned 100 matches all scoring 1.0. The calibration was trained on image-text pairs. Image-image cosines are routinely 0.4+ and don't need (or want) the same transform. Apply calibration only when the query has no image component (``image_weight == 0``); otherwise pass raw cosines through, which matches the user-confirmed pre-fix behavior for image search. Includes a regression test that drives the search code with a stub encoder whose ``calibrate_similarity`` records calls, asserting it fires for text-only queries and stays silent for image queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two redundant frontend filters were overriding state.minSearchScore: - search-ui.js dropped any combined-search result below calculate_search_score_cutoff(...), a weighted blend of hardcoded TEXT_SCORE_CUTOFF=0.2 and IMAGE_SCORE_CUTOFF=0.75. Pure-text searches were therefore stuck at 0.2 even if the user lowered the Settings UI threshold to 0.1. - searchWithImage hardcoded >= 0.6, ignoring the user's setting outright. The backend already honors state.minSearchScore via SearchWithTextAndImageRequest.min_search_score → minimum_score, so both frontend filters were dead-weight at best and silently contradicting the user at worst. Drop them, delete the now-unused calculate_search_score_cutoff helper and its Jest block (which only encoded the bug), and clean up a top_k=500 arg that searchTextAndImage never accepted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SigLIP's text encoder was trained on caption-shaped inputs, so a bare noun like "woman" produces a less-specific embedding than the caption-style phrases the model expects. Combined with SigLIP's narrow cosine band and steep sigmoid calibration, that small specificity gap pushed single-noun queries below the threshold even when the album was full of relevant matches. Mixed-media libraries (drawings, illustrations, paintings alongside photos) made it worse: a single ``a photo of {x}`` template would systematically penalize non-photo content. Ensemble each text query across five modality-spanning templates (``a photo of``, ``a drawing of``, ``an illustration of``, ``a painting of``, and the bare query), L2-normalize each per-template embedding so no single phrasing dominates by magnitude, mean-pool across templates, and re-normalize. The averaged direction sits close to all canonical descriptions of the concept, so single nouns pick up the specificity they were missing without locking in any modality assumption. Confirmed on siglip2-base: ``"woman"`` and ``"middle-aged woman"`` ensembled embeddings cosine to 0.945 against each other; both stay distinct from unrelated queries. Scoped to ``SiglipEncoder.encode_text`` only — CLIP and OpenCLIP work fine on bare queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Initial real-world results from the prompt-ensembling change were mixed, so make it opt-in via SIGLIP_USE_PROMPT_ENSEMBLING (default False) while we evaluate. The legacy single-encode path is restored when the flag is off; the existing unit test flips the flag locally so it still exercises the ensembling math. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The old combine built a single weighted query vector in embedding space, normalized it, and dotted with stored embeddings. That made the user's slider semantics dishonest: image-image cosines and image-text cosines live on very different scales (≈0.75 vs ≈0.28 for CLIP, ≈0.50 vs ≈0.12 for SigLIP), so a "50/50 image and text" query was actually ≈73/27 image-dominant for CLIP and ≈80/20 for SigLIP. The geometry of the unnormalized combined vector also let cross-term alignment between v_img and v_pos quietly inflate or deflate the resulting absolute scores. Replace with a per-modality combine: cos_img = stored · v_img cos_pos = encoder.calibrate_similarity(stored · v_pos) cos_neg = encoder.calibrate_similarity(stored · v_neg) similarity = (w_img·cos_img + w_pos·cos_pos) / (w_img + w_pos) - w_neg · cos_neg Text cosines now go through encoder.calibrate_similarity in every mode (no longer gated to the text-only branch as a workaround for the embedding-space saturation problem). For SigLIP that's the sigmoid that brings text cosines onto the same scale as image cosines; for CLIP/OpenCLIP it stays an identity, so existing behavior on those backends is essentially preserved (just with honest weight semantics on mixed queries). The old test that asserted calibrate is skipped when image_weight > 0 still passes — it was checking image-only and text-only modes, and in the new code calibrate is still only called for cos_pos / cos_neg, never cos_img. New test locks in the score-space math for mixed and negative queries with a stub encoder whose calibrate halves text cosines (a detectable transform). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add three Album fields persisted with the album config: - ``min_search_score`` — defaults to 0.005 for SigLIP albums (cosine band is ~10x narrower than CLIP) and 0.2 otherwise. Resolved from ``encoder_spec`` by a model_validator when the value is omitted. - ``max_search_results`` — defaults to 100; was previously a global setting in localStorage with a hard ceiling of 500. Now per-album with no upper clamp. - ``use_query_optimization`` — defaults to True; SigLIP-only knob to toggle prompt-template ensembling at search time. Ignored for CLIP/OpenCLIP. Round-trip these through to_dict / from_dict, the create_album helper, the /update_album/ route, and the /available_albums/ listing so the frontend can both load defaults and persist edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SigLIP's prompt-template ensembling was guarded by a single module- level constant, which couldn't reflect a per-album choice. Lift it to ``SiglipEncoder.use_ensembling`` (still defaulting from the module flag for direct callers like CLI tools) and have ``Embeddings.search_images_by_text_and_image`` mutate it from the incoming ``use_query_optimization`` argument before encoding text. The /search_with_text_and_image route now accepts ``use_query_optimization`` in the request body; the frontend sources it from the album's per-album setting (committed alongside). A regression test using a stub encoder confirms the flag round-trips through search: True → on, False → off, None → leave alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "Minimum search score" and "Maximum # of search results" inputs used to live in Settings as a single global preference saved in localStorage. They now live in the search dialog itself as per-album controls, alongside a new "Query optimization (SigLIP only)" checkbox that toggles the prompt-template ensembling — disabled for non-SigLIP albums so it's clear the toggle has no effect there. Wiring: - state.js drops localStorage save/restore for minSearchScore / maxSearchResults; both values (and the new useQueryOptimization flag) are loaded from the active album on every setAlbum() and persisted back via /update_album/ on edit. Persistence is debounced 400ms to collapse rapid input changes into a single network write. - state dispatches ``albumSearchSettingsLoaded`` when album config finishes loading; the search dialog listens and refreshes its controls (and toggles the SigLIP-only checkbox's enabled state). - search.js sends ``use_query_optimization`` in the request body so the backend toggles SigLIP ensembling per request. - settings.html and settings.js shed the now-redundant Search accordion and its supporting JS; page-visibility's critical-state backup drops the two now-per-album fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OpenCLIP ViT-L-14 with DFN-2B weights is a better general-purpose default than legacy CLIP: noticeably stronger recall, while keeping CLIP-style cosine semantics that work robustly on cluttered family photos. SigLIP's steeper calibration is brilliant for clean caption-shaped queries against single-subject images (e.g. AI-generated content) but tends to misfire on real-world photo collections. Splits the encoder spec constant to avoid conflating two roles: - ``DEFAULT_ENCODER_SPEC`` — default for *new* albums. Now ``open-clip:ViT-L-14/dfn2b_s39b``. Used by Album.encoder_spec Pydantic field default and the frontend's add-album dropdown. - ``LEGACY_ENCODER_SPEC`` — what an .npz cache or YAML album that predates the encoder swap layer was actually built with. Pinned to ``openai-clip:ViT-B/32`` because that was the only option then. Used by every cache-fallback path: peek_encoder_spec, _open_npz_file, _check_cache_compatibility, IndexResult.model_id default, Embeddings.encoder_spec default, Album.from_dict missing-field fallback, and the update_index{,_async} existing-cache reads. Without this split, legacy caches that lack a ``model_id`` field would be claimed as OpenCLIP-built, and _check_cache_compatibility would trigger an auto-rebuild on every legacy .npz it loaded. Album-manager dropdown re-ordered so OpenCLIP-DFN is the first option (the default for new albums); legacy CLIP relegated to the bottom and re-labeled accordingly. Tests: split ``test_default_spec_is_legacy_clip`` into ``test_default_spec_for_new_albums`` and ``test_legacy_spec_unchanged``; fix ``test_build_encoder_none_uses_default`` to mock OpenClipEncoder since ``build_encoder(None)`` now routes there. Pin the ``new_album`` fixture to legacy CLIP so search-threshold assertions stay stable across default-encoder changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three decimals collapsed every SigLIP photo-album match to "0.000" once the calibrated probabilities dropped under 0.001 — useful matches became indistinguishable from non-matches in the UI. Bump score formatting from toFixed(3) to toFixed(4) in both the search-result overlay and the seek slider so sub-millisigma differences are visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The text inputs in ``.search-fields-column`` extended to the panel's right content edge, putting them under the absolutely-positioned X close button at top/right 0.7em. Reserve 1.8em on the right of the grid column so the inputs and weight sliders stop short of the close button without disturbing the column's label/wrapper alignment or the image column on the left. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three additions tied together: 1. New ``user-guide/encoders.md`` introducing the three bundled encoders, their relative strengths and weaknesses, and concrete guidance on which to pick. Documents how the encoder is set per-album from the Album Manager (and as a config.yaml fallback), and what happens when the encoder is changed on an existing album (auto-rebuild on next index). 2. ``user-guide/search.md`` gains a "Tuning Search Per-Album" section covering the in-dialog Min. score / Max. results / Query optimization controls, with encoder-aware default thresholds and guidance on the SigLIP-only ensembling toggle. The score-scale note now points readers to the per-album tuning knobs, and the image+text+negative-text section is rewritten to describe the score-space combine (weights mean what they say) instead of the pre-refactor advice. 3. ``user-guide/albums.md`` gets a new bullet for the encoder dropdown in the Add/Edit Album form, with a pointer to the encoders page for the trade-offs and a note that encoder changes trigger a from-scratch rebuild. Wired into ``mkdocs.yml`` as a User Guide entry between Albums and Configuration. ``mkdocs build --strict`` succeeds with no broken links. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lstein and others added 26 commits May 7, 2026 07:10

fix(deps): pin torch to ~= 2.9.0

a980b87

Avoids the `operator torchvision::nms does not exist` ABI mismatch when torch and torchvision versions drift apart at install time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs: update add-album image

7c1a917

chore(backend): ruff

307bcb9

lstein enabled auto-merge (squash) May 8, 2026 01:19

lstein merged commit bac7173 into master May 8, 2026
6 checks passed

lstein deleted the lstein/feature/encoder-swap-layer branch May 8, 2026 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pluggable image/text encoder layer with OpenCLIP and SigLIP backends#226

feat: pluggable image/text encoder layer with OpenCLIP and SigLIP backends#226
lstein merged 26 commits intomasterfrom
lstein/feature/encoder-swap-layer

lstein commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lstein commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline changes

Behavior preserved

Notable bug fixes along the way

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lstein commented May 8, 2026 •

edited

Loading