feat: pluggable image/text encoder layer with OpenCLIP and SigLIP backends#226
Merged
feat: pluggable image/text encoder layer with OpenCLIP and SigLIP backends#226
Conversation
Introduce an ImageTextEncoder abstraction with three backends (OpenAI CLIP, OpenCLIP, SigLIP/SigLIP 2) selectable per-album via a new Album.encoder_spec field. Default remains openai-clip:ViT-B/32 so existing albums and .npz caches behave identically. The .npz cache is now stamped with the model_id and embedding_dim it was built with; loaders raise EmbeddingCacheMismatch when the active encoder disagrees, instead of silently returning garbage similarity scores. Old caches without the field default to the legacy spec. OpenCLIP and SigLIP dependencies are gated behind optional extras (open-clip, siglip) to keep the base install lean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build_encoder() forwarded download_root unconditionally, which only OpenAIClipEncoder accepts; SigLIP and OpenCLIP raised TypeError as soon as Embeddings._build_encoder() invoked them. Replace the **kwargs passthrough with an explicit cache_dir parameter that the factory maps to each backend's native option (download_root for OpenAI CLIP, cache_dir for OpenCLIP, ignored for SigLIP since transformers uses HF_HOME). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Avoids the `operator torchvision::nms does not exist` ABI mismatch when torch and torchvision versions drift apart at install time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, changing an album's encoder_spec while a stale .npz still
existed surfaced an EmbeddingCacheMismatch and forced the user to
manually delete the index file. The reindex background task now peeks
at the existing index's stored model_id, and on mismatch it logs a
warning, surfaces a status line through progress_tracker so the UI
shows what's happening, deletes the stale .npz, and falls through to
a fresh create_index_async.
Embeddings.update_index{,_async}() keep their strict mismatch behavior
so CLI tools and library callers still fail loud.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In transformers >= 5, ``Siglip{,2}Model.get_image_features`` and
``get_text_features`` return ``BaseModelOutputWithPooling`` instead of
a bare tensor. Add ``_unwrap_pooled`` to accept either shape so the
SigLIP encoder works on old and new transformers without pinning a
specific version.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production indexing was CPU-bound per-image (PIL decode + EXIF + metadata extraction) with the GPU mostly idle waiting for the next image. Two changes: 1. Batched encoding: ``_process_images_batch`` buffers up to ``batch_size`` decoded images and calls ``encoder.encode_images(buf)`` once per batch, amortizing per-call overhead. 2. Parallel CPU loaders: a bounded ThreadPoolExecutor of ``num_workers`` threads decodes images concurrently while the main thread feeds batches to the GPU. A sliding window of ``num_workers * 2 + batch_size`` futures keeps memory in check on large collections. Defaults (``DEFAULT_BATCH_SIZE=8``, ``DEFAULT_NUM_WORKERS=4``) chosen from benchmark_encoders.py. On the test corpus they yield ~3x speedup for CLIP and ~1.8x for the larger encoders. Workers >4 regress due to GIL contention. The async ``_process_images_batch_async`` collapses to a single ``asyncio.to_thread`` delegation: both sync and async paths now share the parallel pipeline, and the FastAPI event loop stays free. Per-image error handling, progress callbacks, and result ordering are preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone CLI (``python benchmark_encoders.py <image_dir>``) that sweeps over encoder specs, batch sizes, and worker counts and reports end-to-end indexing time + throughput. Used to derive the production defaults (batch=8, workers=4) committed alongside. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bounded sliding window in ``_process_images_batch`` is responsible for re-aligning futures with their submission order before flushing each batch to the encoder. This test stands in a deterministic encoder, replicates the bundled test images so multiple batches are processed, and asserts that workers=4 produces byte-identical output to workers=1 — same filenames in the same order, same embeddings, same modtimes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaces the per-album ``encoder_spec`` setting in the UI. Both the "Add Album" and "Edit Album" forms now expose a select with the three bundled encoder options; specs not in the list (e.g. set via config.yaml) are appended as "(custom)" so the dropdown reflects reality. A short hint warns users that switching the encoder will trigger a from-scratch rebuild on the next index update — the auto-rebuild flow lands gracefully. Backend: ``create_album()`` now accepts an optional ``encoder_spec`` and the ``update_album`` route forwards it from the request payload, so edits round-trip through ``/update_album/`` correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pluggable encoder layer now ships with three first-class backends (OpenAI CLIP, OpenCLIP, SigLIP). Hiding two of them behind optional ``[open-clip]`` / ``[siglip]`` extras meant the album-manager dropdown could surface choices that ImportError on selection. Move ``open_clip_torch`` and ``transformers`` into the main dependency list, drop the now-unused extras, and trim the install hints from the encoder ImportError messages — they referenced commands that no longer apply. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /available_albums/ route was hand-constructing each album dict and forgot to include ``encoder_spec``. The album-manager loads its cards from this endpoint and passes the loaded album object directly into editAlbum(), which then populated the encoder dropdown from the missing field. The dropdown therefore always fell back to the default (openai-clip) regardless of what the album was actually configured — or last indexed — with. Surface ``encoder_spec`` in the listing response and add a regression test covering POST /add_album → GET /available_albums/ → POST /update_album round-trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues users hit when searching with the SigLIP backend: 1. Text searches returned zero hits at the default 0.2 threshold even for queries that should match plenty of images. SigLIP's training objective produces image-text cosines around 0.05-0.15 for matching pairs — a CLIP-tuned threshold like 0.2 filters out almost every true match. The model exposes ``logit_scale`` and ``logit_bias`` precisely to recover calibrated probabilities via ``sigmoid(cos * exp(scale) + bias)``. Add a per-encoder ``calibrate_similarity()`` hook (identity for CLIP/OpenCLIP, sigmoid transform for SigLIP) and apply it in ``search_images_by_text_and_image`` before threshold comparison. 2. Every search rebuilt and tore down the encoder, which for SigLIP meant ``AutoModel.from_pretrained`` plus a flurry of HF Hub HEAD checks per request — slow and noisy. Add ``get_cached_encoder()`` that memoizes encoders by ``(spec, cache_dir)`` and use it from the search path. Indexing keeps building/closing its own short-lived encoder since that's a one-shot operation. Verified on siglip2-base: a matching cos of 0.139 was below the user threshold; after calibration it lands at ~0.25, while non-matches collapse to ~1e-6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous fix applied SigLIP's sigmoid calibration to every search result. With ``exp(logit_scale) ~ 113`` and ``logit_bias ~ -17``, any image-image cosine above ~0.15 saturated to 1.0 — so image search returned 100 matches all scoring 1.0. The calibration was trained on image-text pairs. Image-image cosines are routinely 0.4+ and don't need (or want) the same transform. Apply calibration only when the query has no image component (``image_weight == 0``); otherwise pass raw cosines through, which matches the user-confirmed pre-fix behavior for image search. Includes a regression test that drives the search code with a stub encoder whose ``calibrate_similarity`` records calls, asserting it fires for text-only queries and stays silent for image queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two redundant frontend filters were overriding state.minSearchScore: - search-ui.js dropped any combined-search result below calculate_search_score_cutoff(...), a weighted blend of hardcoded TEXT_SCORE_CUTOFF=0.2 and IMAGE_SCORE_CUTOFF=0.75. Pure-text searches were therefore stuck at 0.2 even if the user lowered the Settings UI threshold to 0.1. - searchWithImage hardcoded >= 0.6, ignoring the user's setting outright. The backend already honors state.minSearchScore via SearchWithTextAndImageRequest.min_search_score → minimum_score, so both frontend filters were dead-weight at best and silently contradicting the user at worst. Drop them, delete the now-unused calculate_search_score_cutoff helper and its Jest block (which only encoded the bug), and clean up a top_k=500 arg that searchTextAndImage never accepted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SigLIP's text encoder was trained on caption-shaped inputs, so a bare
noun like "woman" produces a less-specific embedding than the
caption-style phrases the model expects. Combined with SigLIP's
narrow cosine band and steep sigmoid calibration, that small
specificity gap pushed single-noun queries below the threshold even
when the album was full of relevant matches. Mixed-media libraries
(drawings, illustrations, paintings alongside photos) made it worse:
a single ``a photo of {x}`` template would systematically penalize
non-photo content.
Ensemble each text query across five modality-spanning templates
(``a photo of``, ``a drawing of``, ``an illustration of``, ``a painting
of``, and the bare query), L2-normalize each per-template embedding so
no single phrasing dominates by magnitude, mean-pool across templates,
and re-normalize. The averaged direction sits close to all canonical
descriptions of the concept, so single nouns pick up the specificity
they were missing without locking in any modality assumption.
Confirmed on siglip2-base: ``"woman"`` and ``"middle-aged woman"``
ensembled embeddings cosine to 0.945 against each other; both stay
distinct from unrelated queries.
Scoped to ``SiglipEncoder.encode_text`` only — CLIP and OpenCLIP work
fine on bare queries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Initial real-world results from the prompt-ensembling change were mixed, so make it opt-in via SIGLIP_USE_PROMPT_ENSEMBLING (default False) while we evaluate. The legacy single-encode path is restored when the flag is off; the existing unit test flips the flag locally so it still exercises the ensembling math. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old combine built a single weighted query vector in embedding
space, normalized it, and dotted with stored embeddings. That made
the user's slider semantics dishonest: image-image cosines and
image-text cosines live on very different scales (≈0.75 vs ≈0.28
for CLIP, ≈0.50 vs ≈0.12 for SigLIP), so a "50/50 image and text"
query was actually ≈73/27 image-dominant for CLIP and ≈80/20 for
SigLIP. The geometry of the unnormalized combined vector also let
cross-term alignment between v_img and v_pos quietly inflate or
deflate the resulting absolute scores.
Replace with a per-modality combine:
cos_img = stored · v_img
cos_pos = encoder.calibrate_similarity(stored · v_pos)
cos_neg = encoder.calibrate_similarity(stored · v_neg)
similarity = (w_img·cos_img + w_pos·cos_pos) / (w_img + w_pos)
- w_neg · cos_neg
Text cosines now go through encoder.calibrate_similarity in every
mode (no longer gated to the text-only branch as a workaround for
the embedding-space saturation problem). For SigLIP that's the
sigmoid that brings text cosines onto the same scale as image
cosines; for CLIP/OpenCLIP it stays an identity, so existing
behavior on those backends is essentially preserved (just with
honest weight semantics on mixed queries).
The old test that asserted calibrate is skipped when image_weight
> 0 still passes — it was checking image-only and text-only modes,
and in the new code calibrate is still only called for cos_pos /
cos_neg, never cos_img. New test locks in the score-space math for
mixed and negative queries with a stub encoder whose calibrate
halves text cosines (a detectable transform).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three Album fields persisted with the album config: - ``min_search_score`` — defaults to 0.005 for SigLIP albums (cosine band is ~10x narrower than CLIP) and 0.2 otherwise. Resolved from ``encoder_spec`` by a model_validator when the value is omitted. - ``max_search_results`` — defaults to 100; was previously a global setting in localStorage with a hard ceiling of 500. Now per-album with no upper clamp. - ``use_query_optimization`` — defaults to True; SigLIP-only knob to toggle prompt-template ensembling at search time. Ignored for CLIP/OpenCLIP. Round-trip these through to_dict / from_dict, the create_album helper, the /update_album/ route, and the /available_albums/ listing so the frontend can both load defaults and persist edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SigLIP's prompt-template ensembling was guarded by a single module- level constant, which couldn't reflect a per-album choice. Lift it to ``SiglipEncoder.use_ensembling`` (still defaulting from the module flag for direct callers like CLI tools) and have ``Embeddings.search_images_by_text_and_image`` mutate it from the incoming ``use_query_optimization`` argument before encoding text. The /search_with_text_and_image route now accepts ``use_query_optimization`` in the request body; the frontend sources it from the album's per-album setting (committed alongside). A regression test using a stub encoder confirms the flag round-trips through search: True → on, False → off, None → leave alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Minimum search score" and "Maximum # of search results" inputs used to live in Settings as a single global preference saved in localStorage. They now live in the search dialog itself as per-album controls, alongside a new "Query optimization (SigLIP only)" checkbox that toggles the prompt-template ensembling — disabled for non-SigLIP albums so it's clear the toggle has no effect there. Wiring: - state.js drops localStorage save/restore for minSearchScore / maxSearchResults; both values (and the new useQueryOptimization flag) are loaded from the active album on every setAlbum() and persisted back via /update_album/ on edit. Persistence is debounced 400ms to collapse rapid input changes into a single network write. - state dispatches ``albumSearchSettingsLoaded`` when album config finishes loading; the search dialog listens and refreshes its controls (and toggles the SigLIP-only checkbox's enabled state). - search.js sends ``use_query_optimization`` in the request body so the backend toggles SigLIP ensembling per request. - settings.html and settings.js shed the now-redundant Search accordion and its supporting JS; page-visibility's critical-state backup drops the two now-per-album fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenCLIP ViT-L-14 with DFN-2B weights is a better general-purpose
default than legacy CLIP: noticeably stronger recall, while keeping
CLIP-style cosine semantics that work robustly on cluttered family
photos. SigLIP's steeper calibration is brilliant for clean
caption-shaped queries against single-subject images (e.g. AI-generated
content) but tends to misfire on real-world photo collections.
Splits the encoder spec constant to avoid conflating two roles:
- ``DEFAULT_ENCODER_SPEC`` — default for *new* albums. Now
``open-clip:ViT-L-14/dfn2b_s39b``. Used by Album.encoder_spec
Pydantic field default and the frontend's add-album dropdown.
- ``LEGACY_ENCODER_SPEC`` — what an .npz cache or YAML album that
predates the encoder swap layer was actually built with. Pinned to
``openai-clip:ViT-B/32`` because that was the only option then. Used
by every cache-fallback path: peek_encoder_spec, _open_npz_file,
_check_cache_compatibility, IndexResult.model_id default,
Embeddings.encoder_spec default, Album.from_dict missing-field
fallback, and the update_index{,_async} existing-cache reads.
Without this split, legacy caches that lack a ``model_id`` field
would be claimed as OpenCLIP-built, and _check_cache_compatibility
would trigger an auto-rebuild on every legacy .npz it loaded.
Album-manager dropdown re-ordered so OpenCLIP-DFN is the first
option (the default for new albums); legacy CLIP relegated to the
bottom and re-labeled accordingly.
Tests: split ``test_default_spec_is_legacy_clip`` into
``test_default_spec_for_new_albums`` and ``test_legacy_spec_unchanged``;
fix ``test_build_encoder_none_uses_default`` to mock OpenClipEncoder
since ``build_encoder(None)`` now routes there. Pin the
``new_album`` fixture to legacy CLIP so search-threshold assertions
stay stable across default-encoder changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three decimals collapsed every SigLIP photo-album match to "0.000" once the calibrated probabilities dropped under 0.001 — useful matches became indistinguishable from non-matches in the UI. Bump score formatting from toFixed(3) to toFixed(4) in both the search-result overlay and the seek slider so sub-millisigma differences are visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The text inputs in ``.search-fields-column`` extended to the panel's right content edge, putting them under the absolutely-positioned X close button at top/right 0.7em. Reserve 1.8em on the right of the grid column so the inputs and weight sliders stop short of the close button without disturbing the column's label/wrapper alignment or the image column on the left. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions tied together: 1. New ``user-guide/encoders.md`` introducing the three bundled encoders, their relative strengths and weaknesses, and concrete guidance on which to pick. Documents how the encoder is set per-album from the Album Manager (and as a config.yaml fallback), and what happens when the encoder is changed on an existing album (auto-rebuild on next index). 2. ``user-guide/search.md`` gains a "Tuning Search Per-Album" section covering the in-dialog Min. score / Max. results / Query optimization controls, with encoder-aware default thresholds and guidance on the SigLIP-only ensembling toggle. The score-scale note now points readers to the per-album tuning knobs, and the image+text+negative-text section is rewritten to describe the score-space combine (weights mean what they say) instead of the pre-refactor advice. 3. ``user-guide/albums.md`` gets a new bullet for the encoder dropdown in the Add/Edit Album form, with a pointer to the encoders page for the trade-offs and a note that encoder changes trigger a from-scratch rebuild. Wired into ``mkdocs.yml`` as a User Guide entry between Albums and Configuration. ``mkdocs build --strict`` succeeds with no broken links. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the single hard-coded OpenAI CLIP encoder with a pluggable image/text encoder layer that supports three backends — original CLIP, OpenCLIP (DFN-2B weights), and SigLIP 2 — selectable per album from the UI. Existing albums and `.npz` caches keep working with no migration required.
The PR also picks up several adjacent improvements that surfaced while building it: a search-indexing perf rewrite (batching + parallel CPU loaders → ~2-3× speedup), a score-space refactor of the search-combine math so weight sliders mean what they say, and a documentation page introducing users to the encoders and how to tune them per album.
Headline changes
Behavior preserved
Notable bug fixes along the way
Tests
Test plan
🤖 Generated with Claude Code