fix(embeddings,cli): case-insensitive path diff, VRAM release, broken CLI kwarg#271
Open
lstein wants to merge 1 commit into
Open
fix(embeddings,cli): case-insensitive path diff, VRAM release, broken CLI kwarg#271lstein wants to merge 1 commit into
lstein wants to merge 1 commit into
Conversation
… CLI kwarg
Three small but real defects in embeddings.py / imagetool.py — the
last three backend items from the 2026-05-20 code review.
## 1. _get_new_and_missing_images: case-insensitive Path comparison
The new-vs-missing diff used plain ``set[Path]`` operations. Python's
``Path.__eq__`` is *always* case-sensitive, regardless of the
underlying filesystem. On the dominant user platforms — Windows NTFS
and macOS HFS+/APFS-CI — that meant the same image referenced with
different case in its filename (``Photo.JPG`` in the cache, ``photo.jpg``
on disk) was classified as **both new and missing**:
new = {photo.jpg} ← re-encoded (whole CLIP pass)
missing = {Photo.JPG} ← orphaned in the cache
Net effect: silent double-encoding on every re-index, plus a stale
row that ``_filter_missing_images`` then drops, plus stale UMAP and
cluster-label caches keyed off the bogus filename count.
The fix funnels both sides through ``_path_compare_key``, which
returns ``p.as_posix().casefold()``. The casefolded keys drive the
set diff; the original-case Paths flow back through ``live_by_key``
and ``existing_by_key`` so downstream filesystem operations see what
the disk actually holds.
Trade-off on case-sensitive Linux: two files in the same directory
whose names differ only in case collapse to one entry. That collision
is unusual and harmless — the image is simply indexed once instead
of twice — and far less destructive than the silent double-encoding
the previous Path-equality diff produced on the majority of user
platforms.
## 2. search_images_by_text_and_image: VRAM release was a no-op
The ``finally`` block in this hot search path tried to drop GPU
tensor references before calling ``_cleanup_cuda_memory``:
for name in ("image_embedding", "pos_emb", ...):
if name in locals():
del locals()[name]
self._cleanup_cuda_memory(device)
``locals()`` returns a *snapshot dict* in functions — mutating it
doesn't touch the actual local bindings. The ``del`` was a no-op,
the local refs lived on until the function frame was destroyed, and
``torch.cuda.empty_cache()`` couldn't reclaim memory pinned by those
refs. VRAM accumulated across search queries until the search method
returned.
Pre-declare all ten potentially-allocated tensors / arrays to
``None`` at the start of the try (so they're unconditionally bound),
then ``del`` them explicitly in the finally before
``_cleanup_cuda_memory``. The allocator can now actually reclaim
the freed memory.
## 3. imagetool.do_search: TypeError on every first call
The ``search_images`` CLI passed ``query_image_path=args.search`` to
``Embeddings.search_images_by_text_and_image``. That function's
signature has no ``query_image_path`` parameter — it takes
``query_image_data: Image.Image``. Python raised ``TypeError: got
an unexpected keyword argument 'query_image_path'`` before the
encoder was ever touched, so the script has been broken since the
parameter was renamed.
Fix: open the path with PIL inside a context manager and pass
``query_image_data=img``. Verified by a new test that intercepts
the underlying call and asserts the kwarg name.
## Test coverage added
* ``tests/backend/test_embeddings_diff.py`` — 9 tests covering
``_path_compare_key`` directly plus four ``_get_new_and_missing_images``
scenarios, including the case-only-rename regression.
* ``tests/backend/test_imagetool.py`` — 9 tests covering
``do_search`` / ``do_text_search`` kwarg shapes and ``main()``'s
basename → handler dispatch.
297 backend tests pass (was 279).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three small but real defects
The last three backend items from the 2026-05-20 code review, in one focused PR (~80 lines of fixes, ~320 lines of tests).
1. `_get_new_and_missing_images`: case-insensitive Path comparison
The new-vs-missing diff used plain `set[Path]` operations. Python's `Path.eq` is always case-sensitive, regardless of the underlying filesystem. On Windows NTFS and macOS HFS+/APFS-CI, the same image referenced with different case in its filename (`Photo.JPG` in the cache, `photo.jpg` on disk) was classified as both new and missing:
```
new = {photo.jpg} ← re-encoded (full CLIP pass)
missing = {Photo.JPG} ← orphaned in the cache
```
Net effect: silent double-encoding on every re-index, plus a stale row that `_filter_missing_images` then drops, plus stale UMAP / cluster-label caches keyed off the bogus filename count.
The fix funnels both sides through `_path_compare_key`, which returns `p.as_posix().casefold()`. Casefolded keys drive the set diff; original-case Paths flow back through `live_by_key` / `existing_by_key` so downstream filesystem operations see what the disk actually holds.
Trade-off on case-sensitive Linux: two files in the same directory whose names differ only in case collapse to one entry. That collision is unusual and harmless — the image is simply indexed once instead of twice — and far less destructive than the silent double-encoding the previous Path-equality diff produced on the majority of user platforms.
2. `search_images_by_text_and_image`: VRAM release was a no-op
The `finally` block in this hot search path tried to drop GPU tensor references before calling `_cleanup_cuda_memory`:
```python
for name in ("image_embedding", "pos_emb", ...):
if name in locals():
del locals()[name]
self._cleanup_cuda_memory(device)
```
`locals()` returns a snapshot dict in functions — mutating it doesn't touch the actual local bindings. The `del` was a no-op, the local refs lived on until the function frame was destroyed, and `torch.cuda.empty_cache()` couldn't reclaim memory pinned by those refs. VRAM accumulated across search queries until `search_images_by_text_and_image` returned.
Pre-declared all ten potentially-allocated tensors / arrays to `None` at the start of the try (so they're unconditionally bound), then `del` them explicitly in the finally before `_cleanup_cuda_memory`. The allocator can now actually reclaim the freed memory.
3. `imagetool.do_search`: TypeError on every first call
The `search_images` CLI passed `query_image_path=args.search` to `Embeddings.search_images_by_text_and_image`. That function's signature has no `query_image_path` parameter — it takes `query_image_data: Image.Image`. Python raised:
```
TypeError: search_images_by_text_and_image() got an unexpected keyword argument 'query_image_path'
```
before the encoder was ever touched, so the `search_images` script has been broken since the parameter was renamed. Fix: open the path with PIL inside a `with` block and pass `query_image_data=img`.
Test coverage added
Test plan
Net +399 / −29 lines across 4 files; most of that is new tests.
🤖 Generated with Claude Code