Skip to content

fix(embeddings,cli): case-insensitive path diff, VRAM release, broken CLI kwarg#271

Open
lstein wants to merge 1 commit into
masterfrom
lstein/fix/embeddings-cli-hygiene
Open

fix(embeddings,cli): case-insensitive path diff, VRAM release, broken CLI kwarg#271
lstein wants to merge 1 commit into
masterfrom
lstein/fix/embeddings-cli-hygiene

Conversation

@lstein
Copy link
Copy Markdown
Owner

@lstein lstein commented May 23, 2026

Three small but real defects

The last three backend items from the 2026-05-20 code review, in one focused PR (~80 lines of fixes, ~320 lines of tests).

1. `_get_new_and_missing_images`: case-insensitive Path comparison

The new-vs-missing diff used plain `set[Path]` operations. Python's `Path.eq` is always case-sensitive, regardless of the underlying filesystem. On Windows NTFS and macOS HFS+/APFS-CI, the same image referenced with different case in its filename (`Photo.JPG` in the cache, `photo.jpg` on disk) was classified as both new and missing:

```
new = {photo.jpg} ← re-encoded (full CLIP pass)
missing = {Photo.JPG} ← orphaned in the cache
```

Net effect: silent double-encoding on every re-index, plus a stale row that `_filter_missing_images` then drops, plus stale UMAP / cluster-label caches keyed off the bogus filename count.

The fix funnels both sides through `_path_compare_key`, which returns `p.as_posix().casefold()`. Casefolded keys drive the set diff; original-case Paths flow back through `live_by_key` / `existing_by_key` so downstream filesystem operations see what the disk actually holds.

Trade-off on case-sensitive Linux: two files in the same directory whose names differ only in case collapse to one entry. That collision is unusual and harmless — the image is simply indexed once instead of twice — and far less destructive than the silent double-encoding the previous Path-equality diff produced on the majority of user platforms.

2. `search_images_by_text_and_image`: VRAM release was a no-op

The `finally` block in this hot search path tried to drop GPU tensor references before calling `_cleanup_cuda_memory`:

```python
for name in ("image_embedding", "pos_emb", ...):
if name in locals():
del locals()[name]
self._cleanup_cuda_memory(device)
```

`locals()` returns a snapshot dict in functions — mutating it doesn't touch the actual local bindings. The `del` was a no-op, the local refs lived on until the function frame was destroyed, and `torch.cuda.empty_cache()` couldn't reclaim memory pinned by those refs. VRAM accumulated across search queries until `search_images_by_text_and_image` returned.

Pre-declared all ten potentially-allocated tensors / arrays to `None` at the start of the try (so they're unconditionally bound), then `del` them explicitly in the finally before `_cleanup_cuda_memory`. The allocator can now actually reclaim the freed memory.

3. `imagetool.do_search`: TypeError on every first call

The `search_images` CLI passed `query_image_path=args.search` to `Embeddings.search_images_by_text_and_image`. That function's signature has no `query_image_path` parameter — it takes `query_image_data: Image.Image`. Python raised:

```
TypeError: search_images_by_text_and_image() got an unexpected keyword argument 'query_image_path'
```

before the encoder was ever touched, so the `search_images` script has been broken since the parameter was renamed. Fix: open the path with PIL inside a `with` block and pass `query_image_data=img`.

Test coverage added

File Tests What
`tests/backend/test_embeddings_diff.py` 9 `_path_compare_key` directly + `_get_new_and_missing_images` scenarios including the case-only-rename regression
`tests/backend/test_imagetool.py` 9 `do_search` / `do_text_search` kwarg shapes + `main()` basename → handler dispatch parametrized across all six CLI scripts

Test plan

  • `ruff check` — clean
  • `pytest tests/backend` — 299 passed (was 281 on master after fix(indexing): gate image indexing on pixel dimensions, not file bytes #269; +18 new tests)
  • Manual: `search_images path/to/query.jpg --embeddings album.npz` — confirm it runs instead of TypeError
  • Manual: re-index an album that contains a file the cache stored with different case (e.g. `mv photo.JPG Photo.jpg` and `update_images` afterward) — confirm the file isn't re-encoded
  • Manual (CUDA users): run 100 consecutive searches and watch `nvidia-smi` — VRAM should not climb across queries

Net +399 / −29 lines across 4 files; most of that is new tests.

🤖 Generated with Claude Code

… CLI kwarg

Three small but real defects in embeddings.py / imagetool.py — the
last three backend items from the 2026-05-20 code review.

## 1. _get_new_and_missing_images: case-insensitive Path comparison

The new-vs-missing diff used plain ``set[Path]`` operations. Python's
``Path.__eq__`` is *always* case-sensitive, regardless of the
underlying filesystem. On the dominant user platforms — Windows NTFS
and macOS HFS+/APFS-CI — that meant the same image referenced with
different case in its filename (``Photo.JPG`` in the cache, ``photo.jpg``
on disk) was classified as **both new and missing**:

    new      = {photo.jpg}  ← re-encoded (whole CLIP pass)
    missing  = {Photo.JPG}  ← orphaned in the cache

Net effect: silent double-encoding on every re-index, plus a stale
row that ``_filter_missing_images`` then drops, plus stale UMAP and
cluster-label caches keyed off the bogus filename count.

The fix funnels both sides through ``_path_compare_key``, which
returns ``p.as_posix().casefold()``. The casefolded keys drive the
set diff; the original-case Paths flow back through ``live_by_key``
and ``existing_by_key`` so downstream filesystem operations see what
the disk actually holds.

Trade-off on case-sensitive Linux: two files in the same directory
whose names differ only in case collapse to one entry. That collision
is unusual and harmless — the image is simply indexed once instead
of twice — and far less destructive than the silent double-encoding
the previous Path-equality diff produced on the majority of user
platforms.

## 2. search_images_by_text_and_image: VRAM release was a no-op

The ``finally`` block in this hot search path tried to drop GPU
tensor references before calling ``_cleanup_cuda_memory``:

    for name in ("image_embedding", "pos_emb", ...):
        if name in locals():
            del locals()[name]
    self._cleanup_cuda_memory(device)

``locals()`` returns a *snapshot dict* in functions — mutating it
doesn't touch the actual local bindings. The ``del`` was a no-op,
the local refs lived on until the function frame was destroyed, and
``torch.cuda.empty_cache()`` couldn't reclaim memory pinned by those
refs. VRAM accumulated across search queries until the search method
returned.

Pre-declare all ten potentially-allocated tensors / arrays to
``None`` at the start of the try (so they're unconditionally bound),
then ``del`` them explicitly in the finally before
``_cleanup_cuda_memory``. The allocator can now actually reclaim
the freed memory.

## 3. imagetool.do_search: TypeError on every first call

The ``search_images`` CLI passed ``query_image_path=args.search`` to
``Embeddings.search_images_by_text_and_image``. That function's
signature has no ``query_image_path`` parameter — it takes
``query_image_data: Image.Image``. Python raised ``TypeError: got
an unexpected keyword argument 'query_image_path'`` before the
encoder was ever touched, so the script has been broken since the
parameter was renamed.

Fix: open the path with PIL inside a context manager and pass
``query_image_data=img``. Verified by a new test that intercepts
the underlying call and asserts the kwarg name.

## Test coverage added

* ``tests/backend/test_embeddings_diff.py`` — 9 tests covering
  ``_path_compare_key`` directly plus four ``_get_new_and_missing_images``
  scenarios, including the case-only-rename regression.
* ``tests/backend/test_imagetool.py`` — 9 tests covering
  ``do_search`` / ``do_text_search`` kwarg shapes and ``main()``'s
  basename → handler dispatch.

297 backend tests pass (was 279).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant