fix(embeddings,cli): case-insensitive path diff, VRAM release, broken CLI kwarg by lstein · Pull Request #271 · lstein/PhotoMapAI

lstein · 2026-05-23T00:06:14Z

Three small but real defects

The last three backend items from the 2026-05-20 code review, in one focused PR (~80 lines of fixes, ~320 lines of tests).

1. `_get_new_and_missing_images`: case-insensitive Path comparison

The new-vs-missing diff used plain `set[Path]` operations. Python's `Path.eq` is always case-sensitive, regardless of the underlying filesystem. On Windows NTFS and macOS HFS+/APFS-CI, the same image referenced with different case in its filename (`Photo.JPG` in the cache, `photo.jpg` on disk) was classified as both new and missing:

```
new = {photo.jpg} ← re-encoded (full CLIP pass)
missing = {Photo.JPG} ← orphaned in the cache
```

Net effect: silent double-encoding on every re-index, plus a stale row that `_filter_missing_images` then drops, plus stale UMAP / cluster-label caches keyed off the bogus filename count.

The fix funnels both sides through `_path_compare_key`, which returns `p.as_posix().casefold()`. Casefolded keys drive the set diff; original-case Paths flow back through `live_by_key` / `existing_by_key` so downstream filesystem operations see what the disk actually holds.

Trade-off on case-sensitive Linux: two files in the same directory whose names differ only in case collapse to one entry. That collision is unusual and harmless — the image is simply indexed once instead of twice — and far less destructive than the silent double-encoding the previous Path-equality diff produced on the majority of user platforms.

2. `search_images_by_text_and_image`: VRAM release was a no-op

The `finally` block in this hot search path tried to drop GPU tensor references before calling `_cleanup_cuda_memory`:

```python
for name in ("image_embedding", "pos_emb", ...):
if name in locals():
del locals()[name]
self._cleanup_cuda_memory(device)
```

`locals()` returns a snapshot dict in functions — mutating it doesn't touch the actual local bindings. The `del` was a no-op, the local refs lived on until the function frame was destroyed, and `torch.cuda.empty_cache()` couldn't reclaim memory pinned by those refs. VRAM accumulated across search queries until `search_images_by_text_and_image` returned.

Pre-declared all ten potentially-allocated tensors / arrays to `None` at the start of the try (so they're unconditionally bound), then `del` them explicitly in the finally before `_cleanup_cuda_memory`. The allocator can now actually reclaim the freed memory.

3. `imagetool.do_search`: TypeError on every first call

The `search_images` CLI passed `query_image_path=args.search` to `Embeddings.search_images_by_text_and_image`. That function's signature has no `query_image_path` parameter — it takes `query_image_data: Image.Image`. Python raised:

```
TypeError: search_images_by_text_and_image() got an unexpected keyword argument 'query_image_path'
```

before the encoder was ever touched, so the `search_images` script has been broken since the parameter was renamed. Fix: open the path with PIL inside a `with` block and pass `query_image_data=img`.

Test coverage added

File	Tests	What
`tests/backend/test_embeddings_diff.py`	9	`_path_compare_key` directly + `_get_new_and_missing_images` scenarios including the case-only-rename regression
`tests/backend/test_imagetool.py`	9	`do_search` / `do_text_search` kwarg shapes + `main()` basename → handler dispatch parametrized across all six CLI scripts

Test plan

`ruff check` — clean
`pytest tests/backend` — 299 passed (was 281 on master after fix(indexing): gate image indexing on pixel dimensions, not file bytes #269; +18 new tests)
Manual: `search_images path/to/query.jpg --embeddings album.npz` — confirm it runs instead of TypeError
Manual: re-index an album that contains a file the cache stored with different case (e.g. `mv photo.JPG Photo.jpg` and `update_images` afterward) — confirm the file isn't re-encoded
Manual (CUDA users): run 100 consecutive searches and watch `nvidia-smi` — VRAM should not climb across queries

Net +399 / −29 lines across 4 files; most of that is new tests.

🤖 Generated with Claude Code

… CLI kwarg Three small but real defects in embeddings.py / imagetool.py — the last three backend items from the 2026-05-20 code review. ## 1. _get_new_and_missing_images: case-insensitive Path comparison The new-vs-missing diff used plain ``set[Path]`` operations. Python's ``Path.__eq__`` is *always* case-sensitive, regardless of the underlying filesystem. On the dominant user platforms — Windows NTFS and macOS HFS+/APFS-CI — that meant the same image referenced with different case in its filename (``Photo.JPG`` in the cache, ``photo.jpg`` on disk) was classified as **both new and missing**: new = {photo.jpg} ← re-encoded (whole CLIP pass) missing = {Photo.JPG} ← orphaned in the cache Net effect: silent double-encoding on every re-index, plus a stale row that ``_filter_missing_images`` then drops, plus stale UMAP and cluster-label caches keyed off the bogus filename count. The fix funnels both sides through ``_path_compare_key``, which returns ``p.as_posix().casefold()``. The casefolded keys drive the set diff; the original-case Paths flow back through ``live_by_key`` and ``existing_by_key`` so downstream filesystem operations see what the disk actually holds. Trade-off on case-sensitive Linux: two files in the same directory whose names differ only in case collapse to one entry. That collision is unusual and harmless — the image is simply indexed once instead of twice — and far less destructive than the silent double-encoding the previous Path-equality diff produced on the majority of user platforms. ## 2. search_images_by_text_and_image: VRAM release was a no-op The ``finally`` block in this hot search path tried to drop GPU tensor references before calling ``_cleanup_cuda_memory``: for name in ("image_embedding", "pos_emb", ...): if name in locals(): del locals()[name] self._cleanup_cuda_memory(device) ``locals()`` returns a *snapshot dict* in functions — mutating it doesn't touch the actual local bindings. The ``del`` was a no-op, the local refs lived on until the function frame was destroyed, and ``torch.cuda.empty_cache()`` couldn't reclaim memory pinned by those refs. VRAM accumulated across search queries until the search method returned. Pre-declare all ten potentially-allocated tensors / arrays to ``None`` at the start of the try (so they're unconditionally bound), then ``del`` them explicitly in the finally before ``_cleanup_cuda_memory``. The allocator can now actually reclaim the freed memory. ## 3. imagetool.do_search: TypeError on every first call The ``search_images`` CLI passed ``query_image_path=args.search`` to ``Embeddings.search_images_by_text_and_image``. That function's signature has no ``query_image_path`` parameter — it takes ``query_image_data: Image.Image``. Python raised ``TypeError: got an unexpected keyword argument 'query_image_path'`` before the encoder was ever touched, so the script has been broken since the parameter was renamed. Fix: open the path with PIL inside a context manager and pass ``query_image_data=img``. Verified by a new test that intercepts the underlying call and asserts the kwarg name. ## Test coverage added * ``tests/backend/test_embeddings_diff.py`` — 9 tests covering ``_path_compare_key`` directly plus four ``_get_new_and_missing_images`` scenarios, including the case-only-rename regression. * ``tests/backend/test_imagetool.py`` — 9 tests covering ``do_search`` / ``do_text_search`` kwarg shapes and ``main()``'s basename → handler dispatch. 297 backend tests pass (was 279). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embeddings,cli): case-insensitive path diff, VRAM release, broken CLI kwarg#271

fix(embeddings,cli): case-insensitive path diff, VRAM release, broken CLI kwarg#271
lstein wants to merge 1 commit into
masterfrom
lstein/fix/embeddings-cli-hygiene

lstein commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lstein commented May 23, 2026

Three small but real defects

1. `_get_new_and_missing_images`: case-insensitive Path comparison

2. `search_images_by_text_and_image`: VRAM release was a no-op

3. `imagetool.do_search`: TypeError on every first call

Test coverage added

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant