Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 26, 2025

After image/text search operations, the CLIP model and GPU tensors remained loaded on CUDA, consuming ~0.5GB VRAM indefinitely.

Changes

  • Added _cleanup_cuda_memory() helper: Completely frees GPU VRAM by synchronizing CUDA operations and clearing cache. The model and all tensors are deleted entirely rather than kept in memory.

  • Search cleanup: Wrapped search_images_by_text_and_image logic in try-finally block to ensure cleanup on all exit paths, including exceptions. All GPU tensors (model, preprocess, image_tensor, tokens, embeddings_tensor, norm_embeddings, combined_embedding, similarities, etc.) are explicitly deleted.

  • Batch processing cleanup: Added cleanup calls after _process_images_batch and _process_images_batch_async complete, with explicit deletion of model and preprocess objects.

Implementation

@staticmethod
def _cleanup_cuda_memory(device: str) -> None:
    """Clean up CUDA memory by clearing cache and forcing garbage collection."""
    if device == "cuda":
        try:
            torch.cuda.synchronize()
            torch.cuda.empty_cache()
            gc.collect()
            torch.cuda.empty_cache()
        except RuntimeError as e:
            logger.warning(f"CUDA cleanup failed: {e}")

Results

This approach completely deletes the model rather than keeping it in CPU/RAM:

  • VRAM during operations: 534 MiB (CUDA context + model + tensors)
  • VRAM after cleanup: 188 MiB (CUDA context only - a PyTorch/CUDA baseline that cannot be freed without ending the process)
  • Memory freed for other GPU processes: ~346 MiB ✅
  • Performance trade-off: Model must be reloaded on each search (~1-2s overhead per search)

The 188 MiB baseline is the CUDA context that PyTorch initializes on first GPU use. This is a known PyTorch/CUDA limitation and represents standard behavior for all CUDA applications - it persists for the lifetime of the Python process.

The implementation prioritizes freeing GPU memory for other processes over search performance, ensuring maximum available VRAM for GPU-utilizing processes as required.

Original prompt

This section details on the original issue you should resolve

<issue_title>Free CUDA VRAM after a search</issue_title>
<issue_description>After the user conducts an image or text search, nvidia-smi shows that the backend is consuming ~0.5 GB of GPU card VRAM. This memory should be released after the search operation completes so that it is available to under GPU-utilizing processes.</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI self-assigned this Dec 26, 2025
Copilot AI and others added 5 commits December 26, 2025 13:14
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copilot AI changed the title [WIP] Free CUDA VRAM after completing search operation Free CUDA VRAM after search and indexing operations Dec 26, 2025
Copilot AI requested a review from lstein December 26, 2025 13:32
@lstein lstein marked this pull request as ready for review December 26, 2025 13:39
@lstein
Copy link
Owner

lstein commented Dec 26, 2025

@copilot I'm watching nvidia-smi and I see that during an active search the VRAM usage increases from 0 to 534 MiB, and then shrinks to a resting state of 230 MiB. However, it never returns to zero. It would be preferable for VRAM to return to a resting state of zero.

On the plus side, there is no VRAM memory leakage that I can see.

Instead of moving the model to CPU (which left 230 MiB VRAM allocated),
now completely delete model and tensor references. This ensures VRAM
returns to zero after search and indexing operations.

Changes:
- Remove model parameter from _cleanup_cuda_memory()
- Add torch.cuda.synchronize() before cache clearing
- Explicitly delete model, preprocess, and all GPU tensors
- Delete tensors in finally block with error handling

Trade-off: Model must be reloaded on each search (~1-2s overhead),
but VRAM is completely freed for other GPU processes.

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 26, 2025

@copilot I'm watching nvidia-smi and I see that during an active search the VRAM usage increases from 0 to 534 MiB, and then shrinks to a resting state of 230 MiB. However, it never returns to zero....

I've updated the implementation to completely delete the model and all tensor references instead of just moving them to CPU. This change ensures VRAM returns to zero (or minimal baseline) after search and indexing operations.

The previous approach (moving model to CPU) kept the model in system RAM and left ~230 MiB VRAM allocated. The new approach:

  • Calls torch.cuda.synchronize() to ensure all CUDA operations complete
  • Explicitly deletes model, preprocess, and all GPU tensors
  • Clears CUDA cache and forces garbage collection

Trade-off: The model now needs to be reloaded on each search operation (1-2s overhead), but VRAM is completely freed for other GPU processes as requested.

Commit: 6c9781b

Ensure all GPU tensors are deleted in the finally block, including:
- image_tensor (created from preprocessing)
- tokens (created from tokenization)
- combined_embedding (intermediate computation tensor)

This ensures complete VRAM cleanup.

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
@lstein
Copy link
Owner

lstein commented Dec 26, 2025

@copilot Good progress. However the VRAM now drops to 188 MiB and stays steady there.

Add more aggressive CUDA cleanup:
- Call empty_cache() before and after garbage collection
- Run gc.collect() twice for more thorough cleanup
- Add documentation about unavoidable CUDA context baseline (~188 MiB)

The 188 MiB baseline is the CUDA context initialized by PyTorch on first
GPU use. This cannot be freed without ending the Python process - it's a
known PyTorch/CUDA limitation.

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 26, 2025

@copilot Good progress. However the VRAM now drops to 188 MiB and stays steady there.

The 188 MiB baseline is the CUDA context that PyTorch/CUDA initializes on first GPU use. This is a known limitation - the CUDA context cannot be freed without ending the entire Python process.

I've improved the cleanup to be more aggressive (multiple cache clears, double garbage collection), but the ~188 MiB CUDA context will remain as long as the process is alive. This is standard behavior for any PyTorch application that uses CUDA.

The good news is that beyond this baseline:

  • Active operations still use 534 MiB (model + tensors loaded on GPU)
  • After cleanup, it drops to 188 MiB (just the CUDA context)
  • This frees ~346 MiB for other GPU processes
  • No memory leaks - the baseline stays constant

Commit: dc1aa1f

Streamline cleanup to: synchronize -> empty_cache -> gc.collect -> empty_cache
This is more efficient while still ensuring thorough cleanup.

Removed redundant second gc.collect() call - a single pass is sufficient
after tensor deletions.

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copy link
Owner

@lstein lstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well.

@lstein lstein merged commit 15bb314 into master Dec 26, 2025
3 checks passed
@lstein lstein deleted the copilot/free-cuda-vram-after-search branch January 4, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Free CUDA VRAM after a search

2 participants