Skip to content

Add GPU support and update Docker configuration and documentation#1

Merged
heshinth merged 9 commits into
mainfrom
gpu-support
Apr 24, 2026
Merged

Add GPU support and update Docker configuration and documentation#1
heshinth merged 9 commits into
mainfrom
gpu-support

Conversation

@heshinth
Copy link
Copy Markdown
Owner

No description provided.

Copilot AI review requested due to automatic review settings April 24, 2026 12:19
@heshinth heshinth merged commit ebb58ea into main Apr 24, 2026
2 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds GPU-capable packaging and deployment options for LocalEmbed, including a GPU Docker image and compose file, while updating runtime model loading behavior and documentation to support CPU/GPU workflows.

Changes:

  • Introduce cpu/gpu optional dependency extras and update the uv.lock accordingly.
  • Add LRU model caching plus optional CUDA provider configuration during model initialization.
  • Add a GPU Dockerfile + compose config, and update release workflow + README/.env sample for GPU usage.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
uv.lock Adds locked entries for fastembed-gpu and onnxruntime-gpu; updates project metadata to expose cpu/gpu extras.
pyproject.toml Moves embedding backend into optional extras (cpu, gpu) and adds a poe task.
app/services/embedder.py Implements LRU model cache with locking and adds GPU provider selection + provider logging.
app/config.py Adds USE_GPU and MODEL_CACHE_LIMIT settings.
Dockerfile Updates build to install the cpu extra explicitly.
Dockerfile.gpu Introduces a CUDA-based GPU runtime image that installs the gpu extra and enables GPU mode via env.
docker-compose.yml Documents MODEL_CACHE_LIMIT environment variable option.
docker-compose.gpu.yml Adds a GPU compose file using the latest-gpu image with gpus: all.
README.md Documents CPU vs GPU Docker/compose usage, tag scheme, and local dev extras.
.github/workflows/release.yml Builds/pushes a GPU image variant and updates release notes examples.
.env.sample Adds MODEL_CACHE_LIMIT and comments out thread/batch settings for customization.
app/main.py Updates the root endpoint payload description.
Comments suppressed due to low confidence (1)

pyproject.toml:15

  • fastembed (or fastembed-gpu) is no longer in the base dependencies, but the application imports from fastembed import TextEmbedding (e.g., app/services/embedder.py, app/services/model_registery.py). Installing the project without --extra cpu/gpu will raise ModuleNotFoundError. Consider keeping a working default by either (a) adding fastembed>=0.8.0 back to base dependencies and keeping gpu as an extra, or (b) making the application code handle missing extras with a clear startup error and documenting that an extra is required.
dependencies = [
    "fastapi[standard]<1.0.0,>=0.116.1",
    "loguru>=0.7.3",
    "pydantic>=2.12.5",
    "pydantic-settings>=2.13.1",
]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/services/embedder.py
Comment on lines +25 to +41
with model_cache_lock:
cached_model = model_cache.get(model_id)
if cached_model is not None:
model_cache.move_to_end(model_id)
return cached_model

logger.info(f"Loading embedding model into memory: {model_id}")
model_cache[model_id] = TextEmbedding(
model_id, threads=settings.EMBEDDING_THREADS

# Configure providers based on GPU setting
providers = None
if settings.USE_GPU:
providers = ["CUDAExecutionProvider"]
logger.info("GPU acceleration (CUDAExecutionProvider) enabled.")

model = TextEmbedding(
model_id, threads=settings.EMBEDDING_THREADS, providers=providers
)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_model() holds model_cache_lock while performing the potentially expensive model download/load (TextEmbedding(...)). This serializes concurrent requests and can stall the whole service during first-load or cache-miss storms. Consider using a double-checked approach (check cache under lock, release lock to load, then re-acquire to insert/move-to-end with a second check) so unrelated requests can continue while one thread loads a model.

Copilot uses AI. Check for mistakes.
Comment thread app/services/embedder.py
Comment on lines +36 to +41
providers = ["CUDAExecutionProvider"]
logger.info("GPU acceleration (CUDAExecutionProvider) enabled.")

model = TextEmbedding(
model_id, threads=settings.EMBEDDING_THREADS, providers=providers
)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When USE_GPU is enabled, providers are set to only ['CUDAExecutionProvider']. If CUDA libraries/providers aren’t available (driver/toolkit mismatch, container started without GPUs, etc.), ONNX Runtime typically fails session creation rather than falling back. Consider including CPUExecutionProvider as a fallback (ordered after CUDA), or catching provider init errors and retrying with CPU to keep the service available.

Suggested change
providers = ["CUDAExecutionProvider"]
logger.info("GPU acceleration (CUDAExecutionProvider) enabled.")
model = TextEmbedding(
model_id, threads=settings.EMBEDDING_THREADS, providers=providers
)
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
logger.info(
"GPU acceleration preferred (CUDAExecutionProvider) with CPUExecutionProvider fallback enabled."
)
try:
model = TextEmbedding(
model_id, threads=settings.EMBEDDING_THREADS, providers=providers
)
except Exception as e:
if settings.USE_GPU:
logger.warning(
f"Failed to initialize model {model_id} with GPU-enabled providers {providers}: {e}. Retrying with CPUExecutionProvider only."
)
model = TextEmbedding(
model_id,
threads=settings.EMBEDDING_THREADS,
providers=["CPUExecutionProvider"],
)
else:
raise

Copilot uses AI. Check for mistakes.
Comment thread app/services/embedder.py
Comment on lines +43 to 46
resolved_providers = model.model.model.get_providers()
logger.info(
f"Model {model_id} loaded successfully with providers: {resolved_providers}"
)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved_providers = model.model.model.get_providers() relies on internal/private attributes of fastembed’s TextEmbedding implementation (multiple nested .model). This is brittle across library versions and can break startup even if embedding works. Prefer a public API for provider reporting (if available), or guard this log line so provider introspection failures don’t prevent the model from loading.

Copilot uses AI. Check for mistakes.
Comment thread app/main.py
return {"Project": "LocalEmbed", "description": "LocalEmbed"}
return {
"Project": "LocalEmbed",
"description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint. ",
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description string has a trailing space at the end of the sentence, which will be reflected in the JSON output. Trim the extra whitespace for a cleaner response payload.

Suggested change
"description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint. ",
"description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint.",

Copilot uses AI. Check for mistakes.
Comment thread Dockerfile.gpu
ENV PYTHONUNBUFFERED=1

# Magic trick: Copy the 'uv' binary directly from Astral's official image
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COPY --from=ghcr.io/astral-sh/uv:latest ... makes the GPU image build non-reproducible and can break unexpectedly when the upstream latest tag changes. Prefer pinning uv to a specific version tag or digest (matching what’s used in the CPU Dockerfile) to keep builds deterministic.

Suggested change
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
COPY --from=ghcr.io/astral-sh/uv:0.4.30 /uv /uvx /bin/

Copilot uses AI. Check for mistakes.
Comment thread README.md
- `MODEL_CACHE_LIMIT`: max number of models kept in memory (LRU eviction)
- `EMBEDDING_THREADS`: CPU threads for embedding computation
- `BATCH_SIZE`: number of inputs processed per batch
- `USE_GPU`: set `true` to force CUDA provider in local/non-GPU-image runs
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README suggests USE_GPU can be set to force CUDA “in local/non-GPU-image runs”, but enabling it without installing the GPU extra (and without a working CUDA runtime) will typically fail model initialization. Clarify that USE_GPU=true requires the gpu extra (uv sync --extra gpu / fastembed-gpu) and a CUDA-capable environment, or document the service’s fallback behavior if CUDA isn’t available.

Suggested change
- `USE_GPU`: set `true` to force CUDA provider in local/non-GPU-image runs
- `USE_GPU`: set `true` to request the CUDA provider in local/non-GPU-image runs. This requires the GPU extra to be installed (`uv sync --extra gpu` or `fastembed-gpu`) and a working CUDA-capable environment; otherwise model initialization may fail.

Copilot uses AI. Check for mistakes.
Comment thread .env.sample
Comment on lines 13 to +15
#Number of threads to use for embedding generation.
#Adjust based on your CPU capabilities.
EMBEDDING_THREADS=8
# EMBEDDING_THREADS=8
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting: the comment #Number of threads... is missing a space after #, and there are trailing spaces on these comment lines. Cleaning this up improves readability in the sample env file.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants