Add GPU support and update Docker configuration and documentation by heshinth · Pull Request #1 · heshinth/LocalEmbed

heshinth · 2026-04-24T12:19:29Z

No description provided.

…n Docker setup

… pulls badge Co-authored-by: Copilot <copilot@github.com>

…ronment variables Co-authored-by: Copilot <copilot@github.com>

Co-authored-by: Copilot <copilot@github.com>

Copilot

Pull request overview

Adds GPU-capable packaging and deployment options for LocalEmbed, including a GPU Docker image and compose file, while updating runtime model loading behavior and documentation to support CPU/GPU workflows.

Changes:

Introduce cpu/gpu optional dependency extras and update the uv.lock accordingly.
Add LRU model caching plus optional CUDA provider configuration during model initialization.
Add a GPU Dockerfile + compose config, and update release workflow + README/.env sample for GPU usage.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`uv.lock`	Adds locked entries for `fastembed-gpu` and `onnxruntime-gpu`; updates project metadata to expose `cpu`/`gpu` extras.
`pyproject.toml`	Moves embedding backend into optional extras (`cpu`, `gpu`) and adds a poe task.
`app/services/embedder.py`	Implements LRU model cache with locking and adds GPU provider selection + provider logging.
`app/config.py`	Adds `USE_GPU` and `MODEL_CACHE_LIMIT` settings.
`Dockerfile`	Updates build to install the `cpu` extra explicitly.
`Dockerfile.gpu`	Introduces a CUDA-based GPU runtime image that installs the `gpu` extra and enables GPU mode via env.
`docker-compose.yml`	Documents `MODEL_CACHE_LIMIT` environment variable option.
`docker-compose.gpu.yml`	Adds a GPU compose file using the `latest-gpu` image with `gpus: all`.
`README.md`	Documents CPU vs GPU Docker/compose usage, tag scheme, and local dev extras.
`.github/workflows/release.yml`	Builds/pushes a GPU image variant and updates release notes examples.
`.env.sample`	Adds `MODEL_CACHE_LIMIT` and comments out thread/batch settings for customization.
`app/main.py`	Updates the root endpoint payload description.

Comments suppressed due to low confidence (1)

pyproject.toml:15

fastembed (or fastembed-gpu) is no longer in the base dependencies, but the application imports from fastembed import TextEmbedding (e.g., app/services/embedder.py, app/services/model_registery.py). Installing the project without --extra cpu/gpu will raise ModuleNotFoundError. Consider keeping a working default by either (a) adding fastembed>=0.8.0 back to base dependencies and keeping gpu as an extra, or (b) making the application code handle missing extras with a clear startup error and documenting that an extra is required.

dependencies = [
    "fastapi[standard]<1.0.0,>=0.116.1",
    "loguru>=0.7.3",
    "pydantic>=2.12.5",
    "pydantic-settings>=2.13.1",
]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T12:23:41Z

+    with model_cache_lock:
+        cached_model = model_cache.get(model_id)
+        if cached_model is not None:
+            model_cache.move_to_end(model_id)
+            return cached_model
+
        logger.info(f"Loading embedding model into memory: {model_id}")
-        model_cache[model_id] = TextEmbedding(
-            model_id, threads=settings.EMBEDDING_THREADS
+
+        # Configure providers based on GPU setting
+        providers = None
+        if settings.USE_GPU:
+            providers = ["CUDAExecutionProvider"]
+            logger.info("GPU acceleration (CUDAExecutionProvider) enabled.")
+
+        model = TextEmbedding(
+            model_id, threads=settings.EMBEDDING_THREADS, providers=providers
+        )


get_model() holds model_cache_lock while performing the potentially expensive model download/load (TextEmbedding(...)). This serializes concurrent requests and can stall the whole service during first-load or cache-miss storms. Consider using a double-checked approach (check cache under lock, release lock to load, then re-acquire to insert/move-to-end with a second check) so unrelated requests can continue while one thread loads a model.

Copilot · 2026-04-24T12:23:41Z

+            providers = ["CUDAExecutionProvider"]
+            logger.info("GPU acceleration (CUDAExecutionProvider) enabled.")
+
+        model = TextEmbedding(
+            model_id, threads=settings.EMBEDDING_THREADS, providers=providers
+        )


When USE_GPU is enabled, providers are set to only ['CUDAExecutionProvider']. If CUDA libraries/providers aren’t available (driver/toolkit mismatch, container started without GPUs, etc.), ONNX Runtime typically fails session creation rather than falling back. Consider including CPUExecutionProvider as a fallback (ordered after CUDA), or catching provider init errors and retrying with CPU to keep the service available.

Suggested change

providers = ["CUDAExecutionProvider"]

logger.info("GPU acceleration (CUDAExecutionProvider) enabled.")

model = TextEmbedding(

model_id, threads=settings.EMBEDDING_THREADS, providers=providers

)

providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

logger.info(

"GPU acceleration preferred (CUDAExecutionProvider) with CPUExecutionProvider fallback enabled."

)

try:

model = TextEmbedding(

model_id, threads=settings.EMBEDDING_THREADS, providers=providers

)

except Exception as e:

if settings.USE_GPU:

logger.warning(

f"Failed to initialize model {model_id} with GPU-enabled providers {providers}: {e}. Retrying with CPUExecutionProvider only."

)

model = TextEmbedding(

model_id,

threads=settings.EMBEDDING_THREADS,

providers=["CPUExecutionProvider"],

)

else:

raise

Copilot · 2026-04-24T12:23:41Z

+        resolved_providers = model.model.model.get_providers()
+        logger.info(
+            f"Model {model_id} loaded successfully with providers: {resolved_providers}"
        )


resolved_providers = model.model.model.get_providers() relies on internal/private attributes of fastembed’s TextEmbedding implementation (multiple nested .model). This is brittle across library versions and can break startup even if embedding works. Prefer a public API for provider reporting (if available), or guard this log line so provider introspection failures don’t prevent the model from loading.

Copilot · 2026-04-24T12:23:42Z

-    return {"Project": "LocalEmbed", "description": "LocalEmbed"}
+    return {
+        "Project": "LocalEmbed",
+        "description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint. ",


The description string has a trailing space at the end of the sentence, which will be reflected in the JSON output. Trim the extra whitespace for a cleaner response payload.

Suggested change

"description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint. ",

"description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint.",

Copilot · 2026-04-24T12:23:42Z

+ENV PYTHONUNBUFFERED=1
+
+# Magic trick: Copy the 'uv' binary directly from Astral's official image
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/


COPY --from=ghcr.io/astral-sh/uv:latest ... makes the GPU image build non-reproducible and can break unexpectedly when the upstream latest tag changes. Prefer pinning uv to a specific version tag or digest (matching what’s used in the CPU Dockerfile) to keep builds deterministic.

Suggested change

COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

COPY --from=ghcr.io/astral-sh/uv:0.4.30 /uv /uvx /bin/

Copilot · 2026-04-24T12:23:42Z

+- `MODEL_CACHE_LIMIT`: max number of models kept in memory (LRU eviction)
+- `EMBEDDING_THREADS`: CPU threads for embedding computation
+- `BATCH_SIZE`: number of inputs processed per batch
+- `USE_GPU`: set `true` to force CUDA provider in local/non-GPU-image runs


The README suggests USE_GPU can be set to force CUDA “in local/non-GPU-image runs”, but enabling it without installing the GPU extra (and without a working CUDA runtime) will typically fail model initialization. Clarify that USE_GPU=true requires the gpu extra (uv sync --extra gpu / fastembed-gpu) and a CUDA-capable environment, or document the service’s fallback behavior if CUDA isn’t available.

Suggested change

- `USE_GPU`: set `true` to force CUDA provider in local/non-GPU-image runs

- `USE_GPU`: set `true` to request the CUDA provider in local/non-GPU-image runs. This requires the GPU extra to be installed (`uv sync --extra gpu` or `fastembed-gpu`) and a working CUDA-capable environment; otherwise model initialization may fail.

Copilot · 2026-04-24T12:23:42Z

 #Number of threads to use for embedding generation. 
 #Adjust based on your CPU capabilities.
-EMBEDDING_THREADS=8
+# EMBEDDING_THREADS=8


Minor formatting: the comment #Number of threads... is missing a space after #, and there are trailing spaces on these comment lines. Cleaning this up improves readability in the sample env file.

heshinth and others added 9 commits April 23, 2026 19:46

Add GPU support with Dockerfile and docker-compose for localembed

b44f5f1

Add USE_GPU environment variable and config setting for CUDA support

722fb91

Refactor GPU support configuration and update environment variables i…

8dec3cb

…n Docker setup

Enhance GPU support in release workflow and update README with Docker…

2f8cd77

… pulls badge Co-authored-by: Copilot <copilot@github.com>

Update README to enhance GPU deployment instructions and add new envi…

2be903e

…ronment variables Co-authored-by: Copilot <copilot@github.com>

Update root endpoint description for clarity and detail

14e8c09

Update token_count comment

6e34855

Add MODEL_CACHE_LIMIT to configuration and update documentation

d7dbd64

Co-authored-by: Copilot <copilot@github.com>

Update .env.sample to clarify EMBEDDING_THREADS and BATCH_SIZE settings

1ac4daf

Copilot AI review requested due to automatic review settings April 24, 2026 12:19

Copilot started reviewing on behalf of heshinth April 24, 2026 12:19 View session

heshinth merged commit ebb58ea into main Apr 24, 2026
2 checks passed

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU support and update Docker configuration and documentation#1

Add GPU support and update Docker configuration and documentation#1
heshinth merged 9 commits into
mainfrom
gpu-support

heshinth commented Apr 24, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint. ",
	"description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint.",

	COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
	COPY --from=ghcr.io/astral-sh/uv:0.4.30 /uv /uvx /bin/

	- `USE_GPU`: set `true` to force CUDA provider in local/non-GPU-image runs
	- `USE_GPU`: set `true` to request the CUDA provider in local/non-GPU-image runs. This requires the GPU extra to be installed (`uv sync --extra gpu` or `fastembed-gpu`) and a working CUDA-capable environment; otherwise model initialization may fail.

Conversation

heshinth commented Apr 24, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants