Add GPU support and update Docker configuration and documentation#1
Conversation
… pulls badge Co-authored-by: Copilot <copilot@github.com>
…ronment variables Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Adds GPU-capable packaging and deployment options for LocalEmbed, including a GPU Docker image and compose file, while updating runtime model loading behavior and documentation to support CPU/GPU workflows.
Changes:
- Introduce
cpu/gpuoptional dependency extras and update theuv.lockaccordingly. - Add LRU model caching plus optional CUDA provider configuration during model initialization.
- Add a GPU Dockerfile + compose config, and update release workflow + README/.env sample for GPU usage.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Adds locked entries for fastembed-gpu and onnxruntime-gpu; updates project metadata to expose cpu/gpu extras. |
pyproject.toml |
Moves embedding backend into optional extras (cpu, gpu) and adds a poe task. |
app/services/embedder.py |
Implements LRU model cache with locking and adds GPU provider selection + provider logging. |
app/config.py |
Adds USE_GPU and MODEL_CACHE_LIMIT settings. |
Dockerfile |
Updates build to install the cpu extra explicitly. |
Dockerfile.gpu |
Introduces a CUDA-based GPU runtime image that installs the gpu extra and enables GPU mode via env. |
docker-compose.yml |
Documents MODEL_CACHE_LIMIT environment variable option. |
docker-compose.gpu.yml |
Adds a GPU compose file using the latest-gpu image with gpus: all. |
README.md |
Documents CPU vs GPU Docker/compose usage, tag scheme, and local dev extras. |
.github/workflows/release.yml |
Builds/pushes a GPU image variant and updates release notes examples. |
.env.sample |
Adds MODEL_CACHE_LIMIT and comments out thread/batch settings for customization. |
app/main.py |
Updates the root endpoint payload description. |
Comments suppressed due to low confidence (1)
pyproject.toml:15
fastembed(orfastembed-gpu) is no longer in the basedependencies, but the application importsfrom fastembed import TextEmbedding(e.g.,app/services/embedder.py,app/services/model_registery.py). Installing the project without--extra cpu/gpuwill raiseModuleNotFoundError. Consider keeping a working default by either (a) addingfastembed>=0.8.0back to base dependencies and keepinggpuas an extra, or (b) making the application code handle missing extras with a clear startup error and documenting that an extra is required.
dependencies = [
"fastapi[standard]<1.0.0,>=0.116.1",
"loguru>=0.7.3",
"pydantic>=2.12.5",
"pydantic-settings>=2.13.1",
]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with model_cache_lock: | ||
| cached_model = model_cache.get(model_id) | ||
| if cached_model is not None: | ||
| model_cache.move_to_end(model_id) | ||
| return cached_model | ||
|
|
||
| logger.info(f"Loading embedding model into memory: {model_id}") | ||
| model_cache[model_id] = TextEmbedding( | ||
| model_id, threads=settings.EMBEDDING_THREADS | ||
|
|
||
| # Configure providers based on GPU setting | ||
| providers = None | ||
| if settings.USE_GPU: | ||
| providers = ["CUDAExecutionProvider"] | ||
| logger.info("GPU acceleration (CUDAExecutionProvider) enabled.") | ||
|
|
||
| model = TextEmbedding( | ||
| model_id, threads=settings.EMBEDDING_THREADS, providers=providers | ||
| ) |
There was a problem hiding this comment.
get_model() holds model_cache_lock while performing the potentially expensive model download/load (TextEmbedding(...)). This serializes concurrent requests and can stall the whole service during first-load or cache-miss storms. Consider using a double-checked approach (check cache under lock, release lock to load, then re-acquire to insert/move-to-end with a second check) so unrelated requests can continue while one thread loads a model.
| providers = ["CUDAExecutionProvider"] | ||
| logger.info("GPU acceleration (CUDAExecutionProvider) enabled.") | ||
|
|
||
| model = TextEmbedding( | ||
| model_id, threads=settings.EMBEDDING_THREADS, providers=providers | ||
| ) |
There was a problem hiding this comment.
When USE_GPU is enabled, providers are set to only ['CUDAExecutionProvider']. If CUDA libraries/providers aren’t available (driver/toolkit mismatch, container started without GPUs, etc.), ONNX Runtime typically fails session creation rather than falling back. Consider including CPUExecutionProvider as a fallback (ordered after CUDA), or catching provider init errors and retrying with CPU to keep the service available.
| providers = ["CUDAExecutionProvider"] | |
| logger.info("GPU acceleration (CUDAExecutionProvider) enabled.") | |
| model = TextEmbedding( | |
| model_id, threads=settings.EMBEDDING_THREADS, providers=providers | |
| ) | |
| providers = ["CUDAExecutionProvider", "CPUExecutionProvider"] | |
| logger.info( | |
| "GPU acceleration preferred (CUDAExecutionProvider) with CPUExecutionProvider fallback enabled." | |
| ) | |
| try: | |
| model = TextEmbedding( | |
| model_id, threads=settings.EMBEDDING_THREADS, providers=providers | |
| ) | |
| except Exception as e: | |
| if settings.USE_GPU: | |
| logger.warning( | |
| f"Failed to initialize model {model_id} with GPU-enabled providers {providers}: {e}. Retrying with CPUExecutionProvider only." | |
| ) | |
| model = TextEmbedding( | |
| model_id, | |
| threads=settings.EMBEDDING_THREADS, | |
| providers=["CPUExecutionProvider"], | |
| ) | |
| else: | |
| raise |
| resolved_providers = model.model.model.get_providers() | ||
| logger.info( | ||
| f"Model {model_id} loaded successfully with providers: {resolved_providers}" | ||
| ) |
There was a problem hiding this comment.
resolved_providers = model.model.model.get_providers() relies on internal/private attributes of fastembed’s TextEmbedding implementation (multiple nested .model). This is brittle across library versions and can break startup even if embedding works. Prefer a public API for provider reporting (if available), or guard this log line so provider introspection failures don’t prevent the model from loading.
| return {"Project": "LocalEmbed", "description": "LocalEmbed"} | ||
| return { | ||
| "Project": "LocalEmbed", | ||
| "description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint. ", |
There was a problem hiding this comment.
The description string has a trailing space at the end of the sentence, which will be reflected in the JSON output. Trim the extra whitespace for a cleaner response payload.
| "description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint. ", | |
| "description": "A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint.", |
| ENV PYTHONUNBUFFERED=1 | ||
|
|
||
| # Magic trick: Copy the 'uv' binary directly from Astral's official image | ||
| COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ |
There was a problem hiding this comment.
COPY --from=ghcr.io/astral-sh/uv:latest ... makes the GPU image build non-reproducible and can break unexpectedly when the upstream latest tag changes. Prefer pinning uv to a specific version tag or digest (matching what’s used in the CPU Dockerfile) to keep builds deterministic.
| COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ | |
| COPY --from=ghcr.io/astral-sh/uv:0.4.30 /uv /uvx /bin/ |
| - `MODEL_CACHE_LIMIT`: max number of models kept in memory (LRU eviction) | ||
| - `EMBEDDING_THREADS`: CPU threads for embedding computation | ||
| - `BATCH_SIZE`: number of inputs processed per batch | ||
| - `USE_GPU`: set `true` to force CUDA provider in local/non-GPU-image runs |
There was a problem hiding this comment.
The README suggests USE_GPU can be set to force CUDA “in local/non-GPU-image runs”, but enabling it without installing the GPU extra (and without a working CUDA runtime) will typically fail model initialization. Clarify that USE_GPU=true requires the gpu extra (uv sync --extra gpu / fastembed-gpu) and a CUDA-capable environment, or document the service’s fallback behavior if CUDA isn’t available.
| - `USE_GPU`: set `true` to force CUDA provider in local/non-GPU-image runs | |
| - `USE_GPU`: set `true` to request the CUDA provider in local/non-GPU-image runs. This requires the GPU extra to be installed (`uv sync --extra gpu` or `fastembed-gpu`) and a working CUDA-capable environment; otherwise model initialization may fail. |
| #Number of threads to use for embedding generation. | ||
| #Adjust based on your CPU capabilities. | ||
| EMBEDDING_THREADS=8 | ||
| # EMBEDDING_THREADS=8 |
There was a problem hiding this comment.
Minor formatting: the comment #Number of threads... is missing a space after #, and there are trailing spaces on these comment lines. Cleaning this up improves readability in the sample env file.
No description provided.