[Feature] Blackwell GPU (RTX 50-series, sm_120) support for immich-machine-learning:release-cuda #28582
Closed
ekropotin
started this conversation in
Feature Request
Replies: 1 comment
-
|
This discussion has automatically been closed as it is likely a duplicate. We get a lot of duplicate threads each day, which is why we ask you in the template to confirm that you searched for duplicates before opening one. If you're sure this is not a duplicate, please leave a comment and we will reopen the thread if necessary. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have searched the existing feature requests, both open and closed, to make sure this is not a duplicate request.
The feature
Opening this as a Discussion rather than an Issue because the previous Issue (#28031) was auto-closed by the duplicate-detection bot with no actual duplicate identified, and my reply asking for the duplicate link went unanswered. A Discussion seems like the right venue to consolidate the problem, the workaround, and the open question on how upstream wants to track this.
Problem
The
ghcr.io/immich-app/immich-machine-learning:release-cudaimage silently falls back to CPU on Blackwell GPUs (RTX 50-series, compute capabilitysm_120):Reproduced on RTX 5060 Ti (driver 580.126.18, CUDA 13.0 capable). Also reported in #28032 review and by @yopichy on a separate RTX 50-series setup. The DGX Spark GB10 subthread in #10647 hits the same
sm_120/sm_121family from the arm64 side.Root cause
The
prod-cudastage is built onnvidia/cuda:12.2.2-runtime-ubuntu22.04. CUDA 12.2 predates Blackwell —sm_120support landed in CUDA 12.8 (Feb 2025).Solution
A naive bump to 12.8 doesn't fully work. The
prod-cudastage installslibcudnn9-cuda-12=9.10.2.21-1(kept at 9.10 because 9.11 drops Pascal). Its apt dependency chain drags incuda-cudart-12-2/cuda-libraries-12-2and runsupdate-alternativesso/usr/local/cudais silently re-pointed at/usr/local/cuda-12.2/— leaving the image effectively on CUDA 12.2 again.apt-get autoremovein the laterprodstage re-triggersupdate-alternatives, so even an explicitln -sfnorupdate-alternatives --setinprod-cudagets overridden.A working mitigation (currently running in production on my fork): reinstall
cuda-cudart-12-8after cuDNN,apt-mark manualit so autoremove doesn't strip it, set it as the primary alternative, and re-pin the symlink as a guard in the finalprodstage. Full Dockerfile and root-cause notes:ghcr.io/ekropotin/immich-machine-learningAsks
find a public reference.
release-cuda-blackwellvariant tag (or build-arg-driven matrix) be acceptable? Happy to draft a PR that conforms tothe template / changelog-label rules this time.
libcudnn9/cudartapt trap above is non-obvious and bites anyone who tries the version bump locally — worth a note in the ML Dockerfile regardless of which direction this goes.Platform
Beta Was this translation helpful? Give feedback.
All reactions