-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute Capability Misidentification with PhysX cudart library #4008
Comments
I'll try to find it by code inspection, but could you share a server log with OLLAMA_DEBUG=1 set? |
|
Yikes, yeah, those responses are definitely incorrect. |
One data point that may help, can you search on your system for other instances of In addition, can you share the output of |
One more data point. On my test system running Win 11 pro, I have driver 546.12 and cuda v12.3 installed (as well as v11). Our bundled v11 |
No longer able to replicate the Compute Capability issue, updating Cuda and restarting a couple times may have done it 😅😅 By bundled, are you referring to the .dll stored in Programs/Ollama?
|
However, it seems that while the model is loaded into VRAM, all the compute is done at the CPU level. Looking into solutions, any suggestions? I see that not all layers are sent to the GPU. I tried using phi3 as a smaller model (in case llama3 is not being fully loaded), but it is definitely being ran on the CPU. Is shared memory with the integrated GPU (the other 20 GB) not sufficient as VRAM? Phi3:
CPU being maxxed, GPU doing no compute: |
Ok, I assume the issue is the same as #3201. However, not sure why phi3 will not fit all of its layers in VRAM (30 of 33 layers), given that it is a 2.3GB model and I have 4 GB of VRAM. Task manager also shows that dedicated GPU memory is 3.0/4.0, any suggestions on how to fit the full model/get Ollama to use a little more VRAM? I understand that the question is unrelated to the original thread (and apologize for that), thanks for the help! |
Ran into the compute capability again this morning (Compute Capability 1.0), and saw that Ollama was reading the cudart from Common. Then, I checked nvidia-smi, and saw that the CUDA version was a major version lower than what it last night (??? thanks windows) Updated GeForce drivers, CUDA version is now 12.4. Also added Ollama further up in path. Seems to have done the trick, GPU is in use, the cudart used is the Ollama one. 16 tokens per second on Phi3, 30/33 layers allocated in VRAM (3.0/4.0 GB in use). Lots of CPU usage, I guess it is what it is then? |
Happy to hear you got it running on GPU. I'm still trying to get to the bottom of why that PhysX cudart library behaves strangely. I'm sort of wondering if it's exposing some sort of "virtual" GPU. We include a copy of cudart v11 in the distribution to try to make it easier for users to install without having to add the cuda libraries on their host. There's some combination of factors that causes that bundled version to not work for some users which we're still trying to get to the bottom of. As to the layers question - we're continuing to refine our prediction algorithm to maximize VRAM usage without hitting OOM crashes. Model architecture, context size and other factors can influence the actual VRAM usage at runtime compared to the on-disk size of the model. I'd like to keep this issue tracking the unexplained PhysX cudart behavior leading to misidentification as CC 1.0. |
I found 0.1.33 loaded significantly more lib files from the system, including 0.1.32 Dll
0.1.33 Dll
|
We adjusted the behavior in 0.1.33 to try to use cuda libraries on the host system if found in the hopes that would resolve some other issues we've seen with our bundled library not working in some cases. We weren't anticipating a cudart library successfully loading and enumerating a GPU but providing incorrect information about memory and CC version. I'd definitely like to get this fixed ASAP for the next release, we just need to figure out what the best approach is. |
FWIW, I think I have the same issue. Edit: Actually, not the same but similar. GPU is not being used. Have a very similar graphics card to the OP, GTX 1050Ti. Noticed my ollama embeddings take a very long time. Server logs seem to point to an Ollama version of cuda DLL (cudart64_110.dll) despite the fact that up to date cuda is installed. Removed Ollama, removed all nvidia then re-installed nvidia first then ollama but still the same. |
I also have the same issue; although I removed it from the PATH, I still can't use the GPU. #3969 |
Managed to make it use GPU by force setting in set PATH=C:\tools\scoop\apps\ollama\current;%PATH%
ollama serve
time=2024-05-06T21:00:09.798+03:00 level=INFO source=images.go:828 msg="total blobs: 7"
time=2024-05-06T21:00:09.799+03:00 level=INFO source=images.go:835 msg="total unused blobs removed: 0"
time=2024-05-06T21:00:09.800+03:00 level=INFO source=routes.go:1071 msg="Listening on 127.0.0.1:11434 (version 0.1.33)"
time=2024-05-06T21:00:09.800+03:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11.3 rocm_v5.7 cpu cpu_avx cpu_avx2]"
time=2024-05-06T21:00:09.800+03:00 level=INFO source=gpu.go:96 msg="Detecting GPUs"
time=2024-05-06T21:00:09.833+03:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=C:\tools\scoop\apps\ollama\current\cudart64_110.dll count=1
......
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 281.81 MiB
llm_load_tensors: CUDA0 buffer size = 4155.99 MiB nvidia-smi
Mon May 6 20:56:04 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.22 Driver Version: 552.22 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 WDDM | 00000000:01:00.0 On | N/A |
| 47% 38C P8 35W / 350W | 8023MiB / 12288MiB | 7% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
.....
| 0 N/A N/A 23524 C ...\cuda_v11.3\ollama_llama_server.exe N/A |
+-----------------------------------------------------------------------------------------+ |
What is the issue?
Ollama server incorrectly identifies the Compute Capability of my GPU (detects 1.0 instead of 5.2). It seems to me that this is due to a recent change in gpu/gpu.go. Thanks!
Previously: CUDART CUDA Compute Capability detected: 5.2
Now: CUDA GPU is too old. Compute Capability detected: 1.0
OS
Windows
GPU
Nvidia
CPU
Intel
Ollama version
0.1.33-rc5
Workaround
Remove
c:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\
from yourPATH
environment variable so Ollama does not use this cuda runtime library.The text was updated successfully, but these errors were encountered: