-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow model load and cache ram does not free. #6807
Comments
|
@pisoiu you can check out the PCIe bus speed + lane width using As @rick-github mentioned, the cache tables will stay dirty until they're reused for something else. Also, starting in Ollama 0.3.11 you'll be able to LMK the bus speed / lane width data and either I can close the issue or see if it's something more serious. |
Hi all, thanks for info.
|
Hi, lspci -s 67:00.0 -vvv (the nvme controller) is: 67:00.0 Non-Volatile memory controller: Phison Electronics Corporation PS5021-E21 PCIe4 NVMe Controller (DRAM-less) (rev 01) (prog-if 02 [NVM Express]) Does this contain information you need? |
@pisoiu I think it's saying the lane width is 4x so the theoretical maximum would be 1GB/s? |
I don't think it can be that low. Disk is this one: |
I have another information, it may help. I asked on another forum about this issue and I was told to send iostat -x during both operations, benchmark and model transfer. Linux 6.8.0-44-generic (..-ai-server) 09/13/2024 x86_64 (64 CPU) Results for benchmark are: Linux 6.8.0-44-generic (flo-ai-server) 09/13/2024 x86_64 (64 CPU) This is a bit criptic to me but a user on that forum commented that aqu-sz parameter indicates the problem. For benchmark there are 309.4 requests to be processed by nvme, and for model transfer only 1.16 . This is why I came here with the question, because this indicates to me that is not a hardware related issue, I hope I'm not wrong. |
If your concern is model loading speed (as opposed to reading data off disk), then PCI bandwidth may be the bottleneck. A block of model data is read from disk and cached in RAM. That block is then written to VRAM via PCI bus. The speed of model loading cannot be faster than the speed of the slowest link in that chain. If your PCI bus or VRAM has less bandwidth than the bandwidth used to read from the nvme, that will be the speed of reading from nvme. What's the PCI config of your GPU devices?
|
@rick-github , this is the result of the first command: 01:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1) (prog-if 00 [VGA controller]) All GPU are identical (Nvidia RTX A4000), there are 5 of them, soon I will install another 2 and then all the PCIe slots on the board are filled. Two of them are at x8 lanes, the others are at x16, all on PCIe4.0. |
Do you have a graph from when you are running the benchmark? What benchmark program are you using? What's the output of
|
Result is: I don't have a graph from benchmark because the benchmark does not make graph to move, don't know why. In ubuntu utilities windows, disks, there is an option under 3 vertical dotted button, benckmark disk. That's the one I used to benchmark some time ago. But what is strange is I repeated the test now and it gives me lower results now, around 2.8GB/s. When I got 5GB/s I had only 3 GPU installed, now I have 5, that's the only difference, but I don't understand why should this matter for disk speed. Afaik they're not on shared PCIe lanes. |
What is the issue?
Hi all. My system: AMD TR PRO 3975WX CPU, 512G RAM DDR4 ECC, 3xRTX A4000 (48G VRAM) GPU, 4TB Nvme corsair mp600 core xt, Ubuntu 22.04.1 LTS
I'm not specialist in Linux, so don't throw stones.
Problem 1: According to various tests, transfer speed of DDR4 can go up to 25GB/s. According to the benchmark of my local nvme disk, read speed is around 6GB/s. However, when I start 'ollama run llama3.1:70b' from terminal, system monitor indicate constant disk activity during model transfer and read speed tops around 1.7GB/s, no more. Why isn't it loaded faster if both disk and RAM can do much more? System isn't doing anything else. Let's say this isn't problematic with 70b, but with 405b is really annoying.
Problem 2: 48G VRAM is enough to fit :70b model. When I start 'ollama run llama3.1:70b', it is loaded first in the RAM, in the system monitor window I see 'cache' jumping up. After the model is completely transferred to RAM, I see it pushed into VRAM of GPU for inferrence. The 'memory' section of system monitor indicates '7.3GiB(1.5%) of 503.5GiB, cache 44.6 GiB'. When I'm done with the model and send '/bye' to ollama, I can see VRAM still filled for few more minutes, then it is freed. But not the 'cache' from RAM. It stays at 44.6GiB forever if I'm not doing anything else (I waited >30 min). This is becoming problematic when I load a different model. That will top over the already existing models in cache memory and will increase its size. Continuing loading different models will progressively fill it up to the top, eventually data will go in the swap. Old models are never removed from cache even if newer ones needs memory. Why?
Thank you.
LE: one detail which may or may not be important. Ollama is installed directly and I run it from terminal prompt and there is another installation in a docker container where it was installed from open-webui with built in ollama support, that one is inferred from the network. But both behave the same in regards to cache memory.
OS
Linux
GPU
Nvidia
CPU
AMD
Ollama version
0.3.8
The text was updated successfully, but these errors were encountered: