-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero copy is not used when torch memory is sparse and/or has not been garbage collected #115
Comments
Is there any way we could reproduce with
There is no true zero-copy for GPU, as, as far as I know, there is no memory mapping possible directly on the GPU so there needs to be at least 1 copy to GPU (just making sure everything is clear). There is that: https://devblogs.microsoft.com/directx/directstorage-api-available-on-pc/ but since I don't own a Windows I haven't really looked into it, or if Stuff that might be important to know:
All in all:
Tl;dr Do you have a script / a few scripts that can be used to reproduce this ? Ideally without |
I was able to narrow the scope of the pollution to Latent Diffusion. The above behaviour happens if after it is nearly completes or finishes, the LDM preparation process is interrupted. And yes its pretty consistent that waiting 10+ seconds results in a fast load. These are the modules, the next step for me is to make a replicating script:
It looks like this is just creating tensors with a CPU map location, so its confusing why this behaviour is observed. |
I made a script to recreate this issue, including the dependencies, try it in a windows environment: https://gist.github.com/pattontim/864469ef1f7cb7ebab8ef810b2dc6b3d |
Ok, tried on Linux, couldn't see anything bad. I'm going to try on Windows, (takes me a bit longer, I need to create the setup). |
Just for sanity, just running your script a few times in succession should trigger the bug, right ? |
We tried on Windows and could not reproduce unfortunately. Could it maybe be linked to this ? |
Hi @pattontim, Thanks for the report and the reproduction I just did try to reproduce the error you described but unfortunately I didnt manage to get the 45secs loading. The maximum I was able to get was around 4secs and most of the time it's around 2.7/2.8 secs Anything else that might have an impact we should try? |
It happens every time I run the script. Sometimes the model loading speeds up before it finishes, about 3/4 of the way through. It should be noted I'm also using NVENC with moonlight encoding a 1080p stream of the screen when conducting the test. The model resides on a HDD which is not the main windows drive. I'm re-using the venv that comes with webui, which installs pytorch lightning as well. And I have 16 GB of RAM. It should also be noted that my Windows install loves to utilize the onboard shared memory of my board, during the duration of load_file, the shared memory is utilized by 50 MB or so. Apprently I can turn this off with
In the BIOS. Finally, compare the speeds with the first part of the code commented out. It could be that my 2012 CPU and DDR3-1600 RAM result in this slow speed when the CPU is used instead, as others have reported norm times loading to CPU with a fast CPU as short at 5s. |
For me, the instantiate from config always takes the same amount of time but the load_file varies in how fast it runs. |
Could you try without it ?
Maybe, RAM speed is indeed involved during sending to the GPU. Just for you information, what SAFETENSORS_FAST_GPU does is this (in gist, it's not accurate code): buffer = memmap(filename)
# Figure out tensor shape,
tensor = torch.empty(shape, device="cuda:0")
cudaMemcpy(buffer, tensor.data_ptr()) Looking at your graphs, it seems the memory of the GPU is continuously increasing , so it's during cudaMemcpy that it is slow. What is weird is that model loading has an impact, which doesn't do much aside from allocating memory (afaik). Other option would be to check your disk I/O. If somehow the files are flushed from disk cache, it would then have to spin your disk again which is indeed much slower than fetching from memory (even more so on older computers ! ) Finally, could you try without A third option you could try, is to totally ignore A 4th one would be to compare against Knowing which versions are affected by the slowdown and which are not might help understand the root cause better. Note: It's really hard to fix/understand without being able to reproduce, but I absolutely believe your numbers. |
Yes, I'm using NVENC H.264 1080p 4:2:0 with 500 MB and 7% or so GPU utilization at idle. When I let the test run without it, fast load kicks in within the first 10 seconds. I would give credence to
Though other users quote similar times times, so either they're using NVENC or other processes which strain this part of the memory bus. I will try your other suggestions If you want to strain it in a similar way I recommend Parsec |
All in all, I believe the two factors aforementioned result in long load times: the memory bus is constrained due to NVENC, disk cache clears when loading LatentDiffusion. Safetensors works as expected. |
Used
os.environ['SAFETENSORS_FAST_GPU'] = '1'
I observed on the webui that despite setting device='cuda' with this flag and using safetensors.torch.load_file, it was taking almost 45 seconds to load a 4GB .safetensors file. However, when trying to replicate it in a separate program but using the same libraries, the model loads fast and only takes a few seconds. These can be represented by these curves, each a 60s time chunk:
Long load time observed, CPU fallback
The program is executed at the red line.
It appears that due to some pollution of memory, webui always falls back to loading from CPU by default. This pollution appears to persist, and if you terminate the webui and then run a separate program which uses safetensors, it also falls back to loading using slow CPU copy.
Steps to replicate:
*Does not replicate if you interrupt the second external program and try again.
Cancel to slow copy after pollution
safetensors tested: 0.2.5
Next steps:
print how much memory cuda sees as available at runtime
disable windows 10 Hardware-accelerated GPU scheduling
figure out how the webui pollutes space (I think it loads a few models with torch to CPU, 5GB to memory)
The text was updated successfully, but these errors were encountered: