Zero copy is not used when torch memory is sparse and/or has not been garbage collected #115

pattontim · 2022-11-28T04:57:45Z

Used

os.environ['SAFETENSORS_FAST_GPU'] = '1'

I observed on the webui that despite setting device='cuda' with this flag and using safetensors.torch.load_file, it was taking almost 45 seconds to load a 4GB .safetensors file. However, when trying to replicate it in a separate program but using the same libraries, the model loads fast and only takes a few seconds. These can be represented by these curves, each a 60s time chunk:

Long load time observed, CPU fallback

The program is executed at the red line.

It appears that due to some pollution of memory, webui always falls back to loading from CPU by default. This pollution appears to persist, and if you terminate the webui and then run a separate program which uses safetensors, it also falls back to loading using slow CPU copy.

Steps to replicate:

Load a program using loading various large torch files to cpu (regular torch files with torch.load), and loading a large file to GPU (safetensors).
Terminate it after loading the torch CPU files and GPU for a few seconds.
Within 10 seconds, try to launch an external program using safetensors to GPU with the fast gpu flag. The program resorts to slow copy via CPU.
*Does not replicate if you interrupt the second external program and try again.

Cancel to slow copy after pollution

safetensors tested: 0.2.5

Next steps:

print how much memory cuda sees as available at runtime
disable windows 10 Hardware-accelerated GPU scheduling
figure out how the webui pollutes space (I think it loads a few models with torch to CPU, 5GB to memory)

The text was updated successfully, but these errors were encountered:

Narsil · 2022-11-28T09:13:59Z

Is there any way we could reproduce with safetensors only ? (Maybe torch.load if it causes issues but I doubt it).

zero copy is not used when torch memory is sparse and/or has not been garbage collected

There is no true zero-copy for GPU, as, as far as I know, there is no memory mapping possible directly on the GPU so there needs to be at least 1 copy to GPU (just making sure everything is clear).

There is that: https://devblogs.microsoft.com/directx/directstorage-api-available-on-pc/ but since I don't own a Windows I haven't really looked into it, or if torch supports it in any way, I'm going to guess it's not used.

Stuff that might be important to know:

For zero-copy on CPU, it works only after the file has been accessed only once (it needs to live in disk cache, which is portions of the disks, which the OS keeps in RAM for faster access, meaning it's already loaded somehow). If you are loading too many different things from disk, some other parts of the disk might be flushed from cache meaning they can't be read without slow disk access. This is very OS dependent.
For GPU access: pytorch uses an arena allocator, where the first GPU access creates a pretty large buffer. This means first GPU access is always a bit slow for a given program. But since most memory is already allocated, all further allocations are pretty fast (which is why it's done that way).
Freeing/zeroing memory takes time, most programs don't see that because the OS makes it transparent, once you free memory, the OS can give you back control very fast, and it can zero it only when someone else asks for it or in the background.
I'm not sure how the nvidia drivers make it work for the GPU, but I'm guessing it's similar.
The reason for clearing memory, is you don't want some rogue program accessing what was done by the previous program. (at least it's my understanding). Memory is not necessarily zeroed but could be just randomized. Still requires an operation which might be slow.
All knowledge could very well be OS dependent. As it happens I haven't used a Windows machine in a long while, so something else might be at play here.

All in all:

Is there any reason why you can't use the same program and simply load models back&forth ? As mentionned in all the points it would make memory management either both on the program and the OS. (I'm guessing no but still worth mentionning).
Those 10ish seconds before it's fast again are consistent I'm guessing ? Then it could very well be memory cleanup as you mentionned (and/or memory partitionning is also a good guess).
"The program resorts to slow copy via CPU." I don't think it would ever do that. What safetensors does is memory map the file, and copy it's contents onto GPU. There can be a CPU copy beforehand if the file is not on disk cache, but it's not controlled by the program. CPU copies shouldn't make the program that slow. During my benchmarks, it could be up to 2x slower with an extra CPU copy, but rarely more. (Sometimes much less).

Tl;dr Do you have a script / a few scripts that can be used to reproduce this ? Ideally without webui being involved. (If you cannot reproduce without webui, that's a great signal that the bug might be over there for instance.)

pattontim · 2022-12-01T01:07:59Z

I was able to narrow the scope of the pollution to Latent Diffusion. The above behaviour happens if after it is nearly completes or finishes, the LDM preparation process is interrupted. And yes its pretty consistent that waiting 10+ seconds results in a fast load.

These are the modules, the next step for me is to make a replicating script:

loading model:  D:\SD\models\Stable-diffusion None None
importing model modules.codeformer_model
importing model modules.esrgan_model
importing model modules.gfpgan_model
importing model modules.ldsr_model
importing model modules.realesrgan_model
importing model modules.scunet_model
importing model modules.swinir_model
loading model:  D:\SD\models\ESRGAN https://github.com/cszn/KAIR/releases/download/v1.0/ESRGAN.pth None
loading model:  D:\SD\models\ScuNET https://github.com/cszn/KAIR/releases/download/v1.0/scunet_color_real_gan.pth None
loading model:  D:\SD\models\SwinIR https://github.com/JingyunLiang/SwinIR/releases/download/v0.0/003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth  None
Loading config from: D:\SD\v1-inference.yaml
inst model from config {'base_learning_rate': 0.0001, 'target': 'ldm.models.diffusion.ddpm.LatentDiffusion', 'params': {'linear_start': 0.00085, 'linear_end': 0.012, 'num_timesteps_cond': 1, 'log_every_t': 200, 'timesteps': 1000, 'first_stage_key': 'jpg', 'cond_stage_key': 'txt', 'image_size': 64, 'channels': 4, 'cond_stage_trainable': False, 'conditioning_key': 'crossattn', 'monitor': 'val/loss_simple_ema', 'scale_factor': 0.18215, 'use_ema': False, 'scheduler_config': {'target': 'ldm.lr_scheduler.LambdaLinearScheduler', 'params': {'warm_up_steps': [10000], 'cycle_lengths': [10000000000000], 'f_start': [1e-06], 'f_max': [1.0], 'f_min': [1.0]}}, 'unet_config': {'target': 'ldm.modules.diffusionmodules.openaimodel.UNetModel', 'params': {'image_size': 32, 'in_channels': 4, 'out_channels': 4, 'model_channels': 320, 'attention_resolutions': [4, 2, 1], 'num_res_blocks': 2, 'channel_mult': [1, 2, 4, 4], 'num_heads': 8, 'use_spatial_transformer': True, 'transformer_depth': 1, 'context_dim': 768, 'use_checkpoint': True, 'legacy': False}}, 'first_stage_config': {'target': 'ldm.models.autoencoder.AutoencoderKL', 'params': {'embed_dim': 4, 'monitor': 'val/rec_loss', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}, 'lossconfig': {'target': 'torch.nn.Identity'}}}, 'cond_stage_config': {'target': 'ldm.modules.encoders.modules.FrozenCLIPEmbedder'}}}
LatentDiffusion: Running in eps-prediction mode`

It looks like this is just creating tensors with a CPU map location, so its confusing why this behaviour is observed.

pattontim · 2022-12-01T03:40:10Z

I made a script to recreate this issue, including the dependencies, try it in a windows environment:

https://gist.github.com/pattontim/864469ef1f7cb7ebab8ef810b2dc6b3d

Narsil · 2022-12-01T13:05:05Z

Ok, tried on Linux, couldn't see anything bad. I'm going to try on Windows, (takes me a bit longer, I need to create the setup).

Narsil · 2022-12-01T13:12:34Z

Just for sanity, just running your script a few times in succession should trigger the bug, right ?

Narsil · 2022-12-01T13:51:47Z

We tried on Windows and could not reproduce unfortunately.

Could it maybe be linked to this ?
https://www.reddit.com/r/StableDiffusion/comments/z8mnak/comment/iyfnq83/?utm_source=share&utm_medium=web2x&context=3

mfuntowicz · 2022-12-01T13:54:01Z

Hi @pattontim,

Thanks for the report and the reproduction I just did try to reproduce the error you described but unfortunately I didnt manage to get the 45secs loading.

The maximum I was able to get was around 4secs and most of the time it's around 2.7/2.8 secs

Anything else that might have an impact we should try?

pattontim · 2022-12-01T17:21:48Z

Just for sanity, just running your script a few times in succession should trigger the bug, right ?

It happens every time I run the script. Sometimes the model loading speeds up before it finishes, about 3/4 of the way through.

It should be noted I'm also using NVENC with moonlight encoding a 1080p stream of the screen when conducting the test. The model resides on a HDD which is not the main windows drive.

I'm re-using the venv that comes with webui, which installs pytorch lightning as well. And I have 16 GB of RAM.

It should also be noted that my Windows install loves to utilize the onboard shared memory of my board, during the duration of load_file, the shared memory is utilized by 50 MB or so. Apprently I can turn this off with

Advanced > Video Configuration Menu > Onboard Video Memory Size

In the BIOS.

Finally, compare the speeds with the first part of the code commented out.

It could be that my 2012 CPU and DDR3-1600 RAM result in this slow speed when the CPU is used instead, as others have reported norm times loading to CPU with a fast CPU as short at 5s.

pattontim · 2022-12-01T17:41:43Z

We tried on Windows and could not reproduce unfortunately.

Could it maybe be linked to this ? https://www.reddit.com/r/StableDiffusion/comments/z8mnak/comment/iyfnq83/?utm_source=share&utm_medium=web2x&context=3

For me, the instantiate from config always takes the same amount of time but the load_file varies in how fast it runs.

Narsil · 2022-12-01T18:03:47Z

It should be noted I'm also using NVENC with moonlight encoding a 1080p stream of the screen when conducting the test. The model resides on a HDD which is not the main windows drive.

Could you try without it ?
We have tried both on NVMe and a regular HDD, and results were similar in terms for timings (Everything was relatively consistent all the time)

It could be that my 2012 CPU and DDR3-1600 RAM result in this slow speed when the CPU is used instead, as others have reported norm times loading to CPU with a fast CPU as short at 5s.

Maybe, RAM speed is indeed involved during sending to the GPU.

Just for you information, what SAFETENSORS_FAST_GPU does is this (in gist, it's not accurate code):

buffer = memmap(filename)
# Figure out tensor shape, 
tensor = torch.empty(shape, device="cuda:0")
cudaMemcpy(buffer, tensor.data_ptr())

Looking at your graphs, it seems the memory of the GPU is continuously increasing , so it's during cudaMemcpy that it is slow.
RAM <-> GPU bus is usually quite slow, so if it's used somehow by other parts of the program (like NVENC) maybe that's what slowing it down.

What is weird is that model loading has an impact, which doesn't do much aside from allocating memory (afaik).
Maybe it's indeed your (CPU)RAM that is actually strained somehow.

Other option would be to check your disk I/O. If somehow the files are flushed from disk cache, it would then have to spin your disk again which is indeed much slower than fetching from memory (even more so on older computers ! )

Finally, could you try without SAFETENSORS_FAST_GPU=1 and report what numbers you see ?
It could help pinpoint if it is in cause or not at all.

A third option you could try, is to totally ignore safetensors library and use directly this:
https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282

A 4th one would be to compare against torch.load.

Knowing which versions are affected by the slowdown and which are not might help understand the root cause better.
This is the sort of experiment I would try to do if I was able to reproduce.

Note: It's really hard to fix/understand without being able to reproduce, but I absolutely believe your numbers.

pattontim · 2022-12-01T18:29:40Z

It should be noted I'm also using NVENC with moonlight encoding a 1080p stream of the screen when conducting the test. The model resides on a HDD which is not the main windows drive.

Could you try without it ? We have tried both on NVMe and a regular HDD, and results were similar in terms for timings (Everything was relatively consistent all the time)

Yes, I'm using NVENC H.264 1080p 4:2:0 with 500 MB and 7% or so GPU utilization at idle. When I let the test run without it, fast load kicks in within the first 10 seconds.

I would give credence to

RAM <-> GPU bus is usually quite slow, so if it's used somehow by other parts of the program (like NVENC) maybe that's what slowing it down.

Though other users quote similar times times, so either they're using NVENC or other processes which strain this part of the memory bus.

I will try your other suggestions

If you want to strain it in a similar way I recommend Parsec

pattontim · 2022-12-01T18:55:51Z

Disk IO during operation (no changes)

Disk IO (first half commented out)

The difference is, with inpainting hijack and instantiate from config - always hits disk.

Here's what I believe is happening and it must not be related to safetensors, I'm not 100% sure so correct me if I'm wrong.

When I instantiate from config it almost depletes the RAM, some of that is the RAM used by the disk for caching, so the cache is flushed and then the model reloads from disk. When I don't use instantiate from config,the model remains in the cache and so its re-used. I confirm this by swapping out the model loaded and seeing that a lot of time is used.

This may not be a safetensors issue then.

pattontim · 2022-12-03T02:55:18Z

All in all, I believe the two factors aforementioned result in long load times: the memory bus is constrained due to NVENC, disk cache clears when loading LatentDiffusion. Safetensors works as expected.

pattontim closed this as completed Dec 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero copy is not used when torch memory is sparse and/or has not been garbage collected #115

Zero copy is not used when torch memory is sparse and/or has not been garbage collected #115

pattontim commented Nov 28, 2022 •

edited

Narsil commented Nov 28, 2022 •

edited

pattontim commented Dec 1, 2022 •

edited

pattontim commented Dec 1, 2022

Narsil commented Dec 1, 2022

Narsil commented Dec 1, 2022

Narsil commented Dec 1, 2022

mfuntowicz commented Dec 1, 2022

pattontim commented Dec 1, 2022 •

edited

pattontim commented Dec 1, 2022

Narsil commented Dec 1, 2022 •

edited

pattontim commented Dec 1, 2022 •

edited

pattontim commented Dec 1, 2022 •

edited

pattontim commented Dec 3, 2022

Zero copy is not used when torch memory is sparse and/or has not been garbage collected #115

Zero copy is not used when torch memory is sparse and/or has not been garbage collected #115

Comments

pattontim commented Nov 28, 2022 • edited

Narsil commented Nov 28, 2022 • edited

pattontim commented Dec 1, 2022 • edited

pattontim commented Dec 1, 2022

Narsil commented Dec 1, 2022

Narsil commented Dec 1, 2022

Narsil commented Dec 1, 2022

mfuntowicz commented Dec 1, 2022

pattontim commented Dec 1, 2022 • edited

pattontim commented Dec 1, 2022

Narsil commented Dec 1, 2022 • edited

pattontim commented Dec 1, 2022 • edited

pattontim commented Dec 1, 2022 • edited

pattontim commented Dec 3, 2022

pattontim commented Nov 28, 2022 •

edited

Narsil commented Nov 28, 2022 •

edited

pattontim commented Dec 1, 2022 •

edited

pattontim commented Dec 1, 2022 •

edited

Narsil commented Dec 1, 2022 •

edited

pattontim commented Dec 1, 2022 •

edited

pattontim commented Dec 1, 2022 •

edited