Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory(cpu) when finetuning with 50M text image pairs #1129

Open
jucic opened this issue Feb 21, 2024 · 6 comments
Open

out of memory(cpu) when finetuning with 50M text image pairs #1129

jucic opened this issue Feb 21, 2024 · 6 comments

Comments

@jucic
Copy link

jucic commented Feb 21, 2024

thanks for your nice job, there is a "out of memory(cpu)" error when finetuning with 50M text image pairs, after loading the data(takes about 2 days), the process died in the beginning of the training. we found the reason is out of memory(cpu), for details please see the following screenshot. For now I am trying to split the dataset into 10 patches and loading one of these 10 patches every epoch. Is there any solution else to support huge dataset such as 100M text image pairs?
img_v3_0289_81f9155b-db85-4f9b-9a14-c3e40639063g

@kohya-ss
Copy link
Owner

I'll test with pseudo 50M text image pairs. How much cpu RAM do your system have?

@jucic
Copy link
Author

jucic commented Feb 22, 2024

I'll test with pseudo 50M text image pairs. How much cpu RAM do your system have?

@kohya-ss Thanks, about 1007GiB RAM every machine.

@kohya-ss
Copy link
Owner

Thank you! To be honest, the script is not intended to handle that large a volume of images... However, it should work with appropriate options...

If you cache the latents to the memory with --cache_latents without --cache_latents_to_disk, the amount of memory used by latents will be HWC * sizeof(float) * num of images = 1281284 * 4 * 50M = 13TB. So I guess you are not caching the latents or use --cache_latents_to_disk. Is this correct?

@dill-shower
Copy link

dill-shower commented Feb 22, 2024

If you cache the latents to the memory with --cache_latents without --cache_latents_to_disk, the amount of memory used by latents will be H_W_C * sizeof(float) * num of images = 128_128_4 * 4 * 50M = 13TB.

If we cache the latents to disk, will the same 13TB capacity be used on disk?

Is there any way to speed up the loading of data into the script? Most GPU servers use hourly billing and it is inefficient to spend 2 days just loading images into RAM without even starting training

@kohya-ss
Copy link
Owner

kohya-ss commented Feb 22, 2024

If we cache the latents to disk, will the same 13TB capacity be used on disk?

It should be. Therefore I think it may be better to disable the latent caching in the large scale finetuning.

The script doesn't load the image if the caching is disabled or the cache is created on the disk in advance, but the script checks the size of images. It will take a long time (but 2 days are too long, so I think the latents may be cached.)

@thojmr
Copy link

thojmr commented Feb 23, 2024

You can always add multiprocessing for loading files to speed things up.

You can add it anywhere you are loading the images/captions/metadata

  • image size loading (like below)
  • processing metadata
  • checking cache validity

train_util.py

from multiprocessing import Pool
...
logger.info("loading image sizes (threadpooled).")
pool = Pool()                                            # Create a multiprocessing Pool
iterator = pool.imap(load_image_size, self.image_data.items())  # process data_inputs iterable with pool
for key, image_size in tqdm(iterator, total=len(self.image_data.items()), smoothing=0.01):
    self.image_data[key].image_size = image_size
# fetch image sizes  (ran in  threadpool to load more quickly)
def load_image_size(image_data) -> Tuple[str, Any]:
    image_key, image_info = image_data
    image_size = None

    if image_info.image_size is None:
        image = Image.open(image_info.absolute_path)
        image_size = image.size
    else:
        # return current size if it already exists
        return image_key, image_info.image_size

    return image_key, image_size

synchronous image loading is a pain, so the above change was the largest gain in time saved.

Edit: off the top of my head, I think it now takes 20 minutes to load a 3m size dataset if images are pre-cached. (nvme drive)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants