[Experimental] Add cache mechanism for dataset groups to avoid long waiting time for initilization #1178

KohakuBlueleaf · 2024-03-13T10:27:15Z

For large scale dataset, sd-scripts will suffer from long waiting time to read the image size or other meta.

So I propose 2 improvements:

Cache the built dataset group object into disk so we don't need to calculate it multiple time

For ddp or some expriments across settings. This will be SUPER helpful

use imagesize library to read the image size, don't use PIL which is overkill.

This can provide 5~10 times speed up for reading image size on NVME ssd.

In my cache script, I successfully only use half hour to get the dataset groups. (Which will cost 4hour if I directly run 4card training)
And the loading for cached dataset groups also be fine. I have do a quick sanity check that first few images are same. But need more check from community

kohya-ss · 2024-03-13T12:40:13Z

Thank you for this! This seems to be really useful. I was thinking about how I can reduce the waiting time for preprocessing. I had not thought of the method for pickling of dataset group object.

However, I feel the pickling seems to be a bit aggressive. I wonder caching the sizes of images might be enough for reducing the waiting time...

KohakuBlueleaf · 2024-03-13T14:30:01Z

Thank you for this! This seems to be really useful. I was thinking about how I can reduce the waiting time for preprocessing. I had not thought of the method for pickling of dataset group object.

However, I feel the pickling seems to be a bit aggressive. I wonder caching the sizes of images might be enough for reducing the waiting time...

Actually not only size
Also the absolute path and bucket

So we don't need to wait for listdir and check image

I can ensure the startup time with cached dataset is less than 1min

From I press enter to I see the tqdm progress bar

Pickling is aggressive, I just use this t o show how it help at firstXD

BTW
Absolute path + size + bucket for 5mil image in pickle only cost me 3GB

KohakuBlueleaf · 2024-03-14T02:59:05Z

Thank you for this! This seems to be really useful. I was thinking about how I can reduce the waiting time for preprocessing. I had not thought of the method for pickling of dataset group object.

However, I feel the pickling seems to be a bit aggressive. I wonder caching the sizes of images might be enough for reducing the waiting time...

I will implement a version which cache absolute path list and imagesize for each subset
So the loading procedure will be same, just ignore the listdir and imagesize part

kohya-ss · 2024-03-14T13:11:44Z

I will implement a version which cache absolute path list and imagesize for each subset
So the loading procedure will be same, just ignore the listdir and imagesize part

That's nice! I think it is straightforward :)

KohakuBlueleaf · 2024-03-16T12:43:20Z

@kohya-ss I have done the implementation
which only cache the image path/caption and image size.
With cached metadata, dataset with 32768 img only need 4sec from start the program to finish the dataset setup (include creating bucket)

I only implement it for DreamboothDataset at first.
But I think you can copy the implementation to another 2 dataset class easily.

kohya-ss · 2024-03-17T00:21:51Z

Thank you for update! This is really nice. I will copy it to other datasets :)

I may change the format to JSON or something else for future proof. It makes the metadata bigger three times or more, but I believe it is no problem. I appreciate your understanding.

…aiting time for initilization (kohya-ss#1178) * support meta cached dataset * add cache meta scripts * random ip_noise_gamma strength * random noise_offset strength * use correct settings for parser * cache path/caption/size only * revert mess up commit * revert mess up commit * Update requirements.txt * Add arguments for meta cache. * remove pickle implementation * Return sizes when enable cache --------- Co-authored-by: Kohya S <52813779+kohya-ss@users.noreply.github.com>

KohakuBlueleaf added 2 commits March 13, 2024 18:21

support meta cached dataset

e6dd5fb

add cache meta scripts

f94152b

KohakuBlueleaf and others added 8 commits March 16, 2024 20:34

random ip_noise_gamma strength

c16ce03

random noise_offset strength

39d1d15

use correct settings for parser

5a730e8

cache path/caption/size only

0b76044

revert mess up commit

4f06667

revert mess up commit

9c1e377

Update requirements.txt

a681a5a

Add arguments for meta cache.

8caca59

KohakuBlueleaf added 2 commits March 16, 2024 20:44

remove pickle implementation

ae2774f

Return sizes when enable cache

efed446

kohya-ss changed the base branch from dev to dataset-cache March 24, 2024 06:35

Merge branch 'dataset-cache' into add-cached-meta

949f7b6

kohya-ss merged commit ae97c8b into kohya-ss:dataset-cache Mar 24, 2024
1 check passed

kohya-ss mentioned this pull request Mar 24, 2024

Add metadata caching for DreamBooth dataset #1206

Merged

bmaltais mentioned this pull request Apr 7, 2024

v23.1.0 bmaltais/kohya_ss#2219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Add cache mechanism for dataset groups to avoid long waiting time for initilization #1178

[Experimental] Add cache mechanism for dataset groups to avoid long waiting time for initilization #1178

KohakuBlueleaf commented Mar 13, 2024

kohya-ss commented Mar 13, 2024

KohakuBlueleaf commented Mar 13, 2024

KohakuBlueleaf commented Mar 14, 2024

kohya-ss commented Mar 14, 2024

KohakuBlueleaf commented Mar 16, 2024

kohya-ss commented Mar 17, 2024

[Experimental] Add cache mechanism for dataset groups to avoid long waiting time for initilization #1178

[Experimental] Add cache mechanism for dataset groups to avoid long waiting time for initilization #1178

Conversation

KohakuBlueleaf commented Mar 13, 2024

kohya-ss commented Mar 13, 2024

KohakuBlueleaf commented Mar 13, 2024

KohakuBlueleaf commented Mar 14, 2024

kohya-ss commented Mar 14, 2024

KohakuBlueleaf commented Mar 16, 2024

kohya-ss commented Mar 17, 2024