precalculating the text encoder embeddings can improve vram usage by only loading the text encoders when the dataset needs to be preprocessed, this also can apply to the vae, so that the only thing that needs to be loaded when training is the main diffusion model / unet / dit / that thing.
something like making a modified metadata.csv that includes the text encoder embed path and the latent path relating to each video/image name so that the trainer can find the embed / latent
this should apply to all models, so it can benefit the entire repo (but notably helps the models with t5-xxl / umt5-xxl, as it is a very large model), the only flaws with it could be a lack of dynamic tag-based dropout, but entire dropout could work by having a precalculated empty string embedding