Closed
Description
Currently, we do not implement framewise encoding/decoding in the LTX Video VAE. This leads to an opportunity for reducing memory usage, which will be beneficial for both inference and training.
LoRA finetuning LTX Video on 49x512x768 videos can be done in under 6 GB if prompts and latents are pre-computed, but the pre-computation requires about 12 GB of memory because of the VAE encode/decode. This can be reduced by a considerable amount and lower the bar for entry into video model finetuning. Our friends with potatoes need you!
As always, contributions are welcome 🤗 Happy new year!