Encountering persistently poor reconstruction ability of DiffAE when used for 9-frame videos #54

evaprakash · 2023-04-15T07:32:56Z

Hello,

I'm experimenting enhancing the DiffAE model to capture the semantic information of small video clips and then to reconstruct them. The input is video clips with 9 3-channel RGB frames and the semantic code is a 512 dimensional vector. However, the model stabilizes quickly to a loss of 10e-3/10e-4 in just 20 epochs, at which point I stopped the training. I discovered that the semantic code is not learning anything. Thinking that I needed a higher quality semantic encoder for videos, I swapped the half-unit semantic encoder for an off-the-shelf preinitialized model. But this does not seem to help - the model stabilizes to a similar loss in the same number of epochs. Would you have any suggestions on likely model modifications to help the semantic encoder learn better? One thought I had was that possibly the decoder is not deep enough to decode the complex semantic codes of a video. How did you decide the dimensions of the stochastic encoder/decoder unit?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountering persistently poor reconstruction ability of DiffAE when used for 9-frame videos #54

Encountering persistently poor reconstruction ability of DiffAE when used for 9-frame videos #54

evaprakash commented Apr 15, 2023

Encountering persistently poor reconstruction ability of DiffAE when used for 9-frame videos #54

Encountering persistently poor reconstruction ability of DiffAE when used for 9-frame videos #54

Comments

evaprakash commented Apr 15, 2023