You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm experimenting enhancing the DiffAE model to capture the semantic information of small video clips and then to reconstruct them. The input is video clips with 9 3-channel RGB frames and the semantic code is a 512 dimensional vector. However, the model stabilizes quickly to a loss of 10e-3/10e-4 in just 20 epochs, at which point I stopped the training. I discovered that the semantic code is not learning anything. Thinking that I needed a higher quality semantic encoder for videos, I swapped the half-unit semantic encoder for an off-the-shelf preinitialized model. But this does not seem to help - the model stabilizes to a similar loss in the same number of epochs. Would you have any suggestions on likely model modifications to help the semantic encoder learn better? One thought I had was that possibly the decoder is not deep enough to decode the complex semantic codes of a video. How did you decide the dimensions of the stochastic encoder/decoder unit?
Thanks!
The text was updated successfully, but these errors were encountered:
Hello,
I'm experimenting enhancing the DiffAE model to capture the semantic information of small video clips and then to reconstruct them. The input is video clips with 9 3-channel RGB frames and the semantic code is a 512 dimensional vector. However, the model stabilizes quickly to a loss of 10e-3/10e-4 in just 20 epochs, at which point I stopped the training. I discovered that the semantic code is not learning anything. Thinking that I needed a higher quality semantic encoder for videos, I swapped the half-unit semantic encoder for an off-the-shelf preinitialized model. But this does not seem to help - the model stabilizes to a similar loss in the same number of epochs. Would you have any suggestions on likely model modifications to help the semantic encoder learn better? One thought I had was that possibly the decoder is not deep enough to decode the complex semantic codes of a video. How did you decide the dimensions of the stochastic encoder/decoder unit?
Thanks!
The text was updated successfully, but these errors were encountered: