You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for your implement and I am confused with how you implement Audio + video. In the paper, I see that audio and video modalities are fused at input, and "We achieve this by concatenating a learned, modality-specific encoding to each input. "
Could you give an example of using your "learned, modality-specific encoding" to concatenate these two modality? What should the input be like so that I can feed the data into your perceiver model?
The text was updated successfully, but these errors were encountered:
Hi, thanks for your implement and I am confused with how you implement Audio + video. In the paper, I see that audio and video modalities are fused at input, and "We achieve this by concatenating a learned, modality-specific encoding to each input. "
Could you give an example of using your "learned, modality-specific encoding" to concatenate these two modality? What should the input be like so that I can feed the data into your perceiver model?
The text was updated successfully, but these errors were encountered: