Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference Image Concatenation #3

Open
kradkfl opened this issue Jun 12, 2024 · 3 comments
Open

Reference Image Concatenation #3

kradkfl opened this issue Jun 12, 2024 · 3 comments

Comments

@kradkfl
Copy link

kradkfl commented Jun 12, 2024

Hi!

Thanks for open-sourcing your code, it's helpful to see a reference implementation. I did have a question though:

In your paper, you don't mention concatenating a "reference image" as an additional input the stage 1 model, but the code seems to have this. Is this required to achieve similar results to the demos? If so, did you find there was any benefit to more than 1?

@liutaocode
Copy link
Owner

Yes, your observation is correct. In fact, we have also tested the number of frames stitched in this first stage, with the number ranging from 0 to 10. We found that the difference is not significant, and there are a few key reasons:

  1. Most of the HDTF dataset consists of frontal faces, with fewer multi-angle shots, so the dataset difficulty is not high.
  2. The 512-dimensional motion latent already includes mouth-related information, so providing more reference frames is not very meaningful.

Considering that a talking head model supporting arbitrary speakers will have at least one frame that can be used as a reference, all of our results use a model (first stage) with a "reference image" count of 1 (N=1) for inference, in order to maximize the utilization of information as much as possible.

My suggestions are as follows:

  • If your final scenario requires multiple angles or includes some background after expanding the original mouth mask, I recommend inputting as many frames as possible to the rendering stage for reference.
  • Otherwise, you can proceed without adding reference frames.

@kradkfl
Copy link
Author

kradkfl commented Jun 18, 2024

Thanks! Did you find that, for in the wild predictions, the additional reference frames helped even with a model solely trained on HDTF? Or did training with HDTF encourage the model to ignore the reference frames during training?

@liutaocode
Copy link
Owner

Hello. We haven't conducted experiments beyond HDTF, but I try to analyze the situation logically.

Intuitively, adding references during the diffusion rendering stage seems to be effective. Here, the motion latent space we use (512 dimensions) includes both motion and color information. So, it is challenging to perfectly reconstruct the masked area both in motion and color; theoretically, using some reference frames can result in better reconstruction, as the color information can be derived from the references, allowing the latent space to focus more on motion.

This analysis is supported by the 7th row of Table 1 in reference [1], which shows that a 512-dimensional latent space alone is not sufficient for restoring any arbitrary facial image. However, when modeling smaller areas, such as the mouth area, the situation may be different:

(1) For in-the-wild datasets, the latent space may struggle to accurately replicate areas within the mask, as it needs to accommodate facial imagery from any individual. I suggest adding concatenation to allow the latent space to focus more on motion.

(2) For HDTF, based on our tests with N ranging from 0 to 10 showing no difference, it is likely unnecessary. Given that HDTF features only about 300 people, the diffusion model might easily learn the distribution of these individuals's mouth area, leading to overfitting. This could be due to what you mentioned: "training with HDTF encourages the model to ignore the reference frames."

Reference:
[1] Preechakul, K., Chatthee, N., Wizadwongsa, S., et al. Diffuser Autoencoders: Toward a Meaningful and Decodable Representation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 10619-10629.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants