consider AudioAttention + ControlNetMediaPipeFace = EMO #23

johndpope · 2024-03-20T19:29:50Z

https://huggingface.co/CrucibleAI/ControlNetMediaPipeFace

PHASE 1 -
APPROACH
take the audio file -(train the audio attention model) to predict the frame / head mask - repeat.
once you have frames for all the masks
run it through SD pipeline / loras / models etc.

the problem with existing code - it seems far fetched that it will actually converge. Taking this off the shelf model - will get an EMO look alike quite quickly.

I have this code that may be starting point - uses facebook audiocraft / sd
https://github.com/johndpope/Emote-hack/tree/main/junk/AudioAttention

will wire the above model in coming days.

https://github.com/Klopolupka007/ImageCreator/blob/bf0cbb1033d6e3db4f3adef6c1aee8f51137264a/neural_apis/stable_diff_api/models/stable_diff_face_model.py#L14

I ask Claude3 to plug this in -

import torch
import torch.nn as nn
from controlnet import ControlNetModel

class FramesEncodingVAE(nn.Module):
    """
    FramesEncodingVAE combines the encoding of reference and motion frames with additional components
    such as ReferenceNet, SpeedEncoder, and ControlNetMediaPipeFace as depicted in the Frames Encoding part of the diagram.
    """

    def __init__(self, latent_dim, img_size, reference_net):
        super(FramesEncodingVAE, self).__init__()
        self.vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
        self.vae.to(device)  # Move the model to the appropriate device (e.g., GPU)
        self.latent_dim = latent_dim
        self.img_size = img_size
        self.reference_net = reference_net

        # SpeedEncoder can be implemented as needed.
        num_speed_buckets = 9
        speed_embedding_dim = 64
        self.speed_encoder = SpeedEncoder(num_speed_buckets, speed_embedding_dim)

        # Initialize ControlNetMediaPipeFace
        self.controlnet = ControlNetModel.from_pretrained(
            "CrucibleAI/ControlNetMediaPipeFace",
            subfolder="diffusion_sd15",
            torch_dtype=torch.float16
        )

    def forward(self, reference_image, motion_frames, speed_value):
        # Process reference and motion frames through ControlNetMediaPipeFace
        reference_controlnet_output = self.controlnet(reference_image)
        motion_controlnet_outputs = [self.controlnet(frame) for frame in motion_frames]

        # Encode reference and motion frames using the VAE
        reference_latents = self.vae.encode(reference_controlnet_output).latent_dist.sample()
        motion_latents = [self.vae.encode(output).latent_dist.sample() for output in motion_controlnet_outputs]

        # Scale the latent vectors (optional, depends on the VAE scaling factor)
        reference_latents = reference_latents * 0.18215
        motion_latents = [latent * 0.18215 for latent in motion_latents]

        # Process reference features with ReferenceNet
        reference_features = self.reference_net(reference_latents)

        # Embed speed value
        speed_embedding = self.speed_encoder(speed_value)

        # Combine features
        combined_features = torch.cat([reference_features] + motion_latents + [speed_embedding], dim=1)

        # Decode the combined features
        reconstructed_frames = self.vae.decode(combined_features).sample

        return reconstructed_frames

    def vae_loss(self, recon_frames, reference_image, motion_frames):
        # Compute VAE loss using the VAE's loss function
        loss = self.vae.loss_function(recon_frames, torch.cat([reference_image] + motion_frames, dim=1))
        return loss["loss"]

When you initialize ControlNetMediaPipeFace using ControlNetModel.from_pretrained(), it returns an instance of the ControlNetModel. This model, when executed, produces an output of type ControlNetOutput, which contains two main components:

down_block_res_samples: A tuple of downsampled activations at different resolutions for each downsampling block in the model. These activations are tensors representing the features at various stages of the model's downsampling path.

mid_block_res_sample: A tensor representing the activation of the middle block (the lowest sample resolution) in the model.

The ControlNetModel does not directly return a latent representation or an image. Instead, it outputs features that can be used to condition a generative model like a UNet. These features are not in a latent space suitable for direct sampling, nor are they pixel data like a PIL image. They are intermediate representations that help guide the generative process in a diffusion model setup.

slight problem with this approach - no lips, teeth and tongue position

https://huggingface.co/CrucibleAI/ControlNetMediaPipeFace/discussions/20

This was referenced Mar 21, 2024

Calling out coders to contribute to implement EMOTE #18

Closed

take a look at this #31

Closed

johndpope closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider AudioAttention + ControlNetMediaPipeFace = EMO #23

consider AudioAttention + ControlNetMediaPipeFace = EMO #23

johndpope commented Mar 20, 2024 •

edited

Loading

consider AudioAttention + ControlNetMediaPipeFace = EMO #23

consider AudioAttention + ControlNetMediaPipeFace = EMO #23

Comments

johndpope commented Mar 20, 2024 • edited Loading

johndpope commented Mar 20, 2024 •

edited

Loading