Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider AudioAttention + ControlNetMediaPipeFace = EMO #23

Closed
johndpope opened this issue Mar 20, 2024 · 0 comments
Closed

consider AudioAttention + ControlNetMediaPipeFace = EMO #23

johndpope opened this issue Mar 20, 2024 · 0 comments

Comments

@johndpope
Copy link
Owner

johndpope commented Mar 20, 2024

https://huggingface.co/CrucibleAI/ControlNetMediaPipeFace

PHASE 1 -
APPROACH
take the audio file -(train the audio attention model) to predict the frame / head mask - repeat.
once you have frames for all the masks
run it through SD pipeline / loras / models etc.

Screenshot from 2024-03-21 06-23-30

the problem with existing code - it seems far fetched that it will actually converge. Taking this off the shelf model - will get an EMO look alike quite quickly.

I have this code that may be starting point - uses facebook audiocraft / sd
https://github.com/johndpope/Emote-hack/tree/main/junk/AudioAttention

will wire the above model in coming days.

https://github.com/Klopolupka007/ImageCreator/blob/bf0cbb1033d6e3db4f3adef6c1aee8f51137264a/neural_apis/stable_diff_api/models/stable_diff_face_model.py#L14

I ask Claude3 to plug this in -

import torch
import torch.nn as nn
from controlnet import ControlNetModel

class FramesEncodingVAE(nn.Module):
    """
    FramesEncodingVAE combines the encoding of reference and motion frames with additional components
    such as ReferenceNet, SpeedEncoder, and ControlNetMediaPipeFace as depicted in the Frames Encoding part of the diagram.
    """

    def __init__(self, latent_dim, img_size, reference_net):
        super(FramesEncodingVAE, self).__init__()
        self.vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
        self.vae.to(device)  # Move the model to the appropriate device (e.g., GPU)
        self.latent_dim = latent_dim
        self.img_size = img_size
        self.reference_net = reference_net

        # SpeedEncoder can be implemented as needed.
        num_speed_buckets = 9
        speed_embedding_dim = 64
        self.speed_encoder = SpeedEncoder(num_speed_buckets, speed_embedding_dim)

        # Initialize ControlNetMediaPipeFace
        self.controlnet = ControlNetModel.from_pretrained(
            "CrucibleAI/ControlNetMediaPipeFace",
            subfolder="diffusion_sd15",
            torch_dtype=torch.float16
        )

    def forward(self, reference_image, motion_frames, speed_value):
        # Process reference and motion frames through ControlNetMediaPipeFace
        reference_controlnet_output = self.controlnet(reference_image)
        motion_controlnet_outputs = [self.controlnet(frame) for frame in motion_frames]

        # Encode reference and motion frames using the VAE
        reference_latents = self.vae.encode(reference_controlnet_output).latent_dist.sample()
        motion_latents = [self.vae.encode(output).latent_dist.sample() for output in motion_controlnet_outputs]

        # Scale the latent vectors (optional, depends on the VAE scaling factor)
        reference_latents = reference_latents * 0.18215
        motion_latents = [latent * 0.18215 for latent in motion_latents]

        # Process reference features with ReferenceNet
        reference_features = self.reference_net(reference_latents)

        # Embed speed value
        speed_embedding = self.speed_encoder(speed_value)

        # Combine features
        combined_features = torch.cat([reference_features] + motion_latents + [speed_embedding], dim=1)

        # Decode the combined features
        reconstructed_frames = self.vae.decode(combined_features).sample

        return reconstructed_frames

    def vae_loss(self, recon_frames, reference_image, motion_frames):
        # Compute VAE loss using the VAE's loss function
        loss = self.vae.loss_function(recon_frames, torch.cat([reference_image] + motion_frames, dim=1))
        return loss["loss"]

When you initialize ControlNetMediaPipeFace using ControlNetModel.from_pretrained(), it returns an instance of the ControlNetModel. This model, when executed, produces an output of type ControlNetOutput, which contains two main components:

down_block_res_samples: A tuple of downsampled activations at different resolutions for each downsampling block in the model. These activations are tensors representing the features at various stages of the model's downsampling path.

mid_block_res_sample: A tensor representing the activation of the middle block (the lowest sample resolution) in the model.

The ControlNetModel does not directly return a latent representation or an image. Instead, it outputs features that can be used to condition a generative model like a UNet. These features are not in a latent space suitable for direct sampling, nor are they pixel data like a PIL image. They are intermediate representations that help guide the generative process in a diffusion model setup.

slight problem with this approach - no lips, teeth and tongue position
Screenshot from 2024-03-21 18-25-16

https://huggingface.co/CrucibleAI/ControlNetMediaPipeFace/discussions/20

Screenshot from 2024-03-22 08-06-26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant