You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PHASE 1 -
APPROACH
take the audio file -(train the audio attention model) to predict the frame / head mask - repeat.
once you have frames for all the masks
run it through SD pipeline / loras / models etc.
the problem with existing code - it seems far fetched that it will actually converge. Taking this off the shelf model - will get an EMO look alike quite quickly.
importtorchimporttorch.nnasnnfromcontrolnetimportControlNetModelclassFramesEncodingVAE(nn.Module):
""" FramesEncodingVAE combines the encoding of reference and motion frames with additional components such as ReferenceNet, SpeedEncoder, and ControlNetMediaPipeFace as depicted in the Frames Encoding part of the diagram. """def__init__(self, latent_dim, img_size, reference_net):
super(FramesEncodingVAE, self).__init__()
self.vae=AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
self.vae.to(device) # Move the model to the appropriate device (e.g., GPU)self.latent_dim=latent_dimself.img_size=img_sizeself.reference_net=reference_net# SpeedEncoder can be implemented as needed.num_speed_buckets=9speed_embedding_dim=64self.speed_encoder=SpeedEncoder(num_speed_buckets, speed_embedding_dim)
# Initialize ControlNetMediaPipeFaceself.controlnet=ControlNetModel.from_pretrained(
"CrucibleAI/ControlNetMediaPipeFace",
subfolder="diffusion_sd15",
torch_dtype=torch.float16
)
defforward(self, reference_image, motion_frames, speed_value):
# Process reference and motion frames through ControlNetMediaPipeFacereference_controlnet_output=self.controlnet(reference_image)
motion_controlnet_outputs= [self.controlnet(frame) forframeinmotion_frames]
# Encode reference and motion frames using the VAEreference_latents=self.vae.encode(reference_controlnet_output).latent_dist.sample()
motion_latents= [self.vae.encode(output).latent_dist.sample() foroutputinmotion_controlnet_outputs]
# Scale the latent vectors (optional, depends on the VAE scaling factor)reference_latents=reference_latents*0.18215motion_latents= [latent*0.18215forlatentinmotion_latents]
# Process reference features with ReferenceNetreference_features=self.reference_net(reference_latents)
# Embed speed valuespeed_embedding=self.speed_encoder(speed_value)
# Combine featurescombined_features=torch.cat([reference_features] +motion_latents+ [speed_embedding], dim=1)
# Decode the combined featuresreconstructed_frames=self.vae.decode(combined_features).samplereturnreconstructed_framesdefvae_loss(self, recon_frames, reference_image, motion_frames):
# Compute VAE loss using the VAE's loss functionloss=self.vae.loss_function(recon_frames, torch.cat([reference_image] +motion_frames, dim=1))
returnloss["loss"]
When you initialize ControlNetMediaPipeFace using ControlNetModel.from_pretrained(), it returns an instance of the ControlNetModel. This model, when executed, produces an output of type ControlNetOutput, which contains two main components:
down_block_res_samples: A tuple of downsampled activations at different resolutions for each downsampling block in the model. These activations are tensors representing the features at various stages of the model's downsampling path.
mid_block_res_sample: A tensor representing the activation of the middle block (the lowest sample resolution) in the model.
The ControlNetModel does not directly return a latent representation or an image. Instead, it outputs features that can be used to condition a generative model like a UNet. These features are not in a latent space suitable for direct sampling, nor are they pixel data like a PIL image. They are intermediate representations that help guide the generative process in a diffusion model setup.
slight problem with this approach - no lips, teeth and tongue position
https://huggingface.co/CrucibleAI/ControlNetMediaPipeFace
PHASE 1 -
APPROACH
take the audio file -(train the audio attention model) to predict the frame / head mask - repeat.
once you have frames for all the masks
run it through SD pipeline / loras / models etc.
the problem with existing code - it seems far fetched that it will actually converge. Taking this off the shelf model - will get an EMO look alike quite quickly.
I have this code that may be starting point - uses facebook audiocraft / sd
https://github.com/johndpope/Emote-hack/tree/main/junk/AudioAttention
will wire the above model in coming days.
https://github.com/Klopolupka007/ImageCreator/blob/bf0cbb1033d6e3db4f3adef6c1aee8f51137264a/neural_apis/stable_diff_api/models/stable_diff_face_model.py#L14
I ask Claude3 to plug this in -
When you initialize ControlNetMediaPipeFace using ControlNetModel.from_pretrained(), it returns an instance of the ControlNetModel. This model, when executed, produces an output of type ControlNetOutput, which contains two main components:
down_block_res_samples: A tuple of downsampled activations at different resolutions for each downsampling block in the model. These activations are tensors representing the features at various stages of the model's downsampling path.
mid_block_res_sample: A tensor representing the activation of the middle block (the lowest sample resolution) in the model.
The ControlNetModel does not directly return a latent representation or an image. Instead, it outputs features that can be used to condition a generative model like a UNet. These features are not in a latent space suitable for direct sampling, nor are they pixel data like a PIL image. They are intermediate representations that help guide the generative process in a diffusion model setup.
slight problem with this approach - no lips, teeth and tongue position
https://huggingface.co/CrucibleAI/ControlNetMediaPipeFace/discussions/20
The text was updated successfully, but these errors were encountered: