Skip to content

Conversation

@a-r-r-o-w
Copy link
Contributor

@a-r-r-o-w a-r-r-o-w commented Feb 4, 2024

What does this PR do?

Link to Colab notebook

Fixes #6688.

This PR adds MotionCtrl to diffusers. These changes are not really ideal to merge. It's still a WIP and I wanted to get a working example with diffusers before figuring out how best to add it to core/community. Currently, I've just hacked through the unet code adding motionctrl specific stuff.

Thanks to ModelsLab for providing GPU support.

Before submitting

Who can review?

@DN6 @sayakpaul @patrickvonplaten

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Feb 4, 2024

Link to converted SVD model: https://huggingface.co/a-r-r-o-w/motionctrl-svd/

@wzhouxiff @jiangyzy @xinntao Thank you for your amazing work! Maybe it makes sense to move this to one of the authors accounts or under the TencentARC organization.

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Feb 4, 2024

I'd appreciate some help with debugging.

Testing code
from diffusers.pipelines.stable_video_diffusion.pipeline_stable_video_motionctrl_diffusion import StableVideoMotionCtrlDiffusionPipeline
from diffusers.utils import load_image, export_to_gif

pipe = StableVideoMotionCtrlDiffusionPipeline.from_pretrained(
    "a-r-r-o-w/motionctrl-svd", torch_dtype=torch.float16, variant="fp16"
).to("cuda")

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

camera_pose = ... # use one of the config json files from author's repo
num_frames=14

frames = pipe(
    image=image,
    camera_pose=camera_pose[:num_frames],
    num_frames=num_frames,
    num_inference_steps=25,
    decode_chunk_size=4,
    motion_bucket_id=127,
    min_guidance_scale=2.5,
    max_guidance_scale=1,
    generator=torch.manual_seed(42)
).frames[0]

export_to_gif(frames, "animation.gif")
DDIM EulerDiscrete

The sampling config used in the authors implementation is:

    sampler_config:
      target: sgm.modules.diffusionmodules.sampling.EulerEDMSampler
      params:
        num_steps: 25
        discretization_config:
          target: sgm.modules.diffusionmodules.discretizer.EDMDiscretization
          params:
            sigma_max: 700.0

        guider_config:
          target: sgm.modules.diffusionmodules.guiders.LinearPredictionGuider
          params:
            num_frames: 14
            max_scale: 2.5
            min_scale: 1.0

The sigma_max property is not yet supported by the EulerDiscreteScheduler in diffusers I think. Maybe that could be a potential issue causing bad results? I'm hoping model conversion went correctly since there were no unexpected/missing key errors with strict mode but if anyone could verify, it'd be great.

@Yanting-K
Copy link

Yanting-K commented Feb 5, 2024

I'd appreciate some help with debugging.

Testing code
DDIM EulerDiscrete

The sampling config used in the authors implementation is:

    sampler_config:
      target: sgm.modules.diffusionmodules.sampling.EulerEDMSampler
      params:
        num_steps: 25
        discretization_config:
          target: sgm.modules.diffusionmodules.discretizer.EDMDiscretization
          params:
            sigma_max: 700.0

        guider_config:
          target: sgm.modules.diffusionmodules.guiders.LinearPredictionGuider
          params:
            num_frames: 14
            max_scale: 2.5
            min_scale: 1.0

The sigma_max property is not yet supported by the EulerDiscreteScheduler in diffusers I think. Maybe that could be a potential issue causing bad results? I'm hoping model conversion went correctly since there were no unexpected/missing key errors with strict mode but if anyone could verify, it'd be great.

Any other questions about this pr, I'm ready to reenact motionctrl. Maybe I can give a little help.

@a-r-r-o-w a-r-r-o-w changed the title [WIP] MotionCtrl [WIP] MotionCtrl SVD Feb 5, 2024
@a-r-r-o-w
Copy link
Contributor Author

Any other questions about this pr, I'm ready to reenact motionctrl. Maybe I can give a little help.

Thanks for trying to look into this @Yanting-K! Nope, I do not have any other questions regarding the implementation. Just haven't found time to debug and fix this yet :(

cc @DN6

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Feb 7, 2024

The mistake above that caused bad results was using the wrong image encoder checkpoint. Since the authors freeze all layers of SVD and just train the attn2 and cc_projection layers in the UNet, we can reuse the image encoder/vae from SVD.

Some results (manually downscaled):

Unfortunately, there is not much object movement here due to no checkpoints for the Object Motion Control module for SVD. They only demonstrate the Camera control module, which works great as we can see but is not enough control the community wants with SVD. More controllability exists in DragNUWA with SVD, which is something I've been working on parallely and will open a PR for it shortly. Maybe the OMCM for VideoCrafter can be used but I haven't found the time to experiment or look into more details. I will be working on adding the Crafter family of models very soon, and so will take a look at OMCM stuff then.

@DN6 @sayakpaul @patrickvonplaten I believe this is ready for an initial review. I know the changes are not ideal because of the MotionCtrl-specific additions to the UNet/Attention code. Let me know how to go about it for community/core addition. Thanks

@a-r-r-o-w a-r-r-o-w changed the title [WIP] MotionCtrl SVD MotionCtrl SVD Feb 7, 2024
@a-r-r-o-w
Copy link
Contributor Author

There is one thing I do not understand here though... maybe @wzhouxiff could help me out. Why do you multiply the camera_poses[:, : -1] with [3, 1, 4] for rescaling? The speed makes sense to me, but this just seemed a little random. The results seem to be the same with/without it

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Feb 14, 2024

@sayakpaul @DN6 I have verified that the implementation is faithful to the original repository for SVD and should be a good candidate for a community version. Since this pipeline involves some light modification to the unet attention layers, how would you suggest I convert to a single file pipeline? Is a, sort of hacky, overriding of the forward method and somehow pushing the cc_projection layers into the appropriate places sound good? This is how it's actually done the original repository.

@sayakpaul
Copy link
Member

You can, for now, then add it to research_projects with all the modeling changes and pipelining code. This is how it's done for ControlNetXS, for example.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w
Copy link
Contributor Author

Thanks for running the tests! Seems like everything passed as failing tests are unrelated which means I haven't (hopefully) broken any existing code.

@a-r-r-o-w a-r-r-o-w mentioned this pull request Feb 18, 2024
6 tasks
@a-r-r-o-w a-r-r-o-w closed this Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MotionCntrl

4 participants