From d20dfaf8c314f7fceade7ac6218dd88e64e8a164 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Tue, 6 Feb 2024 08:00:36 -0800 Subject: [PATCH 1/8] use cases --- docs/source/en/_toctree.yml | 2 + docs/source/en/using-diffusers/ip_adapter.md | 251 ++++++++++++++++++ .../en/using-diffusers/loading_adapters.md | 2 +- src/diffusers/loaders/ip_adapter.py | 5 + 4 files changed, 259 insertions(+), 1 deletion(-) create mode 100644 docs/source/en/using-diffusers/ip_adapter.md diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index a7f938fa662f..ab42a9a1b00e 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -58,6 +58,8 @@ - sections: - local: using-diffusers/textual_inversion_inference title: Textual inversion + - local: using-diffusers/ip_adapter + title: IP-Adapter - local: training/distributed_inference title: Distributed inference with multiple GPUs - local: using-diffusers/reusing_seeds diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md new file mode 100644 index 000000000000..2b87050f6621 --- /dev/null +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -0,0 +1,251 @@ + + +# IP-Adapter + +[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like ControlNet. The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features. + +> [!TIP] +> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide. + +This guide will walk you through using IP-Adapter for various tasks and use cases. + +## General tasks + +Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. Feel free to use another pipeline such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff! + + + + +Crafting the precise text prompt to generate the image you want can be difficult because it may not always capture what you'd like to express. Adding an image alongside the text prompt helps the model better understand what it should generate and can lead to more accurate results. + +Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL. + +The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. + +```py +from diffusers import AutoPipelineForText2Image +from diffusers.utils import load_image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin") +pipeline.set_ip_adapter_scale(0.6) +``` + +Create a text prompt and load an image prompt before passing them to the pipeline to generate an image. + +```py +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png") +generator = torch.Generator(device="cpu").manual_seed(0) +images = pipeline( + prompt="a polar bear sitting in a chair drinking a milkshake", + ip_adapter_image=image, + negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality", + num_inference_steps=100, + generator=generator, +).images +images[0] +``` + +
+
+ +
IP-Adapter image
+
+
+ +
generated image
+
+
+ +
+ + +IP-Adapter can also help with image-to-image by guiding the model to generate an image that resembles the original image and the image prompt. + +Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL. + +The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. + +```py +from diffusers import AutoPipelineForImage2Image +from diffusers.utils import load_image +import torch + +pipeline = AutoPipelineForImage2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin") +pipeline.set_ip_adapter_scale(0.6) +``` + +Pass the original image and the IP-Adapter image prompt to the pipeline to generate an image. Providing a text prompt to the pipeline is optional, but in this example, a text prompt is used to increase image quality. + +```py +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png") +ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_2.png") + +generator = torch.Generator(device="cpu").manual_seed(4) +images = pipeline( + prompt="best quality, high quality", + image=image, + ip_adapter_image=ip_image, + generator=generator, + strength=0.6, +).images +images[0] +``` + +
+
+ +
original image
+
+
+ +
IP-Adapter image
+
+
+ +
generated image
+
+
+ +
+ + +IP-Adapter is also useful for inpainting because the image prompt allows you to be much more specific about what you'd like to generate. + +Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL. + +The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. + +```py +from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image +import torch + +pipeline = AutoPipelineForInpainting.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16).to("cuda") +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin") +pipeline.set_ip_adapter_scale(0.6) +``` + +Pass a prompt, the original image, mask image, and the IP-Adapter image prompt to the pipeline to generate an image. Providing a text prompt to the pipeline is optional, but in this example, a text prompt is used to increase image quality. + +```py +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_mask.png") +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png") +ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png") + +generator = torch.Generator(device="cpu").manual_seed(4) +images = pipeline( + prompt="a cute gummy bear waving", + image=image, + mask_image=mask_image, + ip_adapter_image=ip_image, + generator=generator, + num_inference_steps=100, +).images +images[0] +``` + +
+
+ +
original image
+
+
+ +
IP-Adapter image
+
+
+ +
generated image
+
+
+ +
+ + +IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with it's motion adapter, and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. + +The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. + +> [!WARNING] +> If you're planning on offloading the model to the CPU, make sure you run it after you've loaded the IP-Adapter. When you call [`~DiffusionPipeline.enable_model_cpu_offload`] before loading the IP-Adapter, it offloads the image encoder module to the CPU and it'll return an error when you try to run the pipeline. + +```py +import torch +from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter +from diffusers.utils import export_to_gif +from diffusers.utils import load_image + +adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) +pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16) +scheduler = DDIMScheduler.from_pretrained( + "emilianJR/epiCRealism", + subfolder="scheduler", + clip_sample=False, + timestep_spacing="linspace", + beta_schedule="linear", + steps_offset=1, +) +pipeline.scheduler = scheduler +pipeline.enable_vae_slicing() + +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") +pipeline.set_ip_adapter_scale(0.6) +pipeline.enable_model_cpu_offload() +``` + +Pass a prompt and an image prompt to the pipeline to generate a short video. + +```py +ip_adapter_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_inpaint.png") + +output = pipeline( + prompt="A cute gummy bear waving", + negative_prompt="bad quality, worse quality, low resolution", + ip_adapter_image=ip_adapter_image, + num_frames=16, + guidance_scale=7.5, + num_inference_steps=50, + generator=torch.Generator(device="cpu").manual_seed(0), +) +frames = output.frames[0] +export_to_gif(frames, "animation1.gif") +``` + +
+
+ +
IP-Adapter image
+
+
+ +
generated video
+
+
+ +
+
+ +## Specific use cases + +IP-Adapters image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. Let's have a look at some of them. + +### Face model + +### Multi IP-Adapter + +### Instant generation + +### Structural control diff --git a/docs/source/en/using-diffusers/loading_adapters.md b/docs/source/en/using-diffusers/loading_adapters.md index 086527245946..fcd6f8c7158d 100644 --- a/docs/source/en/using-diffusers/loading_adapters.md +++ b/docs/source/en/using-diffusers/loading_adapters.md @@ -308,7 +308,7 @@ image = pipeline(prompt=prompt).images[0] image ``` -## IP-Adapter +## IP-Adapter [IP-Adapter](https://ip-adapter.github.io/) is an effective and lightweight adapter that adds image prompting capabilities to a diffusion model. This adapter works by decoupling the cross-attention layers of the image and text features. All the other model components are frozen and only the embedded image features in the UNet are trained. As a result, IP-Adapter files are typically only ~100MBs. diff --git a/src/diffusers/loaders/ip_adapter.py b/src/diffusers/loaders/ip_adapter.py index b2581a60c1ab..adf9ce25d435 100644 --- a/src/diffusers/loaders/ip_adapter.py +++ b/src/diffusers/loaders/ip_adapter.py @@ -181,6 +181,11 @@ def load_ip_adapter( unet._load_ip_adapter_weights(state_dicts) def set_ip_adapter_scale(self, scale): + """ + Sets the conditioning scale between text and image. + """ + if not isinstance(scale, list): + scale = [scale] unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet for attn_processor in unet.attn_processors.values(): if isinstance(attn_processor, (IPAdapterAttnProcessor, IPAdapterAttnProcessor2_0)): From c70e76a5a45286d91eaaaa0a44c9ceb689b434e2 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Wed, 7 Feb 2024 10:45:50 -0800 Subject: [PATCH 2/8] first draft --- docs/source/en/using-diffusers/ip_adapter.md | 216 +++++++++- .../en/using-diffusers/loading_adapters.md | 402 +----------------- 2 files changed, 223 insertions(+), 395 deletions(-) diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index 2b87050f6621..fd693056570d 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -202,7 +202,6 @@ pipeline.scheduler = scheduler pipeline.enable_vae_slicing() pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") -pipeline.set_ip_adapter_scale(0.6) pipeline.enable_model_cpu_offload() ``` @@ -221,7 +220,7 @@ output = pipeline( generator=torch.Generator(device="cpu").manual_seed(0), ) frames = output.frames[0] -export_to_gif(frames, "animation1.gif") +export_to_gif(frames, "gummy_bear.gif") ```
@@ -230,7 +229,7 @@ export_to_gif(frames, "animation1.gif")
IP-Adapter image
- +
generated video
@@ -242,10 +241,219 @@ export_to_gif(frames, "animation1.gif") IP-Adapters image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. Let's have a look at some of them. -### Face model +### Face model + +Generating accurate faces is challenging because they are complex and nuanced. [IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) is a face-specific IP-Adapter trained with face ID embeddings instead of CLIP image embeddings, allowing you to generate more consistent faces in different contexts and styles. + +For face models, use the [h94/IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) checkpoint. It is also recommended to use [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models. + +```py +import torch +from diffusers import StableDiffusionPipeline, DDIMScheduler +from diffusers.utils import load_image + +pipeline = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, +).to("cuda") +pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.bin") + +pipeline.set_ip_adapter_scale(0.5) + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein_base.png") +generator = torch.Generator(device="cpu").manual_seed(26) + +image = pipeline( + prompt="A photo of Einstein as a chef, wearing an apron, cooking in a French restaurant", + ip_adapter_image=image, + negative_prompt="lowres, bad anatomy, worst quality, low quality", + num_inference_steps=100, + generator=generator, +).images[0] +image +``` + +
+
+ +
IP-Adapter image
+
+
+ +
generated image
+
+
### Multi IP-Adapter +More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter FaceID to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. Let's try this out! + +IP-Adapter uses an image encoder to generate the image features. The image encoder is automatically loaded if the image encoder model weights exist in an `image_encoder` subfolder in your repository. You don't need to explicitly load the image encoder if this subfolder exists. Otherwise, you'll need to load the image encoder weights into [`~transformers.CLIPVisionModelWithProjection`] and pass that to your pipeline. In this example, you'll manually load the image encoder just to see how it's done. + +```py +import torch +from diffusers import AutoPipelineForText2Image, DDIMScheduler +from transformers import CLIPVisionModelWithProjection +from diffusers.utils import load_image + +image_encoder = CLIPVisionModelWithProjection.from_pretrained( + "h94/IP-Adapter", + subfolder="models/image_encoder", + torch_dtype=torch.float16, +) +``` + +Next, you'll load a base model, scheduler, and IP-Adapter's. The IP-Adapter's to use are passed as a list to the `weight_name` parameter: + +* ip-adapter-plus_sdxl_vit-h uses patch embeddings and a ViT H image encoder +* ip-adapter-plus-face_sdxl_vit-h has the same architecture but it is conditioned with images of cropped faces + +```py +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16, + image_encoder=image_encoder, +) +pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"] +) +pipeline.set_ip_adapter_scale([0.7, 0.3]) +pipeline.enable_model_cpu_offload() +``` + +Load an image prompt and a folder containing images of a certain style you want to use. + +```py +face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png") +style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy" +style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)] +``` + +
+
+ +
IP-Adapter image of face
+
+
+ +
IP-Adapter style images
+
+
+ +Pass the image prompt and style images as a list to the `ip_adapter_image` parameter, and run the pipeline! + +```py +generator = torch.Generator(device="cpu").manual_seed(0) + +image = pipeline( + prompt="wonderwoman", + ip_adapter_image=[style_images, face_image], + negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", + num_inference_steps=50, num_images_per_prompt=1, + generator=generator, +).images[0] +``` + +
+    +
generated image
+
+ ### Instant generation +[Latent Consistency Models (LCM)](../using-diffusers/inference_with_lcm_lora) are diffusion models that can generate images in as little as 4 steps compared to other diffusion models like SDXL that typically require way more steps. This is why image generation with a LCM feels "instantaneous". IP-Adapter's can be plugged into a LCM-LoRA model to instantly generate images with an image prompt. + +The IP-Adapter weights need to be loaded first, then you can use [`~StableDiffusionPipeline.load_lora_weights`] to load the LoRA style and weight you want to apply to your image. + +```py +from diffusers import DiffusionPipeline, LCMScheduler +import torch +from diffusers.utils import load_image + +model_id = "sd-dreambooth-library/herge-style" +lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5" + +pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) + +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") +pipeline.load_lora_weights(lcm_lora_id) +pipeline.scheduler = LCMScheduler.from_config(pipe.scheduler.config) +pipeline.enable_model_cpu_offload() + +prompt = "best quality, high quality" +ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png") +image = pipeline( + prompt=prompt, + ip_adapter_image=ip_adapter_image, + num_inference_steps=4, + guidance_scale=1, +).images[0] +image +``` + +
+    +
generated image
+
+ ### Structural control + +To control image generation to an even greater degree, you can combine IP-Adapter with a model like [ControlNet](../using-diffusers/controlnet). A ControlNet is also an adapter that can be inserted into a diffusion model to allow for conditioning on an additional control image. The control image can be depth maps, edge maps, pose estimations, and more. + +Load a [`ControlNetModel`] checkpoint conditioned on depth maps, insert it into a diffusion model, and load the IP-Adapter. + +```py +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel +import torch +from diffusers.utils import load_image + +controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth" +controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16) + +pipeline = StableDiffusionControlNetPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16) +pipeline.to("cuda") +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") +``` + +Now load the IP-Adapter image and depth map. + +```py +ip_adapter_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png") +depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png") +``` + +
+
+ +
IP-Adapter image
+
+
+ +
depth map
+
+
+ +Pass the depth map and IP-Adapter image to the pipeline to generate an image. + +```py +generator = torch.Generator(device="cpu").manual_seed(33) +image = pipeline( + prompt="best quality, high quality", + image=depth_map, + ip_adapter_image=ip_adapter_image, + negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", + num_inference_steps=50, + generator=generator, +).image[0] +image +``` + +
+    +
generated image
+
diff --git a/docs/source/en/using-diffusers/loading_adapters.md b/docs/source/en/using-diffusers/loading_adapters.md index fcd6f8c7158d..67c01de19954 100644 --- a/docs/source/en/using-diffusers/loading_adapters.md +++ b/docs/source/en/using-diffusers/loading_adapters.md @@ -310,38 +310,31 @@ image ## IP-Adapter -[IP-Adapter](https://ip-adapter.github.io/) is an effective and lightweight adapter that adds image prompting capabilities to a diffusion model. This adapter works by decoupling the cross-attention layers of the image and text features. All the other model components are frozen and only the embedded image features in the UNet are trained. As a result, IP-Adapter files are typically only ~100MBs. +[IP-Adapter](https://ip-adapter.github.io/) is a lightweight adapter that enables image prompting for any diffusion model. This adapter works by decoupling the cross-attention layers of the image and text features. All the other model components are frozen and only the embedded image features in the UNet are trained. As a result, IP-Adapter files are typically only ~100MBs. -IP-Adapter works with most of our pipelines, including Stable Diffusion, Stable Diffusion XL (SDXL), ControlNet, T2I-Adapter, AnimateDiff. And you can use any custom models finetuned from the same base models. It also works with LCM-Lora out of box. +You can learn more about how to use IP-Adapter for different tasks and specific use cases in the [IP-Adapter](../using-diffusers/ip_adapter) guide. +> [!TIP] +> Diffusers currently only supports IP-Adapter for some of the most popular pipelines. Feel free to open a feature request if you have a cool use case and want to integrate IP-Adapter with an unsupported pipeline! +> Official IP-Adapter checkpoints are available from [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter). - - -You can find official IP-Adapter checkpoints in [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter). - -IP-Adapter was contributed by [okotaku](https://github.com/okotaku). - - - -Let's first create a Stable Diffusion Pipeline. +To start, load a Stable Diffusion checkpoint. ```py from diffusers import AutoPipelineForText2Image import torch from diffusers.utils import load_image - pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda") ``` -Now load the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) weights with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. +Then load the IP-Adapter weights and add it to the pipeline with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. ```py pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") ``` - -IP-Adapter relies on an image encoder to generate the image features, if your IP-Adapter weights folder contains a "image_encoder" subfolder, the image encoder will be automatically loaded and registered to the pipeline. Otherwise you can so load a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to a Stable Diffusion pipeline when you create it. +IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains a `image_encoder` subfolder, the image encoder is automatically loaded and registed to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline. ```py from diffusers import AutoPipelineForText2Image @@ -356,12 +349,10 @@ image_encoder = CLIPVisionModelWithProjection.from_pretrained( pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, torch_dtype=torch.float16).to("cuda") ``` - -IP-Adapter allows you to use both image and text to condition the image generation process. For example, let's use the bear image from the [Textual Inversion](#textual-inversion) section as the image prompt (`ip_adapter_image`) along with a text prompt to add "sunglasses". 😎 +Once loaded, you can use the pipeline with an image and text prompt to guide the image generation process. ```py -pipeline.set_ip_adapter_scale(0.6) image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load_neg_embed.png") generator = torch.Generator(device="cpu").manual_seed(33) images = pipeline( @@ -370,381 +361,10 @@ images = pipeline(     negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",     num_inference_steps=50,     generator=generator, -).images -images[0] -``` - -
-    -
- - - -You can use the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method to adjust the text prompt and image prompt condition ratio.  If you're only using the image prompt, you should set the scale to `1.0`. You can lower the scale to get more generation diversity, but it'll be less aligned with the prompt. -`scale=0.5` can achieve good results in most cases when you use both text and image prompts. - - -IP-Adapter also works great with Image-to-Image and Inpainting pipelines. See below examples of how you can use it with Image-to-Image and Inpaint. - - - - -```py -from diffusers import AutoPipelineForImage2Image -import torch -from diffusers.utils import load_image - -pipeline = AutoPipelineForImage2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda") - -image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/vermeer.jpg") -ip_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/river.png") - -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") -generator = torch.Generator(device="cpu").manual_seed(33) -images = pipeline( -    prompt='best quality, high quality', -    image = image, -    ip_adapter_image=ip_image, -    num_inference_steps=50, -    generator=generator, -    strength=0.6, -).images -images[0] -``` - - - - -```py -from diffusers import AutoPipelineForInpaint -import torch -from diffusers.utils import load_image - -pipeline = AutoPipelineForInpaint.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float).to("cuda") - -image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/inpaint_image.png") -mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/mask.png") -ip_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/girl.png") - -image = image.resize((512, 768)) -mask = mask.resize((512, 768)) - -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") - -generator = torch.Generator(device="cpu").manual_seed(33) -images = pipeline( - prompt='best quality, high quality', - image = image, - mask_image = mask, - ip_adapter_image=ip_image, - negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", - num_inference_steps=50, - generator=generator, - strength=0.5, -).images -images[0] -``` - - - - -IP-Adapters can also be used with [SDXL](../api/pipelines/stable_diffusion/stable_diffusion_xl.md) - -```python -from diffusers import AutoPipelineForText2Image -from diffusers.utils import load_image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - torch_dtype=torch.float16 -).to("cuda") - -image = load_image("https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/watercolor_painting.jpeg") - -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin") - -generator = torch.Generator(device="cpu").manual_seed(33) -image = pipeline( - prompt="best quality, high quality", - ip_adapter_image=image, - negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", - num_inference_steps=25, - generator=generator, -).images[0] -image.save("sdxl_t2i.png") -``` - -
-
- -
input image
-
-
- -
adapted image
-
-
- -You can use the IP-Adapter face model to apply specific faces to your images. It is an effective way to maintain consistent characters in your image generations. -Weights are loaded with the same method used for the other IP-Adapters. - -```python -# Load ip-adapter-full-face_sd15.bin -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.bin") -``` - - - -It is recommended to use `DDIMScheduler` and `EulerDiscreteScheduler` for face model. - - - - -```python -import torch -from diffusers import StableDiffusionPipeline, DDIMScheduler -from diffusers.utils import load_image - -pipeline = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, -).to("cuda") -pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.bin") - -pipeline.set_ip_adapter_scale(0.7) - -image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ai_face2.png") - -generator = torch.Generator(device="cpu").manual_seed(33) - -image = pipeline( - prompt="A photo of a girl wearing a black dress, holding red roses in hand, upper body, behind is the Eiffel Tower", - ip_adapter_image=image, - negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", - num_inference_steps=50, num_images_per_prompt=1, width=512, height=704, - generator=generator, ).images[0] +images ``` -
-
- -
input image
-
-
- -
output image
-
-
- - -You can load multiple IP-Adapter models and use multiple reference images at the same time. In this example we use IP-Adapter-Plus face model to create a consistent character and also use IP-Adapter-Plus model along with 10 images to create a coherent style in the image we generate. - -```python -import torch -from diffusers import AutoPipelineForText2Image, DDIMScheduler -from transformers import CLIPVisionModelWithProjection -from diffusers.utils import load_image - -image_encoder = CLIPVisionModelWithProjection.from_pretrained( - "h94/IP-Adapter", - subfolder="models/image_encoder", - torch_dtype=torch.float16, -) - -pipeline = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - torch_dtype=torch.float16, - image_encoder=image_encoder, -) -pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) -pipeline.load_ip_adapter( - "h94/IP-Adapter", - subfolder="sdxl_models", - weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"] -) -pipeline.set_ip_adapter_scale([0.7, 0.3]) -pipeline.enable_model_cpu_offload() - -face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png") -style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy" -style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)] - -generator = torch.Generator(device="cpu").manual_seed(0) - -image = pipeline( - prompt="wonderwoman", - ip_adapter_image=[style_images, face_image], - negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", - num_inference_steps=50, num_images_per_prompt=1, - generator=generator, -).images[0] -```
-    -
style input image
-
- -
-
- -
face input image
-
-
- -
output image
-
-
- - -### LCM-Lora - -You can use IP-Adapter with LCM-Lora to achieve "instant fine-tune" with custom images. Note that you need to load IP-Adapter weights before loading the LCM-Lora weights. - -```py -from diffusers import DiffusionPipeline, LCMScheduler -import torch -from diffusers.utils import load_image - -model_id = "sd-dreambooth-library/herge-style" -lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5" - -pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) - -pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") -pipe.load_lora_weights(lcm_lora_id) -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) -pipe.enable_model_cpu_offload() - -prompt = "best quality, high quality" -image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png") -images = pipe( - prompt=prompt, - ip_adapter_image=image, - num_inference_steps=4, - guidance_scale=1, -).images[0] -``` - -### Other pipelines - -IP-Adapter is compatible with any pipeline that (1) uses a text prompt and (2) uses Stable Diffusion or Stable Diffusion XL checkpoint. To use IP-Adapter with a different pipeline, all you need to do is to run `load_ip_adapter()` method after you create the pipeline, and then pass your image to the pipeline as `ip_adapter_image` - - - -🤗 Diffusers currently only supports using IP-Adapter with some of the most popular pipelines, feel free to open a [feature request](https://github.com/huggingface/diffusers/issues/new/choose) if you have a cool use-case and require integrating IP-adapters with a pipeline that does not support it yet! - - - -You can find below examples on how to use IP-Adapter with ControlNet and AnimateDiff. - - - - -``` -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel -import torch -from diffusers.utils import load_image - -controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth" -controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16) - -pipeline = StableDiffusionControlNetPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16) -pipeline.to("cuda") - -image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png") -depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png") - -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") - -generator = torch.Generator(device="cpu").manual_seed(33) -images = pipeline( - prompt='best quality, high quality', - image=depth_map, - ip_adapter_image=image, - negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", - num_inference_steps=50, - generator=generator, -).images -images[0] -``` -
-
- -
input image
-
-
- -
adapted image
-
+   
- -
- - -```py -# animate diff + ip adapter -import torch -from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler -from diffusers.utils import export_to_gif, load_image - -# Load the motion adapter -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) -# load SD 1.5 based finetuned model -model_id = "Lykon/DreamShaper" -pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16) - -# scheduler -scheduler = DDIMScheduler( - clip_sample=False, - beta_start=0.00085, - beta_end=0.012, - beta_schedule="linear", - timestep_spacing="trailing", - steps_offset=1 -) -pipe.scheduler = scheduler - -# enable memory savings -pipe.enable_vae_slicing() -pipe.enable_model_cpu_offload() - -# load ip_adapter -pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") - -# load motion adapters -pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out") -pipe.load_lora_weights("guoyww/animatediff-motion-lora-tilt-up", adapter_name="tilt-up") -pipe.load_lora_weights("guoyww/animatediff-motion-lora-pan-left", adapter_name="pan-left") - -seed = 42 -image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png") -images = [image] * 3 -prompts = ["best quality, high quality"] * 3 -negative_prompt = "bad quality, worst quality" -adapter_weights = [[0.75, 0.0, 0.0], [0.0, 0.0, 0.75], [0.0, 0.75, 0.75]] - -# generate -output_frames = [] -for prompt, image, adapter_weight in zip(prompts, images, adapter_weights): - pipe.set_adapters(["zoom-out", "tilt-up", "pan-left"], adapter_weights=adapter_weight) - output = pipe( - prompt= prompt, - num_frames=16, - guidance_scale=7.5, - num_inference_steps=30, - ip_adapter_image = image, - generator=torch.Generator("cpu").manual_seed(seed), - ) - frames = output.frames[0] - output_frames.extend(frames) - -export_to_gif(output_frames, "test_out_animation.gif") -``` - - -
- From 058d291c94ce0776af8b4a6d9eab7c5d8e478db4 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Wed, 7 Feb 2024 11:33:49 -0800 Subject: [PATCH 3/8] fix image links --- docs/source/en/using-diffusers/ip_adapter.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index fd693056570d..7675554dc9fe 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -229,7 +229,7 @@ export_to_gif(frames, "gummy_bear.gif")
IP-Adapter image
- +
generated video
@@ -339,7 +339,7 @@ style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)]
IP-Adapter image of face
- +
IP-Adapter style images
From 05a2a54a67b3e79a174657c81a69ab1b71d71067 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Wed, 7 Feb 2024 17:17:01 -0800 Subject: [PATCH 4/8] lcm-lora --- docs/source/en/using-diffusers/ip_adapter.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index 7675554dc9fe..aa103c26aff6 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -383,8 +383,16 @@ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-a pipeline.load_lora_weights(lcm_lora_id) pipeline.scheduler = LCMScheduler.from_config(pipe.scheduler.config) pipeline.enable_model_cpu_offload() +``` + +Try using with a lower IP-Adapter scale to condition image generation more on the herge_style checkpoint, and remember to use the special token `herge_style` in your prompt to trigger and apply the style. + +```py +pipeline.set_ip_adapter_scale(0.4) + +prompt = "herge_style woman in armor, best quality, high quality" +generator = torch.Generator(device="cpu").manual_seed(0) -prompt = "best quality, high quality" ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png") image = pipeline( prompt=prompt, @@ -396,7 +404,7 @@ image ```
-    +   
generated image
From 7632a41a82821f2932b363e37b5d7074c46f0cd8 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Fri, 9 Feb 2024 12:45:39 -0800 Subject: [PATCH 5/8] feedback --- docs/source/en/using-diffusers/ip_adapter.md | 23 +++++------- .../en/using-diffusers/loading_adapters.md | 35 ++++++++++--------- src/diffusers/loaders/ip_adapter.py | 2 -- 3 files changed, 28 insertions(+), 32 deletions(-) diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index aa103c26aff6..fd756b1b9292 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -12,16 +12,18 @@ specific language governing permissions and limitations under the License. # IP-Adapter -[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like ControlNet. The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features. +[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like [ControlNet](../using-diffusers/controlnet). The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features. > [!TIP] -> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide. +> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide, and make sure you check out the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section which requires manually loading the image encoder. This guide will walk you through using IP-Adapter for various tasks and use cases. ## General tasks -Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. Feel free to use another pipeline such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff! +Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. We also encourage you to try out other pipelines such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff! + +In all the following examples, you'll see the [`~IPAdapterMixin.set_ip_adapter_scale`] method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. @@ -30,8 +32,6 @@ Crafting the precise text prompt to generate the image you want can be difficult Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL. -The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. - ```py from diffusers import AutoPipelineForText2Image from diffusers.utils import load_image @@ -75,8 +75,6 @@ IP-Adapter can also help with image-to-image by guiding the model to generate an Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL. -The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. - ```py from diffusers import AutoPipelineForImage2Image from diffusers.utils import load_image @@ -126,8 +124,6 @@ IP-Adapter is also useful for inpainting because the image prompt allows you to Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL. -The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. - ```py from diffusers import AutoPipelineForInpainting from diffusers.utils import load_image @@ -177,8 +173,6 @@ images[0] IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with it's motion adapter, and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. -The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. - > [!WARNING] > If you're planning on offloading the model to the CPU, make sure you run it after you've loaded the IP-Adapter. When you call [`~DiffusionPipeline.enable_model_cpu_offload`] before loading the IP-Adapter, it offloads the image encoder module to the CPU and it'll return an error when you try to run the pipeline. @@ -289,7 +283,8 @@ image More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter FaceID to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. Let's try this out! -IP-Adapter uses an image encoder to generate the image features. The image encoder is automatically loaded if the image encoder model weights exist in an `image_encoder` subfolder in your repository. You don't need to explicitly load the image encoder if this subfolder exists. Otherwise, you'll need to load the image encoder weights into [`~transformers.CLIPVisionModelWithProjection`] and pass that to your pipeline. In this example, you'll manually load the image encoder just to see how it's done. +> [!TIP] +> Read the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section to learn why you need to manually load the image encoder. ```py import torch @@ -306,8 +301,8 @@ image_encoder = CLIPVisionModelWithProjection.from_pretrained( Next, you'll load a base model, scheduler, and IP-Adapter's. The IP-Adapter's to use are passed as a list to the `weight_name` parameter: -* ip-adapter-plus_sdxl_vit-h uses patch embeddings and a ViT H image encoder -* ip-adapter-plus-face_sdxl_vit-h has the same architecture but it is conditioned with images of cropped faces +* [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT H image encoder +* [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) has the same architecture but it is conditioned with images of cropped faces ```py pipeline = AutoPipelineForText2Image.from_pretrained( diff --git a/docs/source/en/using-diffusers/loading_adapters.md b/docs/source/en/using-diffusers/loading_adapters.md index 67c01de19954..a4083a92dd01 100644 --- a/docs/source/en/using-diffusers/loading_adapters.md +++ b/docs/source/en/using-diffusers/loading_adapters.md @@ -334,22 +334,6 @@ Then load the IP-Adapter weights and add it to the pipeline with the [`~loaders. pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") ``` -IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains a `image_encoder` subfolder, the image encoder is automatically loaded and registed to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline. - -```py -from diffusers import AutoPipelineForText2Image -from transformers import CLIPVisionModelWithProjection -import torch - -image_encoder = CLIPVisionModelWithProjection.from_pretrained( - "h94/IP-Adapter", - subfolder="models/image_encoder", - torch_dtype=torch.float16, -).to("cuda") - -pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, torch_dtype=torch.float16).to("cuda") -``` - Once loaded, you can use the pipeline with an image and text prompt to guide the image generation process. ```py @@ -368,3 +352,22 @@ images
   
+ +### IP-Adapter Plus + +IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains a `image_encoder` subfolder, the image encoder is automatically loaded and registed to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline. This is the case for *IP-Adapter Plus* checkpoints which use the ViT-H image encoder. + +```py +image_encoder = CLIPVisionModelWithProjection.from_pretrained( + "h94/IP-Adapter", + subfolder="models/image_encoder", + torch_dtype=torch.float16) + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + image_encoder=image_encoder, + torch_dtype=torch.float16 +).to("cuda") + +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors") +``` diff --git a/src/diffusers/loaders/ip_adapter.py b/src/diffusers/loaders/ip_adapter.py index adf9ce25d435..76cc23b2a0fe 100644 --- a/src/diffusers/loaders/ip_adapter.py +++ b/src/diffusers/loaders/ip_adapter.py @@ -184,8 +184,6 @@ def set_ip_adapter_scale(self, scale): """ Sets the conditioning scale between text and image. """ - if not isinstance(scale, list): - scale = [scale] unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet for attn_processor in unet.attn_processors.values(): if isinstance(attn_processor, (IPAdapterAttnProcessor, IPAdapterAttnProcessor2_0)): From a16f7015e98b8ea17be8014e74c2c59b79eb3658 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Fri, 9 Feb 2024 14:40:07 -0800 Subject: [PATCH 6/8] review --- docs/source/en/api/loaders/ip_adapter.md | 4 +-- docs/source/en/using-diffusers/ip_adapter.md | 27 +++++++++---------- .../en/using-diffusers/loading_adapters.md | 7 +++-- src/diffusers/loaders/ip_adapter.py | 6 +++++ 4 files changed, 26 insertions(+), 18 deletions(-) diff --git a/docs/source/en/api/loaders/ip_adapter.md b/docs/source/en/api/loaders/ip_adapter.md index 18da05122fc4..8805092d92fc 100644 --- a/docs/source/en/api/loaders/ip_adapter.md +++ b/docs/source/en/api/loaders/ip_adapter.md @@ -12,11 +12,11 @@ specific language governing permissions and limitations under the License. # IP-Adapter -[IP-Adapter](https://hf.co/papers/2308.06721) is a lightweight adapter that enables prompting a diffusion model with an image. This method decouples the cross-attention layers of the image and text features. The image features are generated from an image encoder. Files generated from IP-Adapter are only ~100MBs. +[IP-Adapter](https://hf.co/papers/2308.06721) is a lightweight adapter that enables prompting a diffusion model with an image. This method decouples the cross-attention layers of the image and text features. The image features are generated from an image encoder. -Learn how to load an IP-Adapter checkpoint and image in the [IP-Adapter](../../using-diffusers/loading_adapters#ip-adapter) loading guide. +Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading](../../using-diffusers/loading_adapters#ip-adapter) guide, and you can see how to use it in the [usage](../../using-diffusers/ip_adapter) guide. diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index fd756b1b9292..e94a7e1e6175 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -23,14 +23,14 @@ This guide will walk you through using IP-Adapter for various tasks and use case Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. We also encourage you to try out other pipelines such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff! -In all the following examples, you'll see the [`~IPAdapterMixin.set_ip_adapter_scale`] method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. +In all the following examples, you'll see the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. Crafting the precise text prompt to generate the image you want can be difficult because it may not always capture what you'd like to express. Adding an image alongside the text prompt helps the model better understand what it should generate and can lead to more accurate results. -Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL. +Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights. ```py from diffusers import AutoPipelineForText2Image @@ -73,7 +73,7 @@ images[0] IP-Adapter can also help with image-to-image by guiding the model to generate an image that resembles the original image and the image prompt. -Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL. +Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights. ```py from diffusers import AutoPipelineForImage2Image @@ -122,7 +122,7 @@ images[0] IP-Adapter is also useful for inpainting because the image prompt allows you to be much more specific about what you'd like to generate. -Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL. +Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights. ```py from diffusers import AutoPipelineForInpainting @@ -134,7 +134,7 @@ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name= pipeline.set_ip_adapter_scale(0.6) ``` -Pass a prompt, the original image, mask image, and the IP-Adapter image prompt to the pipeline to generate an image. Providing a text prompt to the pipeline is optional, but in this example, a text prompt is used to increase image quality. +Pass a prompt, the original image, mask image, and the IP-Adapter image prompt to the pipeline to generate an image. ```py mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_mask.png") @@ -171,7 +171,7 @@ images[0] -IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with it's motion adapter, and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. +IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with it's motion adapter and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. > [!WARNING] > If you're planning on offloading the model to the CPU, make sure you run it after you've loaded the IP-Adapter. When you call [`~DiffusionPipeline.enable_model_cpu_offload`] before loading the IP-Adapter, it offloads the image encoder module to the CPU and it'll return an error when you try to run the pipeline. @@ -233,7 +233,7 @@ export_to_gif(frames, "gummy_bear.gif") ## Specific use cases -IP-Adapters image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. Let's have a look at some of them. +IP-Adapters image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. This section covers some of the more popular applications of IP-Adapter, and we can't wait to see what you come up with! ### Face model @@ -281,11 +281,13 @@ image ### Multi IP-Adapter -More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter FaceID to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. Let's try this out! +More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter-FaceID to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. > [!TIP] > Read the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section to learn why you need to manually load the image encoder. +Load the image encoder with [`~transformers.CLIPVisionModelWithProjection`]. + ```py import torch from diffusers import AutoPipelineForText2Image, DDIMScheduler @@ -299,9 +301,9 @@ image_encoder = CLIPVisionModelWithProjection.from_pretrained( ) ``` -Next, you'll load a base model, scheduler, and IP-Adapter's. The IP-Adapter's to use are passed as a list to the `weight_name` parameter: +Next, you'll load a base model, scheduler, and the IP-Adapter's. The IP-Adapter's to use are passed as a list to the `weight_name` parameter: -* [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT H image encoder +* [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder * [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) has the same architecture but it is conditioned with images of cropped faces ```py @@ -355,7 +357,6 @@ image = pipeline(
    -
generated image
### Instant generation @@ -380,7 +381,7 @@ pipeline.scheduler = LCMScheduler.from_config(pipe.scheduler.config) pipeline.enable_model_cpu_offload() ``` -Try using with a lower IP-Adapter scale to condition image generation more on the herge_style checkpoint, and remember to use the special token `herge_style` in your prompt to trigger and apply the style. +Try using with a lower IP-Adapter scale to condition image generation more on the [herge_style](https://huggingface.co/sd-dreambooth-library/herge-style) checkpoint, and remember to use the special token `herge_style` in your prompt to trigger and apply the style. ```py pipeline.set_ip_adapter_scale(0.4) @@ -400,7 +401,6 @@ image
    -
generated image
### Structural control @@ -458,5 +458,4 @@ image
    -
generated image
diff --git a/docs/source/en/using-diffusers/loading_adapters.md b/docs/source/en/using-diffusers/loading_adapters.md index a4083a92dd01..fdf5ddceb9f3 100644 --- a/docs/source/en/using-diffusers/loading_adapters.md +++ b/docs/source/en/using-diffusers/loading_adapters.md @@ -355,13 +355,16 @@ images ### IP-Adapter Plus -IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains a `image_encoder` subfolder, the image encoder is automatically loaded and registed to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline. This is the case for *IP-Adapter Plus* checkpoints which use the ViT-H image encoder. +IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains a `image_encoder` subfolder, the image encoder is automatically loaded and registed to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline. + +This is the case for *IP-Adapter Plus* checkpoints which use the ViT-H image encoder. ```py image_encoder = CLIPVisionModelWithProjection.from_pretrained( "h94/IP-Adapter", subfolder="models/image_encoder", - torch_dtype=torch.float16) + torch_dtype=torch.float16 +) pipeline = AutoPipelineForText2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", diff --git a/src/diffusers/loaders/ip_adapter.py b/src/diffusers/loaders/ip_adapter.py index 76cc23b2a0fe..513054702a54 100644 --- a/src/diffusers/loaders/ip_adapter.py +++ b/src/diffusers/loaders/ip_adapter.py @@ -183,6 +183,12 @@ def load_ip_adapter( def set_ip_adapter_scale(self, scale): """ Sets the conditioning scale between text and image. + + Example: + + ```py + pipeline.set_ip_adapter_scale(0.5) + ``` """ unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet for attn_processor in unet.attn_processors.values(): From d2bc16ce1e8bfd420a586b2e5e726aa78fd528a9 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Wed, 14 Feb 2024 11:03:04 -0800 Subject: [PATCH 7/8] feedback --- docs/source/en/using-diffusers/ip_adapter.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index e94a7e1e6175..aba737472a8d 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -171,7 +171,7 @@ images[0]
-IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with it's motion adapter and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. +IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with its motion adapter and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. > [!WARNING] > If you're planning on offloading the model to the CPU, make sure you run it after you've loaded the IP-Adapter. When you call [`~DiffusionPipeline.enable_model_cpu_offload`] before loading the IP-Adapter, it offloads the image encoder module to the CPU and it'll return an error when you try to run the pipeline. @@ -233,7 +233,7 @@ export_to_gif(frames, "gummy_bear.gif") ## Specific use cases -IP-Adapters image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. This section covers some of the more popular applications of IP-Adapter, and we can't wait to see what you come up with! +IP-Adapter's image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. This section covers some of the more popular applications of IP-Adapter, and we can't wait to see what you come up with! ### Face model @@ -301,7 +301,7 @@ image_encoder = CLIPVisionModelWithProjection.from_pretrained( ) ``` -Next, you'll load a base model, scheduler, and the IP-Adapter's. The IP-Adapter's to use are passed as a list to the `weight_name` parameter: +Next, you'll load a base model, scheduler, and the IP-Adapters. The IP-Adapters to use are passed as a list to the `weight_name` parameter: * [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder * [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) has the same architecture but it is conditioned with images of cropped faces @@ -361,7 +361,7 @@ image = pipeline( ### Instant generation -[Latent Consistency Models (LCM)](../using-diffusers/inference_with_lcm_lora) are diffusion models that can generate images in as little as 4 steps compared to other diffusion models like SDXL that typically require way more steps. This is why image generation with a LCM feels "instantaneous". IP-Adapter's can be plugged into a LCM-LoRA model to instantly generate images with an image prompt. +[Latent Consistency Models (LCM)](../using-diffusers/inference_with_lcm_lora) are diffusion models that can generate images in as little as 4 steps compared to other diffusion models like SDXL that typically require way more steps. This is why image generation with an LCM feels "instantaneous". IP-Adapters can be plugged into an LCM-LoRA model to instantly generate images with an image prompt. The IP-Adapter weights need to be loaded first, then you can use [`~StableDiffusionPipeline.load_lora_weights`] to load the LoRA style and weight you want to apply to your image. From c32eaec7cb5874128a5b1b48e5974997c932a670 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Wed, 14 Feb 2024 12:06:40 -0800 Subject: [PATCH 8/8] feedback --- docs/source/en/using-diffusers/ip_adapter.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index aba737472a8d..b37ef15fc6af 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -237,9 +237,15 @@ IP-Adapter's image prompting and compatibility with other adapters and models ma ### Face model -Generating accurate faces is challenging because they are complex and nuanced. [IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) is a face-specific IP-Adapter trained with face ID embeddings instead of CLIP image embeddings, allowing you to generate more consistent faces in different contexts and styles. +Generating accurate faces is challenging because they are complex and nuanced. Diffusers supports two IP-Adapter checkpoints specifically trained to generate faces: -For face models, use the [h94/IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) checkpoint. It is also recommended to use [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models. +* [ip-adapter-full-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-full-face_sd15.safetensors) is conditioned with images of cropped faces and removed backgrounds +* [ip-adapter-plus-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-plus-face_sd15.safetensors) uses patch embeddings and is conditioned with images of cropped faces + +> [TIP] +> [IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) is a face-specific IP-Adapter trained with face ID embeddings instead of CLIP image embeddings, allowing you to generate more consistent faces in different contexts and styles. Try out this popular [community pipeline](https://github.com/huggingface/diffusers/tree/main/examples/community#ip-adapter-face-id) and see how it compares to the other face IP-Adapters. + +For face models, use the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) checkpoint. It is also recommended to use [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models. ```py import torch @@ -281,7 +287,7 @@ image ### Multi IP-Adapter -More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter-FaceID to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. +More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter-Face to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. > [!TIP] > Read the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section to learn why you need to manually load the image encoder.