# Generating images and text with UniDiffuser

UniDiffuser was introduced in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://arxiv.org/abs/2303.06555).

In this notebook, we will show how the [UniDiffuser pipeline](https://huggingface.co/docs/diffusers/api/pipelines/unidiffuser) in 🧨 diffusers can be used for:

* Unconditional image generation
* Unconditional text generation
* Text-to-image generation
* Image-to-text generation
* Image variation
* Text variation

One pipeline to rule six use cases 🤯

Let's start!

<div align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/unidiffuser.gif" width=350/>
</div>

## Setup

In [1]:
!pip install -q git+https://github.com/huggingface/diffusers
!pip install transformers accelerate -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for diffusers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m104.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m105.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Unconditional image and text generation

Throughout this notebook, we'll be using the ["thu-ml/unidiffuser-v1"](https://huggingface.co/thu-ml/unidiffuser-v1) checkpoint. UniDiffuser comes with two checkpoints:

* ["thu-ml/unidiffuser-v1"](https://huggingface.co/thu-ml/unidiffuser-v1)
* ["thu-ml/unidiffuser-v0"](https://huggingface.co/thu-ml/unidiffuser-v0)

In [2]:
import torch
from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Unconditional image and text generation. The generation task is automatically inferred.
sample = pipe(num_inference_steps=20, guidance_scale=8.0)
image = sample.images[0]
text = sample.text[0]
image.save("unidiffuser_joint_sample_image.png")
print(text)

Downloading (…)ain/model_index.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Fetching 22 files:   0%|          | 0/22 [00:00<?, ?it/s]

Downloading pytorch_model.bin:   0%|          | 0.00/351M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading (…)_encoder/config.json:   0%|          | 0.00/572 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading (…)cheduler_config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/518 [00:00<?, ?B/s]

Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading (…)on_pytorch_model.bin:   0%|          | 0.00/511M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/492M [00:00<?, ?B/s]

Downloading (…)_encoder/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)_decoder/config.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

Downloading (…)er/added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/748 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

Downloading (…)16e/unet/config.json:   0%|          | 0.00/839 [00:00<?, ?B/s]

Downloading (…)on_pytorch_model.bin:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

Downloading (…)on_pytorch_model.bin:   0%|          | 0.00/335M [00:00<?, ?B/s]

Downloading (…)216e/vae/config.json:   0%|          | 0.00/582 [00:00<?, ?B/s]

No inputs or latents have been supplied, and mode has not been manually set, defaulting to mode 'joint'.


  0%|          | 0/20 [00:00<?, ?it/s]

  attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)


A small white car parked up in parking lot


You can also generate only an image or only text (which the UniDiffuser paper calls “marginal” generation since we sample from the marginal distribution of images and text, respectively):

In [3]:
# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance

# Image-only generation
pipe.set_image_mode()
sample_image = pipe(num_inference_steps=20).images[0]

# Text-only generation
pipe.set_text_mode()
sample_text = pipe(num_inference_steps=20).text[0]

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

To reset a mode, call: `pipe.reset_mode()`. 

## Text-to-image generation

The `UniDiffuserPipeline` can infer the right mode of execution from provided inputs to the pipeline called. Since we started with the joint unconditional mode (`set_joint_mode()`), the subsequent calls will be executed in this model. Now, we want to generate images from text. So, we set the model accordingly. 

In [4]:
pipe.set_text_to_image_mode()

In [5]:
# Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

  0%|          | 0/20 [00:00<?, ?it/s]

## Image-to-text generation

In [6]:
pipe.set_image_to_text_mode()

In [7]:
from diffusers.utils import load_image

# Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

  0%|          | 0/20 [00:00<?, ?it/s]

An image of an astronaut flying over the Earth


## Image variation

For image variation, we follow a "round-trip" method as suggested in the paper. We first generate a caption from a given image. And then use the caption to generate a image from it. 

In [8]:
# Image variation can be performed with a image-to-text generation followed by a text-to-image generation:
# 1. Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

pipe.set_image_to_text_mode()
sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

# 2. Text-to-image generation
pipe.set_text_to_image_mode()
sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0)
final_image = sample.images[0]
final_image.save("unidiffuser_image_variation_sample.png")

  0%|          | 0/20 [00:00<?, ?it/s]

An astronaut floating in                                                               


  0%|          | 0/20 [00:00<?, ?it/s]

## Text variation

The same round-trip methodology can be applied here. 

In [9]:
# Text variation can be performed with a text-to-image generation followed by a image-to-text generation:
# 1. Text-to-image generation
prompt = "an elephant under the sea"

pipe.set_text_to_image_mode()
sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

# 2. Image-to-text generation
pipe.set_image_to_text_mode()
sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0)
final_prompt = sample.text[0]
print(final_prompt)

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

A baby elephant in a aquarium with a fish
