# DIY Colab on text-to-image with flux1.schnell from Black Forest Lab
In August 2024, Black Forest Lab introduced their new pretrained flux1 model. Meanwhile it's used by X(Twitter)'s Grok.

The model itself has 12B parameters and requires a GPU to get it run in a reasonable time. It's recommended to have at least 12 GB memory on your CPU and 12 GB memory on your GPU.
In the following you can click through two different versions to get flux running:

1) A already quantized version of the transformer.
2) The original model components from Black Forest Labs uploaded to HuggingFace, which are then quantized. The pipeline is built ony by one. One will also learn how to save quantized models in torch.

## Install requirements and import libraries

## 1) Run already quantized version

First, we are going to install the necessary requirements and import the libraries we are going to use.

### Requirements

In [None]:
!pip install pip --upgrade
!pip install numpy==1.26.4
!pip install accelerate
!pip install git+https://github.com/huggingface/diffusers
!pip install optimum-quanto
!pip install transformers --upgrade 

import torch # necessary to check the device
# identify which device is used (cuda = GPU, cpu = CPU only, mps = Mac)
device: str = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')
if device == 'cpu':
    !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
elif device == 'cuda':
    !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
elif device == 'mps':
    !pip3 install torch torchvision torchaudio
else:
    print("device unknown")
# exception: cu124 necessary for google colab no matter if T4 GPU enabled or CPU only

###  Libraries

In [None]:
import torch

import accelerate

from optimum.quanto import freeze, qfloat8, quantize

from diffusers import FluxTransformer2DModel 
from diffusers import FluxPipeline

from transformers import T5EncoderModel

import os

If you run this notebook in google colab, execute the following cell:

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

### Load and quantize different pipeline components

Define loading and saving path for models:

In [2]:
cache_dir = './models/text-to-image/flux.1-schnell' # saving path
model = "black-forest-labs/FLUX.1-schnell" # official model flux1.-schnell from Blackforest (not quantized)
model_tr = "https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-schnell-fp8.safetensors" # quantized transformer from HuggingFace

Create necessary folders:

In [None]:
if not os.path.exists(f'{cache_dir}'): 
  os.makedirs(f'{cache_dir}')
# saving folder for images
if not os.path.exists('./figs'): 
  os.makedirs('./figs')

Load and requantize transformer.
Note: You may run out of CPU memory here, since the file is first completely loaded into the CPU memory and is about 12 GB big (CPU ram > 12 GB necessary). If you ran out of memory, try method 2).

In [None]:
transformer = FluxTransformer2DModel.from_single_file(model_tr, 
                                                        torch_dtype=torch.bfloat16,
                                                        cache_dir = cache_dir,
                                                        #local_files_only=True # once you have downloaded the model, you can force the use of these downloaded models instead of downloading them each time you run the program.
)
quantize(transformer, weights=qfloat8)
freeze(transformer)

Load and requantize text_encoder_2

In [None]:
text_encoder_2 = T5EncoderModel.from_pretrained(model,
                                                subfolder="text_encoder_2",
                                                torch_dtype=torch.bfloat16,
                                                cache_dir=cache_dir,
                                                #local_files_only=True
)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

### Set up pipeline

Set up pipe line with main model and the two quantized models (transformer & text_encoder_2). When running on cuda (GPU) there are some more "tricks" to lower the memory usage.

In [None]:
pipe = FluxPipeline.from_pretrained(model,
                                    transformer=None,
                                    text_encoder_2=None,
                                    torch_dtype=torch.bfloat16
)

pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2

For cuda (GPU) use ONLY to save some VRAM on GPU to get the code running with VRAM < 16 GB. Depending on your GPU you should try to either use "enable_model_cpu_offload" or all the three other lines of code all together. Try out which option runs faster (or at all since it's a very GPU consuming model). Just the first line tends to be faster but you need more GPU memory.

In [7]:
if device == 'cuda':
    # pipe.enable_model_cpu_offload() # offloads modules to CPU on a submodule level (rather than model level)
    pipe.enable_sequential_cpu_offload() # when using non-quantized versions to make it run with VRAM 4-32 GB
    pipe.vae.enable_slicing() # when using non-quantized versions to make it run with VRAM 4-32 GB
    pipe.vae.enable_tiling() # when using non-quantized versions to make it run with VRAM 4-32 GB
else: 
    pipe.to(device)

### Define and create image

Define parameters for the image. Most important: prompt which should describe the picture as closely as possible. You can also describe something in the foreground, in the background, etc. and define the style, e.g. photorealistic, high definition, water color style, ....

In [None]:
prompt = "Dog in Space on a flying carpet. Behind there are cats. In the background there is a snow covered mountain and the moon."
height, width = 1024, 1024 # standard = 1024x1024
num_inference_steps = 4  # number of iterations, 4 gives decent results and should be considered as minimum; people on HuggingFace, GitHub and Reddit: ~15-50 iterations. Check for yourself to get a good tradeoff between speed and quality
generator = torch.Generator(device).manual_seed(12345) # set seed for repeatable results

Image generation:

In [None]:
image = pipe(
    prompt=prompt,
    guidance_scale=0.0, # must be 0.0 for flux1.-schnell, may be 3.5 for flux1.-dev but up to 7.0 --> higher guidance scale forces the model to keep closer to the prompt at the expense of image quality and may introduce artefacts
    height=height,
    width=width,
    num_inference_steps=num_inference_steps,
    max_sequence_length=256, #256 is max for flux1.-schnell; maximum sequence length to use with the prompt
    generator=generator
).images[0]
image


Save image

In [None]:
image.save(f"figs/Kijai_qt-qte2_{num_inference_steps}_{height}_{width}.png")

## 2) Run original models and define pipeline components one by one, then quantize them manually

This code is based on the work from https://gist.github.com/AmericanPresidentJimmyCarter/873985638e1f3541ba8b00137e7dacd9

As for the first version, we are going to install all necessary requirements and import the corresponding libraries.

### Requirements

In [None]:
!pip install pip --upgrade
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 # if cuda 12.4 does not work, go to https://pytorch.org/get-started/locally/ and select the version that fits your OS.
!pip install transformers --upgrade
!pip install sentencepiece
!pip install protobuf
!pip install accelerate
!pip install git+https://github.com/huggingface/diffusers
!pip install optimum-quanto
!pip install -U bitsandbytes

###  Libraries

In [None]:
import torch

from optimum.quanto import freeze, qfloat8, quantize #, qint4

from diffusers import FlowMatchEulerDiscreteScheduler, AutoencoderKL
from diffusers import FluxTransformer2DModel
from diffusers import FluxPipeline
from transformers import CLIPTextModel, CLIPTokenizer, T5EncoderModel, T5TokenizerFast, AutoModelForCausalLM
#from safetensors.torch import save_file, load_file

import os

If you run this notebook in google colab, execute the following cell:

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

### Load and quantize different pipeline components

In [None]:
device: str = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')

Define loading and saving path for models:

In [11]:
bfl_repo = "black-forest-labs/FLUX.1-schnell" # official model flux1.-schnell from Blackforest (not quantized)
revision = "refs/pr/7" #refs/pr/1 works
model_tr = "https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-schnell-fp8.safetensors" # quantized transformer from HuggingFace
cache_dir = './models/text-to-image/flux.1-schnell' # saving path

Create necessary folders:

In [12]:
if not os.path.exists(f'{cache_dir}'): 
  os.makedirs(f'{cache_dir}')
# saving folder for images
if not os.path.exists('./figs'): 
  os.makedirs('./figs')

Let's try to quantize the original transformer introduced by Black Forest Labs. It's 24 GB big. 

In [None]:
# original, not quantized transformer from flux schnell = 24 GB

print("start loading transformer...")
transformer = FluxTransformer2DModel.from_pretrained(bfl_repo,
                                                     subfolder="transformer",
                                                     torch_dtype=torch.bfloat16,
                                                     revision=revision,
                                                     cache_dir = cache_dir,
                                                     #local_files_only=True
)

print("start quantizing transformer...")
# quantizing qfloat8 works, you may also want to try qint4 and see if it works
#quantize(transformer, weights=qint4, exclude=["proj_out", "x_embedder", "norm_out", "context_embedder"])
quantize(transformer, weights=qfloat8)

# print("start freezing transformer...")
freeze(transformer)

As an alternative, we can still use the already quantized transformer (12 GB file size) and requantize it.

In [None]:
# fp8 quantized transformer = 12 GB

print("start loading transformer...")
transformer = FluxTransformer2DModel.from_single_file(model_tr,
                                                      torch_dtype=torch.bfloat16,
                                                      cache_dir = cache_dir,
                                                      #local_files_only=True
)

print("start quantizing transformer...")
quantize(transformer, weights=qfloat8)

print("start freezing transformer...")
freeze(transformer)

Let's save the quantized transformer (if you went for the already quantized transformer, you can skip the next two cells)...

In [None]:
# save(transformer)
torch.save(transformer, f'{cache_dir}' + '/' + 'transformer.pt')

... and load it again.

In [None]:
# loading(transformer)
transformer = torch.load(f'{cache_dir}' + '/' + 'transformer.pt')
transformer.eval()

Now we will quantize the text_encoder_2.

In [None]:
print("start loading text_encoder_2...")
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo,
                                                subfolder="text_encoder_2", 
                                                torch_dtype=torch.bfloat16, 
                                                revision=revision
)

print("start quantizing text_encoder_2...")
quantize(text_encoder_2, weights=qfloat8)

print("start freezing text_encoder_2...")
freeze(text_encoder_2)

Saving the quantized text_encoder_2.

In [None]:
# saving (text_encoder_2)
torch.save(text_encoder_2, f'{cache_dir}' + '/' + 'text_encoder_2.pt')

Loading quantized text_encoder_2.

In [None]:
# loading (text_encoder_2)
text_encoder_2 = torch.load(f'{cache_dir}' + '/' + 'text_encoder_2.pt')
text_encoder_2.eval()

### Load remaining pipeline components, one by one.

This time we will be loading all the other pipeline components one by one instead of loading it from one single file. This means, you also get to see all the components of the flux pipeline. Cool!

In [None]:
scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(bfl_repo, subfolder="scheduler", revision=revision)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16)
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16)
tokenizer_2 = T5TokenizerFast.from_pretrained(bfl_repo, subfolder="tokenizer_2", torch_dtype=torch.bfloat16, revision=revision)
vae = AutoencoderKL.from_pretrained(bfl_repo, subfolder="vae", torch_dtype=torch.bfloat16, revision=revision)


### Set up pipeline

Since we have all model components loaded, we can now set up the pipeline.

In [None]:
pipe = FluxPipeline(
    scheduler=scheduler,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    text_encoder_2=None,
    tokenizer_2=tokenizer_2,
    vae=vae,
    transformer=None,
)

pipe.text_encoder_2 = text_encoder_2
pipe.transformer = transformer

As before, we can apply some tricks to use less VRAM when using a GPU:

In [None]:
if device == 'cuda':
    # pipe.enable_model_cpu_offload() # offloads modules to CPU on a submodule level (rather than model level)
    pipe.enable_sequential_cpu_offload() # when using non-quantized versions to make it run with VRAM 4-32 GB
    pipe.vae.enable_slicing() # when using non-quantized versions to make it run with VRAM 4-32 GB
    pipe.vae.enable_tiling() # when using non-quantized versions to make it run with VRAM 4-32 GB
else: 
    pipe.to(device)

### Define and create image

Parameter defintion for the image:

In [None]:
prompt = "Dog in Space on a flying carpet. Behind there are cats. In the background there is a snow covered mountain and the moon."
height, width = 1024, 1024 # standard = 1024x1024
num_inference_steps = 4  # number of iterations, 4 gives decent results and should be considered as minimum; people on HuggingFace, GitHub and Reddit: ~15-50 iterations. Check for yourself to get a good tradeoff between speed and quality
generator = torch.Generator(device).manual_seed(12345) # set seed for repeatable results

Image generation:

In [None]:
image = pipe(
    prompt=prompt,
    guidance_scale=0.0, # must be 0.0 for flux1.-schnell, may be 3.5 for flux1.-dev but up to 7.0 --> higher guidance scale forces the model to keep closer to the prompt at the expense of image quality
    height=height,
    width=width,
    #output_type="pil",
    num_inference_steps=num_inference_steps,
    max_sequence_length=128, #256 is max for flux1.-schnell; maximum sequence length to use with the prompt
    generator=generator
).images[0]

image


Saving the image:

In [None]:
image.save(f"figs/OneByOne_qt-qte2_{num_inference_steps}_{height}_{width}.png")