# 🍫 Building a Controlnet dataset with Fondant

> ⚠️ Please note that this notebook **is not** compatible with **Google Colab**. To complete the tutorial, you must
> initiate Docker containers. Starting Docker containers within Google Colab is not supported.


### Pipeline overview


There are 5 components in total, these are:

1. [**Prompt Generation**](components/generate_prompts): This component generates a set of seed prompts using a rule-based approach that combines various rooms and styles together, like “a photo of a {room_type} in the style of {style_type}”. As input, it takes in a list of room types (bedroom, kitchen, laundry room, ..), a list of room styles (contemporary, minimalist, art deco, ...) and a list of prefixes (comfortable, luxurious, simple). These lists can be easily adapted to other domains. The output of this component is a list of seed prompts.

2. [**Image URL Retrieval**](https://github.com/ml6team/fondant/tree/main/components/prompt_based_laion_retrieval): This component retrieves images from the [LAION-5B](https://laion.ai/blog/laion-5b/) dataset based on the seed prompts. The retrieval itself is done based on CLIP embeddings similarity between the prompt sentences and the captions in the LAION dataset. This component doesn’t return the actual images yet, only the URLs. The next component in the pipeline will then download these images.

3. [**Download Images**](https://github.com/ml6team/fondant/tree/main/components/download_images): This component downloads the actual images based on the URLs retrieved by the previous component. It takes in the URLs as input and returns the actual images, along with some metadata (like their height and width).

4. [**Caption Filtering**] The LAION dataset contains a lot of low-quality captions, your task is to play around with filtering out captions that are either too long, too short, contain irrelevant data etc.

5. [**Conditioning Creation**] In this step your task is to write a component that creates a conditioning image from a given image. Unless you're running on GPU locally, it's adviced to stick to CPU-friendly transformations, like those found in [opencv](https://opencv.org/). Get creative and experiment! 

Here are some examples:
- canny edges, 
- pixelation, 
- smoothed out hue, saturation, lightness, etc.
- Hough maps
- blob detection

## Environment

#### This section checks the prerequisites of your environment. Read any errors or warnings carefully.

**Ensure a Python between version 3.8 and 3.10 is available**

In [None]:
import sys
if sys.version_info < (3, 8, 0) or sys.version_info >= (3, 11, 0):
    raise Exception(f"A Python version between 3.8 and 3.10 is required. You are running {sys.version}")

**Check if docker compose is installed and the docker daemon is running**

In [None]:
!docker compose version >/dev/null
!docker info >/dev/null

**Make sure Fondant is installed**

In [None]:
!pip install -r ../requirements.txt -q --disable-pip-version-check

## Implement the pipeline

### Creating a pipeline

First of all, we need to initialize the pipeline, which includes specifying a name for your pipeline, providing a description, and setting a base_path. The base_path is used to store the pipeline artifacts and data generated by the components.

In [182]:
from pathlib import Path

from fondant.pipeline import ComponentOp, Pipeline

BASE_PATH = "./data_dir"
Path(BASE_PATH).mkdir(parents=True, exist_ok=True)

pipeline = Pipeline(
    pipeline_name="controlnet-pipeline",
    pipeline_description="Pipeline that collects and processes data to train ControlNet",
    base_path=BASE_PATH
)

### Adding a (custom) component

The first component of our pipeline is the `generate_prompts` component, which generates a set of prompts to query the LAION knn index with. This is a custom component implemented in this repository. You can find it at [./components/generate_prompts](./components/generate_prompts).

To create an operation for a custom component, we create a `ComponentOp` and pass in the `component_dir` where the component is located.

We can pass in arguments to change the behavior of the component. Here we are passing in `n_rows_to_load: 10`, which limits the amount of data that is generated for the purpose of this example.

For an overview of the available arguments, you can check the [`fondant_component.yaml`](/edit/src/components/generate_prompts/fondant_component.yaml) specification.

In [183]:
generate_prompts_op = ComponentOp(
    component_dir="components/generate_prompts",
    arguments={
        "n_rows_to_load": 10,
    },
)

Once we've created an operation for our component, we can add it to our pipeline.

In [184]:
pipeline.add_op(generate_prompts_op)

Now, our pipeline consists of a single component that generates prompts.

### Adding more (reusable) components

We can now proceed to add more components. 

We will use some components available on the [Fondant Hub](https://fondant.ai/en/latest/components/hub/), for which we can create operations using the `ComponentOp.from_registry(...)` method.

In [185]:
# custom component
retrieval_component_op = ComponentOp(
    component_dir="components/prompt_based_laion_retrieval",
    arguments={
        "num_images": 2,
        "aesthetic_score": 9,
        "aesthetic_weight": 0.5,
        "url": "https://knn.laion.ai/knn-service"
    },
    cache=False
)


# reusable hub component
download_images_op = ComponentOp.from_registry(
    name="download_images",
    arguments={
        "timeout": 1,
        "retries": 0,
        "image_size": 512,
        "resize_mode": "center_crop",
        "resize_only_if_bigger": False,
        "min_image_size": 0,
        "max_aspect_ratio": 2.5,
    },
)

# custom component
filter_component_op = ComponentOp(
    component_dir="components/filter_component",
    arguments={
        "max_length": 100
    },
    cache=False
)

# custom component
conditioing_component_op = ComponentOp(
    component_dir="components/conditioning_component",
)


Now, we can use the components in our pipeline. We will chain them into a pipeline by defining dependencies between the different pipeline steps.

In [186]:

pipeline.add_op(retrieval_component_op, dependencies=generate_prompts_op)
pipeline.add_op(filter_component_op, dependencies=retrieval_component_op)
pipeline.add_op(download_images_op, dependencies=filter_component_op)
pipeline.add_op(conditioing_component_op, dependencies=download_images_op)


## Writing the dataset to the Hugging Face Hub 

To write the final dataset to HF hub, we will use the `write_to_hf_hub` component from the [Fondant Hub](https://fondant.ai/en/latest/components/hub/).

You'll need a Hugging Face Hub account for this. If you don't have one, you can either create one, or skip this step.

In [187]:
USERNAME = "khaerens"
HF_TOKEN = "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"

`write_to_hf_hub` is a special type of reusable Fondant component which is **generic**. This means that it can handle different data schemas, but we have to tell it which schema to use.

We do this by overwriting its `fondant_component.yaml` file with the schema of the data we want it to write. To achieve this, we can create a `fondant_component.yaml` file in the directory `components/write_to_hf_hub` with the following content:

In [188]:
%%writefile components/write_to_hub_controlnet/fondant_component.yaml
name: Write to hub
description: Component that writes a dataset to the hub
image: fndnt/write_to_hf_hub:0.6.2  # We use a docker image from the Fondant Hub instead of implementing our own.

consumes:  # We fill in our data schema here. The component will write this data to the Hugging Face Hub.
  images:
    fields:
      data:
        type: binary

  conditionings:
    fields:
      data:
        type: binary

  captions:
    fields:
      text:
        type: string

args:  # We repeat the arguments from the original `fondant_component.yaml`
  hf_token:
    description: The hugging face token used to write to the hub
    type: str
  username:
    description: The username under which to upload the dataset
    type: str
  dataset_name:
    description: The name of the dataset to upload
    type: str
  image_column_names:
    description: A list containing the image column names. Used to format to image to HF hub format
    type: list
    default: []
  column_name_mapping:
    description: Mapping of the consumed fondant column names to the written hub column names
    type: dict
    default: {}

Overwriting components/write_to_hub_controlnet/fondant_component.yaml


For which we then create an operation as if it was a custom component:

In [189]:
write_to_hub_controlnet = ComponentOp(
    component_dir="components/write_to_hub_controlnet",
    arguments={
        "username": USERNAME ,
        "hf_token": HF_TOKEN ,
        "dataset_name": "controlnet-interior-design",
        "image_column_names": ["images_data", "conditionings_data", "captions_text"],
    },
)

And add it to the pipeline

In [190]:
pipeline.add_op(write_to_hub_controlnet, dependencies=conditioing_component_op)

## Running the pipeline

This pipeline will generate prompts, retrieve matching images in the laion dataset, download then and generate corresponding captions and segmentations. If you added the optional `write_to_hf_hub` component, it will write the resulting dataset to the HF hub.

Fondant provides multiple runners to run our pipeline:
- A Docker runner for local execution
- A Vertex AI runner for managed execution on Google Cloud
- A Kubeflow Pipelines runner for execution anywhere

Here we will use the `DockerRunner` for local execution, which utilizes docker-compose under the hood.

The runner will first build the custom component and download the reusable components from the component hub. Afterwards, you will see the components execute one by one.

In [191]:
# If you are using a MacBook with a M1 processor you have to make sure to set the docker default platform to linux/amd64
import os
os.environ["DOCKER_DEFAULT_PLATFORM"] = "linux/amd64"

In [192]:
from fondant.compiler import DockerCompiler
from fondant.runner import DockerRunner

from pathlib import Path

DockerCompiler().compile(pipeline=pipeline, output_path="docker-compose.yml")
DockerRunner().run("docker-compose.yml")

[2023-11-30 02:31:07,997 | fondant.compiler | INFO] Compiling controlnet-pipeline to docker-compose.yml
[2023-11-30 02:31:07,997 | fondant.compiler | INFO] Base path found on local system, setting up ./data_dir as mount volume
[2023-11-30 02:31:07,998 | fondant.pipeline | INFO] Sorting pipeline component graph topologically.
[2023-11-30 02:31:08,063 | fondant.pipeline | INFO] All pipeline component specifications match.
[2023-11-30 02:31:08,064 | fondant.compiler | INFO] Compiling service for generate_prompts
[2023-11-30 02:31:08,065 | fondant.compiler | INFO] Found Dockerfile for generate_prompts, adding build step.
[2023-11-30 02:31:08,065 | fondant.compiler | INFO] Compiling service for laion_retrieval
[2023-11-30 02:31:08,067 | fondant.compiler | INFO] Found Dockerfile for laion_retrieval, adding build step.
[2023-11-30 02:31:08,067 | fondant.compiler | INFO] Compiling service for filter_prompts
[2023-11-30 02:31:08,068 | fondant.compiler | INFO] Found Dockerfile for filter_prompts

#0 building with "default" instance using docker driver

#1 [generate_prompts internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#2 [generate_prompts internal] load build definition from Dockerfile
#2 transferring dockerfile: 538B done
#2 DONE 0.0s

#3 [generate_prompts internal] load metadata for docker.io/library/python:3.8-slim
#3 DONE 0.0s

#4 [generate_prompts 1/8] FROM docker.io/library/python:3.8-slim
#4 DONE 0.0s

#5 [generate_prompts internal] load build context
#5 transferring context: 133B done
#5 DONE 0.0s

#6 [generate_prompts 2/8] RUN apt-get update &&     apt-get upgrade -y &&     apt-get install git -y
#6 CACHED

#7 [generate_prompts 4/8] RUN python3 -m pip install --upgrade pip
#7 CACHED

#8 [generate_prompts 7/8] COPY src/ .
#8 CACHED

#9 [generate_prompts 3/8] COPY requirements.txt /
#9 CACHED

#10 [generate_prompts 5/8] RUN pip3 install --no-cache-dir -r requirements.txt
#10 CACHED

#11 [generate_prompts 6/8] WORKDIR /component/src
#11 CACHE

 Container controlnet-pipeline-generate_prompts-1  Recreate
 Container controlnet-pipeline-generate_prompts-1  Recreated
 Container controlnet-pipeline-laion_retrieval-1  Recreate
 Container controlnet-pipeline-laion_retrieval-1  Recreated
 Container controlnet-pipeline-filter_prompts-1  Recreate
 Container controlnet-pipeline-filter_prompts-1  Recreated
 Container controlnet-pipeline-download_images-1  Recreate
 Container controlnet-pipeline-download_images-1  Recreated
 Container controlnet-pipeline-condition_images-1  Recreate
 Container controlnet-pipeline-condition_images-1  Recreated
 Container controlnet-pipeline-write_to_hub-1  Recreate
 Container controlnet-pipeline-write_to_hub-1  Recreated


Attaching to controlnet-pipeline-condition_images-1, controlnet-pipeline-download_images-1, controlnet-pipeline-filter_prompts-1, controlnet-pipeline-generate_prompts-1, controlnet-pipeline-laion_retrieval-1, controlnet-pipeline-write_to_hub-1


controlnet-pipeline-generate_prompts-1  | [2023-11-30 01:31:11,682 | fondant.cli | INFO] Component `GeneratePromptsComponent` found in module main
controlnet-pipeline-generate_prompts-1  | [2023-11-30 01:31:11,690 | fondant.executor | INFO] Dask default local mode will be used for further executions.Our current supported options are limited to 'local' and 'default'.
controlnet-pipeline-generate_prompts-1  | [2023-11-30 01:31:11,709 | fondant.executor | INFO] Matching execution detected for component. The last execution of the component originated from `controlnet-pipeline-20231130014353`.
controlnet-pipeline-generate_prompts-1  | [2023-11-30 01:31:11,709 | fondant.executor | INFO] Skipping component execution
controlnet-pipeline-generate_prompts-1  | [2023-11-30 01:31:11,710 | fondant.executor | INFO] Saving output manifest to /data_dir/controlnet-pipeline/controlnet-pipeline-20231130023107/generate_prompts/manifest.json
controlnet-pipeline-generate_prompts-1  | [2023-11-30 01:31:11,71

controlnet-pipeline-generate_prompts-1 exited with code 0


controlnet-pipeline-laion_retrieval-1   | [2023-11-30 01:31:13,497 | fondant.cli | INFO] Component `LAIONRetrievalComponent` found in module main
controlnet-pipeline-laion_retrieval-1   | [2023-11-30 01:31:13,511 | fondant.executor | INFO] Dask default local mode will be used for further executions.Our current supported options are limited to 'local' and 'default'.
controlnet-pipeline-laion_retrieval-1   | [2023-11-30 01:31:13,515 | fondant.executor | INFO] Caching disabled for the component
controlnet-pipeline-laion_retrieval-1   | [2023-11-30 01:31:13,516 | root | INFO] Executing component
controlnet-pipeline-laion_retrieval-1   | [2023-11-30 01:31:13,555 | fondant.data_io | INFO] Loading subset prompts with fields ['text']...
controlnet-pipeline-laion_retrieval-1   | [2023-11-30 01:31:13,563 | fondant.data_io | INFO] The number of partitions of the input dataframe is 1. The available number of workers is 8.
controlnet-pipeline-laion_retrieval-1   | [2023-11-30 01:31:13,564 | fondant

[########################################] | 100% Completed | 1.81 sms


controlnet-pipeline-laion_retrieval-1   | [2023-11-30 01:31:15,416 | fondant.executor | INFO] Saving output manifest to /data_dir/controlnet-pipeline/controlnet-pipeline-20231130023107/laion_retrieval/manifest.json
controlnet-pipeline-laion_retrieval-1   | [2023-11-30 01:31:15,416 | fondant.executor | INFO] Writing cache key to /data_dir/controlnet-pipeline/cache/f0fac4258b6bfa7d0b9a87504a83f845.txt


controlnet-pipeline-laion_retrieval-1 exited with code 0


controlnet-pipeline-filter_prompts-1    | [2023-11-30 01:31:17,774 | fondant.cli | INFO] Component `FilterComponent` found in module main
controlnet-pipeline-filter_prompts-1    | [2023-11-30 01:31:17,792 | fondant.executor | INFO] Dask default local mode will be used for further executions.Our current supported options are limited to 'local' and 'default'.
controlnet-pipeline-filter_prompts-1    | [2023-11-30 01:31:17,802 | fondant.executor | INFO] Caching disabled for the component
controlnet-pipeline-filter_prompts-1    | [2023-11-30 01:31:17,802 | root | INFO] Executing component
controlnet-pipeline-filter_prompts-1    | [2023-11-30 01:31:17,881 | fondant.data_io | INFO] Loading subset images with fields ['url']...
controlnet-pipeline-filter_prompts-1    | [2023-11-30 01:31:17,905 | fondant.data_io | INFO] Loading subset captions with fields ['text']...
controlnet-pipeline-filter_prompts-1    | [2023-11-30 01:31:17,931 | root | INFO] Columns of dataframe: ['images_url', 'captions_t

[########################################] | 100% Completed | 409.25 ms


controlnet-pipeline-filter_prompts-1    | [2023-11-30 01:31:18,394 | fondant.executor | INFO] Saving output manifest to /data_dir/controlnet-pipeline/controlnet-pipeline-20231130023107/filter_prompts/manifest.json
controlnet-pipeline-filter_prompts-1    | [2023-11-30 01:31:18,394 | fondant.executor | INFO] Writing cache key to /data_dir/controlnet-pipeline/cache/ed0ab64b7d4d7f4fd5b42d9e32832c97.txt


controlnet-pipeline-filter_prompts-1 exited with code 0


controlnet-pipeline-download_images-1   | [2023-11-30 01:31:21,657 | fondant.cli | INFO] Component `DownloadImagesComponent` found in module main
controlnet-pipeline-download_images-1   | [2023-11-30 01:31:21,663 | fondant.executor | INFO] Dask default local mode will be used for further executions.Our current supported options are limited to 'local' and 'default'.
controlnet-pipeline-download_images-1   | [2023-11-30 01:31:21,665 | fondant.executor | INFO] Previous component `filter_prompts` is not cached. Invalidating cache for current and subsequent components
controlnet-pipeline-download_images-1   | [2023-11-30 01:31:21,665 | fondant.executor | INFO] Caching disabled for the component
controlnet-pipeline-download_images-1   | [2023-11-30 01:31:21,665 | root | INFO] Executing component
controlnet-pipeline-download_images-1   | [2023-11-30 01:31:21,732 | fondant.data_io | INFO] Loading subset images with fields ['url']...
controlnet-pipeline-download_images-1   | [2023-11-30 01:31:2

[########################################] | 100% Completed | 12.65 s


controlnet-pipeline-download_images-1   | [2023-11-30 01:31:34,897 | fondant.executor | INFO] Saving output manifest to /data_dir/controlnet-pipeline/controlnet-pipeline-20231130023107/download_images/manifest.json
controlnet-pipeline-download_images-1   | [2023-11-30 01:31:34,897 | fondant.executor | INFO] Writing cache key to /data_dir/controlnet-pipeline/cache/2f5b3ae9d0b06ce05c544feac812a00b.txt


controlnet-pipeline-download_images-1 exited with code 0


controlnet-pipeline-condition_images-1  | [2023-11-30 01:31:37,231 | fondant.cli | INFO] Component `ConditioningComponent` found in module main
controlnet-pipeline-condition_images-1  | [2023-11-30 01:31:37,241 | fondant.executor | INFO] Dask default local mode will be used for further executions.Our current supported options are limited to 'local' and 'default'.
controlnet-pipeline-condition_images-1  | [2023-11-30 01:31:37,252 | fondant.executor | INFO] Previous component `download_images` is not cached. Invalidating cache for current and subsequent components
controlnet-pipeline-condition_images-1  | [2023-11-30 01:31:37,252 | fondant.executor | INFO] Caching disabled for the component
controlnet-pipeline-condition_images-1  | [2023-11-30 01:31:37,252 | root | INFO] Executing component
controlnet-pipeline-condition_images-1  | [2023-11-30 01:31:37,329 | fondant.data_io | INFO] Loading subset images with fields ['data']...
controlnet-pipeline-condition_images-1  | [2023-11-30 01:31:3

[########################################] | 100% Completed | 19.47 ss


controlnet-pipeline-condition_images-1  | [2023-11-30 01:31:56,866 | fondant.executor | INFO] Saving output manifest to /data_dir/controlnet-pipeline/controlnet-pipeline-20231130023107/condition_images/manifest.json
controlnet-pipeline-condition_images-1  | [2023-11-30 01:31:56,866 | fondant.executor | INFO] Writing cache key to /data_dir/controlnet-pipeline/cache/a18eda28c948a59afdb40d58979fbebc.txt


controlnet-pipeline-condition_images-1 exited with code 0


controlnet-pipeline-write_to_hub-1      | [2023-11-30 01:31:59,448 | fondant.cli | INFO] Component `WriteToHubComponent` found in module main
controlnet-pipeline-write_to_hub-1      | [2023-11-30 01:31:59,452 | fondant.executor | INFO] Dask default local mode will be used for further executions.Our current supported options are limited to 'local' and 'default'.
controlnet-pipeline-write_to_hub-1      | [2023-11-30 01:31:59,456 | fondant.executor | INFO] Previous component `condition_images` is not cached. Invalidating cache for current and subsequent components
controlnet-pipeline-write_to_hub-1      | [2023-11-30 01:31:59,456 | fondant.executor | INFO] Caching disabled for the component
controlnet-pipeline-write_to_hub-1      | [2023-11-30 01:31:59,456 | root | INFO] Executing component
controlnet-pipeline-write_to_hub-1      | [2023-11-30 01:31:59,685 | main | INFO] Creating HF dataset repository under ID: 'khaerens/controlnet-interior-design'
controlnet-pipeline-write_to_hub-1      

controlnet-pipeline-write_to_hub-1 exited with code 137


## Exploring the dataset 

You can also explore the dataset using the fondant explorer, this enables you to visualize your output dataset at each component step. Use the side panel on the left to browse through the steps and subsets.

**If docker throws an error, run the command under next cell (this can happen if the previous container didn't shut down correctly)**

In [None]:
from fondant.explorer import run_explorer_app


run_explorer_app(
    base_path=BASE_PATH,
    container="fndnt/data_explorer",
    tag="0.6.2",
    port=8501,
)


[2023-11-30 01:38:18,885 | root | INFO] Using local base path: ./data_dir
[2023-11-30 01:38:18,886 | root | INFO] This directory will be mounted to /artifacts in the container.
[2023-11-30 01:38:18,887 | root | INFO] Running image from registry: fndnt/data_explorer with tag: 0.6.2 on port: 8501
[2023-11-30 01:38:18,888 | root | INFO] Access the explorer at http://localhost:8501
0.6.2: Pulling from fndnt/data_explorer
Digest: sha256:8f317b795798f24f37cb287355d6223c9cca94eb6f12e3535790d1faa79735ec
Status: Image is up to date for fndnt/data_explorer:0.6.2


KeyboardInterrupt: 

In [None]:
!docker rm $(docker stop $(docker ps -a -q --filter ancestor=fndnt/data_explorer:0.6.2 --format="{{.ID}}"))

Error response from daemon: No such container: 90a4cf0d42f4


To stop the Explorer and continue the notebook, press the stop button at the top of the notebook.

## Scaling up

If you're happy with your dataset, it's time to scale up. Check [our documentation](https://fondant.ai/en/latest/pipeline/#compiling-and-running-a-pipeline) for more information about the available runners.