## 🍫 Building a Datacomp filtering pipeline with Fondant

[DataComp](https://www.datacomp.ai/) is a competition organized by the University of Washington and
others to come up with the best possible image-text dataset to train a fixed CLIP model. Hence, it's
an ideal use case for Fondant, as we can leverage reusable components to filter large, noisy
image-text datasets.

In this example, we build a pipeline for filtering the dataset using the T-Mars data filtering
approach. For more information on T-Mars, check out
the [official paper](https://arxiv.org/pdf/2307.03132.pdf). 

There are 7 components in total, these are:

1. [**Load from hf hub**](components/generate_prompts): The pipeline begins by loading the initial
   datacomp data which we hosted on the Hugginface hub.

2. [**Download images**](https://github.com/ml6team/fondant/tree/main/components/download_images):
   This component downloads the actual images based on the URLs retrieved by the previous component.
   It takes in the URLs as input and returns the actual images.

3. [**Resize images**](https://github.com/ml6team/fondant/tree/main/components/resize_images): This
   component resizes the images to a fixed size. It takes in the images as input and returns the
   resized images.

4. [**Detect text**](components/detect_text): This component detects text in the images using
   ann [mmocr model](https://github.com/locuslab/T-MARS/tree/main/dataset2metadata/text_detection).
   It takes in the images as input and returns the bounding boxes of the detected text.

5. [**Mask images**](components/mask_images): This component masks the detected text in the images.
   It takes in the images and the bounding boxes as input and returns the masked images.

6. [**Add clip score**](components/add_clip_score): This component adds a CLIP score to the images.
   The clip score is estimated as the dot product between the CLIP embeddings of the masked images
   and the original image captions.

7. [**Filter clip score**](components/filter_clip_score): This component filters the images based on
   their CLIP score. It takes in the images and the CLIP scores as input and returns the filtered
   indexes.

**Prerequisite:**

- Ensure Python version 3.8 to 3.10 is installed on your system.
- Install and configure Docker on your system.
- Ensure that you have a GPU for running the GPU-based component of the pipeline.


## Environment

### This section checks the prerequisites of your environment. Read any errors or warnings carefully.

**Ensure a Python version between 3.8 and 3.10 is available**

In [1]:
import sys
if sys.version_info < (3, 8, 0) or sys.version_info >= (3, 11, 0):
    raise Exception(f"A Python version between 3.8 and 3.10 is required. You are running {sys.version}")

**Check if docker compose is installed and the docker daemon is running**

In [2]:
!docker compose version >/dev/null
!docker info >/dev/null

**Check if GPU is available**

In [3]:
import logging
import subprocess

try:
    subprocess.check_output('nvidia-smi')
    logging.info("Found GPU, using it!")
    number_of_accelerators = 1
    accelerator_name = "GPU"
except Exception:
    logging.warning("We recommend to run this pipeline on a GPU, but none could be found, using CPU instead")
    number_of_accelerators = None
    accelerator_name = None



**Install Fondant**

In [2]:
!pip install -r ../requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Implement the pipeline

First of all, we need to initialize the pipeline, which includes specifying a name for your pipeline, providing a description, and setting a base_path. The base_path is used to store the pipeline artifacts and data generated by the components

In [16]:
from pathlib import Path

from fondant.pipeline import ComponentOp, Pipeline, Resources

IMAGE_SIZE = 256

BASE_PATH = "./data_dir"
Path(BASE_PATH).mkdir(parents=True, exist_ok=True)

pipeline = Pipeline(
    pipeline_name="controlnet-pipeline",
    pipeline_description="Pipeline that collects data to train ControlNet",
    base_path=BASE_PATH
)

To start off, we will use the `load_from_hub_op` component to load the initial [dataset](https://huggingface.co/datasets/nielsr/datacomp-small-with-text-embeddings):


In [17]:
load_component_column_mapping = {
    "url": "images_url",
    "original_width": "images_width",
    "original_height": "images_height",
    "face_bboxes": "images_face_bboxes",
    "sha256": "images_sha256",
    "text": "text_data",
    "uid": "image_text_uid",
    "clip_b32_similarity_score": "image_text_clip_b32_similarity_score",
    "clip_l14_similarity_score": "image_text_clip_l14_similarity_score",
}


load_from_hub_op = ComponentOp(
    component_dir="components/load_from_hf_hub",
    arguments={
        "dataset_name": "nielsr/datacomp-small-with-text-embeddings",
        "column_name_mapping": load_component_column_mapping,
        "n_rows_to_load": 10,
    },
)

pipeline.add_op(load_from_hub_op)

Now, our pipeline consists of a single component that loads the dataset from HuggingFace Hub. We can proceed to add the other components. The resuable components available on the hub will be loaded using the `ComponentOp.from_registry(...)` method.

In [18]:
download_images_op = ComponentOp.from_registry(
    name="download_images",
    arguments={
        "retries": 2,
        "min_image_size": 0,
    },
)

resize_images = ComponentOp(
    component_dir="components/resize_images",
    arguments={
        "resize_width": IMAGE_SIZE,
        "resize_height": IMAGE_SIZE,
    },
)

detect_text_op = ComponentOp(
    component_dir="components/detect_text",
    arguments={
        "batch_size": 8,
        "image_size": IMAGE_SIZE,
    },
    resources=Resources(
        accelerator_number=number_of_accelerators,
        accelerator_name=accelerator_name,
    ),
)
mask_images_op = ComponentOp(
    component_dir="components/mask_images",
)

embed_images_op = ComponentOp.from_registry(
    name="embed_images",
    arguments={
        "batch_size": 8,
    },
    resources=Resources(
        accelerator_number=number_of_accelerators,
        accelerator_name=accelerator_name,
    ),
)
add_clip_score_op = ComponentOp(
    component_dir="components/add_clip_score",
)

filter_clip_score_op = ComponentOp(
    component_dir="components/filter_clip_score",
    arguments={
        "threshold_score": 0.19,
    },
)

Now, we can use the components in our pipeline. It is important to note that we will define dependencies between the pipeline steps.

In [19]:
pipeline.add_op(download_images_op, dependencies=load_from_hub_op)
pipeline.add_op(resize_images, dependencies=download_images_op)
pipeline.add_op(detect_text_op, dependencies=resize_images)
pipeline.add_op(mask_images_op, dependencies=detect_text_op)
pipeline.add_op(embed_images_op, dependencies=mask_images_op)
pipeline.add_op(add_clip_score_op, dependencies=embed_images_op)
pipeline.add_op(filter_clip_score_op, dependencies=add_clip_score_op)

## Execute the pipeline

The pipeline will generate the prompts, retreive matching images in the laion dataset and download then and finally will generate corresponding captions and segmentations needed before writing the dataset to the HF hub.

We can execute our pipeline. Fondant provides various executors, and in this case, we are using the `DockerRunner` for local execution, which utilizes docker-compose under the hood.

In [None]:
from fondant.pipeline.compiler import DockerCompiler
from fondant.pipeline.runner import DockerRunner

DockerCompiler().compile(pipeline=pipeline, output_path = "docker-compose.yml")
DockerRunner().run("docker-compose.yml")

 download_images Pulling 
 embed_images Pulling 
 load_from_hub Pulling 
 download_images Pulled 
 99803d4b97f3 Already exists 
 4ade0a4bc5d5 Already exists 
 035a286326d6 Already exists 
 4f4fb700ef54 Already exists 
 2185b402c9ca Already exists 
 3f9b7e137132 Already exists 
 a5c399c2f560 Pulling fs layer 
 37df9d93b7ab Pulling fs layer 
 96f398012653 Pulling fs layer 
 961d2e925ec8 Pulling fs layer 
 6598fe19f2da Pulling fs layer 
 961d2e925ec8 Waiting 
 6598fe19f2da Waiting 
 578acb154839 Already exists 
 ac65017cfc56 Already exists 
 100bb96a0327 Already exists 
 2b7e5f1b5877 Already exists 
 195da115a4ff Already exists 
 c071838618b7 Already exists 
 e57273a4c95d Already exists 
 cade639f07cd Already exists 
 94133c34028f Pulling fs layer 
 d28a9097a18e Pulling fs layer 
 ff3dbbcbbfa5 Pulling fs layer 
 ff3dbbcbbfa5 Waiting 
 94133c34028f Waiting 
 d28a9097a18e Waiting 
 a5c399c2f560 Verifying Checksum 
 a5c399c2f560 Download complete 
 a5c399c2f560 Pull complete 
 96f398012653 D

## Exploring the dataset 

You can also explore the dataset using the fondant explorer, this enables you to visualize your output dataset at each component step

In [None]:
from fondant.explore import run_explorer_app

run_explorer_app(
    base_path=BASE_PATH,
    container="fndnt/data_explorer",
    tag="latest",
    port=8501,
)