## 🍫 Building a Datacomp filtering pipeline with Fondant

[DataComp](https://www.datacomp.ai/) is a competition organized by the University of Washington and
others to come up with the best possible image-text dataset to train a fixed CLIP model. Hence, it's
an ideal use case for Fondant, as we can leverage reusable components to filter large, noisy
image-text datasets.

In this example, we build a pipeline for filtering the dataset using the T-Mars data filtering
approach. For more information on T-Mars, check out
the [official paper](https://arxiv.org/pdf/2307.03132.pdf). 

There are 7 components in total, these are:

1. [**Load from hf hub**](components/generate_prompts): The pipeline begins by loading the initial
   datacomp data which we hosted on the Hugginface hub.

2. [**Download images**](https://github.com/ml6team/fondant/tree/main/components/download_images):
   This component downloads the actual images based on the URLs retrieved by the previous component.
   It takes in the URLs as input and returns the actual images.

3. [**Resize images**](https://github.com/ml6team/fondant/tree/main/components/resize_images): This
   component resizes the images to a fixed size. It takes in the images as input and returns the
   resized images.

4. [**Detect text**](components/detect_text): This component detects text in the images using
   ann [mmocr model](https://github.com/locuslab/T-MARS/tree/main/dataset2metadata/text_detection).
   It takes in the images as input and returns the bounding boxes of the detected text.

5. [**Mask images**](components/mask_images): This component masks the detected text in the images.
   It takes in the images and the bounding boxes as input and returns the masked images.

6. [**Add clip score**](components/add_clip_score): This component adds a CLIP score to the images.
   The clip score is estimated as the dot product between the CLIP embeddings of the masked images
   and the original image captions.

7. [**Filter clip score**](components/filter_clip_score): This component filters the images based on
   their CLIP score. It takes in the images and the CLIP scores as input and returns the filtered
   indexes.

**Prerequisite:**

- Ensure Python version 3.8 to 3.10 is installed on your system.
- Install and configure Docker on your system.
- Ensure that you have a GPU for running the GPU-based component of the pipeline.


In [1]:
# Setup your environment 
!pip install "fondant[docker]==0.8.0" -q

## Implement the pipeline

First of all, we need to initialize the pipeline, which includes specifying a name for your pipeline, providing a description, and setting a base_path. The base_path is used to store the pipeline artifacts and data generated by the components

In [16]:
from pathlib import Path
import pyarrow as pa
import fsspec
import subprocess

from fondant.pipeline import Pipeline, Resources

# Check GPU
try:
    subprocess.check_output('nvidia-smi')
    number_of_accelerators = 1
    accelerator_name = "GPU"
except Exception:
    logging.warning("We recommend to run this pipeline on a GPU, but none could be found")
    number_of_accelerators = None
    accelerator_name = None
    
# General configs
BASE_PATH = "./fondant-artifacts"
N_ROWS_TO_LOAD = 10  # Set to None to load all rows
IMAGE_SIZE = 256

# Create data directory if it doesn't exist and if it's a local path
if fsspec.core.url_to_fs(BASE_PATH)[0].protocol == ('file', 'local'):
    Path(BASE_PATH).mkdir(parents=True, exist_ok=True)

pipeline = Pipeline(
    name="datacomp-filtering-pipeline",
    description="A pipeline for filtering the Datacomp dataset",
    base_path=BASE_PATH
)

To start off, we will use the `load_from_hub_op` component to load the initial [dataset](https://huggingface.co/datasets/nielsr/datacomp-small-with-text-embeddings):


In [17]:
dataset_from_hf_hub = pipeline.read(
    "load_from_hf_hub",
    arguments={
        "dataset_name": "nielsr/datacomp-small-with-text-embeddings",
        "n_rows_to_load": N_ROWS_TO_LOAD,
    },
    produces={
        "url": pa.string(),
        "original_width": pa.int64(),
        "original_height": pa.int64(),
        "face_bboxes": pa.list_(pa.list_(pa.float64())),
        "sha256": pa.string(),
        "text": pa.string(),
        "uid": pa.string(),
        "clip_b32_similarity_score": pa.float32(),
        "clip_l14_similarity_score": pa.float32(),
        "clip_l14_text_embedding": pa.list_(pa.float64())
    },
    cache=True
)

Now, our pipeline consists of a single component that loads the dataset from HuggingFace Hub. We can proceed to add the other components. To add a new reusable component, use the `apply` method. We have to pass the name of the component we want to use, as well as component arguments.

The `consumes` argument defines which columns of the dataset will be passed to component. 


In [18]:
images = dataset_from_hf_hub.apply(
    "download_images",
    consumes={
        "image_url": "url"
    },
    arguments={
        "retries": 2,
        "min_image_size": 0,
    },
)

We can utilize the `apply` method to incorporate custom components. For this, it is necessary to provide the path to the implementation of the custom component.

In [19]:
resized_images = images.apply(
     "components/resize_images",
     arguments={
         "resize_width": IMAGE_SIZE,
         "resize_height": IMAGE_SIZE,
     }
 )

detected_text = resized_images.apply(
    "components/detect_text",
    arguments={
        "batch_size": 8,
        "image_size": IMAGE_SIZE,
    },
    resources=Resources(accelerator_name="GPU", accelerator_number=1),
    cache=False
)

mask_images = detected_text.apply(
    "components/mask_images", 
    cache=False
)

embedded_images = mask_images.apply(
    "embed_images",
    arguments={
        "batch_size": 8,
    },
    resources=Resources(accelerator_name="GPU", accelerator_number=1)
)

images_with_clip_score = embedded_images.apply(
    "components/add_clip_score",
    consumes={
        "text_embedding": "clip_l14_text_embedding"
    }
)

filtered_clip_score_op = images_with_clip_score.apply(
    "components/filter_clip_score",
    arguments={
        "threshold_score": 0.19
    }
)

## Execute the pipeline

The pipeline will generate the prompts, retreive matching images in the laion dataset and download then and finally will generate corresponding captions and segmentations needed before writing the dataset to the HF hub.

We can execute our pipeline. Fondant provides various executors, and in this case, we are using the `DockerRunner` for local execution, which utilizes docker-compose under the hood.

In [20]:
from fondant.pipeline.runner import DockerRunner
DockerRunner().run(input=pipeline)

[2023-12-20 09:23:27,598 | root | INFO] Found reference to un-compiled pipeline... compiling
[2023-12-20 09:23:27,599 | fondant.pipeline.compiler | INFO] Compiling datacomp-filtering-pipeline to .fondant/compose.yaml
[2023-12-20 09:23:27,600 | fondant.pipeline.compiler | INFO] Base path found on local system, setting up ./fondant-artifacts as mount volume
[2023-12-20 09:23:27,601 | fondant.pipeline.pipeline | INFO] Sorting pipeline component graph topologically.
[2023-12-20 09:23:27,612 | fondant.pipeline.pipeline | INFO] All pipeline component specifications match.
[2023-12-20 09:23:27,613 | fondant.pipeline.compiler | INFO] Compiling service for load_from_hugging_face_hub
[2023-12-20 09:23:27,614 | fondant.pipeline.compiler | INFO] Compiling service for download_images
[2023-12-20 09:23:27,616 | fondant.pipeline.compiler | INFO] Compiling service for resize_images
[2023-12-20 09:23:27,617 | fondant.pipeline.compiler | INFO] Found Dockerfile for resize_images, adding build step.
[2023

Starting pipeline run...
Finished pipeline run.


unknown shorthand flag: 'f' in -f
See 'docker --help'.

Usage:  docker [OPTIONS] COMMAND

A self-sufficient runtime for containers

Common Commands:
  run         Create and run a new container from an image
  exec        Execute a command in a running container
  ps          List containers
  build       Build an image from a Dockerfile
  pull        Download an image from a registry
  push        Upload an image to a registry
  images      List images
  login       Log in to a registry
  logout      Log out from a registry
  search      Search Docker Hub for images
  version     Show the Docker version information
  info        Display system-wide information

Management Commands:
  builder     Manage builds
  container   Manage containers
  context     Manage contexts
  image       Manage images
  manifest    Manage Docker image manifests and manifest lists
  network     Manage networks
  plugin      Manage plugins
  scan*       Docker Scan (Docker Inc., v0.23.0)
  system      Manag

## Exploring the dataset 

You can also explore the dataset using the fondant explorer, this enables you to visualize your output dataset at each component step

In [None]:
from fondant.explore import run_explorer_app
run_explorer_app(base_path=BASE_PATH)

## Scaling up

If you're happy with your dataset, it's time to scale up. Check [our documentation](https://fondant.ai/en/latest/) for more information about the available runners.