# 🍫 Creative common license dataset


This sample pipeline demonstrates how to effectively utilize a creative
commons image dataset within a fondant pipeline. This dataset comprises images from diverse sources and is available in various data formats.
[The dataset](https://huggingface.co/datasets/fondant-ai/fondant-cc-25m) itself is available on Huggingface.



### Pipeline overview

The primary goal of this sample is to showcase how you can use a Fondant pipeline and reusable
components to load an image dataset from HuggingFace Hub and download all images.
Pipeline Steps:

- [Load from Huggingface Hub](https://github.com/ml6team/fondant/tree/main/components/load_from_hf_hub):
  The pipeline begins by loading the image dataset from Huggingface Hub.
- [Download Images](https://github.com/ml6team/fondant/tree/main/components/download_images): 
  The download image component download images and stores them to parquet. 

In [None]:
# Setup your environment 
!pip install "fondant[docker]==0.6.2"

## Implement the pipeline

First of all, we need to initialize the pipeline, which includes specifying a name for your pipeline, providing a description, and setting a base_path. The base_path is used to store the pipeline artifacts and data generated by the components.

In [None]:
%%writefile pipeline.py
from fondant.pipeline import ComponentOp, Pipeline

pipeline = Pipeline(
    pipeline_name="filter-creative-commons",  # Add a unique pipeline name to easily track your progress and data
    pipeline_description="Load cc image dataset",
    base_path="./data-dir", # The demo pipelines uses a local directory to store the data.
)

For demonstration purposes, we will utilize a dataset available on HuggingFace. As such, we will use a reusable Fondant component `load_from_hf_hub`. The `load_from_hf_hub` component is a generic one, which implies that we still need to customize the component specification file. We have to modify the dataframe schema defined in the produce section of the component.

To achieve this, we can create a `fondant_component.yaml` file in the directory `components/load_from_hf_hub` with the following content:

In [None]:
%%writefile components/load_from_hf_hub/fondant_component.yaml
name: Load from hub
description: Component that loads a dataset from the hub
image: fndnt/load_from_hf_hub:0.6.2

produces:
  images:
    fields:
      alt+text:
        type: string
      url:
        type: string
      license+location:
        type: string
      license+type:
        type: string
      webpage+url:
        type: string
      surt+url:
        type: string
      top+level+domain:
        type: string

args:
  dataset_name:
    description: Name of dataset on the hub
    type: str
  column_name_mapping:
    description: Mapping of the consumed hub dataset to fondant column names
    type: dict
    default: {}
  image_column_names:
    description: Optional argument, a list containing the original image column names in case the 
      dataset on the hub contains them. Used to format the image from HF hub format to a byte string.
    type: list
    default: []
  n_rows_to_load:
    description: Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale
    type: int
  index_column:
    description: Column to set index to in the load component, if not specified a default globally unique index will be set
    type: str
    default: None

Afterwards, we can initialize the component and add it to our pipeline.
It's important to note that we are using the `load_component_column_mapping` to define which columns of the Huggingface dataset will be mapped to the schema of the dataset that Fondant operates on.

In [None]:
%%writefile -a pipeline.py
# Load from hub component
load_component_column_mapping = {
    "alt_text": "images_alt+text",
    "image_url": "images_url",
    "license_location": "images_license+location",
    "license_type": "images_license+type",
    "webpage_url": "images_webpage+url",
    "surt_url": "images_surt+url",
    "top_level_domain": "images_top+level+domain",
}

load_from_hf_hub = ComponentOp(
    component_dir="components/load_from_hf_hub",
    arguments={
        "dataset_name": "fondant-ai/fondant-cc-25m",
        "column_name_mapping": load_component_column_mapping,
        "n_rows_to_load": 100,  # Here you can modify the number of images you want to download.
    }
)

pipeline.add_op(load_from_hf_hub)

Currently, our pipeline comprises a single component responsible for loading the dataset from the HuggingFace Hub. We have the flexibility to include additional components in the pipeline. In this instance, our objective is to download all the images. For this purpose, we will employ a reusable component named `download_images`. To make use of a reusable component, we can utilize the `ComponentOp.from_registry(...)` method.

In [None]:
%%writefile -a pipeline.py
download_images = ComponentOp.from_registry(
    name="download_images",
    arguments={"input_partition_rows": 100, "resize_mode": "no"},
)

Reusable components offer various arguments that typically affect the component's operations. In this case, we have set, for example, `"resize_mode": "no"`. This setting ensures that the images will not be resized after they are downloaded. If you would like to learn more about components and their arguments, please refer to our [documentation](https://fondant.ai) and explore the [ComponentHub](https://hub.fondant.ai).


Now, we can use the components in our pipeline. It is important to note that we will define dependencies between the pipeline steps.

In [None]:
%%writefile -a pipeline.py
pipeline.add_op(download_images, dependencies=[load_from_hf_hub])

## Execute the pipeline

Now we are ready to execute our pipeline. 
Fondant provides various executors, and in this case, we are using the LocalRunner, which utilizes Docker under the hood.

In [None]:
!fondant run local pipeline.py