# 🍫 Creative common license dataset

> ⚠️ Please note that this notebook is not compatible with Google Colab. To complete the tutorial, you must initiate Docker containers. Starting Docker containers within Google Colab is not supported.

This sample pipeline demonstrates how to effectively utilize a creative
commons image dataset within a fondant pipeline. This dataset comprises images from diverse sources and is available in various data formats.
[The dataset](https://huggingface.co/datasets/fondant-ai/fondant-cc-25m) itself is available on Huggingface.




### Pipeline overview

The primary goal of this sample is to showcase how you can use a Fondant pipeline and reusable
components to load an image dataset from HuggingFace Hub and download all images.
Pipeline Steps:

- [Load from Huggingface Hub](https://github.com/ml6team/fondant/tree/main/components/load_from_hf_hub):
  The pipeline begins by loading the image dataset from Huggingface Hub.
- [Download Images](https://github.com/ml6team/fondant/tree/main/components/download_images): 
  The download image component download images and stores them to parquet. 
- [Filter Images](https://github.com/ml6team/fondant/tree/main/components/filter_image_resolution):
  The filter image component filters images based on their resolution.

## Environment
### This section checks the prerequisites of your environment. Read any errors or warnings carefully.

**Ensure a Python between version 3.8 and 3.10 is available**

In [None]:
import sys
if sys.version_info < (3, 8, 0) or sys.version_info >= (3, 11, 0):
    raise Exception(f"A Python version between 3.8 and 3.10 is required. You are running {sys.version}")

**Check if docker compose is installed and the docker daemon is running**

In [None]:
!docker compose version >/dev/null
!docker info >/dev/null

**Install Fondant**

In [None]:
!pip install -r ../requirements.txt

## Implement the pipeline

First of all, we need to initialize the pipeline, which includes specifying a name for your pipeline, providing a description, and setting a base_path. The base_path is used to store the pipeline artifacts and data generated by the components.

In [None]:
from pathlib import Path

import pyarrow as pa

from fondant.pipeline import Pipeline

BASE_PATH = "./fondant-artifacts"

# Create data directory if it doesn't exist
Path(BASE_PATH).mkdir(parents=True, exist_ok=True)

pipeline = Pipeline(
    name="filter-creative-commons",  # Add a unique pipeline name to easily track your progress and data
    description="Load cc image dataset",
    base_path=BASE_PATH, # The demo pipelines uses a local directory to store the data.
)

For demonstration purposes, we will utilize a dataset available on HuggingFace. As such, we will use a reusable Fondant component `load_from_hf_hub`. The `load_from_hf_hub` component is a generic one, which implies that we still need to specify the produce section of the component (explicitly name and type the fields te component will generate, you find the available fields on the [huggingface dataset](https://huggingface.co/datasets/fondant-ai/fondant-cc-25m)).


Add the following to your pipeline file:

In [None]:
# Load from hub component
raw_data = pipeline.read(
    "load_from_hf_hub",
    arguments={
        "dataset_name": "fondant-ai/fondant-cc-25m",
        "n_rows_to_load": 100,  # Modify the number of images you want to download.
    },
    produces={
        "alt_text": pa.string(),
        "image_url": pa.string(),
        "license_location": pa.string(),
        "license_type": pa.string(),
        "webpage_url": pa.string(),
        "surt_url": pa.string(),
        "top_level_domain": pa.string(),
    }
)

Currently, our pipeline comprises a single component responsible for loading the dataset from the HuggingFace Hub. We have the flexibility to include additional components in the pipeline. In this instance, our objective is to download all the images. For this purpose, we will employ a reusable component named `download_images`. We apply the `download_images` component to the `load_from_hf_hub` component. This way we tell Fondant that this step should follow the previous one.

In [None]:
# Download images component
images = raw_data.apply(
    "download_images",
    arguments={
        "input_partition_rows": 100,
        "resize_mode": "no",
    }
)

Reusable components offer various arguments that typically affect the component's operations. In this case, we have set, for example, `"resize_mode": "no"`. This setting ensures that the images will not be resized after they are downloaded. If you would like to learn more about components and their arguments, please refer to our [documentation](https://fondant.ai) and explore the [ComponentHub](https://hub.fondant.ai).

Lets add one more step to our pipeline. We will filter the downloaded images based on their resolution. We will use the `filter_image_resolution` component for this purpose. This component requires two arguments: `min_width` and `max_aspect_ratio`. We will set these arguments so only images with a minimum resolution of 512 pixels will be kept.

In [None]:
# Filter images component
big_images = images.apply(
    "filter_image_resolution",
    arguments={
        "min_image_dim": 512,
        "max_aspect_ratio": 2.5,
    }
)

## Execute the pipeline

Now we are ready to execute our pipeline. 
Fondant provides various executors, and in this case, we are using the LocalRunner, which utilizes Docker under the hood.

In [None]:
# If you are using a MacBook with a M1 processor you have to make sure to set the docker default platform to linux/amd64
import os
os.environ["DOCKER_DEFAULT_PLATFORM"]="linux/amd64"

In [None]:
from fondant.pipeline.runner import DockerRunner

DockerRunner().run(input=pipeline)

## Exploring the dataset

You can also explore the dataset using the fondant explorer, this enables you to visualize your output dataset at each component step.


In [None]:
from fondant.explore import run_explorer_app

run_explorer_app(
    base_path=BASE_PATH,
    container="fndnt/data_explorer",
    tag="latest",
    port=8501
)