## 🍫 Building a RAG indexing pipeline with Fondant

This repository demonstrates a Fondant data pipeline that ingests text
data into a vector database. The pipeline uses four reusable Fondant components.  
Additionally, we provide a Docker Compose setup for Weaviate, enabling local testing and
development.

### Pipeline overview

The primary goal of this sample is to showcase how you can use a Fondant pipeline and reusable
components to load, chunk and embed text, as well as ingest the text embeddings to a vector
database.
Pipeline Steps:

- [Data Loading](https://github.com/ml6team/fondant/tree/main/components/load_from_parquet): The
  pipeline begins by loading text data from a Parquet file, which serves as the
  source for subsequent processing. For the minimal example we are using a dataset from Huggingface.
- [Text Chunking](https://github.com/ml6team/fondant/tree/main/components/chunk_text): Text data is
  chunked into manageable sections to prepare it for embedding. This
  step
  is crucial for performant RAG systems.
- [Text Embedding](https://github.com/ml6team/fondant/tree/main/components/embed_text): We are using
  a small HuggingFace model for the generation of text embeddings.
  The `embed_text` component easily allows the usage of different models as well.
- [Write to Weaviate](https://github.com/ml6team/fondant/tree/main/components/index_weaviate): The
  final step of the pipeline involves writing the embedded text data to
  a Weaviate database.

In [None]:
# Setup your environment 
!pip install "fondant[docker]==0.6.2"

## Implement the pipeline

First of all, we need to initialize the pipeline, which includes specifying a name for your pipeline, providing a description, and setting a base_path. The base_path is used to store the pipeline artifacts and data generated by the components

In [None]:
from fondant.pipeline import ComponentOp, Pipeline

pipeline = Pipeline(
    pipeline_name="ingestion-pipeline",  # Add a unique pipeline name to easily track your progress and data
    pipeline_description="Pipeline to prepare and process data for building a RAG solution",
    base_path="./data-dir", # The demo pipelines uses a local directory to store the data.
)

For demonstration purposes, we will utilize a dataset available on Hugging Face. As such, we will use a reusable Fondant component `load_from_hf_hub`. The `load_from_hf_hub`` component is a generic one, which implies that we still need to customize the component specification file. We have to modify the dataframe schema defined in the produce section of the component.

To achieve this, we can create a `fondant_component.yaml` file in the directory `components/load_from_hf_hub` with the following content:

In [None]:
%%writefile components/load_from_hf_hub/fondant_component.yaml
name: Load from huggingface hub
description: Component that loads a dataset from huggingface hub
image: fndnt/load_from_hf_hub:0.6.2

produces:
  text:
    fields:
      data:
        type: string

args:
  dataset_name:
    description: Name of dataset on the hub
    type: str
  column_name_mapping:
    description: Mapping of the consumed hub dataset to fondant column names
    type: dict
    default: {}
  image_column_names:
    description: Optional argument, a list containing the original image column names in case the 
      dataset on the hub contains them. Used to format the image from HF hub format to a byte string.
    type: list
    default: []
  n_rows_to_load:
    description: Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale
    type: int
    default: None
  index_column:
    description: Column to set index to in the load component, if not specified a default globally unique index will be set
    type: str
    default: None


Afterwards, we can initialize the component and add it to our pipeline.

In [None]:
load_from_hf_hub = ComponentOp(
    component_dir="components/load_from_hf_hub",
    arguments={
        # Add arguments
        "dataset_name": "wikitext@~parquet",
        # Define the column mapping between the huggingface dataset and the Fondant dataframe
        "column_name_mapping": {
            "text": "text_data"
        },
        "n_rows_to_load": 10
    }
)

pipeline.add_op(load_from_hf_hub)

Now, our pipeline consists of a single component that loads the dataset from HuggingFace Hub. We can proceed to add the other components. All of them are reusable components, and we can initialize them using the `ComponentOp.from_registry(...)` method.

In [None]:
chunk_text_op = ComponentOp.from_registry(
    name="chunk_text",
    arguments={
        "chunk_size": 512,
        "chunk_overlap": 32,
    }
)

embed_text_op = ComponentOp.from_registry(
    name="embed_text",
    arguments={
        "model_provider": "huggingface",
        "model": "all-MiniLM-L6-v2",
    }
)

index_weaviate_op = ComponentOp.from_registry(
    name="index_weaviate",
    arguments={
        "weaviate_url": "http://host.docker.internal:8080",
        "class_name": "index",  # Add a unique class name to show up on the leaderboard
    }
)

Now, we can use the components in our pipeline. It is important to note that we will define dependencies between the pipeline steps.

In [None]:
pipeline.add_op(chunk_text_op, dependencies=load_from_hf_hub)
pipeline.add_op(embed_text_op, dependencies=chunk_text_op)
pipeline.add_op(index_weaviate_op, dependencies=embed_text_op)

## Execute the pipeline

The pipeline will load and process text data, then ingest the processed data into a vector database. Before executing the pipeline, we need to start the Weaviate database. Otherwise the pipeline execution will fail.

To do this, we can utilize the Docker setup provided in the `weaviate` folder.

In [None]:
!docker compose -f weaviate/docker-compose.yaml up --detach

Finally, we can execute our pipeline. Fondant provides various executors, and in this case, we are using the LocalRunner, which utilizes Docker under the hood.

In [None]:
from fondant.compiler import DockerCompiler
from fondant.runner import DockerRunner

DockerCompiler().compile(pipeline, output_path="docker-compose.yaml")
DockerRunner().run("docker-compose.yaml")

## Exploring the dataset

You can also explore the dataset using the fondant explorer, this enables you to visualize your output dataset at each component step.

In [None]:
from fondant.explorer import run_explorer_app

run_explorer_app(
    base_path="./data-dir",
    container="fndnt/data_explorer",
    tag="latest",
    port=8501
)