## 🍫 Building a RAG indexing pipeline with Fondant

> ⚠️ Please note that this notebook **is not** compatible with **Google Colab**. To complete the tutorial, you must 
> initiate Docker containers. Starting Docker containers within Google Colab is not supported.

This repository demonstrates a Fondant data pipeline that ingests text
data into a vector database. The pipeline uses four reusable Fondant components.  
Additionally, we provide a Docker Compose setup for Weaviate, enabling local testing and
development.

### Pipeline overview

The primary goal of this sample is to showcase how you can use a Fondant pipeline and reusable
components to load, chunk and embed text, as well as ingest the text embeddings to a vector
database.
Pipeline Steps:

- [Data Loading](https://github.com/ml6team/fondant/tree/main/components/load_from_parquet): The
  pipeline begins by loading text data from a Parquet file, which serves as the
  source for subsequent processing. For the minimal example we are using a dataset from Huggingface.
- [Text Chunking](https://github.com/ml6team/fondant/tree/main/components/chunk_text): Text data is
  chunked into manageable sections to prepare it for embedding. This
  step
  is crucial for performant RAG systems.
- [Text Embedding](https://github.com/ml6team/fondant/tree/main/components/embed_text): We are using
  a small HuggingFace model for the generation of text embeddings.
  The `embed_text` component easily allows the usage of different models as well.
- [Write to Weaviate](https://github.com/ml6team/fondant/tree/main/components/index_weaviate): The
  final step of the pipeline involves writing the embedded text data to
  a Weaviate database.

## Environment
### This section checks the prerequisites of your environment. Read any errors or warnings carefully.

**Ensure a Python between version 3.8 and 3.10 is available**

In [None]:
import sys
if sys.version_info < (3, 8, 0) or sys.version_info >= (3, 11, 0):
    raise Exception(f"A Python version between 3.8 and 3.10 is required. You are running {sys.version}")

**Check if docker compose is installed and the docker daemon is running**

In [None]:
!docker compose version >/dev/null
!docker info >/dev/null

**Check if GPU is available**

In [3]:
import logging
import subprocess

try:
    subprocess.check_output('nvidia-smi')
    logging.info("Found GPU, using it!")
    number_of_accelerators = 1
    accelerator_name = "GPU"
except Exception:
    logging.warning("We recommend to run this pipeline on a GPU, but none could be found, using CPU instead")
    number_of_accelerators = None
    accelerator_name = None



**Install Fondant**

In [4]:
!pip install -r ../requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Implement the pipeline

First of all, we need to initialize the pipeline, which includes specifying a name for your pipeline, providing a description, and setting a base_path. The base_path is used to store the pipeline artifacts and data generated by the components

In [8]:
from pathlib import Path
from fondant.pipeline import Pipeline, Resources

BASE_PATH = "./data-dir"
Path(BASE_PATH).mkdir(parents=True, exist_ok=True)

pipeline = Pipeline(
    name="ingestion-pipeline",  # Add a unique pipeline name to easily track your progress and data
    description="Pipeline to prepare and process data for building a RAG solution",
    base_path=BASE_PATH, # The demo pipelines uses a local directory to store the data.
)

For demonstration purposes, we will utilize a dataset available on Hugging Face. As such, we will use a reusable Fondant component `load_from_hf_hub`. Note that the `load_from_hf_hub` component does not define a fixed schema for the data it produces, which means we need to provide hits ourselves with the `produces` argument. It takes a mapping from field names to `pyarrow` types.

In [6]:
import pyarrow as pa

text = pipeline.read(
    "load_from_hf_hub",
    arguments={
        # Add arguments
        "dataset_name": "wikitext@~parquet",
        "column_name_mapping": {"text": "text"},
        "n_rows_to_load": 1000,
    },
    produces={
        "text": pa.string()
    }
)

This method doesn't execute the component yet, but adds it to the execution graph of the pipeline, and returns a lazy `Dataset` instance. We can now chain additional components from the [Fondant Hub](https://fondant.ai/en/latest/components/hub/) using the `Dataset.apply()`.

In [9]:
chunks = text.apply(
    "chunk_text",
    arguments={
        "chunk_size": 512,
        "chunk_overlap": 32,
    }
)

embeddings = chunks.apply(
    "embed_text",
    arguments={
        "model_provider": "huggingface",
        "model": "all-MiniLM-L6-v2"
    },
    resources=Resources(
        accelerator_number=number_of_accelerators,
        accelerator_name=accelerator_name,
    ),
)

embeddings.write(
    "index_weaviate",
    arguments={
        "weaviate_url": "http://host.docker.internal:8080",
        "class_name": "index",
    }
)

Our pipeline now looks as follows:

`read_from_hf_hub` -> `chunk_text` -> `embed_text` -> `index_weaviate`

## Running the pipeline

The pipeline will load and process text data, then ingest the processed data into a vector database. Before executing the pipeline, we need to start the Weaviate database. Otherwise the pipeline execution will fail.

To do this, we can utilize the Docker setup provided in the `weaviate` folder.

In [None]:
# If you are using a MacBook with a M1 processor you have to make sure to set the docker default platform to linux/amd64
import os
os.environ["DOCKER_DEFAULT_PLATFORM"]="linux/amd64"

In [None]:
!docker compose -f weaviate/docker-compose.yaml up --detach

Finally, we can execute our pipeline. 
Fondant provides multiple runners to run our pipeline:

- A Docker runner for local execution
- A Vertex AI runner for managed execution on Google Cloud
- A Sagemaker runner for managed execution on AWS
- A Kubeflow Pipelines runner for execution anywhere
Here we will use the DockerRunner for local execution, which utilizes docker-compose under the hood.

The runner will download the reusable components from the component hub. Afterwards, you will see the components execute one by one.

In [None]:
from fondant.pipeline.runner import DockerRunner

DockerRunner().run(pipeline)

## Exploring the dataset

You can also explore the dataset using the fondant explorer, this enables you to visualize your output dataset at each component step. It might take a while to start the first time as it needs to download the explorer docker image first.

In [None]:
from fondant.explore import run_explorer_app

run_explorer_app(base_path=BASE_PATH)

To stop the Explorer and continue the notebook, press the stop button at the top of the notebook.

## Create your own component

Certainly, you can create your own custom components and use them in the pipeline. Let's consider building a component that cleans our text articles. For demo purpose we will implement a component thats removes all empty lines.

To implement a custom component, a couple of files need to be defined:

- Fondant component specification
- main.py script in a src folder
- Dockerfile
- requirements.txt

If you want to learn more about the creating custom components checkout [our documentation](https://fondant.ai/en/latest/components/custom_component/).


### Component specification

The component specification is represented by a single `fondant_component.yaml` file. There you can define which fields your component consumes and produces. 

In [None]:
%%writefile components/text_cleaning/fondant_component.yaml
name: Text cleaning component
description: Clean text passages
image: ghcr.io/ml6team/text_cleaning:dev

consumes:
  text:
    type: string

produces:
  text:
    type: string

### Main.py script

The core logic of the component should be implemented in a `main.py` script in a folder called `src`. We can implement the text cleaning logic as a class. We will inherit from the base class `PandasTransformComponent`. The `PandasTransformComponent` operates on pandas dataframes. 

In [None]:
%%writefile components/text_cleaning/src/main.py
import pandas as pd
from fondant.component import PandasTransformComponent


class TextCleaningComponent(PandasTransformComponent):
    def __init__(self, **kwargs):
        """Initialize your component"""

    def remove_empty_lines(self, text):
        lines = text.split("\n")
        non_empty_lines = [line.strip() for line in lines if line.strip()]
        return "\n".join(non_empty_lines)

    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe["text"] = dataframe["text"].apply(
            self.remove_empty_lines
        )
        return dataframe

### Dockerfile 
The Dockerfile defines how to build the component into a Docker image. You can use the following:

In [None]:
%%writefile components/text_cleaning/Dockerfile
FROM --platform=linux/amd64 python:3.8-slim

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
COPY src/ .

ENTRYPOINT ["fondant", "execute", "main"]

### Requirements.txt

In the requirements.txt we define all dependencies of the component.

In [None]:
%%writefile components/text_cleaning/requirements.txt
fondant[component]==0.8.dev4

### Add the new component to the pipeline

Now we can add the new component to the pipeline with the `Dataset.apply` function. We just specify the path to the directory containing the custom component instead of the name of the reusable component.

In [None]:
import pyarrow as pa
from fondant.pipeline import Pipeline


pipeline = Pipeline(
    name="ingestion-pipeline",
    description="Pipeline to prepare and process data for building a RAG solution",
    base_path=BASE_PATH,  # The demo pipelines uses a local directory to store the data.
)

text = pipeline.read(
    "load_from_hf_hub",
    arguments={
        "dataset_name": "wikitext@~parquet",
        "column_name_mapping": {"text": "text"},
        "n_rows_to_load": 1000,
    },
    produces={
        "text": pa.string()
    }
)

cleaned_text = text.apply(
    "components/text_cleaning",  # Path to custom component
)

chunks = cleaned_text.apply(
    "chunk_text",
    arguments={
        "chunk_size": 512,
        "chunk_overlap": 32,
    },
)

embeddings = chunks.apply(
    "embed_text",
    arguments={
        "model_provider": "huggingface",
        "model": "all-MiniLM-L6-v2",
    },
)

embeddings.write(
    "index_weaviate",
    arguments={
        "weaviate_url": "http://host.docker.internal:8080",
        "class_name": "index",
    },
)

If you now run your pipeline, the new changes will be picked up and Fondant will automatically re-build the component with the changes included.

In [None]:
DockerRunner().run(pipeline)

If you check the logs, you will see th

If you restart the Explorer, you'll see that you can now select a second pipeline and inspect your new dataset.

In [None]:
run_explorer_app(base_path=BASE_PATH)

## Clean up your environment

After your pipeline run successfully, you should clean up your environment and stop the weaviate database.

In [None]:
!docker compose -f weaviate/docker-compose.yaml down

## Scaling up
If you're happy with your dataset, it's time to scale up. Check [our documentation](https://fondant.ai/en/latest/pipeline/#compiling-and-running-a-pipeline) for more information about the available runners.

