## 🍫 Building and Evaluate a RAG pipeline with Fondant

> ⚠️ Please note that this notebook **is not** compatible with **Google Colab**. To complete the tutorial, you must 
> initiate Docker containers. Starting Docker containers within Google Colab is not supported.

This repository demonstrates a complete RAG pipeline. 

The RAG pipeline is composed of a Fondant pipeline that ingests text data into a vector database (indexing pipeline), and a Fondant pipeline that proceeds retrieval on the vector database and evaluates it (evaluation pipeline). The score is then displayed to observe the performance of the chosen RAG configuration.

Additionally, we provide a Docker Compose setup for Weaviate, enabling local testing and
development, and an evaluation dataset containing questions over the first 1000 rows of the [loaded text data](https://huggingface.co/datasets/wikitext).

> ⚠️ Pay attention to the "TODO"s in the notebook as they require your input to run the pipelines.

## Why Fondant for RAG?

You will quickly notice that this notebook may look like another basic RAG pipeline notebook. So what is the difference here? Fondant!

Leveraging Fondant is key to reach the best RAG configuration for your use case in a record time. Finding the best RAG configuration implies several runs with different parameters (the size of the chunks, the embedding model used, the retrieval strategy, some data processing...). However, a single run can already quickly take a while, imagine for 10 or 50 runs. 

Fondant enables you to:
- Easily perform several runs, where distributed processing and caching will be automatically managed to run the pipelines as efficiently as possible.
- Explore each pipeline configuration
- Compare them with different metrics
- Finally, choose your best RAG configuration. Just like a hyper-parametrisation!

This Notebook is a first sight on Fondant's capabilities in a RAG context, but can be definitely improved as much as RAG keeps evoluating with new techniques! 

> 💡 If you want to do a Grid-Search over different RAG configurations, take a look at our `grid_search` notebook!

## Environment

**Check if docker compose is installed and the docker daemon is running**

In [None]:
# installation
!docker compose version >/dev/null
!docker info >/dev/null

**Install Fondant**

In [None]:
!pip install -r ../requirements.txt

**Initiate the weaviate vectorDB**

In [None]:
# If you are using a MacBook with a M1 processor you have to make sure to set the docker default platform to linux/amd64
import os
os.environ["DOCKER_DEFAULT_PLATFORM"]="linux/amd64"

In [None]:
# Run Weaviate with Docker compose
!docker compose -f weaviate/docker-compose.yaml up --detach

In [None]:
# Make sure the vectorDB is running and accessible
import weaviate
local_weaviate_client = weaviate.Client("http://localhost:8080")
local_weaviate_client.schema.get()

## Run the RAG Pipeline and Evaluate

### Pipeline overview

The RAG pipeline is divided into two pipelines:

- **Indexing Pipeline**: This pipeline processes text data and loads it in an indexed vector database.
- **Evaluation Pipeline**: This pipeline proceeds retrieval from the vector database (based on a set of queries) and evaluates the results using [RAGAS](https://docs.ragas.io/en/latest/index.html), a RAG evaluation framework.

**Import the pipelines creator and the pipeline runner**

In [None]:
from fondant.pipeline.runner import DockerRunner
import pipeline_index, pipeline_eval

### Set-up RAG parameters and Run the Pipelines

**Set-up shared parameters**

Both Indexing and Evaluation pipeline must share parameters that are set-up below.

In [None]:
# get Host IP address to be able to access the vector database from a Docker Image
import socket

def get_host_ip():
    try:
        # Create a socket object and connect to an external server
        # This step is done to get the local machine's IP address
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.connect(("8.8.8.8", 80))
        host_ip = s.getsockname()[0]
    except Exception as e:
        print(f"Error while retrieving host IP address: {e}")
        host_ip = None
    finally:
        s.close()

    return host_ip

# Example usage
host_ip = get_host_ip()
print(f"Host IP address: {host_ip}")

In [None]:
fixed_args = {
    "pipeline_dir":"./data-dir",
    "embed_model_provider":"huggingface",
    "embed_model":"all-MiniLM-L6-v2",
    "weaviate_url":f"http://{host_ip}:8080", # IP address 
    "weaviate_class_name":"Pipeline1",
}

**Set-up and Run Indexing Pipeline**

The Indexing pipeline created can be found in `pipeline_index.py`. It is composed of 4 steps/components:

- [Data Loading](https://github.com/ml6team/fondant/tree/main/components/load_from_parquet): The
  pipeline begins by loading text data from a Parquet file, which serves as the
  source for subsequent processing. For the minimal example we are using a dataset from Huggingface.
- [Text Chunking](https://github.com/ml6team/fondant/tree/main/components/chunk_text): Text data is
  chunked into manageable sections to prepare it for embedding. This
  step
  is crucial for performant RAG systems.
- [Text Embedding](https://github.com/ml6team/fondant/tree/main/components/embed_text): We are using
  a small HuggingFace model for the generation of text embeddings.
  The `embed_text` component easily allows the usage of different models as well. It can be modified in the `fixed_args` dictionary above.
- [Write to Weaviate](https://github.com/ml6team/fondant/tree/main/components/index_weaviate): The
  final step of the pipeline involves writing the embedded text data to
  a Weaviate database.

Below are specified the arguments of the indexing pipeline which can be modified. Keep in mind that changing the dataset implies changing the evaluation dataset used in the Evaluation Pipeline. 

In [None]:
indexing_args = {
    "hf_dataset_name":"wikitext@~parquet",
    "data_column_name":"text",
    "n_rows_to_load":1000,
    "chunk_size":512,
    "chunk_overlap":32
}

indexing_pipeline = pipeline_index.create_pipeline(**fixed_args, **indexing_args)

# indexing_pipeline = pipeline_index.create_pipeline(
#     pipeline_dir="./data-dir",
#     embed_model_provider="huggingface",
#     embed_model="all-MiniLM-L6-v2",
#     weaviate_url=f"{host_ip}:8080", # IP address 
#     weaviate_class_name="Pipeline_1",
#     hf_dataset_name="wikitext@~parquet",
#     data_column_name="text",
#     n_rows_to_load=1000,
#     chunk_size=512,
#     chunk_overlap=32
# )

In [None]:
def run_indexing_pipeline(runner, index_pipeline, host_ip, weaviate_class_name):
    runner.run(index_pipeline)
    docker_weaviate_client = weaviate.Client(f"http://{host_ip}:8080")
    return docker_weaviate_client.schema.get(weaviate_class_name)

runner = DockerRunner()
weaviate_class_name = fixed_args["weaviate_class_name"]

run_indexing_pipeline(
    runner=runner,
    index_pipeline=indexing_pipeline,
    host_ip=host_ip,
    weaviate_class_name=weaviate_class_name
)

**Set-up and Run Evaluation Pipeline**

The Evaluation Pipeline created can be found in `pipeline_eval.py`. It is composed of 5 steps/components:

- **Evaluation Data Loading**: The pipeline begins by loading the evaluation dataset from a CSV file. The dataset contains a set of questions over the HuggingFace loaded dataset that will be used for the retrieval and the evaluation. 
- **Text Embedding**: The questions' embeddings are generated using the same model as for the HuggingFace loaded dataset.
- **Retrieve Relevant Documents**: For each question's embedding, the revelant documents to answer it are retrieved from the vector database. 
- **Evaluate the RAG configuration**: This component evaluates the RAG pipeline using metrics computed using [RAGAS](https://docs.ragas.io/en/latest/index.html)' evaluation framework. The metrics computed are the [precision](https://docs.ragas.io/en/latest/concepts/metrics/context_precision.html) and the [relevancy](https://docs.ragas.io/en/latest/concepts/metrics/context_relevancy.html) of the retrieved context.
- **Aggregate Evaluation Scores**: This last component aggregates the scores over all questions for each metric computed.

Below are specified the arguments of the evaluation pipeline which can be modified. Keep in mind that if the dataset loaded in the indexing pipeline was changed, the evaluation dataset must be as well.

In [None]:
evaluation_args = {
    "csv_dataset_uri":"/data/wikitext_1000_q.csv",
    "csv_column_separator":";",
    "question_column_name":"question",
    "top_k":3,
    "module": "langchain.llms",
    "llm_name":"OpenAI",
    "llm_kwargs":{"openai_api_key": ""}, #TODO Specify your key in you're using OpenAI
    "metrics":["context_precision", "context_relevancy"]
}

evaluation_pipeline = pipeline_eval.create_pipeline(**fixed_args, **evaluation_args)

# evaluation_pipeline = pipeline_eval.create_pipeline(
#     pipeline_dir="./data-dir",
#     embed_model_provider="huggingface",
#     embed_model="all-MiniLM-L6-v2",
#     weaviate_url=f"{host_ip}:8080", # IP address 
#     weaviate_class_name="Pipeline_1",
#     csv_dataset_uri="/data/wikitext_1000_q.csv", #make sure it is the same as mounted file
#     csv_column_separator=";",
#     question_column_name="question",
#     top_k=3,
#     llm_name="OpenAI",
#     llm_kwargs={"openai_api_key": ""},
#     metrics=["context_precision", "context_relevancy"]
# )


In [None]:
def run_evaluation_pipeline(runner, eval_pipeline, extra_volumes):
    runner.run(input=eval_pipeline, extra_volumes=extra_volumes)

runner = DockerRunner()
local_folder_absolute_path = "fondant-usecase-RAG/src/local_file" #TODO Repace with absolute Path
extra_volumes = [f"{local_folder_absolute_path}:/data"]

run_evaluation_pipeline(
    runner=runner,
    eval_pipeline=evaluation_pipeline,
    extra_volumes=extra_volumes
)

## Exploring the dataset

You can explore your results using the fondant explorer, this enables you to visualize your output dataset at each component step. It might take a while to start the first time as it needs to download the explorer docker image first. 

Enjoy the exploration! 🍫 

In [None]:
from fondant.explore import run_explorer_app

run_explorer_app(base_path=fixed_args["pipeline_dir"])

**Read Latest Evaluated Pipeline Score**

You can also read the latest dataset containing the results of the scoring of your RAG pipeline. 

In [None]:
# Read latest chosen component
import os
from datetime import datetime

import pandas as pd


def read_latest_data(base_path: str, pipeline_name: str, component_name: str):
    # Specify the path to the 'data' directory
    data_directory = f"{base_path}/{pipeline_name}"

    # Get a list of all subdirectories in the 'data' directory
    subdirectories = [
        d
        for d in os.listdir(data_directory)
        if os.path.isdir(os.path.join(data_directory, d))
    ]

    # keep pipeline directories
    valid_entries = [
        entry for entry in subdirectories if entry.startswith(pipeline_name)
    ]
    # keep pipeline folders containing a parquet file in the component folder
    valid_entries = [
        folder
        for folder in valid_entries
        if has_parquet_file(data_directory, folder, component_name)
    ]
    # keep the latest folder
    latest_folder = sorted(valid_entries, key=extract_timestamp, reverse=True)[0]

    # If a valid folder is found, proceed to read all Parquet files in the component folder
    if latest_folder:
        # Find the path to the component folder
        component_folder = os.path.join(data_directory, latest_folder, component_name)

        # Get a list of all Parquet files in the component folder
        parquet_files = [
            f for f in os.listdir(component_folder) if f.endswith(".parquet")
        ]

        if parquet_files:
            # Read all Parquet files and concatenate them into a single DataFrame
            dfs = [
                pd.read_parquet(os.path.join(component_folder, file))
                for file in parquet_files
            ]
            return pd.concat(dfs, ignore_index=True)
        return None
    return None


def has_parquet_file(data_directory, entry, component_name):
    component_folder = os.path.join(data_directory, entry, component_name)
    # Check if the component exists
    if not os.path.exists(component_folder) or not os.path.isdir(component_folder):
        return False
    parquet_files = [
        file for file in os.listdir(component_folder) if file.endswith(".parquet")
    ]
    return bool(parquet_files)


def extract_timestamp(folder_name):
    # Extract the timestamp part from the folder name
    timestamp_str = folder_name.split("-")[-1]
    # Convert the timestamp string to a datetime object
    return datetime.strptime(timestamp_str, "%Y%m%d%H%M%S")

**Read aggregated results**

In [None]:
pipeline_dir = "./data-dir"
pipeline_name = "evaluation-pipeline"
component_name = "aggregate_eval_results"

read_latest_data(
            base_path=pipeline_dir,
            pipeline_name=pipeline_name,
            component_name=component_name,
        )

## Clean up your environment

After your pipeline run successfully, you should clean up your environment and stop the weaviate database.

In [None]:
!docker compose -f weaviate/docker-compose.yaml down