# 🍫 Auto-tune your RAG data pipeline using parameter search

> ⚠️ This notebook can be run on your local machine or on a virtual machine and requires [Docker Compose](https://docs.docker.com/desktop/).
> Please note that it is **not compatible with Google Colab** as the latter does not support Docker.

In this notebook we demonstrate **how to perform parameter search and automatically tune a Retrieval-Augmented Generation (RAG)** system using [Fondant](https://fondant.ai).

We will:

1. Set up an environment and a [Weaviate](https://weaviate.io/platform) Vector Store
2. Define the sets of parameters that should be tried
3. Run the parameter search which automatically:
    * Runs an indexing pipeline for each combination of parameters to be tested
    * Runs an evaluation pipeline for each index
    * Collects results
5. Explore results

<div align="center">
<img src="../art/iteration.png" width="900"/>
</div>

We will use [**Fondant**](https://fondant.ai), **a hub and framework for easy and shareable data processing**, which has the following advantages for RAG evaluation:

- **Speed**
    - Reusable RAG components from the [Fondant Hub](https://fondant.ai/en/latest/components/hub/) to quickly build RAG pipelines
    - [Pipeline caching](https://fondant.ai/en/latest/caching/) to speed up subsequent runs
    - Parallel processing out of the box to speed up processing of large datasets
- **Ease-of-use**
    - Change parameters and swap [components](https://fondant.ai/en/latest/components/hub/) by changing only a few lines of code
    - Create your own [custom components](https://fondant.ai/en/latest/components/custom_component/) (eg. with different chunking strategies) and plug them into your pipeline
    - Reuse your processing components in different pipelines and share them with the [community](https://discord.gg/HnTdWhydGp)
- **Production-readiness**
    - Full data lineage and a [data explorer](https://fondant.ai/en/latest/data_explorer/) to check the evolution of data after each step
    - Ready to deploy to (managed) platforms such as _Vertex, SageMaker and Kubeflow_
 
Share your experiences or let us know how we can improve through our [**Discord**](https://discord.gg/HnTdWhydGp) or on [**GitHub**](https://github.com/ml6team/fondant). And of course feel free to give us a [**star ⭐**](https://github.com/ml6team/fondant-usecase-RAG) if you like what we are doing!

## Set up environment

> ⚠️ This section checks the prerequisites of your environment. Read any errors or warnings carefully.

Ensure a **Python between version 3.8 and 3.10** is available

In [1]:
import sys
if sys.version_info < (3, 8, 0) or sys.version_info >= (3, 11, 0):
    raise Exception(f"A Python version between 3.8 and 3.10 is required. You are running {sys.version}")

Check if **docker compose** is installed and the **docker daemon** is running

In [2]:
!docker compose version
!docker info && echo "Docker running"

Docker Compose version v2.23.0-desktop.1
Client:
 Version:    24.0.6
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2-desktop.5
    Path:     C:\Program Files\Docker\cli-plugins\docker-buildx.exe
  compose: Docker Compose (Docker Inc.)
    Version:  v2.23.0-desktop.1
    Path:     C:\Program Files\Docker\cli-plugins\docker-compose.exe
  dev: Docker Dev Environments (Docker Inc.)
    Version:  v0.1.0
    Path:     C:\Program Files\Docker\cli-plugins\docker-dev.exe
  extension: Manages Docker extensions (Docker Inc.)
    Version:  v0.2.20
    Path:     C:\Program Files\Docker\cli-plugins\docker-extension.exe
  init: Creates Docker-related starter files for your project (Docker Inc.)
    Version:  v0.1.0-beta.9
    Path:     C:\Program Files\Docker\cli-plugins\docker-init.exe
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
    Version:  0.6.0
    Path:     C:\Program Files\Docker\cli



Make sure that **logs** are displayed

In [3]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.info("test")

INFO:root:test


**Set up Fondant**

In [4]:
!pip install -q -r ../requirements.txt --disable-pip-version-check && echo "Success"

"Success"


## Spin up the Weaviate vector store

> ⚠️ For **Apple M1/M2** chip users:
> 
> - In Docker Desktop Dashboard `Settings -> Features in development`, make sure to **un**check `Use containerd` for pulling and storing images. More info [here](https://docs.docker.com/desktop/settings/mac/#beta-features)
> - Make sure that Docker uses linux/amd64 platform and not arm64 (cell below should take care of that)

In [5]:
import os
os.environ["DOCKER_DEFAULT_PLATFORM"]="linux/amd64"

Run **Weaviate** with Docker compose

In [6]:
!docker compose -f weaviate/docker-compose.yaml up --detach

 Container weaviate-weaviate-1  Running
 Container weaviate-contextionary-1  Running


Make sure you have **Weaviate client v3**

In [7]:
!pip install -q "weaviate-client==3.*" --disable-pip-version-check && echo "Weaviate client installed successfully"

"Weaviate client installed successfully"


Make sure the vectorDB is running and accessible

In [8]:
import logging
import weaviate

try:
    weaviate_client = weaviate.Client("http://localhost:8080")
    logging.info("Connected to Weaviate instance")
except weaviate.WeaviateStartUpError:
    logging.error("Cannot connect to weaviate instance, is it running?")

INFO:root:Connected to Weaviate instance


## Parameter search

Parameter search allows you **try out different configurations of pipelines and compare their performance**

`pipeline_index.py` processes text data and loads it into the vector database

<div align="center">
<img src="../art/indexing_ltr.png" width="800"/>
</div>

- [**Load data**](https://github.com/ml6team/fondant/tree/main/components/load_from_parquet): loads data from the Hugging Face Hub
- [**Chunk data**](https://github.com/ml6team/fondant/tree/main/components/chunk_text): divides the text into sections of a certain size and with a certain overlap
- [**Embed chunks**](https://github.com/ml6team/fondant/tree/main/components/embed_text): embeds each chunk as a vector.  
- [**Index vector store**](https://github.com/ml6team/fondant/tree/main/components/index_weaviate): writes data and embeddings to the vector store

`pipeline_eval.py` evaluates retrieval performance using the questions provided in your test dataset

<div align=center>
<img src="../art/evaluation_ltr.png" width="800"/>
</div>

- [**Load eval data**](https://github.com/ml6team/fondant/tree/main/components/load_from_csv): loads the evaluation dataset (questions) from a csv file
- [**Embed questons**](https://github.com/ml6team/fondant/tree/main/components/embed_text): embeds each chunk as a vector
- [**Query vector store**](https://github.com/ml6team/fondant/tree/main/components/retrieve_from_weaviate): retrieves the most relevant chunks for each question from the vector store
- [**Evaluate**](https://github.com/ml6team/fondant/tree/0.8.0/components/evaluate_ragas): evaluates the retrieved chunks for each question, e.g. using [RAGAS](https://docs.ragas.io/en/latest/index.html)
- [**Aggregate**](https://github.com/ml6team/fondant-usecase-RAG/tree/main/src/components/aggregate_eval_results): calculates aggregated results

> 💡 This notebook defaults to the first 1000 rows of the [wikitext](https://huggingface.co/datasets/wikitext) dataset for demonstration purposes, but you can load your own dataset using one the other load components available on the [**Fondant Hub**](https://fondant.ai/en/latest/components/hub/#component-hub) or by creating your own [**custom load component**](https://fondant.ai/en/latest/guides/implement_custom_components/). Keep in mind that changing the dataset implies that you also need to change the [evaluation dataset](evaluation_datasets/wikitext_1000_q.csv) used in the evaluation pipeline. 

### Set up parameter search

**Choose parameters over which to search**

- `chunk_size`: size of each text chunk, in number of characters ([chunk text](https://github.com/ml6team/fondant/tree/main/components/chunk_text) component)
- `chunk_overlap`: overlap between chunks ([chunk text](https://github.com/ml6team/fondant/tree/main/components/chunk_text) component)
- `embed_model`: model used to embed ([embed text](https://github.com/ml6team/fondant/tree/main/components/embed_text) component)
- `retrieval_top_k`: number of retrieved chunks taken into account for evaluation ([retrieve](https://github.com/ml6team/fondant/tree/main/components/retrieve_from_weaviate) component)

**Choose a search method**
- [grid search](https://en.wikipedia.org/wiki/Hyperparameter_optimization): tries all possible combinations
- progressive search (recommended): identifies best option per step, much more efficient as complexity increases linearly with number of search options vs. exponentially for grid search (with similar performance)

In [11]:
searchable_index_params = {
    'chunk_size' : [192, 256, 320],
    'chunk_overlap' : [64, 128, 192],
}
searchable_shared_params = {
    'embed_model' : [("huggingface","all-MiniLM-L6-v2"), ("huggingface", "BAAI/bge-base-en-v1.5")]
}
searchable_eval_params = {
    'retrieval_top_k' : [2, 4, 8]
}

search_method = 'progressive_search' # 'grid_search', 'progressive_search'
target_metric = 'context_precision' # relevant for 'smart' methods that use previous results to determine params, e.g. progressive search

⚠️ If you want to use **ChatGPT for evaluation** you will need an [OpenAI API key](https://platform.openai.com/docs/quickstart) (see TODO below)

In [12]:
# configurable parameters
shared_args = {
    "base_path":"./data", # where data goes
}
index_args = {
    "n_rows_to_load":1000,
}
eval_args = {
    "evaluation_set_path" : "./evaluation_datasets",
    "evaluation_set_filename" : "wikitext_1000_q.csv",
    "evaluation_set_separator" : ";",
    "evaluation_module" : "langchain.llms",
    "evaluation_llm" :"OpenAI",
    "evaluation_llm_kwargs" : {
                              "openai_api_key": os.environ["OPENAI_KEY"], #TODO Specify your key if you're using OpenAI
                              # "model_name" : "gpt-3.5-turbo"
    },
    "evaluation_metrics" : ["context_precision", "context_relevancy"]
}

### Run parameter search

> 💡 The first time you run a pipeline, you need to **download a docker image for each component** which may take a minute.

> 💡 Use a **GPU** or an external API to speed up the embedding step

> 💡 Steps that have been processed before are **cached** and will be skipped in subsequent runs which speeds up processing.


In [None]:
from utils import ParameterSearch

mysearch = ParameterSearch(
    searchable_index_params = searchable_index_params,
    searchable_shared_params = searchable_shared_params,
    searchable_eval_params = searchable_eval_params,
    shared_args = shared_args,
    index_args = index_args,
    eval_args = eval_args,
    search_method = search_method,
    target_metric = target_metric
)

parameter_search_results = mysearch.run()

## Display Results

**Compare the performance** of your runs below. The default evaluation component uses [Ragas](https://github.com/explodinggradients/ragas) and provides the following two performance measures [context precision](https://docs.ragas.io/en/latest/concepts/metrics/context_precision.html) and [context relevancy](https://docs.ragas.io/en/latest/concepts/metrics/context_relevancy.html).

In [9]:
parameter_search_results

NameError: name 'parameter_search_results' is not defined

## Visualize Results

Make sure **Plotly** is installed

In [None]:
!pip install -q "plotly" --disable-pip-version-check && echo "Plotly installed successfully"

Show legend of **embedding models** used

In [None]:
from utils import add_embed_model_numerical_column, show_legend_embed_models

parameter_search_results = add_embed_model_numerical_column(parameter_search_results)
show_legend_embed_models(parameter_search_results)

**Plot results**

In [None]:
import plotly.express as px

dimensions = ['chunk_size', 'chunk_overlap', 'embed_model_numerical', 'retrieval_top_k', 'context_precision']
fig = px.parallel_coordinates(parameter_search_results, color="context_precision",
                              dimensions=dimensions,
                              color_continuous_scale=px.colors.sequential.Bluered)
fig.show()

## Explore data

You can also check your data and results at each step in the pipelines using the **Fondant data explorer**. The first time you run the data explorer, you need to download the docker image which may take a minute. Then you can access the data explorer at: **http://localhost:8501/**

Enjoy the exploration! 🍫 

Press the ◼️ in the notebook toolbar to **stop the explorer**.

In [None]:
from fondant.explore import run_explorer_app

run_explorer_app(base_path=shared_args["base_path"])

INFO:root:Using local base path: ./data
INFO:root:This directory will be mounted to /artifacts in the container.
INFO:root:Running image from registry: fndnt/data_explorer with tag: 0.8.0 on port: 8501
INFO:root:Access the explorer at http://localhost:8501


## Clean up your environment

After your pipeline run successfully, you can **clean up** your environment and stop the weaviate database.

In [None]:
!docker compose -f weaviate/docker-compose.yaml down

## Feedback

Please share your experience or **let us know how we can improve** through our 
* [**Discord**](https://discord.gg/HnTdWhydGp) 
* [**GitHub**](https://github.com/ml6team/fondant)

And of course feel free to give us a [**star** ⭐](https://github.com/ml6team/fondant) if you like what we are doing!