# RAG QA Evaluation

This notebook can be used to evaluate either the RAG pipeline. It loads the prepared dataset and embedding, then executes the either the whole pipeline or just the retrieval and evaluates the results. 

## Colab Setup

If you are running this notebook in Colab, we first need to clone the repository and install the requirements. Otherwise, we assume that you already have a clean setup to save execution time.

In [None]:
import sys
IS_COLAB = "google.colab" in sys.modules

if IS_COLAB:
    import os

    REPO_URL = "https://github.com/lucas937-code/rag-qa"
    REPO_DIR = "rag-qa"
    BRANCH = "dev"
    MODE = "DEBUG"

    # Clone repo only if it does not exist yet
    if not os.path.isdir(REPO_DIR):
        print(f"Cloning repository from {REPO_URL}...")
        !git clone {REPO_URL} {REPO_DIR}
    else:
        print(f"Repository '{REPO_DIR}' already exists, skipping clone.")

    # Change into repo directory
    %cd {REPO_DIR}

    # Checkout the correct branch
    if BRANCH != "main":
        remote_branches = !git branch -r
        print(remote_branches)
        if f"origin/{BRANCH}" not in str(remote_branches):
            raise ValueError(f"Branch {BRANCH} does not exist")
        !git checkout {BRANCH}

    # Install dependencies
    !pip install -r requirements.txt

## Setup Configuration

Initialize the configuration based on the execution environment (local using the HuggingFace API, local using an Ollama server or running in Colab using the HuggingFace API) and prepare the necessary directories.

In [None]:
from src.config import OllamaConfig, LocalConfig, ColabConfig

USE_OLLAMA = True

if USE_OLLAMA:
    OLLAMA_HOST = "172.19.176.1"
    OLLAMA_PORT = 11434
    OLLAMA_URL = f"http://{OLLAMA_HOST}:{OLLAMA_PORT}/api/chat"
    config = OllamaConfig(ollama_url=OLLAMA_URL)
else:
    config = ColabConfig() if IS_COLAB else LocalConfig()
    
config.ensure_dirs()

## Retrieval Evaluation

Retrieve chunks for a given number of samples and evaluate for each sample if at least one of the retrieved chunks contains the answer.  
You can specify the number of samples, the number of candidates before the reranking is applied and a tuple of values for k that you want to evaluate. 

In [None]:
from src.evaluate_retrieve import run_evaluation

run_evaluation(config=config,
               sample_limit=100,
               candidates=100,
               top_k=(1,3,5,10),
               data_dirs=(config.train_dir, 
                          config.val_dir, 
                          config.test_dir))

## Full RAG Evaluation

Execute the whole RAG pipeline for a given number of samples and evaluate using two Benchmarks: 
**Exact Match (EM):** simple benchmark that indicates how often the LLM returned exactly the golden answer.

**F1:** a more complex accuracy benchmark that calculates how similar the generated answer is to the golden answer on average. To do so, it uses the following steps:
1. normalize prediction & golden answer:
    - lowercasing
    - remove punctuation
    - remove "a"/"an"/"the"
    - normalize white spaces
2. tokenize prediction & golden answer
3. count same tokens
4. calculate precision & recall:
    - $precision = \frac{num_{same}}{|tokens_{prediction}|}$
    - $recall = \frac{num_{same}}{|tokens_{golden}|}$
5. Calculate F1 score: $F1 = 2 \cdot precision \cdot \frac{recall}{precision + recall}$ ($0$ if $num_{same}=0$)

In [None]:
from src.evaluate_rag_full import run_full_rag_eval

run_full_rag_eval(config=config, 
                  max_questions=100, 
                  top_k=5,
                  save_file=None)