## Setting up an Evaluation Pipeline for Research Assistant v2
- Utilize Llama Index Library
- Switch out LLM and Embeddings Model

**Useful Links**

**Llama Index RAG Implementation**
- [Starter] https://medium.com/mitb-for-all/a-gentle-introduction-to-the-llm-multiverse-part-3-llamaindex-798344050c49
- [Docs] Basic and Advanced RAG systems https://www.llamaindex.ai/blog/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b 
- [Docs] Customising LLMs within LlamaIndex Abstractions https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom/ 
- [Docs] Customising Embedding Model for Vector store Indexing https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/ 

**Evaluation Framework**
- [Docs] https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-3/Evaluating_RAG_Systems/#correctness-evaluator
- [Ragas] https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/

In [1]:
# set autoreload for modules
%load_ext autoreload
%autoreload 2

# import dependencies
import os
import openai
from dotenv import load_dotenv, find_dotenv
import warnings
import nest_asyncio

_ = load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")
nest_asyncio.apply()

#### Developing Evaluation Dataset (with question-answer pairs relating to context)
**1. *First, let us load configure the LLM and Custom Embedding model  for our RAG system - “hkunlp/instructor-large”***

In [2]:
from llama_index.core import (
    Settings,
    VectorStoreIndex,
    SimpleDirectoryReader,
)
from llama_index.llms.openai import OpenAI

# Configure LLM
Settings.llm = OpenAI(model="gpt-4o-mini")

In [3]:
# Use custom embedding model - “hkunlp/instructor-large”
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# load embedding model (try) - loads https://huggingface.co/hkunlp/instructor-large
embed_model = HuggingFaceEmbedding(model_name="hkunlp/instructor-large")

In [3]:
# Test embedding model
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

768
[-0.048856671899557114, -0.00874248519539833, -0.013273396529257298, -0.011948740109801292, 0.020680604502558708]


In [4]:
# Set embedding model
Settings.embed_model = embed_model

**2. Next, *Ingest documents/context and generate RAG evaluation dataset.***

In [6]:
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

# load documents from pdf format
docs = SimpleDirectoryReader("../RAG-webscraper/docs/").load_data(show_progress=True)

# input llm and docs to instantiate RAG dataset generator
data_gen = RagDatasetGenerator.from_documents(
    docs,
    llm= Settings.llm,
    question_gen_query="You are a teacher/professor. Using the provided context, formulat a single question and its answer",
    num_questions_per_chunk=10
)

Loading files: 100%|██████████| 1/1 [00:07<00:00,  7.55s/it]


In [None]:
# run generate RAG dataset
rag_dataset = data_gen.generate_dataset_from_nodes()

In [28]:
rag_dataset



*Save evaluation dataset for future evaluations after implementations*

***Adapted from:** https://github.com/tituslhy/ideal-palm-tree/blob/main/notebooks/1a.%20training_dataset_gen.ipynb*

In [None]:
import json
import pandas as pd

# 1. Build a list of flat dicts, serializing each field properly
records = []
for ex in rag_dataset.examples:
    records.append({
        "query": ex.query,
        # JSON-encode the list of contexts
        "reference_contexts": json.dumps(ex.reference_contexts),
        "reference_answer": ex.reference_answer,
        # JSON-encode the CreatedBy objects
        "query_by": ex.query_by.model_dump_json(),
        "reference_answer_by": ex.reference_answer_by.model_dump_json(),
    })

# 2. Turn into a DataFrame and write to CSV
df = pd.DataFrame.from_records(records)
df.to_csv("data/eval_dataset.csv", index=False)

*Retrieve evaluation dataset from csv file.*

In [6]:
from llama_index.core.llama_dataset import (
    LabelledRagDataset,
    LabelledRagDataExample,
    CreatedBy,
)


def get_rag_dataset_from_csv(csv_path: str):
    converters = {
        "reference_contexts":    lambda s: json.loads(s),
        "query_by":             lambda s: CreatedBy.model_validate_json(s),
        "reference_answer_by":  lambda s: CreatedBy.model_validate_json(s),
    }
    df = pd.read_csv(csv_path, converters=converters)
    examples = []
    for _, row in df.iterrows():
        examples.append(
            LabelledRagDataExample(
                query=row["query"],
                query_by=row["query_by"],                      # now a CreatedBy
                reference_contexts=row["reference_contexts"],   # now a List[str]
                reference_answer=row["reference_answer"],
                reference_answer_by=row["reference_answer_by"], # now a CreatedBy
            )
        )

    # Create the dataset
    dataset = LabelledRagDataset(examples=examples)
    return dataset

In [7]:
eval_dataset = get_rag_dataset_from_csv("data/eval_dataset.csv")
len(eval_dataset.examples)

55

### Build Query Engine (RAG System) and Run Baseline Evaluation
*With our evaluation dataset in place, we first run evaluation on "Research Assistant v2", our baseline RAG model with custom embeddings model and GPT-4o-mini as both our answer generator and judge/evaluator LLM.*

In [5]:
from llama_index.llms.ollama import Ollama

# Instantiate query engine LLM
llm = Ollama(model="llama3.2:1b", request_timeout=120)

In [None]:
# Input documents (in index), embedding model and LLM to generate query engine (RAG system)
docs = SimpleDirectoryReader("../RAG-webscraper/docs/").load_data(show_progress=True) # load documents from pdf format
index = VectorStoreIndex.from_documents(docs, embed_model=Settings.embed_model)
query_engine = index.as_query_engine(similarity_top_k=6, llm=llm)

Loading files: 100%|██████████| 1/1 [00:07<00:00,  7.41s/it]


In [11]:
# Test out query engine
response = query_engine.query("How did Snape support Harry despite being a deatheater? On top of that, how did he hide his allegiance with the order from Voldermort?")
print(response)

Based on the provided context information, it appears that Snape supported Harry because of Dumbledore's words and the understanding that Snape's true allegiance lies with the Order of the Phoenix. 

In Chapter 33, Snape is told by Dumbledore to suggest to the Order of the Phoenix that they use decoys, specifically Polyjuice Potion, and identical Potters. This means that Snape deliberately chose to support Harry despite being a Death Eater, as it directly benefits the greater good (Harry's safety) rather than advancing Voldemort's interests.



**Note:** In this scenario, though GPT-4o-mini is a great LLM, we should NOT utilize the same LLM as judge/evaluator and answer generator (in RAG system). Hence, with reference with our earlier demonstration of "Research Assistant v2", we kept "llama3.2:1b" as the answer generator LLM in our query engine. 

*Might be useful to consider utilizing a more powerful LLM (GPT-4o-mini) to judge/evaluate the answer generator LLM to derive a more "precise/critical" benchmark.* - *Refer to https://github.com/tituslhy/ideal-palm-tree/blob/main/notebooks/2.%20llama32_1bn_RAFT.ipynb*

In [None]:
from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")

# Instantiate RAG Evaluator - input query engine, evaluation dataset, judge LLM & embeddings model
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine, 
    rag_dataset=rag_dataset,
    judge_llm=Settings.llm, #use the same llm that we use to create the dataset to judge
    embed_model=Settings.embed_model
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Processing /Users/jinkettyee/Desktop/my_GitHub/great-things/RAG-evaluation/pack
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: llama-index-packs-rag-evaluator
  Building wheel for llama-index-packs-rag-evaluator (pyproject.toml): started
  Building wheel for llama-index-packs-rag-evaluator (pyproject.toml): finished with status 'done'
  Created wheel for llama-index-packs-rag-evaluator: filename=llama_index_packs_rag_evaluator-0.3.1-py3-none-any.whl size=4935 sha256=dad027ca78706e90f46566970ee58c33a8d05a5fec0ef37e09c13ed24e871a8c
  Stored in directory: /private/var/folders/nr/6b6zx3jn687ghmtz2_2dw_b40000gn/T/pip-ephem-wheel-cache-41ft15r7/wheels/c5/b3/f2/e8724b5fcdbbb7

You should consider upgrading via the '/Users/jinkettyee/.pyenv/versions/great_things/bin/python -m pip install --upgrade pip' command.


In [26]:
# Run evaluation/benchmarking
benchmark_df = rag_evaluator.run()

2it [00:11,  5.99s/it]
2it [00:12,  6.44s/it]
2it [00:11,  5.79s/it]
2it [00:12,  6.14s/it]
2it [00:11,  5.84s/it]
2it [00:12,  6.22s/it]
2it [00:12,  6.05s/it]
2it [00:12,  6.03s/it]
2it [00:12,  6.12s/it]
2it [00:11,  5.73s/it]
2it [00:12,  6.27s/it]
2it [00:11,  5.92s/it]
2it [00:12,  6.22s/it]
2it [00:12,  6.04s/it]
2it [00:11,  5.80s/it]
2it [00:11,  5.80s/it]
2it [00:11,  5.88s/it]
2it [00:11,  5.67s/it]
2it [00:10,  5.11s/it]
2it [00:09,  4.97s/it]
2it [00:09,  4.78s/it]
2it [00:10,  5.28s/it]
2it [00:09,  4.85s/it]
2it [00:11,  5.90s/it]
2it [00:16,  8.32s/it]
2it [00:18,  9.09s/it]
2it [00:12,  6.35s/it]
1it [00:05,  5.91s/it]


In [27]:
# Review scores
print(benchmark_df)

rag                            base_rag
metrics                                
mean_correctness_score         4.218182
mean_relevancy_score           0.981818
mean_faithfulness_score        0.836364
mean_context_similarity_score  0.952509


*Pretty decent results, reasons likely due to factors associated with evaluation dataset quality, retriever (embedding model) etc.*