# Evaluation of RAG systems: LLM evaluator vs Semantic similarity
RAG systems have many moving parts. You can test each component individually or the end-to-end system. This notebook shows 2 methods to **evaluate end-to-end systems**.

## Requirements
To follow along, you will need an OpenAI API key. To get an OpenAI API key, go to ...

## Golden dataset
The golden dataset is a set of curated paris of **questions**, **chunks**, and **answers**. This notebook, uses  you'll use the [] dataset.

To evaluate your own system you must **own** or **create** a golden dataset. To create a synthetic golden dataset from a corpus, you can use LLama-index's `generate_qa_embedding_pairs` utility, then use chunks and questions to generate answers with GPT4. 

In [3]:
from dotenv import load_dotenv
import os
import openai

load_dotenv()

openai.api_key = os.environ["OPENAI_API_KEY"]

# Create an idex from a widely available dataset. 
FILE_URL = "https://raw.githubusercontent.com/jerryjliu/llama_index/main/examples/paul_graham_essay/data/paul_graham_essay.txt"

# Create qa embedding pairs.
from llama_index import VectorStoreIndex, download_loader, SimpleDirectoryReader
#https://gpt-index.readthedocs.io/en/latest/getting_started/starter_example.html


In [4]:
# We want to do these tests with HF embeddings.
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index import ServiceContext


# Load HF embedding from langchain
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Default LLM is OpenAI model. Thus, it requires setting an OpenAI API key.
service_context = ServiceContext.from_defaults(embed_model=embed_model)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
RemoteReader = download_loader("RemoteReader")
documents = RemoteReader().load_data(FILE_URL)
# index = VectorStoreIndex.from_documents(
#     documents,
#     service_context=service_context,
#     show_progress=True,
# )

Parsing documents into nodes: 100%|██████████| 1/1 [00:00<00:00, 19.19it/s]
Generating embeddings: 100%|██████████| 19/19 [00:03<00:00,  5.65it/s]


In [1]:
from llama_index.finetuning import generate_qa_embedding_pairs

In [6]:
# Parse nodes from documents
from llama_index.node_parser import SimpleNodeParser

parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents)

In [13]:
# From Llamaindex
import random
def subsample(data, ratio):
    "Subsample a list to a given ratio."
    if not 0 <= ratio <= 1:
        raise ValueError("Ratio must be between 0 and 1")

    # Calculate the number of items to retain in the subsample
    num_items_to_retain = int(len(data) * ratio)

    # Randomly select items to retain
    subsampled_data = random.sample(data, num_items_to_retain)

    return subsampled_data

SUBSAMPLE_RATIO = 0.5

subsampled_nodes = subsample(nodes, SUBSAMPLE_RATIO)
print('Subsampled {} nodes into {} nodes'.format(len(nodes), len(subsampled_nodes)))


Subsampled 19 nodes into 9 nodes


In [14]:
# Uses gpt-3.5-turbo by default
synthetic_dataset = generate_qa_embedding_pairs(subsampled_nodes, num_questions_per_chunk=2)

100%|██████████| 9/9 [00:33<00:00,  3.71s/it]


In [36]:
synthetic_data = []
node_id_to_text = {node.id_: node.text for node in subsampled_nodes}

for query_id, context_ids in synthetic_dataset.relevant_docs.items():
    query = synthetic_dataset.queries[query_id]
    golden_context = node_id_to_text[context_ids[0]]
    entry = {
        "question": query,
        "context": golden_context,
    }
    synthetic_data.append(entry)

In [37]:
import json
with open("question_context_pairs.json", "w") as f:
    synthetic_data


[{'question': "How did the author's experience with the IBM 1401 influence their interest in programming?",
  'context': 'What I Worked On\n\n\n\n\n\nFebruary 2021\n\n\n\n\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\n\n\n\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised fl

## Evaluation