# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [2]:
#!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

In [3]:
#!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml 

### Provide OpenAI API Key

In [2]:
#import os
#import getpass

#os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")
import os
from dotenv import load_dotenv

load_dotenv()

True

## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [3]:
!mkdir data

mkdir: data: File exists


In [4]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31413    0 31413    0     0   116k      0 --:--:-- --:--:-- --:--:--  116k


In [5]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70272    0 70272    0     0   129k      0 --:--:-- --:--:-- --:--:--  129k


In [6]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

>> NOTE: You may need to run this cell twice to get it to work.

In [8]:
training_documents = text_splitter.split_documents(text_loader.load())

In [9]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [10]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [11]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]
print(len(training_split_documents))
print(len(val_split_documents))
print(len(test_split_documents))

78
12
12


## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [12]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [13]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [14]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [15]:
import tqdm
import re

async def create_questions(documents, n_questions):
  questions = {}
  relevant_docs = {}

  ### YOUR CODE HERE
  # Process each document to generate questions
  for doc in tqdm.tqdm(documents, desc="Processing documents"):
        # Generate n_questions for each document using the question generation chain
        generated_questions = question_generation_chain.invoke({
            "context": doc.page_content,
            "n_questions": n_questions
        })

        # For each generated question
        for question in generated_questions.content.split('\n'):
            #remove the question number at the begining
            question = re.sub(r"(^\d+. )", "", question)
            
            # Generate a unique ID for the question
            question_id = str(uuid.uuid4())

            # Add to questions dictionary
            questions[question_id] = question.strip()
                
            # Add to relevant docs dictionary - link question to source document
            relevant_docs[question_id] = [doc.metadata["id"]]
              
  return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [16]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Processing documents: 100%|██████████| 78/78 [01:29<00:00,  1.15s/it]


We'll use the function to generate training, validation, and test data.

In [17]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Processing documents: 100%|██████████| 12/12 [00:18<00:00,  1.52s/it]


In [18]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)
print(test_questions)
print(test_relevant_contexts)

Processing documents: 100%|██████████| 12/12 [00:13<00:00,  1.09s/it]

{'e766257a-5620-4650-8510-36bd822f7741': 'What are some potential societal impacts of the technology mentioned in the context?  ', '9aa330aa-32d7-49c3-b43e-0ab7d73ea8ac': 'Why does the author believe that the knowledge gap between tech enthusiasts and the general population is unhealthy?', '3a2a223d-5f76-42ed-9156-6a528f5128c0': 'What are some reasons people dislike LLMs according to the context?', 'a86fd938-aa11-4500-b0ca-5e699075feb5': 'Why is it important to discuss the criticisms of LLMs?', '1001c601-96e3-488b-bc49-838666f7c584': 'What are some potential consequences of making decisions based on hype and misinformation?  ', 'a449a292-7d6f-4390-bbb3-b0709d4a4f91': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', '974ba56d-26a0-447e-8cc3-1fa236345123': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?  ", '94aa9911-47b2-40ce-8e63-0e8483e436a1': 'What does the au




In [22]:
for query_id, query in test_questions.items():
    print(f"{query_id} : {query}")

e766257a-5620-4650-8510-36bd822f7741 : What are some potential societal impacts of the technology mentioned in the context?  
9aa330aa-32d7-49c3-b43e-0ab7d73ea8ac : Why does the author believe that the knowledge gap between tech enthusiasts and the general population is unhealthy?
3a2a223d-5f76-42ed-9156-6a528f5128c0 : What are some reasons people dislike LLMs according to the context?
a86fd938-aa11-4500-b0ca-5e699075feb5 : Why is it important to discuss the criticisms of LLMs?
1001c601-96e3-488b-bc49-838666f7c584 : What are some potential consequences of making decisions based on hype and misinformation?  
a449a292-7d6f-4390-bbb3-b0709d4a4f91 : Why is it important to acknowledge good applications of certain tools before making decisions about their use?
974ba56d-26a0-447e-8cc3-1fa236345123 : What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?  
94aa9911-47b2-40ce-8e63-0e8483e436a1 : What does the author believe is necessary to he

### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [23]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [24]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [25]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [26]:
!uv pip install -qU sentence_transformers datasets pyarrow

In [27]:
#SentenceTransformer loads the pre-trained embedding model
#This model will be used to generate embeddings for text queries and documents.
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [28]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [29]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [30]:
#Prepare the Training Data
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [31]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [32]:
#Defines a two-step loss function:
#MultipleNegativesRankingLoss:
#Encourages the model to bring queries closer to their correct documents while pushing away unrelated ones.
#MatryoshkaLoss:
#Ensures that embeddings retain meaning even when reduced to lower dimensions (useful for efficient storage & fast inference).
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

**ANSWER**

**MultipleNegativesRankingLoss** helps train sentence embeddings so that:

The correct (positive) sentence is placed closer to the query in the embedding space. Other (negative) sentences are pushed further away.
Instead of manually selecting "wrong" answers (negatives), it automatically treats all other sentences in the batch as negatives—this is what makes it so efficient.

🛠 How Does It Work?
- Input Setup:

You have a batch of queries and their correct answers (positives).
Each pair consists of (Query, Correct Answer).

- Computing Similarities:

The model generates embeddings (numerical representations) for all queries and answers.
It then compares each query with every answer in the batch using cosine similarity.

- Ranking the Correct Answer Higher:

The loss function rewards the model when the similarity between the query and its correct answer is high.
It penalizes the model when the query mistakenly ranks other answers higher.

- In-Batch Negatives Trick:

Instead of needing separate wrong answers, the model automatically considers every other answer in the batch as a negative.
This makes training much faster and more efficient.

🤔 Why is it Useful?

✅ Efficient Learning – No need to manually pick negative samples.

✅ Better Retrieval Models – Helps the model understand which sentences are most relevant.

✅ Works Well in Large Batches – The more sentences in a batch, the more negatives, improving performance

--------------------

**MatryoshkaLoss** is a loss function that trains embeddings to be useful at multiple dimensions (sizes) at the same time.

1️⃣ Encourages embeddings to store important meaning at the start of the vector.

2️⃣ Makes sure truncated embeddings (shorter versions) still work well.

3️⃣ Allows models to be flexible—you can use smaller embeddings for speed while keeping good performance.

🛠 How Does It Work?
- Start with a high-dimensional embedding (e.g., 768 dimensions).
- Create smaller versions by chopping off some of the last dimensions (e.g., reduce to 512, 256, 128, 64).
- Train each version separately using a ranking loss (e.g., MultipleNegativesRankingLoss).
- Ensure that smaller versions still perform well—this forces the model to store the most important meaning in the first few dimensions.

🤔 Why Is This Useful?

✅ More flexibility – You can use smaller embeddings if needed without losing too much quality.

✅ Faster inference – If you need speed, use a lower-dimension version.

✅ Efficient storage – Store compact embeddings while still capturing important meaning.

✅ Works well for retrieval & search – Smaller, optimized embeddings can make search systems faster.


Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [33]:
#Loads validation data from val_dataset.
#Creates an InformationRetrievalEvaluator, which:
#Measures how well the model ranks relevant documents for given queries.
#Uses metrics like Mean Average Precision (MAP) (How well the model ranks relevant documents across queries.)
# and Recall@K. (Measures if the correct document appears in the top K results.)
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [34]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [35]:
!uv pip install -qU wandb 'accelerate>=0.26.0' 'ipywidgets>=8.1.5'

In [36]:
#Disables Weights & Biases (wandb) tracking for this experiment.
#Useful when debugging or running tests without logging.

import wandb
wandb.init(mode="disabled") # Completely disables logging

In [37]:
#Warm-up steps prevent sudden learning rate spikes.
#Formula:
#len(loader) * EPOCHS = total number of training steps.
#0.1 = 10% of steps used for gradual learning rate increase.
warmup_steps = int(len(loader) * EPOCHS * 0.1)

#Trains the model using the train_loss function.
#Fine-tunes it over multiple epochs to improve retrieval accuracy.
#Evaluates performance every 50 steps using the evaluator.
#Saves the fine-tuned model to 'finetuned_arctic_ft'.
model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
32,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
48,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
50,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
64,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
80,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.979167,0.972222,0.972222
96,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.979167,0.972222,0.972222
100,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.963789,0.951389,0.951389
112,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.979167,0.972222,0.972222
128,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.979167,0.972222,0.972222


In [38]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [39]:
hf_username = "kcheng0816"

In [40]:
# Uploads the fine-tuned model. Makes the model publicly available for reusability.
model.push_to_hub(f"{hf_username}/finetuned_arctic_kc")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/kcheng0816/aie5_assignment8/commit/c287feaa2b7d4b87c3d6b521117d43fabe5d4d09'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [41]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [48]:
#Creates a FAISS-based vector search index.
#Converts the corpus into vector embeddings.
#Retrieves top-K most similar documents for each query.
#Checks if the correct document is in the retrieved results (is_hit flag).
#Returns evaluation results (how often the correct document appears in the top-K results).
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  #print(questions)
  for id, question in questions.items():
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [43]:
!uv pip install -qU faiss-cpu

In [45]:
print(test_dataset)

{'questions': {'e766257a-5620-4650-8510-36bd822f7741': 'What are some potential societal impacts of the technology mentioned in the context?  ', '9aa330aa-32d7-49c3-b43e-0ab7d73ea8ac': 'Why does the author believe that the knowledge gap between tech enthusiasts and the general population is unhealthy?', '3a2a223d-5f76-42ed-9156-6a528f5128c0': 'What are some reasons people dislike LLMs according to the context?', 'a86fd938-aa11-4500-b0ca-5e699075feb5': 'Why is it important to discuss the criticisms of LLMs?', '1001c601-96e3-488b-bc49-838666f7c584': 'What are some potential consequences of making decisions based on hype and misinformation?  ', 'a449a292-7d6f-4390-bbb3-b0709d4a4f91': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', '974ba56d-26a0-447e-8cc3-1fa236345123': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?  ", '94aa9911-47b2-40ce-8e63-0e8483e436a1': 'Wh

In [49]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

{'e766257a-5620-4650-8510-36bd822f7741': 'What are some potential societal impacts of the technology mentioned in the context?  ', '9aa330aa-32d7-49c3-b43e-0ab7d73ea8ac': 'Why does the author believe that the knowledge gap between tech enthusiasts and the general population is unhealthy?', '3a2a223d-5f76-42ed-9156-6a528f5128c0': 'What are some reasons people dislike LLMs according to the context?', 'a86fd938-aa11-4500-b0ca-5e699075feb5': 'Why is it important to discuss the criticisms of LLMs?', '1001c601-96e3-488b-bc49-838666f7c584': 'What are some potential consequences of making decisions based on hype and misinformation?  ', 'a449a292-7d6f-4390-bbb3-b0709d4a4f91': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', '974ba56d-26a0-447e-8cc3-1fa236345123': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?  ", '94aa9911-47b2-40ce-8e63-0e8483e436a1': 'What does the au

In [50]:
te3_results_df = pd.DataFrame(te3_results)

In [51]:
#Uses OpenAI’s text-embedding-3-small to embed documents.
#Evaluates retrieval accuracy.
#Computes hit rate (percentage of queries where the correct document appears in the top results).
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

np.float64(1.0)

### `Snowflake/snowflake-arctic-embed-l` (base)

In [53]:
!uv pip install -qU langchain_huggingface

In [76]:
#Uses Hugging Face’s Snowflake/snowflake-arctic-embed-l embeddings.
#Computes retrieval performance and compares it to OpenAI’s model.
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

{'e766257a-5620-4650-8510-36bd822f7741': 'What are some potential societal impacts of the technology mentioned in the context?  ', '9aa330aa-32d7-49c3-b43e-0ab7d73ea8ac': 'Why does the author believe that the knowledge gap between tech enthusiasts and the general population is unhealthy?', '3a2a223d-5f76-42ed-9156-6a528f5128c0': 'What are some reasons people dislike LLMs according to the context?', 'a86fd938-aa11-4500-b0ca-5e699075feb5': 'Why is it important to discuss the criticisms of LLMs?', '1001c601-96e3-488b-bc49-838666f7c584': 'What are some potential consequences of making decisions based on hype and misinformation?  ', 'a449a292-7d6f-4390-bbb3-b0709d4a4f91': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', '974ba56d-26a0-447e-8cc3-1fa236345123': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?  ", '94aa9911-47b2-40ce-8e63-0e8483e436a1': 'What does the au

In [77]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [78]:

arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

np.float64(0.9166666666666666)

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [79]:
#Evaluates the fine-tuned Arctic embeddings.
#Measures how well the fine-tuned model improves retrieval accuracy.
finetune_embeddings = HuggingFaceEmbeddings(model_name="kcheng0816/finetuned_arctic_kc")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/275 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/29.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at kcheng0816/finetuned_arctic_kc and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

{'e766257a-5620-4650-8510-36bd822f7741': 'What are some potential societal impacts of the technology mentioned in the context?  ', '9aa330aa-32d7-49c3-b43e-0ab7d73ea8ac': 'Why does the author believe that the knowledge gap between tech enthusiasts and the general population is unhealthy?', '3a2a223d-5f76-42ed-9156-6a528f5128c0': 'What are some reasons people dislike LLMs according to the context?', 'a86fd938-aa11-4500-b0ca-5e699075feb5': 'Why is it important to discuss the criticisms of LLMs?', '1001c601-96e3-488b-bc49-838666f7c584': 'What are some potential consequences of making decisions based on hype and misinformation?  ', 'a449a292-7d6f-4390-bbb3-b0709d4a4f91': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', '974ba56d-26a0-447e-8cc3-1fa236345123': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?  ", '94aa9911-47b2-40ce-8e63-0e8483e436a1': 'What does the au

In [80]:
finetune_results_df = pd.DataFrame(finetune_results)

In [81]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

np.float64(1.0)

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [82]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [83]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [84]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [85]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [86]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [87]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'An agent, in the context of AI, is an infuriatingly vague term that generally refers to AI systems that can act on your behalf. There are two main interpretations: one sees agents as systems that go and perform tasks for you (like a travel agent), while the other views them as LLMs (large language models) that have access to tools and can run processes in a loop to solve problems. However, the term lacks a single, clear definition, leading to confusion about its meaning and utility.'

In [88]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [89]:
base_rag_chain.invoke({"question" : "What is the laziest AI month?"})["response"]

'I do not know.'

In [90]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [91]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [92]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [93]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An "agent" in the context of AI refers to a system that can act on behalf of a user, but the term lacks a single, clear definition. There are two main interpretations: one sees agents as entities that perform tasks for users, similar to a travel agent, while the other views them as LLMs (Large Language Models) that utilize tools to solve problems in a loop. However, the concept remains vague and is often associated with challenges such as gullibility, where these systems may struggle to distinguish truth from fiction. Overall, the term "agent" is still evolving and has not yet been fully realized in practical applications.'

In [94]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better models than GPT-3 include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [95]:
finetune_rag_chain.invoke({"question" : "What is the laziest AI month?"})["response"]

'I do not know.'

In [96]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The largest model that Simon has run on his phone is the Llama 3.2 3B model.'

####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

**ANSWER**

finetune_rag_chain answered the 1st and 2nd questions same good as base_rag_chain and also it did answer the 4th query related to Simon correctly. Because the embedding model kcheng0816/finetuned_arctic_kc used in the finetune_retriever was fine tuned by the specific information in Simon's web pages, the basic embedding model used in base_retriever does not have the information from Simon's web pages. 

Both chains did not answer the 3rd question because the provided contexts did not provide the related information. This result matched the rag prompt template.

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

### Install Dependencies

In [97]:
### YOUR CODE HERE
!uv pip install -qU ragas==0.2.12 langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0


### Input RAGAS API key

In [98]:
import os
from getpass import getpass

os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

### Load downloaded web pages from data folder

In [99]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


### Create your embeding model wrapper

In [100]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

### Generate Synthetic Data 

In [101]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [102]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What role has Microsoft Research played in the...,[Code may be the best application The ethics o...,Microsoft Research is one of the organizations...,single_hop_specifc_query_synthesizer
1,What are some challenges associated with evalu...,[Based Development As a computer scientist and...,Evaluating LLMs is challenging because there a...,single_hop_specifc_query_synthesizer
2,Wht were the key developments in AI during 2023?,[Simon Willison’s Weblog Subscribe Stuff we fi...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,What is the significance of OpenAI in the cont...,[easy to follow. The rest of the document incl...,OpenAI is a significant entity in the context ...,single_hop_specifc_query_synthesizer
4,How do the ethics of AI training data and the ...,[<1-hop>\n\nCode may be the best application T...,The intersection of the ethics of AI training ...,multi_hop_abstract_query_synthesizer
5,How do the ethics of AI training data and the ...,[<1-hop>\n\nCode may be the best application T...,The ethics of AI training data and the environ...,multi_hop_abstract_query_synthesizer
6,How have advancements in Large Language Models...,[<1-hop>\n\nCode may be the best application T...,Advancements in Large Language Models (LLMs) h...,multi_hop_abstract_query_synthesizer
7,How have advancements in Large Language Models...,[<1-hop>\n\nCode may be the best application T...,Advancements in Large Language Models (LLMs) h...,multi_hop_abstract_query_synthesizer
8,How have advancements in GPT-4o and its pricin...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Advancements in GPT-4o have significantly impa...,multi_hop_specific_query_synthesizer
9,What are some of the major advancements and ch...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In 2024, significant advancements in Large Lan...",multi_hop_specific_query_synthesizer


In [103]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/145473f5-76da-40fd-83a2-53496fe809c5


'https://app.ragas.io/dashboard/alignment/testset/145473f5-76da-40fd-83a2-53496fe809c5'

### Call the base RAG chain with synthetic data and save the repsonses and the retrieved contexts

In [107]:
for test_row in dataset:
  response = base_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [108]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What role has Microsoft Research played in the...,"[Prompt injection explained, with video, slide...",[Code may be the best application The ethics o...,I do not know.,Microsoft Research is one of the organizations...,single_hop_specifc_query_synthesizer
1,What are some challenges associated with evalu...,[I get it. There are plenty of reasons to disl...,[Based Development As a computer scientist and...,Some challenges associated with evaluating LLM...,Evaluating LLMs is challenging because there a...,single_hop_specifc_query_synthesizer
2,Wht were the key developments in AI during 2023?,[The legal arguments here are complex. I’m not...,[Simon Willison’s Weblog Subscribe Stuff we fi...,The key developments in AI during 2023 include...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,What is the significance of OpenAI in the cont...,[The legal arguments here are complex. I’m not...,[easy to follow. The rest of the document incl...,The provided context does not specifically men...,OpenAI is a significant entity in the context ...,single_hop_specifc_query_synthesizer
4,How do the ethics of AI training data and the ...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nCode may be the best application T...,The ethics of AI training data and the ease of...,The intersection of the ethics of AI training ...,multi_hop_abstract_query_synthesizer
5,How do the ethics of AI training data and the ...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nCode may be the best application T...,The ethics of AI training data and the environ...,The ethics of AI training data and the environ...,multi_hop_abstract_query_synthesizer
6,How have advancements in Large Language Models...,[I get it. There are plenty of reasons to disl...,[<1-hop>\n\nCode may be the best application T...,Advancements in Large Language Models (LLMs) h...,Advancements in Large Language Models (LLMs) h...,multi_hop_abstract_query_synthesizer
7,How have advancements in Large Language Models...,[I get it. There are plenty of reasons to disl...,[<1-hop>\n\nCode may be the best application T...,The provided context does not contain specific...,Advancements in Large Language Models (LLMs) h...,multi_hop_abstract_query_synthesizer
8,How have advancements in GPT-4o and its pricin...,[That same laptop that could just about run a ...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Advancements in GPT-4 and its pricing have sig...,Advancements in GPT-4o have significantly impa...,multi_hop_specific_query_synthesizer
9,What are some of the major advancements and ch...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In 2024, some major advancements in Large Lang...","In 2024, significant advancements in Large Lan...",multi_hop_specific_query_synthesizer


### Convert to Evaluation Dataset

In [109]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

### Select a judge model

In [110]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

### Evaluate the base RAG chain

In [111]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=1440)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[2]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')


{'context_recall': 0.3995, 'faithfulness': 0.6499, 'factual_correctness': 0.3300, 'answer_relevancy': 0.7198, 'context_entity_recall': 0.2865, 'noise_sensitivity_relevant': 0.2296}

### Call the base RAG chain with synthetic data and save the repsonses and the retrieved contexts

In [112]:
for test_row in dataset:
  response = finetune_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [113]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What role has Microsoft Research played in the...,[Large Language Models\nThey’re actually quite...,[Code may be the best application The ethics o...,Microsoft Research has contributed to the deve...,Microsoft Research is one of the organizations...,single_hop_specifc_query_synthesizer
1,What are some challenges associated with evalu...,[I find I have to work with an LLM for a few w...,[Based Development As a computer scientist and...,Some challenges associated with evaluating LLM...,Evaluating LLMs is challenging because there a...,single_hop_specifc_query_synthesizer
2,Wht were the key developments in AI during 2023?,[Stuff we figured out about AI in 2023\n\n\n\n...,[Simon Willison’s Weblog Subscribe Stuff we fi...,The key developments in AI during 2023 include...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,What is the significance of OpenAI in the cont...,[Here’s the sequel to this post: Things we lea...,[easy to follow. The rest of the document incl...,OpenAI is significant in the context of large ...,OpenAI is a significant entity in the context ...,single_hop_specifc_query_synthesizer
4,How do the ethics of AI training data and the ...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nCode may be the best application T...,The ethics of AI training data and the ease of...,The intersection of the ethics of AI training ...,multi_hop_abstract_query_synthesizer
5,How do the ethics of AI training data and the ...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nCode may be the best application T...,The ethics of AI training data and the environ...,The ethics of AI training data and the environ...,multi_hop_abstract_query_synthesizer
6,How have advancements in Large Language Models...,[The rise of inference-scaling “reasoning” mod...,[<1-hop>\n\nCode may be the best application T...,Advancements in Large Language Models (LLMs) h...,Advancements in Large Language Models (LLMs) h...,multi_hop_abstract_query_synthesizer
7,How have advancements in Large Language Models...,[The rise of inference-scaling “reasoning” mod...,[<1-hop>\n\nCode may be the best application T...,Advancements in Large Language Models (LLMs) h...,Advancements in Large Language Models (LLMs) h...,multi_hop_abstract_query_synthesizer
8,How have advancements in GPT-4o and its pricin...,[The GPT-4 barrier was comprehensively broken\...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Advancements in GPT-4o have significantly impa...,Advancements in GPT-4o have significantly impa...,multi_hop_specific_query_synthesizer
9,What are some of the major advancements and ch...,[The rise of inference-scaling “reasoning” mod...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In 2024, some major advancements in large lang...","In 2024, significant advancements in Large Lan...",multi_hop_specific_query_synthesizer


In [114]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

### Evaluate the fine tune RAG chain

In [115]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=1440)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.6041, 'faithfulness': 0.8392, 'factual_correctness': 0.4667, 'answer_relevancy': 0.9604, 'context_entity_recall': 0.3552, 'noise_sensitivity_relevant': 0.3281}