# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

##### ❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?


##### Answer:

Q&D pairs is all about training the model to align user queries with relevant chunks, while inter-document pairs/related sentences is all about training the model to capture semantic relationships between chunks. Thus, the use-case for Q&D pairs training is RAG (which is what we've been focusing on this bootcamp) while a use case for inter-document pairs/related sentences is about clustering (e.g. think about clustering legal documents/chunks by topic. This is a good example because legal has so much jargon that fine-tuning the embedding model will probably lead to much better results and that for legal, you would probably really want to have all your chunks clustered by topics (although I'm not a lawyer but this seems like something of high value)).

For Q&D pairs, the training data is positive pairs (query, relevant chunk) and negative pairs (query, irrelavant chunk), although as we see in this notebook, you don't have to stress too much about the negative pairs because once we have all our positive pairs, for a specific query, all the OTHER chunks (not part of its positive pair) form that query's negative pairs. For inter-document pairs/related sentences, the training data is (anchor chunk, similar chunk, dissimilar chunk) triplets, where "anchor" here just means our baseline chunk (as in, we start with a chunk, then we need another chunk that is similar to it and another chunk that is similar to it and now we form a triplet).

For Q&D pairs, a caveat is that synthetic question generation will introduce bias (and give us worse performance) if the questions don't reflect real user intent (e.g. realistic questions we would get from all our users). For inter-document pairs/related sentences, a caveat is that it's not as good for query-document retrieval (and hence RAG).

For Q&D pairs, some special considerations for what kind of Q's we should use are:

(a) question quality (questions should reflect real user queries in both phrasing AND intent)

(b) domain relevant (use the domain-specific terminalogy/jargon)

(c) diversity (use both factual questions and conceptual questions)


## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [2]:
!pip install -qU "langchain_openai>=0.3.4" "langchain_huggingface" "langchain_core>=0.3.34" "langchain>=0.3.18" "langchain_community>=0.3.17" "langchain-text-splitters>=0.3.6" "datasets>=3.2.0"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/437.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m437.6/437.6 kB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m107.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m117.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.4/169.4 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### Provide OpenAI API Key

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data

In [6]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31554    0 31554    0     0  36332      0 --:--:-- --:--:-- --:--:-- 36310


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70721    0 70721    0     0  56100      0 --:--:--  0:00:01 --:--:-- 56127


In [8]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

> NOTE: You may need to run this cell twice to get it to work.

In [10]:
training_documents = text_splitter.split_documents(text_loader.load())

In [11]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [15]:
"""
Generate a random UUID (identifier) for each chunk. Checks if the identifier is already
in the set (unlikely but possible), and if it is, it generates a new one. Then add the
unique ID to the chunk's metadata.
"""

import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [16]:
"""
training gets all chunks except for the last 24, validation gets the next 12 chunks
(used for validation during training to monitor performance and prevent overfitting),
test set gets the last 12 chunks (reserved for final evaluation after training is
complete)
"""

training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4.1-mini`

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [17]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4.1-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [18]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [19]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [20]:
import tqdm
import asyncio

"""
Sample Usage of TQDM:

for i in tqdm.tqdm(range(10)):
  time.sleep(1)
"""

async def create_questions(documents, n_questions):
    # Initialize dictionaries for the 2 return objects
    questions = {}
    relevant_docs = {}
    # The point of tqdm is to show a progress bar because this is a long process
    for doc in tqdm.tqdm(documents, desc="Generating questions"):
        # Prepare the input for the chain
        input_context = doc.page_content
        doc_id = doc.metadata["id"]

        # Call the question generation chain
        response = await question_generation_chain.ainvoke({"context": input_context, "n_questions": n_questions})

        # Split the response into a list of questions via newline bc the response is a
        # string with each question on a new line because of how we formed the prompt
        generated_questions = response.content.split("\n")
        # Strip whitespace and empty strings
        generated_questions = [q.strip() for q in generated_questions if q.strip()]

        # Some outputs might be numbered like "1. What is ...?", so clean numbering
        cleaned_questions = []
        for q in generated_questions:
            # In case question starts with a number and a period
            if q[0].isdigit() and q[1] == '.':
                cleaned_questions.append(q[2:].strip())
            # In case question starts with a number and a space
            elif q[0].isdigit() and q[1] == ' ':
                cleaned_questions.append(q[1:].strip())
            else:
                cleaned_questions.append(q)

        # Now save each question
        for q in cleaned_questions:
            # Generate a random UUID for the question
            question_id = str(uuid.uuid4())
            # Add the question to the questions dictionary
            questions[question_id] = q
            # Add the document ID to the relevant_docs dictionary
            relevant_docs[question_id] = [doc_id]

    return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [21]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Generating questions: 100%|██████████| 78/78 [02:11<00:00,  1.68s/it]


We'll use the function to generate training, validation, and test data.

In [22]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Generating questions: 100%|██████████| 12/12 [00:16<00:00,  1.40s/it]


In [23]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Generating questions: 100%|██████████| 12/12 [00:19<00:00,  1.63s/it]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [24]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

# Save the training dataset to a JSONL file in JSON format
with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [25]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [26]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [26]:
!pip install -qU sentence_transformers pyarrow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.7/345.7 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.
pylibcudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.[0m[31m
[0m

In [27]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [28]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [29]:
# Batch size is the number of training examples processed together in one
# forward/backward pass during model training.

BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [30]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

# Create the list of positive pairs in the format expected by the model
examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [31]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [32]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

What are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

##### Answer:

##### MultipleNegativesRankingLoss:

MultipleNegativesRankingLoss optimizes embedding models by training them to distinguish positive pairs from in-batch negatives. For each anchor (i.e. query), it forces the model to assign the highest similarity score to its corresponding positive example (i.e. associated chunk) compared to all other candidates in the batch.

Thus, since we are using batch size of 10, that means we have 9 negatives for each query (and 1 positive).

Key features of MultipleNegativesRankingLoss:

(a) uses cosine similarity (or dot product) scaled by a temperature parameter

(b) leverages automatic in-batch negative sampling

(c) ideal for retrieval tasks like question-answer matching


##### MatryoshkaLoss:

MatryoshkaLoss enables training embeddings that remain effective when truncated to smaller dimensions. It modifies existing losses like MultipleNegativesRankingLoss by

(a) applying the base loss at multiple predefined dimensions simultaneously

(b) randomly sampling subsets of dimensions during training for efficiency

(c) weighting losses across dimensions (default: equal weights)



High-level way of thinking how MultipleNegativesRankingLoss and MatryoshkaLoss work together: First, just think of MultipleNegativesRankingLoss as our loss function to train our embedding model (for each query, it has one positive chunk and batch_size - 1 negative chunks). Then just think of MatryoshkaLoss as something special that allows us to do our MultipleNegativesRankingLoss training in a special way such that we actually do our training at different embedding dimension lengths. What I mean is that we end up having a MultipleNegativesRankingLoss-trained embedding model for different embedding dimensions (e.g. one embedding model which has 768 dimensions, one embedding model which has 512 dimensions, and so on). This is useful when we are dealing with a crazy amount of data, then for our business, it's worth using smaller sized embedding models for cost purposes (it may be slightly worse but worth it when you consider how much money we save)


Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [33]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

# We are setting up our evaluator to evaluate our model on the validation set
# (not the training set)
corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 10 epochs, though you could increase this number if we had a significant amount more data.

In [34]:
"""
An epoch is one complete pass through the entire training dataset. During each epoch,
every training example (query-chunk pair) is processed once. Setting EPOCHS = 10 means
the embedding model will see and learn from the complete set of training examples 10
times.
"""

EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [35]:
import wandb
wandb.init(mode="disabled") # prevent any data logging to wandb servers

> NOTE: You may not see direct improvement during the training cycles - this is absolutely expected. We will verify performance later in the notebook.

In [36]:
"""
In PyTorch, the len(loader) is the number of batches in one complete epoch.
Thus, total training steps = number of batches * number of epochs.
Thus, warmup_steps = 10% of total training steps.
Warmup steps is a training technique that helps stabilize early training when
gradients might be erratic.
"""

warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.963789,0.951389,0.951389
32,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
48,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
50,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
64,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
80,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
96,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
100,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
112,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
128,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167


In [37]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [38]:
hf_username = "kamkol"

In [39]:
"""
We push my fine-tuned model to my HuggingFace account. Note that 'legal' is in the
name probably because in a previous cohort, a bunch of legal documents were used
instead of the course corpus (not sure but probably).
"""

import uuid

model.push_to_hub(f"{hf_username}/legal-ft-{uuid.uuid4()}")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/kamkol/legal-ft-0f6bea46-12bf-4ffc-9906-4c67500cf104/commit/3b8eaadd0dbd699d2fbff2ef326a5465449c5384'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [28]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [29]:
"""
This function evaluates the performance of a given embedding model by comparing the
retrieved chunk IDs with the expected chunk ID for each question in the dataset that
we are evaluating on. Note that if the expected chunk ID is in the list of retrieved
chunk IDs, then we have a "hit", otherwise we don't (True or False, which we'll
later score as 1 or 0). Default is retrieving top 5 chunks
"""

def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [30]:
# Note that we use the test dataset to compare the different embedding models (which is the correct way of doing things to get a fair comparison)

te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 24/24 [00:13<00:00,  1.74it/s]


In [31]:
te3_results_df = pd.DataFrame(te3_results)

In [32]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

np.float64(1.0)

### `Snowflake/snowflake-arctic-embed-l` (base)

In [33]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

100%|██████████| 24/24 [00:00<00:00, 49.43it/s]


In [34]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [35]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

np.float64(0.8333333333333334)

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [40]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="kamkol/legal-ft-0f6bea46-12bf-4ffc-9906-4c67500cf104")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/281 [00:00<?, ?B/s]






README.md:   0%|          | 0.00/29.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/584 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at kamkol/legal-ft-0f6bea46-12bf-4ffc-9906-4c67500cf104 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

100%|██████████| 24/24 [00:00<00:00, 49.26it/s]


In [41]:
finetune_results_df = pd.DataFrame(finetune_results)

In [42]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

np.float64(0.9166666666666666)

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [43]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [44]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [45]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [46]:
rag_llm =  ChatOpenAI(
    model="gpt-4.1-nano",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [47]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [48]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'Based on the provided context, an "agent" in the context of AI refers to systems that are often described as capable of acting on your behalf, such as travel agents or digital assistants. However, the term is highly vague and lacks a clear, universally accepted definition. The discussions highlight that many claims about AI agents are often ambiguous, and their actual utility is questioned due to issues like gullibility and the difficulty in distinguishing truth from fiction. Therefore, an agent can be broadly understood as an AI system purported to perform tasks or make decisions on behalf of a user, but the precise meaning varies and remains somewhat unclear.'

In [49]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Several organizations have produced models that are better than GPT-3, including Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, and Baidu.'

In [50]:
base_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'The provided context does not specify a particular time of year that is considered the "laziest" for AI.'

In [51]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The provided context does not specify the name "Simon" or details about the largest model he has run on his phone. Therefore, I do not know the answer.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [52]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [53]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [54]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'Based on the provided context, an "agent" is a term that lacks a clear, universally accepted definition. It generally refers to systems that act on your behalf, such as travel agents or AI systems that can perform tasks or make decisions. However, the term is often used vaguely, and there is skepticism about their current utility due to issues like gullibility and the difficulty in distinguishing truth from fiction. Overall, an agent can be thought of as an AI system designed to act or make decisions on behalf of a user, but the precise meaning varies and remains somewhat ambiguous.'

In [55]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Several organizations have produced models that are better than GPT-3, including Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, and Baidu.'

In [56]:
finetune_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'The provided context suggests that AI models, such as ChatGPT, may become less useful or "lazy" during certain times, specifically around the holidays in December. Therefore, the laziest time of the year for AI appears to be December.'

In [57]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The largest model that Simon has run on his phone is Mistral 7B.'

#### ❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

#### Answer:

Doing the vibe check, it's clear that the 2nd LCEL RAG chain (the one with our fine-tuned embedding model) did better (vs the 1st LCEL RAG chain, which is using the base snowflake-arctic-embed-l embedding model). Here's a breakdown of why:

Question 1: roughly even (vibe-check level)

Question 2: roughly even (vibe-check level)

Question 3: 1st LCEL RAG chain wasn't able to find any relevent context from our chunks, whereas the 2nd LCEL RAG chain was actually able to find relevent context. Thus, the 2nd LCEL RAG chain did a better job answering this question (giving the answer December and explaining what in the context made it answer December)

Question 4: 1st LCEL RAG chain wasn't able to find any relevent context from our chunks, whereas the 2nd LCEL RAG chain was actually able to find relevent context. Thus, the 2nd LCEL RAG chain did a better job answering this question (it answered that the largest model that Simon has ran on his phone is Mistral 7B, which it got from the context retrieved).

Thus, looking at my vibe check above, the 2nd LCEL RAG chain (with our fine-tuned embedding model) answered the questions better because 2 of the questions they were tied and 2 of the questions the 2nd LCEL RAG chain clearly did better so overall, 2nd LCEL RAG chain answered the questions better




## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [61]:
!pip install -qU ragas==0.2.10


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [62]:
!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m95.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m88.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.2/137.2 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m4.0 MB/s[0m eta [36m0

In [63]:
# Got rate limit errors 1st time so Chris recommended in office hours to use 4.1-nano

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [65]:
docs = text_loader.load()

In [69]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)



Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [70]:
dataset.to_pandas()


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Google is like what in the big picture of AI a...,[Code may be the best application The ethics o...,Google is listed among organizations that have...,single_hop_specifc_query_synthesizer
1,As an AI Researcher and Responsible AI Advocat...,[Based Development As a computer scientist and...,"Based on the context, LLMs are infuriating bec...",single_hop_specifc_query_synthesizer
2,What is AI and why is it important in 2023 acc...,[Stuff we figured out about AI in 2023 Simon W...,AI refers to the latest and most interesting d...,single_hop_specifc_query_synthesizer
3,Is it okay to train models on peoples content ...,[easy to follow. The rest of the document incl...,The context discusses ethical questions about ...,single_hop_specifc_query_synthesizer
4,How do the evaluation difficulties and benchma...,[<1-hop>\n\nThings we learned about LLMs in 20...,"In 2024, the evaluation difficulties and bench...",multi_hop_abstract_query_synthesizer
5,How can we make LLMs more accessable and run t...,[<1-hop>\n\nCode may be the best application T...,"The context shows that in 2023, it became incr...",multi_hop_abstract_query_synthesizer
6,Considering the capabilities and limitations o...,[<1-hop>\n\nBased Development As a computer sc...,"The gullibility of AI agents, which causes the...",multi_hop_abstract_query_synthesizer
7,"How do the 2024 developments in AI, such as th...",[<1-hop>\n\nThings we learned about LLMs in 20...,"The 2024 developments in AI, highlighted by th...",multi_hop_abstract_query_synthesizer
8,Wha is Claude?,[<1-hop>\n\nThings we learned about LLMs in 20...,"Based on the context, Claude refers to a serie...",multi_hop_specific_query_synthesizer
9,How do the recent advances in Large Language M...,[<1-hop>\n\nThings we learned about LLMs in 20...,"The advances in 2024, including models with lo...",multi_hop_specific_query_synthesizer


In [71]:
import time

for test_row in dataset:
  response = base_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(10) # To try to avoid rate limiting.


In [72]:
dataset.to_pandas()


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Google is like what in the big picture of AI a...,"[Prompt injection explained, with video, slide...",[Code may be the best application The ethics o...,I do not know.,Google is listed among organizations that have...,single_hop_specifc_query_synthesizer
1,As an AI Researcher and Responsible AI Advocat...,[The legal arguments here are complex. I’m not...,[Based Development As a computer scientist and...,As an AI Researcher and Responsible AI Advocat...,"Based on the context, LLMs are infuriating bec...",single_hop_specifc_query_synthesizer
2,What is AI and why is it important in 2023 acc...,[Things we learned about LLMs in 2024\n\n\n\n\...,[Stuff we figured out about AI in 2023 Simon W...,According to Simon Willison’s Weblog and the s...,AI refers to the latest and most interesting d...,single_hop_specifc_query_synthesizer
3,Is it okay to train models on peoples content ...,[The legal arguments here are complex. I’m not...,[easy to follow. The rest of the document incl...,I do not know.,The context discusses ethical questions about ...,single_hop_specifc_query_synthesizer
4,How do the evaluation difficulties and benchma...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nThings we learned about LLMs in 20...,"In 2024, the criticism and assessment of LLMs ...","In 2024, the evaluation difficulties and bench...",multi_hop_abstract_query_synthesizer
5,How can we make LLMs more accessable and run t...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nCode may be the best application T...,Making LLMs more accessible and enabling them ...,"The context shows that in 2023, it became incr...",multi_hop_abstract_query_synthesizer
6,Considering the capabilities and limitations o...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nBased Development As a computer sc...,The gullibility of AI agents significantly imp...,"The gullibility of AI agents, which causes the...",multi_hop_abstract_query_synthesizer
7,"How do the 2024 developments in AI, such as th...",[The legal arguments here are complex. I’m not...,[<1-hop>\n\nThings we learned about LLMs in 20...,"The 2024 developments, including the release o...","The 2024 developments in AI, highlighted by th...",multi_hop_abstract_query_synthesizer
8,Wha is Claude?,"[Prompt injection explained, with video, slide...",[<1-hop>\n\nThings we learned about LLMs in 20...,Claude is an AI language model developed by An...,"Based on the context, Claude refers to a serie...",multi_hop_specific_query_synthesizer
9,How do the recent advances in Large Language M...,[Large Language Models\nThey’re actually quite...,[<1-hop>\n\nThings we learned about LLMs in 20...,"The recent advances in 2024, such as longer co...","The advances in 2024, including models with lo...",multi_hop_specific_query_synthesizer


In [73]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())


In [74]:
# Got rate limit errors 1st time so Chris recommended in office hours to use 4.1-nano

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))


In [75]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=600)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result


Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[61]: AttributeError('StringIO' object has no attribute 'sentences')
ERROR:ragas.executor:Exception raised in Job[65]: AttributeError('StringIO' object has no attribute 'sentences')


{'context_recall': 0.8333, 'faithfulness': 0.8710, 'factual_correctness': 0.5900, 'answer_relevancy': 0.6767, 'context_entity_recall': 0.2846, 'noise_sensitivity_relevant': 0.1617}

In [76]:
for test_row in dataset:
  response = finetune_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(10) # To try to avoid rate limiting.

In [77]:
dataset.to_pandas()


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Google is like what in the big picture of AI a...,[google\n 360\n\n\n ai\n...,[Code may be the best application The ethics o...,The provided context does not explicitly descr...,Google is listed among organizations that have...,single_hop_specifc_query_synthesizer
1,As an AI Researcher and Responsible AI Advocat...,[The two main categories I see are people who ...,[Based Development As a computer scientist and...,The term LLMs (Large Language Models) relates ...,"Based on the context, LLMs are infuriating bec...",single_hop_specifc_query_synthesizer
2,What is AI and why is it important in 2023 acc...,[Stuff we figured out about AI in 2023\n\n\n\n...,[Stuff we figured out about AI in 2023 Simon W...,"According to Simon Willison’s Weblog in ""Stuff...",AI refers to the latest and most interesting d...,single_hop_specifc_query_synthesizer
3,Is it okay to train models on peoples content ...,[The legal arguments here are complex. I’m not...,[easy to follow. The rest of the document incl...,The provided context does not explicitly addre...,The context discusses ethical questions about ...,single_hop_specifc_query_synthesizer
4,How do the evaluation difficulties and benchma...,[LLMs somehow got even harder to use\nA drum I...,[<1-hop>\n\nThings we learned about LLMs in 20...,The evaluation difficulties and benchmarking c...,"In 2024, the evaluation difficulties and bench...",multi_hop_abstract_query_synthesizer
5,How can we make LLMs more accessable and run t...,"[This unleashed a whirlwind of innovation, whi...",[<1-hop>\n\nCode may be the best application T...,Making LLMs more accessible and enabling them ...,"The context shows that in 2023, it became incr...",multi_hop_abstract_query_synthesizer
6,Considering the capabilities and limitations o...,[I think this is because of gullibility.\nCan ...,[<1-hop>\n\nBased Development As a computer sc...,The gullibility of AI agents significantly imp...,"The gullibility of AI agents, which causes the...",multi_hop_abstract_query_synthesizer
7,"How do the 2024 developments in AI, such as th...",[The rise of inference-scaling “reasoning” mod...,[<1-hop>\n\nThings we learned about LLMs in 20...,"The 2024 developments in AI, including the rel...","The 2024 developments in AI, highlighted by th...",multi_hop_abstract_query_synthesizer
8,Wha is Claude?,[Evals really matter\nAnthropic’s Amanda Askel...,[<1-hop>\n\nThings we learned about LLMs in 20...,Claude is a language model developed by Anthro...,"Based on the context, Claude refers to a serie...",multi_hop_specific_query_synthesizer
9,How do the recent advances in Large Language M...,[Things we learned about LLMs in 2024\n\n\n\n\...,[<1-hop>\n\nThings we learned about LLMs in 20...,"The recent advances in 2024, such as increased...","The advances in 2024, including models with lo...",multi_hop_specific_query_synthesizer


In [78]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())


In [79]:
custom_run_config = RunConfig(timeout=600)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[25]: AttributeError('StringIO' object has no attribute 'sentences')
ERROR:ragas.executor:Exception raised in Job[29]: AttributeError('StringIO' object has no attribute 'sentences')
ERROR:ragas.executor:Exception raised in Job[55]: AttributeError('StringIO' object has no attribute 'sentences')


{'context_recall': 0.9583, 'faithfulness': 0.8943, 'factual_correctness': 0.7125, 'answer_relevancy': 0.5855, 'context_entity_recall': 0.4354, 'noise_sensitivity_relevant': 0.2219}

**Analysis**:

First, let's copy paste the actual scores:

base_rag_chain:

{'context_recall': 0.8333, 'faithfulness': 0.8710, 'factual_correctness': 0.5900, 'answer_relevancy': 0.6767, 'context_entity_recall': 0.2846, 'noise_sensitivity_relevant': 0.1617}

finetune_rag_chain:

{'context_recall': 0.9583, 'faithfulness': 0.8943, 'factual_correctness': 0.7125, 'answer_relevancy': 0.5855, 'context_entity_recall': 0.4354, 'noise_sensitivity_relevant': 0.2219}

Looking at the results for each metric:

context_recall: our fine-tuned embedding rag was much better

faithfulness: our fine-tuned embedding rag was better, but just barely (maybe could be noise). We can say even because probably within margin of error

factual_correctness: our fine-tuned embedding rag was much better

answer_relevancy: our fine-tuned embedding rag was actually worse by a decent amount (~9% absolute)

context_entity_recall: our fine-tuned embedding rag was much better

noise_sensitivity_relevant: our fine-tuned embedding rag was better

Thus, it looks like our fine-tuned embedding rag was better in 4 of the 6 metrics, worse in 1 of the 6 metrics, and about even in 1 of the 6 metrics.

Thus, looking at this overvall, it's fair (and objective!) to say that our fine-tuned embedding rag was better on our Ragas evaluation overall