<a href="https://colab.research.google.com/github/philocifer/AIE5/blob/main/09_Finetuning_Embeddings/Fine_tuning_Embedding_Models_for_RAG_Solution_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

##### ❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [2]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/2.5 MB[0m [31m26.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m64.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m96.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.1/165.1 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25h

### Provide OpenAI API Key

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data

In [6]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31440    0 31440    0     0  83061      0 --:--:-- --:--:-- --:--:-- 83174


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 70299    0 70299    0     0   499k      0 --:--:-- --:--:-- --:--:--  501k


In [8]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

>> NOTE: You may need to run this cell twice to get it to work.

In [10]:
training_documents = text_splitter.split_documents(text_loader.load())

In [11]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [12]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [13]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [14]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [15]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [16]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [17]:
import asyncio
from tqdm import tqdm

# Use logic from above (with a correction) to create a function that returns a unique UUID
def get_uuid():
    id = str(uuid.uuid4())
    while id in id_set:
        id = str(uuid.uuid4()) # This is the correction
    id_set.add(id)
    return id

async def process_document(document, n_questions):
    """Process a single document to generate questions and relevant context mappings.

    Args:
        document: Langchain Document object with page_content and metadata
        n_questions: Number of questions to generate per document

    Returns:
        Tuple of (questions dict, relevant_docs dict) for this document"""

    doc_questions = {}
    doc_relevant_docs = {}

    # Generate questions using LLM chain
    questions_generated = await question_generation_chain.ainvoke({
        "context": document.page_content,
        "n_questions": n_questions
    })

    # Process each generated question line
    for question in questions_generated.content.split("\n"):
        # Create unique ID for question
        question_id = get_uuid()

        # Remove numbering from question string and clean whitespace
        doc_questions[question_id] = "".join(question.split(".")[1:]).strip()

        # Link question to document's UUID
        doc_relevant_docs[question_id] = [document.metadata["id"]]

    return doc_questions, doc_relevant_docs

async def create_questions(documents, n_questions):
    """Orchestrate parallel processing of documents to generate questions.

    Args:
        documents: List of Langchain Document objects
        n_questions: Number of questions per document

    Returns:
        Tuple of aggregated (questions dict, relevant_docs dict)"""

    questions = {}
    relevant_docs = {}

    # Create async tasks for all documents
    tasks = [process_document(doc, n_questions) for doc in documents]

    # Process tasks with progress bar
    for task in tqdm(asyncio.as_completed(tasks),
                    total=len(documents),
                    desc="Processing Documents"):
        doc_questions, doc_relevant_docs = await task

        # Aggregate results from all documents
        questions.update(doc_questions)
        relevant_docs.update(doc_relevant_docs)

    return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [18]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Processing Documents: 100%|██████████| 78/78 [00:12<00:00,  6.45it/s]


We'll use the function to generate training, validation, and test data.

In [19]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Processing Documents: 100%|██████████| 12/12 [00:05<00:00,  2.34it/s]


In [20]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Processing Documents: 100%|██████████| 12/12 [00:01<00:00,  6.90it/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [21]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [22]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [23]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [24]:
!pip install -qU sentence_transformers datasets pyarrow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 24.12.0 requires pyarrow<19.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 19.0.0 which is incompatible.
cudf-cu12 24.12.0 requires pyarrow<19.0.0a0,>=14.0.0; platform_machine =

In [25]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [26]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [27]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [28]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [29]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [30]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

#### ✅ Answer:
#### 1. MultipleNegativesRankingLoss (Base Loss)
**What it does**:  
Teaches the model to recognize good matches between questions and answers/documents.

**How it works**:
- Shows the model many (Question, Correct Answer) pairs
- For each pair, treats all other answers in the batch as "wrong answers"
- Uses a grading system (cross-entropy) to:
  - Reward the model when question & correct answer are similar
  - Penalize the model when question matches wrong answers

**Why it matters**:  
Makes related items (like good Q&A pairs) cluster close together in the embedding space.

#### 2. MatryoshkaLoss (Wrapper Loss)
**What it does**:  
Like Russian nesting dolls - trains the model to work at multiple detail levels simultaneously.

**How it works**:
- Teaches the model to make good matches using:
  - Full detail (768 dimensions)
  - Progressively simpler versions (512, 256, 128, 64 dimensions)
- Combines performance scores from all detail levels
- Forces the model to preserve important patterns at every scale

**Why it matters**:
- Lets you choose embedding size later based on needs:
  - Small (64-dim) for fast operations
  - Full (768-dim) for maximum accuracy
- Makes the model more efficient without retraining

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [31]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [32]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [33]:
import wandb
wandb.init(mode="disabled")

In [34]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,0.833333,1.0,1.0,1.0,0.833333,0.333333,0.2,0.1,0.833333,1.0,1.0,1.0,0.927577,0.902778,0.902778
32,No log,No log,0.833333,1.0,1.0,1.0,0.833333,0.333333,0.2,0.1,0.833333,1.0,1.0,1.0,0.933033,0.909722,0.909722
48,No log,No log,0.833333,0.958333,1.0,1.0,0.833333,0.319444,0.2,0.1,0.833333,0.958333,1.0,1.0,0.930144,0.90625,0.90625
50,No log,No log,0.833333,0.958333,1.0,1.0,0.833333,0.319444,0.2,0.1,0.833333,0.958333,1.0,1.0,0.930144,0.90625,0.90625
64,No log,No log,0.875,0.916667,1.0,1.0,0.875,0.305556,0.2,0.1,0.875,0.916667,1.0,1.0,0.937178,0.916667,0.916667
80,No log,No log,0.875,0.958333,1.0,1.0,0.875,0.319444,0.2,0.1,0.875,0.958333,1.0,1.0,0.940067,0.920139,0.920139
96,No log,No log,0.875,0.958333,1.0,1.0,0.875,0.319444,0.2,0.1,0.875,0.958333,1.0,1.0,0.940067,0.920139,0.920139
100,No log,No log,0.875,0.958333,1.0,1.0,0.875,0.319444,0.2,0.1,0.875,0.958333,1.0,1.0,0.940067,0.920139,0.920139
112,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.942955,0.923611,0.923611
128,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.948411,0.930556,0.930556


In [38]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [39]:
hf_username = "philocifer"

In [40]:
model.push_to_hub(f"{hf_username}/legal-ft-2", exist_ok=True)

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/philocifer/legal-ft-2/commit/c4fb6915ab799a110aac8ddf55e02699219ee521'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [41]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [42]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [43]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 24/24 [00:09<00:00,  2.60it/s]


In [44]:
te3_results_df = pd.DataFrame(te3_results)

In [45]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

1.0

### `Snowflake/snowflake-arctic-embed-l` (base)

In [46]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 24/24 [00:00<00:00, 47.40it/s]


In [47]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [48]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.9166666666666666

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [49]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 24/24 [00:00<00:00, 46.32it/s]


In [50]:
finetune_results_df = pd.DataFrame(finetune_results)

In [51]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

1.0

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [52]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [53]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [54]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [55]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [56]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [57]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'An agent, in the context of AI, is an infuriatingly vague term that generally refers to AI systems that can act on your behalf. There are two main interpretations: one sees agents as systems that go and perform tasks for you (like a travel agent), while the other views them as LLMs (large language models) that have access to tools and can run processes in a loop to solve problems. However, the term lacks a clear and widely understood definition, leading to confusion about its meaning and utility.'

In [58]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Better-than-GPT-3 class models have been produced by Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several other organizations.'

In [59]:
base_rag_chain.invoke({"question" : "What is the laziest AI month?"})["response"]

'I do not know.'

In [60]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [61]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [62]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [63]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An "agent" in the context of AI refers to a system that can act on behalf of a user, often described in two main categories: one where agents operate like a travel agent, taking actions for you, and another where LLMs (Large Language Models) are given access to tools to solve problems in a loop. However, the term is considered vague and lacks a clear, widely understood definition. There are concerns about the utility of such agents due to issues like gullibility, where LLMs may believe false information, impacting their ability to make meaningful decisions.'

In [64]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [65]:
finetune_rag_chain.invoke({"question" : "What is the laziest AI month?"})["response"]

'I do not know.'

In [66]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'Simon has run the Llama 3.2 3B model on his iPhone.'

#### ❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

#### ✅ Answer:
**The fine-tuned RAG chain (`finetune_rag_chain`) demonstrated better performance because:**

1. **Precision Improvements**  
   - For _"What is the largest model that Simon has run on his phone?"_, the fine-tuned chain correctly identified "Llama 3.2 3B" while the base chain returned "I do not know"
   - Shows improved ability to retrieve domain-specific details from the context

2. **Hallucination Reduction**  
   - For _"Who has produced better models than GPT-3?"_, the fine-tuned chain properly stated uncertainty when context was lacking, while the base chain hallucinated a list of companies
   - Demonstrates better alignment between retrieved context and generated response

3. **Retrieval Metrics Validation**  
   - Our earlier evaluation showed 100% hit rate for fine-tuned vs 87.5% for base Arctic-L
   - This directly translates to more reliable context for answer generation

4. **Question Understanding**  
   - The fine-tuned embeddings better captured semantic relationships in domain-specific language (e.g., recognizing "Agent" as a technical term rather than generic English)

**Why This Matters:**  
The fine-tuned model's training on Q&D pairs specifically from our domain data enables it to better understand the relationship between user questions and relevant context passages.

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [67]:
!pip install -qU ragas==0.2.10 unstructured==0.16.12

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.1/981.5 kB[0m [31m11.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m47

#### Load Data

In [68]:
docs = text_loader.load()

#### Synthetic Data Generation

In [69]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [70]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

### Base RAG Evaluation

In [71]:
for test_row in dataset:
  response = base_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [72]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What insights does the Chatbot Arena Leaderboa...,"[Then in December, the Chatbot Arena team intr...",[Prompt driven app generation is a commodity a...,The Chatbot Arena Leaderboard indicates that t...,The Chatbot Arena Leaderboard reveals that 18 ...,single_hop_specifc_query_synthesizer
1,What is the pricing for Anthropic's Claude 3 H...,"[If you can gather the right data, and afford ...","[gets you OpenAI’s most expensive model, o1. G...",I do not know.,Anthropic’s Claude 3 Haiku model is priced at ...,single_hop_specifc_query_synthesizer
2,What recent feature has Google Gemini introduc...,[A year ago the single most notable example of...,[a lot) is live video. ChatGPT voice mode now ...,Google Gemini has introduced a live video feat...,Google Gemini has introduced a preview of a fe...,single_hop_specifc_query_synthesizer
3,What MLX do for Mac users?,"[While MLX is a game changer, Apple’s own “App...",[about it since September 2022. I’m beginning ...,MLX supports running a wide range of MLX-compa...,Apple's MLX library is an array framework for ...,single_hop_specifc_query_synthesizer
4,How do the challenges of understanding and con...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nThe ethics of this space remain di...,The challenges of understanding and controllin...,The challenges of understanding and controllin...,multi_hop_abstract_query_synthesizer
5,"How do AI capabilities influence employment, a...",[The legal arguments here are complex. I’m not...,[<1-hop>\n\nThe ethics of this space remain di...,"AI capabilities, particularly those of Large L...","AI capabilities, particularly those of Large L...",multi_hop_abstract_query_synthesizer
6,How AI capabilities affect employment and what...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nThe ethics of this space remain di...,"AI capabilities, particularly those of Large L...","AI capabilities, particularly those of large l...",multi_hop_abstract_query_synthesizer
7,How do the challenges of understanding and con...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nThe ethics of this space remain di...,The challenges of understanding and controllin...,The challenges of understanding and controllin...,multi_hop_abstract_query_synthesizer
8,How has Anthropic's approach to model evaluati...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nPrompt driven app generation is a ...,The provided context does not contain specific...,Anthropic's approach to model evaluation and d...,multi_hop_specific_query_synthesizer
9,How have advancements in multi-modal LLMs like...,[The GPT-4 barrier was comprehensively broken\...,[<1-hop>\n\ngets you OpenAI’s most expensive m...,"The advancements in multi-modal LLMs, such as ...",Advancements in multi-modal LLMs such as GPT-4...,multi_hop_specific_query_synthesizer


In [73]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [74]:
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

In [75]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[8]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')


{'context_recall': 0.5556, 'faithfulness': 0.7399, 'factual_correctness': 0.5136, 'answer_relevancy': 0.7061, 'context_entity_recall': 0.3029, 'noise_sensitivity_relevant': 0.2332}

### Fine-tuned RAG Evaluation

In [76]:
for test_row in dataset:
  response = finetune_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [77]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What insights does the Chatbot Arena Leaderboa...,[Then there’s the rest. If you browse the Chat...,[Prompt driven app generation is a commodity a...,The Chatbot Arena Leaderboard indicates signif...,The Chatbot Arena Leaderboard reveals that 18 ...,single_hop_specifc_query_synthesizer
1,What is the pricing for Anthropic's Claude 3 H...,[Today $30/mTok gets you OpenAI’s most expensi...,"[gets you OpenAI’s most expensive model, o1. G...",Anthropic's Claude 3 Haiku model is priced at ...,Anthropic’s Claude 3 Haiku model is priced at ...,single_hop_specifc_query_synthesizer
2,What recent feature has Google Gemini introduc...,[Your browser does not support the audio eleme...,[a lot) is live video. ChatGPT voice mode now ...,Google Gemini has introduced a live video feat...,Google Gemini has introduced a preview of a fe...,single_hop_specifc_query_synthesizer
3,What MLX do for Mac users?,[Apple’s mlx-lm Python library supports runnin...,[about it since September 2022. I’m beginning ...,MLX supports running a wide range of MLX-compa...,Apple's MLX library is an array framework for ...,single_hop_specifc_query_synthesizer
4,How do the challenges of understanding and con...,[Here’s the sequel to this post: Things we lea...,[<1-hop>\n\nThe ethics of this space remain di...,The challenges of understanding and controllin...,The challenges of understanding and controllin...,multi_hop_abstract_query_synthesizer
5,"How do AI capabilities influence employment, a...",[The legal arguments here are complex. I’m not...,[<1-hop>\n\nThe ethics of this space remain di...,"AI capabilities, particularly those of Large L...","AI capabilities, particularly those of Large L...",multi_hop_abstract_query_synthesizer
6,How AI capabilities affect employment and what...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nThe ethics of this space remain di...,"AI capabilities, particularly those of Large L...","AI capabilities, particularly those of large l...",multi_hop_abstract_query_synthesizer
7,How do the challenges of understanding and con...,[Large Language Models\nThey’re actually quite...,[<1-hop>\n\nThe ethics of this space remain di...,The challenges of understanding and controllin...,The challenges of understanding and controllin...,multi_hop_abstract_query_synthesizer
8,How has Anthropic's approach to model evaluati...,[To understand more about inference scaling I ...,[<1-hop>\n\nPrompt driven app generation is a ...,The context provided does not contain specific...,Anthropic's approach to model evaluation and d...,multi_hop_specific_query_synthesizer
9,How have advancements in multi-modal LLMs like...,"[260 input tokens, 92 output tokens. Cost appr...",[<1-hop>\n\ngets you OpenAI’s most expensive m...,"Advancements in multi-modal LLMs, such as GPT-...",Advancements in multi-modal LLMs such as GPT-4...,multi_hop_specific_query_synthesizer


In [78]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [79]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.6972, 'faithfulness': 0.6701, 'factual_correctness': 0.5042, 'answer_relevancy': 0.8624, 'context_entity_recall': 0.5183, 'noise_sensitivity_relevant': 0.1707}

# RAGAS Evaluation Comparison

**Key Improvements in Fine-tuned Model:**
1. **Context Recall (+25.5% improvement)**  
   - Base: 55.6% → Fine-tuned: 69.7%  
   - The fine-tuned model retrieves more complete contextual information due to better question-document alignment

2. **Answer Relevancy (+22.2% improvement)**  
   - Base: 70.6% → Fine-tuned: 86.2%  
   - Responses better match user intent through domain-specific embedding relationships

3. **Context Entity Recall (+71.1% improvement)**  
   - Base: 30.3% → Fine-tuned: 51.8%  
   - Better capture of domain-specific entities/keyphrases from training data

**Tradeoffs:**
- **Faithfulness (-9.4%)**  
  - Base: 74.0% → Fine-tuned: 67.0%  
  - Increased retrieval breadth from smaller chunks (600 vs 750) introduces more contextual variance that the fine-tuned model must reconcile
- **Noise Sensitivity (-26.8% improvement)**  
  - Base: 23.3% → Fine-tuned: 17.1%  
  - Shows reduced sensitivity to irrelevant information. This is likely due to evaluation chunks containing more overlap noise (50 vs 20)

**Why Fine-tuned Performs Better Overall:**  
The fine-tuned model demonstrates substantial gains in core retrieval metrics (context recall + entity recall) while maintaining comparable factual accuracy. The 86% answer relevancy score indicates significantly better alignment with user questions. The slight faithfulness tradeoff is acceptable given the domain-specific nature of the content, where complete context capture is prioritized over strict verbatim consistency.

**Critical Insight:**  
1. Fine-tuning particularly helped with:  
   - Recognizing domain-specific terminology (e.g., "Matryoshka embeddings")  
   - Linking colloquial phrases to technical concepts  
   - Distinguishing between similar entity references in the corpus
2. ***This evaluation comparison conflates two variables: fine-tuning model improvements and chunking strategy changes.***

**Recommendation:**  
To isolate the impact of fine-tuning, we should repeat the evaluation process with the same chunking strategy used to train the fine-tuned model. This would allow us to directly compare the performance of the fine-tuned model against the base model using the same retrieval context.