# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-m`
  - Task 5: Evaluating our Retriever

- 🤝 Breakout Room #2:
  - Task 1: Vibe Checking Our LCEL RAG Chain
  - Task 2: Ragas Evaluation



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

In [3]:
!pip install -qU langchain_experimental langchain_openai langchain_huggingface langchain_core==0.2.38 langchain langchain_community langchain-text-splitters huggingface_hub

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m396.4/396.4 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.2/207.2 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m82.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m436.4/436.4 kB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.6/294.6 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
!pip install -qU faiss-cpu unstructured==0.15.7 python-pptx==1.0.2 nltk==3.9.1 pymupdf

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/981.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/981.5 kB[0m [31m3.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m430.1/981.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m931.8/981.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━

### Provide OpenAI API Key

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")
#os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter Your LangChain API Key: ")
#os.environ["QDRANT_API_KEY"] = getpass.getpass("Enter Your Qdrant API Key: ")

os.environ["LANGCHAIN_TRACING_V2"] = 'false'
os.environ["QDRANT_URL"] = 'https://6db11fca-6840-43a7-9aa0-96fa9b3c0320.europe-west3-0.gcp.cloud.qdrant.io'

## Task 2: Loading Data

We'll be using a recent document released by the EU 'laying down harmonised rules on artificial intelligence and amending Regulations'.

The data can be found [here](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689), though we will be using a HTML version which was collected into the AIM DataRepository.

First, we'll clone and then `cd` into the DataRepository.

In [6]:
!git clone https://github.com/ledgerW/policy-rag.git

Cloning into 'policy-rag'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 46 (delta 10), reused 43 (delta 9), pack-reused 0 (from 0)[K
Receiving objects: 100% (46/46), 12.22 MiB | 36.06 MiB/s, done.
Resolving deltas: 100% (10/10), done.


In [7]:
%cd policy-rag

/content/policy-rag


In [3]:
!ls

README.md           environment.yml     golden_dataset.yml  sdg.py
chunk_experiment.py [34mexperiments[m[m         [34mpolicy_rag[m[m          [34mtests[m[m
[34mdata[m[m                fine_tuning.ipynb   scratch.ipynb


In [4]:
data_dir = 'data'

In [5]:
from policy_rag.text_utils import DocLoader

In [6]:
loader = DocLoader()
docs = loader.load_dir(data_dir)
print(len(docs))

137


Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [7]:
from policy_rag.text_utils import get_recursive_token_chunks, get_semantic_chunks

Next we can load/split these documents as follows.

In [8]:
training_documents = get_recursive_token_chunks(
    docs=docs,
    model_name='gpt-4',
    chunk_size=100,
    chunk_overlap=25
)

In [9]:
len(training_documents)

1222

Next, we're going to associate each of our chunks with a unique identifier.

In [10]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [71]:
training_split_documents = training_documents[:700]
val_split_documents = training_documents[700:775]
test_split_documents = training_documents[775:850]

In [12]:
training_split_documents[0]

Document(metadata={'source': 'data/Blueprint-for-an-AI-Bill-of-Rights.pdf', 'file_path': 'data/Blueprint-for-an-AI-Bill-of-Rights.pdf', 'page': 0, 'total_pages': 73, 'format': 'PDF 1.6', 'title': 'Blueprint for an AI Bill of Rights', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe Illustrator 26.3 (Macintosh)', 'producer': 'iLovePDF', 'creationDate': "D:20220920133035-04'00'", 'modDate': "D:20221003104118-04'00'", 'trapped': '', 'id': 'ecbd3a8c-01c7-47ce-89f6-f426d9d15848'}, page_content='BLUEPRINT FOR AN \nAI BILL OF \nRIGHTS \nMAKING AUTOMATED \nSYSTEMS WORK FOR \nTHE AMERICAN PEOPLE \nOCTOBER 2022')

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [July 18th](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [40]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [39]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate 1 question based only on the provided context.

Context:
{context}

Question:
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [41]:
question_generation_chain = qa_prompt_template | qa_chat_model

In [42]:
question_generation_chain.invoke({'context': 'What types of accessible formats are available for persons with disabilities?'})

AIMessage(content='What are some examples of accessible formats that can be provided for individuals with disabilities?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 16, 'prompt_tokens': 41, 'total_tokens': 57, 'completion_tokens_details': {'reasoning_tokens': 0}, 'prompt_tokens_details': {'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_f85bea6784', 'finish_reason': 'stop', 'logprobs': None}, id='run-226cc984-fef8-45e7-b314-71a7ac9a3872-0', usage_metadata={'input_tokens': 41, 'output_tokens': 16, 'total_tokens': 57})

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [22]:
question_data = {
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}

context_data = {
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}

In [17]:
from pydantic import BaseModel, RootModel, field_validator
from typing import List, Dict
from uuid import UUID

class QuestionObject(RootModel[Dict[str, str]]):
    model_config = {'validate_assignment': True}

    @field_validator('root')
    def validate_key_is_uuid(cls, value):
        for key in value.keys():
            try:
                u = UUID(key)
                if u.version != 4:
                    raise ValueError(f"{key} is not UUID v4")
            except ValueError as e:
                raise ValueError(f"{key} is not UUID v4")
        return value


class ContextObject(RootModel[Dict[str, List[str]]]):
    model_config = {'validate_assignment': True}

    @field_validator('root')
    def validate_key_is_uuid(cls, value):
        for key in value.keys():
            try:
                u = UUID(key)
                if u.version != 4:
                    raise ValueError(f"{key} is not UUID v4")
            except ValueError as e:
                raise ValueError(f"{key} is not UUID v4")
        return value

    @field_validator('root')
    def validate_values_are_uuid(cls, value):
        for key, val in value.items():
            for v in val:
                try:
                    u = UUID(v)
                    if u.version != 4:
                        raise ValueError(f"{key} is not UUID v4")
                except:
                    raise ValueError(f"{key} is not UUID v4")
        return value

### Prep Data

In [57]:
def create_question_context_pairs(
        documents: List[Document]
    ) -> Tuple[List[str], List[str]]:
    questions: List[str] = []
    contexts: List[str] = []

    for doc in documents:
        question = question_generation_chain.invoke({'context': doc.page_content})
        questions.append(question.content)
        contexts.append(doc.page_content)

    return questions, contexts

We'll use the function to generate training, validation, and test data with `n_questions=2` for each.

In [72]:
questions, contexts = create_question_context_pairs(training_split_documents)
val_questions, val_contexts = create_question_context_pairs(val_split_documents)
test_questions, test_contexts = create_question_context_pairs(test_split_documents)

In [73]:
import pandas as pd

pd.DataFrame({'id': range(0, len(questions)), 'anchor': questions, 'positive': contexts}).to_json('training.jsonl', orient='records')

train_val_len = len(questions)+len(val_questions)
pd.DataFrame({'id': range(len(questions), train_val_len), 'anchor': val_questions, 'positive': val_contexts}).to_json('validation.jsonl', orient='records')

train_val_test_len = train_val_len + len(test_questions)
pd.DataFrame({'id': range(train_val_len, train_val_test_len), 'anchor': test_questions, 'positive': test_contexts}).to_json('test.jsonl', orient='records')

In [64]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
    SequentialEvaluator,
)
from sentence_transformers.util import cos_sim
from datasets import load_dataset, concatenate_datasets


#train_dataset = load_dataset("json", data_files="training.jsonl", split="train")
val_dataset = load_dataset("json", data_files="validation.jsonl", split="train")
 
# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)  # Our corpus (cid => document)
queries = dict(
    zip(val_dataset["id"], val_dataset["anchor"])
)  # Our queries (qid => question)
 
# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]
 
 
evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs
)

Generating train split: 500 examples [00:00, 30169.64 examples/s]
Generating train split: 75 examples [00:00, 14845.34 examples/s]
Generating train split: 75 examples [00:00, 13127.99 examples/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

> NOTE: If you ran into issues creating the data - you can use the data from the DataRespository. It's simply called: `train_dataset.jsonl`, etc.

## Task 4: Fine-tuning `gte-large-en-v1.5`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-m`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

In [1]:
!pip install -qU sentence_transformers datasets pyarrow optimum[exporters]

In [38]:
from sentence_transformers import SentenceTransformer

model_id = "Alibaba-NLP/gte-large-en-v1.5"
model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/71.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling.py:   0%|          | 0.00/59.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [39]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [40]:
BATCH_SIZE = 20

Let's move our dataset into the expected format for training.

In [43]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
    SequentialEvaluator,
)
from sentence_transformers.util import cos_sim
from datasets import load_dataset, concatenate_datasets


train_dataset = load_dataset("json", data_files="training.jsonl", split="train")
val_dataset = load_dataset("json", data_files="validation.jsonl", split="train")
test_dataset = load_dataset("json", data_files="test.jsonl", split="train")
corpus_dataset = concatenate_datasets([train_dataset, val_dataset, test_dataset])
 
# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)  # Our corpus (cid => document)
queries = dict(
    zip(val_dataset["id"], val_dataset["anchor"])
)  # Our queries (qid => question)
 
# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]
 
 
evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs
)

In [5]:
from datasets import load_dataset

train_dataset = load_dataset("json", data_files="training.jsonl", split="train")

In [7]:
train_dataset[50]

{'id': 50,
 'anchor': 'What steps should be taken to ensure that individuals impacted by a system are notified of significant use case or key functionality changes?',
 'positive': 'system functioning and the role automation plays, notice that such systems are in use, the individual or organiza\xad\ntion responsible for the system, and explanations of outcomes that are clear, timely, and accessible. Such notice \nshould be kept up-to-date and people impacted by the system should be notified of significant use case or key \nfunctionality changes. You should know how and why an outcome impacting you was determined by an'}

In [4]:
train_dataset[50]

{'queries': 'Discuss the implications of biased automated sentiment analyzers as highlighted in the context. How can such biases affect online discourse and the representation of marginalized groups?',
 'corpus': None,
 'relevant_docs': ['5c3c3c94-f47d-4e85-bc8a-61f5d9e9014b'],
 'mode': 'text'}

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [44]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [1024, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.  

LW:
**Multiple Negatives Ranking Loss** is a more efficient variation of the standard Triplet Loss commonly used in contrastive learning models. A training batch contains an Anchor-Positive pair along with multiple Anchor-Negative pairs, computes the loss (or similarity really) of each embedding pair with a cosine similarity, and then uses a variation of a Softmax over the similarities of all the pairs such that the Anchor-Positive pair similarity is maximized, while the Anchor-Negative pairs are minimized.

**Matryoshka Loss** works by aggregating losses across multiple output embedding sizes, such that the larger embeddings represent finer and finer detail.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [46]:
EPOCHS = 5

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [None]:
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers
 
finetuned_model_name = 'policy_gte_large_6'

# load train dataset again
train_dataset = load_dataset("json", data_files="training.jsonl", split="train")
 
# define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir=finetuned_model_name, # output directory and hugging face model ID
    num_train_epochs=EPOCHS,                         # number of epochs
    per_device_train_batch_size=32,             # train batch size
    gradient_accumulation_steps=16,             # for a global batch size of 512
    per_device_eval_batch_size=16,              # evaluation batch size
    warmup_ratio=0.1,                           # warmup ratio
    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                 # use constant learning rate scheduler
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    tf32=True,                                  # use tf32 precision
    bf16=True,                                  # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                      # evaluate after each epoch
    save_strategy="epoch",                      # save after each epoch
    logging_steps=10,                           # log every 10 steps
    save_total_limit=3,                         # save only the last 3 models
    load_best_model_at_end=True,                # load the best model when training ends
)

In [None]:
from sentence_transformers import SentenceTransformerTrainer
 
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset.select_columns(["positive", "anchor"]),
    loss=train_loss,
    evaluator=evaluator,
)

In [None]:
trainer.train()
trainer.save_model()

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
!optimum-cli export onnx --help

2024-10-02 20:48:00.373482: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-02 20:48:00.391171: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-02 20:48:00.412398: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-02 20:48:00.418868: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-02 20:48:00.434175: I tensorflow/core/platform/cpu_feature_guar

### Export to ONNX from local

In [None]:
!optimum-cli export onnx --model policy_gte_large_6/ policy_gte_large_6/onnx/ --task feature-extraction --trust-remote-code --framework pt

### Export to ONNX from HF Hub

In [5]:
!optimum-cli export onnx --model lw2134/policy_gte_large onnx/ --task feature-extraction --trust-remote-code --framework pt

2024-10-02 20:50:19.797272: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-02 20:50:19.814597: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-02 20:50:19.835681: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-02 20:50:19.842076: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-02 20:50:19.857264: I tensorflow/core/platform/cpu_feature_guar

In [49]:
trainer.model.push_to_hub(finetuned_model_name, local_model_path=finetuned_model_name, exist_ok=True)

model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

'https://huggingface.co/lw2134/policy_gte_large/commit/acb3c2ec1c7bca362e6c957a97cdf8db9fc413c3'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [None]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [None]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-m`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [None]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 50/50 [00:07<00:00,  7.14it/s]


In [None]:
te3_results_df = pd.DataFrame(te3_results)

In [None]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

1.0

### `Snowflake/snowflake-arctic-embed-m` (base)

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-m")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 50/50 [00:00<00:00, 55.86it/s]


In [None]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [None]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.86

### `Snowflake/snowflake-arctic-embed-m` (fine-tuned)

In [None]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 50/50 [00:00<00:00, 83.70it/s]


In [None]:
finetune_results_df = pd.DataFrame(finetune_results)

In [None]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

1.0

# 🤝 Breakout Room #2

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(training_documents_loaded.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [None]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [None]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [None]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [None]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [None]:
base_rag_chain.invoke({"question" : "Why does the EU want to regulate AI?"})["response"]

'The EU wants to regulate AI to promote a human-centric approach to AI, ensure the development of secure, trustworthy, and ethical AI, protect ethical principles, and facilitate the protection of natural persons, democracy, the rule of law, and environmental protection. Additionally, the regulation aims to boost innovation and employment, making the Union a leader in the uptake of trustworthy AI.'

In [None]:
base_rag_chain.invoke({"question" : "What are the codes of practice?"})["response"]

'I do not know.'

In [None]:
base_rag_chain.invoke({"question" : "How many parameters is too many parameters?"})["response"]

'The context suggests that models with at least a billion parameters are considered to display significant generality and competence in performing a wide range of tasks. Therefore, it can be inferred that having a billion parameters is a threshold for being considered a model with "too many" parameters in this context.'

In [None]:
base_rag_chain.invoke({"question" : "What is an emotion recognition system and why is it important?"})["response"]

'An emotion recognition system is a type of artificial intelligence (AI) technology designed to identify and interpret human emotions based on various inputs, such as facial expressions, voice tone, body language, or biometric data. These systems analyze patterns in the data to infer the emotional state of an individual.\n\nThe importance of emotion recognition systems lies in their potential applications across various fields, including mental health, customer service, security, and human-computer interaction. They can enhance user experiences, improve communication, and provide insights into emotional well-being. However, there are significant concerns regarding their reliability, specificity, and potential for discriminatory outcomes, particularly given the variability of emotional expression across different cultures and individuals.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [None]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [None]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [None]:
finetune_rag_chain.invoke({"question" : "Why does the EU want to regulate AI?"})["response"]

'The EU wants to regulate AI to establish harmonized rules on artificial intelligence, ensuring safety, accountability, and ethical standards in the deployment of AI systems. This regulation aims to address the risks associated with high-risk AI systems and to protect the rights and interests of individuals and society as a whole.'

In [None]:
finetune_rag_chain.invoke({"question" : "What are the codes of practice?"})["response"]

'I do not know.'

In [None]:
finetune_rag_chain.invoke({"question" : "How many parameters is too many parameters?"})["response"]

'I do not know.'

In [None]:
finetune_rag_chain.invoke({"question" : "What is an emotion recognition system and why is it important?"})["response"]

'An emotion recognition system is an AI system designed to identify or infer the emotions or intentions of natural persons based on their biometric data. This includes recognizing emotions such as happiness, sadness, anger, surprise, disgust, embarrassment, excitement, shame, contempt, satisfaction, and amusement. It does not encompass physical states like pain or fatigue.\n\nThe importance of emotion recognition systems lies in their potential applications across various fields, such as mental health, customer service, security, and human-computer interaction. By accurately identifying emotions, these systems can enhance user experiences, improve communication, and provide insights into human behavior, which can be crucial for developing responsive and empathetic AI technologies.'

#####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?  

LW: So, it actually looks like the base model answered better, which is weird. The fine-tuned model answered "I don't know" 2 of 4 times opposed to only 1 of 4 with the base model. Otherwise the answers are about the same with the base model maybe even marginally better according to an anecdotal glance at the answers.

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

In [None]:
!pip install -qU ragas

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/185.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m184.3/185.7 kB[0m [31m7.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.7/185.7 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### RAGAS Synthetic Testset Generation

First things first, we need to generate some data to test our model on.

Let's use our test data that we created before as a base!

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

In [None]:
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

In [None]:
testset = generator.generate_with_langchain_docs(
    test_split_documents,
    test_size=20,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
    raise_exceptions=False
)

In [None]:
testset.to_pandas().head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What role do competent authorities play in sup...,[( 50 ) and (EU) 2016/97 ( 51 ) of the Europea...,Competent authorities play a crucial role in s...,simple,"[{'source': 'eu_ai_act.html', 'id': '51ea4d92-...",True
1,How can natural or legal persons lodge a compl...,[(170) Union and national law already provide ...,Natural or legal persons can lodge a complaint...,simple,"[{'source': 'eu_ai_act.html', 'id': '9f53f575-...",True
2,How can the Commission and market surveillance...,[(160) The market surveillance authorities and...,The Commission and market surveillance authori...,simple,"[{'source': 'eu_ai_act.html', 'id': 'f360da98-...",True
3,What requirements are established for internal...,[undertakings and insurance holding companies ...,"Requirements regarding internal governance, ar...",simple,"[{'source': 'eu_ai_act.html', 'id': '95e97d50-...",True
4,How are Member States' experts involved in the...,"[the preparation of delegated acts, the Europe...",Member States' experts are involved in the pre...,simple,"[{'source': 'eu_ai_act.html', 'id': '14147088-...",True


### Generating Answer Datasets

For each of our pipelines, let's generate answers to these questions!

Once we have our: Questions, Answers, Contexts, Ground Truths we can move on to evaluating our datasets!

In [None]:
from datasets import Dataset

def generate_answers(chain, testset):
  answers = []
  contexts = []
  questions = testset.to_pandas()["question"].values.tolist()
  ground_truths = testset.to_pandas()["ground_truth"].values.tolist()

  for question in tqdm.tqdm(questions):
    answer = chain.invoke({"question" : question})
    answers.append(answer["response"])
    contexts.append([context.page_content for context in answer["context"]])

  return Dataset.from_dict({
      "question" : questions,
      "answer" : answers,
      "contexts" : contexts,
      "ground_truth" : ground_truths
  })

In [None]:
base_dataset = generate_answers(base_rag_chain, testset)

100%|██████████| 19/19 [00:15<00:00,  1.20it/s]


In [None]:
finetune_dataset = generate_answers(finetune_rag_chain, testset)

100%|██████████| 19/19 [00:18<00:00,  1.03it/s]


### Evaluating Using the Test Set

Now that we have a test set - it's time to evaluate our pipelines with it!

In [None]:
from ragas.metrics import (
    context_recall,
    context_precision,
)

In [None]:
from ragas import evaluate

result = evaluate(
    base_dataset,
    metrics=[
        context_precision,
        context_recall,
    ],
)

In [None]:
result

{'context_precision': 0.7086, 'context_recall': 0.6898}

In [None]:
result.to_pandas().head()

Unnamed: 0,question,contexts,answer,ground_truth,context_precision,context_recall
0,What role do competent authorities play in sup...,[of the high-risk AI system with the requireme...,I do not know.,Competent authorities play a crucial role in s...,0.691667,1.0
1,How can natural or legal persons lodge a compl...,[of protection or the need for compliance with...,Natural or legal persons can lodge a complaint...,Natural or legal persons can lodge a complaint...,0.25,0.5
2,How can the Commission and market surveillance...,[of protection or the need for compliance with...,The Commission and market surveillance authori...,The Commission and market surveillance authori...,0.5,0.5
3,What requirements are established for internal...,[For deployers that are financial institutions...,The requirements established for internal gove...,"Requirements regarding internal governance, ar...",1.0,0.5
4,How are Member States' experts involved in the...,[access to meetings of Commission expert group...,Member States' experts are involved in the pre...,Member States' experts are involved in the pre...,1.0,1.0


In [None]:
result = evaluate(
    finetune_dataset,
    metrics=[
        context_precision,
        context_recall,
    ],
)

In [None]:
result

{'context_precision': 0.7467, 'context_recall': 0.8947}

In [None]:
result.to_pandas().head()

Unnamed: 0,question,contexts,answer,ground_truth,context_precision,context_recall
0,What role do competent authorities play in sup...,[as defined in Regulation (EU) No 575/2013 of ...,Competent authorities are designated to superv...,Competent authorities play a crucial role in s...,1.0,1.0
1,How can natural or legal persons lodge a compl...,[(170) Union and national law already provide ...,Natural or legal persons can lodge a complaint...,Natural or legal persons can lodge a complaint...,0.833333,1.0
2,How can the Commission and market surveillance...,[(160) The market surveillance authorities and...,The Commission and market surveillance authori...,The Commission and market surveillance authori...,1.0,1.0
3,What requirements are established for internal...,[(158) Union financial services law includes i...,The requirements established for internal gove...,"Requirements regarding internal governance, ar...",1.0,0.5
4,How are Member States' experts involved in the...,[access to meetings of Commission expert group...,Member States' experts are involved in the pre...,Member States' experts are involved in the pre...,1.0,1.0


#### 🏗️ Activity #3:

Discuss changes that you'd make to this pipeline based on the performance improvements that you see with RAGAS and the fine-tuning.

Come up with 3 changes, and then we'll discuss these options as a group!

1. We could implement some retrieval techniques, like Contextual Compression and MultiQueryRetriever to address precision and recall, like we did in the past. With Langchain, these cost very little to implement.
2. We could attempt to improve the quality of the fine-tune by, for example, increasing the value of `n_questions` in the question generator chain, and/or by simply increasing the number of training examples. On a related note, I'm curious about the comparative performance of a fine-tune vs. non-fine-tune approach (i.e. option 1 vs. option 2 above), especially from a cost-benefit perspective.
3. We could take another look at the actual fine-tuning traninig process, to include the base model and embedding size - and basically go bigger.