# RAG Series Part 1: How to choose the right embedding model for your RAG application

This notebook evaluates the [UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) model.

## Step 1: Install required libraries

* **datasets**: Library to get access to datasets available on Hugging Face Hub
<p>
* **angle-emb**: Library to interact with the AngleE's embedding models
<p>
* **sentence-transformers**: Framework for working with text and image embeddings

In [1]:
! pip install -qU datasets angle-emb sentence-transformers

## Step 3: Download the evaluation dataset

We will use MongoDB's [cosmopedia-wikihow-chunked](https://huggingface.co/datasets/MongoDB/cosmopedia-wikihow-chunked) dataset, which has chunked versions of WikiHow articles from the [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset released by Hugging Face. The dataset is pretty large, so we will only grab the first 25k records for testing.

In [2]:
from datasets import load_dataset
import pandas as pd

data = load_dataset("MongoDB/cosmopedia-wikihow-chunked", split="train", streaming=True)
data_head = data.take(25000)
df = pd.DataFrame(data_head)

# Use this if you want the full dataset
# data = load_dataset("AIatMongoDB/cosmopedia-wikihow-chunked", split="train")
# df = pd.DataFrame(data)

Downloading readme:   0%|          | 0.00/3.50k [00:00<?, ?B/s]

## Step 4: Data Analysis

Make sure the length of the dataset is what we expect (25k), preview the data, drop Nones etc.

In [3]:
# Ensuring length of dataset is what we expect i.e. 25k
len(df)

25000

In [4]:
# Previewing the contents of the data
df.head()

Unnamed: 0,doc_id,chunk_id,text_token_length,text
0,0,0,180,Title: How to Create and Maintain a Compost Pi...
1,0,1,141,**Step 2: Gather Materials**\nGather brown (ca...
2,0,2,182,_Key guideline:_ For every volume of green mat...
3,0,3,188,_Key tip:_ Chop large items like branches and ...
4,0,4,157,**Step 7: Maturation and Use**\nAfter 3-4 mont...


In [5]:
# Only keep records where the text field is not null
df = df[df["text"].notna()]

In [6]:
# Number of unique documents in the dataset
df.doc_id.nunique()

4335

## Step 5: Creating Embeddings

Define embedding functions for each of the models, running sanity checks.

In [7]:
from angle_emb import AnglE, Prompts
from typing import List

In [8]:
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
angle.set_prompt(prompt=Prompts.C)

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/733 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

INFO:AnglE:Prompt is set, the prompt will be automatically applied during the encoding phase. To disable prompt setting, please configure set_prompt(prompt=None)


In [9]:
def get_angle_embeddings(docs: List[str])-> List[List[float]]:
    docs = [{"text": doc} for doc in docs]
    response = angle.encode(docs, to_numpy=True)
    return response

In [10]:
# Testing embedding using the AnglE embedding model
test_angle_embed = get_angle_embeddings([df.iloc[0]["text"]])

In [11]:
# Sanity check to make sure embedding dimensions are as expected i.e. 1024
len(test_angle_embed[0])

1024

## Step 6: Measuring Embedding Latency

Create a local index (list) of embeddings for the entire dataset.

In [12]:
from tqdm.auto import tqdm

In [13]:
texts = df["text"].tolist()

In [14]:
batch_size = 128

In [None]:
embeddings = []
for i in tqdm(range(0, len(texts), batch_size)):
    end = min(len(texts), i+batch_size)
    batch = texts[i:end]
    batch_embeddings = get_angle_embeddings(batch)
    embeddings.extend(batch_embeddings)

## Step 7: Measuring Retrieval Quality

* Create embedding for the user query
<p>
* Get the top 5 most similar documents from the local vector store using dot product as the similarity metric

In [16]:
import numpy as np
from sentence_transformers.util import cos_sim

In [None]:
embeddings = np.asarray(embeddings)

In [None]:
def query(query: str, top_k: int=3):
    query_emb = np.asarray(get_angle_embeddings([query]))
    scores = cos_sim(query_emb, embeddings)[0]
    idxs = np.argsort(-scores)[:3]

    print(f"Query: {query}")
    for idx in idxs:
        print(f"Score: {scores[idx]:.4f}")
        print(texts[idx])
        print("--------")

In [None]:
query_emb = query("Hello World")