# RAG Series Part 1: How to choose the right embedding model for your RAG application

This notebook evaluates the [voyage-lite-02-instruct](https://docs.voyageai.com/embeddings/) model.

## Step 1: Install required libraries

* **datasets**: Library to get access to datasets available on Hugging Face Hub
<p>
* **voyageai**: Library to interact with Voyage AI's models via their APIs
<p>
* **sentence-transformers**: Framework for working with text and image embeddings

In [1]:
! pip install -qU datasets voyageai sentence-transformers

## Step 2: Setup pre-requisites

Set Voyage API key as environment variable, and initialize the Voyage AI client.

Steps to obtain a Voyage AI API Key can be found [here](https://docs.voyageai.com/docs/api-key-and-installation).

In [None]:
import os
import getpass
import voyageai

In [None]:
VOYAGE_API_KEY = getpass.getpass("Voyage API Key:")
voyage_client = voyageai.Client(api_key=VOYAGE_API_KEY)

## Step 3: Download the evaluation dataset

We wil use MongoDB's [cosmopedia-wikihow-chunked](https://huggingface.co/datasets/MongoDB/cosmopedia-wikihow-chunked) dataset, which has chunked versions of WikiHow articles from the [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset released by Hugging Face. The dataset is pretty large, so we will only grab the first 25k records for testing.

In [None]:
from datasets import load_dataset
import pandas as pd

data = load_dataset("MongoDB/cosmopedia-wikihow-chunked", split="train", streaming=True)
data_head = data.take(25000)
df = pd.DataFrame(data_head)

# Use this if you want the full dataset
# data = load_dataset("AIatMongoDB/cosmopedia-wikihow-chunked", split="train")
# df = pd.DataFrame(data)

## Step 4: Data Analysis

Make sure the length of the dataset is what we expect (25k), preview the data, drop Nones etc.

In [None]:
# Ensuring length of dataset is what we expect i.e. 25k
len(df)

In [None]:
# Previewing the contents of the data
df.head()

In [None]:
# Only keep records where the text field is not null
df = df[df["text"].notna()]

In [None]:
# Number of unique documents in the dataset
df.doc_id.nunique()

## Step 5: Creating Embeddings

In [None]:
from typing import List

In [None]:
def get_voyage_embeddings(docs: List[str], input_type: str, model:str="voyage-lite-02-instruct") -> List[List[float]]:
    response = voyage_client.embed(docs, model=model, input_type=input_type)
    return response.embeddings

In [None]:
test_voyageai_embed = get_voyage_embeddings([df.iloc[0]["text"]], "document")

In [None]:
# Sanity check to make sure embedding dimensions are as expected i.e. 1024
len(test_voyageai_embed[0])

## Step 6: Measuring Embedding Latency

Create a local vector store (list) of embeddings for the entire dataset.

In [None]:
from tqdm.auto import tqdm

In [None]:
texts = df["text"].tolist()

In [None]:
batch_size = 128

In [None]:
embeddings = []
for i in tqdm(range(0, len(texts), batch_size)):
    end = min(len(texts), i+batch_size)
    batch = texts[i:end]
    batch_embeddings = get_voyage_embeddings(batch, "document")
    embeddings.extend(batch_embeddings)

## Step 7: Measuring Retrieval Quality

* Create embedding for the user query
<p>
* Get the top 5 most similar documents from the local vector store using dot product as the similarity metric

In [None]:
from sentence_transformers.util import cos_sim

In [None]:
embeddings = np.asarray(embeddings)

In [None]:
def query(query: str, top_k: int=3):
    query_emb = np.asarray(get_voyage_embeddings([query], "query"))
    scores = cos_sim(query_emb, embeddings)[0]
    idxs = np.argsort(-scores)[:3]

    print(f"Query: {query}")
    for idx in idxs:
        print(f"Score: {scores[idx]:.4f}")
        print(texts[idx])
        print("--------")

In [None]:
query("Hello World")