# RAG Series Part 1: How to choose the right embedding model for your RAG application

This notebook evaluates the [voyage-lite-02-instruct](https://docs.voyageai.com/embeddings/) model.

## Step 1: Install required libraries

* **datasets**: Python library to get access to datasets available on Hugging Face Hub
<p>
* **voyageai**: Python library to interact with Voyage AI APIs
<p>
* **sentence-transformers**: Framework for working with text and image embeddings
<p>
* **numpy**: Python library that provides tools to perform mathematical operations on arrays
<p>
* **pandas**: Python library for data analysis, exploration and manipulation
<p>
* **tdqm**: Python module to show a progress meter for loops

In [2]:
! pip install -qU datasets sentence-transformers numpy pandas tqdm

## Step 2: Setup pre-requisites

Set Voyage API key as environment variable, and initialize the Voyage AI client.

Steps to obtain a Voyage AI API Key can be found [here](https://docs.voyageai.com/docs/api-key-and-installation).

In [3]:
import os
import getpass
import voyageai

In [4]:
VOYAGE_API_KEY = getpass.getpass("Voyage API Key:")
voyage_client = voyageai.Client(api_key=VOYAGE_API_KEY)

Voyage API Key:········


## Step 3: Download the evaluation dataset

We will use MongoDB's [cosmopedia-wikihow-chunked](https://huggingface.co/datasets/MongoDB/cosmopedia-wikihow-chunked) dataset, which has chunked versions of WikiHow articles from the [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset released by Hugging Face. The dataset is pretty large, so we will only grab the first 25k records for testing.

In [5]:
from datasets import load_dataset
import pandas as pd

# Use streaming=True to load the dataset without downloading it fully
data = load_dataset("MongoDB/cosmopedia-wikihow-chunked", split="train", streaming=True)
# Get first 25k records from the dataset
data_head = data.take(25000)
df = pd.DataFrame(data_head)

# Use this if you want the full dataset
# data = load_dataset("AIatMongoDB/cosmopedia-wikihow-chunked", split="train")
# df = pd.DataFrame(data)

## Step 4: Data analysis

Make sure the length of the dataset is what we expect (25k), preview the data, drop Nones etc.

In [6]:
# Ensuring length of dataset is what we expect i.e. 25k
len(df)

25000

In [7]:
# Previewing the contents of the data
df.head()

Unnamed: 0,doc_id,chunk_id,text_token_length,text
0,0,0,180,Title: How to Create and Maintain a Compost Pi...
1,0,1,141,**Step 2: Gather Materials**\nGather brown (ca...
2,0,2,182,_Key guideline:_ For every volume of green mat...
3,0,3,188,_Key tip:_ Chop large items like branches and ...
4,0,4,157,**Step 7: Maturation and Use**\nAfter 3-4 mont...


In [8]:
# Only keep records where the text field is not null
df = df[df["text"].notna()]

In [9]:
# Number of unique documents in the dataset
df.doc_id.nunique()

4335

## Step 5: Creating embeddings

Define the embedding function, and run a quick test.

In [10]:
from typing import List

In [11]:
def get_embeddings(docs: List[str], input_type: str, model:str="voyage-lite-02-instruct") -> List[List[float]]:
    """
    Get embeddings using the Voyage AI API.
    
    Args:
        docs (List[str]): List of texts to embed
        input_type (str): Type of input to embed. Can be "document" or "query".
        model (str, optional): Model name. Defaults to "voyage-lite-02-instruct".

    Returns:
        List[List[float]]: Array of embedddings
    """
    response = voyage_client.embed(docs, model=model, input_type=input_type)
    return response.embeddings

In [12]:
# Generating a test embedding
test_voyageai_embed = get_embeddings([df.iloc[0]["text"]], "document")

In [13]:
# Sanity check to make sure embedding dimensions are as expected i.e. 1024
len(test_voyageai_embed[0])

1024

## Step 6: Evaluation

### Measuring embedding latency

Create a local vector store (list) of embeddings for the entire dataset.

In [14]:
from tqdm.auto import tqdm

In [15]:
texts = df["text"].tolist()

In [16]:
batch_size = 128

In [17]:
embeddings = []
# Generate embeddings in batches
for i in tqdm(range(0, len(texts), batch_size)):
    end = min(len(texts), i+batch_size)
    batch = texts[i:end]
    # Generate embeddings for current batch
    batch_embeddings = get_embeddings(batch, "document")
    # Add to the list of embeddings
    embeddings.extend(batch_embeddings)

  0%|          | 0/196 [00:00<?, ?it/s]

### Measuring retrieval quality

* Create embedding for the user query
<p>
* Get the top 5 most similar documents from the local vector store using cosine similarity as the similarity metric

In [20]:
import numpy as np
from sentence_transformers.util import cos_sim

In [21]:
# Converting embeddings list to a Numpy array- required to calculate cosine similarity
embeddings = np.asarray(embeddings)

In [22]:
def query(query: str, top_k: int=3) -> None:
    """
    Query the local vector store for the top 3 most relevant documents.

    Args:
        query (str): User query
        top_k (int, optional): Number of documents to return. Defaults to 3.
    """
    # Generate embedding for the user query
    query_emb = np.asarray(get_embeddings([query], "query"))
    # Calculate cosine similarity
    scores = cos_sim(query_emb, embeddings)[0]
    # Get indices of the top k records
    idxs = np.argsort(-scores)[:top_k]

    print(f"Query: {query}")
    for idx in idxs:
        print(f"Score: {scores[idx]:.4f}")
        print(texts[idx])
        print("--------")

In [26]:
query("Give me some tips to improve my mental health.")

Query: Give me some tips to improve my mental health.
Score: 0.9284
Key Tips:

* Learn to recognize early signs of stress and address them proactively.
* Share concerns with trusted friends, family members, or mental health professionals.
* Develop a list of coping mechanisms to deploy during high-stress periods.

Step 6: Cultivate Social Connections
Isolation can worsen depression. Nurture relationships with loved ones and participate in social events to foster a sense of belonging.

* Join clubs, groups, or communities centered around shared interests.
* Schedule regular phone calls or video chats with distant friends and relatives.
* Volunteer for causes close to your heart.

Key Guidelines:

* Set boundaries when necessary to protect your emotional well-being.
* Communicate openly about your struggles with trusted confidants.
* Seek professional guidance if social anxiety impedes relationship development.
--------
Score: 0.9247
It's crucial to consult a licensed therapist, psychiat

In [28]:
query("Give me some tips for writing good code.")

Query: Give me some tips for writing good code.
Score: 0.9201
Step 6: Improve Code Quality
Strive for clean, readable, maintainable code. Adopt consistent naming conventions, indentation styles, and formatting rules. Utilize version control systems like Git to track changes and collaborate effectively. Leverage linters and static analyzers to enforce style guides automatically. Document your work using comments and dedicated documentation tools. High-quality code facilitates collaboration, promotes longevity, and simplifies troubleshooting.

Step 7: Embrace Best Practices
Follow established best practices relevant to your chosen language and domain. Examples include Object-Oriented Design Principles, SOLID principles, Test-Driven Development (TDD), Dependency Injection, Asynchronous Programming, etc. While seemingly overwhelming initially, integrating them gradually enhances design patterns, scalability, and extensibility. Consult authoritative blogs, books, and articles to stay update

In [1]:
query("How to create a basic webpage?")

NameError: name 'query' is not defined

In [39]:
query("What are some environment-friendly practices I can incorporate in everyday life?")

Query: What are some environment-friendly practices I can incorporate in everyday life?
Score: 0.9389
By consistently implementing these steps, every individual can actively contribute to helping the world become a cleaner, greener, and more resilient place for future generations.
--------
Score: 0.9352
Step 9: Recycle Properly
Familiarize yourself with local recycling programs and sort materials accordingly. Rinse containers and remove caps if necessary. Key tip: Never place non-recyclable items in bins. Guideline: Educate family members on proper recycling techniques.

Step 10: Green Transportation Options
Choose walking, cycling, public transit, carpooling, or electric vehicles over gasoline-powered cars. Combine errands to minimize trips. Key tip: Schedule regular vehicle maintenance checks to maximize efficiency. Guideline: Investigate incentives for green transportation options in your area.

Step 11: Energy Efficiency Upgrades
Replace incandescent bulbs with LED lights, install 