[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/evals/openai-embeddings-eval.ipynb)

[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/developer/products/atlas/choose-embedding-model-rag/?utm_campaign=devrel&utm_source=cross-post&utm_medium=organic_social&utm_content=https%3A%2F%2Fgithub.com%2Fmongodb-developer%2FGenAI-Showcase&utm_term=apoorva.joshi)

How to choose the right embedding model for your RAG application

This notebook evaluates the [text-embedding-3-large](https://openai.com/blog/new-embedding-models-and-api-updates) model.


## Step 1: Install required libraries

- **datasets**: Python library to get access to datasets available on Hugging Face Hub
- **openai**: Python library to interact with OpenAI APIs
- **numpy**: Python library that provides tools to perform mathematical operations on arrays
- **pandas**: Python library for data analysis, exploration and manipulation
- **tdqm**: Python module to show a progress meter for loops


In [49]:
! pip install -qU datasets openai numpy pandas tqdm

## Step 2: Setup pre-requisites

Set OpenAI API key as environment variable, and initialize the OpenAI client.

Steps to obtain an OpenAI API Key can be found [here](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)


In [50]:
import getpass
import os

from openai import OpenAI

In [51]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
openai_client = OpenAI()

OpenAI API Key: ········


## Step 3: Download the evaluation dataset

We will use MongoDB's [cosmopedia-wikihow-chunked](https://huggingface.co/datasets/MongoDB/cosmopedia-wikihow-chunked) dataset, which has chunked versions of WikiHow articles from the [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset released by Hugging Face. The dataset is pretty large, so we will only grab the first 2k records for testing.


In [52]:
import pandas as pd
from datasets import load_dataset

# Use streaming=True to load the dataset without downloading it fully
data = load_dataset("MongoDB/cosmopedia-wikihow-chunked", split="train", streaming=True)
# Get first 2k records from the dataset
data_head = data.take(2000)
df = pd.DataFrame(data_head)

# Use this if you want the full dataset
# data = load_dataset("AIatMongoDB/cosmopedia-wikihow-chunked", split="train")
# df = pd.DataFrame(data)

## Step 4: Data analysis

Make sure the length of the dataset is what we expect (2k), preview the data, drop Nones etc.


In [53]:
# Ensuring length of dataset is what we expect i.e. 2k
len(df)

2000

In [54]:
# Previewing the contents of the data
df.head()

Unnamed: 0,doc_id,chunk_id,text_token_length,text
0,0,0,180,Title: How to Create and Maintain a Compost Pi...
1,0,1,141,**Step 2: Gather Materials**\nGather brown (ca...
2,0,2,182,_Key guideline:_ For every volume of green mat...
3,0,3,188,_Key tip:_ Chop large items like branches and ...
4,0,4,157,**Step 7: Maturation and Use**\nAfter 3-4 mont...


In [55]:
# Only keep records where the text field is not null
df = df[df["text"].notna()]

In [56]:
# Number of unique documents in the dataset
df.doc_id.nunique()

352

## Step 5: Creating embeddings

Define the embedding function, and run a quick test.


In [57]:
from typing import List

In [58]:
def get_embeddings(text: str) -> List[float]:
    """
    Generate embeddings using the OpenAI API.

    Args:
        text (str): Text to embed

    Returns:
        List[float]: Embedding vector
    """
    text = text.replace("\n", " ")
    response = openai_client.embeddings.create(
        input=text, model="text-embedding-3-large"
    )
    return response.data[0].embedding

In [59]:
# Generating a test embedding
test_openai_embed = get_embeddings(df.iloc[0]["text"])

In [60]:
# Sanity check to make sure embedding dimensions are as expected i.e. 3072
len(test_openai_embed)

3072

## Step 6: Evaluation


### Measuring embedding latency

Create a local vector store (list) of embeddings for the entire dataset.


In [61]:
import numpy as np
from tqdm.auto import tqdm

In [62]:
texts = df["text"].tolist()

In [63]:
embeddings = []
# Generate embeddings
for text in tqdm(texts):
    embedding = get_embeddings(text)
    # Add to the list of embeddings
    embeddings.append(np.array(embedding))

  0%|          | 0/2000 [00:00<?, ?it/s]

### Measuring retrieval quality

- Create embedding for the user query
- Get the top 5 most similar documents from the local vector store using cosine similarity as the similarity metric


In [None]:
from sentence_transformers.util import cos_sim

In [43]:
# Converting embeddings list to a Numpy array- required to calculate cosine similarity
embeddings = np.asarray(embeddings)

In [44]:
def query(query: str, top_k: int = 3) -> None:
    """
    Query the local vector store for the top 3 most relevant documents.

    Args:
        query (str): User query
        top_k (int, optional): Number of documents to return. Defaults to 3.
    """
    # Generate embedding for the user query
    query_emb = np.asarray(get_embeddings(query))
    # Calculate cosine similarity
    scores = cos_sim(query_emb, embeddings)[0]
    # Get indices of the top k records
    idxs = np.argsort(-scores)[:top_k]

    print(f"Query: {query}")
    for idx in idxs:
        print(f"Score: {scores[idx]:.4f}")
        print(texts[idx])
        print("--------")

In [45]:
query("Give me some tips to improve my mental health.")

Query: Give me some tips to improve my mental health.
Score: 0.5319
Step 2: Reach Out to Someone Trustworthy
Connect with someone who cares about you—a friend, family member, mental health professional, or support group. Talking openly about your struggles can lighten your emotional burden, making it easier to manage. Sharing your feelings may also lead to helpful suggestions and advice from others.

Guideline: Make sure to choose someone trustworthy who has shown empathy towards you before. If possible, reach out to more than one person to build a strong support network around you.

Step 3: Engage in Physical Activity
Exercise releases endorphins, which improve mood and reduce stress levels. Go for a walk, jog, bike ride, swim, or engage in any physical activity that suits your abilities. Exercising regularly can significantly impact overall wellbeing by reducing symptoms of depression and anxiety.

Key Tip: Start small - aim for just five minutes of exercise if needed, then gradually

In [46]:
query_emb = query("Give me some tips for writing good code.")

Query: Give me some tips for writing good code.
Score: 0.5893
Step 6: Improve Code Quality
Strive for clean, readable, maintainable code. Adopt consistent naming conventions, indentation styles, and formatting rules. Utilize version control systems like Git to track changes and collaborate effectively. Leverage linters and static analyzers to enforce style guides automatically. Document your work using comments and dedicated documentation tools. High-quality code facilitates collaboration, promotes longevity, and simplifies troubleshooting.

Step 7: Embrace Best Practices
Follow established best practices relevant to your chosen language and domain. Examples include Object-Oriented Design Principles, SOLID principles, Test-Driven Development (TDD), Dependency Injection, Asynchronous Programming, etc. While seemingly overwhelming initially, integrating them gradually enhances design patterns, scalability, and extensibility. Consult authoritative blogs, books, and articles to stay update

In [47]:
query("How do I create a basic webpage?")

Query: How do I create a basic webpage?
Score: 0.4838
Step 6: Choose a Template
After logging in, select a template that suits your preferences by browsing through various categories such as business, personal, blog, etc., located on the left sidebar under "Template Categories". Once you find a suitable design, hover over it and click on the green "Use this template" button below the preview image.

Step 7: Customize Your Site
You can customize different aspects of your site like its layout, color scheme, background image, font styles, and more via the editor dashboard on the left panel. Remember to save changes made before navigating away from any editing screen.

Step 8: Add Pages
To add pages to your website, go to the "Pages" tab on the editor dashboard. Here, choose between predefined page types (e.g., Home, About Us, Services) or create custom ones according to your requirements. Don't forget to assign appropriate titles and URL slugs to these pages.
--------
Score: 0.4658
Step 9

In [48]:
query(
    "What are some environment-friendly practices I can incorporate in everyday life?"
)

Query: What are some environment-friendly practices I can incorporate in everyday life?
Score: 0.6116
a) Use public transportation, carpool, bike, walk, or telecommute whenever possible to minimize fuel consumption.
b) If purchasing a vehicle, consider electric or hybrid options that emit fewer greenhouse gases than traditional gasoline-powered cars.
c) Improve home energy efficiency through insulation, LED lighting, Energy Star appliances, and renewable energy sources like solar panels.
d) Limit air travel and opt for video conferencing when feasible.
e) Be mindful of your dietary choices – consume less red meat, choose locally sourced foods, and reduce food waste.
f) Plant trees and support reforestation projects as they absorb CO2 from the atmosphere.
g) Advocate for policies that promote clean energy and reduced emissions at local, national, and international levels.

Key Tip: Calculate your carbon footprint using online tools (such as the EPA's Household Carbon Footprint Calculato