[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/evals/openai-embeddings-eval.ipynb)

[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/developer/products/atlas/choose-embedding-model-rag/?utm_campaign=devrel&utm_source=cross-post&utm_medium=organic_social&utm_content=https%3A%2F%2Fgithub.com%2Fmongodb-developer%2FGenAI-Showcase&utm_term=apoorva.joshi)

# How to choose the right embedding model for your RAG application

This notebook evaluates the [gemini-embedding-001](https://ai.google.dev/gemini-api/docs/embeddings) model.


## Step 1: Install required libraries

- **datasets**: Python library to get access to datasets available on Hugging Face Hub
- **google-genai**: Google’s GenAI Python SDK
- **numpy**: Python library that provides tools to perform mathematical operations on arrays
- **pandas**: Python library for data analysis, exploration and manipulation
- **tdqm**: Python module to show a progress meter for loops


In [None]:
! pip install -qU datasets google-genai numpy pandas tqdm

## Step 2: Setup pre-requisites

Set the Gemini API key as an environment variable, and initialize the Gemini client.

Steps to obtain a Gemini API Key can be found [here](https://aistudio.google.com/app/apikey)


In [None]:
import getpass
import os

from google import genai

In [None]:
os.environ["GOOGLE_API_KEY"] = getpass.getpass("Gemini API Key:")
gemini_client = genai.Client()

## Step 3: Download the evaluation dataset

We will use MongoDB's [cosmopedia-wikihow-chunked](https://huggingface.co/datasets/MongoDB/cosmopedia-wikihow-chunked) dataset, which has chunked versions of WikiHow articles from the [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset released by Hugging Face. The dataset is pretty large, so we will only grab the first 25k records for testing.


In [None]:
import pandas as pd
from datasets import load_dataset

# Use streaming=True to load the dataset without downloading it fully
data = load_dataset("MongoDB/cosmopedia-wikihow-chunked", split="train", streaming=True)
# Get first 25k records from the dataset
data_head = data.take(25000)
df = pd.DataFrame(data_head)

# Use this if you want the full dataset
# data = load_dataset("AIatMongoDB/cosmopedia-wikihow-chunked", split="train")
# df = pd.DataFrame(data)

## Step 4: Data analysis

Make sure the length of the dataset is what we expect (25k), preview the data, drop Nones etc.


In [None]:
# Ensuring length of dataset is what we expect i.e. 25k
len(df)

In [None]:
# Previewing the contents of the data
df.head()

In [None]:
# Only keep records where the text field is not null
df = df[df["text"].notna()]

In [None]:
# Number of unique documents in the dataset
df.doc_id.nunique()

## Step 5: Creating embeddings

Define the embedding function, and run a quick test.


In [None]:
from typing import List

In [None]:
def get_embeddings(docs: List[str]) -> List[List[float]]:
    """
    Get embeddings using the Gemini API.

    Args:
        docs (List[str]): List of texts to embed

    Returns:
        List[List[float]]: Array of embedddings
    """
    response = gemini_client.models.embed_content(
        model="gemini-embedding-001", contents=docs
    )
    return response.embeddings

In [None]:
# Generating a test embedding
test_gemini_embed = get_embeddings([df.iloc[0]["text"]])

In [None]:
# Sanity check to make sure embedding dimensions are as expected i.e. 3072
len(test_gemini_embed[0].values)

## Step 6: Evaluation


### Measuring embedding latency

Create a local vector store (list) of embeddings for the entire dataset.


In [None]:
import numpy as np
from tqdm.auto import tqdm

In [None]:
texts = df["text"].tolist()

In [None]:
batch_size = 100

In [None]:
embeddings = []
# Generate embeddings in batches
for i in tqdm(range(0, len(texts), batch_size)):
    end = min(len(texts), i + batch_size)
    batch = texts[i:end]
    # Generate embeddings for current batch
    batch_embeddings = get_embeddings(batch)
    # Add to the list of embeddings
    embeddings.extend(np.array(batch_embeddings))

### Measuring retrieval quality

- Create embedding for the user query
<p>
- Get the top 5 most similar documents from the local vector store using cosine similarity as the similarity metric


In [None]:
from sentence_transformers.util import cos_sim

In [None]:
# Converting embeddings list to a Numpy array- required to calculate cosine similarity
embeddings = np.asarray(embeddings)

In [None]:
def query(query: str, top_k: int = 3) -> None:
    """
    Query the local vector store for the top 3 most relevant documents.

    Args:
        query (str): User query
        top_k (int, optional): Number of documents to return. Defaults to 3.
    """
    # Generate embedding for the user query
    query_emb = np.asarray(get_embeddings([query]))
    # Calculate cosine similarity
    scores = cos_sim(query_emb, embeddings)[0]
    # Get indices of the top k records
    idxs = np.argsort(-scores)[:top_k]

    print(f"Query: {query}")
    for idx in idxs:
        print(f"Score: {scores[idx]:.4f}")
        print(texts[idx])
        print("--------")

In [None]:
query("Give me some tips to improve my mental health.")

In [None]:
query_emb = query("Give me some tips for writing good code.")

In [None]:
query("How do I create a basic webpage?")

In [None]:
query(
    "What are some environment-friendly practices I can incorporate in everyday life?"
)