<a target="_blank" href="https://colab.research.google.com/github/okareo-ai/okareo-python-sdk/blob/main/examples/embedding_comparison.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Optimizing Your RAG - Choose an Embedding Model That Fits Your Data

Get your API token from [https://app.okareo.com/](https://app.okareo.com/) and set it in the cell below. 👇
   (Note: You will need to register first.)

In [1]:
OKAREO_API_KEY = "<YOUR-OKAREO-API-TOKEN>"
OPENAI_API_KEY = "<YOUR-OPENAI-API-KEY>"

#### Install Okareo, ChromaDB, and pandas
- Okareo will be used to evaluate and compare model performance
- We will be using ChromaDB for query to document similarity search to compare embedding models

In [None]:
%pip install okareo
%pip install chromadb
%pip install pandas

### Generate Your RAG Questions
- We are using a list of user questions with relevant document IDs to compare performance of different embedding models. Document IDs are pointing to documents that would be stored and retrieved by your RAG, usually from a vector database. This is our evaluation scenario.
- Performance of similarity search of your RAG stack will depend on the types of data it stores and how it's being retrieved. The goal here is too see which embedding model does a better job of matching your type of user queries to your documents.
- The example evaluation scenario below is based on fictitious WebBizz web business. WebBizz questions were created using [generate_retrieval_scenario.ipynb](https://github.com/okareo-ai/okareo-python-sdk/blob/main/examples/generate_retrieval_scenario.ipynb) notebook and downloaded from https://app.okareo.com/ as a .jsonl file.
- You want to modify the [generate_retrieval_scenario.ipynb](https://github.com/okareo-ai/okareo-python-sdk/blob/main/examples/generate_retrieval_scenario.ipynb) notebook to generate a list of synthetic user questions based on your own documents, your own RAG data. You could then use it instead of example scenario below.

In [None]:
import tempfile
from okareo import Okareo
import random
import string

# Create an instance of the Okareo client
okareo = Okareo(OKAREO_API_KEY)
random_string = ''.join(random.choices(string.ascii_letters, k=5))

# Download questions from Okareo's GitHub repository
webbizz_embedding_questions = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_embedding_questions.jsonl').read()

with tempfile.NamedTemporaryFile(suffix="webbizz_embedding_questions.jsonl", mode="w+", delete=True) as temp_file:
    temp_file.write(webbizz_embedding_questions)
    temp_file.seek(0) # Move the file pointer to the beginning

    # Upload the questions to Okareo from the temporary file
    scenario = okareo.upload_scenario_set(file_path=temp_file.name, scenario_name=f"RAG Embedding Comparison Questions - {random_string}")

### Prepare Your Document Database
- We will be loading WebBizz documents into ChromaDB. Below are different functions to help with loading and querying of documents from ChromaDB.
- To compare performance we use different embedding models to encode the documents in DB and ecode the user questions.

In [None]:
import os
import chromadb
from chromadb.utils import embedding_functions
import random
import string
from io import StringIO  
import pandas as pd
from okareo import Okareo
from okareo.model_under_test import CustomModel, ModelInvocation

# Create an instance of the Okareo client
okareo = Okareo(OKAREO_API_KEY)
random_string = ''.join(random.choices(string.ascii_letters, k=5))

# Load documents from Okareo's GitHub repository
webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_30_articles.jsonl').read()

# Convert the JSONL string to a pandas DataFrame
jsonObj = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)

# Create a ChromaDB client
chroma_client = chromadb.Client()

# A function to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
    parsed_ids_with_scores = []
    for i in range(0, len(results['distances'][0])):
        # Create a score based on cosine similarity
        score = (2 - results['distances'][0][i]) / 2
        parsed_ids_with_scores.append(
            {
                "id": results['ids'][0][i],
                "score": score,
                "metadata": {'document' : results['documents'][0][i]},
                "label": f"WebBizz Article w/ ID: {results['ids'][0][i]}"
            }
        )
    return parsed_ids_with_scores

# Implement Okareo CustomModel API that uses the ChromaDB collection to retrieve documents
# This will return the top 5 most similar documents to the query based on embedding function and return these for evaluation
class RetrievalModel(CustomModel):
    def invoke(self, input: dict) -> ModelInvocation:
        # Query the collection with the input text
        results = collection.query(
            query_texts=[input["question"]],
            n_results=5
        )
        # Return formatted query results and the model response context
        return ModelInvocation(model_prediction=query_results_to_score(results), raw_model_output=results)

def create_vector_collection(embedding_model_name, embedding_function):
    # Create a ChromaDB collection
    # The collection will be used to store the documents as vector embeddings
    # We want to measure the similarity between questions and documents using cosine similarity 
    collection = chroma_client.get_or_create_collection(name=embedding_model_name + "-comparison", 
                                             metadata={"hnsw:space": "cosine"}, 
                                             embedding_function=embedding_function)

    # Add the documents to the collection with the corresponding metadata
    collection.add(
        documents=list(jsonObj.input),
        ids=list(jsonObj.result),
    )

    # Register the model being evaluated with Okareo
    # This will return a model if it already exists or create a new one if it doesn't
    model_under_test = okareo.register_model(name=embedding_model_name, model=RetrievalModel(name=embedding_model_name))
    return model_under_test, collection


### Evaluate Performance of **all-MiniLM-L6-v2** embedding model from SentenceTransformers
- Model Card: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [None]:
from okareo_api_client.models import TestRunType

embedding_model_name = "all-MiniLM-L6-v2"
# This is the default SentenceTransformer model that ChromaDB uses to embed the documents
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef)

# Perform a test run using the uploaded scenario 
test_run_item = model_under_test.run_test(
    scenario=scenario, # use the scenario uploaded earlier in this notebook
    name=f"RAG Comparison {embedding_model_name} - {random_string}", 
    test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

### Evaluate Performance of **text-embedding-3-large** embedding model from OpenAI
- Model Card: https://openai.com/index/new-embedding-models-and-api-updates/

In [None]:
from okareo_api_client.models import TestRunType

embedding_model_name = "text-embedding-3-large"
openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=OPENAI_API_KEY, model_name=embedding_model_name)
model_under_test, collection = create_vector_collection(embedding_model_name, openai_ef)

# Perform a test run using the uploaded scenario 
test_run_item = model_under_test.run_test(
    scenario=scenario, # use the scenario uploaded earlier in this notebook
    name=f"RAG Comparison {embedding_model_name} - {random_string}", 
    test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

### Evaluate Performance of **gte-small** embedding model from Alibaba DAMO Academy
- Model Card: https://huggingface.co/thenlper/gte-small

In [None]:

from okareo_api_client.models import TestRunType

embedding_model_name = "thenlper/gte-small"
gte_small_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=embedding_model_name)
model_under_test, collection = create_vector_collection("gte-small", gte_small_ef)

# Perform a test run using the uploaded scenario 
test_run_item = model_under_test.run_test(
    scenario=scenario, # use the scenario uploaded earlier in this notebook
    name=f"RAG Comparison {embedding_model_name} - {random_string}", 
    test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")