# LanceDB Experiment Example

## Installations

In [None]:
!pip install --quiet --force-reinstall prompttools

## No setup required

In [1]:
from prompttools.experiment import LanceDBExperiment


## Run an experiment

One common use case is to compare two different embedding functions and how it may impact your document retrieval. We have can define what embedding functions we'd like to test here.

Note: If you previously haven't downloaded these embedding models. This may kick off downloads.

In [2]:
from sentence_transformers import SentenceTransformer

DEFAULT = SentenceTransformer("paraphrase-MiniLM-L3-v2")
MIMNILM_L6 = SentenceTransformer("all-MiniLM-L6-v2")

def default_embed_func(batch):
    return [DEFAULT.encode(sentence) for sentence in batch]

def minilm_l6_embed_func(batch):
    return [MIMNILM_L6.encode(sentence) for sentence in batch]

emb_fns = {"openai-ada-002": default_embed_func, "default": minilm_l6_embed_func}

Next, we create our test inputs. In this case, we would like to create a new ChromaDB collection.

During the experiment, for each embedding function, a new ChromaDB collection will be temporarily created. The documents will be added into it. Then, we will query from it and examine the results.

In [5]:
import pandas as pd
use_existing_table = False  # Specify that we want to create a collection during the experiment

# Documents that will be added into the database. LanceDB also accepts other dataset formats like pydict, pyarrow, Pydantic etc.
# Learn more here - https://lancedb.github.io/lancedb/guides/tables/

data = pd.DataFrame({
    "text": ["This is a document", "This is another document", "This is the document."],
    "metadatas": [{"source": "my_source"}, {"source": "my_source"}, {"source": "my_source"}],
    "ids": ["id1", "id2", "id3"],
})

query_args = {"text": ["This is a query document", "This is a another query document"], "metric": ["cosine", "l2"]}


# Set up the experiment
experiment = LanceDBExperiment(
    data=data,
    embedding_fns=emb_fns,
    query_args=query_args,
)

# [Optional] Advanced query args 
# Our test queries, along with optional query args. LanceDB query accepts a few args to customize your search:
# metrics: "l2", "cosine", or "dot" (cosine by default)
# filter: SQL where clause to filter the vector search results before applying the limit. (None by default)
# limit: number of results to return (3 by default)
"""
query_args_adv = {
                "text": ["This is a query document", "This is a another query document"], 
                "metric": ["cosine", "l2", "dot"],
                "filter": ["text IS NOT NULL" , "text LIKE '%document.%'"]
                }
experiment = LanceDBExperiment(
    data=data,
    embedding_fns=emb_fns,
    query_args=query_args_adv,
 
)
"""
print("")




We can then run the experiment to get results.

In [6]:
experiment.run()



We can visualize the result. In this case, the result of the second query "This is a another query document" is different.

paraphrase-MiniLM-L3-v2: [id2, id3, id1]

default (all-MiniLM-L6-v2) : [id2, id1, id3]

In [7]:
experiment.visualize()

Unnamed: 0,text,metric,embed_fn,top doc ids,distances,documents
0,This is a query document,cosine,openai-ada-002,"[id2, id3, id1]","[0.7633732557296753, 0.773878812789917, 0.7882261872291565]","[This is another document, This is the document., This is a document]"
1,This is a query document,cosine,default,"[id1, id3, id2]","[0.8099705576896667, 0.8289484977722168, 0.8308900594711304]","[This is a document, This is the document., This is another document]"
2,This is a query document,l2,openai-ada-002,"[id3, id1, id2]","[45.84406280517578, 49.12738037109375, 49.839256286621094]","[This is the document., This is a document, This is another document]"
3,This is a query document,l2,default,"[id1, id3, id2]","[1.619940996170044, 1.6578971147537231, 1.6617801189422607]","[This is a document, This is the document., This is another document]"
4,This is a another query document,cosine,openai-ada-002,"[id2, id3, id1]","[0.7633732557296753, 0.773878812789917, 0.7882261872291565]","[This is another document, This is the document., This is a document]"
5,This is a another query document,cosine,default,"[id1, id3, id2]","[0.8099705576896667, 0.8289484977722168, 0.8308900594711304]","[This is a document, This is the document., This is another document]"
6,This is a another query document,l2,openai-ada-002,"[id3, id1, id2]","[45.84406280517578, 49.12738037109375, 49.839256286621094]","[This is the document., This is a document, This is another document]"
7,This is a another query document,l2,default,"[id1, id3, id2]","[1.619940996170044, 1.6578971147537231, 1.6617801189422607]","[This is a document, This is the document., This is another document]"


## Evaluate the model response

To evaluate the results, we'll define an evaluation function. Sometimes, you know order of the most relevant document should be given a query, and you can compute the correlation between expected ranking and actual ranking.

In [8]:
import scipy.stats as stats

# For each query, you can define what the expected ranking is.
EXPECTED_RANKING = {
    "This is a query document": ["id1", "id3", "id2"],
    "This is a another query document": ["id2", "id3", "id1"],
}


def measure_correlation(input_query: str, results: dict, metadata: dict) -> float:
    """
    A simple test that compares the expected ranking for a given query with the actual ranking produced
    by the embedding function being tested.
    """
    correlation, _ = stats.spearmanr(results["ids"], EXPECTED_RANKING[input_query])
    return correlation

Finally, we can evaluate and visualize the results.

In [9]:
experiment.evaluate("ranking_correlation", measure_correlation, input_key="text")

In [12]:
experiment.visualize()

Unnamed: 0,ranking_correlation,text,metric,embed_fn,top doc ids,distances,documents
0,0.5,This is a query document,cosine,openai-ada-002,"[id2, id3, id1]","[0.7633732557296753, 0.773878812789917, 0.7882261872291565]","[This is another document, This is the document., This is a document]"
1,1.0,This is a query document,cosine,default,"[id1, id3, id2]","[0.8099705576896667, 0.8289484977722168, 0.8308900594711304]","[This is a document, This is the document., This is another document]"
2,-1.0,This is a query document,l2,openai-ada-002,"[id3, id1, id2]","[45.84406280517578, 49.12738037109375, 49.839256286621094]","[This is the document., This is a document, This is another document]"
3,1.0,This is a query document,l2,default,"[id1, id3, id2]","[1.619940996170044, 1.6578971147537231, 1.6617801189422607]","[This is a document, This is the document., This is another document]"
4,1.0,This is a another query document,cosine,openai-ada-002,"[id2, id3, id1]","[0.7633732557296753, 0.773878812789917, 0.7882261872291565]","[This is another document, This is the document., This is a document]"
5,0.5,This is a another query document,cosine,default,"[id1, id3, id2]","[0.8099705576896667, 0.8289484977722168, 0.8308900594711304]","[This is a document, This is the document., This is another document]"
6,-0.5,This is a another query document,l2,openai-ada-002,"[id3, id1, id2]","[45.84406280517578, 49.12738037109375, 49.839256286621094]","[This is the document., This is a document, This is another document]"
7,0.5,This is a another query document,l2,default,"[id1, id3, id2]","[1.619940996170044, 1.6578971147537231, 1.6617801189422607]","[This is a document, This is the document., This is another document]"


You can also use auto evaluation. We will add an example of this in the near future.