Goals

1. How to setup lancedb locally
2. Looking at some metrics like recall and MRR and manually calculating some examples


# Introduction

Before starting this part, make sure that you have ran the `setup.py` file so that we have a lancedb db that is populated with the first 1000 entries of the ms-marco dataset. Depending on the internet, this might take a while so do make sure that you have completed this step before the workshop

In [11]:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
import numpy as np

In [4]:
# We can create a db by using the connect function

db = lancedb.connect("./lance-db")

In [7]:
# We can then create tables which can be based off a simple pydantic schema
func = get_registry().get("openai").create(name="text-embedding-3-small")

class Entry(LanceModel):
    vector: Vector(func.ndims()) = func.VectorField()
    text: str = func.SourceField()

table = db.create_table("sample_table",schema=Entry)

In [8]:
sample_data = [
    "The Capital of France is Paris",
    "How long do you need for sydney and surrounding areas",
    "Twitter is a popular web application"
]
table.add([{"text":item} for item in sample_data])

In [21]:
results = table.search(np.random.random((1536))) \
    .limit(10) \
    .to_list()

for result in results:
    print(f"text: {result['text']}, vector: {result['vector'][:2]}, distance: {round(result['_distance'],3)}\n")
    

text: The Capital of France is Paris, vector: [0.026081105694174767, 0.020630236715078354], distance: 509.652

text: Twitter is a popular web application, vector: [0.004896957892924547, -0.04718351364135742], distance: 510.056

text: How long do you need for sydney and surrounding areas, vector: [0.00892411544919014, 0.022230438888072968], distance: 512.427



In [20]:
results = table.search("Paris is a nice city to visit") \
    .limit(10) \
    .to_list()

for result in results:
    print(f"text: {result['text']}, vector: {result['vector'][:2]}, distance: {round(result['_distance'],3)}\n")
    

text: The Capital of France is Paris, vector: [0.026081105694174767, 0.020630236715078354], distance: 0.94

text: How long do you need for sydney and surrounding areas, vector: [0.00892411544919014, 0.022230438888072968], distance: 1.611

text: Twitter is a popular web application, vector: [0.004896957892924547, -0.04718351364135742], distance: 1.719



In [24]:
import shutil

shutil.rmtree("./lance-db")

**Summary** : LanceDB provides an easy way to have FTS ( as a simple baseline ) and embedding search. It also handles batching and provides other functionality such as integrations with duckdb, arrow table and filtering out of the box.

# Metrics

Now that we've figured out how our vector db works, let's look at some metrics we want to consider. Important to note that we always take these metrics at a value of `k`.

This is relevant to the specific use case (Eg. if we have a menu item that we want to show things for then we need to focus on the top 5 items )

## Reciprocal Rank

Highlights the importance of quickly surfacing at least one relevant document, with an emphasis on the efficiency of relevance delivery. Matters a lot when there are only a few items we can show to the user at any given time.

Formula is $$\frac{1}{\text{First Relevant Item}}$$


In [25]:
def calculate_reciprocal_rank(predictions, labels):
    for index, prediction in enumerate(predictions):
        if prediction in labels:
            return 1 / (index + 1)
    return 0

In [27]:
predictions = [1,2,3,4,5]
labels = [2,4]

calculate_reciprocal_rank(predictions,labels) # 1/2 = 0.5 since earliest relevant item is at index=2

0.5

In [28]:
predictions = [1,2,3,4,5]
labels = [10,20]

calculate_reciprocal_rank(predictions,labels) # No Relevant Items

0

## Recall

Recall measures the system's capability to retrieve all relevant documents within the top K results, emphasizing the breadth of relevant information captured.

Formula is $$\frac{\text{Number of Retrieved Relevant Items}}{\text{Total Number of Relevant Items}}$$

In [29]:
def calculate_recall(predictions, labels):
    correct_predictions = sum(1 for prediction in predictions if prediction in labels)
    if labels:
        return correct_predictions / len(labels)
    return 0


In [31]:
predictions = [1,2,3,4,5]
labels = [2,4]

calculate_recall(predictions,labels) # 2/2

1.0

In [32]:
predictions = [1,2,3,4,5]
labels = [200,20]

calculate_recall(predictions,labels) # 0/2 = 0

0.0

In [34]:
predictions = [1,2,3,4,5]
labels = [2,10]

calculate_recall(predictions,labels) # 0.5

0.5

These metrics allow us to be able to see the performance of our system and quantify the performance improvements of incremental improvements over time. There are more metrics that you can track ( see our article [here](https://jxnl.co/writing/2024/02/05/when-to-lgtm-at-k/) )

In short, think of the two metrics as follows

- Recall: How many relevant items did we surface?
- Reciprocal Rank: Where was the first relevant item that we care about?

# Evaluating Our Data

We've generated a .jsonl file with the queries from the [MS-Marco](https://huggingface.co/datasets/ms_marco) dataset.

In [None]:
def load_jsonl_file(file_path):
    data = []
    with open(file_path, "r") as file:
        for line in file:
            json_obj = json.loads(line.strip())
            data.append(json_obj)
    return data

data = 