# Upgrading our System

In this notebook, we look at how we can use the previous building blocks that we put together to test different pipeline combinations. 



In our previous notebook, we saw how we could evaluate a RAG system using some simple metrics such as recall. 

Let's see how we can use this to quickly experiment with different pipelines and get a quantiative idea of how good or bad different metrics stack up against each other

In [1]:
%load_ext autoreload
%autoreload 2

In [6]:
from lib.models import QueryItem
import lancedb
from lib.eval import score_retrieval, calculate_reciprocal_rank, calculate_recall
from lib.data import get_labels
from lib.db import get_table
import pandas as pd
from tabulate import tabulate

# Define Scoring Parameters
sizes = [3, 5, 10, 15, 25]

eval_fns = {"mrr": calculate_reciprocal_rank, "recall": calculate_recall}

# Setup Test Files
db = lancedb.connect("../lance")
queries = get_labels("../data/queries_single_label.jsonl")[:30]
queries = [
    QueryItem(
        **{"query": item["query"], "selected_chunk_ids": [item["selected_chunk_ids"]]}
    )
    for item in queries
]
table = get_table(db, "ms_marco")

In [5]:
from lib.query import fts_search, vector_search

# Define Candidate Search Methods
candidates = {"Full Text Search": fts_search, "Semantic Search": vector_search}

results = {}

for candidate, search_fn in candidates.items():
    search_results = search_fn(table, queries, 25)
    chunk_ids = [
        [item["chunk_id"] for item in retrieved_items]
        for retrieved_items in search_results
    ]
    evaluation_metrics = [
        score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
        for retrieved_chunk_ids, query in zip(chunk_ids, queries)
    ]
    results[f"{candidate}"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

Executing Full Text Search now...: 100%|█████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 37.68it/s]
Generating Embeddings for 30 queries: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 7326.30it/s]
Executing Vector Search now...: 100%|████████████████████████████████████████████████████████████████████| 30/30 [00:02<00:00, 12.25it/s]

+-----------+--------------------+-------------------+
|           |   Full Text Search |   Semantic Search |
| mrr@3     |               0.25 |              0.37 |
+-----------+--------------------+-------------------+
| mrr@5     |               0.29 |              0.41 |
+-----------+--------------------+-------------------+
| mrr@10    |               0.34 |              0.42 |
+-----------+--------------------+-------------------+
| mrr@15    |               0.34 |              0.42 |
+-----------+--------------------+-------------------+
| mrr@25    |               0.34 |              0.42 |
+-----------+--------------------+-------------------+
| recall@3  |               0.33 |              0.67 |
+-----------+--------------------+-------------------+
| recall@5  |               0.5  |              0.87 |
+-----------+--------------------+-------------------+
| recall@10 |               0.8  |              0.93 |
+-----------+--------------------+-------------------+
| recall@1




We can immediately notice a few different things for our system

- We can match `recall@5` for Semantic Search using say a larger value of `k` for Full Text Search. The `recall@10` for full text search is 0.9 while the `recall@5` for Semantic search is 0.86
- Our MRR isn't the best for Full Text Search and we never quite match up to the performance of full text search.

What other options do we have?

# Our Evaluation Dataset

What exactly have we been benchmarking our results with and what's included within the MsMarco Dataset.

In [59]:
from datasets import load_dataset

dataset = load_dataset("ms_marco", "v1.1", split="train", streaming=True).take(4)
dataset

IterableDataset({
    features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
    n_shards: 1
})

In [63]:
item = next(iter(dataset))
item

{'answers': ['Results-Based Accountability is a disciplined way of thinking and taking action that communities can use to improve the lives of children, youth, families, adults and the community as a whole.'],
 'passages': {'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
  'passage_text': ["Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.",
   "The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the C

In [64]:
item["passages"]

{'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'passage_text': ["Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.",
  "The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the Commonwealth Bank. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydn

The MS-Marco dataset is a dataset that benchmarks the ability of models to do information retrieval. 

Each item in the dataset consists of a few different bits of information

- `query` : This is the query that the original user made on Bing
- `passages` 
  - `is_selected` : This is a binary label that indicates whether a human annotator found this query relevant when composing a response to the query. A 1 indicates that it's relevant and 0 indicates that it's not. **Note that multiple passages can be selected as relevant**
  - `passage_test` : This is a list of passages that was returned which corresponds to the `is_selected` index

We then compute a unique chunk_id for each passage using the `hashlib` library with the `md5` hash. This then allows us to use each item and its corresponding hash to generate two small test files in `.jsonl` format.

- `queries_multi_label` : This contains a mapping of query to one or more selected passages

    ```
    {"query": "in animals somatic cells are produced by and gametic cells are produced by", "selected_chunk_ids":  ["4ff0ed20afce65c76bdb0df809ef5025", "0f556defd6442767c1fee0e729f942fb"]}
    ```
  
- `queries_single_label`: This contains a mapping of a query to a single selected passage

    ```
    {"query": "how much time can you go between oil changes", "selected_chunk_id": "c944888dc7bb30a1a01cf5c19776a19b"}
    ```

In [2]:
import hashlib


def compute_md5_hash(input_string: str) -> str:
    """
    Compute the MD5 hash of a given string.

    Parameters:
    input_string (str): The input string to hash.

    Returns:
    str: The MD5 hash of the input string.
    """
    md5_hash = hashlib.md5(input_string.encode())
    return md5_hash.hexdigest()


compute_md5_hash("hello world"), compute_md5_hash("HEllo world")

('5eb63bbbe01eeed093cb22bb8f5acdc3', '68df723b3e61541ec5af9c6053357942')

## Hybrid Search

If we look at the `query_type` for LanceDB, we'll notice that there's an additional query_type called `hybrid`. What's this new query type and how does it perform?

In [8]:
from lib.models import QueryItem
from lib.query import fts_search, vector_search, hybrid_search
import lancedb
from lib.eval import score_retrieval, calculate_reciprocal_rank, calculate_recall
from lib.data import get_labels
from lib.db import get_table
import pandas as pd
from tabulate import tabulate

# Define Scoring Parameters
sizes = [3, 10, 15, 25]

eval_fns = {"mrr": calculate_reciprocal_rank, "recall": calculate_recall}

# Setup Test Files
db = lancedb.connect("../lance")
queries = get_labels("../data/queries_single_label.jsonl")[:30]
queries = [
    QueryItem(
        **{"query": item["query"], "selected_chunk_ids": [item["selected_chunk_ids"]]}
    )
    for item in queries
]
table = get_table(db, "ms_marco")

# Define Candidate Search Methods
candidates = {"Full Text Search": fts_search, "Semantic Search": vector_search,"Hybrid Search":hybrid_search}

results = {}

for candidate, search_fn in candidates.items():
    search_results = search_fn(table, queries, 25)
    chunk_ids = [
        [item["chunk_id"] for item in retrieved_items]
        for retrieved_items in search_results
    ]
    evaluation_metrics = [
        score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
        for retrieved_chunk_ids, query in zip(chunk_ids, queries)
    ]
    results[f"{candidate}"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

Executing Full Text Search now...: 100%|█████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 39.98it/s]
Generating Embeddings for 30 queries: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 7876.63it/s]
Executing Vector Search now...: 100%|████████████████████████████████████████████████████████████████████| 30/30 [00:02<00:00, 12.02it/s]
Executing Hybrid Search now...: 100%|████████████████████████████████████████████████████████████████████| 30/30 [00:18<00:00,  1.65it/s]

+-----------+--------------------+-------------------+-----------------+
|           |   Full Text Search |   Semantic Search |   Hybrid Search |
| mrr@3     |               0.25 |              0.37 |            0.36 |
+-----------+--------------------+-------------------+-----------------+
| mrr@10    |               0.34 |              0.42 |            0.42 |
+-----------+--------------------+-------------------+-----------------+
| mrr@15    |               0.34 |              0.42 |            0.42 |
+-----------+--------------------+-------------------+-----------------+
| mrr@25    |               0.34 |              0.42 |            0.42 |
+-----------+--------------------+-------------------+-----------------+
| recall@3  |               0.33 |              0.67 |            0.63 |
+-----------+--------------------+-------------------+-----------------+
| recall@10 |               0.8  |              0.93 |            0.97 |
+-----------+--------------------+-----------------




But what's happening under the hood? Why are our new results in between Semantic Search and Full Text Search?

Turns out LanceDB is actually using a linear_combination_search reranker under the hood, combining the value of semantic search + full text search distance metrics with a weight of 0.7 for semantic search and 0.3 for full text search by default. [Link to Documentation](https://lancedb.github.io/lancedb/hybrid_search/hybrid_search/#arguments)

Can we tune this weight hyper-parameter and get better results using this naive linear combination reranker?

```python
from lancedb.rerankers import LinearCombinationReranker

reranker = LinearCombinationReranker(weight=0.3) # Use 0.3 as the weight for vector search

# We can pass in a linear reranker here to do the re-ranking
results = table.search("rebel", query_type="hybrid").rerank(reranker=reranker).to_pandas()
```



In [19]:
from lib.query import linear_combination_search, semantic_search
from lib.data import get_labels
from lib.eval import score_retrieval, calculate_reciprocal_rank, calculate_recall
from lib.models import EmbeddedPassage
from lib.db import get_table
import pandas as pd
import lancedb
from tabulate import tabulate

db = lancedb.connect("../lance")

weightage = [0.3, 0.7, 0.9]

queries = get_labels("../data/queries_single_label.jsonl")[:30]
queries = [
    QueryItem(
        **{"query": item["query"], "selected_chunk_ids": [item["selected_chunk_ids"]]}
    )
    for item in queries
]
table = get_table(db, "ms_marco", EmbeddedPassage)

# Define Scoring Parameters
sizes = [3, 10, 15, 25]

eval_fns = {"mrr": calculate_reciprocal_rank, "recall": calculate_recall}

# Run test_data against candidates
results = {}

for weight in weightage:
    search_results = linear_combination_search(table, queries, 25, weight)
    chunk_ids = [
        [item["chunk_id"] for item in retrieved_items]
        for retrieved_items in search_results
    ]
    evaluation_metrics = [
        score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
        for retrieved_chunk_ids, query in zip(chunk_ids, queries)
    ]
    results[f"Linear Combination ({weight})"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

Linear Combination (Weight 0.3): 100%|███████████████████████████████████████████████████████████████████| 30/30 [00:20<00:00,  1.50it/s]
Linear Combination (Weight 0.7): 100%|███████████████████████████████████████████████████████████████████| 30/30 [00:21<00:00,  1.39it/s]
Linear Combination (Weight 0.9): 100%|███████████████████████████████████████████████████████████████████| 30/30 [00:21<00:00,  1.41it/s]

+-----------+----------------------------+----------------------------+----------------------------+
|           |   Linear Combination (0.3) |   Linear Combination (0.7) |   Linear Combination (0.9) |
| mrr@3     |                       0.34 |                       0.36 |                       0.36 |
+-----------+----------------------------+----------------------------+----------------------------+
| mrr@10    |                       0.4  |                       0.42 |                       0.42 |
+-----------+----------------------------+----------------------------+----------------------------+
| mrr@15    |                       0.4  |                       0.42 |                       0.42 |
+-----------+----------------------------+----------------------------+----------------------------+
| mrr@25    |                       0.4  |                       0.42 |                       0.42 |
+-----------+----------------------------+----------------------------+--------------------




# Re-Rankers

Re-Rankers will often be much more accurate at finding relevant documents as compared to simple embedding search because they're able to extract out significantly more information from the query (and the text itself)

Typically we'd utilise a re-ranker in a two step re-ranking process

1. First fetch the relevant chunks
2. Throw into a re-ranker
3. Then return a subset of the re-ranked elements

This will allow us to take advantage of the fast retrieval of embedding search to quickly narrow down the subset of evaluated chunks while combining the increased accuracy of a re-ranker

## Cohere Re-Ranker

In [26]:
from lib.query import linear_combination_search, semantic_search
from lib.data import get_labels
from lib.models import EmbeddedPassage
from lib.db import get_table
import lancedb
from lib.eval import score_retrieval, calculate_reciprocal_rank, calculate_recall

db = lancedb.connect("../lance")
test_data = get_labels("../data/queries_single_label.jsonl")[:20]
table = get_table(db, "ms_marco", EmbeddedPassage)

sizes = [3,5,10]
eval_fns = {"mrr": calculate_reciprocal_rank, "recall": calculate_recall}

In [21]:
test_data[0]

{'query': 'what is rba',
 'selected_chunk_ids': 'ca869ae1ed3f5021cb5b2a0b78cc846c'}

In [27]:
def normal_search(query, table, limit):
    return [
        item["chunk_id"]
        for item in table.search(query, query_type="fts")
        .limit(limit)
        .to_list()
    ]


query = test_data[0]
retrieved_chunk_ids = normal_search(query["query"], table, 25)
evaluation_metrics = score_retrieval(retrieved_chunk_ids, query["selected_chunk_ids"],sizes,eval_fns)
evaluation_metrics

{'mrr@3': 0,
 'mrr@5': 0,
 'mrr@10': 0.167,
 'recall@3': 0.0,
 'recall@5': 0.0,
 'recall@10': 1.0}

In [28]:
from lancedb.rerankers import CohereReranker


def rerank_search(query, table, limit):
    reranker = CohereReranker(
        model_name="rerank-english-v2.0"
    )  # This uses the rerank-english
    return [
        item["chunk_id"]
        for item in table.search(query, query_type="fts")
        .limit(limit)
        .rerank(reranker=reranker)
        .to_list()
    ]


query = test_data[0]
retrieved_chunk_ids = rerank_search(query["query"], table, 25)
evaluation_metrics = score_retrieval(retrieved_chunk_ids, query["selected_chunk_ids"],sizes,eval_fns)
evaluation_metrics

{'mrr@3': 1.0,
 'mrr@5': 1.0,
 'mrr@10': 1.0,
 'recall@3': 1.0,
 'recall@5': 1.0,
 'recall@10': 1.0}

How does a reranker improve the quality of our retrieval vs simple full text search for MRR and Recall?

In [29]:
from lib.models import QueryItem
from lib.query import fts_search
import lancedb
from lib.eval import score_retrieval, calculate_reciprocal_rank, calculate_recall
from lib.data import get_labels
from lib.db import get_table
import pandas as pd
from tabulate import tabulate

# Define Scoring Parameters
sizes = [3, 10, 15, 25]

eval_fns = {"mrr": calculate_reciprocal_rank, "recall": calculate_recall}

# Setup Test Files
db = lancedb.connect("../lance")
queries = get_labels("../data/queries_single_label.jsonl")[:30]
queries = [
    QueryItem(
        **{"query": item["query"], "selected_chunk_ids": [item["selected_chunk_ids"]]}
    )
    for item in queries
]
table = get_table(db, "ms_marco")

In [30]:
from lib.query import fts_search, cohere_rerank_search

# Define Candidate Search Methods
candidates = {"Full Text Search": fts_search, "Cohere Rerank": cohere_rerank_search}

results = {}

queries = queries[:10]
for candidate, search_fn in candidates.items():
    search_results = search_fn(table, queries, 25)
    chunk_ids = [
        [item["chunk_id"] for item in retrieved_items]
        for retrieved_items in search_results
    ]
    evaluation_metrics = [
        score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
        for retrieved_chunk_ids, query in zip(chunk_ids, queries)
    ]
    results[f"{candidate}"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

Executing Full Text Search now...: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 38.04it/s]
Cohere Reranker (rerank-english-v2.0): 100%|█████████████████████████████████████████████████████████████| 10/10 [00:06<00:00,  1.61it/s]

+-----------+--------------------+-----------------+
|           |   Full Text Search |   Cohere Rerank |
| mrr@3     |               0.25 |            0.67 |
+-----------+--------------------+-----------------+
| mrr@10    |               0.34 |            0.69 |
+-----------+--------------------+-----------------+
| mrr@15    |               0.34 |            0.69 |
+-----------+--------------------+-----------------+
| mrr@25    |               0.34 |            0.69 |
+-----------+--------------------+-----------------+
| recall@3  |               0.4  |            0.9  |
+-----------+--------------------+-----------------+
| recall@10 |               0.9  |            1    |
+-----------+--------------------+-----------------+
| recall@15 |               0.9  |            1    |
+-----------+--------------------+-----------------+
| recall@25 |               1    |            1    |
+-----------+--------------------+-----------------+





### Sample Evaluation : Which Model to use?

Cohere ships a few different re-ranker models ( `rerank-english-v3.0`, `rerank-multilingual-v3.0`, `rerank-english-v2.0`, `rerank-multilingual-v2.0`) that perform slightly differently based on the use-case. How can we determine what might work best for our use case?

In [None]:
from lib.models import QueryItem
from lib.query import cohere_rerank_search, vector_search
import lancedb
from lib.eval import score_retrieval, calculate_reciprocal_rank, calculate_recall
from lib.data import get_labels
from lib.db import get_table
import pandas as pd
from tabulate import tabulate

# Define Scoring Parameters
sizes = [3, 10, 15, 25]

eval_fns = {"mrr": calculate_reciprocal_rank, "recall": calculate_recall}

# Setup Test Files
db = lancedb.connect("../lance")
queries = get_labels("../data/queries_single_label.jsonl")[:30]
queries = [
    QueryItem(
        **{"query": item["query"], "selected_chunk_ids": [item["selected_chunk_ids"]]}
    )
    for item in queries
]
table = get_table(db, "ms_marco")

# Define Candidate Search Methods
model_names = {
    "rr-eng-3": "rerank-english-v3.0",
    "rr-mul-3": "rerank-multilingual-v3.0",
    "rr-eng-2": "rerank-english-v2.0",
    "rr-mul-2": "rerank-multilingual-v2.0",
}

results = {}

for candidate, search_fn in candidates.items():
    search_results = search_fn(table, queries, 25)
    chunk_ids = [
        [item["chunk_id"] for item in retrieved_items]
        for retrieved_items in search_results
    ]
    evaluation_metrics = [
        score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
        for retrieved_chunk_ids, query in zip(chunk_ids, queries)
    ]
    results[f"{candidate}"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

In [43]:
from lib.query import cohere_rerank_search, vector_search
from lib.data import get_labels
from lib.models import EmbeddedPassage
from lib.db import get_table
import pandas as pd
import lancedb
from tabulate import tabulate

# Define Scoring Parameters
sizes = [3, 10, 15, 25]

eval_fns = {"mrr": calculate_reciprocal_rank, "recall": calculate_recall}

results = {}

# Setup Test Files
db = lancedb.connect("../lance")
queries = get_labels("../data/queries_single_label.jsonl")[:5]
queries = [
    QueryItem(
        **{"query": item["query"], "selected_chunk_ids": [item["selected_chunk_ids"]]}
    )
    for item in queries
]
table = get_table(db, "ms_marco")

# Define Candidate Search Methods
model_names = {
    "rr-eng-3": "rerank-english-v3.0",
    "rr-mul-3": "rerank-multilingual-v3.0",
    "rr-eng-2": "rerank-english-v2.0",
    "rr-mul-2": "rerank-multilingual-v2.0",
}

# Do Semantic Search
search_results = vector_search(table, queries, 25)
chunk_ids = [
    [item["chunk_id"] for item in retrieved_items]
    for retrieved_items in search_results
]
evaluation_metrics = [
    score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
    for retrieved_chunk_ids, query in zip(chunk_ids, queries)
]
results["Semantic Search"] = pd.DataFrame(evaluation_metrics).mean()

for header_name, model_name in model_names.items():
    search_results = cohere_rerank_search(table, queries, 50, model_name,"hybrid")
    chunk_ids = [
        [item["chunk_id"] for item in retrieved_items]
        for retrieved_items in search_results
    ]
    evaluation_metrics = [
        score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
        for retrieved_chunk_ids, query in zip(chunk_ids, queries)
    ]
    results[f"{header_name}"] = pd.DataFrame(evaluation_metrics).mean()


# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

Generating Embeddings for 5 queries: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2807.43it/s]
Executing Vector Search now...: 100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  9.56it/s]
Cohere Reranker (rerank-english-v3.0): 100%|███████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.09s/it]
Cohere Reranker (rerank-multilingual-v3.0): 100%|██████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.39s/it]
Cohere Reranker (rerank-english-v2.0): 100%|███████████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.39s/it]
Cohere Reranker (rerank-multilingual-v2.0): 100%|██████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.21s/it]

+-----------+-------------------+------------+------------+------------+------------+
|           |   Semantic Search |   rr-eng-3 |   rr-mul-3 |   rr-eng-2 |   rr-mul-2 |
| mrr@3     |              0.47 |       0.4  |       0.77 |       0.6  |       0.53 |
+-----------+-------------------+------------+------------+------------+------------+
| mrr@10    |              0.47 |       0.52 |       0.77 |       0.67 |       0.53 |
+-----------+-------------------+------------+------------+------------+------------+
| mrr@15    |              0.47 |       0.52 |       0.77 |       0.67 |       0.53 |
+-----------+-------------------+------------+------------+------------+------------+
| mrr@25    |              0.47 |       0.52 |       0.77 |       0.67 |       0.53 |
+-----------+-------------------+------------+------------+------------+------------+
| recall@3  |              1    |       0.4  |       1    |       0.6  |       1    |
+-----------+-------------------+------------+--------




# Metadata Ingestion

We've looked at different ways that we can set up our retrieval pipeline. Let's now switch gears and see how we can experiment with metadata ingestion to improve the quality of our search

## Creating Question-Answer Pairs

We previously used Instructor to generate synthethic questions and answers. Let's see how we can expand on this to improve our retrieval pipeline.

In [11]:
import lancedb

db = lancedb.connect("../lance")
table = db.open_table("ms_marco")

In [52]:
chunks = table.search("this is a random chunk",query_type="fts").select(["text"]).limit(3).to_list()
# Now for each chunk, we'll generate a question and answer pair that we'll embed into our database
chunk = chunks[0]
chunk

{'text': 'Between any two values of a continuous random variable, there are an infinite number of other valid values. This is not the case for discrete random variables, because between any two discrete values, there is an integer number (0, 1, 2, ...) of',
 'score': 12.768003463745117}

In [53]:
from lib.synthethic import generate_question_batch

question = await generate_question_batch([chunk], 20)
question

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.38s/it]


[{'response': QuestionAnswerResponse(chain_of_thought='The text discusses the difference between continuous and discrete random variables, specifically focusing on the fact that continuous random variables have an infinite number of possible values between any two values, whereas discrete random variables have a finite set of values between any two points. This detail will help in crafting a specific question and answer pair.', question='What is the difference between continuous and discrete random variables in terms of the possible values between any two points?', answer='Between any two values of a continuous random variable, there are an infinite number of other valid values. In contrast, for discrete random variables, between any two discrete values, there is an integer number (0, 1, 2, ...) of valid values.'),
  'source': {'text': 'Between any two values of a continuous random variable, there are an infinite number of other valid values. This is not the case for discrete random va

In [45]:
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

func = get_registry().get("openai").create(name="text-embedding-3-small")


class EmbeddedPassageWithQA(LanceModel):
    vector: Vector(func.ndims()) = func.VectorField()
    chunk_id: str
    text: str = func.SourceField()
    source_text: str


table_withqa = db.create_table(
    "ms_marco_qa_4", schema=EmbeddedPassageWithQA, mode="overwrite"
)

We generated ~1600 questions ahead of time with GPT-4o 

This took me ~1.5 mins to execute on my local computer for ~11k chunks with a semaphore of 5

In [48]:
import hashlib
import lancedb
from itertools import batched,islice
from tqdm.asyncio import tqdm_asyncio as asyncio
import time
from lib.data import get_labels
from asyncio import Semaphore

cached_questions = get_labels("../data/synth-questions-4o.jsonl")

def get_chunks(questions,chunks):
    for chunk in chunks:
        chunk_id = hashlib.md5(chunk.encode()).hexdigest()
        yield {
            "chunk_id": chunk_id,
            "text": chunk,
            "source_text": chunk,
        }
        
    for question in questions:
        chunk_id = hashlib.md5(question["chunk"].encode()).hexdigest()
        yield {
            "chunk_id": chunk_id,
            "text": question["question"],
            "source_text": question["chunk"],
        }
        yield {
            "chunk_id": chunk_id,
            "text": question["answer"],
            "source_text": question["chunk"],
        }

text_chunks = map(lambda x:str(x),table.to_arrow()['text'])
items = list(get_chunks(cached_questions,text_chunks))

batched_items = batched(items,500)
async_db = await lancedb.connect_async("../lance")
async_tbl = await async_db.open_table("ms_marco_qa_4")

start = time.time()
await asyncio.gather(*[async_tbl.add(list(batch)) for batch in batched_items])
end = time.time()

print(f"Inserted {len(items)} items in {end-start}s")

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [01:33<00:00,  4.05s/it]

Inserted 11471 items in 93.18293404579163s





In [50]:
table_withqa.create_fts_index(["text", "source_text"],replace=True)

In [7]:
from lib.models import QueryItem
from lib.query import fts_search
import lancedb
from lib.eval import score_retrieval, calculate_reciprocal_rank, calculate_recall
from lib.data import get_labels
from lib.db import get_table
import pandas as pd
from tabulate import tabulate

sizes = [3, 10, 15, 25]
eval_fns = {"mrr": calculate_reciprocal_rank, "recall": calculate_recall}


db = lancedb.connect("../lance")
queries = get_labels("../data/queries_single_label.jsonl")[:100]
queries = [
    QueryItem(
        **{"query": item["query"], "selected_chunk_ids": [item["selected_chunk_ids"]]}
    )
    for item in queries
]
table_withqa = get_table(db, "ms_marco_qa_4")
table = get_table(db, "ms_marco")


candidates = {
    "FTS": fts_search,
}

tables = {"With Q": table_withqa, "W/o  Q": table}

results = {}

for (
    candidate_fn_tuple,
    db_table_tuple,
) in product(candidates.items(), tables.items()):
    db_name, db_table = db_table_tuple
    candidate_name, search_fn = candidate_fn_tuple
    search_results = search_fn(db_table, queries, 25)
    chunk_ids = [
        [item["chunk_id"] for item in retrieved_items]
        for retrieved_items in search_results
    ]
    evaluation_metrics = [
        score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
        for retrieved_chunk_ids, query in zip(chunk_ids, queries)
    ]
    results[f"{candidate_name} ({db_name})"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

Executing Full Text Search now...: 100%|███████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 44.59it/s]
Executing Full Text Search now...: 100%|███████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 42.56it/s]

+-----------+----------------+----------------+
|           |   FTS (With Q) |   FTS (W/o  Q) |
| mrr@3     |           0.28 |           0.35 |
+-----------+----------------+----------------+
| mrr@10    |           0.36 |           0.42 |
+-----------+----------------+----------------+
| mrr@15    |           0.36 |           0.42 |
+-----------+----------------+----------------+
| mrr@25    |           0.37 |           0.42 |
+-----------+----------------+----------------+
| recall@3  |           0.34 |           0.5  |
+-----------+----------------+----------------+
| recall@10 |           0.75 |           0.88 |
+-----------+----------------+----------------+
| recall@15 |           0.86 |           0.89 |
+-----------+----------------+----------------+
| recall@25 |           0.93 |           0.93 |
+-----------+----------------+----------------+





In [10]:
table_withqa = get_table(db, "ms_marco_qa_4", EmbeddedPassageWithQA)
desired_result = "9dc539403752581ac234e01d2eb77877"
queried_results = (
    table_withqa.search("what species is a dandelion", query_type="fts")
    .select(["chunk_id", "text", "source_text"])
    .limit(20)
    .to_pandas()
)
queried_results

Unnamed: 0,chunk_id,text,source_text,score
0,9dc539403752581ac234e01d2eb77877,The common species of Taraxacum that rapidly c...,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,31.960682
1,9dc539403752581ac234e01d2eb77877,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,31.862806
2,89c9edbed59513ee50cb83e1cfb1ebbe,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,31.336384
3,591eae9a842ee783b157f2b033519a2e,Both species are edible in their entirety. The...,Both species are edible in their entirety. The...,30.430489
4,0853e1b91ecae08717962457740453fa,Description. There are considered to be about ...,Description. There are considered to be about ...,29.815567
5,ea830e20c43da3d767c180fee0f061de,"About this species. Dandelions are well-known,...","About this species. Dandelions are well-known,...",29.55406
6,a5872798dffbe3077255ed0dd0dd7a12,"Many Taraxacum species, such as the common dan...",Each single flower in a head is called a flore...,29.211123
7,a75f014e6c6f31495e04f78e8476b54d,Plant Description. Hundreds of species of dand...,Plant Description. Hundreds of species of dand...,29.060694
8,89c9edbed59513ee50cb83e1cfb1ebbe,What is the lifecycle of the common dandelion ...,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,28.738657
9,a5872798dffbe3077255ed0dd0dd7a12,Each single flower in a head is called a flore...,Each single flower in a head is called a flore...,28.075951


## Creating Metadata using GPT4-o 

Now that we've seen how to create synthethic questions, let's try looking at another method of improving our search pipeline - creating metadata when ingesting data.

Documents exist in relation to one another so it's incredibly useful to use metadata when we're dealing with categories, date-based queries or other types of documents which require some form of complex user understanding

In [10]:
from lib.query import fts_search,vector_search,metadata_search
import lancedb
from lib.eval import score_retrieval, calculate_precision
from lib.data import get_labels
from lib.db import get_table
import pandas as pd
from tabulate import tabulate
from lib.models import QueryItem


# Define Scoring Parameters
sizes = [3, 10, 15, 25]

eval_fns = {"precision": calculate_precision}

# Setup Test Files
db = lancedb.connect("../lance")
queries = get_labels("../data/category_questions.jsonl")[:20]
queries = [
    QueryItem(
        **{"query": item["query"], "selected_chunk_ids": [item["category"]]}
    )
    for item in queries
]

candidates = {
    "SS": vector_search,
    "FTS": fts_search,
}

results = {}

table = db.open_table("arxiv_papers")

search_results = await metadata_search(table, queries, 25)
chunk_ids = [
    [item["category"] for item in retrieved_items]
    for retrieved_items in search_results
]
evaluation_metrics = [
    score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
    for retrieved_chunk_ids, query in zip(chunk_ids, queries)
]

results["FTS w Metadata"] = pd.DataFrame(evaluation_metrics).mean()

for candidate, search_fn in candidates.items():
    search_results = search_fn(table, queries, 25)
    chunk_ids = [
        [item["category"] for item in retrieved_items]
        for retrieved_items in search_results
    ]
    evaluation_metrics = [
        score_retrieval(retrieved_chunk_ids, query.selected_chunk_ids, sizes, eval_fns)
        for retrieved_chunk_ids, query in zip(chunk_ids, queries)
    ]
    results[f"{candidate}"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 12.80it/s]
20it [00:01, 12.61it/s]
Generating Embeddings for 20 queries: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5504.34it/s]
Executing Vector Search now...: 100%|████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 45.07it/s]
Executing Full Text Search now...: 100%|█████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 43.63it/s]

+--------------+------------------+------+-------+
|              |   FTS w Metadata |   SS |   FTS |
| precision@3  |             0.95 | 0.75 |  0.72 |
+--------------+------------------+------+-------+
| precision@10 |             0.95 | 0.61 |  0.52 |
+--------------+------------------+------+-------+
| precision@15 |             0.95 | 0.57 |  0.47 |
+--------------+------------------+------+-------+
| precision@25 |             0.95 | 0.53 |  0.42 |
+--------------+------------------+------+-------+





### What's happening over here?

Metadata can help us to filter out relevant information that cannot be captured within the context of a single vector.

In [11]:
documents = [
    {"price" : "The price was 24", "date": "24-02-2024"},
    {"price":"The price was 400", "date": "14-02-2024"},
    {"price":"The price was -200","date": "07-02-2024"}
]

query = "What was the price updated to last week?" # Assuming today is 31-02-2024

In [17]:
from openai import OpenAI

client = OpenAI()

def generate_embeddings(data):
    return OpenAI().embeddings.create(
        model="text-embedding-3-small",
        input=[item["price"] for item in data]
    ).data

embeddings = generate_embeddings(documents)
documents = [
    {
        **document,
        "embedding": embedding.embedding
    }
    for document,embedding in zip(documents,embeddings)
]


In [14]:
import numpy as np

def similarity_search_w_document(query,documents):
    query_embedding = generate_embeddings([{"price":query}])[0].embedding
    similarity_scores = [
        (document, np.inner(query_embedding, document['embedding']))
        for document in documents
    ]
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    return similarity_scores

[(item[0]['price'],item[1]) for item in similarity_search_w_document(query,documents)]


[('The price was -200', 0.44006616151637845),
 ('The price was 400', 0.41957457665426845),
 ('The price was 24', 0.39200806304293756)]

In [15]:
from pydantic import BaseModel
import instructor
from openai import OpenAI

class Query(BaseModel):
    start_date: str
    end_date: str

client = instructor.from_openai(OpenAI())
query = "What was the price updated to last week?"

query_obj = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role":"system",
            "content":"You are an expert query understanding Artificial Intelligence system.( Today is 02-03-2024 and it is a Saturday ). You will need to extract the dates from the query and return the start and end dates for items that are going to be relevant to the user's query. Your start date should be a Monday and the end date a Sunday"
        },
        {
            "role":"user",
            "content":query
        }
    ],
    response_model=Query
)
query_obj


Query(start_date='2024-02-19', end_date='2024-02-25')

In [18]:
from datetime import datetime

def similarity_search_w_document_and_metadata(query,documents):
    query_embedding = generate_embeddings([{"price":query}])[0].embedding
    client = instructor.from_openai(OpenAI())

    query_obj = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role":"system",
                "content":"You are an expert query understanding Artificial Intelligence system.( Today is 02-03-2024 and it is a Saturday ). You will need to extract the dates from the query and return the start and end dates for items that are going to be relevant to the user's query. Your start date should be a Monday and the end date a Sunday"
            },
            {
                "role":"user",
                "content":query
            }
        ],
        response_model=Query
    )
    print(query_obj)

    start_date = datetime.strptime(query_obj.start_date, '%Y-%m-%d')
    end_date = datetime.strptime(query_obj.end_date, '%Y-%m-%d')

    filtered_documents = [
        document for document in documents
        if start_date <= datetime.strptime(document['date'], '%d-%m-%Y') <= end_date
    ]

    similarity_scores = [
        (document, np.inner(query_embedding, document['embedding']))
        for document in filtered_documents
    ]
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    return similarity_scores

[(item[0]['price'],item[1]) for item in similarity_search_w_document_and_metadata(query,documents)]


start_date='2024-02-19' end_date='2024-02-25'


[('The price was 24', 0.39200775073752137)]

In [19]:
similarity_search_w_document_and_metadata("What was the price like last year?",documents)

start_date='2023-02-27' end_date='2023-03-05'


[]

### What Happened in our Example?

What is happening under the hood

In [27]:
import lancedb
from lib.db import get_table
from lib.models import ArxivPaper

db = lancedb.connect("../lance")
table = get_table(db, "arxiv_papers", ArxivPaper)

In [29]:
table.to_pandas().head(4)

Unnamed: 0,title,authors,category,abstract,text,vector,chunk_id
0,Calculating Valid Domains for BDD-Based Intera...,"Tarik Hadzic, Rune Moller Jensen, Henrik Reif ...",cs.AI,In these notes we formally describe the func...,Title:Calculating Valid Domains for BDD-Based ...,"[0.035932083, 0.018666347, 0.061117645, -0.023...",e53b8c4c0077270bbcd70953a75d2628
1,A study of structural properties on profiles HMMs,"Juliana S Bernardes, Alberto Davila, Vitor San...",cs.AI,Motivation: Profile hidden Markov Models (pH...,Title:A study of structural properties on prof...,"[0.007371153, 0.014932285, -0.0031393894, -0.0...",95632f153798757abe735d5882c657b3
2,Bayesian approach to rough set,Tshilidzi Marwala and Bodie Crossingham,cs.AI,This paper proposes an approach to training ...,Title:Bayesian approach to rough set\nAbstract...,"[-0.002857872, -0.0032559216, 0.047268394, 0.0...",d6dea0edd048c82504e98cd7e019d9de
3,Comparing Robustness of Pairwise and Multiclas...,"J. Uglov, V. Schetinin, C. Maple",cs.AI,"Noise, corruptions and variations in face im...",Title:Comparing Robustness of Pairwise and Mul...,"[-0.0033912328, -0.014125003, 0.036404647, -0....",8119b6693be1f30dd152dac35b8eaa73


In [30]:
table.to_pandas()['category'].value_counts()

category
cs.AI      100
cs.IR      100
stat.ML    100
Name: count, dtype: int64

In [37]:
from pydantic import BaseModel,Field
from typing import Literal
import openai
import instructor

client  = instructor.from_openai(openai.OpenAI())

category_description = """
This represents a categorization of the user's query

- stat.ML : Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding
- cs.AI : Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
- cs.IR : Covers indexing, dictionaries, retrieval, content and analysis. Roughly includes material in ACM Subject Classes H.3.0, H.3.1, H.3.2, H.3.3, and H.3.4.
""".strip()

class Category(BaseModel):
    category: Literal['cs.AI','cs.IR','stat.ML'] = Field(...,description=category_description)

query = "What potential application does Quantum Computation (QC) have to Artificial Intelligence (AI) according to the text?"

def categorize_item(query):
    return client.chat.completions.create(
        messages = [
            {
                "role": "system",
                "content": "You are an expert topic classifier, your job is to classify the following title into the categories provided in the response object. Make sure to classify it into one of the individual categories provided",
            },
            {
                "role": "user",
                "content": f"The title is {query}"
            }
        ],
        model = "gpt-4o",
        response_model = Category
    )

q = categorize_item(query)
print(q)

category='cs.AI'


In [39]:
table.search(query,query_type="fts").select(['category']).limit(25).to_list()

[{'category': 'cs.AI', 'score': 44.642051696777344},
 {'category': 'cs.AI', 'score': 16.717119216918945},
 {'category': 'cs.AI', 'score': 15.190481185913086},
 {'category': 'cs.AI', 'score': 11.489946365356445},
 {'category': 'cs.AI', 'score': 10.238595008850098},
 {'category': 'cs.AI', 'score': 9.757229804992676},
 {'category': 'cs.AI', 'score': 9.154081344604492},
 {'category': 'cs.AI', 'score': 8.645560264587402},
 {'category': 'cs.IR', 'score': 8.546699523925781},
 {'category': 'stat.ML', 'score': 8.312337875366211},
 {'category': 'cs.AI', 'score': 8.290803909301758},
 {'category': 'cs.AI', 'score': 8.105945587158203},
 {'category': 'cs.IR', 'score': 7.876897811889648},
 {'category': 'cs.IR', 'score': 7.408782005310059},
 {'category': 'cs.AI', 'score': 7.327653884887695},
 {'category': 'cs.AI', 'score': 7.243310928344727},
 {'category': 'cs.AI', 'score': 7.066513538360596},
 {'category': 'cs.IR', 'score': 6.783278465270996},
 {'category': 'cs.AI', 'score': 6.648003578186035},
 {'ca

In [42]:
table.search(query,query_type="fts").where("category = 'cs.AI'", prefilter=True).select(['category']).limit(25).to_list()

[{'category': 'cs.AI', 'score': 44.642051696777344},
 {'category': 'cs.AI', 'score': 16.717119216918945},
 {'category': 'cs.AI', 'score': 15.190481185913086},
 {'category': 'cs.AI', 'score': 11.489946365356445},
 {'category': 'cs.AI', 'score': 10.238595008850098},
 {'category': 'cs.AI', 'score': 9.757229804992676},
 {'category': 'cs.AI', 'score': 9.154081344604492},
 {'category': 'cs.AI', 'score': 8.645560264587402},
 {'category': 'cs.AI', 'score': 8.290803909301758},
 {'category': 'cs.AI', 'score': 8.105945587158203},
 {'category': 'cs.AI', 'score': 7.327653884887695},
 {'category': 'cs.AI', 'score': 7.243310928344727},
 {'category': 'cs.AI', 'score': 7.066513538360596},
 {'category': 'cs.AI', 'score': 6.648003578186035},
 {'category': 'cs.AI', 'score': 6.570124626159668},
 {'category': 'cs.AI', 'score': 6.534304618835449},
 {'category': 'cs.AI', 'score': 6.524085521697998}]