# Upgrading our System

In this notebook, we look at how we can use the previous building blocks that we put together to test different pipeline combinations. 



In our previous notebook, we saw how we could evaluate a RAG system using some simple metrics such as recall. 

Let's see how we can use this to quickly experiment with different pipelines and get a quantiative idea of how good or bad different metrics stack up against each other

In [1]:
from lib.data import get_labels
from lib.query import full_text_search, semantic_search
from lib.eval import score
from lib.models import EmbeddedPassage
from lib.db import get_table
import pandas as pd
import lancedb
from tabulate import tabulate

db = lancedb.connect("../lance")

candidates = {"Semantic Search": semantic_search, "Full Text Search": full_text_search}

test_data = get_labels("../data/queries_single_label.jsonl")
table = get_table(db, "ms_marco", EmbeddedPassage)

# Run test_data against candidates
results = {}

for candidate, search_fn in candidates.items():
    search_results = search_fn(table, test_data, 25)
    evaluation_metrics = [
        score(retrieved_chunk_ids, query["selected_chunk_id"])
        for retrieved_chunk_ids, query in zip(search_results, test_data)
    ]
    results[candidate] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

100%|██████████| 11/11 [00:00<00:00, 14907.06it/s]
100%|██████████| 219/219 [00:13<00:00, 16.79it/s]
100%|██████████| 219/219 [00:06<00:00, 35.46it/s]

+-----------+-------------------+--------------------+
|           |   Semantic Search |   Full Text Search |
| mrr@3     |              0.44 |               0.35 |
+-----------+-------------------+--------------------+
| mrr@5     |              0.49 |               0.4  |
+-----------+-------------------+--------------------+
| mrr@10    |              0.51 |               0.42 |
+-----------+-------------------+--------------------+
| mrr@15    |              0.51 |               0.43 |
+-----------+-------------------+--------------------+
| mrr@25    |              0.51 |               0.43 |
+-----------+-------------------+--------------------+
| recall@3  |              0.64 |               0.5  |
+-----------+-------------------+--------------------+
| recall@5  |              0.86 |               0.71 |
+-----------+-------------------+--------------------+
| recall@10 |              0.99 |               0.9  |
+-----------+-------------------+--------------------+
| recall@1




We can immediately notice a few different things for our system

- We can match `recall@5` for Semantic Search using say a larger value of `k` for Full Text Search. The `recall@10` for full text search is 0.9 while the `recall@5` for Semantic search is 0.86
- Our MRR isn't the best for Full Text Search and we never quite match up to the performance of full text search.

What other options do we have?

# Our Evaluation Dataset

What exactly have we been benchmarking our results with and what's included within the MsMarco Dataset.

In [59]:
from datasets import load_dataset

dataset = load_dataset("ms_marco", "v1.1", split="train", streaming=True).take(4)
dataset

IterableDataset({
    features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
    n_shards: 1
})

In [63]:
item = next(iter(dataset))
item

{'answers': ['Results-Based Accountability is a disciplined way of thinking and taking action that communities can use to improve the lives of children, youth, families, adults and the community as a whole.'],
 'passages': {'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
  'passage_text': ["Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.",
   "The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the C

In [64]:
item["passages"]

{'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'passage_text': ["Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.",
  "The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the Commonwealth Bank. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydn

The MS-Marco dataset is a dataset that benchmarks the ability of models to do information retrieval. 

Each item in the dataset consists of a few different bits of information

- `query` : This is the query that the original user made on Bing
- `passages` 
  - `is_selected` : This is a binary label that indicates whether a human annotator found this query relevant when composing a response to the query. A 1 indicates that it's relevant and 0 indicates that it's not. **Note that multiple passages can be selected as relevant**
  - `passage_test` : This is a list of passages that was returned which corresponds to the `is_selected` index

We then compute a unique chunk_id for each passage using the `hashlib` library with the `md5` hash. This then allows us to use each item and its corresponding hash to generate two small test files in `.jsonl` format.

- `queries_multi_label` : This contains a mapping of query to one or more selected passages

    ```
    {"query": "in animals somatic cells are produced by and gametic cells are produced by", "selected_chunk_ids":  ["4ff0ed20afce65c76bdb0df809ef5025", "0f556defd6442767c1fee0e729f942fb"]}
    ```
  
- `queries_single_label`: This contains a mapping of a query to a single selected passage

    ```
    {"query": "how much time can you go between oil changes", "selected_chunk_id": "c944888dc7bb30a1a01cf5c19776a19b"}
    ```

In [2]:
import hashlib


def compute_md5_hash(input_string: str) -> str:
    """
    Compute the MD5 hash of a given string.

    Parameters:
    input_string (str): The input string to hash.

    Returns:
    str: The MD5 hash of the input string.
    """
    md5_hash = hashlib.md5(input_string.encode())
    return md5_hash.hexdigest()


compute_md5_hash("hello world"), compute_md5_hash("HEllo world")

('5eb63bbbe01eeed093cb22bb8f5acdc3', '68df723b3e61541ec5af9c6053357942')

## Hybrid Search

If we look at the `query_type` for LanceDB, we'll notice that there's an additional query_type called `hybrid`. What's this new query type and how does it perform?

In [58]:
from lib.data import get_labels
from lib.query import full_text_search, semantic_search, hybrid_search
from lib.eval import score
from lib.models import EmbeddedPassage
from lib.db import get_table
import pandas as pd
import lancedb
from tabulate import tabulate

db = lancedb.connect("../lance")

candidates = {
    "Semantic Search": semantic_search,
    "Full Text Search": full_text_search,
    "Hybrid Search": hybrid_search,
}

test_data = get_labels("../data/queries_single_label.jsonl")
table = get_table(db, "ms_marco", EmbeddedPassage)

# Run test_data against candidates
results = {}

for candidate, search_fn in candidates.items():
    search_results = search_fn(table, test_data, 25)
    evaluation_metrics = [
        score(retrieved_chunk_ids, query["selected_chunk_id"])
        for retrieved_chunk_ids, query in zip(search_results, test_data)
    ]
    results[candidate] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

100%|████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 40865.67it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 219/219 [00:09<00:00, 23.42it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 219/219 [00:05<00:00, 37.06it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 219/219 [02:01<00:00,  1.80it/s]

+-----------+-------------------+--------------------+-----------------+
|           |   Semantic Search |   Full Text Search |   Hybrid Search |
| mrr@3     |              0.44 |               0.35 |            0.37 |
+-----------+-------------------+--------------------+-----------------+
| mrr@5     |              0.49 |               0.4  |            0.42 |
+-----------+-------------------+--------------------+-----------------+
| mrr@10    |              0.51 |               0.42 |            0.45 |
+-----------+-------------------+--------------------+-----------------+
| mrr@15    |              0.51 |               0.43 |            0.45 |
+-----------+-------------------+--------------------+-----------------+
| mrr@25    |              0.51 |               0.43 |            0.45 |
+-----------+-------------------+--------------------+-----------------+
| recall@3  |              0.64 |               0.5  |            0.55 |
+-----------+-------------------+------------------




But what's happening under the hood? Why are our new results in between Semantic Search and Full Text Search?

Turns out LanceDB is actually using a linear_combination_search reranker under the hood, combining the value of semantic search + full text search distance metrics with a weight of 0.7 for semantic search and 0.3 for full text search by default. [Link to Documentation](https://lancedb.github.io/lancedb/hybrid_search/hybrid_search/#arguments)

Can we tune this weight hyper-parameter and get better results using this naive linear combination reranker?

```python
from lancedb.rerankers import LinearCombinationReranker

reranker = LinearCombinationReranker(weight=0.3) # Use 0.3 as the weight for vector search

# We can pass in a linear reranker here to do the re-ranking
results = table.search("rebel", query_type="hybrid").rerank(reranker=reranker).to_pandas()
```



In [6]:
from lib.query import linear_combination_search, semantic_search
from lib.data import get_labels
from lib.eval import score
from lib.models import EmbeddedPassage
from lib.db import get_table
import pandas as pd
import lancedb
from tabulate import tabulate

db = lancedb.connect("../lance")

weightage = [0.3, 0.7, 0.9]

test_data = get_labels("../data/queries_single_label.jsonl")[:20]
table = get_table(db, "ms_marco", EmbeddedPassage)

# Run test_data against candidates
results = {}

for weight in weightage:
    search_results = linear_combination_search(table, test_data, 25, weight)
    evaluation_metrics = [
        score(retrieved_chunk_ids, query["selected_chunk_id"])
        for retrieved_chunk_ids, query in zip(search_results, test_data)
    ]
    results[f"Linear Combination ({weight})"] = pd.DataFrame(evaluation_metrics).mean()

# Do Semantic Search
search_results = semantic_search(table, test_data, 25)
evaluation_metrics = [
    score(retrieved_chunk_ids, query["selected_chunk_id"])
    for retrieved_chunk_ids, query in zip(search_results, test_data)
]
results["Semantic Search"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

Linear Combination (Weight 0.3): 100%|██████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.97it/s]
Linear Combination (Weight 0.7): 100%|██████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.99it/s]
Linear Combination (Weight 0.9): 100%|██████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.84it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3663.15it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 33.91it/s]

+-----------+----------------------------+----------------------------+----------------------------+-------------------+
|           |   Linear Combination (0.3) |   Linear Combination (0.7) |   Linear Combination (0.9) |   Semantic Search |
| mrr@3     |                       0.31 |                       0.35 |                       0.35 |              0.4  |
+-----------+----------------------------+----------------------------+----------------------------+-------------------+
| mrr@5     |                       0.37 |                       0.4  |                       0.4  |              0.45 |
+-----------+----------------------------+----------------------------+----------------------------+-------------------+
| mrr@10    |                       0.4  |                       0.43 |                       0.43 |              0.45 |
+-----------+----------------------------+----------------------------+----------------------------+-------------------+
| mrr@15    |                   




# Re-Rankers

Re-Rankers will often be much more accurate at finding relevant documents as compared to simple embedding search because they're able to extract out significantly more information from the query (and the text itself)

Typically we'd utilise a re-ranker in a two step re-ranking process

1. First fetch the relevant chunks
2. Throw into a re-ranker
3. Then return a subset of the re-ranked elements

This will allow us to take advantage of the fast retrieval of embedding search to quickly narrow down the subset of evaluated chunks while combining the increased accuracy of a re-ranker

## Cohere Re-Ranker

In [10]:
from lib.query import linear_combination_search, semantic_search
from lib.data import get_labels
from lib.models import EmbeddedPassage
from lib.db import get_table
import lancedb


db = lancedb.connect("../lance")

weightage = [0.3, 0.7, 0.9]

test_data = get_labels("../data/queries_single_label.jsonl")[:20]
table = get_table(db, "ms_marco", EmbeddedPassage)

In [11]:
test_data[0]

{'query': 'what is rba',
 'selected_chunk_id': 'ca869ae1ed3f5021cb5b2a0b78cc846c'}

In [12]:
def normal_search(query, table, limit):
    return [
        item["chunk_id"]
        for item in table.search(query, query_type="fts")
        .limit(limit)
        .to_list()
    ]


query = test_data[0]
retrieved_chunk_ids = normal_search(query["query"], table, 25)
evaluation_metrics = score(retrieved_chunk_ids, query["selected_chunk_id"])
evaluation_metrics

{'mrr@3': 0,
 'mrr@5': 0,
 'mrr@10': 0.167,
 'mrr@15': 0.167,
 'mrr@25': 0.167,
 'recall@3': 0.0,
 'recall@5': 0.0,
 'recall@10': 1.0,
 'recall@15': 1.0,
 'recall@25': 1.0}

In [13]:
from lancedb.rerankers import CohereReranker


def rerank_search(query, table, limit):
    reranker = CohereReranker(
        model_name="rerank-english-v2.0"
    )  # This uses the rerank-english
    return [
        item["chunk_id"]
        for item in table.search(query, query_type="fts")
        .limit(limit)
        .rerank(reranker=reranker)
        .to_list()
    ]


query = test_data[0]
retrieved_chunk_ids = rerank_search(query["query"], table, 25)
evaluation_metrics = score(retrieved_chunk_ids, query["selected_chunk_id"])
evaluation_metrics

{'mrr@3': 1.0,
 'mrr@5': 1.0,
 'mrr@10': 1.0,
 'mrr@15': 1.0,
 'mrr@25': 1.0,
 'recall@3': 1.0,
 'recall@5': 1.0,
 'recall@10': 1.0,
 'recall@15': 1.0,
 'recall@25': 1.0}

In [9]:
from tqdm import tqdm


scores = []
for query in tqdm(test_data):
    retrieved_chunk_ids = rerank_search(query["query"], table, 25)
    evaluation_metrics = score(retrieved_chunk_ids, query["selected_chunk_id"])
    scores.append(evaluation_metrics)

df = pd.DataFrame(scores)
df.mean()

  0%|          | 0/20 [00:00<?, ?it/s]

100%|██████████| 20/20 [00:10<00:00,  1.89it/s]


mrr@3        0.5833
mrr@5        0.6058
mrr@10       0.6058
mrr@15       0.6058
mrr@25       0.6058
recall@3     0.8000
recall@5     0.9000
recall@10    0.9000
recall@15    0.9000
recall@25    0.9000
dtype: float64

In [23]:
scores = []
for query in tqdm(test_queries):
    retrieved_chunk_ids = rerank_search(
        query["query"], table, 50
    )  # What if we increase the #
    evaluation_metrics = score(retrieved_chunk_ids, query["selected_chunk_id"])
    scores.append(evaluation_metrics)

df = pd.DataFrame(scores)
df.mean()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 196/196 [02:28<00:00,  1.32it/s]


mrr@3        0.522066
mrr@5        0.565179
mrr@10       0.575582
mrr@15       0.575582
mrr@25       0.575898
recall@3     0.714286
recall@5     0.897959
recall@10    0.964286
recall@15    0.964286
recall@25    0.969388
dtype: float64

### Sample Evaluation : Which Model to use?

Cohere ships a few different re-ranker models ( `rerank-english-v3.0`, `rerank-multilingual-v3.0`, `rerank-english-v2.0`, `rerank-multilingual-v2.0`) that perform slightly differently based on the use-case. How can we determine what might work best for our use case?

In [6]:
from lib.query import cohere_rerank_search, semantic_search
from lib.data import get_labels
from lib.eval import score
from lib.models import EmbeddedPassage
from lib.db import get_table
import pandas as pd
import lancedb
from tabulate import tabulate

db = lancedb.connect("../lance")
model_names = {
    "rr-eng-3": "rerank-english-v3.0",
    "rr-mul-3": "rerank-multilingual-v3.0",
    "rr-eng-2": "rerank-english-v2.0",
    "rr-mul-2": "rerank-multilingual-v2.0",
}

test_data = get_labels("../data/queries_single_label.jsonl")[:5]
table = get_table(db, "ms_marco", EmbeddedPassage)

# Run test_data against candidates
results = {}

# Do Semantic Search
search_results = semantic_search(table, test_data, 25)
evaluation_metrics = [
    score(retrieved_chunk_ids, query["selected_chunk_id"])
    for retrieved_chunk_ids, query in zip(search_results, test_data)
]
results["Semantic Search"] = pd.DataFrame(evaluation_metrics).mean()


for header_name, model_name in model_names.items():
    search_results = cohere_rerank_search(table, test_data, 50, model_name)
    evaluation_metrics = [
        score(retrieved_chunk_ids, query["selected_chunk_id"])
        for retrieved_chunk_ids, query in zip(search_results, test_data)
    ]
    results[f"({header_name})"] = pd.DataFrame(evaluation_metrics).mean()


# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2262.30it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 32.17it/s]
Cohere Reranker (rerank-english-v3.0): 100%|██████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.20it/s]
Cohere Reranker (rerank-multilingual-v3.0): 100%|█████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.15it/s]
Cohere Reranker (rerank-english-v2.0): 100%|██████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.05it/s]
Cohere Reranker (rerank-multilingual-v2.0): 100%|█████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.08it/s]

+-----------+-------------------+--------------+--------------+--------------+--------------+
|           |   Semantic Search |   (rr-eng-3) |   (rr-mul-3) |   (rr-eng-2) |   (rr-mul-2) |
| mrr@3     |              0.47 |         0.4  |         0.77 |         0.6  |         0.53 |
+-----------+-------------------+--------------+--------------+--------------+--------------+
| mrr@5     |              0.47 |         0.49 |         0.77 |         0.64 |         0.53 |
+-----------+-------------------+--------------+--------------+--------------+--------------+
| mrr@10    |              0.47 |         0.52 |         0.77 |         0.67 |         0.53 |
+-----------+-------------------+--------------+--------------+--------------+--------------+
| mrr@15    |              0.47 |         0.52 |         0.77 |         0.67 |         0.53 |
+-----------+-------------------+--------------+--------------+--------------+--------------+
| mrr@25    |              0.47 |         0.52 |         0.7




# Metadata Ingestion

We've looked at different ways that we can set up our retrieval pipeline. Let's now switch gears and see how we can experiment with metadata ingestion to improve the quality of our search

## Creating Question-Answer Pairs

We previously used Instructor to generate synthethic questions and answers. Let's see how we can expand on this to improve our retrieval pipeline.

In [11]:
import lancedb

db = lancedb.connect("../lance")
table = db.open_table("ms_marco")

In [2]:
chunks = table.to_pandas()["text"].tolist()

In [3]:
# Now for each chunk, we'll generate a question and answer pair that we'll embed into our database

chunk = chunks[0]
chunk

"Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site."

In [4]:
from lib.synthethic import generate_question_batch

question = await generate_question_batch([chunk], 20)
question

100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.49s/it]


[{'response': QuestionAnswerResponse(chain_of_thought="The text chunk highlights the impact of the 'Securency' or NPA scandal on the RBA since 2007, describes the RBA's assets and net worth, and provides information about the employees. A hypothetical user might search for details regarding the scandal, the bank's net worth, or the distribution of its employees. The question and answer pair should therefore focus on these unique aspects of the text.", question="What scandal has affected the RBA's reputation since 2007 and how many of its employees work at the headquarters?", answer="Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal, which involved bribing overseas officials to win note-printing contracts. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales, and at the Business Resumption Site."),
  'source': "Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. 

In [12]:
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

func = get_registry().get("openai").create(name="text-embedding-3-small")


class EmbeddedPassageWithQA(LanceModel):
    vector: Vector(func.ndims()) = func.VectorField()
    chunk_id: str
    text: str = func.SourceField()
    source_text: str


table_withqa = db.create_table(
    "ms_marco_qa_4", schema=EmbeddedPassageWithQA, mode="overwrite"
)

In [6]:
questions = await generate_question_batch(chunks, 40)

100%|███████████████████████████████████████████████████████████████████████████████████████| 1624/1624 [02:26<00:00, 11.12it/s]


In [8]:
import json

with open("../data/synth-questions-4o.jsonl", "w") as file:
    for question in questions:
        question_obj = {
            "question": question["response"].question,
            "answer": question["response"].answer,
            "chunk": question["source"],
        }
        file.write(json.dumps(question_obj) + "\n")

In [15]:
from lib.data import get_labels

cached_questions = get_labels("../data/synth-questions-4o.jsonl")

In [19]:
import hashlib

from itertools import batched
from tqdm import tqdm


def get_chunks(questions):
    for question in questions:
        chunk_id = hashlib.md5(question["chunk"].encode()).hexdigest()
        yield {
            "chunk_id": chunk_id,
            "text": question["chunk"],
            "source_text": question["chunk"],
        }
        yield {
            "chunk_id": chunk_id,
            "text": question["question"],
            "source_text": question["chunk"],
        }
        yield {
            "chunk_id": chunk_id,
            "text": question["answer"],
            "source_text": question["chunk"],
        }


batches = batched(get_chunks(cached_questions), 40)

for batch in tqdm(batches):
    table_withqa.add(list(batch))

122it [03:15,  1.60s/it]


In [20]:
table_withqa.create_fts_index(["text", "source_text"])

In [13]:
from lib.data import get_labels
from lib.eval import score
from tqdm import tqdm
import pandas as pd
from lib.db import get_table
from lib.models import EmbeddedPassageWithQA, EmbeddedPassage
from lib.query import full_text_search, semantic_search
from itertools import product
from tabulate import tabulate

test_data = get_labels("../data/queries_single_label.jsonl")[:30]
table_withqa = get_table(db, "ms_marco_qa_4", EmbeddedPassageWithQA)
table = get_table(db, "ms_marco", EmbeddedPassage)


candidates = {
    "SS": semantic_search,
    "FTS": full_text_search,
}

tables = {"With Q": table_withqa, "W/o  Q": table}

results = {}

for (
    candidate_fn_tuple,
    db_table_tuple,
) in product(candidates.items(), tables.items()):
    db_name, db_table = db_table_tuple
    candidate_name, search_fn = candidate_fn_tuple
    search_results = search_fn(db_table, test_data, 25)
    evaluation_metrics = [
        score(retrieved_chunk_ids, query["selected_chunk_id"])
        for retrieved_chunk_ids, query in zip(search_results, test_data)
    ]
    results[f"{candidate_name} ({db_name})"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

100%|██████████| 2/2 [00:00<00:00, 6898.53it/s]
100%|██████████| 30/30 [00:01<00:00, 21.84it/s]
100%|██████████| 2/2 [00:00<00:00, 18040.02it/s]
100%|██████████| 30/30 [00:00<00:00, 33.34it/s]
100%|██████████| 30/30 [00:00<00:00, 40.45it/s]
100%|██████████| 30/30 [00:00<00:00, 45.55it/s]

+-----------+---------------+---------------+----------------+----------------+
|           |   SS (With Q) |   SS (W/o  Q) |   FTS (With Q) |   FTS (W/o  Q) |
| mrr@3     |          0.25 |          0.37 |           0.18 |           0.26 |
+-----------+---------------+---------------+----------------+----------------+
| mrr@5     |          0.31 |          0.41 |           0.22 |           0.31 |
+-----------+---------------+---------------+----------------+----------------+
| mrr@10    |          0.34 |          0.43 |           0.26 |           0.34 |
+-----------+---------------+---------------+----------------+----------------+
| mrr@15    |          0.35 |          0.43 |           0.27 |           0.35 |
+-----------+---------------+---------------+----------------+----------------+
| mrr@25    |          0.35 |          0.43 |           0.27 |           0.35 |
+-----------+---------------+---------------+----------------+----------------+
| recall@3  |          0.4  |          0




In [14]:
from lib.data import get_labels
from lib.eval import score
from tqdm import tqdm
import pandas as pd
from lib.db import get_table
from lib.models import EmbeddedPassageWithQA, EmbeddedPassage
from lib.query import full_text_search, semantic_search
from itertools import product
from tabulate import tabulate

test_data = get_labels("../data/queries_multi_label.jsonl")[:30]
table_withqa = get_table(db, "ms_marco_qa_4", EmbeddedPassageWithQA)
table = get_table(db, "ms_marco", EmbeddedPassage)

candidates = {
    "SS": semantic_search,
    "FTS": full_text_search,
}

tables = {"With Q": table_withqa, "W/o  Q": table}

results = {}

for (
    candidate_fn_tuple,
    db_table_tuple,
) in product(candidates.items(), tables.items()):
    db_name, db_table = db_table_tuple
    candidate_name, search_fn = candidate_fn_tuple
    search_results = search_fn(db_table, test_data, 25)
    evaluation_metrics = [
        score(retrieved_chunk_ids, query["selected_chunk_ids"])
        for retrieved_chunk_ids, query in zip(search_results, test_data)
    ]
    results[f"{candidate_name} ({db_name})"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers="keys", tablefmt="grid"))

100%|██████████| 2/2 [00:00<00:00, 5777.28it/s]
100%|██████████| 30/30 [00:01<00:00, 23.80it/s]
100%|██████████| 2/2 [00:00<00:00, 11765.23it/s]
100%|██████████| 30/30 [00:01<00:00, 29.77it/s]
100%|██████████| 30/30 [00:00<00:00, 33.90it/s]
100%|██████████| 30/30 [00:00<00:00, 43.59it/s]

+-----------+---------------+---------------+----------------+----------------+
|           |   SS (With Q) |   SS (W/o  Q) |   FTS (With Q) |   FTS (W/o  Q) |
| mrr@3     |          0.27 |          0.39 |           0.21 |           0.3  |
+-----------+---------------+---------------+----------------+----------------+
| mrr@5     |          0.33 |          0.42 |           0.25 |           0.35 |
+-----------+---------------+---------------+----------------+----------------+
| mrr@10    |          0.36 |          0.43 |           0.29 |           0.38 |
+-----------+---------------+---------------+----------------+----------------+
| mrr@15    |          0.37 |          0.44 |           0.31 |           0.38 |
+-----------+---------------+---------------+----------------+----------------+
| mrr@25    |          0.37 |          0.44 |           0.31 |           0.38 |
+-----------+---------------+---------------+----------------+----------------+
| recall@3  |          0.4  |          0




In [54]:
table_withqa = get_table(db, "ms_marco_qa_4", EmbeddedPassageWithQA)
desired_result = "9dc539403752581ac234e01d2eb77877"
queried_results = (
    table_withqa.search("what species is a dandelion", query_type="fts")
    .select(["chunk_id", "text", "source_text"])
    .limit(20)
    .to_pandas()
)
queried_results[queried_results["chunk_id"] == "9dc539403752581ac234e01d2eb77877"]

Unnamed: 0,chunk_id,text,source_text,score
0,9dc539403752581ac234e01d2eb77877,The common species of Taraxacum that rapidly c...,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,29.076788
1,9dc539403752581ac234e01d2eb77877,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,28.786659
11,9dc539403752581ac234e01d2eb77877,What is the common species of Taraxacum that r...,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,22.830486


In [52]:
table = get_table(db, "ms_marco", EmbeddedPassageWithQA)

desired_result = "9dc539403752581ac234e01d2eb77877"
queried_results = (
    table.search("what species is a dandelion", query_type="fts")
    .select(["chunk_id", "text"])
    .limit(20)
    .to_pandas()
)
# queried_results[queried_results['chunk_id'] == '9dc539403752581ac234e01d2eb77877']
queried_results

Unnamed: 0,chunk_id,text,score
0,9dc539403752581ac234e01d2eb77877,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,14.361879
1,89c9edbed59513ee50cb83e1cfb1ebbe,Taraxacum /təˈraeksəkʉm/ təˈræksəkʉm is a larg...,14.211342
2,591eae9a842ee783b157f2b033519a2e,Both species are edible in their entirety. The...,13.762878
3,0853e1b91ecae08717962457740453fa,Description. There are considered to be about ...,13.485062
4,ea830e20c43da3d767c180fee0f061de,"About this species. Dandelions are well-known,...",13.234418
5,a75f014e6c6f31495e04f78e8476b54d,Plant Description. Hundreds of species of dand...,12.953527
6,a5872798dffbe3077255ed0dd0dd7a12,Each single flower in a head is called a flore...,12.775234
7,c97bb0fb297bff6c2ea72c07962cf198,The common name Dandelion is given to members ...,9.375005
8,8f3042881945d16694e29a710681255e,Level 3 Bonding can be described as a biologic...,6.27281
9,f982e5c138dc4c59c837a71a577de593,While you’ll never hear me use the phrase “It ...,5.874215


## Creating Metadata using GPT4-o 

Now that we've seen how to create synthethic questions, let's try looking at another method of improving our search pipeline - creating metadata when ingesting data

In [4]:
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field

client = instructor.from_openai(OpenAI())

class Metadata(BaseModel):
    """
    This is a model which represents some metadata that we want to generate from a given text. 
    
    Make sure to expand on the text by extracting out any accronyms, context or phrases that users might search for later on \
    when trying to retrieve this specific chunk and model the metadata in a way that allows us to retrieve the most relevant chunks when searching for the query
    """

    keywords: list[str] = Field(
        ...,
        description="This is a field which represents keywords that a user might use to search for this text",
    )
    hypothetical_phrases: list[str] = Field(
        ...,
        description="This is a field which represents hypothetical phrases that a user might use to search for this text",
    )


def enhance_query(text_chunk: str):
    return client.chat.completions.create(
        model="gpt-3.5-turbo",
        response_model=Metadata,
        messages=[
            {
                "role": "system",
                "content": "You are a world class query indexing system. You are about to be passed a text chunk and you'll need to generate some metadata that will allow you to retrieve this specific chunk when the user makes a relevant query",
            },
            {"role": "user", "content": f"The text chunk is {text_chunk}"},
        ],
    )


enhance_query("Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.")

Metadata(keywords=['RBA', 'outstanding reputation', 'Securency', 'NPA scandal', 'subsidairies', 'bribing', 'note-printing contracts', 'assets', 'gold reserves', 'foreign exchange reserves', 'net worth', 'A$101 billion', 'employees', 'headquarters', 'Sydney', 'Business Resumption Site'], hypothetical_phrases=['RBA scandal', 'Australia note-printing contracts', 'RBA assets', 'RBA employees', 'RBA headquarters'])

In [2]:
from lib.synthethic import generate_metadata_batch
import lancedb

db = lancedb.connect("../lance")
table = db.open_table("ms_marco")


chunks = table.to_pandas()["text"].tolist()
res = await generate_metadata_batch(chunks[:2], 20)
for item in res:
    print(item)

100%|██████████| 2/2 [00:01<00:00,  1.01it/s]

{'response': Metadata(keywords=['RBA', 'outstanding reputation', 'Securency scandal', 'NPA scandal', 'bribing', 'overseas officials', 'note-printing contracts', 'gold reserves', 'foreign exchange reserves', 'net worth', 'A$101 billion', 'employees', 'Sydney', 'New South Wales', 'Business Resumption Site'], hypothetical_phrases=['RBA scandal', 'bank assets', 'RBA headquarters', 'Australia reserves', 'RBA employees']), 'source': "Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site."}
{'response': Metadata(keywords=['Reserve Bank of Australia', 'RBA', 'centr




In [3]:
from lib.data import save_labels
import json

metadata_batch = await generate_metadata_batch(chunks, 20)

data = []

for metadata in metadata_batch:
    response_data = json.loads(metadata["response"].model_dump_json())

    data.append({"source": metadata["source"], **response_data})

save_labels(data, "../data/synth-metadata-gpt3.jsonl")

  0%|          | 0/1624 [00:00<?, ?it/s]

100%|██████████| 1624/1624 [00:57<00:00, 28.47it/s] 

Labels saved to ../data/synth-metadata-gpt3.jsonl





In [10]:
from lib.data import get_labels

data = get_labels("../data/synth-metadata-gpt3.jsonl")
print(data[:2])


[{'source': "Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.", 'keywords': ['RBA', 'Securency', 'NPA scandal', 'bribery', 'note-printing contracts', 'gold reserves', 'foreign exchange reserves', 'net worth', 'Sydney', 'New South Wales', 'Business Resumption Site'], 'hypothetical_phrases': ['RBA outstanding reputation', 'RBA subsidiaries scandal', 'Australia note-printing contracts', 'RBA assets', 'RBA headquarters Sydney', 'RBA employees']}, {'source': "The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central ban

In [14]:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

func = get_registry().get("openai").create(name="text-embedding-3-small")

db = lancedb.connect("../lance")


class EmbeddedPassageWithMetadata(LanceModel):
    vector: Vector(func.ndims()) = func.VectorField()
    chunk_id: str
    text: str = func.SourceField()
    keywords: str
    search_queries: str

table_with_metadata = db.create_table(
    "ms_marco_metadata", schema=EmbeddedPassageWithMetadata, mode="overwrite"
)

In [None]:
from lib.synthethic import generate_metadata_batch

db = lancedb.connect("../lance")
table = db.open_table("ms_marco")


chunks = table.to_pandas()["text"].tolist()
await generate_metadata_batch(chunks[:10],10)

100%|██████████| 10/10 [00:02<00:00,  4.35it/s]


[{'response': Metadata(keywords=['RBA', 'Securency', 'NPA scandal', 'Australia', 'gold reserves', 'foreign exchange reserves', 'Sydney', 'New South Wales', 'Business Resumption Site'], hypothetical_phrases=['RBA outstanding reputation', 'bribing overseas officials', 'note-printing contracts', 'net worth of A$101 billion', 'RBA headquarters', 'RBA employees']),
  'source': "Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site."},
 {'response': Metadata(keywords=['Reserve Bank of Australia', 'RBA', 'central bank', 'banknote issuing authority', 'Reserve Bank 

In [16]:
import hashlib
from itertools import batched
from tqdm import tqdm

def format_metadata(metadata):
    return {
        "chunk_id": hashlib.md5(metadata["source"].encode()).hexdigest(),
        "text": metadata["source"],
        "keywords": ",".join(metadata["keywords"]),
        "search_queries": ",".join(metadata["hypothetical_phrases"]),
    }

def metadata_generator(metadata_batch):
    for metadata in metadata_batch:
        yield format_metadata(metadata)

metadata_batch = batched(metadata_generator(data), 20)

for item in tqdm(metadata_batch):
    table_with_metadata.add(list(item))


82it [01:45,  1.29s/it]


In [17]:
table_with_metadata.create_fts_index(["text","keywords","search_queries"])

In [25]:
from lib.data import get_labels
from lib.eval import score
from tqdm import tqdm
import pandas as pd
from lib.db import get_table
from lib.models import EmbeddedPassageWithQA, EmbeddedPassage,EmbeddedPassageWithMetadata
from lib.query import full_text_search, semantic_search
from itertools import product
from tabulate import tabulate

test_data = get_labels("../data/queries_single_label.jsonl")[:60]
table_withqa = get_table(db,"ms_marco_metadata", EmbeddedPassageWithMetadata)
table = get_table(db, "ms_marco", EmbeddedPassage)

candidates = {
    "FTS": full_text_search,
}

tables = {"With Metadata": table_withqa, "W/o  Metadata": table}

results = {}

for (
    candidate_fn_tuple,
    db_table_tuple,
) in product(candidates.items(), tables.items()):
    db_name, db_table = db_table_tuple
    candidate_name, search_fn = candidate_fn_tuple
    search_results = search_fn(db_table, test_data, 25)
    evaluation_metrics = [
        score(retrieved_chunk_ids, query["selected_chunk_id"])
        for retrieved_chunk_ids, query in zip(search_results, test_data)
    ]
    results[f"{candidate_name} ({db_name})"] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(3), headers="keys", tablefmt="grid"))

  0%|          | 0/60 [00:00<?, ?it/s]

100%|██████████| 60/60 [00:02<00:00, 25.79it/s]
100%|██████████| 60/60 [00:01<00:00, 39.90it/s]

+-----------+-----------------------+-----------------------+
|           |   FTS (With Metadata) |   FTS (W/o  Metadata) |
| mrr@3     |                 0.344 |                 0.339 |
+-----------+-----------------------+-----------------------+
| mrr@5     |                 0.399 |                 0.387 |
+-----------+-----------------------+-----------------------+
| mrr@10    |                 0.428 |                 0.415 |
+-----------+-----------------------+-----------------------+
| mrr@15    |                 0.429 |                 0.418 |
+-----------+-----------------------+-----------------------+
| mrr@25    |                 0.432 |                 0.419 |
+-----------+-----------------------+-----------------------+
| recall@3  |                 0.5   |                 0.517 |
+-----------+-----------------------+-----------------------+
| recall@5  |                 0.733 |                 0.733 |
+-----------+-----------------------+-----------------------+
| recall


