# Pipeline Upgrades

In this section, we look at how we can use the previous building blocks that we put together to test different pipeline combinations

## Experimentation

In our previous example, we saw how we could evaluate a RAG system using some simple metrics such as recall. Let's see how we can use this to quickly experiment with different pipelines and get a quantiative idea of how good or bad different metrics stack up against each other

> Make sure to have set the environment variable `COHERE_API_KEY` in your shell for this section

### Re-Rankers

How do re-rankers affect the result of the final returned values

In [2]:
from lib.data import get_labels
from lib.db import get_table
from lib.models import EmbeddedPassage
import lancedb

db = lancedb.connect("../lance")
test_data = get_labels("../queries_single_label.json")
table = get_table(db,"ms_marco",EmbeddedPassage)

In [12]:
from lancedb.rerankers import CohereReranker
from lib.eval import score

def rerank_search(query,table,limit,query_type):
    reranker = CohereReranker()
    return [
        item["chunk_id"]
        for item in table.search(query, query_type=query_type).limit(limit).rerank(reranker=reranker).to_list()
    ]

query = test_data[0]
retrieved_chunk_ids = rerank_search(query['query'],table,25,"fts")
evaluation_metrics = score(retrieved_chunk_ids,query['selected_chunk_id'])
evaluation_metrics

{'mrr@3': 1.0,
 'mrr@5': 1.0,
 'mrr@10': 1.0,
 'mrr@15': 1.0,
 'mrr@25': 1.0,
 'recall@3': 1.0,
 'recall@5': 1.0,
 'recall@10': 1.0,
 'recall@15': 1.0,
 'recall@25': 1.0}

In [18]:
from lib.query import full_text_search,semantic_search
import pandas as pd
from tabulate import tabulate

test_data = get_labels("../queries_single_label.json")[:10]

def rerank_semantic(table,queries,top_k):
    return [rerank_search(query['query'],table,top_k,"vector") for query in queries]

def rerank_fts(table,queries,top_k):
    return [rerank_search(query['query'],table,top_k,"fts") for query in queries]

candidates = {
    "Full Text Search": full_text_search,
    "Semantic Search": semantic_search,
    "Reranked Full Text Search" : rerank_fts,
    "Reranked Semantic":rerank_semantic
}
results = {}

for candidate,search_fn in candidates.items():
    search_results = search_fn(table,test_data,25)
    evaluation_metrics = [
        score(retrieved_chunk_ids,query['selected_chunk_id']) 
        for retrieved_chunk_ids,query in zip(search_results,test_data)
    ]
    results[candidate] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers='keys', tablefmt='grid'))

100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 43.82it/s]
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 43.71it/s]


+-----------+--------------------+-------------------+-----------------------------+---------------------+
|           |   Full Text Search |   Semantic Search |   Reranked Full Text Search |   Reranked Semantic |
| mrr@3     |               0.25 |               0.5 |                        0.67 |                0.67 |
+-----------+--------------------+-------------------+-----------------------------+---------------------+
| mrr@5     |               0.34 |               0.5 |                        0.69 |                0.69 |
+-----------+--------------------+-------------------+-----------------------------+---------------------+
| mrr@10    |               0.37 |               0.5 |                        0.69 |                0.69 |
+-----------+--------------------+-------------------+-----------------------------+---------------------+
| mrr@15    |               0.37 |               0.5 |                        0.69 |                0.69 |
+-----------+--------------------+---

## Metadata Ingestion

Now let's see how we can use some metadata to improve our retrieval performance

## Embedding Questions

We previously used Instructor to generate synthethic questions and answers. Let's see how we can expand on this to improve our retrieval pipeline.

In [34]:
import lancedb

db = lancedb.connect("../lance")
table = db.open_table("ms_marco")

In [2]:
chunks = table.to_pandas()['text'].tolist()
chunks[:2]

["Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.",
 "The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the Commonwealth Bank. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site."]

In [53]:
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

func = get_registry().get("openai").create(name="text-embedding-3-small")

class EmbeddedText(LanceModel):
    vector: Vector(dim=func.ndims()) = func.VectorField()  # type: ignore
    chunk_id: str
    text: str = func.SourceField()
    source_text: str


table_withqa = db.create_table("ms_marco_v2", schema=EmbeddedText,exist_ok=True)

In [18]:
from lib.synthethic import generate_question_batch

questions = await generate_question_batch(chunks,20)

100%|██████████████████████████████████████████| 814/814 [02:41<00:00,  5.05it/s]


In [23]:
from lib.data import save_labels
def get_questions(questions):
    for row in questions:
        for question in row["response"]:
            yield {
                "question":question.question,
                "source":row["source"]
            }

save_labels(get_questions(questions),"../synth_questions.jsonl")

Labels saved to ../synth_questions.jsonl


In [54]:
from lib.data import get_labels
from itertools import batched
import hashlib
from tqdm import tqdm


data = get_labels("../synth_questions.jsonl")

def question_generator(questions):
    for row in data:
        chunk_id = hashlib.md5(row['source'].encode()).hexdigest()
        question_chunk_id = hashlib.md5(row['question'].encode()).hexdigest()
        yield {
            'chunk_id': chunk_id,
            'source_text':row['source'],
            'text': row['source']
        }
        yield {
            'chunk_id': chunk_id,
            'source_text':row['source'],
            'text': row['question']
        }

        

batched_data = batched(question_generator(data),30)

for batch in tqdm(batched_data):
    table_withqa.add(list(batch))

163it [04:04,  1.50s/it]


In [63]:
table_withqa.create_fts_index("text")

In [64]:
from lib.query import semantic_search,full_text_search
import pandas as pd
from tabulate import tabulate
from lib.eval import score

test_data = get_labels("../queries_single_label.json")

candidates = {
    "Semantic Search on Metadata": semantic_search,
    "Full Text Search on Metadata": full_text_search
}

table_withqa = db.open_table("ms_marco_v2")

results = {}

for candidate,search_fn in candidates.items():
    search_results = search_fn(table_withqa,test_data,25)
    evaluation_metrics = [
        score(retrieved_chunk_ids,query['selected_chunk_id']) 
        for retrieved_chunk_ids,query in zip(search_results,test_data)
    ]
    results[candidate] = pd.DataFrame(evaluation_metrics).mean()

# Convert the dictionary to a DataFrame
df = pd.DataFrame(results)

# Print the table
print(tabulate(df.round(2), headers='keys', tablefmt='grid'))

100%|██████████████████████████████████████████| 111/111 [00:06<00:00, 17.57it/s]
100%|██████████████████████████████████████████| 111/111 [00:02<00:00, 47.57it/s]

+-----------+-------------------------------+--------------------------------+
|           |   Semantic Search on Metadata |   Full Text Search on Metadata |
| mrr@3     |                          0.33 |                           0.3  |
+-----------+-------------------------------+--------------------------------+
| mrr@5     |                          0.36 |                           0.33 |
+-----------+-------------------------------+--------------------------------+
| mrr@10    |                          0.38 |                           0.36 |
+-----------+-------------------------------+--------------------------------+
| mrr@15    |                          0.39 |                           0.37 |
+-----------+-------------------------------+--------------------------------+
| mrr@25    |                          0.4  |                           0.38 |
+-----------+-------------------------------+--------------------------------+
| recall@3  |                          0.44 |       


