# Semantic Search & RAG - Hard Tasks

Advanced RAG stuff that actually gets used in production.

**Topics:**
- Multi-vector retrieval (ColBERT-style)
- Cross-encoder reranking
- HyDE (Hypothetical Document Embeddings)
- Self-correcting RAG
- Adaptive retrieval

## Setup

Run all cells in this section to set up the environment and load necessary data.

### [Optional] - Installing Packages on Google Colab

If you are viewing this notebook on Google Colab, uncomment and run the following code to install dependencies.

**Note**: Use a GPU for this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

In [83]:
# %%capture
# !pip install langchain==0.2.5 faiss-cpu==1.8.0 cohere==5.5.8 langchain-community==0.2.5 rank_bm25==0.2.2 sentence-transformers==3.0.1 pandas python-dotenv
# !pip install llama-cpp-python==0.2.78  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

# ## IMPORTANT: Make sure to restart the session after installing the packages above.

### Import Libraries and Setup API

In [84]:
import cohere
import os
from dotenv import load_dotenv
import numpy as np
import pandas as pd
import faiss
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from sentence_transformers import SentenceTransformer, CrossEncoder
from langchain.text_splitter import RecursiveCharacterTextSplitter

load_dotenv()
api_key = os.environ.get('COHERE_API_KEY')

if api_key is None:
    print("API key not found.")
else:
    print("API key loaded")

co = cohere.Client(api_key)

API key loaded


### Load Sample Dataset

More complex technical documents for testing advanced techniques.

In [85]:
# Technical documentation corpus
documents = [
    """RAG Architecture: Retrieval-Augmented Generation combines information retrieval with 
    language generation. The retrieval component searches a knowledge base for relevant documents. 
    The generation component uses an LLM to synthesize information from retrieved documents into 
    coherent responses. This architecture reduces hallucinations by grounding generation in 
    retrieved facts.""",
    
    """Dense Retrieval Methods: Dense retrieval uses neural networks to encode queries and documents 
    as dense vectors. These vectors capture semantic meaning beyond keyword matching. Models like 
    BERT and sentence transformers learn these representations through contrastive learning on 
    question-answer pairs. The main advantage is finding semantically similar content even when 
    exact words don't match.""",
    
    """Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separately, 
    cross-encoders process query-document pairs jointly. This allows attention mechanisms to 
    model interactions between query terms and document content. Cross-encoders are more accurate 
    but computationally expensive, making them ideal for reranking top candidates from fast 
    first-stage retrieval.""",
    
    """Embedding Model Training: High-quality embeddings require careful training. Contrastive 
    learning optimizes embeddings so relevant pairs are close in vector space while irrelevant 
    pairs are far apart. Hard negative mining selects challenging examples that are similar but 
    not relevant, improving model discrimination. Training data quality matters more than 
    quantity for specialized domains.""",
    
    """Context Window Management: LLMs have fixed context windows measured in tokens. Effective RAG 
    systems must fit relevant information within this limit. Strategies include selecting top-k 
    documents by relevance, truncating individual documents, and reranking to prioritize most 
    useful content. Some systems use sliding windows or hierarchical summarization for very long 
    documents.""",
    
    """Query Understanding: Effective search requires understanding user intent. Query expansion 
    adds related terms to improve recall. Query rewriting reformulates ambiguous queries for 
    clarity. Some systems generate multiple query variations and merge results. Intent 
    classification routes queries to specialized retrievers. These preprocessing steps 
    significantly impact retrieval quality.""",
    
    """Vector Databases: Specialized databases optimize similarity search over high-dimensional 
    vectors. FAISS uses techniques like product quantization and HNSW graphs to enable 
    sub-linear search time. Other systems like Pinecone and Weaviate add features like 
    filtering, hybrid search, and managed infrastructure. Choosing the right index type 
    balances speed, accuracy, and memory usage.""",
    
    """Evaluation Metrics: RAG systems require careful evaluation. Retrieval metrics include 
    recall@k (are relevant docs retrieved?) and MRR (how highly ranked?). Generation metrics 
    assess answer quality through accuracy, completeness, and faithfulness to sources. End-to-end 
    metrics like answer correctness and user satisfaction capture overall system performance.""",
]

print(f"Loaded {len(documents)} technical documents")
print(f"Average length: {np.mean([len(d) for d in documents]):.0f} characters")

Loaded 8 technical documents
Average length: 400 characters


### Helper Functions

In [86]:
def bm25_tokenizer(text):
    """Tokenize text for BM25"""
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)
        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

In [87]:
def print_results(query, results, title="Results"):
    """Pretty print search results"""
    print(f"\n{'='*80}")
    print(f"{title}")
    print(f"Query: {query}")
    print(f"{'='*80}")
    for i, (doc, score) in enumerate(results, 1):
        print(f"\n{i}. Score: {score:.4f}")
        print(f"   {doc[:150]}...")
    print()

## Hard Tasks

Advanced RAG techniques used in production systems.

### Task 1: Multi-Vector Retrieval (ColBERT-style)

Instead of one embedding per document, use MULTIPLE embeddings. Each sentence gets its own vector.

**The idea:**
- Split docs into sentences
- Embed each sentence separately  
- At search: find best matching sentence from each doc
- Doc score = score of its best sentence

Why? Single vectors lose details. Multiple vectors capture more.

In [88]:
# Split each doc into sentences
doc_sentences = []
doc_ids = []

for doc_idx, doc in enumerate(documents):
    # Simple split on '. '
    sents = [s.strip() for s in doc.split('. ') if len(s.strip()) > 20]
    doc_sentences.append(sents)
    
    for sent in sents:
        doc_ids.append(doc_idx)

print(f"Split {len(documents)} docs into {len(doc_ids)} sentences")

Split 8 docs into 33 sentences


In [89]:
# Check first doc
print(f"\nDoc 0 has {len(doc_sentences[0])} sentences:")
for i, sent in enumerate(doc_sentences[0][:2], 1):
    print(f"{i}. {sent[:80]}...")


Doc 0 has 4 sentences:
1. RAG Architecture: Retrieval-Augmented Generation combines information retrieval ...
2. The retrieval component searches a knowledge base for relevant documents...


In [90]:
# Flatten for embedding
all_sents = []
for sents in doc_sentences:
    all_sents.extend(sents)

print(f"\nTotal sentences to embed: {len(all_sents)}")


Total sentences to embed: 33


In [124]:
# Embed all sentences
sent_embs = co.embed(
    texts=all_sents,
    input_type="search_document"
).embeddings
sent_embs = np.array(sent_embs)

print(f"Created {sent_embs.shape[0]} embeddings")

Created 33 embeddings


In [92]:
# Storage cost comparison
print(f"\nSingle-vector: {len(documents)} embeddings")
print(f"Multi-vector: {len(all_sents)} embeddings")



Single-vector: 8 embeddings
Multi-vector: 33 embeddings


In [93]:
# Step 3: Build FAISS index for all sentences

# Build FAISS index
dim = sent_embs.shape[1]
idx_multi = faiss.IndexFlatL2(dim)
idx_multi.add(np.float32(sent_embs))

print(f"Index ready: {idx_multi.ntotal} vectors")

Index ready: 33 vectors


In [94]:
# Step 4: Search with multi-vector retrieval
# For each document, we find its BEST matching sentence

# Search all sentences
query = "How do cross-encoders differ from bi-encoders?"

q_emb = co.embed(texts=[query], input_type="search_query").embeddings[0]
dists, idxs = idx_multi.search(np.float32([q_emb]), len(all_sents))

print(f"Query: {query}\n")

Query: How do cross-encoders differ from bi-encoders?



In [95]:
# Top matching sentences
print("Top sentences:")
for i in range(3):
    sent_idx = idxs[0][i]
    sent = all_sents[sent_idx]
    doc_id = doc_ids[sent_idx]
    print(f"\n{i+1}. dist={dists[0][i]:.4f} from doc {doc_id}")
    print(f"   {sent[:100]}...")

Top sentences:

1. dist=5495.3833 from doc 2
   Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separately, 
    cross-en...

2. dist=7663.7764 from doc 2
   Cross-encoders are more accurate 
    but computationally expensive, making them ideal for reranking...

3. dist=8085.0264 from doc 3
   Contrastive 
    learning optimizes embeddings so relevant pairs are close in vector space while irr...


In [96]:
# Step 5: Aggregate sentence scores to document scores
# For each document, use the score of its BEST matching sentence

# Aggregate sentence scores to doc scores
doc_best_sc = {}
doc_best_sent = {}

for sent_idx, dist in zip(idxs[0], dists[0]):
    doc_id = doc_ids[sent_idx]
    sim = 1 / (1 + dist)  # convert distance to similarity
    
    if doc_id not in doc_best_sc or sim > doc_best_sc[doc_id]:
        doc_best_sc[doc_id] = sim
        doc_best_sent[doc_id] = all_sents[sent_idx]

print(f"Aggregated {len(doc_best_sc)} docs")

Aggregated 8 docs


In [97]:
# Rank by best sentence
ranked = sorted(doc_best_sc.items(), key=lambda x: x[1], reverse=True)


In [98]:
# Display top 3
print(f"\nMulti-vector results:\n")
for i, (doc_id, sc) in enumerate(ranked[:3], 1):
    print(f"{i}. Doc {doc_id} - score: {sc:.4f}")
    print(f"   Best: {doc_best_sent[doc_id][:100]}...")
    print(f"   Full: {documents[doc_id][:100]}...\n")


Multi-vector results:

1. Doc 2 - score: 0.0002
   Best: Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separately, 
    cross-en...
   Full: Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separately, 
    cross-en...

2. Doc 3 - score: 0.0001
   Best: Contrastive 
    learning optimizes embeddings so relevant pairs are close in vector space while irr...
   Full: Embedding Model Training: High-quality embeddings require careful training. Contrastive 
    learnin...

3. Doc 1 - score: 0.0001
   Best: Models like 
    BERT and sentence transformers learn these representations through contrastive lear...
   Full: Dense Retrieval Methods: Dense retrieval uses neural networks to encode queries and documents 
    a...



**Questions:**

1. Try a query matching specific details - does multi-vector help?
2. What if you average all sentence scores instead of taking max?
3. Storage tradeoff - is it worth it?
4. When would multi-vector NOT be worth the cost?

### Task 2: Cross-Encoder Reranking

Most retrieval: embed query and docs separately, compare vectors (**bi-encoder**).

Cross-encoders: process query+doc TOGETHER. Model sees interactions between them.

**Two-stage pattern:**
1. Fast bi-encoder -> top 100 candidates
2. Slow cross-encoder -> rerank top 10

Why? Cross-encoders are accurate but slow. Use them only on top candidates.

In [99]:
# Step 1: First stage - Fast bi-encoder retrieval
# Get top candidates quickly

# Load bi-encoder
print("Loading bi-encoder...")
bi_enc = SentenceTransformer('BAAI/bge-small-en-v1.5')
print("Loaded")

# Embed documents
doc_embeds = bi_enc.encode(documents, convert_to_numpy=True)

# Build FAISS index
dim = doc_embeds.shape[1]
index_bi = faiss.IndexFlatL2(dim)
index_bi.add(np.float32(doc_embeds))

print(f"Bi-encoder index built with {index_bi.ntotal} documents")

Loading bi-encoder...
Loaded
Bi-encoder index built with 8 documents
Loaded
Bi-encoder index built with 8 documents


In [100]:
# Embed docs
doc_embs = bi_enc.encode(documents, convert_to_numpy=True)
print(f"Encoded {len(documents)} docs")

Encoded 8 docs


In [101]:
# Build index
dim = doc_embs.shape[1]
idx_bi = faiss.IndexFlatL2(dim)
idx_bi.add(np.float32(doc_embs))
print(f"Index ready: {idx_bi.ntotal} docs")

Index ready: 8 docs


In [102]:
# Step 2: Search with bi-encoder

query = "What are the advantages of cross-encoders over other methods?"

q_emb = bi_enc.encode([query], convert_to_numpy=True)
dists, idxs = index_bi.search(np.float32(q_emb), 5)

# Build candidates list
candidates = []
for idx, dist in zip(idxs[0], dists[0]):
    candidates.append({
        'doc_id': idx,
        'text': documents[idx],
        'bi_dist': dist
    })
print(f"Built {len(candidates)} candidates")

Built 5 candidates


In [103]:
# Build candidates
candidates = []
for idx, dist in zip(idxs[0], dists[0]):
    candidates.append({
        'doc_id': idx,
        'text': documents[idx],
        'bi_dist': dist
    })

print(f"Built {len(candidates)} candidates")



Built 5 candidates


In [104]:
# Search with bi-encoder
query = "What are the advantages of cross-encoders over other methods?"

q_emb = bi_enc.encode([query], convert_to_numpy=True)
dists, idxs = idx_bi.search(np.float32(q_emb), 5)
print(f"Query: {query}\n")
print("Got top 5 candidates")

# Display bi-encoder results
print("Bi-encoder results:")
for i, c in enumerate(candidates, 1):
    print(f"{i}. dist={c['bi_dist']:.4f}")
    print(f"   {c['text'][:80]}...\n")

# Load cross-encoder
print("Loading cross-encoder...")
cross_enc = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print("Loaded")

Query: What are the advantages of cross-encoders over other methods?

Got top 5 candidates
Bi-encoder results:
1. dist=0.4058
   Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separ...

2. dist=0.6338
   Embedding Model Training: High-quality embeddings require careful training. Cont...

3. dist=0.6877
   Dense Retrieval Methods: Dense retrieval uses neural networks to encode queries ...

4. dist=0.7459
   Vector Databases: Specialized databases optimize similarity search over high-dim...

5. dist=0.7490
   RAG Architecture: Retrieval-Augmented Generation combines information retrieval ...

Loading cross-encoder...
Loaded
Loaded


In [105]:
# Show results
print("Bi-encoder results:")
for i, c in enumerate(candidates, 1):
    print(f"{i}. dist={c['bi_dist']:.4f}")
    print(f"   {c['text'][:80]}...\n")

Bi-encoder results:
1. dist=0.4058
   Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separ...

2. dist=0.6338
   Embedding Model Training: High-quality embeddings require careful training. Cont...

3. dist=0.6877
   Dense Retrieval Methods: Dense retrieval uses neural networks to encode queries ...

4. dist=0.7459
   Vector Databases: Specialized databases optimize similarity search over high-dim...

5. dist=0.7490
   RAG Architecture: Retrieval-Augmented Generation combines information retrieval ...



In [106]:
# Load cross-encoder
print("Loading cross-encoder...")
cross_enc = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print("Loaded")

print("\nKey difference:")
print("- Bi-encoder: Embeds query and doc separately, then compares vectors")
print("- Cross-encoder: Feeds [query, doc] together, model sees interactions")
print("\nThis means cross-encoder can understand:")
print("  - Which query words match which doc parts")
print("  - Context and relationships between terms")
print("  - But it's much slower (must process each pair)")

Loading cross-encoder...
Loaded

Key difference:
- Bi-encoder: Embeds query and doc separately, then compares vectors
- Cross-encoder: Feeds [query, doc] together, model sees interactions

This means cross-encoder can understand:
  - Which query words match which doc parts
  - Context and relationships between terms
  - But it's much slower (must process each pair)
Loaded

Key difference:
- Bi-encoder: Embeds query and doc separately, then compares vectors
- Cross-encoder: Feeds [query, doc] together, model sees interactions

This means cross-encoder can understand:
  - Which query words match which doc parts
  - Context and relationships between terms
  - But it's much slower (must process each pair)


In [107]:
# Rerank with cross-encoder
pairs = [[query, cand['text']] for cand in candidates]

# Get scores (higher = more relevant)
sc = cross_enc.predict(pairs)

for cand, score in zip(candidates, sc):
    cand['cross_sc'] = score

print(sc)

[  5.4480724 -10.7306     -1.4447366  -7.91539    -9.924572 ]


In [108]:
# Show reranked results
reranked = sorted(candidates, key=lambda x: x['cross_sc'], reverse=True)

print("\nAfter reranking:\n")
for i, c in enumerate(reranked, 1):
    print(f"{i}. cross_sc={c['cross_sc']:.4f} (bi_dist was {c['bi_dist']:.4f})")
    print(f"   {c['text'][:80]}...\n")


After reranking:

1. cross_sc=5.4481 (bi_dist was 0.4058)
   Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separ...

2. cross_sc=-1.4447 (bi_dist was 0.6877)
   Dense Retrieval Methods: Dense retrieval uses neural networks to encode queries ...

3. cross_sc=-7.9154 (bi_dist was 0.7459)
   Vector Databases: Specialized databases optimize similarity search over high-dim...

4. cross_sc=-9.9246 (bi_dist was 0.7490)
   RAG Architecture: Retrieval-Augmented Generation combines information retrieval ...

5. cross_sc=-10.7306 (bi_dist was 0.6338)
   Embedding Model Training: High-quality embeddings require careful training. Cont...



In [109]:
# Compare rankings
print("\nRanking comparison:\n")
print("Pos | Bi-Encoder | Cross-Encoder")
print("-" * 40)

for i in range(len(candidates)):
    bi_doc = candidates[i]['doc_id']
    cross_doc = reranked[i]['doc_id']
    changed = " <- changed" if bi_doc != cross_doc else ""
    print(f" {i+1}  |    Doc {bi_doc}   |    Doc {cross_doc}{changed}")


Ranking comparison:

Pos | Bi-Encoder | Cross-Encoder
----------------------------------------
 1  |    Doc 2   |    Doc 2
 2  |    Doc 3   |    Doc 1 <- changed
 3  |    Doc 1   |    Doc 6 <- changed
 4  |    Doc 6   |    Doc 0 <- changed
 5  |    Doc 0   |    Doc 3 <- changed


**Questions:**

1. Try different queries - does reranking always help?
2. What about very specific technical queries?
3. Retrieve 10 candidates instead of 5 - does reranking improve more?
4. When is the extra cost NOT worth it?

### Task 3: Hypothetical Document Embeddings (HyDE)

Weird idea: instead of searching with your query, generate a FAKE answer first, then search for docs similar to that fake answer.

**Why it works:**
- Questions are short and abstract
- Answers are detailed and concrete  
- Real docs contain answer-like text
- Fake answer is more similar to real answers than question is

**Process:**
1. User: "What is RAG?"
2. LLM: generates fake answer
3. Embed fake answer, search
4. Find real docs about RAG

In [110]:
# Create doc embeddings
print("Creating doc embeddings...")
doc_embs = co.embed(texts=documents, input_type="search_document").embeddings
doc_embs = np.array(doc_embs)

print(f"Created {len(doc_embs)} embeddings")

Creating doc embeddings...
Created 8 embeddings
Created 8 embeddings


In [111]:
# Build index
dim = doc_embs.shape[1]
idx_hyde = faiss.IndexFlatL2(dim)
idx_hyde.add(np.float32(doc_embs))

print(f"Index ready: {idx_hyde.ntotal} docs")

Index ready: 8 docs


In [112]:
# Baseline - search with query directly
query = "What makes dense retrieval better than keyword search?"

q_emb = co.embed(texts=[query], input_type="search_query").embeddings[0]
dists, idxs = idx_hyde.search(np.float32([q_emb]), 3)

print(f"Query: {query}\n")
print("Baseline (direct query):")
for i, (idx, dist) in enumerate(zip(idxs[0], dists[0]), 1):
    print(f"{i}. dist={dist:.4f}")
    print(f"   {documents[idx][:80]}...\n")

Query: What makes dense retrieval better than keyword search?

Baseline (direct query):
1. dist=4093.2412
   Dense Retrieval Methods: Dense retrieval uses neural networks to encode queries ...

2. dist=7746.4697
   Query Understanding: Effective search requires understanding user intent. Query ...

3. dist=7979.4609
   Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separ...



In [113]:
# Step 3: Generate hypothetical answer with LLM
# This is the key step - we create a fake answer

# Generate fake answer
prompt = f"""Answer this in 2-3 sentences. Be technical.\n\nQuestion: {query}\n\nAnswer:"""

resp = co.chat(message=prompt, max_tokens=150, temperature=0.3)
hypo_doc = resp.text.strip()



In [114]:
print(hypo_doc)



Dense retrieval leverages deep learning models to embed queries and documents into a continuous vector space, capturing semantic relationships that keyword search cannot. This allows dense retrieval to match queries with relevant documents based on contextual meaning rather than exact term overlap, improving recall and precision, especially for complex or ambiguous queries. Additionally, dense retrieval can handle synonyms, polysemy, and contextual nuances more effectively than traditional keyword-based methods.


In [115]:
# Embed fake answer
hyde_emb = co.embed(
    texts=[hypo_doc],
    input_type="search_query"
).embeddings[0]

dists_h, idxs_h = idx_hyde.search(np.float32([hyde_emb]), 3)


In [116]:
# Embed fake answer
hyde_emb = co.embed(
    texts=[hypo_doc],
    input_type="search_query"
).embeddings[0]

dists_h, idxs_h = idx_hyde.search(np.float32([hyde_emb]), 3)


In [117]:
# Show results
print("\nWith fake answer:\n")
for i, (idx, dist) in enumerate(zip(idxs_h[0], dists_h[0]), 1):
    print(f"{i}. dist={dist:.4f}")
    print(f"   {documents[idx][:80]}...\n")


With fake answer:

1. dist=2555.9355
   Dense Retrieval Methods: Dense retrieval uses neural networks to encode queries ...

2. dist=7315.6992
   Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separ...

3. dist=7566.8037
   Query Understanding: Effective search requires understanding user intent. Query ...



In [118]:
# Compare results
print("\\nComparison:\\n")
print("Direct query:")
for i, idx in enumerate(idxs[0], 1):
    print(f"  {i}. Doc {idx}")

print("\\nHyDE:")
for i, idx in enumerate(idxs_h[0], 1):
    print(f"  {i}. Doc {idx}")

\nComparison:\n
Direct query:
  1. Doc 1
  2. Doc 5
  3. Doc 2
\nHyDE:
  1. Doc 1
  2. Doc 2
  3. Doc 5


**Questions:**

1. Try a specific technical question - does HyDE help more?
2. Generate multiple hypothetical answers and combine results?
3. Change temperature to 0.7 - does more creative fake answer help?
4. When might HyDE give WORSE results?

### Task 4: Self-Correcting RAG

RAG system that checks its own answers.

**Process:**
1. Retrieve docs
2. Generate answer
3. Verify: is answer supported by docs?
4. If not: retrieve more OR say "I don't know"

Self-verification catches hallucinations.

In [None]:
# Step 1: Set up retrieval (reuse embeddings from earlier)

print("Setting up retrieval system...")

# We already have doc_embeddings and index_hyde from previous task
print(f"Ready to retrieve from {len(documents)} documents")

Setting up retrieval system...
Ready to retrieve from 8 documents


In [120]:
# Step 2: Retrieve documents and generate answer

# Retrieve docs
query = "How does hard negative mining improve embedding quality?"

q_emb = co.embed(texts=[query], input_type="search_query").embeddings[0]
dists, idxs = idx_hyde.search(np.float32([q_emb]), 3)

ret_docs = [documents[idx] for idx in idxs[0]]

print(f"Query: {query}\n")
print("Retrieved:")
for i, doc in enumerate(ret_docs, 1):
    print(f"{i}. {doc[:80]}...\n")

Query: How does hard negative mining improve embedding quality?

Retrieved:
1. Embedding Model Training: High-quality embeddings require careful training. Cont...

2. Dense Retrieval Methods: Dense retrieval uses neural networks to encode queries ...

3. Cross-Encoder Reranking: Unlike bi-encoders that encode query and document separ...



In [121]:
# Generate initial answer
ctx = "\n\n".join([f"Doc {i+1}: {doc}" for i, doc in enumerate(ret_docs)])

prompt = f"""Answer based on these docs.\n\nDocs:\n{ctx}\n\nQuestion: {query}\n\nAnswer:"""

resp = co.chat(message=prompt, max_tokens=150, temperature=0.3)
initial_ans = resp.text.strip()

print("Generated initial answer")

Generated initial answer


In [None]:
# Verify answer
verif_prompt = f"""Is this answer supported by the docs?\n\nDocs:\n{ctx}\n\nQuestion: {query}\nAnswer: {initial_ans}\n\nRespond: SUPPORTED, UNSUPPORTED, or PARTIAL\n\nVerification:"""

verif = co.chat(message=verif_prompt, max_tokens=100, temperature=0.0)
verdict = verif.text.strip()

print(f"Verdict: {verdict}")

# Decide what to do
if "UNSUPPORTED" in verdict.upper() or "PARTIAL" in verdict.upper():
    print("\n  Not fully supported. Retrieving more...\n")
    
    dists_m, idxs_m = idx_hyde.search(np.float32([q_emb]), 6)
    add_docs = [documents[idx] for idx in idxs_m[0][3:]]
    
    print(f"Retrieved {len(add_docs)} more docs")
else:
    print("\nAnswer supported")
    print({"Final": initial_ans})

Verdict: SUPPORTED

The answer is fully supported by the information provided in Doc 1, which explicitly states that hard negative mining selects challenging examples that are similar but not relevant, thereby improving the model's discrimination ability. The answer accurately reflects this mechanism and its impact on embedding quality, including the emphasis on training data quality in specialized domains.

Answer supported!
{'Final': 'Hard negative mining improves embedding quality by selecting challenging examples that are similar to the relevant pairs but not actually relevant. This process enhances the model\'s ability to discriminate between relevant and irrelevant pairs in the vector space. By focusing on these "hard negatives," the model learns to better distinguish subtle differences, leading to more accurate and robust embeddings. This technique is particularly effective in specialized domains where training data quality is more critical than quantity.'}


In [123]:
# Generate revised answer
if "UNSUPPORTED" in verdict.upper() or "PARTIAL" in verdict.upper():
    all_docs = ret_docs + add_docs
    exp_ctx = "\n\n".join([f"Doc {i+1}: {doc}" for i, doc in enumerate(all_docs)])
    
    rev_prompt = f"""Answer with ALL docs.\n\nDocs:\n{exp_ctx}\n\nQuestion: {query}\n\nAnswer:"""
    
    rev_resp = co.chat(message=rev_prompt, max_tokens=200, temperature=0.3)
    final_ans = rev_resp.text.strip()
    
    print("\nRevised answer:")
    print(final_ans)

**Questions:**

1. Try a question that can't be answered from these docs - what happens?
2. What if verification is too strict?
3. Add a 3rd iteration?
4. How to handle cases where more docs don't help?

