# Semantic Search & RAG - Medium Tasks

Trying out some more advanced search techniques. Building on the basic stuff from earlier.

**What we're doing:**
- Hybrid search (BM25 + semantic)
- Different chunking strategies 
- Query expansion
- Document selection for LLM context

## Setup

Run all cells in this section to set up the environment and load necessary data.

### [Optional] - Installing Packages on Google Colab

If you are viewing this notebook on Google Colab, uncomment and run the following code to install dependencies.

**Note**: Use a GPU for this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

In [167]:
# %%capture
!pip install langchain==0.2.5 faiss-cpu==1.8.0 cohere==5.5.8 langchain-community==0.2.5 rank_bm25==0.2.2 sentence-transformers==3.0.1 pandas python-dotenv
!pip install llama-cpp-python==0.2.78  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

## IMPORTANT: Make sure to restart the session after installing the packages above.

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu124


### Import Libraries and Setup API

In [168]:
import cohere
import os
from dotenv import load_dotenv
import numpy as np
import pandas as pd
import faiss
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm import tqdm

# Load environment variables
load_dotenv()

# Get API key from environment variable
api_key = os.environ.get('COHERE_API_KEY')

# Create Cohere client
co = cohere.Client(api_key)

## Load Sample Dataset

For these tasks, we'll use a dataset with multiple documents.

In [169]:
# Sample technical documents for our search experiments
documents = [
    """API Gateway Architecture: A well-designed API gateway serves as a single entry point 
    for all client requests. It handles authentication, rate limiting, request routing, and 
    response transformation. Popular patterns include BFF (Backend for Frontend) where 
    different gateways serve web vs mobile clients. Circuit breakers prevent cascade 
    failures when downstream services are unavailable.""",
    
    """Microservices Security: Securing distributed systems requires defense in depth. 
    Service-to-service communication should use mutual TLS authentication. JWT tokens 
    enable stateless authorization with configurable expiration. Zero-trust architecture 
    assumes no implicit trust between services. Regular security audits identify 
    vulnerabilities in dependencies and configurations.""",
    
    """Database Scaling Strategies: Horizontal scaling distributes data across multiple 
    machines. Sharding partitions data by key ranges or hash functions. Read replicas 
    handle query workloads while write operations go to primary instances. Eventual 
    consistency allows systems to remain available during partition events. CQRS 
    separates read and write models for optimal performance.""",
    
    """Container Orchestration: Kubernetes automates deployment, scaling, and management 
    of containerized applications. Pods group related containers that share resources. 
    Services provide stable networking endpoints for pod communication. ConfigMaps and 
    Secrets manage configuration data separately from application code. Operators 
    extend Kubernetes to manage complex stateful applications.""",
    
    """Cloud Computing Performance: Auto-scaling adjusts resource allocation based on 
    demand metrics like CPU utilization or request count. Load balancers distribute 
    traffic across healthy instances. Content delivery networks cache static assets 
    at edge locations worldwide. Database connection pooling reduces overhead from 
    frequent connection establishment. Monitoring provides visibility into system health.""",
    
    """Software Development Lifecycle: Agile methodologies emphasize iterative development 
    with short feedback cycles. Continuous integration automatically builds and tests 
    code changes. Feature flags enable gradual rollout of new functionality. Code 
    reviews improve quality through peer collaboration. Retrospectives help teams 
    identify process improvements and address technical debt.""",
    
    """Machine Learning Operations: MLOps bridges development and production for ML systems. 
    Model versioning tracks changes in algorithms and training data. Automated testing 
    validates model performance against baseline metrics. Feature stores centralize 
    data preparation and serve consistent features to training and inference pipelines. 
    A/B testing compares model performance in production environments.""",
    
    """Web Application Security: Cross-site scripting (XSS) attacks inject malicious 
    scripts into web pages. Content Security Policy headers restrict resource loading 
    to prevent XSS. SQL injection exploits inadequate input validation in database 
    queries. Parameterized queries safely handle user input. HTTPS encrypts data 
    in transit while secure session management prevents unauthorized access.""",
    
    """Data Pipeline Architecture: Extract, Transform, Load (ETL) processes move data 
    between systems. Stream processing handles real-time data using tools like Apache 
    Kafka and Apache Flink. Data lakes store raw data in various formats for future 
    analysis. Data lineage tracking helps understand data flow and impact of changes. 
    Schema evolution manages changes to data structure over time.""",
    
    """Mobile Application Development: Native apps provide optimal performance and platform 
    integration. Cross-platform frameworks like React Native reduce development effort 
    across iOS and Android. Progressive Web Apps offer app-like experiences using 
    web technologies. Offline-first design ensures functionality without network 
    connectivity. Push notifications engage users with timely updates."""
]

print(f"Loaded {len(documents)} technical documents")
print(f"Average length: {np.mean([len(d) for d in documents]):.0f} characters")

Loaded 10 technical documents
Average length: 406 characters


### Helper Functions

These are reusable functions we'll use throughout the tasks.

`bm25_tokenier` prepares the text for BM25 keyword search

Prepare text for BM25 keyword search.
    
    BM25 needs clean tokens (words) to count matches. This function:
    1. Makes everything lowercase (so "Database" and "database" match)
    2. Removes punctuation (so "cloud," becomes "cloud")
    3. Removes stop words (common words like "the", "is", "a" that don't help search)
    4. Only keeps meaningful words
    
    Example: "The database is fast" becomes ["database", "fast"]

In [170]:
def bm25_tokenizer(text):
    tokenized_doc = []
    
    # Split text into words and process each one
    for token in text.lower().split():
        # Remove punctuation from beginning and end
        token = token.strip(string.punctuation)
        
        # Only keep non-empty words that aren't stop words
        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    
    return tokenized_doc

In [171]:
def print_results(query, results, show_scores=True):
    """Display search results in a readable format."""
    print(f"\nQuery: '{query}'")
    print("-" * 80)
    
    # Loop through results and print each one
    for i, (doc, score) in enumerate(results, 1):
        if show_scores:
            print(f"{i}. [Score: {score:.4f}] {doc}")
        else:
            print(f"{i}. {doc}")
    
    print("-" * 80)

## Medium Tasks

Complete the following tasks to build production-ready search systems.

### Task 1: Hybrid Search 

Let's try combining keyword search (BM25) with semantic search. Sometimes you need exact keyword matches, sometimes you want meaning-based results.

The idea: combine both and see if we get better results than either alone.

In [172]:
# Start by getting embeddings for all our docs
print("Creating embeddings...")
doc_embeddings = co.embed(
    texts=documents,
    input_type="search_document"
).embeddings
doc_embeddings = np.array(doc_embeddings)

print(f"Got embeddings: {doc_embeddings.shape}")
# Each doc is now a 4096-dim vector

Creating embeddings...
Got embeddings: (10, 4096)
Got embeddings: (10, 4096)


In [173]:
print(f"Shape looks right: {doc_embeddings.shape}")
print(f"Values look normalized: min={doc_embeddings.min():.3f}, max={doc_embeddings.max():.3f}")
# print(doc_embeddings[0][:5])  # peek at first few dims

Shape looks right: (10, 4096)
Values look normalized: min=-8.125, max=8.141


In [174]:
# Build the FAISS index for semantic search
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.float32(doc_embeddings))

print(f"FAISS index ready - {index.ntotal} vectors")

FAISS index ready - 10 vectors


In [175]:
# Now set up BM25 for keyword matching
tokenized_docs = [bm25_tokenizer(doc) for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

print("BM25 ready")
# This will find exact word matches

BM25 ready


In [176]:
# Quick test - how many docs have "security" keyword?
# print(sum(['security' in doc.lower() for doc in documents]))
# print(sum(['scale' in doc.lower() for doc in documents]))

# Just checking distribution of keywords

In [177]:
# Try semantic search first
query = "API gateway rate limiting authentication"

q_emb = co.embed(
    texts=[query],
    input_type="search_query"
).embeddings[0]

# Search
dists, idxs = index.search(np.float32([q_emb]), 3)

print(f"Query: {query}")
print("\nSemantic search results:")
for i, (idx, dist) in enumerate(zip(idxs[0], dists[0]), 1):
    print(f"{i}. Distance: {dist:.4f}")
    print(f"   {documents[idx][:100]}...\n")

Query: API gateway rate limiting authentication

Semantic search results:
1. Distance: 10498.3184
   API Gateway Architecture: A well-designed API gateway serves as a single entry point 
    for all cl...

2. Distance: 12875.8115
   Microservices Security: Securing distributed systems requires defense in depth. 
    Service-to-serv...

3. Distance: 13333.7549
   Web Application Security: Cross-site scripting (XSS) attacks inject malicious 
    scripts into web ...



In [178]:
# Now try BM25
tokenized_q = bm25_tokenizer(query)
scores = bm25.get_scores(tokenized_q)

top_idxs = np.argsort(scores)[-3:][::-1]

print(f"\nBM25 results:")
for i, idx in enumerate(top_idxs, 1):
    print(f"{i}. Score: {scores[idx]:.4f}")
    print(f"   {documents[idx][:100]}...\n")


BM25 results:
1. Score: 10.1701
   API Gateway Architecture: A well-designed API gateway serves as a single entry point 
    for all cl...

2. Score: 1.3230
   Microservices Security: Securing distributed systems requires defense in depth. 
    Service-to-serv...

3. Score: 0.0000
   Mobile Application Development: Native apps provide optimal performance and platform 
    integratio...



In [179]:
# Normalize both scores to 0-1 range so we can combine them
bm25_norm = (scores - scores.min()) / (scores.max() - scores.min())

# Get all semantic scores
all_dists, _ = index.search(np.float32([q_emb]), len(documents))
sem_scores = 1 / (1 + all_dists[0])  # convert distance to score
sem_norm = (sem_scores - sem_scores.min()) / (sem_scores.max() - sem_scores.min())

print(f"Normalized BM25: min={bm25_norm.min():.3f}, max={bm25_norm.max():.3f}")
print(f"Normalized semantic: min={sem_norm.min():.3f}, max={sem_norm.max():.3f}")

Normalized BM25: min=0.000, max=1.000
Normalized semantic: min=0.000, max=1.000


In [180]:
# Combine - 50/50 mix
alpha = 0.5
combined = {}
for i, doc in enumerate(documents):
    combined[doc] = (1 - alpha) * bm25_norm[i] + alpha * sem_norm[i]

sorted_results = sorted(combined.items(), key=lambda x: x[1], reverse=True)

print(f"\nHybrid (alpha={alpha}):")
for i, (doc, score) in enumerate(sorted_results[:3], 1):
    print(f"{i}. Score: {score:.4f}")
    print(f"   {doc[:100]}...\n")


Hybrid (alpha=0.5):
1. Score: 1.0000
   API Gateway Architecture: A well-designed API gateway serves as a single entry point 
    for all cl...

2. Score: 0.3210
   Microservices Security: Securing distributed systems requires defense in depth. 
    Service-to-serv...

3. Score: 0.2190
   Database Scaling Strategies: Horizontal scaling distributes data across multiple 
    machines. Shar...



In [181]:
# Step 7: Try different alpha values
# Let's see how changing alpha affects results

# Let's try different alpha values with a new query
query2 = "how to make software more secure"

print(f"Query: {query2}")


Query: how to make software more secure


In [182]:
# Get embeddings and scores for this query
q2_emb = co.embed(texts=[query2], input_type="search_query").embeddings[0]
q2_tok = bm25_tokenizer(query2)

# BM25 scores
bm25_sc = bm25.get_scores(q2_tok)
bm25_n = (bm25_sc - bm25_sc.min()) / (bm25_sc.max() - bm25_sc.min())

In [183]:
# Dense scores
d2, _ = index.search(np.float32([q2_emb]), len(documents))
sem_sc = 1 / (1 + d2[0])
sem_n = (sem_sc - sem_sc.min()) / (sem_sc.max() - sem_sc.min())



In [184]:
# Try pure keyword (alpha=0)
print("\nalpha=0 (keywords only):")
for i, idx in enumerate(np.argsort(bm25_n)[-3:][::-1], 1):
    print(f"{i}. {documents[idx][:80]}...")


alpha=0 (keywords only):
1. Software Development Lifecycle: Agile methodologies emphasize iterative developm...
2. Web Application Security: Cross-site scripting (XSS) attacks inject malicious 
 ...
3. Mobile Application Development: Native apps provide optimal performance and plat...


In [185]:
# Try pure semantic (alpha=1)
print("\nalpha=1 (semantic only):")
for i, idx in enumerate(np.argsort(sem_n)[-3:][::-1], 1):
    print(f"{i}. {documents[idx][:80]}...")


alpha=1 (semantic only):
1. API Gateway Architecture: A well-designed API gateway serves as a single entry p...
2. Microservices Security: Securing distributed systems requires defense in depth. ...
3. Database Scaling Strategies: Horizontal scaling distributes data across multiple...


In [186]:
# Balanced approach
print("\nalpha=0.5 (balanced):")
for i, idx in enumerate(np.argsort(0.5*bm25_n + 0.5*sem_n)[-3:][::-1], 1):
    print(f"{i}. {documents[idx][:80]}...")


alpha=0.5 (balanced):
1. Software Development Lifecycle: Agile methodologies emphasize iterative developm...
2. Web Application Security: Cross-site scripting (XSS) attacks inject malicious 
 ...
3. API Gateway Architecture: A well-designed API gateway serves as a single entry p...


In [187]:
# Try with different query type?
# q3 = "make systems faster"
# q3_emb = co.embed(texts=[q3], input_type="search_query").embeddings[0]
# q3_tok = bm25_tokenizer(q3)

# Not sure if worth running

**Questions:**

1. For the queries, does keyword matching or semantic search work better?
2. What happens if you change alpha to 0.3 or 0.7?
3. For really technical queries with specific terms, do you need higher keyword weight?
4. For conceptual questions, is semantic search better?

### Task 2: Chunking Experiments

When you have long docs, you need to split them up. But how?

**Questions:**
- Sentence-based vs word-based chunks?
- Does overlap between chunks help?
- What size works best?

In [188]:
# Let's work with a longer document to test chunking
long_document = """
Artificial intelligence is transforming how businesses operate. Machine learning algorithms can now 
analyze vast amounts of data to identify patterns that humans might miss. Natural language processing 
enables computers to understand and generate human language with remarkable accuracy.

Deep learning models, particularly neural networks with multiple layers, have achieved breakthrough 
results in image recognition, speech synthesis, and language translation. These models learn hierarchical 
representations of data, capturing both low-level features and high-level abstractions.

The deployment of AI systems in production environments requires careful consideration of several factors. 
Model inference latency must be minimized to ensure responsive user experiences. Scalability is crucial 
as request volumes can vary dramatically. Monitoring and observability help detect model drift and 
performance degradation over time."""

print(f"Doc length: {len(long_document)} chars, {len(long_document.split())} words")
# Pretty long - definitely need to chunk this

Doc length: 936 chars, 119 words


In [189]:
# Continue the document...
long_document += """
Ethical considerations are paramount when developing AI applications. Bias in training data can lead to 
unfair outcomes for certain groups. Privacy concerns arise when models are trained on sensitive personal 
information. Transparency and explainability help build trust with users and stakeholders.

The future of AI likely involves more efficient models that require less computational resources. 
Few-shot and zero-shot learning techniques enable models to adapt to new tasks with minimal examples. 
Multimodal models that process text, images, and audio together will unlock new applications and capabilities.
"""

print(f"Doc length: {len(long_document)} chars, {len(long_document.split())} words")
# Pretty long - definitely need to chunk this

Doc length: 1553 chars, 204 words


In [190]:
# Approach 1: split by sentences
sents = [s.strip() for s in long_document.replace('\n', ' ').split('.') if s.strip()]

# Group into 2-sent chunks
chunks_sent = []
for i in range(0, len(sents), 2):
    chunk = '. '.join(sents[i:i + 2]) + '.'
    chunks_sent.append(chunk)

print(f"Got {len(chunks_sent)} sentence chunks")
print("\nFirst one:")
print(chunks_sent[0])
# print(chunks_sent[1])  # uncomment to see more

Got 8 sentence chunks

First one:
Artificial intelligence is transforming how businesses operate. Machine learning algorithms can now  analyze vast amounts of data to identify patterns that humans might miss.


In [191]:
# Approach 2: word-based with overlap
words = long_document.split()
chunk_size = 40
overlap = 10

chunks_word = []
i = 0
while i < len(words):
    chunk = ' '.join(words[i:i + chunk_size])
    chunks_word.append(chunk)
    i += (chunk_size - overlap)

print(f"Got {len(chunks_word)} word chunks ({chunk_size} words, {overlap} overlap)")
print("\nFirst:", chunks_word[0][:100] + "...")
print("\nSecond:", chunks_word[1][:100] + "...")


Got 7 word chunks (40 words, 10 overlap)

First: Artificial intelligence is transforming how businesses operate. Machine learning algorithms can now ...

Second: understand and generate human language with remarkable accuracy. Deep learning models, particularly ...


In [192]:
# What if we try different overlap values?
# for ovlp in [0, 5, 10, 20]:
#     chunks_test = []
#     i = 0
#     while i < len(words):
#         chunks_test.append(' '.join(words[i:i+40]))
#         i += (40 - ovlp)
#     print(f"overlap={ovlp}: {len(chunks_test)} chunks")

# TODO: test if 20 overlap is actually better

In [193]:
# Embed both chunk types
print("Embedding sentence chunks...")
sent_embs = np.array(co.embed(texts=chunks_sent, input_type="search_document").embeddings)

print("Embedding word chunks...")
word_embs = np.array(co.embed(texts=chunks_word, input_type="search_document").embeddings)

print(f"\nSentence: {sent_embs.shape}")
print(f"Word: {word_embs.shape}")

Embedding sentence chunks...
Embedding word chunks...
Embedding word chunks...

Sentence: (8, 4096)
Word: (7, 4096)

Sentence: (8, 4096)
Word: (7, 4096)


In [194]:
# Step 4: Build FAISS indices
idx_sent = faiss.IndexFlatL2(sent_embs.shape[1])
idx_sent.add(np.float32(sent_embs))

idx_word = faiss.IndexFlatL2(word_embs.shape[1])
idx_word.add(np.float32(word_embs))

print("Indices built! Now let's search...")

Indices built! Now let's search...


In [195]:
# Try a query about ethics
test_q = "ethical concerns with AI"

# Get query embedding
q_emb = co.embed(texts=[test_q], input_type="search_query").embeddings[0]

print(f"\nQuery: {test_q}")
print("=" * 80)


Query: ethical concerns with AI


In [196]:
# Search sentence-based chunks
d_sent, i_sent = idx_sent.search(np.float32([q_emb]), 2)
print("\nSentence chunks:")
for i, (idx, d) in enumerate(zip(i_sent[0], d_sent[0]), 1):
    print(f"{i}. dist={d:.4f}")
    print(f"   {chunks_sent[idx][:120]}...\n")


Sentence chunks:
1. dist=7059.3555
   Monitoring and observability help detect model drift and  performance degradation over time. Ethical considerations are ...

2. dist=7389.8052
   Transparency and explainability help build trust with users and stakeholders. The future of AI likely involves more effi...



In [197]:
# Search word-based chunks  
d_word, i_word = idx_word.search(np.float32([q_emb]), 2)
print("Word chunks:")
for i, (idx, d) in enumerate(zip(i_word[0], d_word[0]), 1):
    print(f"{i}. dist={d:.4f}")
    print(f"   {chunks_word[idx][:120]}...\n")

Word chunks:
1. dist=6042.2891
   considerations are paramount when developing AI applications. Bias in training data can lead to unfair outcomes for cert...

2. dist=8005.3174
   Transparency and explainability help build trust with users and stakeholders. The future of AI likely involves more effi...



**Questions:**

- Which chunking strategy gave better results? 
- Try changing words_per_chunk to 20 or 60 - does it matter?
- What about overlap of 0 vs 20?
- For specific questions, are smaller chunks better?
- For broad questions, are larger chunks better?

### Task 3: Query Expansion

Sometimes you don't find what you need because you asked the "wrong" way. What if we let an LLM rewrite the query in different ways and search with all versions?

**Goal:** Use an LLM to rewrite queries and combine the results.

**What you'll learn:**
- How to expand queries automatically
- How to merge results from multiple searches
- When this helps (and when it doesn't)

In [198]:
# Let's get LLM to generate query variations
original_q = "improving software security"

prompt = f"""Generate 2 alternative ways to phrase this search query. 
Each variation should mean the same thing but use different words.

Original query: {original_q}

Return only the alternative queries, one per line."""

resp = co.chat(message=prompt, max_tokens=100, temperature=0.7)

In [199]:
# Parse the generated variations
variations = [v.strip() for v in resp.text.strip().split('\n') if v.strip()]

print(f"Original: {original_q}")
print("\nGenerated:")
for i, v in enumerate(variations, 1):
    print(f"{i}. {v}")

Original: improving software security

Generated:
1. enhancing application protection
2. strengthening program safety measures


In [200]:
# First attempt used temperature=0.3 but variations were too similar
# Bumped to 0.7 - better diversity

In [201]:
# Search with all query variations
all_qs = [original_q] + variations

print("\nSearching with each query...")
all_res = []


Searching with each query...


In [202]:
# Run search for each query variant
for q in all_qs:
    print(f"\n{q}:")
    
    qe = co.embed(texts=[q], input_type="search_query").embeddings[0]
    dists, idxs = index.search(np.float32([qe]), 5)
    
    q_results = [(documents[idx], rank) for rank, (idx, _) in enumerate(zip(idxs[0], dists[0]), 1)]
    all_res.append(q_results)
    
    # Top 2
    print("  Top 2:")
    for i, (doc, _) in enumerate(q_results[:2], 1):
        print(f"  {i}. {doc[:60]}...")


improving software security:
  Top 2:
  1. Web Application Security: Cross-site scripting (XSS) attacks...
  2. Microservices Security: Securing distributed systems require...

enhancing application protection:
  Top 2:
  1. Web Application Security: Cross-site scripting (XSS) attacks...
  2. Microservices Security: Securing distributed systems require...

enhancing application protection:
  Top 2:
  1. Web Application Security: Cross-site scripting (XSS) attacks...
  2. Microservices Security: Securing distributed systems require...

strengthening program safety measures:
  Top 2:
  1. Web Application Security: Cross-site scripting (XSS) attacks...
  2. Microservices Security: Securing distributed systems require...

strengthening program safety measures:
  Top 2:
  1. Web Application Security: Cross-site scripting (XSS) attacks...
  2. Microservices Security: Securing distributed systems require...
  Top 2:
  1. Web Application Security: Cross-site scripting (XSS) attacks...
  2. Mi

In [203]:
# Merge using RRF (Reciprocal Rank Fusion)
# Docs that show up in multiple results get boosted

doc_scores = {}
k = 60

for results in all_res:
    for doc, rank in results:
        if doc not in doc_scores:
            doc_scores[doc] = 0
        doc_scores[doc] += 1 / (k + rank)  # RRF formula

merged = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)
print(f"Merged {len(merged)} documents")

Merged 7 documents


In [204]:
# Display merged results
print("\n" + "=" * 80)
print("Merged results:")
print("=" * 80)
for i, (doc, sc) in enumerate(merged[:5], 1):
    print(f"{i}. RRF={sc:.4f}")
    print(f"   {doc[:80]}...\n")


Merged results:
1. RRF=0.0492
   Web Application Security: Cross-site scripting (XSS) attacks inject malicious 
 ...

2. RRF=0.0484
   Microservices Security: Securing distributed systems requires defense in depth. ...

3. RRF=0.0469
   Cloud Computing Performance: Auto-scaling adjusts resource allocation based on 
...

4. RRF=0.0315
   Machine Learning Operations: MLOps bridges development and production for ML sys...

5. RRF=0.0313
   Software Development Lifecycle: Agile methodologies emphasize iterative developm...



In [205]:
# Debug: check if docs repeated
from collections import Counter
top_docs = [doc for doc, _ in merged[:10]]
print(f"\nTop 10 slots, {len(set(top_docs))} unique docs")
if len(set(top_docs)) < 10:
    print("Some docs appeared multiple times")


Top 10 slots, 7 unique docs
Some docs appeared multiple times


**Questions:**

- Try this with a completely different query - do we get different docs?
- Does query expansion actually help or just add noise?
- When would generating variations be useful vs harmful?
- For ambiguous queries does it help, but for specific technical terms does it hurt?

### Medium Task 4: Fitting Documents Into Limited Space

LLMs have limited context windows. You might retrieve 20 documents but only fit 5. How do you choose which ones?

**Goal:** Pick the best documents that fit within a token budget.

**What you'll learn:**
- How to estimate token counts
- How to rerank documents by relevance
- Different strategies for ordering documents in the prompt

In [206]:
# Load a local embedding model
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

embedding_model = HuggingFaceEmbeddings(model_name='BAAI/bge-small-en-v1.5')


In [207]:
# Some technical documentation
tech_docs = """
Cloud Computing: Our platform provides scalable computing on demand. Virtual machines 
scale automatically based on load.

Machine Learning: Pre-trained models handle common tasks. Custom models can be trained 
using distributed infrastructure.

Databases: We offer SQL and NoSQL databases. Automatic sharding distributes data across nodes.

Security: End-to-end encryption protects data. Multi-factor authentication prevents unauthorized access.
"""

print(f"Loaded {len(tech_docs)} chars of docs")

Loaded 448 chars of docs


In [208]:
# More docs...
tech_docs += """
Developer Tools: SDK supports Python, JavaScript, Java. Documentation includes code examples.

Monitoring: Real-time metrics track performance. Distributed tracing helps debug microservices.
"""

# Split into small chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=30)
chunks = splitter.split_text(tech_docs)

print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks[:3], 1):
    print(f"\nChunk {i}: {chunk[:80]}...")

Created 6 chunks

Chunk 1: Cloud Computing: Our platform provides scalable computing on demand. Virtual mac...

Chunk 2: Machine Learning: Pre-trained models handle common tasks. Custom models can be t...

Chunk 3: Databases: We offer SQL and NoSQL databases. Automatic sharding distributes data...


In [209]:
# Step 2: Create vector store and retrieve candidates
# Create vector store
vectorstore = FAISS.from_texts(chunks, embedding_model)


In [210]:
# Retrieve candidates
query = "How can I scale my application?"

# TODO: maybe try k=20 and see if reranking helps more?
candidates = vectorstore.similarity_search(query, k=10)
candidate_texts = [doc.page_content for doc in candidates]

print(f"\nQuery: {query}")
print(f"Retrieved {len(candidate_texts)} candidates")


Query: How can I scale my application?
Retrieved 6 candidates


In [211]:
# Quick look at what we got
print("\nFirst 3:")
for i, txt in enumerate(candidate_texts[:3], 1):
    print(f"{i}. {txt[:80]}...")


First 3:
1. Cloud Computing: Our platform provides scalable computing on demand. Virtual mac...
2. Monitoring: Real-time metrics track performance. Distributed tracing helps debug...
3. Developer Tools: SDK supports Python, JavaScript, Java. Documentation includes c...


In [212]:
# Step 3: Rerank using Cohere to get the BEST candidates
# Reranking scores documents based on how well they answer the question

# Rerank with Cohere
# Should score by relevance not just distance

reranked = co.rerank(
    query=query,
    documents=candidate_texts,
    top_n=10,
    return_documents=True
)

print("\nAfter reranking:")
for i, res in enumerate(reranked.results[:3], 1):
    print(f"{i}. Relevance: {res.relevance_score:.4f}")
    print(f"   {res.document.text[:80]}...\n")


After reranking:
1. Relevance: 0.2476
   Cloud Computing: Our platform provides scalable computing on demand. Virtual mac...

2. Relevance: 0.1010
   Databases: We offer SQL and NoSQL databases. Automatic sharding distributes data...

3. Relevance: 0.0649
   Machine Learning: Pre-trained models handle common tasks. Custom models can be t...



In [213]:
# Pick docs that fit in token budget
# Quick estimate: ~4 chars per token

max_tok = 300
selected = []
total_tok = 0

for res in reranked.results:
    txt = res.document.text
    tok = len(txt) // 4
    
    if total_tok + tok <= max_tok:
        selected.append({'text': txt, 'rel': res.relevance_score, 'tok': tok})
        total_tok += tok
    else:
        break

In [214]:
# What did we select?
print(f"Fit {len(selected)} docs in {total_tok} tokens (budget: {max_tok})\n")

for i, d in enumerate(selected, 1):
    print(f"{i}. rel={d['rel']:.3f}, tok={d['tok']}")
    print(f"   {d['text'][:50]}...\n")

Fit 6 docs in 155 tokens (budget: 300)

1. rel=0.248, tok=30
   Cloud Computing: Our platform provides scalable co...

2. rel=0.101, tok=23
   Databases: We offer SQL and NoSQL databases. Autom...

3. rel=0.065, tok=30
   Machine Learning: Pre-trained models handle common...

4. rel=0.043, tok=23
   Developer Tools: SDK supports Python, JavaScript, ...

5. rel=0.041, tok=23
   Monitoring: Real-time metrics track performance. D...

6. rel=0.028, tok=26
   Security: End-to-end encryption protects data. Mul...



In [215]:
# Try different orderings
print("\nBest doc first:")
ctx1 = "\n".join([f"Doc {i+1}: {d['text']}" for i, d in enumerate(selected)])
print(ctx1[:150] + "...\n")

print("Best doc last:")
ctx2 = "\n".join([f"Doc {i+1}: {d['text']}" for i, d in enumerate(reversed(selected))])
print(ctx2[:150] + "...")

# Research says end position sometimes better (recency bias)
# But need to test with actual LLM to know for sure


Best doc first:
Doc 1: Cloud Computing: Our platform provides scalable computing on demand. Virtual machines 
scale automatically based on load.
Doc 2: Databases: We ...

Best doc last:
Doc 1: Security: End-to-end encryption protects data. Multi-factor authentication prevents unauthorized access.
Doc 2: Monitoring: Real-time metrics t...


In [218]:
# Quick check: what if we ignored tokens?
print("\nTop 5 by relevance (no token limit):")
for i, res in enumerate(reranked.results[:5], 1):
    print(f"{i}. {res.relevance_score:.3f} : {res.document.text[:100]}...")
    



Top 5 by relevance (no token limit):
1. 0.248 : Cloud Computing: Our platform provides scalable computing on demand. Virtual machines 
scale automat...
2. 0.101 : Databases: We offer SQL and NoSQL databases. Automatic sharding distributes data across nodes....
3. 0.065 : Machine Learning: Pre-trained models handle common tasks. Custom models can be trained 
using distri...
4. 0.043 : Developer Tools: SDK supports Python, JavaScript, Java. Documentation includes code examples....
5. 0.041 : Monitoring: Real-time metrics track performance. Distributed tracing helps debug microservices....


In [217]:
# Step 5: Try different document orderings
# Research shows: best document LAST often works better!

# Note: tried both orderings above
# Normally you'd test with actual LLM to see which works better
# Context window positioning can affect answer quality!

**Questions:**

- What happens with max_tokens=500 instead of 300?
- Try putting best doc first vs last. Which works better?
- What if a single doc is bigger than max_tokens?
- Could we summarize long docs instead of truncating?
- Could we use doc importance scores for selection?