# RAG with Qdrant: Building a Textbook Q&A System

This notebook demonstrates a complete **Retrieval-Augmented Generation (RAG)** pipeline using Qdrant.

**What you'll learn:**
1. Loading a textbook from Azure Blob Storage
2. Chunking documents into 700-token pieces
3. Creating embeddings and storing them in Qdrant
4. How user prompts become embedded queries
5. How vector similarity lookup works
6. How context is retrieved and formatted for an LLM
7. Building your own RAG system

**Prerequisites:** Basic understanding of embeddings and vector databases (see Part 1-3 notebooks)

**Note:** This notebook demonstrates the RAG retrieval pipeline. The actual LLM call is shown conceptually - you can integrate with OpenAI, Azure OpenAI, or other LLMs.

## Part 1: Setup and Installation

First, let's install all required packages.

In [None]:
# Install required packages
!pip install qdrant-client sentence-transformers azure-storage-blob tiktoken -q

print("‚úÖ All packages installed successfully!")

## Part 2: Loading the Textbook from Azure

We'll load a textbook from Azure Blob Storage using a shared access key.

**In class, you'll receive:**
- Azure Storage Account URL
- Container name
- Blob name (textbook file)
- SAS token for access

In [None]:
from azure.storage.blob import BlobServiceClient
import os

# ‚ö†Ô∏è REPLACE THESE WITH VALUES PROVIDED IN CLASS
AZURE_STORAGE_URL = "https://your-storage-account.blob.core.windows.net"
CONTAINER_NAME = "textbooks"
BLOB_NAME = "sample-textbook.txt"
SAS_TOKEN = "?sv=2021-06-08&ss=b&srt=sco&sp=r&se=..."  # Provided in class

# For this demo, we'll use a sample text if Azure credentials aren't available
USE_AZURE = False  # Set to True when you have credentials

if USE_AZURE:
    # Connect to Azure Blob Storage
    blob_service_client = BlobServiceClient(account_url=AZURE_STORAGE_URL, credential=SAS_TOKEN)
    blob_client = blob_service_client.get_blob_client(container=CONTAINER_NAME, blob=BLOB_NAME)
    
    # Download the textbook
    print(f"üì• Downloading textbook from Azure...")
    textbook_content = blob_client.download_blob().readall().decode('utf-8')
    print(f"‚úÖ Downloaded {len(textbook_content)} characters")
else:
    # Demo textbook content about machine learning
    print("üìö Using demo textbook content...")
    textbook_content = """
Chapter 1: Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. The field has revolutionized how we approach problem-solving in computer science.

Section 1.1: What is Machine Learning?

Machine learning algorithms build mathematical models based on sample data, known as training data, to make predictions or decisions. Unlike traditional programming where rules are explicitly coded, machine learning systems discover patterns in data.

There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Each type serves different purposes and uses different approaches to learn from data.

Section 1.2: Supervised Learning

Supervised learning is the most common type of machine learning. In supervised learning, the algorithm learns from labeled training data. Each training example consists of an input and the desired output. The algorithm learns to map inputs to outputs.

Common supervised learning tasks include classification and regression. Classification involves predicting discrete categories, such as whether an email is spam or not spam. Regression involves predicting continuous values, such as house prices or temperature.

Popular supervised learning algorithms include linear regression, logistic regression, decision trees, random forests, support vector machines, and neural networks. Each algorithm has strengths and weaknesses depending on the problem.

Section 1.3: Unsupervised Learning

Unsupervised learning works with unlabeled data. The algorithm tries to find patterns and structure in the data without being told what to look for. This is useful when you don't know what patterns exist in your data.

Common unsupervised learning tasks include clustering and dimensionality reduction. Clustering groups similar data points together, such as customer segmentation. Dimensionality reduction simplifies data while preserving important information.

K-means clustering, hierarchical clustering, and DBSCAN are popular clustering algorithms. Principal Component Analysis (PCA) and t-SNE are common dimensionality reduction techniques.

Section 1.4: Reinforcement Learning

Reinforcement learning is about learning through interaction with an environment. An agent takes actions and receives rewards or penalties. The goal is to learn a policy that maximizes cumulative reward over time.

Reinforcement learning has achieved remarkable success in game playing, robotics, and autonomous systems. Famous examples include AlphaGo, which defeated world champions in Go, and self-driving car systems.

Key concepts in reinforcement learning include states, actions, rewards, policies, and value functions. Q-learning and policy gradient methods are fundamental algorithms in this field.

Chapter 2: Neural Networks and Deep Learning

Neural networks are computing systems inspired by biological neural networks. They consist of interconnected nodes (neurons) organized in layers. Deep learning refers to neural networks with many layers.

Section 2.1: Neural Network Architecture

A basic neural network has three types of layers: input layer, hidden layers, and output layer. The input layer receives data, hidden layers process it, and the output layer produces predictions.

Each connection between neurons has a weight that determines the strength of the signal. During training, these weights are adjusted to minimize prediction errors. This process is called backpropagation.

Activation functions introduce non-linearity into the network. Common activation functions include ReLU (Rectified Linear Unit), sigmoid, and tanh. Without activation functions, neural networks would only learn linear relationships.

Section 2.2: Training Neural Networks

Training a neural network involves feeding it data and adjusting weights to minimize a loss function. The loss function measures how far predictions are from actual values. Common loss functions include mean squared error for regression and cross-entropy for classification.

Gradient descent is the optimization algorithm used to minimize the loss function. It calculates gradients (derivatives) of the loss with respect to each weight and updates weights in the direction that reduces loss.

Learning rate is a crucial hyperparameter that controls how much weights change during each update. Too high a learning rate causes instability; too low makes training very slow. Adaptive learning rate methods like Adam help address this.

Section 2.3: Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are specialized for processing grid-like data such as images. They use convolutional layers that apply filters to detect features like edges, textures, and patterns.

CNNs have revolutionized computer vision tasks including image classification, object detection, and facial recognition. They achieve human-level or better performance on many visual tasks.

Key components of CNNs include convolutional layers, pooling layers, and fully connected layers. Convolutional layers detect features, pooling layers reduce dimensionality, and fully connected layers make final predictions.

Section 2.4: Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are designed for sequential data like text, speech, and time series. Unlike feedforward networks, RNNs have connections that loop back, allowing them to maintain memory of previous inputs.

Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) are advanced RNN architectures that solve the vanishing gradient problem. They can learn long-term dependencies in sequences.

RNNs power many natural language processing applications including machine translation, speech recognition, and text generation. However, transformer architectures have recently surpassed RNNs for many NLP tasks.

Chapter 3: Model Evaluation and Validation

Evaluating machine learning models correctly is crucial for building reliable systems. Poor evaluation can lead to models that perform well in testing but fail in production.

Section 3.1: Train-Test Split

The most basic evaluation technique is splitting data into training and test sets. The model trains on the training set and is evaluated on the test set. This simulates how the model will perform on new, unseen data.

A common split is 80% training and 20% testing, though this varies by dataset size. The key principle is that test data must never be used during training, or evaluation will be overly optimistic.

Random splitting works for most cases, but stratified splitting ensures each split has the same proportion of each class. This is important for imbalanced datasets where some classes are rare.

Section 3.2: Cross-Validation

Cross-validation provides more robust evaluation than a single train-test split. K-fold cross-validation divides data into K subsets (folds). The model trains K times, each time using a different fold as the test set.

This gives K performance estimates that can be averaged for a more reliable assessment. Five-fold or ten-fold cross-validation are common choices. Cross-validation is especially valuable with small datasets.

Leave-one-out cross-validation is an extreme case where K equals the number of samples. Each sample serves as a test set once. This is computationally expensive but maximizes training data usage.

Section 3.3: Evaluation Metrics

Different metrics suit different problems. For classification, accuracy measures the proportion of correct predictions. However, accuracy can be misleading with imbalanced classes.

Precision measures what proportion of positive predictions are correct. Recall measures what proportion of actual positives are found. The F1 score combines precision and recall into a single metric.

For regression, mean squared error (MSE) and mean absolute error (MAE) are common. MSE penalizes large errors more heavily. R-squared measures how much variance the model explains.

Section 3.4: Overfitting and Underfitting

Overfitting occurs when a model learns training data too well, including noise and outliers. It performs excellently on training data but poorly on new data. Complex models with many parameters are prone to overfitting.

Underfitting occurs when a model is too simple to capture patterns in the data. It performs poorly on both training and test data. Finding the right model complexity is a key challenge.

Regularization techniques help prevent overfitting by penalizing model complexity. L1 and L2 regularization add penalty terms to the loss function. Dropout randomly deactivates neurons during training in neural networks.
"""
    print(f"‚úÖ Loaded {len(textbook_content)} characters of demo content")

print(f"\nüìñ Textbook preview (first 500 characters):")
print(textbook_content[:500] + "...")

## Part 3: Chunking the Textbook

We'll split the textbook into chunks of approximately 700 tokens each. This is important because:
- LLMs have context limits
- Smaller chunks provide more precise retrieval
- Each chunk should contain a coherent piece of information

We'll use `tiktoken` to count tokens accurately (same tokenizer as GPT models).

In [None]:
import tiktoken
from typing import List, Dict

# Initialize tokenizer (using GPT-3.5/GPT-4 encoding)
encoding = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    """Count the number of tokens in a text string."""
    return len(encoding.encode(text))

def chunk_text(text: str, max_tokens: int = 700, overlap_tokens: int = 50) -> List[Dict[str, any]]:
    """
    Split text into chunks of approximately max_tokens size with overlap.
    
    Args:
        text: The text to chunk
        max_tokens: Maximum tokens per chunk
        overlap_tokens: Number of tokens to overlap between chunks
    
    Returns:
        List of dictionaries with chunk text and metadata
    """
    # Split into paragraphs (preserve document structure)
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    
    chunks = []
    current_chunk = []
    current_tokens = 0
    chunk_id = 0
    
    for para in paragraphs:
        para_tokens = count_tokens(para)
        
        # If adding this paragraph exceeds max_tokens, save current chunk
        if current_tokens + para_tokens > max_tokens and current_chunk:
            chunk_text = '\n\n'.join(current_chunk)
            chunks.append({
                'id': chunk_id,
                'text': chunk_text,
                'token_count': current_tokens,
                'char_count': len(chunk_text)
            })
            chunk_id += 1
            
            # Start new chunk with overlap (keep last paragraph)
            if len(current_chunk) > 1:
                current_chunk = [current_chunk[-1]]
                current_tokens = count_tokens(current_chunk[0])
            else:
                current_chunk = []
                current_tokens = 0
        
        current_chunk.append(para)
        current_tokens += para_tokens
    
    # Add the last chunk
    if current_chunk:
        chunk_text = '\n\n'.join(current_chunk)
        chunks.append({
            'id': chunk_id,
            'text': chunk_text,
            'token_count': current_tokens,
            'char_count': len(chunk_text)
        })
    
    return chunks

# Chunk the textbook
print("‚úÇÔ∏è Chunking textbook into ~700 token pieces...")
chunks = chunk_text(textbook_content, max_tokens=700, overlap_tokens=50)

print(f"\n‚úÖ Created {len(chunks)} chunks")
print(f"\nChunk statistics:")
print(f"  Average tokens per chunk: {sum(c['token_count'] for c in chunks) / len(chunks):.0f}")
print(f"  Min tokens: {min(c['token_count'] for c in chunks)}")
print(f"  Max tokens: {max(c['token_count'] for c in chunks)}")

print(f"\nüìÑ Example chunk (Chunk 0):")
print(f"  Tokens: {chunks[0]['token_count']}")
print(f"  Characters: {chunks[0]['char_count']}")
print(f"  Preview: {chunks[0]['text'][:200]}...")

## Part 4: Creating Embeddings and Storing in Qdrant

Now we'll:
1. Load an embedding model
2. Create embeddings for each chunk
3. Store them in Qdrant with metadata

In [None]:
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import numpy as np

# Load embedding model
print("üîÑ Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dimensional embeddings
print("‚úÖ Model loaded!")

# Create Qdrant client (in-memory for this demo)
print("\nüîÑ Creating Qdrant client...")
client = QdrantClient(":memory:")
print("‚úÖ Qdrant client created (in-memory mode)")

# Create collection for textbook chunks
collection_name = "textbook_chunks"

print(f"\nüîÑ Creating collection '{collection_name}'...")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(
        size=384,  # Dimension of our embeddings
        distance=Distance.COSINE  # Cosine similarity
    )
)
print("‚úÖ Collection created!")

In [None]:
# Create embeddings for all chunks and insert into Qdrant
print("\nüîÑ Creating embeddings for all chunks...")
print("This may take a minute...")

# Extract just the text from chunks
chunk_texts = [chunk['text'] for chunk in chunks]

# Create embeddings in batch (faster than one at a time)
embeddings = model.encode(chunk_texts, show_progress_bar=True)

print(f"‚úÖ Created {len(embeddings)} embeddings")
print(f"   Each embedding has {len(embeddings[0])} dimensions")

In [None]:
# Prepare points for Qdrant
print("\nüîÑ Preparing data for Qdrant...")

points = []
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
    # Extract chapter/section info if available
    first_line = chunk['text'].split('\n')[0]
    is_chapter = 'Chapter' in first_line
    is_section = 'Section' in first_line

    points.append(
        PointStruct(
            id=i,
            vector=embedding.tolist(),
            payload={
                'text': chunk['text'],
                'chunk_id': chunk['id'],
                'token_count': chunk['token_count'],
                'char_count': chunk['char_count'],
                'is_chapter': is_chapter,
                'is_section': is_section,
                'preview': chunk['text'][:100] + '...'
            }
        )
    )

# Insert into Qdrant
client.upsert(
    collection_name=collection_name,
    points=points
)

print(f"‚úÖ Inserted {len(points)} chunks into Qdrant!")
print(f"\nüìä Collection info:")
collection_info = client.get_collection(collection_name)
print(f"   Vectors count: {collection_info.vectors_count}")
print(f"   Points count: {collection_info.points_count}")

## Part 5: How RAG Works - Step by Step

Now let's see exactly how a RAG system processes a user question!

### Step 1: User asks a question (the prompt)

In [None]:
# User's question
user_question = "What is supervised learning and what are some examples?"

print("‚ùì User Question:")
print(f"   '{user_question}'")
print(f"\nüìä Question stats:")
print(f"   Tokens: {count_tokens(user_question)}")
print(f"   Characters: {len(user_question)}")

### Step 2: Convert the question to an embedding

The question is embedded using the SAME model that embedded the textbook chunks.
This is crucial - both must use the same embedding space!

In [None]:
print("üîÑ Converting question to embedding...")

# Embed the question
question_embedding = model.encode([user_question])[0]

print(f"‚úÖ Question embedded!")
print(f"   Embedding dimensions: {len(question_embedding)}")
print(f"   First 10 values: {question_embedding[:10]}")
print(f"   Last 10 values: {question_embedding[-10:]}")
print(f"\nüí° This embedding represents the semantic meaning of the question!")

### Step 3: Search Qdrant for similar chunks

Qdrant compares the question embedding to all chunk embeddings using cosine similarity.

In [None]:
print("üîç Searching for relevant chunks in Qdrant...")

# Search for top 3 most relevant chunks
search_results = client.search(
    collection_name=collection_name,
    query_vector=question_embedding.tolist(),
    limit=3  # Retrieve top 3 most relevant chunks
)

print(f"‚úÖ Found {len(search_results)} relevant chunks!\n")

# Display results with similarity scores
for i, result in enumerate(search_results, 1):
    print(f"{'='*80}")
    print(f"Result #{i} - Similarity Score: {result.score:.4f}")
    print(f"{'='*80}")
    print(f"Chunk ID: {result.payload['chunk_id']}")
    print(f"Tokens: {result.payload['token_count']}")
    print(f"\nContent Preview:")
    print(result.payload['text'][:300] + "...")
    print()

### ü§î Understanding Similarity Scores

**Similarity scores range from 0 to 1:**
- **1.0** = Perfect match (identical meaning)
- **0.8-1.0** = Very similar (highly relevant)
- **0.6-0.8** = Somewhat similar (possibly relevant)
- **< 0.6** = Not very similar (likely not relevant)

Notice how the top results have high scores - they're semantically related to the question!

### Step 4: Format context for the LLM

Now we combine the retrieved chunks into a context string that will be sent to an LLM.

In [None]:
def format_context(search_results, max_chunks: int = 3) -> str:
    """
    Format search results into a context string for the LLM.

    Args:
        search_results: Results from Qdrant search
        max_chunks: Maximum number of chunks to include

    Returns:
        Formatted context string
    """
    context_parts = []

    for i, result in enumerate(search_results[:max_chunks], 1):
        context_parts.append(f"[Source {i} - Relevance: {result.score:.2f}]")
        context_parts.append(result.payload['text'])
        context_parts.append("")  # Empty line between sources

    return "\n".join(context_parts)

# Format the context
context = format_context(search_results, max_chunks=3)

print("üìù Formatted Context for LLM:")
print("="*80)
print(context)
print("="*80)
print(f"\nüìä Context stats:")
print(f"   Total tokens: {count_tokens(context)}")
print(f"   Total characters: {len(context)}")

### Step 5: Build the final prompt for the LLM

This is what actually gets sent to the LLM (GPT-4, Claude, etc.)

In [None]:
def build_llm_prompt(user_question: str, context: str) -> str:
    """
    Build the final prompt that will be sent to the LLM.

    Args:
        user_question: The user's original question
        context: Retrieved context from vector database

    Returns:
        Complete prompt for the LLM
    """
    prompt = f"""You are a helpful teaching assistant. Answer the student's question using ONLY the information provided in the context below. If the context doesn't contain enough information to answer the question, say so.

Context from textbook:
{context}

Student's question: {user_question}

Answer:"""

    return prompt

# Build the complete prompt
final_prompt = build_llm_prompt(user_question, context)

print("ü§ñ Complete Prompt for LLM:")
print("="*80)
print(final_prompt)
print("="*80)
print(f"\nüìä Final prompt stats:")
print(f"   Total tokens: {count_tokens(final_prompt)}")
print(f"   Total characters: {len(final_prompt)}")

### Step 6: Send to LLM (Conceptual)

In a real application, you would now send this prompt to an LLM API:

```python
# Example with OpenAI (not executed in this notebook)
import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful teaching assistant."},
        {"role": "user", "content": final_prompt}
    ],
    temperature=0.7,
    max_tokens=500
)

answer = response.choices[0].message.content
print(answer)
```

**For this demo, here's what the LLM would likely respond:**

In [None]:
print("ü§ñ Simulated LLM Response:")
print("="*80)
simulated_response = """Based on the textbook context provided:

Supervised learning is the most common type of machine learning where the algorithm learns from labeled training data. Each training example consists of an input and the desired output, and the algorithm learns to map inputs to outputs.

Common examples of supervised learning include:

1. **Classification tasks**: Predicting discrete categories, such as:
   - Determining whether an email is spam or not spam

2. **Regression tasks**: Predicting continuous values, such as:
   - House prices
   - Temperature

Popular supervised learning algorithms mentioned in the textbook include:
- Linear regression
- Logistic regression
- Decision trees
- Random forests
- Support vector machines
- Neural networks

Each algorithm has its own strengths and weaknesses depending on the specific problem you're trying to solve."""

print(simulated_response)
print("="*80)

## Part 6: Interactive RAG Function

Let's create a complete RAG function you can use with any question!

In [None]:
def rag_query(question: str, top_k: int = 3, show_sources: bool = True) -> Dict:
    """
    Complete RAG pipeline: question -> retrieval -> context formatting.

    Args:
        question: User's question
        top_k: Number of chunks to retrieve
        show_sources: Whether to display source information

    Returns:
        Dictionary with question, context, prompt, and metadata
    """
    # Step 1: Embed the question
    question_emb = model.encode([question])[0]

    # Step 2: Search Qdrant
    results = client.search(
        collection_name=collection_name,
        query_vector=question_emb.tolist(),
        limit=top_k
    )

    # Step 3: Format context
    context = format_context(results, max_chunks=top_k)

    # Step 4: Build final prompt
    prompt = build_llm_prompt(question, context)

    # Prepare response
    response = {
        'question': question,
        'context': context,
        'prompt': prompt,
        'num_sources': len(results),
        'sources': results,
        'total_tokens': count_tokens(prompt)
    }

    if show_sources:
        print(f"‚ùì Question: {question}\n")
        print(f"üìä Retrieved {len(results)} relevant chunks:\n")
        for i, result in enumerate(results, 1):
            print(f"  {i}. [Score: {result.score:.3f}] {result.payload['preview']}")
        print(f"\nüìù Total context tokens: {count_tokens(context)}")
        print(f"üìù Total prompt tokens: {response['total_tokens']}")

    return response

# Test the RAG function
print("="*80)
print("Testing RAG Query Function")
print("="*80)
result = rag_query("What is overfitting and how can we prevent it?", top_k=3)

## Part 7: Comparing Different Questions

Let's see how the RAG system handles different types of questions!

In [None]:
# Test with different questions
test_questions = [
    "What are the three main types of machine learning?",
    "How do convolutional neural networks work?",
    "What is the difference between precision and recall?",
    "Explain gradient descent in simple terms"
]

print("üß™ Testing RAG with Multiple Questions")
print("="*80)

for i, question in enumerate(test_questions, 1):
    print(f"\n{'='*80}")
    print(f"Question {i}/{len(test_questions)}")
    print(f"{'='*80}")
    result = rag_query(question, top_k=2, show_sources=True)
    print()

### ü§î Observations

Look at the results above and consider:
1. **How do similarity scores vary?** Some questions have higher scores than others
2. **Are the retrieved chunks relevant?** Do they actually contain information to answer the question?
3. **What happens with questions not in the textbook?** Try asking about topics not covered!

## Part 8: Advanced RAG Techniques

Let's explore some advanced features you can add to your RAG system.

### Technique 1: Filtered Search

Search only within specific sections or chapters.

In [None]:
from qdrant_client.models import Filter, FieldCondition, MatchValue

# Search only in chapter-level content
question = "What is machine learning?"
question_emb = model.encode([question])[0]

filtered_results = client.search(
    collection_name=collection_name,
    query_vector=question_emb.tolist(),
    query_filter=Filter(
        must=[
            FieldCondition(
                key="is_chapter",
                match=MatchValue(value=True)
            )
        ]
    ),
    limit=3
)

print("üîç Filtered Search (Chapters only):\n")
for i, result in enumerate(filtered_results, 1):
    print(f"{i}. [Score: {result.score:.3f}]")
    print(f"   Is Chapter: {result.payload['is_chapter']}")
    print(f"   Preview: {result.payload['preview']}\n")

### Technique 2: Hybrid Search with Token Limits

Retrieve chunks but ensure total context doesn't exceed token budget.

In [None]:
def rag_query_with_token_limit(question: str, max_context_tokens: int = 1500) -> Dict:
    """
    RAG query that respects a maximum token budget for context.

    Args:
        question: User's question
        max_context_tokens: Maximum tokens allowed in context

    Returns:
        Dictionary with query results
    """
    # Embed and search
    question_emb = model.encode([question])[0]
    results = client.search(
        collection_name=collection_name,
        query_vector=question_emb.tolist(),
        limit=10  # Get more candidates
    )

    # Add chunks until we hit token limit
    selected_chunks = []
    total_tokens = 0

    for result in results:
        chunk_tokens = result.payload['token_count']
        if total_tokens + chunk_tokens <= max_context_tokens:
            selected_chunks.append(result)
            total_tokens += chunk_tokens
        else:
            break  # Stop when we'd exceed limit

    print(f"üìä Token Budget Management:")
    print(f"   Max allowed: {max_context_tokens} tokens")
    print(f"   Actually used: {total_tokens} tokens")
    print(f"   Chunks included: {len(selected_chunks)}")
    print(f"   Chunks excluded: {len(results) - len(selected_chunks)}")

    return {
        'chunks': selected_chunks,
        'total_tokens': total_tokens,
        'chunks_used': len(selected_chunks)
    }

# Test token-limited retrieval
result = rag_query_with_token_limit(
    "Explain neural networks and how they are trained",
    max_context_tokens=1000
)

### Technique 3: Re-ranking Results

Sometimes the initial similarity scores aren't perfect. Re-ranking can improve results.

In [None]:
def rerank_by_keyword_presence(results, keywords: List[str]) -> List:
    """
    Re-rank results by boosting chunks that contain specific keywords.

    Args:
        results: Initial search results from Qdrant
        keywords: List of keywords to boost

    Returns:
        Re-ranked results
    """
    scored_results = []

    for result in results:
        text_lower = result.payload['text'].lower()

        # Count keyword matches
        keyword_score = sum(1 for kw in keywords if kw.lower() in text_lower)

        # Combine original similarity with keyword boost
        combined_score = result.score + (keyword_score * 0.1)  # Boost by 0.1 per keyword

        scored_results.append({
            'result': result,
            'original_score': result.score,
            'keyword_matches': keyword_score,
            'combined_score': combined_score
        })

    # Sort by combined score
    scored_results.sort(key=lambda x: x['combined_score'], reverse=True)

    return scored_results

# Example: Search for neural networks and boost results mentioning "training"
question = "How are neural networks trained?"
question_emb = model.encode([question])[0]

results = client.search(
    collection_name=collection_name,
    query_vector=question_emb.tolist(),
    limit=5
)

reranked = rerank_by_keyword_presence(results, keywords=["training", "gradient", "backpropagation"])

print("üîÑ Re-ranked Results:\n")
for i, item in enumerate(reranked[:3], 1):
    print(f"{i}. Original Score: {item['original_score']:.3f} | "
          f"Keyword Matches: {item['keyword_matches']} | "
          f"Combined Score: {item['combined_score']:.3f}")
    print(f"   Preview: {item['result'].payload['preview']}\n")

## Part 9: Student Exercises

Now it's your turn! Complete these exercises to build your RAG skills.

### Exercise 1: Basic RAG Query

**Task:** Use the `rag_query()` function to answer: "What is reinforcement learning?"

**Expected:** You should get chunks from the reinforcement learning section with high similarity scores.

In [None]:
# YOUR CODE HERE
# Use rag_query() with the question about reinforcement learning



### Exercise 2: Filtered Search

**Task:** Search for information about "evaluation metrics" but only in chunks that are sections (not chapters).

**Hint:** Use `query_filter` with `is_section=True`

In [None]:
# YOUR CODE HERE
# 1. Create the question embedding
# 2. Use client.search() with a filter for is_section=True
# 3. Print the results



### Exercise 3: Token Budget

**Task:** Create a RAG query for "What are CNNs and RNNs?" with a maximum context of 800 tokens.

**Hint:** Use the `rag_query_with_token_limit()` function

In [None]:
# YOUR CODE HERE
# Use rag_query_with_token_limit() with max_context_tokens=800



### Exercise 4: Build a Multi-Question RAG

**Task:** Create a function that takes multiple questions and returns combined context for all of them.

This is useful when a user asks a complex question that might need information from different parts of the textbook.

In [None]:
# YOUR CODE HERE
def multi_question_rag(questions: List[str], chunks_per_question: int = 2) -> Dict:
    """
    Retrieve context for multiple related questions.

    Args:
        questions: List of questions to answer
        chunks_per_question: How many chunks to retrieve per question

    Returns:
        Dictionary with combined results
    """
    # YOUR CODE HERE
    # 1. For each question, get embeddings and search
    # 2. Combine results (remove duplicates)
    # 3. Return combined context
    pass

# Test with related questions
# questions = [
#     "What is supervised learning?",
#     "What is unsupervised learning?",
#     "What is reinforcement learning?"
# ]
# result = multi_question_rag(questions, chunks_per_question=1)

### Exercise 5: Evaluation Metrics

**Task:** Create a function to evaluate RAG quality by checking if retrieved chunks contain expected keywords.

This helps you measure if your RAG system is retrieving relevant information.

In [None]:
# YOUR CODE HERE
def evaluate_retrieval(question: str, expected_keywords: List[str], top_k: int = 3) -> Dict:
    """
    Evaluate retrieval quality by checking for expected keywords.

    Args:
        question: The question to search for
        expected_keywords: Keywords that should appear in good results
        top_k: Number of chunks to retrieve

    Returns:
        Dictionary with evaluation metrics
    """
    # YOUR CODE HERE
    # 1. Perform RAG query
    # 2. Check how many expected keywords appear in results
    # 3. Calculate precision: (keywords found / total keywords)
    # 4. Return metrics
    pass

# Test evaluation
# metrics = evaluate_retrieval(
#     question="What is overfitting?",
#     expected_keywords=["overfitting", "training", "test", "regularization"],
#     top_k=3
# )
# print(f"Keyword coverage: {metrics['precision']:.2%}")

## Part 10: Real-World RAG Considerations

### Important Factors for Production RAG Systems

**1. Chunk Size Selection**
- Too small: Lacks context, may miss important information
- Too large: Less precise retrieval, wastes tokens
- Sweet spot: 500-1000 tokens depending on your use case

**2. Overlap Between Chunks**
- Prevents information from being split across chunk boundaries
- Typical overlap: 10-20% of chunk size
- Trade-off: More storage vs. better retrieval

**3. Embedding Model Choice**
- Larger models (768+ dimensions): Better accuracy, more expensive
- Smaller models (384 dimensions): Faster, cheaper, still good
- Domain-specific models: Better for specialized content (medical, legal, etc.)

**4. Number of Retrieved Chunks (top_k)**
- More chunks: Better coverage, but more noise and cost
- Fewer chunks: Faster, cheaper, but might miss information
- Typical range: 3-5 chunks for most applications

**5. Metadata and Filtering**
- Add source, date, author, section, etc.
- Enables filtered search (e.g., "only recent documents")
- Improves relevance and user trust

**6. Handling No Good Matches**
- Set a minimum similarity threshold (e.g., 0.7)
- If no results above threshold, tell user "I don't have information about that"
- Better than hallucinating an answer!

**7. Cost Considerations**
- Embedding API costs (if using OpenAI, Cohere, etc.)
- Vector database storage costs
- LLM API costs (proportional to context size)
- Balance quality vs. cost by tuning chunk size and top_k

## Summary: What You've Learned

### RAG Pipeline Steps
1. **Document Preparation**: Load and chunk documents into manageable pieces
2. **Embedding Creation**: Convert chunks to vector embeddings
3. **Vector Storage**: Store embeddings in a vector database (Qdrant)
4. **Query Processing**: Convert user questions to embeddings
5. **Similarity Search**: Find most relevant chunks using vector similarity
6. **Context Assembly**: Combine retrieved chunks into context
7. **Prompt Construction**: Build final prompt with context + question
8. **LLM Generation**: Send to LLM for answer generation

### Key Concepts
- ‚úÖ **Chunking**: Breaking documents into semantic units
- ‚úÖ **Token counting**: Managing context limits
- ‚úÖ **Embeddings**: Converting text to semantic vectors
- ‚úÖ **Vector similarity**: Finding semantically related content
- ‚úÖ **Context formatting**: Preparing information for LLMs
- ‚úÖ **Metadata filtering**: Improving retrieval precision
- ‚úÖ **Token budgets**: Managing costs and context limits

### Advanced Techniques
- üîß Filtered search by metadata
- üîß Token-limited retrieval
- üîß Result re-ranking
- üîß Multi-question queries
- üîß Retrieval evaluation

**Congratulations!** You now understand how to build a complete RAG system! üéâ

### Next Steps
1. Try with your own documents
2. Experiment with different chunk sizes
3. Test different embedding models
4. Integrate with a real LLM (OpenAI, Azure OpenAI, etc.)
5. Add a user interface (Streamlit, Gradio, etc.)
6. Deploy to production!

## Bonus: Quick Reference Code

Here's a complete minimal RAG implementation you can copy and adapt:

In [None]:
"""
MINIMAL RAG IMPLEMENTATION - QUICK REFERENCE

This is a simplified version you can use as a starting point.
"""

def minimal_rag(question: str, collection_name: str, client, model, top_k: int = 3):
    """Minimal RAG implementation in one function."""

    # 1. Embed question
    q_emb = model.encode([question])[0]

    # 2. Search vector DB
    results = client.search(
        collection_name=collection_name,
        query_vector=q_emb.tolist(),
        limit=top_k
    )

    # 3. Build context
    context = "\n\n".join([r.payload['text'] for r in results])

    # 4. Build prompt
    prompt = f"""Answer this question using the context below.

Context:
{context}

Question: {question}

Answer:"""

    return prompt

# Example usage:
# prompt = minimal_rag("What is machine learning?", collection_name, client, model)
# # Send prompt to your LLM of choice
print("‚úÖ Minimal RAG function defined!")
print("\nYou can now use minimal_rag() for quick RAG queries!")

## Resources and Further Reading

### Documentation
- **Qdrant**: https://qdrant.tech/documentation/
- **Sentence Transformers**: https://www.sbert.net/
- **Azure Blob Storage**: https://docs.microsoft.com/azure/storage/blobs/

### Advanced Topics
- **Hybrid Search**: Combining vector search with keyword search
- **Re-ranking Models**: Using cross-encoders for better results
- **Streaming Responses**: Returning LLM responses in real-time
- **Caching**: Storing common queries to reduce costs
- **Evaluation**: Measuring RAG system quality

### Related Notebooks
- Part 1: Why Vector Databases
- Part 2: Vector Database Solutions
- Part 3: Qdrant & Advanced Features

**Happy building!** üöÄ


