# üöÄ Day 2: Advanced RAG + LLM Agents + LoRA/QLoRA

**üéØ Goal:** Master cutting-edge LLM techniques demanded by industry in 2025

**‚è±Ô∏è Time:** 150-180 minutes

**üåü Why This Matters (2025 Job Market):**
- **RAG Engineers** are the hottest new role - companies hiring like crazy!
- **LLM Agents** are the future - from ChatGPT plugins to AutoGPT
- **LoRA/QLoRA** mentioned in 21% of ML job postings - essential for efficient fine-tuning
- These skills separate junior from senior AI engineers
- Real companies need: Advanced chunking, hybrid search, re-ranking, agent frameworks

**What You'll Master Today:**
1. **Advanced RAG:** Chunking strategies, hybrid search, re-ranking, evaluation
2. **LLM Agents:** ReAct pattern, tool use, function calling, LangChain agents
3. **LoRA/QLoRA:** Parameter-efficient fine-tuning, practical implementation
4. **Production Patterns:** What companies actually use in 2025

---

## üìö Part 1: Advanced RAG Techniques

**Basic RAG (What we covered before):**
```
Documents ‚Üí Embeddings ‚Üí Vector DB ‚Üí Retrieval ‚Üí LLM ‚Üí Answer
```

**Advanced RAG (What companies use in 2025):**
```
Documents ‚Üí Smart Chunking ‚Üí Multi-Representation Embeddings
    ‚Üì
Vector DB + Keyword Index (Hybrid Search)
    ‚Üì
Retrieval (Top-20) ‚Üí Re-Ranking (Top-3) ‚Üí Context Compression
    ‚Üì
LLM with Chain-of-Thought ‚Üí Answer + Citations
    ‚Üì
Evaluation (Faithfulness, Relevance)
```

### üéØ Advanced RAG Components:

**1. Smart Chunking Strategies:**
- Sentence-based chunking (semantic boundaries)
- Sliding window with overlap
- Recursive character splitting
- Markdown/code-aware splitting

**2. Hybrid Search:**
- Semantic search (vector similarity)
- Keyword search (BM25, TF-IDF)
- Combine scores with weighted fusion

**3. Re-Ranking:**
- Retrieve top-K candidates (e.g., 20)
- Re-rank with cross-encoder
- Return top-N most relevant (e.g., 3)

**4. Evaluation:**
- Faithfulness: Does answer come from context?
- Relevance: Is context relevant to query?
- Answer correctness: Is answer accurate?

Let's build each component!

In [None]:
# Install advanced RAG libraries
import sys
!{sys.executable} -m pip install langchain langchain-community rank-bm25 sentence-transformers chromadb --quiet

print("‚úÖ Advanced RAG libraries installed!")

### 1Ô∏è‚É£ Smart Chunking Strategies

In [None]:
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    SentenceTransformersTokenTextSplitter
)

# Sample document (typical blog post or documentation)
document = """
# Introduction to Large Language Models

Large Language Models (LLMs) have revolutionized AI. Models like GPT-4, Claude, and Gemini can understand and generate human-like text.

## How LLMs Work

LLMs are trained on massive amounts of text data using the transformer architecture. They learn patterns in language by predicting the next word in a sequence.

### Training Process

1. Pre-training: Models learn from internet-scale text data
2. Fine-tuning: Models are adapted to specific tasks
3. RLHF: Reinforcement Learning from Human Feedback improves quality

## Applications

LLMs power chatbots, code assistants, search engines, and more. Companies use them for customer support, content generation, and data analysis.

### Best Practices

When using LLMs, always validate outputs, use prompt engineering, and implement RAG systems for accurate information retrieval.
"""

print("üìÑ Original Document:")
print(f"   Length: {len(document)} characters")
print(f"   Length: {len(document.split())} words")

In [None]:
# Strategy 1: Recursive Character Splitting (Markdown-Aware)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,  # Target chunk size
    chunk_overlap=50,  # Overlap between chunks
    separators=["\n\n", "\n", ". ", " ", ""],  # Try these in order
    length_function=len
)

chunks_recursive = recursive_splitter.split_text(document)

print("\nüî™ Recursive Character Splitting:")
print(f"   Created {len(chunks_recursive)} chunks\n")
for i, chunk in enumerate(chunks_recursive, 1):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:100]}...\n")

print("üí° This respects document structure (headers, paragraphs)!")

In [None]:
# Strategy 2: Sentence-Based Chunking with Token Limit
token_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=20,
    tokens_per_chunk=50  # Based on model's token limit
)

chunks_tokens = token_splitter.split_text(document)

print("\n‚úÇÔ∏è Token-Based Splitting:")
print(f"   Created {len(chunks_tokens)} chunks\n")
for i, chunk in enumerate(chunks_tokens[:3], 1):  # Show first 3
    print(f"Chunk {i}: {chunk}\n")

print("üí° This ensures chunks fit within embedding model's token limit!")

### 2Ô∏è‚É£ Hybrid Search (Semantic + Keyword)

In [None]:
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np

# Sample knowledge base
documents = [
    "GPT-4 is a large language model by OpenAI with advanced reasoning capabilities.",
    "RAG systems combine retrieval with generation to reduce hallucinations.",
    "LoRA enables efficient fine-tuning by updating only small adapter layers.",
    "Transformers use self-attention mechanisms to process sequences in parallel.",
    "Vector databases like Pinecone store embeddings for fast semantic search.",
    "Prompt engineering involves crafting effective prompts for better LLM outputs.",
    "BERT is an encoder-only transformer model for understanding tasks.",
    "Fine-tuning adapts pre-trained models to specific domains and tasks."
]

# Initialize embedding model for semantic search
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = embedding_model.encode(documents)

# Initialize BM25 for keyword search
tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

print("‚úÖ Hybrid search system initialized!")
print(f"   {len(documents)} documents indexed")
print(f"   Semantic search: {doc_embeddings.shape[1]}-dim embeddings")
print(f"   Keyword search: BM25 algorithm")

In [None]:
def hybrid_search(query, top_k=3, alpha=0.5):
    """
    Hybrid search combining semantic and keyword search
    
    Args:
        query: Search query
        top_k: Number of results
        alpha: Weight for semantic search (0=keyword only, 1=semantic only)
    """
    # Semantic search
    query_embedding = embedding_model.encode([query])[0]
    semantic_scores = np.dot(doc_embeddings, query_embedding)
    semantic_scores = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min())
    
    # Keyword search (BM25)
    tokenized_query = query.lower().split()
    keyword_scores = bm25.get_scores(tokenized_query)
    keyword_scores = (keyword_scores - keyword_scores.min()) / (keyword_scores.max() - keyword_scores.min() + 1e-10)
    
    # Combine scores
    hybrid_scores = alpha * semantic_scores + (1 - alpha) * keyword_scores
    
    # Get top-k results
    top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'document': documents[idx],
            'score': hybrid_scores[idx],
            'semantic_score': semantic_scores[idx],
            'keyword_score': keyword_scores[idx]
        })
    
    return results

# Test hybrid search
query = "How can I fine-tune models efficiently?"

print(f"üîç Query: '{query}'\n")
print("="*80)

# Compare: Semantic only vs Keyword only vs Hybrid
print("\nüé® Semantic Search Only (alpha=1.0):")
semantic_results = hybrid_search(query, alpha=1.0)
for i, r in enumerate(semantic_results, 1):
    print(f"  {i}. [Score: {r['score']:.3f}] {r['document']}")

print("\nüìù Keyword Search Only (alpha=0.0):")
keyword_results = hybrid_search(query, alpha=0.0)
for i, r in enumerate(keyword_results, 1):
    print(f"  {i}. [Score: {r['score']:.3f}] {r['document']}")

print("\n‚öñÔ∏è Hybrid Search (alpha=0.5):")
hybrid_results = hybrid_search(query, alpha=0.5)
for i, r in enumerate(hybrid_results, 1):
    print(f"  {i}. [Score: {r['score']:.3f}] {r['document']}")
    print(f"      Semantic: {r['semantic_score']:.3f}, Keyword: {r['keyword_score']:.3f}")

print("\nüí° Hybrid search combines best of both worlds!")

### 3Ô∏è‚É£ Re-Ranking with Cross-Encoder

In [None]:
!{sys.executable} -m pip install sentence-transformers --quiet

from sentence_transformers import CrossEncoder

# Load cross-encoder for re-ranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

print("‚úÖ Cross-encoder loaded for re-ranking!")
print("\nüí° Cross-encoders are more accurate than bi-encoders for ranking")
print("   but slower (can't pre-compute). Use them for re-ranking!")

In [None]:
def advanced_rag_search(query, initial_k=5, final_k=2):
    """
    Advanced RAG: Hybrid search + Re-ranking
    """
    # Step 1: Hybrid search (get more candidates)
    candidates = hybrid_search(query, top_k=initial_k, alpha=0.5)
    
    # Step 2: Re-rank with cross-encoder
    pairs = [[query, c['document']] for c in candidates]
    rerank_scores = reranker.predict(pairs)
    
    # Add rerank scores
    for i, score in enumerate(rerank_scores):
        candidates[i]['rerank_score'] = score
    
    # Sort by rerank score
    reranked = sorted(candidates, key=lambda x: x['rerank_score'], reverse=True)[:final_k]
    
    return reranked

# Test advanced RAG
query = "What's the most efficient way to adapt large models?"

print(f"üîç Query: '{query}'\n")
print("="*80)

results = advanced_rag_search(query, initial_k=5, final_k=2)

print("\nüèÜ Final Re-Ranked Results:\n")
for i, r in enumerate(results, 1):
    print(f"{i}. [Re-rank Score: {r['rerank_score']:.4f}]")
    print(f"   Document: {r['document']}")
    print(f"   Original Hybrid Score: {r['score']:.3f}\n")

print("üí° Re-ranking found LoRA as most relevant - more accurate than hybrid alone!")

## ü§ñ Part 2: LLM Agents & Tool Use

**What are LLM Agents?**

Agents are LLMs that can:
- ‚úÖ Use external tools (calculators, search engines, APIs)
- ‚úÖ Make multi-step decisions
- ‚úÖ Take actions in environments
- ‚úÖ Learn from feedback

**Examples:**
- ChatGPT Plugins
- AutoGPT
- LangChain Agents
- Microsoft Copilot

### üéØ ReAct Pattern (Reasoning + Acting)

**ReAct Framework:**
```
Thought: I need to find the current weather
Action: search["weather in San Francisco"]
Observation: Currently 68¬∞F and sunny
Thought: I have the info, I can answer now
Answer: It's 68¬∞F and sunny in San Francisco
```

Let's build agents!

In [None]:
# Install LangChain for agents
!{sys.executable} -m pip install langchain langchain-community --quiet

print("‚úÖ LangChain installed!")

In [None]:
# Simple ReAct Agent Implementation
import re

# Define tools the agent can use
def calculator(expression):
    """Evaluates mathematical expressions"""
    try:
        return eval(expression)
    except:
        return "Error: Invalid expression"

def search(query):
    """Simulates searching a knowledge base"""
    knowledge = {
        "capital of france": "Paris",
        "population of tokyo": "14 million",
        "largest ocean": "Pacific Ocean",
        "speed of light": "299,792,458 meters per second"
    }
    
    query_lower = query.lower()
    for key, value in knowledge.items():
        if key in query_lower:
            return value
    return "No information found"

# Tools registry
TOOLS = {
    "calculator": calculator,
    "search": search
}

def react_agent(question, max_steps=5):
    """
    Simple ReAct agent that can use tools
    """
    print(f"ü§ñ Agent Question: {question}\n")
    print("="*80)
    
    for step in range(max_steps):
        print(f"\nStep {step + 1}:")
        
        # Thought (simplified - in production use LLM)
        if "calculate" in question.lower() or "+" in question or "*" in question:
            print("üí≠ Thought: I need to use the calculator")
            
            # Extract expression
            numbers = re.findall(r'\d+', question)
            if "plus" in question or "+" in question:
                expr = f"{numbers[0]} + {numbers[1]}"
            elif "times" in question or "*" in question:
                expr = f"{numbers[0]} * {numbers[1]}"
            else:
                expr = " + ".join(numbers)
            
            print(f"üîß Action: calculator[{expr}]")
            result = calculator(expr)
            print(f"üëÄ Observation: {result}")
            print(f"\n‚úÖ Answer: {result}")
            return result
        
        else:
            print("üí≠ Thought: I need to search for information")
            print(f"üîß Action: search[{question}]")
            result = search(question)
            print(f"üëÄ Observation: {result}")
            print(f"\n‚úÖ Answer: {result}")
            return result
    
    return "Could not answer within max steps"

# Test the agent
print("üß™ Testing ReAct Agent:\n")

# Question 1: Search
react_agent("What is the capital of France?")

print("\n" + "="*80 + "\n")

# Question 2: Calculator
react_agent("What is 25 plus 17?")

### üõ†Ô∏è Function Calling (OpenAI-Style)

In [None]:
# Function calling enables LLMs to use structured tools

# Example: Define tools for the LLM
tools = [
    {
        "name": "get_current_weather",
        "description": "Get the current weather in a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    },
    {
        "name": "search_documents",
        "description": "Search internal company documents",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query"
                },
                "category": {
                    "type": "string",
                    "enum": ["hr", "engineering", "sales"]
                }
            },
            "required": ["query"]
        }
    }
]

print("üõ†Ô∏è Function Calling Tools Defined:\n")
for tool in tools:
    print(f"  ‚Ä¢ {tool['name']}: {tool['description']}")

print("\nüí° In production, you'd send these to OpenAI/Claude:")
print("""
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What's the weather in SF?"}],
    tools=tools
)
# LLM decides to call: get_current_weather(location="San Francisco, CA")
""")

print("‚úÖ This is how ChatGPT plugins work!")

## üé® Part 3: LoRA & QLoRA - Efficient Fine-Tuning

**The Problem:**
- Fine-tuning GPT-4 (1.8T params) = Impossible for most
- Even Llama 70B = expensive, slow

**The Solution: LoRA (Low-Rank Adaptation)**

**How LoRA Works:**
```
Traditional Fine-Tuning:
  Update ALL 70 billion parameters ‚ùå Expensive!

LoRA:
  Freeze base model (70B params)
  Add small trainable adapters (10M params) ‚úÖ 100x cheaper!
  Merge after training
```

**QLoRA = LoRA + 4-bit Quantization**
- Load model in 4-bit (75% less memory)
- Fine-tune with LoRA
- Fine-tune 70B model on 1 GPU! üöÄ

### üìä LoRA vs Full Fine-Tuning:

| Metric | Full Fine-Tuning | LoRA | QLoRA |
|--------|------------------|------|-------|
| **Trainable Params** | 70B | 10M | 10M |
| **GPU Memory** | 280GB | 80GB | 24GB |
| **Training Time** | Days | Hours | Hours |
| **Cost** | $$$$ | $ | $ |
| **Performance** | 100% | ~95-99% | ~95-99% |

Let's implement LoRA!

In [None]:
# Install PEFT (Parameter-Efficient Fine-Tuning) library
!{sys.executable} -m pip install peft transformers datasets accelerate --quiet

print("‚úÖ PEFT library installed!")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
import torch

# Load base model (using GPT-2 for demo - same principles apply to Llama, Mistral, etc.)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set padding token
tokenizer.pad_token = tokenizer.eos_token

print(f"‚úÖ Loaded base model: {model_name}")
print(f"   Total parameters: {model.num_parameters():,}")

In [None]:
# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # LoRA rank (higher = more capacity, but more params)
    lora_alpha=32,  # LoRA scaling factor
    lora_dropout=0.1,  # Dropout for regularization
    target_modules=["c_attn"],  # Which layers to apply LoRA (attention layers)
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()

print("\nüí° See the difference?")
print("   Full fine-tuning: ~124M parameters")
print("   LoRA: Only ~300K trainable parameters!")
print("   That's 400x fewer parameters to train! üöÄ")

In [None]:
# Create training dataset (custom domain-specific data)
# Example: Fine-tune for technical AI explanations
training_data = [
    "Q: What is LoRA? A: LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that adds small trainable adapters to a frozen pre-trained model.",
    "Q: Why use RAG? A: RAG (Retrieval-Augmented Generation) combines information retrieval with LLMs to provide accurate, grounded answers with source citations.",
    "Q: How do transformers work? A: Transformers use self-attention mechanisms to process sequences in parallel, enabling them to capture long-range dependencies efficiently.",
    "Q: What is prompt engineering? A: Prompt engineering is the practice of designing effective prompts to guide LLMs toward desired outputs, using techniques like few-shot learning and chain-of-thought.",
    "Q: Explain vector databases? A: Vector databases store high-dimensional embeddings and enable fast similarity search, which is essential for semantic search and RAG systems."
]

dataset = Dataset.from_dict({"text": training_data})

# Tokenize
def tokenize(examples):
    tokenized = tokenizer(
        examples["text"],
        truncation=True,
        max_length=128,
        padding="max_length"
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])

print(f"‚úÖ Training dataset prepared: {len(tokenized_dataset)} examples")

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./lora-gpt2-ai-tuned",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=3e-4,  # Higher LR for LoRA
    logging_steps=5,
    save_strategy="no",  # Don't save checkpoints for demo
    report_to="none"  # Disable wandb
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

print("üöÄ Starting LoRA fine-tuning...\n")

# Train!
trainer.train()

print("\n‚úÖ LoRA fine-tuning complete!")
print("\nüí° The model is now specialized for AI technical explanations!")

In [None]:
# Test the fine-tuned model
from transformers import pipeline

# Create generator with LoRA model
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=100
)

# Test prompts
prompts = [
    "Q: What is LoRA? A:",
    "Q: Why use RAG? A:",
    "Q: What are vector databases? A:"
]

print("üß™ Testing LoRA Fine-Tuned Model:\n")
print("="*80)

for prompt in prompts:
    print(f"\n{prompt}")
    output = generator(prompt, num_return_sequences=1, do_sample=False)[0]['generated_text']
    answer = output[len(prompt):].strip()
    print(f"{answer[:200]}...")  # Truncate for display
    print("-"*80)

print("\n‚úÖ Model generates domain-specific AI explanations!")
print("\nüí° This is how companies fine-tune Llama/Mistral for their specific use cases")

## üéì Key Takeaways

### Advanced RAG:
‚úÖ **Smart Chunking** - Respect document structure, use overlap  
‚úÖ **Hybrid Search** - Combine semantic (meaning) + keyword (exact match)  
‚úÖ **Re-Ranking** - Retrieve many, re-rank with cross-encoder, return few  
‚úÖ **This is what production RAG looks like** in 2025  

### LLM Agents:
‚úÖ **ReAct Pattern** - Reasoning + Acting in loops  
‚úÖ **Tool Use** - LLMs can call functions, APIs, search engines  
‚úÖ **Function Calling** - Structured way to give LLMs tools  
‚úÖ **Agents are the future** - ChatGPT plugins, AutoGPT, Copilot  

### LoRA/QLoRA:
‚úÖ **100x more efficient** than full fine-tuning  
‚úÖ **Same performance** - 95-99% of full fine-tuning quality  
‚úÖ **PEFT library** - Production-ready implementation  
‚úÖ **Mentioned in 21% of jobs** - Critical skill for 2025  

---

**You now have industry-ready skills for:**
- üîç Building production RAG systems with hybrid search and re-ranking
- ü§ñ Creating LLM agents that use tools and take actions
- üé® Efficiently fine-tuning large models with LoRA/QLoRA
- üìä Understanding what companies actually use in 2025

**Next:** Day 3 - Big Data, Kubernetes, Graph Databases! üöÄ