# Week 5: Complete RAG System with LLM Integration

**What We're Building This Week:**

Week 5 completes our RAG (Retrieval-Augmented Generation) system by adding the final piece: **answer generation with a local LLM**.

## Week 5 Focus Areas

### Core Objectives
- **Local LLM Integration**: Use Ollama to generate answers from search results
- **Complete RAG Pipeline**: Query → Search → Generate → Answer
- **Performance Optimization**: 6x speed improvement (120s → 15-20s)
- **Streaming Capabilities**: Real-time response streaming
- **Clean API Design**: Simplified endpoints for production use

### What We'll Test In This Notebook
1. **Service Health Check** - Verify all components are running
2. **API Structure** - See our clean, simplified endpoints
3. **LLM Integration** - Test Ollama generating answers
4. **Performance Comparison** - Before vs after optimization
5. **Complete RAG Pipeline** - End-to-end question answering
6. **Streaming Responses** - Real-time answer generation
7. **Interactive Testing** - Try your own questions

---

## Prerequisites

**Ensure all services are running:**
```bash
docker compose up --build -d
```

**Service Access Points:**
- **FastAPI**: http://localhost:8000/docs
- **OpenSearch**: http://localhost:9200
- **Ollama**: http://localhost:11434
- **Airflow**: http://localhost:8080
- **Gradio Interface**: http://localhost:7861

---

## API Endpoints Overview

### Core Endpoints
- **`POST /api/v1/ask`** - Standard RAG endpoint (wait for complete response)
- **`POST /api/v1/stream`** - Streaming RAG endpoint (real-time response)
- **`POST /api/v1/hybrid-search/`** - Search papers with hybrid approach
- **`GET /api/v1/health`** - System health and service status

### Request Format
```json
{
    "query": "Your question here",
    "top_k": 3,           // Number of chunks to retrieve
    "use_hybrid": true,   // Use both BM25 and vector search
    "model": "llama3.2:1b",  // LLM model to use
    "categories": ["cs.AI", "cs.LG"]  // Optional: filter by categories
}
```

### Response Format (Standard)
```json
{
    "query": "Your question",
    "answer": "Generated answer from LLM",
    "sources": ["https://arxiv.org/pdf/..."],
    "chunks_used": 3,
    "search_mode": "hybrid"
}
```

### Response Format (Streaming)
```
data: {"sources": [...], "chunks_used": 3, "search_mode": "hybrid"}
data: {"chunk": "The"}
data: {"chunk": " answer"}
data: {"chunk": " is"}
data: {"answer": "The answer is...", "done": true}
```

---

## System Architecture

```
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   User Query    │────▶│  FastAPI Router │────▶│  Search Service │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                │                         │
                                │                         ▼
                                │                 ┌─────────────────┐
                                │                 │   OpenSearch    │
                                │                 │  (BM25 + Vector)│
                                │                 └─────────────────┘
                                │                         │
                                ▼                         │
                        ┌─────────────────┐              │
                        │  Ollama Service │◀─────────────┘
                        │   (LLM Gen)     │
                        └─────────────────┘
                                │
                                ▼
                        ┌─────────────────┐
                        │  Stream/Response │
                        └─────────────────┘
```

---

## Performance Metrics

| Metric | Before Optimization | After Optimization | Improvement |
|--------|--------------------|--------------------|-------------|
| Total Response Time | 120+ seconds | 15-20 seconds | 6x faster |
| Time to First Token | N/A | 2-3 seconds | Streaming enabled |
| Prompt Size | ~10KB | ~2KB | 80% reduction |
| Memory Usage | High | Optimized | Reduced overhead |

---

## Key Features

### 1. **Hybrid Search**
- Combines BM25 keyword search with vector similarity
- Better relevance ranking than either method alone
- Configurable per request

### 2. **Streaming Responses**
- See answers as they're generated
- Better user experience with immediate feedback
- Reduces perceived latency

### 3. **Local LLM**
- No external API dependencies
- Complete data privacy
- Customizable models via Ollama

### 4. **Production Ready**
- Health monitoring endpoints
- Error handling and recovery
- Clean, maintainable architecture

---

## Testing Guide

### Basic Tests
- **Health Check**: Verify all services are running
- **Search Test**: Ensure papers can be found
- **LLM Test**: Confirm Ollama is responding
- **RAG Pipeline**: End-to-end question answering

### Advanced Tests
- **Streaming**: Real-time response generation
- **Performance**: Measure response times
- **Categories**: Filter by specific arXiv categories
- **Error Handling**: Test edge cases

---

## Troubleshooting

### Common Issues

1. **404 Error on Streaming**
   - Ensure API container is rebuilt: `docker compose build api`
   - Restart API: `docker compose restart api`

2. **Slow Responses**
   - Check Ollama model is downloaded: `docker exec rag-ollama ollama list`
   - Verify OpenSearch has indexed papers
   - Consider using smaller model (llama3.2:1b)

3. **No Results Found**
   - Check OpenSearch status: `curl localhost:9200/_cluster/health`
   - Verify papers are indexed: `curl localhost:9200/arxiv-papers-chunks/_count`

4. **Gradio Interface Issues**
   - Default port changed to 7861 (from 7860)
   - Check if running: `curl localhost:7861`

---

## Next Steps

After completing this notebook, you can:

1. **Experiment with Models**
   - Try different Ollama models
   - Adjust generation parameters
   - Test prompt engineering

2. **Optimize Further**
   - Fine-tune chunk sizes
   - Adjust search parameters
   - Implement caching

3. **Extend Functionality**
   - Add conversation memory
   - Implement feedback loops
   - Create specialized prompts

4. **Deploy to Production**
   - Set up monitoring
   - Configure rate limiting
   - Implement authentication

---

## Additional Resources

- **API Documentation**: http://localhost:8000/docs
- **Gradio Interface**: http://localhost:7861
- **OpenSearch Dashboard**: http://localhost:5601
- **Project Repository**: [GitHub Link Placeholder]
- **Ollama Models**: https://ollama.ai/library

---

**Let's begin testing our complete RAG system!**

## 1. Environment Setup

In [14]:
# Environment Setup
import sys
import os
from pathlib import Path
import requests
import time
import json

print(f"Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")

# Find project root and add to Python path
current_dir = Path.cwd()
if current_dir.name == "week5" and current_dir.parent.name == "notebooks":
    project_root = current_dir.parent.parent
elif (current_dir / "compose.yml").exists():
    project_root = current_dir
else:
    project_root = Path("/Users/Shared/Projects/MOAI/zero_to_RAG")

if project_root.exists():
    print(f"Project root: {project_root}")
    sys.path.insert(0, str(project_root))
else:
    print("Project root not found - check directory structure")

print("✓ Environment setup complete")

Python Version: 3.12.11
Project root: /Users/Shared/Projects/MOAI/zero_to_RAG
✓ Environment setup complete


## 2. Service Health Check

First, let's verify all our services are running properly.

In [15]:
# Check Service Health
print("WEEK 5 SERVICE HEALTH CHECK")
print("=" * 40)

services = {
    "FastAPI": "http://localhost:8000/api/v1/health",
    "OpenSearch": "http://localhost:9200/_cluster/health",
    "Ollama": "http://localhost:11434/api/version"
}

all_healthy = True
for service_name, url in services.items():
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            print(f"✓ {service_name}: Healthy")
        else:
            print(f"✗ {service_name}: HTTP {response.status_code}")
            all_healthy = False
    except:
        print(f"✗ {service_name}: Not accessible")
        all_healthy = False

if all_healthy:
    print("\n✓ All services ready for Week 5!")
else:
    print("\n⚠ Some services need attention. Run: docker compose up --build -d")

WEEK 5 SERVICE HEALTH CHECK
✓ FastAPI: Healthy
✓ OpenSearch: Healthy
✓ Ollama: Healthy

✓ All services ready for Week 5!


## 3. API Structure Overview

Week 5 includes a **major simplification** - we cleaned up our API to just **3 focused endpoints**.

In [17]:
# Check API Endpoints
print("API STRUCTURE")
print("=" * 20)

try:
    response = requests.get("http://localhost:8000/openapi.json")
    if response.status_code == 200:
        openapi_data = response.json()
        endpoints = list(openapi_data['paths'].keys())
        
        print(f"Total endpoints: {len(endpoints)}")
        print("\nAvailable endpoints:")
        for endpoint in sorted(endpoints):
            print(f"  • {endpoint}")
    else:
        print(f"Could not fetch API info: {response.status_code}")
except Exception as e:
    print(f"Error: {e}")

API STRUCTURE
Total endpoints: 4

Available endpoints:
  • /api/v1/ask
  • /api/v1/health
  • /api/v1/hybrid-search/
  • /api/v1/stream


## 4. Test Ollama LLM

Let's test our local LLM service to make sure it can generate responses.

In [18]:
# Test Ollama LLM Service
print("OLLAMA LLM TEST")
print("=" * 20)

# Check what models are available
try:
    models_response = requests.get("http://localhost:11434/api/tags")
    if models_response.status_code == 200:
        models = models_response.json().get('models', [])
        print(f"Available models: {len(models)}")
        for model in models:
            print(f"  • {model['name']}")
    else:
        print(f"Could not list models: {models_response.status_code}")
except Exception as e:
    print(f"Error listing models: {e}")

OLLAMA LLM TEST
Available models: 1
  • llama3.2:1b


In [19]:
# Test Simple Generation
print("\nTesting LLM Generation:")

try:
    # Simple test to see if the LLM can respond
    test_data = {
        "model": "llama3.2:1b",
        "prompt": "What is 2+6? Answer with just the number.",
        "stream": False
    }
    
    response = requests.post(
        "http://localhost:11434/api/generate",
        json=test_data,
        timeout=30
    )
    
    if response.status_code == 200:
        result = response.json()
        answer = result.get('response', '').strip()
        print(f"✓ LLM responded: '{answer}'")
        print("✓ Ollama is working!")
    else:
        print(f"✗ Generation failed: {response.status_code}")
        
except Exception as e:
    print(f"✗ Error: {e}")


Testing LLM Generation:
✓ LLM responded: '8'
✓ Ollama is working!


## 5. Test Search Functionality

Before we can generate answers, we need to test that search is working to find relevant papers.

In [20]:
# Test Search
print("SEARCH TEST")
print("=" * 15)

search_query = "machine learning"
print(f"Searching for: '{search_query}'")

try:
    search_request = {
        "query": search_query,
        "use_hybrid": True,  # Use both keyword and semantic search
        "size": 3
    }
    
    response = requests.post(
        "http://localhost:8000/api/v1/hybrid-search/",
        json=search_request,
        timeout=30
    )
    
    if response.status_code == 200:
        data = response.json()
        print(f"✓ Found {data['total']} results")
        print(f"✓ Search mode: {data['search_mode']}")
        
        if data['hits']:
            print("\nTop results:")
            for i, hit in enumerate(data['hits'][:2], 1):
                title = hit.get('title', 'Unknown')[:60]
                score = hit.get('score', 0)
                print(f"  {i}. {title}... (score: {score:.3f})")
        else:
            print("No results found")
    else:
        print(f"✗ Search failed: {response.status_code}")
        
except Exception as e:
    print(f"✗ Error: {e}")

SEARCH TEST
Searching for: 'machine learning'
✓ Found 3 results
✓ Search mode: hybrid

Top results:
  1. Improving Low-Resource Translation with Dictionary-Guided Fi... (score: 0.016)
  2. Deep Active Learning for Lung Disease Severity Classificatio... (score: 0.016)


## 6. Complete RAG Pipeline Test 

Now for the main event: **complete question answering** with optimized performance!

In [22]:
# Test Complete RAG Pipeline (Optimized Performance)
print("COMPLETE RAG PIPELINE TEST (OPTIMIZED)")
print("=" * 40)

question = "Summarize machine learning papers?"
print(f"Question: {question}")

start_time = time.time()

try:
    rag_request = {
        "query": question,
        "top_k": 1,  # Use 1 chunk for context
        "use_hybrid": True,  # Use best search
        "model": "llama3.2:1b"
    }
    
    # Using optimized endpoint (6x faster than before!)
    response = requests.post(
        "http://localhost:8000/api/v1/ask/",
        json=rag_request,
        timeout=60
    )
    
    response_time = time.time() - start_time
    
    if response.status_code == 200:
        data = response.json()
        
        print(f"\n✓ Success! ({response_time:.1f} seconds)")
        print(f"\nAnswer:")
        print("-" * 40)
        print(data['answer'])
        print("-" * 40)
        
        print(f"\nSources: {len(data.get('sources', []))} papers")
        print(f"Chunks used: {data.get('chunks_used', 0)}")
        print(f"Search mode: {data.get('search_mode', 'unknown')}")

    else:
        print(f"\n✗ Request failed: HTTP {response.status_code}")
        print(f"Response: {response.text[:200]}")
        
except Exception as e:
    print(f"\n✗ Error: {e}")


COMPLETE RAG PIPELINE TEST (OPTIMIZED)
Question: Summarize machine learning papers?

✓ Success! (7.7 seconds)

Answer:
----------------------------------------
machine learning papers often focus on developing and applying techniques from various domains to achieve specific goals, such as image classification, natural language processing, or regression.
----------------------------------------

Sources: 1 papers
Chunks used: 1
Search mode: hybrid


## 7. Complete RAG Pipeline Test - streaming

Now for the main event: **complete question answering** with optimized performance!

In [23]:
# Test Complete RAG Pipeline with STREAMING
print("COMPLETE RAG PIPELINE TEST (STREAMING)")
print("=" * 40)

question = "Summarize machine learning papers?"
print(f"Question: {question}")

start_time = time.time()

try:
    rag_request = {
        "query": question,
        "top_k": 1,  # Use 1 chunk for context
        "use_hybrid": True,  # Use best search
        "model": "llama3.2:1b"
    }
    
    # Using streaming endpoint for real-time responses
    response = requests.post(
        "http://localhost:8000/api/v1/stream",
        json=rag_request,
        stream=True,  # Enable streaming
        timeout=60
    )
    
    if response.status_code == 200:
        # Process streaming response
        full_answer = ""
        sources = []
        chunks_used = 0
        search_mode = "unknown"
        first_chunk_time = None
        
        print(f"\nStreaming response...")
        
        for line in response.iter_lines():
            if line:
                line_str = line.decode('utf-8')
                if line_str.startswith('data: '):
                    try:
                        data = json.loads(line_str[6:])  # Remove 'data: ' prefix
                        
                        # Handle metadata
                        if 'sources' in data:
                            sources = data['sources']
                            chunks_used = data.get('chunks_used', 0)
                            search_mode = data.get('search_mode', 'unknown')
                        
                        # Handle streaming chunks
                        if 'chunk' in data:
                            if first_chunk_time is None:
                                first_chunk_time = time.time() - start_time
                                print(f"First response in: {first_chunk_time:.1f} seconds")
                                print("\nAnswer:")
                                print("-" * 40)
                            
                            chunk_text = data['chunk']
                            full_answer += chunk_text
                            print(chunk_text, end='', flush=True)  # Print as it streams
                        
                        # Handle completion
                        if data.get('done', False):
                            break
                            
                    except json.JSONDecodeError:
                        continue
        
        response_time = time.time() - start_time
        
        print("\n" + "-" * 40)
        print(f"\n✓ Complete! (Total: {response_time:.1f} seconds)")
        
        print(f"\nSources: {len(sources)} papers")
        if sources:
            for i, source in enumerate(sources[:2], 1):
                print(f"  {i}. {source}")
        print(f"Chunks used: {chunks_used}")
        print(f"Search mode: {search_mode}")

    else:
        print(f"\n✗ Request failed: HTTP {response.status_code}")
        print(f"Response: {response.text[:200]}")
        
except Exception as e:
    print(f"\n✗ Error: {e}")
    import traceback
    traceback.print_exc()


COMPLETE RAG PIPELINE TEST (STREAMING)
Question: Summarize machine learning papers?

Streaming response...
First response in: 3.7 seconds

Answer:
----------------------------------------
Here's a summary of relevant machine learning papers from arXiv:

Machine Learning Papers

Several studies have contributed to the field of machine learning, with notable works including:

* Deep Active Learning for Lung Disease Severity Classification from Chest X-rays: Learning with Less Data in the Presence of Class Imbalance (arXiv:2508.21263v1)
	+ This paper applied deep active learning with a Bayesian Neural Network (BNN) approximation and weighted loss function to reduce labeled data requirements for lung disease severity classification.
* Semi-Supervised Deep Learning for Activity Recognition (arXiv:2009.04466v2)
	+ This study employed a semi-supervised approach, leveraging both labeled and unlabeled data to improve activity recognition accuracy.

Key Concepts

The key concepts in machine lear

In [24]:
# System Status Summary
print("SYSTEM STATUS SUMMARY")
print("=" * 25)

try:
    health_response = requests.get("http://localhost:8000/api/v1/health")
    if health_response.status_code == 200:
        health_data = health_response.json()
        
        print(f"Overall Status: {health_data.get('status', 'unknown').upper()}")
        print(f"Version: {health_data.get('version', 'unknown')}")
        
        print("\nService Status:")
        services = health_data.get('services', {})
        for service, info in services.items():
            status = info.get('status', 'unknown')
            message = info.get('message', '')
            print(f"  • {service}: {status} - {message}")
        
        print("\nRAG Pipeline Status:")
        print("  ✓ Data Ingestion: Papers indexed in OpenSearch")
        print("  ✓ Search: BM25 + Vector hybrid search working")
        print("  ✓ LLM Generation: Ollama generating answers")
        print("  ✓ Performance: 6x speed improvement (120s → 15-20s)")
        print("  ✓ API: Clean endpoints ready for production")
        
        # Check endpoint availability
        print("\nEndpoint Status:")
        try:
            test_response = requests.get("http://localhost:8000/openapi.json")
            if test_response.status_code == 200:
                endpoints = list(test_response.json()['paths'].keys())
                print(f"  ✓ Standard RAG: /api/v1/ask/ (working)")
                
                if "/api/v1/ask/ask-stream/" in endpoints:
                    print(f"  ✓ Streaming RAG: /api/v1/ask/ask-stream/ (available)")
                else:
                    print(f"  ⚠ Streaming RAG: /api/v1/ask/ask-stream/ (needs container rebuild)")
                
                print(f"  ✓ Search: /api/v1/hybrid-search/ (working)")
        except:
            print("  ⚠ Could not check endpoint status")
        
        print("\n🎉 Complete RAG system operational!")
        print(f"   • Dramatic performance improvement achieved")
        print(f"   • Production-ready with excellent response times")
        
    else:
        print(f"Could not get system status: {health_response.status_code}")
        
except Exception as e:
    print(f"Error checking system status: {e}")

SYSTEM STATUS SUMMARY
Overall Status: OK
Version: 0.1.0

Service Status:
  • database: healthy - Connected successfully
  • opensearch: healthy - Index 'arxiv-papers-chunks' with 511 documents
  • ollama: healthy - Ollama service is running

RAG Pipeline Status:
  ✓ Data Ingestion: Papers indexed in OpenSearch
  ✓ Search: BM25 + Vector hybrid search working
  ✓ LLM Generation: Ollama generating answers
  ✓ Performance: 6x speed improvement (120s → 15-20s)
  ✓ API: Clean endpoints ready for production

Endpoint Status:
  ✓ Standard RAG: /api/v1/ask/ (working)
  ⚠ Streaming RAG: /api/v1/ask/ask-stream/ (needs container rebuild)
  ✓ Search: /api/v1/hybrid-search/ (working)

🎉 Complete RAG system operational!
   • Dramatic performance improvement achieved
   • Production-ready with excellent response times


## 8. Using the Gradio Interface

For a more user-friendly experience, try the Gradio web interface!

In [27]:
# Launch Gradio Interface Instructions

print("GRADIO INTERFACE")
print("=" * 40)

print("\n📱 Web Interface Available!")
print("\nTo use the Gradio interface:")
print("1. Open a terminal")
print("2. Run: uv run python gradio_launcher.py")
print("3. Open browser to: http://localhost:7861")
print("\nFeatures:")
print("  • Real-time streaming responses")
print("  • Interactive parameter controls")
print("  • Clean, user-friendly design")
print("  • Example questions included")
print("  • Source paper links")

# Check if Gradio is running
try:
    gradio_check = requests.get("http://localhost:7861", timeout=2)
    if gradio_check.status_code == 200:
        print("\n✅ Gradio interface is running!")
        print("   Visit: http://localhost:7861")
    else:
        print("\n⚠️ Gradio not detected on port 7861")
        print("   Run: uv run python gradio_launcher.py")
except:
    print("\n⚠️ Gradio interface not running")
    print("   To start: uv run python gradio_launcher.py")
    


GRADIO INTERFACE

📱 Web Interface Available!

To use the Gradio interface:
1. Open a terminal
2. Run: uv run python gradio_launcher.py
3. Open browser to: http://localhost:7861

Features:
  • Real-time streaming responses
  • Interactive parameter controls
  • Clean, user-friendly design
  • Example questions included
  • Source paper links

✅ Gradio interface is running!
   Visit: http://localhost:7861


## Summary

### What We Built in Week 5:

**Complete RAG System Components:**
1. **Data Pipeline**: arXiv papers → PostgreSQL → OpenSearch indexing
2. **Search System**: Hybrid BM25 + vector similarity search  
3. **LLM Integration**: Local Ollama service for answer generation
4. **Performance Optimization**: 6x speed improvement through prompt optimization
5. **Streaming API**: Real-time response streaming for better UX
6. **Clean Architecture**: 3 focused endpoints for production use

**RAG Pipeline Flow:**
```
User Question → Search Papers → Find Relevant Chunks → LLM Generates Answer → Stream Response
```

**Key Features:**
- **Local LLM**: No external API calls for generation
- **Hybrid Search**: Combines keyword matching + semantic similarity
- **Optimized Performance**: 18-20 seconds total vs previous 120+ seconds
- **Streaming Responses**: See answers as they're generated (2-3s to first response)
- **Production Ready**: Error handling, monitoring, clean architecture

**API Endpoints:**
- `/ask/` - Optimized standard endpoint (wait for complete response)
- `/ask/ask-stream/` - Streaming endpoint (real-time response)
- `/hybrid-search/` - Search papers directly

### Performance Achievements:
- **Before optimization**: 120+ seconds per question
- **After optimization**: 15-20 seconds per question  
- **With streaming**: 2-3 seconds to first response, full answer streams in
- **Speed improvement**: 6x faster response times

### Key Optimizations Applied:
- **Reduced prompt size by 80%** (removed redundant metadata)
- **Streamlined data processing** (eliminated unnecessary field lookups)
- **Optimized LLM context handling** (minimal chunk data)
- **Shared code architecture** (DRY principles for maintainability)

### What You Learned:
- How to integrate a local LLM (Ollama) with search results
- Complete RAG pipeline from question to answer
- Performance optimization techniques for production systems
- Streaming responses for better user experience
- Production API design with health monitoring

### Next Steps:
- Experiment with different search modes (BM25 vs hybrid)
- Test with various question types and complexities
- Enable streaming for real-time response experience
- Explore the API documentation at http://localhost:8000/docs
- Consider deployment strategies for production use

**Congratulations! You've built a complete, high-performance, production-ready RAG system! 🎉**