# 📚 RAG: Documentación Técnica con LLMs

Objetivo: construir un sistema RAG (Retrieval Augmented Generation) para consultar documentación de datos, diccionarios de esquemas, y knowledge bases.

- Duración: 120 min
- Dificultad: Alta
- Stack: OpenAI, ChromaDB, LangChain

### 🏗️ **RAG Architecture: Knowledge Retrieval for Data Engineering**

**Evolution of Knowledge Management:**

```
Traditional Approach (Pre-RAG):
  ├─ Static documentation (Confluence, Notion)
  ├─ Manual search (Ctrl+F)
  ├─ Outdated quickly
  ├─ Context scattered across tools
  └─ No natural language queries

RAG Approach (2023+):
  ├─ Unified vector knowledge base
  ├─ Semantic search (similarity-based)
  ├─ Real-time context retrieval
  ├─ Natural language interface
  └─ LLM-powered answers with citations
```

**RAG System Architecture:**

```
┌──────────────────────────────────────────────────────────────┐
│  LAYER 1: DATA INGESTION (Indexing)                         │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Sources                                             │   │
│  │  ├─ Confluence pages (API)                           │   │
│  │  ├─ GitHub README/wikis (git clone)                  │   │
│  │  ├─ DBT documentation (manifest.json)                │   │
│  │  ├─ Airflow DAG docstrings (AST parsing)            │   │
│  │  ├─ SQL queries (comments, EXPLAIN)                 │   │
│  │  ├─ Data quality reports (Great Expectations)       │   │
│  │  └─ Runbooks/SOPs (Markdown/PDF)                    │   │
│  └──────────────────┬───────────────────────────────────┘   │
│                     │                                        │
│  ┌──────────────────┴───────────────────────────────────┐   │
│  │  Document Processing                                 │   │
│  │  ├─ Text extraction (PyPDF2, Unstructured)          │   │
│  │  ├─ Chunking (RecursiveCharacterTextSplitter)       │   │
│  │  │   • Chunk size: 500-1000 tokens                  │   │
│  │  │   • Overlap: 10-20% (100-200 tokens)             │   │
│  │  │   • Separators: \n\n > \n > . > space           │   │
│  │  ├─ Metadata extraction (title, source, date, owner)│   │
│  │  └─ Deduplication (hash-based)                      │   │
│  └──────────────────┬───────────────────────────────────┘   │
│                     │                                        │
│  ┌──────────────────┴───────────────────────────────────┐   │
│  │  Embedding Generation                                │   │
│  │  ├─ Model: text-embedding-3-small (OpenAI)          │   │
│  │  │   • Dimensions: 1536                             │   │
│  │  │   • Cost: $0.02 per 1M tokens                    │   │
│  │  │   • Latency: ~50ms per document                  │   │
│  │  ├─ Alternatives:                                    │   │
│  │  │   • sentence-transformers/all-MiniLM-L6-v2 (OSS)│   │
│  │  │   • Cohere embed-multilingual-v3.0               │   │
│  │  │   • voyage-large-2-instruct (best accuracy)      │   │
│  │  └─ Batch processing (100 docs at a time)           │   │
│  └──────────────────┬───────────────────────────────────┘   │
│                     │                                        │
│  ┌──────────────────┴───────────────────────────────────┐   │
│  │  Vector Store (ChromaDB/Pinecone/Weaviate)          │   │
│  │  ├─ Index type: HNSW (Hierarchical NSW)             │   │
│  │  ├─ Distance metric: Cosine similarity               │   │
│  │  ├─ Metadata storage: JSON                          │   │
│  │  └─ Collections: data_catalog, pipelines, runbooks  │   │
│  └──────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│  LAYER 2: QUERY PROCESSING (Retrieval)                      │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  User Query (Natural Language)                       │   │
│  │  "What tables contain customer PII?"                 │   │
│  └──────────────────┬───────────────────────────────────┘   │
│                     │                                        │
│  ┌──────────────────┴───────────────────────────────────┐   │
│  │  Query Understanding                                 │   │
│  │  ├─ Intent classification (search, definition, how-to)│  │
│  │  ├─ Entity extraction (table names, metrics)        │   │
│  │  ├─ Query expansion (synonyms: "PII" → "personal data")│ │
│  │  └─ Filter inference (metadata: type=schema)        │   │
│  └──────────────────┬───────────────────────────────────┘   │
│                     │                                        │
│  ┌──────────────────┴───────────────────────────────────┐   │
│  │  Hybrid Search                                       │   │
│  │  ├─ Vector search (semantic similarity, top-k=20)   │   │
│  │  ├─ Keyword search (BM25, ElasticSearch)            │   │
│  │  ├─ Metadata filtering (owner, type, freshness)     │   │
│  │  └─ Fusion: Reciprocal Rank Fusion (RRF)            │   │
│  │     score = Σ(1 / (k + rank_i)) for each method    │   │
│  └──────────────────┬───────────────────────────────────┘   │
│                     │                                        │
│  ┌──────────────────┴───────────────────────────────────┐   │
│  │  Re-ranking (Optional)                               │   │
│  │  ├─ Cross-encoder (ms-marco-MiniLM)                 │   │
│  │  ├─ Cohere Rerank API (best, $2 per 1K searches)    │   │
│  │  ├─ LLM-based scoring (GPT-4 rates relevance 1-10)  │   │
│  │  └─ Top-k final: 3-5 documents                      │   │
│  └──────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│  LAYER 3: GENERATION (Answering)                            │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Context Construction                                │   │
│  │  ├─ Retrieved chunks (3-5 most relevant)            │   │
│  │  ├─ Source attribution (with URLs)                  │   │
│  │  ├─ Token budget management (max 4K for context)    │   │
│  │  └─ Chunk reordering (most relevant first)          │   │
│  └──────────────────┬───────────────────────────────────┘   │
│                     │                                        │
│  ┌──────────────────┴───────────────────────────────────┐   │
│  │  Prompt Engineering                                  │   │
│  │  System: "You are a data engineer assistant..."     │   │
│  │  Context: [Retrieved documentation]                 │   │
│  │  Query: [User question]                             │   │
│  │  Instructions:                                       │   │
│  │    - Answer ONLY from provided context             │   │
│  │    - Cite sources with [1], [2] notation           │   │
│  │    - If no info found, say "I don't know"          │   │
│  │    - Include table/column names when relevant      │   │
│  └──────────────────┬───────────────────────────────────┘   │
│                     │                                        │
│  ┌──────────────────┴───────────────────────────────────┐   │
│  │  LLM Generation                                      │   │
│  │  ├─ Model: GPT-4o (best accuracy)                   │   │
│  │  ├─ Temperature: 0.1 (factual, deterministic)       │   │
│  │  ├─ Max tokens: 500 (concise answers)               │   │
│  │  └─ Streaming: yes (better UX)                      │   │
│  └──────────────────┬───────────────────────────────────┘   │
│                     │                                        │
│  ┌──────────────────┴───────────────────────────────────┐   │
│  │  Post-Processing                                     │   │
│  │  ├─ Citation formatting ([1] → link to source)      │   │
│  │  ├─ Code syntax highlighting                        │   │
│  │  ├─ Table formatting (Markdown tables)              │   │
│  │  └─ Confidence scoring (based on retrieval scores)  │   │
│  └──────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│  LAYER 4: EVALUATION & MONITORING                           │
│  ├─ Retrieval metrics: Precision@k, Recall@k, MRR           │
│  ├─ Generation metrics: Faithfulness, Relevance             │
│  ├─ User feedback: 👍/👎, explicit ratings                  │
│  ├─ Latency tracking: p50/p95/p99 response times            │
│  └─ Cost tracking: embedding + LLM API calls                │
└──────────────────────────────────────────────────────────────┘
```

**Vector Database Comparison:**

| Feature | ChromaDB | Pinecone | Weaviate | Qdrant |
|---------|----------|----------|----------|--------|
| **Deployment** | Local/embedded | Cloud SaaS | Self-hosted/Cloud | Self-hosted/Cloud |
| **Max vectors** | Millions | Billions | Billions | Billions |
| **Metadata filtering** | ✅ Basic | ✅ Advanced | ✅ Advanced | ✅ Advanced |
| **Hybrid search** | ❌ | ✅ (sparse-dense) | ✅ (BM25+vector) | ✅ |
| **Cost (1M vectors)** | Free (local) | $70/month | $25/month (DO) | $0 (self-hosted) |
| **Latency (p95)** | 50-100ms | 20-50ms | 30-80ms | 20-60ms |
| **Best for** | Prototyping, small datasets | Production, scale | Flexibility, GraphQL | High performance, Rust |

**Embedding Model Comparison:**

| Model | Dimensions | Cost | MTEB Score | Use Case |
|-------|-----------|------|------------|----------|
| **text-embedding-3-small** (OpenAI) | 1536 | $0.02/1M tokens | 62.3 | Production, balanced |
| **text-embedding-3-large** (OpenAI) | 3072 | $0.13/1M tokens | 64.6 | Best accuracy |
| **all-MiniLM-L6-v2** (OSS) | 384 | Free | 58.8 | Budget, fast |
| **bge-large-en-v1.5** (OSS) | 1024 | Free | 63.9 | Best open source |
| **voyage-large-2-instruct** | 1024 | $0.12/1M tokens | 68.3 | Highest accuracy |
| **Cohere embed-multilingual-v3** | 1024 | $0.10/1M tokens | 64.5 | Multilingual |

**Implementation: Complete RAG System**

```python
from dataclasses import dataclass
from typing import List, Dict, Optional
import chromadb
from openai import OpenAI
import hashlib

@dataclass
class Document:
    """Document with metadata for indexing."""
    id: str
    content: str
    source: str
    metadata: Dict
    
    def to_hash(self) -> str:
        """Generate unique hash for deduplication."""
        return hashlib.md5(self.content.encode()).hexdigest()

class RAGSystem:
    """
    Production-ready RAG system for data documentation.
    
    Features:
    - Document ingestion with chunking
    - Hybrid search (vector + metadata)
    - LLM generation with citations
    - Evaluation metrics
    """
    
    def __init__(
        self,
        collection_name: str = "data_docs",
        embedding_model: str = "text-embedding-3-small",
        llm_model: str = "gpt-4o"
    ):
        # Initialize ChromaDB
        self.chroma_client = chromadb.PersistentClient(path="./chroma_db")
        self.collection = self.chroma_client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}  # Cosine similarity
        )
        
        # Initialize OpenAI
        self.openai_client = OpenAI()
        self.embedding_model = embedding_model
        self.llm_model = llm_model
        
        # Cache for embeddings (reduce API calls)
        self._embedding_cache = {}
    
    def get_embedding(self, text: str) -> List[float]:
        """Generate embedding with caching."""
        cache_key = hashlib.md5(text.encode()).hexdigest()
        
        if cache_key in self._embedding_cache:
            return self._embedding_cache[cache_key]
        
        response = self.openai_client.embeddings.create(
            model=self.embedding_model,
            input=text
        )
        
        embedding = response.data[0].embedding
        self._embedding_cache[cache_key] = embedding
        
        return embedding
    
    def chunk_document(
        self,
        text: str,
        chunk_size: int = 800,
        chunk_overlap: int = 200
    ) -> List[str]:
        """
        Split document into overlapping chunks.
        
        Strategy:
        - Try to split at paragraph boundaries (\n\n)
        - Fall back to sentence boundaries (. )
        - Last resort: character boundaries
        """
        chunks = []
        start = 0
        
        while start < len(text):
            end = start + chunk_size
            
            # If not at end of document
            if end < len(text):
                # Try to find paragraph boundary
                newline_idx = text.rfind('\n\n', start, end)
                if newline_idx != -1 and newline_idx > start + chunk_size // 2:
                    end = newline_idx
                else:
                    # Try sentence boundary
                    period_idx = text.rfind('. ', start, end)
                    if period_idx != -1 and period_idx > start + chunk_size // 2:
                        end = period_idx + 1
            
            chunk = text[start:end].strip()
            if chunk:
                chunks.append(chunk)
            
            # Move start with overlap
            start = end - chunk_overlap if end < len(text) else len(text)
        
        return chunks
    
    def index_document(
        self,
        doc: Document,
        chunk: bool = True
    ) -> int:
        """
        Index document into vector store.
        
        Returns: Number of chunks indexed
        """
        # Check for duplicates
        doc_hash = doc.to_hash()
        existing = self.collection.get(where={"hash": doc_hash})
        if existing['ids']:
            print(f"⚠️ Document {doc.id} already indexed (skipping)")
            return 0
        
        # Chunk if needed
        if chunk:
            chunks = self.chunk_document(doc.content)
        else:
            chunks = [doc.content]
        
        # Generate embeddings
        embeddings = [self.get_embedding(chunk) for chunk in chunks]
        
        # Prepare metadata
        chunk_ids = [f"{doc.id}_chunk_{i}" for i in range(len(chunks))]
        metadatas = [
            {
                **doc.metadata,
                "source": doc.source,
                "hash": doc_hash,
                "chunk_index": i,
                "total_chunks": len(chunks)
            }
            for i in range(len(chunks))
        ]
        
        # Add to collection
        self.collection.add(
            ids=chunk_ids,
            documents=chunks,
            embeddings=embeddings,
            metadatas=metadatas
        )
        
        print(f"✅ Indexed {doc.id}: {len(chunks)} chunks")
        return len(chunks)
    
    def search(
        self,
        query: str,
        top_k: int = 5,
        filters: Optional[Dict] = None
    ) -> List[Dict]:
        """
        Search for relevant documents.
        
        Args:
            query: Search query
            top_k: Number of results
            filters: Metadata filters (e.g., {"type": "schema"})
        
        Returns:
            List of {content, metadata, score} dicts
        """
        # Generate query embedding
        query_embedding = self.get_embedding(query)
        
        # Search
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
            where=filters
        )
        
        # Format results
        documents = []
        for i in range(len(results['ids'][0])):
            documents.append({
                'id': results['ids'][0][i],
                'content': results['documents'][0][i],
                'metadata': results['metadatas'][0][i],
                'distance': results['distances'][0][i] if 'distances' in results else None
            })
        
        return documents
    
    def generate_answer(
        self,
        query: str,
        context_docs: List[Dict],
        include_citations: bool = True
    ) -> Dict:
        """
        Generate answer using LLM with retrieved context.
        
        Returns:
            {
                'answer': str,
                'sources': List[str],
                'confidence': float
            }
        """
        # Build context
        context_parts = []
        sources = []
        
        for i, doc in enumerate(context_docs, 1):
            context_parts.append(f"[{i}] {doc['content']}")
            sources.append({
                'id': doc['id'],
                'source': doc['metadata'].get('source', 'Unknown'),
                'citation': f"[{i}]"
            })
        
        context = "\n\n".join(context_parts)
        
        # Build prompt
        prompt = f"""You are an expert data engineer assistant. Answer the question using ONLY the provided context.

**Context:**
{context}

**Question:** {query}

**Instructions:**
- Answer based ONLY on the provided context
- Cite sources using [1], [2], etc. notation
- If the context doesn't contain enough information, say "I don't have enough information to answer this question"
- Be specific and technical when referencing tables, columns, pipelines
- Keep answer concise (max 3 paragraphs)

**Answer:**"""
        
        # Generate
        response = self.openai_client.chat.completions.create(
            model=self.llm_model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            max_tokens=500
        )
        
        answer = response.choices[0].message.content.strip()
        
        # Calculate confidence (based on retrieval scores)
        avg_distance = sum(doc.get('distance', 1.0) for doc in context_docs) / len(context_docs)
        confidence = 1.0 - avg_distance  # Lower distance = higher confidence
        
        return {
            'answer': answer,
            'sources': sources,
            'confidence': confidence,
            'context_used': len(context_docs)
        }
    
    def query(
        self,
        question: str,
        top_k: int = 3,
        filters: Optional[Dict] = None
    ) -> Dict:
        """
        Complete RAG pipeline: search + generate.
        """
        # Search
        docs = self.search(question, top_k=top_k, filters=filters)
        
        # Generate
        result = self.generate_answer(question, docs)
        
        return result

# Example usage
rag = RAGSystem()

# Index documents
docs = [
    Document(
        id="table_customers",
        content="""
Table: customers
Schema: dwh.customers
Description: Customer master data with demographics and contact info
Columns:
- customer_id (BIGINT, PK): Unique customer identifier
- email (VARCHAR, UNIQUE): Email address (PII)
- phone (VARCHAR): Phone number (PII)
- created_at (TIMESTAMP): Account creation date
- country (VARCHAR): Country code (ISO 3166-1 alpha-2)
Owner: data-engineering@company.com
PII: Yes (email, phone must be encrypted at rest)
Update frequency: Real-time via CDC from PostgreSQL
        """,
        source="data_catalog.md",
        metadata={"type": "schema", "owner": "data-engineering", "pii": True}
    )
]

for doc in docs:
    rag.index_document(doc)

# Query
result = rag.query("What tables contain customer PII?")
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Sources: {result['sources']}")
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🔍 **Advanced Retrieval: Hybrid Search & Re-ranking**

**Retrieval Methods Comparison:**

```
┌─────────────────────────────────────────────────────────────┐
│  1. Vector Search (Semantic Similarity)                     │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Query: "What tables have PII?"                        │  │
│  │ Embedding: [0.23, -0.45, 0.67, ...]                  │  │
│  │ Search: Cosine similarity in vector space            │  │
│  │ Strengths: Understands meaning, synonyms             │  │
│  │ Weaknesses: Misses exact keywords                    │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  2. Keyword Search (BM25/TF-IDF)                            │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Query: "PII tables"                                   │  │
│  │ Tokenization: ["pii", "tables"]                      │  │
│  │ Scoring: BM25(doc, query)                            │  │
│  │ Strengths: Exact matches, fast, explainable          │  │
│  │ Weaknesses: No semantic understanding                │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  3. Hybrid Search (Best of Both)                            │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Vector results: [doc1, doc3, doc5, doc7]             │  │
│  │ Keyword results: [doc2, doc1, doc4, doc5]            │  │
│  │ Fusion: Reciprocal Rank Fusion (RRF)                 │  │
│  │   score = Σ 1/(k + rank_i) for each method          │  │
│  │ Final ranking: [doc1, doc5, doc3, doc2, doc4, doc7]  │  │
│  │ Improvement: +15-25% accuracy vs single method       │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
```

**Reciprocal Rank Fusion (RRF) Algorithm:**

```python
def reciprocal_rank_fusion(
    rankings: List[List[str]],  # Multiple ranking lists
    k: int = 60  # Constant (typical: 60)
) -> List[str]:
    """
    Combine multiple ranking lists using RRF.
    
    Formula: score(doc) = Σ 1/(k + rank_i)
    
    Paper: "Reciprocal Rank Fusion outperforms Condorcet and
            individual Rank Learning Methods" (Cormack et al. 2009)
    """
    scores = {}
    
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += 1 / (k + rank)
    
    # Sort by score descending
    sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    
    return [doc_id for doc_id, score in sorted_docs]

# Example
vector_results = ['doc1', 'doc3', 'doc5', 'doc7', 'doc9']
keyword_results = ['doc2', 'doc1', 'doc4', 'doc5', 'doc8']

fused = reciprocal_rank_fusion([vector_results, keyword_results])
print(f"Fused ranking: {fused}")
# Output: ['doc1', 'doc5', 'doc3', 'doc2', 'doc4', ...]
```

**Hybrid Search Implementation:**

```python
from rank_bm25 import BM25Okapi
from typing import List, Dict
import numpy as np

class HybridRAG:
    """
    RAG with hybrid search (vector + BM25).
    
    Combines:
    - Semantic search (ChromaDB)
    - Keyword search (BM25)
    - Reciprocal Rank Fusion
    """
    
    def __init__(self, rag_system: RAGSystem):
        self.rag = rag_system
        
        # Build BM25 index
        self._build_bm25_index()
    
    def _build_bm25_index(self):
        """Create BM25 index from all documents in collection."""
        # Get all documents
        all_docs = self.rag.collection.get()
        
        self.doc_ids = all_docs['ids']
        self.documents = all_docs['documents']
        
        # Tokenize for BM25
        tokenized_corpus = [doc.lower().split() for doc in self.documents]
        
        # Create BM25 object
        self.bm25 = BM25Okapi(tokenized_corpus)
        
        print(f"✅ BM25 index built: {len(self.documents)} documents")
    
    def search_vector(
        self,
        query: str,
        top_k: int = 20,
        filters: Dict = None
    ) -> List[str]:
        """Vector search returns doc IDs."""
        results = self.rag.search(query, top_k=top_k, filters=filters)
        return [doc['id'] for doc in results]
    
    def search_bm25(
        self,
        query: str,
        top_k: int = 20
    ) -> List[str]:
        """BM25 keyword search returns doc IDs."""
        tokenized_query = query.lower().split()
        scores = self.bm25.get_scores(tokenized_query)
        
        # Get top-k indices
        top_indices = np.argsort(scores)[::-1][:top_k]
        
        return [self.doc_ids[i] for i in top_indices]
    
    def hybrid_search(
        self,
        query: str,
        top_k: int = 5,
        vector_weight: float = 0.7,
        bm25_weight: float = 0.3
    ) -> List[Dict]:
        """
        Hybrid search with weighted fusion.
        
        Args:
            vector_weight: Weight for semantic search (0-1)
            bm25_weight: Weight for keyword search (0-1)
        """
        # Get rankings from both methods
        vector_ids = self.search_vector(query, top_k=20)
        bm25_ids = self.search_bm25(query, top_k=20)
        
        # Apply RRF
        fused_ids = reciprocal_rank_fusion([vector_ids, bm25_ids])
        
        # Get top-k
        final_ids = fused_ids[:top_k]
        
        # Retrieve full documents
        results = []
        for doc_id in final_ids:
            # Get from collection
            doc_data = self.rag.collection.get(ids=[doc_id])
            
            if doc_data['ids']:
                results.append({
                    'id': doc_id,
                    'content': doc_data['documents'][0],
                    'metadata': doc_data['metadatas'][0]
                })
        
        return results

# Usage
hybrid_rag = HybridRAG(rag)
results = hybrid_rag.hybrid_search("customer PII data")
print(f"Hybrid search returned {len(results)} documents")
```

**Re-ranking with Cross-Encoder:**

```python
from sentence_transformers import CrossEncoder

class RerankedRAG:
    """
    RAG with re-ranking stage.
    
    Pipeline:
    1. Initial retrieval (vector or hybrid): top-100
    2. Re-rank with cross-encoder: top-5
    3. Generate answer with best-ranked docs
    """
    
    def __init__(self, rag_system: RAGSystem):
        self.rag = rag_system
        
        # Load cross-encoder (more accurate but slower)
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    def rerank(
        self,
        query: str,
        documents: List[Dict],
        top_k: int = 5
    ) -> List[Dict]:
        """
        Re-rank documents using cross-encoder.
        
        Cross-encoder scores each (query, doc) pair directly.
        More accurate than bi-encoder (separate query/doc embeddings).
        """
        # Prepare pairs
        pairs = [(query, doc['content']) for doc in documents]
        
        # Score all pairs
        scores = self.reranker.predict(pairs)
        
        # Sort by score
        ranked_indices = np.argsort(scores)[::-1]
        
        # Return top-k
        reranked = [documents[i] for i in ranked_indices[:top_k]]
        
        # Add scores to metadata
        for i, doc in enumerate(reranked):
            doc['rerank_score'] = float(scores[ranked_indices[i]])
        
        return reranked
    
    def query_with_rerank(
        self,
        question: str,
        initial_k: int = 20,
        final_k: int = 3
    ) -> Dict:
        """
        Full RAG pipeline with re-ranking.
        """
        # Step 1: Initial retrieval (cast wide net)
        initial_docs = self.rag.search(question, top_k=initial_k)
        print(f"Initial retrieval: {len(initial_docs)} documents")
        
        # Step 2: Re-rank (focus on best)
        reranked_docs = self.rerank(question, initial_docs, top_k=final_k)
        print(f"After re-ranking: {len(reranked_docs)} documents")
        
        # Step 3: Generate answer
        result = self.rag.generate_answer(question, reranked_docs)
        
        return result

# Usage
reranked_rag = RerankedRAG(rag)
result = reranked_rag.query_with_rerank("Which pipelines process customer data?")
```

**Query Expansion for Better Recall:**

```python
def expand_query(query: str, llm_client: OpenAI) -> List[str]:
    """
    Generate multiple query variations for better recall.
    
    Techniques:
    - Synonym expansion
    - Question reformulation
    - Keyword extraction
    """
    
    prompt = f"""Given this data engineering question, generate 3 alternative phrasings that mean the same thing.

Original question: {query}

Alternative phrasings:
1."""
    
    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=150
    )
    
    # Parse alternatives
    alternatives = [query]  # Include original
    lines = response.choices[0].message.content.strip().split('\n')
    
    for line in lines:
        # Extract after "1. " or "2. "
        if line.strip() and any(line.startswith(f"{i}.") for i in range(1, 10)):
            alt = line.split('.', 1)[1].strip()
            if alt:
                alternatives.append(alt)
    
    return alternatives

# Multi-query retrieval
def multi_query_search(
    query: str,
    rag_system: RAGSystem,
    top_k_per_query: int = 10
) -> List[Dict]:
    """
    Search with multiple query variations, deduplicate results.
    """
    # Expand query
    queries = expand_query(query, rag_system.openai_client)
    print(f"Expanded to {len(queries)} queries: {queries}")
    
    # Search with each query
    all_docs = []
    seen_ids = set()
    
    for q in queries:
        docs = rag_system.search(q, top_k=top_k_per_query)
        
        for doc in docs:
            if doc['id'] not in seen_ids:
                all_docs.append(doc)
                seen_ids.add(doc['id'])
    
    return all_docs

# Usage
expanded_docs = multi_query_search("What is the update frequency of sales data?", rag)
print(f"Multi-query search found {len(expanded_docs)} unique documents")
```

**Metadata Filtering for Precision:**

```python
class FilteredRAG:
    """
    RAG with advanced metadata filtering.
    
    Use cases:
    - Search only schemas (type='schema')
    - Find docs by specific owner
    - Filter by date range (updated_at)
    - Combine multiple filters with AND/OR logic
    """
    
    def __init__(self, rag_system: RAGSystem):
        self.rag = rag_system
    
    def search_by_type(
        self,
        query: str,
        doc_type: str,
        top_k: int = 5
    ) -> List[Dict]:
        """Search only specific document types."""
        return self.rag.search(
            query,
            top_k=top_k,
            filters={"type": doc_type}
        )
    
    def search_by_owner(
        self,
        query: str,
        owner: str,
        top_k: int = 5
    ) -> List[Dict]:
        """Search docs owned by specific team."""
        return self.rag.search(
            query,
            top_k=top_k,
            filters={"owner": owner}
        )
    
    def search_pii_only(
        self,
        query: str,
        top_k: int = 5
    ) -> List[Dict]:
        """Search only tables with PII data."""
        return self.rag.search(
            query,
            top_k=top_k,
            filters={"pii": True}
        )
    
    def search_with_complex_filters(
        self,
        query: str,
        filters: Dict,
        top_k: int = 5
    ) -> List[Dict]:
        """
        Advanced filtering with ChromaDB where clause.
        
        Examples:
        - {"$and": [{"type": "schema"}, {"pii": True}]}
        - {"$or": [{"owner": "data-eng"}, {"owner": "analytics"}]}
        """
        return self.rag.search(query, top_k=top_k, filters=filters)

# Usage examples
filtered_rag = FilteredRAG(rag)

# Only schemas
schemas = filtered_rag.search_by_type("customer data", doc_type="schema")

# Only data-eng team docs
eng_docs = filtered_rag.search_by_owner("ETL pipeline", owner="data-engineering")

# PII tables only
pii_tables = filtered_rag.search_pii_only("contact information")

# Complex: schemas with PII from analytics team
complex_results = filtered_rag.search_with_complex_filters(
    "user data",
    filters={
        "$and": [
            {"type": "schema"},
            {"pii": True},
            {"owner": "analytics"}
        ]
    }
)
```

**Performance Optimization:**

```python
import time
from functools import lru_cache

class OptimizedRAG:
    """
    Performance-optimized RAG system.
    
    Optimizations:
    - Embedding caching
    - Query result caching
    - Batch embedding generation
    - Connection pooling
    """
    
    def __init__(self, rag_system: RAGSystem):
        self.rag = rag_system
        self._query_cache = {}
    
    @lru_cache(maxsize=1000)
    def cached_search(self, query: str, top_k: int = 5) -> str:
        """
        Cache search results (return JSON string for hashability).
        
        Cache hit rate: 40-60% for common queries
        Latency reduction: 95% (2000ms → 100ms)
        """
        docs = self.rag.search(query, top_k=top_k)
        return json.dumps(docs)
    
    def batch_embed(self, texts: List[str]) -> List[List[float]]:
        """
        Generate embeddings in batches (more efficient).
        
        OpenAI allows up to 2048 texts per request.
        Batching reduces API calls by 100x.
        """
        response = self.rag.openai_client.embeddings.create(
            model=self.rag.embedding_model,
            input=texts
        )
        
        return [item.embedding for item in response.data]
    
    def benchmark_search(self, query: str, runs: int = 10):
        """Benchmark search latency."""
        latencies = []
        
        for _ in range(runs):
            start = time.time()
            self.rag.search(query)
            latency = time.time() - start
            latencies.append(latency)
        
        return {
            'mean': np.mean(latencies),
            'p50': np.percentile(latencies, 50),
            'p95': np.percentile(latencies, 95),
            'p99': np.percentile(latencies, 99)
        }

# Benchmark
optimized = OptimizedRAG(rag)
metrics = optimized.benchmark_search("customer tables")
print(f"Search latency - Mean: {metrics['mean']*1000:.1f}ms, P95: {metrics['p95']*1000:.1f}ms")
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 📊 **Evaluation: Measuring RAG Quality**

**RAG Evaluation Metrics Framework:**

```
┌──────────────────────────────────────────────────────────────┐
│  RETRIEVAL METRICS (Measure search quality)                 │
├──────────────────────────────────────────────────────────────┤
│  1. Precision@k                                              │
│     • Definition: Relevant docs in top-k / k                 │
│     • Formula: P@k = |Relevant ∩ Retrieved| / k              │
│     • Good: >0.80 (80%+ of retrieved docs are relevant)      │
│                                                              │
│  2. Recall@k                                                 │
│     • Definition: Relevant docs retrieved / total relevant   │
│     • Formula: R@k = |Relevant ∩ Retrieved| / |Relevant|     │
│     • Good: >0.70 (70%+ of relevant docs found)              │
│                                                              │
│  3. Mean Reciprocal Rank (MRR)                               │
│     • Definition: Average of 1/rank of first relevant doc    │
│     • Formula: MRR = (1/n) Σ 1/rank_i                        │
│     • Good: >0.80 (relevant doc in top 1-2 positions)        │
│                                                              │
│  4. NDCG@k (Normalized Discounted Cumulative Gain)           │
│     • Definition: Quality of ranking with relevance grades   │
│     • Formula: DCG = Σ (2^rel_i - 1) / log2(i + 1)           │
│     • Good: >0.85 (excellent ranking quality)                │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│  GENERATION METRICS (Measure answer quality)                 │
├──────────────────────────────────────────────────────────────┤
│  1. Faithfulness (Hallucination Detection)                   │
│     • Definition: Answer is grounded in retrieved context    │
│     • Method: LLM judges if answer follows from context      │
│     • Target: >0.95 (minimal hallucination)                  │
│                                                              │
│  2. Answer Relevance                                         │
│     • Definition: Answer directly addresses the question     │
│     • Method: Cosine similarity(question, answer)            │
│     • Target: >0.80 (highly relevant)                        │
│                                                              │
│  3. Context Precision                                        │
│     • Definition: Retrieved context is useful for answer     │
│     • Method: LLM rates if each chunk was used               │
│     • Target: >0.70 (most context is useful)                 │
│                                                              │
│  4. Context Recall                                           │
│     • Definition: Answer uses all relevant retrieved info    │
│     • Method: LLM checks if facts in answer come from context│
│     • Target: >0.85 (answer covers available info)           │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│  END-TO-END METRICS (Overall system quality)                 │
├──────────────────────────────────────────────────────────────┤
│  1. Correctness (Human or LLM-judged)                        │
│     • 5-point scale: 1 (wrong) to 5 (perfect)                │
│     • Target: >4.0 average score                             │
│                                                              │
│  2. Latency (User experience)                                │
│     • p50: <500ms, p95: <2s, p99: <5s                        │
│                                                              │
│  3. Cost per Query                                           │
│     • Embedding + LLM API costs                              │
│     • Target: <$0.01 per query                               │
│                                                              │
│  4. User Satisfaction (CSAT)                                 │
│     • Thumbs up/down feedback                                │
│     • Target: >80% positive                                  │
└──────────────────────────────────────────────────────────────┘
```

**Implementation: Comprehensive Evaluation Suite**

```python
from typing import List, Dict, Tuple
import numpy as np
from dataclasses import dataclass

@dataclass
class EvalResult:
    """Single evaluation result."""
    query: str
    retrieved_docs: List[str]
    relevant_docs: List[str]
    generated_answer: str
    ground_truth: str
    metrics: Dict[str, float]

class RAGEvaluator:
    """
    Comprehensive RAG evaluation framework.
    
    Implements:
    - Retrieval metrics (P@k, R@k, MRR, NDCG)
    - Generation metrics (faithfulness, relevance)
    - End-to-end correctness
    """
    
    def __init__(self, rag_system: RAGSystem):
        self.rag = rag_system
        self.openai_client = rag_system.openai_client
    
    def precision_at_k(
        self,
        retrieved: List[str],
        relevant: List[str],
        k: int = 5
    ) -> float:
        """
        Precision@k: Fraction of retrieved docs that are relevant.
        
        Example:
            retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
            relevant = ['doc1', 'doc3', 'doc7']
            P@5 = 2/5 = 0.40 (doc1 and doc3 are relevant)
        """
        retrieved_k = retrieved[:k]
        relevant_set = set(relevant)
        
        relevant_retrieved = sum(1 for doc in retrieved_k if doc in relevant_set)
        
        return relevant_retrieved / k if k > 0 else 0.0
    
    def recall_at_k(
        self,
        retrieved: List[str],
        relevant: List[str],
        k: int = 5
    ) -> float:
        """
        Recall@k: Fraction of relevant docs that were retrieved.
        
        Example:
            retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
            relevant = ['doc1', 'doc3', 'doc7']
            R@5 = 2/3 = 0.67 (found doc1 and doc3, missing doc7)
        """
        retrieved_k = set(retrieved[:k])
        relevant_set = set(relevant)
        
        relevant_retrieved = len(retrieved_k & relevant_set)
        
        return relevant_retrieved / len(relevant_set) if relevant_set else 0.0
    
    def mean_reciprocal_rank(
        self,
        retrieved: List[str],
        relevant: List[str]
    ) -> float:
        """
        MRR: Reciprocal of rank of first relevant document.
        
        Example:
            retrieved = ['doc2', 'doc1', 'doc3']  # doc1 is relevant
            MRR = 1/2 = 0.50 (first relevant at position 2)
        """
        relevant_set = set(relevant)
        
        for rank, doc in enumerate(retrieved, start=1):
            if doc in relevant_set:
                return 1.0 / rank
        
        return 0.0  # No relevant doc found
    
    def ndcg_at_k(
        self,
        retrieved: List[str],
        relevance_scores: Dict[str, int],  # doc_id -> relevance (0-3)
        k: int = 5
    ) -> float:
        """
        NDCG@k: Normalized Discounted Cumulative Gain.
        
        Accounts for:
        - Position (earlier is better)
        - Relevance grade (highly relevant > somewhat relevant)
        
        Example:
            retrieved = ['doc1', 'doc2', 'doc3']
            relevance_scores = {'doc1': 3, 'doc2': 1, 'doc3': 2}
            DCG = (2^3-1)/log2(2) + (2^1-1)/log2(3) + (2^2-1)/log2(4)
                = 7/1 + 1/1.58 + 3/2 = 7 + 0.63 + 1.5 = 9.13
            IDCG (ideal) = 7 + 1.5 + 0.63 = 9.13
            NDCG = DCG / IDCG = 1.0 (perfect ranking)
        """
        retrieved_k = retrieved[:k]
        
        # Calculate DCG
        dcg = 0.0
        for i, doc in enumerate(retrieved_k, start=1):
            rel = relevance_scores.get(doc, 0)
            dcg += (2**rel - 1) / np.log2(i + 1)
        
        # Calculate IDCG (ideal ranking)
        ideal_ranking = sorted(relevance_scores.values(), reverse=True)[:k]
        idcg = 0.0
        for i, rel in enumerate(ideal_ranking, start=1):
            idcg += (2**rel - 1) / np.log2(i + 1)
        
        return dcg / idcg if idcg > 0 else 0.0
    
    def evaluate_faithfulness(
        self,
        answer: str,
        context: List[str]
    ) -> float:
        """
        Faithfulness: Answer is grounded in context (no hallucination).
        
        Method:
        - LLM judges if each statement in answer is supported by context
        - Returns score 0-1
        """
        context_text = "\n\n".join(context)
        
        prompt = f"""Evaluate if the answer is faithful to the context (no hallucination).

Context:
{context_text}

Answer:
{answer}

Question: Is every statement in the answer supported by the context?
Answer with a score from 0 (completely unfaithful) to 10 (perfectly faithful).
Respond with ONLY a number.

Score:"""
        
        response = self.openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
            max_tokens=10
        )
        
        try:
            score = float(response.choices[0].message.content.strip())
            return min(score / 10.0, 1.0)  # Normalize to 0-1
        except:
            return 0.5  # Default if parsing fails
    
    def evaluate_relevance(
        self,
        query: str,
        answer: str
    ) -> float:
        """
        Answer Relevance: Does answer address the question?
        
        Method:
        - Compute embeddings for query and answer
        - Calculate cosine similarity
        """
        query_embedding = np.array(self.rag.get_embedding(query))
        answer_embedding = np.array(self.rag.get_embedding(answer))
        
        # Cosine similarity
        similarity = np.dot(query_embedding, answer_embedding) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(answer_embedding)
        )
        
        return float(similarity)
    
    def evaluate_correctness(
        self,
        answer: str,
        ground_truth: str
    ) -> float:
        """
        Correctness: LLM-judged correctness compared to ground truth.
        
        Returns: Score 0-1
        """
        prompt = f"""Compare the generated answer to the ground truth answer.

Ground Truth:
{ground_truth}

Generated Answer:
{answer}

Rate the correctness on a scale of 0-10:
- 10: Perfect match, all key information present
- 7-9: Mostly correct, minor details missing
- 4-6: Partially correct, some key info missing
- 1-3: Mostly incorrect
- 0: Completely wrong

Respond with ONLY a number.

Score:"""
        
        response = self.openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
            max_tokens=10
        )
        
        try:
            score = float(response.choices[0].message.content.strip())
            return min(score / 10.0, 1.0)
        except:
            return 0.5
    
    def evaluate_query(
        self,
        query: str,
        relevant_doc_ids: List[str],
        ground_truth_answer: str,
        top_k: int = 5
    ) -> EvalResult:
        """
        Complete evaluation for a single query.
        
        Returns all metrics.
        """
        # Retrieve documents
        retrieved_docs = self.rag.search(query, top_k=top_k)
        retrieved_ids = [doc['id'] for doc in retrieved_docs]
        retrieved_contents = [doc['content'] for doc in retrieved_docs]
        
        # Generate answer
        result = self.rag.generate_answer(query, retrieved_docs)
        generated_answer = result['answer']
        
        # Calculate metrics
        metrics = {
            # Retrieval metrics
            'precision@5': self.precision_at_k(retrieved_ids, relevant_doc_ids, k=5),
            'recall@5': self.recall_at_k(retrieved_ids, relevant_doc_ids, k=5),
            'mrr': self.mean_reciprocal_rank(retrieved_ids, relevant_doc_ids),
            
            # Generation metrics
            'faithfulness': self.evaluate_faithfulness(generated_answer, retrieved_contents),
            'relevance': self.evaluate_relevance(query, generated_answer),
            'correctness': self.evaluate_correctness(generated_answer, ground_truth_answer)
        }
        
        return EvalResult(
            query=query,
            retrieved_docs=retrieved_ids,
            relevant_docs=relevant_doc_ids,
            generated_answer=generated_answer,
            ground_truth=ground_truth_answer,
            metrics=metrics
        )
    
    def evaluate_dataset(
        self,
        test_cases: List[Dict]
    ) -> Dict:
        """
        Evaluate RAG system on test dataset.
        
        Args:
            test_cases: List of {
                'query': str,
                'relevant_docs': List[str],
                'ground_truth': str
            }
        
        Returns:
            Aggregated metrics across all test cases
        """
        results = []
        
        for i, case in enumerate(test_cases, 1):
            print(f"Evaluating {i}/{len(test_cases)}: {case['query'][:50]}...")
            
            result = self.evaluate_query(
                query=case['query'],
                relevant_doc_ids=case['relevant_docs'],
                ground_truth_answer=case['ground_truth']
            )
            
            results.append(result)
        
        # Aggregate metrics
        all_metrics = [r.metrics for r in results]
        
        aggregated = {}
        for metric_name in all_metrics[0].keys():
            values = [m[metric_name] for m in all_metrics]
            aggregated[metric_name] = {
                'mean': np.mean(values),
                'std': np.std(values),
                'min': np.min(values),
                'max': np.max(values)
            }
        
        return {
            'results': results,
            'aggregated': aggregated,
            'num_queries': len(test_cases)
        }

# Example test dataset
test_cases = [
    {
        'query': 'Which tables contain customer PII?',
        'relevant_docs': ['table_customers', 'table_orders'],
        'ground_truth': 'The customers table contains PII including email and phone. It is in the dwh.customers schema and owned by data-engineering.'
    },
    {
        'query': 'How often is sales data updated?',
        'relevant_docs': ['table_sales', 'pipeline_sales_etl'],
        'ground_truth': 'Sales data is updated daily at 3 AM via the ventas_daily_etl pipeline.'
    }
]

# Run evaluation
evaluator = RAGEvaluator(rag)
eval_results = evaluator.evaluate_dataset(test_cases)

# Print summary
print("\n=== RAG Evaluation Results ===")
for metric, stats in eval_results['aggregated'].items():
    print(f"{metric}: {stats['mean']:.3f} ± {stats['std']:.3f} (min={stats['min']:.3f}, max={stats['max']:.3f})")
```

**A/B Testing RAG Systems:**

```python
class RAGComparison:
    """
    Compare two RAG systems (e.g., different retrieval methods).
    
    Use case:
    - Test hybrid search vs pure vector search
    - Compare different embedding models
    - Evaluate LLM model upgrades (GPT-4 vs GPT-4o)
    """
    
    def __init__(
        self,
        system_a: RAGSystem,
        system_b: RAGSystem,
        system_a_name: str = "System A",
        system_b_name: str = "System B"
    ):
        self.system_a = system_a
        self.system_b = system_b
        self.system_a_name = system_a_name
        self.system_b_name = system_b_name
        
        self.evaluator_a = RAGEvaluator(system_a)
        self.evaluator_b = RAGEvaluator(system_b)
    
    def compare(
        self,
        test_cases: List[Dict]
    ) -> Dict:
        """
        Run both systems on same test set, compare metrics.
        """
        print(f"Evaluating {self.system_a_name}...")
        results_a = self.evaluator_a.evaluate_dataset(test_cases)
        
        print(f"\nEvaluating {self.system_b_name}...")
        results_b = self.evaluator_b.evaluate_dataset(test_cases)
        
        # Calculate improvements
        comparison = {}
        
        for metric in results_a['aggregated'].keys():
            mean_a = results_a['aggregated'][metric]['mean']
            mean_b = results_b['aggregated'][metric]['mean']
            
            improvement = ((mean_b - mean_a) / mean_a) * 100 if mean_a > 0 else 0
            
            comparison[metric] = {
                self.system_a_name: mean_a,
                self.system_b_name: mean_b,
                'improvement_pct': improvement,
                'better': self.system_b_name if mean_b > mean_a else self.system_a_name
            }
        
        return comparison
    
    def print_comparison(self, comparison: Dict):
        """Pretty print comparison results."""
        print("\n" + "="*70)
        print(f"  {self.system_a_name} vs {self.system_b_name}")
        print("="*70)
        
        for metric, stats in comparison.items():
            print(f"\n{metric.upper()}:")
            print(f"  {self.system_a_name}: {stats[self.system_a_name]:.3f}")
            print(f"  {self.system_b_name}: {stats[self.system_b_name]:.3f}")
            
            if stats['improvement_pct'] > 0:
                print(f"  ✅ {stats[self.system_b_name]} is {stats['improvement_pct']:.1f}% better")
            elif stats['improvement_pct'] < 0:
                print(f"  ❌ {stats[self.system_b_name]} is {abs(stats['improvement_pct']):.1f}% worse")
            else:
                print(f"  ➖ No difference")

# Example: Compare vector search vs hybrid search
rag_vector = RAGSystem(collection_name="docs_vector")
rag_hybrid = HybridRAG(rag_vector)

comparison = RAGComparison(
    rag_vector,
    rag_hybrid,
    system_a_name="Vector Search",
    system_b_name="Hybrid Search"
)

results = comparison.compare(test_cases)
comparison.print_comparison(results)
```

**Continuous Monitoring Dashboard:**

```python
from datetime import datetime
import json

class RAGMonitor:
    """
    Production monitoring for RAG system.
    
    Tracks:
    - Query latency (p50, p95, p99)
    - Retrieval quality (avg precision/recall)
    - User feedback (thumbs up/down)
    - Cost per query
    - Error rate
    """
    
    def __init__(self):
        self.metrics_log = []
    
    def log_query(
        self,
        query: str,
        num_retrieved: int,
        latency_ms: float,
        cost_usd: float,
        user_feedback: Optional[str] = None,  # 'positive' or 'negative'
        error: Optional[str] = None
    ):
        """Log single query metrics."""
        self.metrics_log.append({
            'timestamp': datetime.utcnow().isoformat(),
            'query': query,
            'num_retrieved': num_retrieved,
            'latency_ms': latency_ms,
            'cost_usd': cost_usd,
            'user_feedback': user_feedback,
            'error': error
        })
    
    def get_summary(self, last_n_hours: int = 24) -> Dict:
        """Generate summary for last N hours."""
        cutoff = datetime.utcnow() - timedelta(hours=last_n_hours)
        
        recent_logs = [
            log for log in self.metrics_log
            if datetime.fromisoformat(log['timestamp']) > cutoff
        ]
        
        if not recent_logs:
            return {}
        
        latencies = [log['latency_ms'] for log in recent_logs]
        costs = [log['cost_usd'] for log in recent_logs]
        
        feedbacks = [log.get('user_feedback') for log in recent_logs if log.get('user_feedback')]
        positive_feedback = sum(1 for f in feedbacks if f == 'positive')
        
        errors = sum(1 for log in recent_logs if log.get('error'))
        
        return {
            'time_window': f'Last {last_n_hours} hours',
            'total_queries': len(recent_logs),
            'latency': {
                'p50': np.percentile(latencies, 50),
                'p95': np.percentile(latencies, 95),
                'p99': np.percentile(latencies, 99)
            },
            'cost': {
                'total': sum(costs),
                'avg_per_query': np.mean(costs)
            },
            'user_satisfaction': {
                'total_feedback': len(feedbacks),
                'positive_rate': positive_feedback / len(feedbacks) if feedbacks else 0
            },
            'error_rate': errors / len(recent_logs) if recent_logs else 0
        }
    
    def export_to_prometheus(self) -> str:
        """Export metrics in Prometheus format."""
        summary = self.get_summary()
        
        metrics = f"""
# HELP rag_queries_total Total number of queries
# TYPE rag_queries_total counter
rag_queries_total {summary['total_queries']}

# HELP rag_latency_ms Query latency percentiles
# TYPE rag_latency_ms gauge
rag_latency_ms{{quantile="0.5"}} {summary['latency']['p50']}
rag_latency_ms{{quantile="0.95"}} {summary['latency']['p95']}
rag_latency_ms{{quantile="0.99"}} {summary['latency']['p99']}

# HELP rag_cost_usd Total cost in USD
# TYPE rag_cost_usd counter
rag_cost_usd {summary['cost']['total']}

# HELP rag_user_satisfaction User satisfaction rate (0-1)
# TYPE rag_user_satisfaction gauge
rag_user_satisfaction {summary['user_satisfaction']['positive_rate']}

# HELP rag_error_rate Error rate (0-1)
# TYPE rag_error_rate gauge
rag_error_rate {summary['error_rate']}
"""
        return metrics

# Usage in production
monitor = RAGMonitor()

# In your RAG query endpoint
def handle_query(query: str):
    start = time.time()
    
    try:
        result = rag.query(query)
        latency = (time.time() - start) * 1000  # ms
        
        # Estimate cost (embedding + LLM)
        cost = 0.00002 * len(query.split()) + 0.00001 * len(result['answer'].split())
        
        monitor.log_query(
            query=query,
            num_retrieved=result['context_used'],
            latency_ms=latency,
            cost_usd=cost
        )
        
        return result
    
    except Exception as e:
        monitor.log_query(
            query=query,
            num_retrieved=0,
            latency_ms=0,
            cost_usd=0,
            error=str(e)
        )
        raise

# Get daily summary
summary = monitor.get_summary(last_n_hours=24)
print(json.dumps(summary, indent=2))
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🚀 **Production RAG: Multi-Source Knowledge Base**

**Enterprise RAG Architecture:**

```
┌────────────────────────────────────────────────────────────────┐
│  DATA SOURCES (Multi-format ingestion)                        │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │ Confluence  │  │   GitHub    │  │     DBT     │          │
│  │   (API)     │  │   (Git)     │  │ (manifest)  │          │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘          │
│         │                │                │                   │
│  ┌──────┴──────┐  ┌──────┴──────┐  ┌──────┴──────┐          │
│  │   Airflow   │  │  Snowflake  │  │ Looker/     │          │
│  │    DAGs     │  │  INFORMATION│  │  Tableau    │          │
│  │ (Docstring) │  │   _SCHEMA   │  │  (Metadata) │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
│                                                                │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   Slack     │  │   Jira      │  │  Runbooks   │          │
│  │  (Search)   │  │   (API)     │  │  (Markdown) │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
└────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌────────────────────────────────────────────────────────────────┐
│  INGESTION PIPELINE (Orchestrated by Airflow)                 │
├────────────────────────────────────────────────────────────────┤
│  DAG: knowledge_base_sync (runs every 6 hours)                │
│                                                                │
│  ┌──────────────────────────────────────────────────────────┐ │
│  │ Task 1: Extract from Sources                             │ │
│  │  ├─ Confluence: Get pages updated in last 6h            │ │
│  │  ├─ GitHub: Pull README/wiki commits                    │ │
│  │  ├─ DBT: Parse manifest.json for table docs             │ │
│  │  └─ Snowflake: Query INFORMATION_SCHEMA                 │ │
│  └──────────────────┬───────────────────────────────────────┘ │
│                     │                                          │
│  ┌──────────────────┴───────────────────────────────────────┐ │
│  │ Task 2: Transform & Enrich                               │ │
│  │  ├─ Parse Markdown/HTML to plain text                   │ │
│  │  ├─ Extract metadata (owner, last_updated, tags)        │ │
│  │  ├─ Chunk documents (800 tokens, 200 overlap)           │ │
│  │  ├─ Deduplicate by content hash                         │ │
│  │  └─ Add source lineage (URL, file path)                 │ │
│  └──────────────────┬───────────────────────────────────────┘ │
│                     │                                          │
│  ┌──────────────────┴───────────────────────────────────────┐ │
│  │ Task 3: Generate Embeddings                              │ │
│  │  ├─ Batch embed (100 docs per API call)                 │ │
│  │  ├─ Model: text-embedding-3-small (OpenAI)              │ │
│  │  ├─ Cost tracking: $0.02 per 1M tokens                  │ │
│  │  └─ Retry logic: exponential backoff                    │ │
│  └──────────────────┬───────────────────────────────────────┘ │
│                     │                                          │
│  ┌──────────────────┴───────────────────────────────────────┐ │
│  │ Task 4: Load to Vector Store                             │ │
│  │  ├─ Upsert to Pinecone (production) or ChromaDB (dev)   │ │
│  │  ├─ Organize into collections by source type            │ │
│  │  ├─ Index metadata for filtering                        │ │
│  │  └─ Validation: sample queries for smoke test           │ │
│  └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌────────────────────────────────────────────────────────────────┐
│  VECTOR DATABASE (Pinecone for production scale)              │
│  ├─ Collections: schemas, pipelines, metrics, runbooks        │
│  ├─ Total vectors: ~500K (50K per collection)                 │
│  ├─ Metadata filters: source, owner, tags, last_updated       │
│  └─ SLA: p95 latency <50ms, 99.9% uptime                      │
└────────────────────────────────────────────────────────────────┘
```

**Complete Production Implementation:**

```python
from typing import List, Dict, Optional
from dataclasses import dataclass
import requests
import hashlib
from datetime import datetime, timedelta
from git import Repo
import yaml

@dataclass
class DataSource:
    """Configuration for a data source."""
    name: str
    type: str  # 'confluence', 'github', 'dbt', 'snowflake'
    config: Dict
    sync_frequency_hours: int = 6

class MultiSourceRAG:
    """
    Production RAG system with multi-source ingestion.
    
    Features:
    - Syncs from 8+ data sources
    - Incremental updates (only changed docs)
    - Metadata-rich indexing
    - Source attribution in answers
    """
    
    def __init__(
        self,
        vector_db: str = "pinecone",  # or "chromadb"
        pinecone_api_key: Optional[str] = None
    ):
        if vector_db == "pinecone":
            import pinecone
            pinecone.init(api_key=pinecone_api_key)
            self.index = pinecone.Index("data-knowledge-base")
        else:
            self.chroma_client = chromadb.PersistentClient(path="./chroma_db")
            self.index = self.chroma_client.get_or_create_collection("data_docs")
        
        self.openai_client = OpenAI()
        
        # Track last sync times
        self.last_sync = {}
    
    def sync_confluence(self, base_url: str, space_key: str, api_token: str):
        """
        Sync from Confluence space.
        
        Extracts:
        - Page title, content, labels
        - Last updated timestamp
        - Page URL for source attribution
        """
        headers = {
            "Authorization": f"Bearer {api_token}",
            "Content-Type": "application/json"
        }
        
        # Get pages updated since last sync
        last_sync_time = self.last_sync.get('confluence', datetime.utcnow() - timedelta(hours=24))
        
        # Confluence REST API
        url = f"{base_url}/rest/api/content"
        params = {
            "spaceKey": space_key,
            "expand": "body.storage,version,metadata.labels",
            "limit": 100,
            "orderby": "lastmodified"
        }
        
        response = requests.get(url, headers=headers, params=params)
        pages = response.json()['results']
        
        documents = []
        
        for page in pages:
            # Check if updated since last sync
            updated = datetime.fromisoformat(page['version']['when'].rstrip('Z'))
            if updated < last_sync_time:
                continue
            
            # Parse HTML content
            from bs4 import BeautifulSoup
            soup = BeautifulSoup(page['body']['storage']['value'], 'html.parser')
            content = soup.get_text(separator='\n', strip=True)
            
            doc = Document(
                id=f"confluence_{page['id']}",
                content=f"# {page['title']}\n\n{content}",
                source=f"{base_url}/wiki{page['_links']['webui']}",
                metadata={
                    'type': 'confluence_page',
                    'title': page['title'],
                    'space': space_key,
                    'last_updated': updated.isoformat(),
                    'labels': [label['name'] for label in page['metadata']['labels']['results']]
                }
            )
            
            documents.append(doc)
        
        print(f"✅ Synced {len(documents)} Confluence pages")
        self.last_sync['confluence'] = datetime.utcnow()
        
        return documents
    
    def sync_github_repo(self, repo_url: str, branch: str = "main"):
        """
        Sync README and wiki from GitHub repo.
        
        Extracts:
        - README.md
        - All wiki pages
        - Code docstrings (Python files)
        """
        import tempfile
        import shutil
        
        # Clone repo to temp directory
        temp_dir = tempfile.mkdtemp()
        
        try:
            repo = Repo.clone_from(repo_url, temp_dir, branch=branch, depth=1)
            
            documents = []
            
            # Index README
            readme_path = Path(temp_dir) / "README.md"
            if readme_path.exists():
                content = readme_path.read_text()
                
                doc = Document(
                    id=f"github_readme_{hashlib.md5(repo_url.encode()).hexdigest()}",
                    content=content,
                    source=f"{repo_url}/blob/{branch}/README.md",
                    metadata={
                        'type': 'github_readme',
                        'repo': repo_url,
                        'branch': branch
                    }
                )
                
                documents.append(doc)
            
            # Index Python docstrings
            for py_file in Path(temp_dir).rglob("*.py"):
                if ".git" in str(py_file):
                    continue
                
                # Extract docstrings using AST
                docstrings = self._extract_docstrings(py_file)
                
                for func_name, docstring in docstrings.items():
                    doc = Document(
                        id=f"github_docstring_{hashlib.md5(str(py_file).encode()).hexdigest()}_{func_name}",
                        content=f"Function: {func_name}\n\n{docstring}",
                        source=f"{repo_url}/blob/{branch}/{py_file.relative_to(temp_dir)}",
                        metadata={
                            'type': 'code_docstring',
                            'repo': repo_url,
                            'file': str(py_file.relative_to(temp_dir)),
                            'function': func_name
                        }
                    )
                    
                    documents.append(doc)
            
            print(f"✅ Synced {len(documents)} GitHub documents")
            return documents
        
        finally:
            shutil.rmtree(temp_dir)
    
    def _extract_docstrings(self, py_file: Path) -> Dict[str, str]:
        """Extract function docstrings from Python file."""
        import ast
        
        try:
            with open(py_file) as f:
                tree = ast.parse(f.read())
            
            docstrings = {}
            
            for node in ast.walk(tree):
                if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
                    docstring = ast.get_docstring(node)
                    if docstring:
                        docstrings[node.name] = docstring
            
            return docstrings
        
        except Exception as e:
            return {}
    
    def sync_dbt_docs(self, manifest_path: str, catalog_path: str):
        """
        Sync from DBT manifest and catalog.
        
        Extracts:
        - Table descriptions
        - Column descriptions
        - Tests
        - Lineage (upstream/downstream)
        """
        with open(manifest_path) as f:
            manifest = json.load(f)
        
        with open(catalog_path) as f:
            catalog = json.load(f)
        
        documents = []
        
        for node_id, node in manifest['nodes'].items():
            if node['resource_type'] not in ['model', 'source']:
                continue
            
            # Build comprehensive documentation
            content_parts = [
                f"# {node['name']}",
                f"\n**Type:** {node['resource_type']}",
                f"**Schema:** {node['schema']}",
                f"**Database:** {node['database']}",
                f"\n## Description\n{node.get('description', 'No description provided')}",
            ]
            
            # Add column information
            if node_id in catalog['nodes']:
                cat_node = catalog['nodes'][node_id]
                content_parts.append("\n## Columns")
                
                for col_name, col_info in cat_node['columns'].items():
                    content_parts.append(
                        f"\n- **{col_name}** ({col_info['type']}): {col_info.get('comment', '')}"
                    )
            
            # Add tests
            if 'tests' in node:
                content_parts.append("\n## Tests")
                for test in node['tests']:
                    content_parts.append(f"- {test}")
            
            content = "\n".join(content_parts)
            
            doc = Document(
                id=f"dbt_{node_id}",
                content=content,
                source=f"dbt:///{node['original_file_path']}",
                metadata={
                    'type': 'dbt_model',
                    'schema': node['schema'],
                    'database': node['database'],
                    'materialization': node.get('config', {}).get('materialized'),
                    'tags': node.get('tags', [])
                }
            )
            
            documents.append(doc)
        
        print(f"✅ Synced {len(documents)} DBT models")
        return documents
    
    def sync_snowflake_schema(
        self,
        connection_params: Dict,
        databases: List[str]
    ):
        """
        Sync table/view metadata from Snowflake INFORMATION_SCHEMA.
        
        Extracts:
        - Table descriptions (COMMENT)
        - Column names and types
        - Row counts
        - Last updated timestamp
        """
        import snowflake.connector
        
        conn = snowflake.connector.connect(**connection_params)
        cursor = conn.cursor()
        
        documents = []
        
        for database in databases:
            # Query INFORMATION_SCHEMA
            query = f"""
            SELECT 
                t.table_schema,
                t.table_name,
                t.table_type,
                t.row_count,
                t.comment,
                LISTAGG(c.column_name || ' (' || c.data_type || ')', ', ') AS columns
            FROM {database}.INFORMATION_SCHEMA.TABLES t
            LEFT JOIN {database}.INFORMATION_SCHEMA.COLUMNS c
                ON t.table_schema = c.table_schema 
                AND t.table_name = c.table_name
            WHERE t.table_schema NOT IN ('INFORMATION_SCHEMA')
            GROUP BY 1,2,3,4,5
            """
            
            cursor.execute(query)
            
            for row in cursor:
                schema, table, table_type, row_count, comment, columns = row
                
                content = f"""
# {database}.{schema}.{table}

**Type:** {table_type}
**Row Count:** {row_count:,}

## Description
{comment or 'No description'}

## Columns
{columns}
"""
                
                doc = Document(
                    id=f"snowflake_{database}_{schema}_{table}",
                    content=content,
                    source=f"snowflake://{database}/{schema}/{table}",
                    metadata={
                        'type': 'snowflake_table',
                        'database': database,
                        'schema': schema,
                        'table': table,
                        'row_count': row_count,
                        'table_type': table_type
                    }
                )
                
                documents.append(doc)
        
        cursor.close()
        conn.close()
        
        print(f"✅ Synced {len(documents)} Snowflake tables")
        return documents
    
    def full_sync(self, sources: List[DataSource]):
        """
        Sync all configured data sources.
        
        Returns: Total documents indexed
        """
        all_documents = []
        
        for source in sources:
            print(f"\n📥 Syncing {source.name} ({source.type})...")
            
            try:
                if source.type == 'confluence':
                    docs = self.sync_confluence(**source.config)
                elif source.type == 'github':
                    docs = self.sync_github_repo(**source.config)
                elif source.type == 'dbt':
                    docs = self.sync_dbt_docs(**source.config)
                elif source.type == 'snowflake':
                    docs = self.sync_snowflake_schema(**source.config)
                else:
                    print(f"⚠️ Unknown source type: {source.type}")
                    continue
                
                all_documents.extend(docs)
            
            except Exception as e:
                print(f"❌ Error syncing {source.name}: {e}")
                continue
        
        # Index all documents
        print(f"\n📊 Indexing {len(all_documents)} documents...")
        
        rag = RAGSystem()
        for doc in all_documents:
            rag.index_document(doc)
        
        print(f"\n✅ Full sync complete: {len(all_documents)} documents indexed")
        
        return len(all_documents)

# Configuration
sources = [
    DataSource(
        name="Engineering Wiki",
        type="confluence",
        config={
            "base_url": "https://company.atlassian.net",
            "space_key": "ENG",
            "api_token": os.getenv("CONFLUENCE_TOKEN")
        }
    ),
    DataSource(
        name="Data Platform Repo",
        type="github",
        config={
            "repo_url": "https://github.com/company/data-platform",
            "branch": "main"
        }
    ),
    DataSource(
        name="DBT Documentation",
        type="dbt",
        config={
            "manifest_path": "/path/to/target/manifest.json",
            "catalog_path": "/path/to/target/catalog.json"
        }
    ),
    DataSource(
        name="Snowflake Warehouse",
        type="snowflake",
        config={
            "connection_params": {
                "user": os.getenv("SNOWFLAKE_USER"),
                "password": os.getenv("SNOWFLAKE_PASSWORD"),
                "account": "company.us-east-1",
                "warehouse": "ANALYTICS_WH"
            },
            "databases": ["ANALYTICS", "RAW"]
        }
    )
]

# Run sync
multi_rag = MultiSourceRAG(vector_db="pinecone")
total_docs = multi_rag.full_sync(sources)
```

**Slack Bot Integration:**

```python
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler

class SlackRAGBot:
    """
    Slack bot for querying data documentation.
    
    Commands:
    - @databot What tables have customer PII?
    - /ask How is monthly revenue calculated?
    """
    
    def __init__(self, rag_system: RAGSystem):
        self.rag = rag_system
        
        # Initialize Slack app
        self.app = App(token=os.getenv("SLACK_BOT_TOKEN"))
        
        # Register handlers
        self.app.message(self.handle_mention)
        self.app.command("/ask")(self.handle_command)
        
        # Track usage
        self.usage_log = []
    
    def handle_mention(self, message, say):
        """Handle @databot mentions."""
        query = message['text'].split('>', 1)[1].strip()  # Remove @mention
        user = message['user']
        
        # Show typing indicator
        say(f"🤔 Searching documentation...")
        
        # Query RAG
        result = self.rag.query(query)
        
        # Format response with sources
        response = f"*Answer:*\n{result['answer']}\n\n"
        response += "*Sources:*\n"
        
        for source in result['sources']:
            response += f"• <{source['source']}|{source['citation']}>\n"
        
        say(response)
        
        # Log usage
        self.usage_log.append({
            'user': user,
            'query': query,
            'timestamp': datetime.utcnow()
        })
    
    def handle_command(self, ack, command, respond):
        """Handle /ask slash command."""
        ack()
        
        query = command['text']
        
        # Query RAG
        result = self.rag.query(query)
        
        # Format as rich message block
        blocks = [
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": f"*Question:* {query}"
                }
            },
            {
                "type": "divider"
            },
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": f"*Answer:*\n{result['answer']}"
                }
            },
            {
                "type": "context",
                "elements": [
                    {
                        "type": "mrkdwn",
                        "text": f"Confidence: {result['confidence']:.0%}"
                    }
                ]
            },
            {
                "type": "actions",
                "elements": [
                    {
                        "type": "button",
                        "text": {"type": "plain_text", "text": "👍 Helpful"},
                        "value": "positive",
                        "action_id": "feedback_positive"
                    },
                    {
                        "type": "button",
                        "text": {"type": "plain_text", "text": "👎 Not helpful"},
                        "value": "negative",
                        "action_id": "feedback_negative"
                    }
                ]
            }
        ]
        
        respond(blocks=blocks)
    
    def start(self):
        """Start Slack bot."""
        handler = SocketModeHandler(self.app, os.getenv("SLACK_APP_TOKEN"))
        print("⚡ Slack bot is running!")
        handler.start()

# Deploy bot
bot = SlackRAGBot(rag)
bot.start()
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Setup ChromaDB

In [None]:
# pip install chromadb openai langchain
import chromadb
from chromadb.config import Settings

# Cliente persistente
client = chromadb.PersistentClient(path='./chroma_db')

# Crear colección
collection = client.get_or_create_collection(
    name='data_docs',
    metadata={'description': 'Documentación técnica de datos'}
)

print(f'Colección creada: {collection.name}')
print(f'Documentos: {collection.count()}')

## 2. Ingestión de documentación

In [None]:
import os
from openai import OpenAI
client_openai = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def get_embedding(text: str) -> list[float]:
    """Genera embedding con OpenAI."""
    resp = client_openai.embeddings.create(
        model='text-embedding-ada-002',
        input=text
    )
    return resp.data[0].embedding

docs = [
    {
        'id': 'tabla_ventas',
        'text': '''
Tabla: ventas
Esquema: dwh.ventas
Descripción: Transacciones de ventas diarias desde 2020.
Columnas:
- venta_id (BIGINT, PK): identificador único
- fecha (DATE): fecha de la transacción
- producto_id (INT, FK): referencia a dim_productos
- cantidad (INT): unidades vendidas
- total (DECIMAL): monto en USD
Frecuencia: actualización diaria a las 3 AM
Owner: equipo-analytics
        ''',
        'metadata': {'type': 'schema', 'owner': 'analytics'}
    },
    {
        'id': 'pipeline_ventas',
        'text': '''
Pipeline: ventas_daily_etl
Descripción: procesa ventas del día anterior
Pasos:
1. Extracción de S3 (bucket: raw-data/ventas/)
2. Validación con Great Expectations
3. Deduplicación por venta_id
4. Enriquecimiento con datos de productos
5. Carga a Redshift
Dependencias: dim_productos debe estar actualizado
Alertas: email a data-eng si falla
        ''',
        'metadata': {'type': 'pipeline', 'owner': 'data-eng'}
    },
    {
        'id': 'metrica_revenue',
        'text': '''
Métrica: monthly_revenue
Definición: SUM(total) de ventas agrupado por mes
Fórmula: SELECT DATE_TRUNC('month', fecha) mes, SUM(total) revenue FROM ventas GROUP BY 1
Business owner: CFO
Dashboards: Tableau (Revenue Overview)
        ''',
        'metadata': {'type': 'metric', 'owner': 'finance'}
    }
]

# Agregar a ChromaDB
for doc in docs:
    embedding = get_embedding(doc['text'])
    collection.add(
        ids=[doc['id']],
        documents=[doc['text']],
        embeddings=[embedding],
        metadatas=[doc['metadata']]
    )

print(f'✅ {len(docs)} documentos indexados')

## 3. Búsqueda semántica

In [None]:
def semantic_search(query: str, top_k: int = 3) -> list:
    """Busca documentos relevantes."""
    query_embedding = get_embedding(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    return results

pregunta = '¿Qué tabla contiene información de ventas?'
resultados = semantic_search(pregunta)

print(f'Pregunta: {pregunta}\n')
for i, doc in enumerate(resultados['documents'][0]):
    print(f'Resultado {i+1}:')
    print(doc[:200] + '...')
    print()

## 4. RAG: respuesta con contexto

In [None]:
def rag_answer(question: str) -> str:
    """Responde usando RAG."""
    # 1. Buscar contexto relevante
    results = semantic_search(question, top_k=2)
    context = '\n\n'.join(results['documents'][0])
    
    # 2. Prompt con contexto
    prompt = f'''
Eres un experto en ingeniería de datos. Responde la pregunta usando SOLO la información del contexto.

Contexto:
{context}

Pregunta: {question}

Respuesta (menciona la fuente si es relevante):
'''
    
    # 3. Generar respuesta
    resp = client_openai.chat.completions.create(
        model='gpt-4',
        messages=[{'role': 'user', 'content': prompt}],
        temperature=0.1
    )
    
    return resp.choices[0].message.content.strip()

preguntas = [
    '¿Cuál es el esquema de la tabla de ventas?',
    '¿A qué hora se actualiza la data de ventas?',
    '¿Qué pipeline procesa las ventas?',
    '¿Cómo se calcula el monthly revenue?'
]

for q in preguntas:
    answer = rag_answer(q)
    print(f'❓ {q}')
    print(f'✅ {answer}\n')

## 5. RAG con LangChain

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Embeddings y vectorstore
embeddings = OpenAIEmbeddings(openai_api_key=os.getenv('OPENAI_API_KEY'))
vectorstore = Chroma(
    persist_directory='./chroma_db',
    embedding_function=embeddings,
    collection_name='data_docs'
)

# LLM
llm = ChatOpenAI(model='gpt-4', temperature=0, openai_api_key=os.getenv('OPENAI_API_KEY'))

# Chain RAG
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=vectorstore.as_retriever(search_kwargs={'k': 2}),
    return_source_documents=True
)

result = qa_chain({'query': '¿Quién es el owner de la tabla ventas?'})
print('Respuesta:', result['result'])
print('\nFuentes:')
for doc in result['source_documents']:
    print('-', doc.metadata)

## 6. Chunking avanzado

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Para documentos largos
long_doc = '''
# Data Warehouse - Guía Completa

## Arquitectura
Redshift cluster con 5 nodos dc2.large.
Schemas: raw, staging, dwh, analytics.

## Tablas principales
- ventas: transacciones diarias
- clientes: información demográfica
- productos: catálogo completo

## Pipelines
Airflow con 15 DAGs ejecutándose diariamente.
'''

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    separators=['\n\n', '\n', '. ', ' ']
)

chunks = splitter.split_text(long_doc)
print(f'Documento dividido en {len(chunks)} chunks:\n')
for i, chunk in enumerate(chunks):
    print(f'Chunk {i+1}: {chunk}\n')

## 7. Filtrado por metadatos

In [None]:
# Buscar solo pipelines
results_pipeline = collection.query(
    query_embeddings=[get_embedding('automatización de datos')],
    n_results=5,
    where={'type': 'pipeline'}
)

print('Pipelines encontrados:')
for doc in results_pipeline['documents'][0]:
    print('-', doc.split('\n')[1])

## 8. Buenas prácticas RAG

- **Chunking inteligente**: divide por secciones lógicas (headers, párrafos).
- **Metadatos ricos**: agrega source, timestamp, owner, versión.
- **Híbrido**: combina búsqueda semántica + keyword search.
- **Re-ranking**: usa modelos como Cohere Rerank para mejorar resultados.
- **Cache**: guarda respuestas frecuentes.
- **Actualización**: sincroniza vectorstore con cambios en docs.
- **Monitoreo**: loggea queries, latencia, quality de respuestas.

## 9. Ejercicios

1. Indexa tu documentación real de data warehouse en ChromaDB.
2. Construye un chatbot Slack que responda preguntas sobre esquemas.
3. Implementa hybrid search (semántico + BM25) con LangChain.
4. Crea un dashboard Streamlit para explorar el vectorstore.