# Baseline RAG Implementation
## Retrieval-Augmented Generation for WordPress Documentation

This notebook implements the baseline Naive RAG architecture from:
**"Retrieval-Augmented Generation for Large Language Models: A Survey"**
Gao et al., 2023 - https://arxiv.org/abs/2312.10997

## 1. Setup and Dependencies

In [None]:
# Install required packages
!pip install sentence-transformers faiss-cpu openai numpy pandas matplotlib seaborn --quiet

In [None]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import json
import time
from typing import List, Dict, Tuple
import matplotlib.pyplot as plt
import seaborn as sns

print("✓ All packages loaded successfully!")

## 2. Sample WordPress Documentation Dataset
In a real implementation, you'd scrape WordPress Codex. For this demo, we use sample data.

In [None]:
# Sample WordPress documentation chunks
wp_docs = [
    {
        "id": "doc_001",
        "title": "add_action Function",
        "content": "The add_action() function is used to hook a function to a specific action. Actions are triggered at specific times during WordPress execution. Syntax: add_action($hook, $function_to_add, $priority, $accepted_args);",
        "type": "code"
    },
    {
        "id": "doc_002",
        "title": "WordPress Hooks Overview",
        "content": "Hooks are a way for one piece of code to interact with another piece of code. They make up the foundation for how plugins and themes interact with WordPress Core. There are two types of hooks: Actions and Filters.",
        "type": "concept"
    },
    {
        "id": "doc_003",
        "title": "wp_enqueue_script Function",
        "content": "wp_enqueue_script() is the proper way to add JavaScript files to a WordPress site. It prevents conflicts and ensures scripts load in the correct order. Syntax: wp_enqueue_script($handle, $src, $deps, $ver, $in_footer);",
        "type": "code"
    },
    {
        "id": "doc_004",
        "title": "The Loop in WordPress",
        "content": "The Loop is PHP code used by WordPress to display posts. Using The Loop, WordPress processes each post to be displayed on the current page and formats it according to specified criteria. The Loop extracts data from each post.",
        "type": "concept"
    },
    {
        "id": "doc_005",
        "title": "get_post_meta Function",
        "content": "Retrieve post meta field for a post. Returns the value of a custom field for the specified post. Syntax: get_post_meta($post_id, $key, $single); Returns an array of values if $single is false, or the value itself if true.",
        "type": "code"
    },
    {
        "id": "doc_006",
        "title": "Custom Post Types",
        "content": "WordPress can hold and display many different types of content. A Post Type is a way to define the structure and characteristics of different content types. Custom Post Types allow you to create content types beyond posts and pages.",
        "type": "concept"
    },
    {
        "id": "doc_007",
        "title": "register_post_type Function",
        "content": "Creates a custom post type. Syntax: register_post_type($post_type, $args); The $args array can contain labels, public visibility, menu position, supports features, and more configuration options.",
        "type": "code"
    },
    {
        "id": "doc_008",
        "title": "WordPress Security Best Practices",
        "content": "Always validate and sanitize user input. Use nonces to prevent CSRF attacks. Escape output data. Use prepared statements for database queries. Keep WordPress, themes, and plugins updated.",
        "type": "concept"
    },
    {
        "id": "doc_009",
        "title": "wp_insert_post Function",
        "content": "Inserts or updates a post in the database. Syntax: wp_insert_post($postarr, $wp_error); Returns the post ID on success. The $postarr parameter is an array of post data including post_title, post_content, post_status, etc.",
        "type": "code"
    },
    {
        "id": "doc_010",
        "title": "WordPress REST API",
        "content": "The WordPress REST API provides an interface for applications to interact with WordPress sites by sending and receiving data as JSON objects. It enables developers to create, read, update, and delete WordPress content from external applications.",
        "type": "concept"
    }
]

print(f"✓ Loaded {len(wp_docs)} WordPress documentation chunks")
print(f"  - Code examples: {sum(1 for d in wp_docs if d['type'] == 'code')}")
print(f"  - Concept docs: {sum(1 for d in wp_docs if d['type'] == 'concept')}")

## 3. Baseline RAG Implementation
Following the Naive RAG architecture from the survey paper.

In [None]:
class BaselineRAG:
    """
    Baseline Naive RAG implementation following Gao et al. 2023.
    Components:
    1. Indexing: Embed documents using sentence-transformers
    2. Retrieval: FAISS vector similarity search
    3. Generation: Simulated (concat retrieved docs as context)
    """
    
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        print(f"Initializing Baseline RAG with {model_name}...")
        self.encoder = SentenceTransformer(model_name)
        self.index = None
        self.documents = []
        self.embeddings = None
        
    def index_documents(self, documents: List[Dict]):
        """Create vector embeddings and FAISS index"""
        print("Creating embeddings...")
        start = time.time()
        
        self.documents = documents
        texts = [f"{doc['title']} {doc['content']}" for doc in documents]
        
        # Generate embeddings
        self.embeddings = self.encoder.encode(texts, show_progress_bar=True)
        
        # Create FAISS index
        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dimension)
        self.index.add(np.array(self.embeddings).astype('float32'))
        
        elapsed = time.time() - start
        print(f"✓ Indexed {len(documents)} documents in {elapsed:.2f}s")
        
    def retrieve(self, query: str, k: int = 3) -> List[Dict]:
        """Retrieve top-k most similar documents"""
        # Encode query
        query_embedding = self.encoder.encode([query])
        
        # Search FAISS index
        distances, indices = self.index.search(
            np.array(query_embedding).astype('float32'), k
        )
        
        # Return retrieved documents with scores
        results = []
        for idx, dist in zip(indices[0], distances[0]):
            doc = self.documents[idx].copy()
            doc['score'] = float(dist)
            results.append(doc)
            
        return results
    
    def generate_answer(self, query: str, retrieved_docs: List[Dict]) -> str:
        """Simulate answer generation (in real system, would call LLM)"""
        context = "\n\n".join([
            f"[{doc['title']}]: {doc['content']}" 
            for doc in retrieved_docs
        ])
        
        # Simulated generation (would be LLM call in production)
        answer = f"Based on the documentation:\n{context}"
        return answer

print("✓ BaselineRAG class defined")

## 4. Initialize and Test Baseline

In [None]:
# Initialize baseline RAG
baseline_rag = BaselineRAG()
baseline_rag.index_documents(wp_docs)

In [None]:
# Test queries
test_queries = [
    "How do I add a JavaScript file to WordPress?",
    "What is add_action function?",
    "How to create custom post types?",
    "What are WordPress hooks?",
    "How to insert a post programmatically?"
]

print("Testing Baseline RAG:\n")
for query in test_queries[:2]:  # Test first 2
    print(f"Query: {query}")
    results = baseline_rag.retrieve(query, k=2)
    for i, doc in enumerate(results, 1):
        print(f"  {i}. {doc['title']} (score: {doc['score']:.3f})")
    print()

## 5. Evaluate Baseline Performance

In [None]:
# Evaluation metrics
def evaluate_retrieval(rag_system, queries_with_expected):
    """
    Evaluate retrieval quality.
    queries_with_expected: List of (query, expected_doc_ids)
    """
    results = {
        'precision_at_1': [],
        'precision_at_3': [],
        'retrieval_times': []
    }
    
    for query, expected_ids in queries_with_expected:
        start = time.time()
        retrieved = rag_system.retrieve(query, k=3)
        elapsed = time.time() - start
        
        retrieved_ids = [doc['id'] for doc in retrieved]
        
        # Precision@1
        p1 = 1.0 if retrieved_ids[0] in expected_ids else 0.0
        results['precision_at_1'].append(p1)
        
        # Precision@3
        hits = sum(1 for rid in retrieved_ids if rid in expected_ids)
        p3 = hits / 3.0
        results['precision_at_3'].append(p3)
        
        results['retrieval_times'].append(elapsed)
    
    return {
        'avg_precision_at_1': np.mean(results['precision_at_1']),
        'avg_precision_at_3': np.mean(results['precision_at_3']),
        'avg_retrieval_time': np.mean(results['retrieval_times'])
    }

# Test dataset with ground truth
eval_queries = [
    ("How to enqueue JavaScript in WordPress?", ["doc_003"]),
    ("What is add_action?", ["doc_001"]),
    ("How to create custom post types?", ["doc_006", "doc_007"]),
    ("What are hooks in WordPress?", ["doc_002"]),
    ("How to insert a post?", ["doc_009"]),
]

baseline_metrics = evaluate_retrieval(baseline_rag, eval_queries)
print("\nBaseline RAG Performance:")
print(f"  Precision@1: {baseline_metrics['avg_precision_at_1']:.2%}")
print(f"  Precision@3: {baseline_metrics['avg_precision_at_3']:.2%}")
print(f"  Avg Retrieval Time: {baseline_metrics['avg_retrieval_time']*1000:.1f}ms")

## 6. Save Baseline Results

In [None]:
# Save results for comparison
baseline_results = {
    'system': 'Baseline RAG',
    'metrics': baseline_metrics,
    'timestamp': time.strftime('%Y-%m-%d %H:%M:%S')
}

with open('baseline_results.json', 'w') as f:
    json.dump(baseline_results, f, indent=2)

print("✓ Baseline results saved to baseline_results.json")