# Minimal Local RAG: Llama 3.2 1B Implementation
**Context:** Technical Interview Assignment | **Infrastructure:** Local CPU Optimized

## 1. Executive Summary
This notebook implements a self-contained **Retrieval-Augmented Generation (RAG)** system designed to run efficiently on standard local hardware. By utilizing **Quantized Small Language Models (SLMs)** and a file-based vector store, we achieve low-latency semantic search and generation without requiring GPU resources or external cloud dependencies.

## 2. Architecture: Local vs. Production
To adhere to the assignment constraints while demonstrating readiness for enterprise scale, the system is designed with a clear separation between the current "Minimal" implementation and the standard "Production" architecture:

* **Inference Strategy:**
    * *Current:* We use **Llama-3.2-1B (Int4 Quantized)** running on `llama.cpp` to optimize for local CPU memory (<1GB).
    * *Production:* This would scale to larger enterprise models (e.g., Llama 3 70B) hosted on **vLLM** or **Triton Inference Server** with GPU acceleration.

* **Vector Storage:**
    * *Current:* **ChromaDB** is configured as a local persistent client for simplicity and zero-setup.
    * *Production:* Data would migrate to a distributed vector database like **Milvus** or **Weaviate** to handle millions of vectors with high availability.

* **Orchestration & Ingestion:**
    * *Current:* A linear Python pipeline handles document processing.
    * *Production:* Automated workflows using **Kubeflow** or **Airflow** DAGs would manage continuous data ingestion and retraining pipelines.

---
### Phase 1: System Initialization
*Objective: Configure the runtime environment. The system automates dependency checks and provisions the model artifact from the registry if not present locally.*

In [None]:
import sys
import os
import logging
import warnings
from huggingface_hub import hf_hub_download

# Suppress llama.cpp warnings about duplicate tokens
warnings.filterwarnings('ignore', category=RuntimeWarning, module='llama_cpp.llama')

# --- 1. CONFIGURATION ---
# Defines the specific model artifact to be used (Quantized Llama 3.2 1B)
REPO_ID = "bartowski/Llama-3.2-1B-Instruct-GGUF"
FILENAME = "Llama-3.2-1B-Instruct-Q4_K_M.gguf"

# Define paths relative to this notebook
# Notebook is in 'notebooks/', so we go up one level ('..') to reach root
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
MODEL_DIR = os.path.join(PROJECT_ROOT, "models")
MODEL_PATH = os.path.join(MODEL_DIR, FILENAME)
DB_PATH = os.path.join(PROJECT_ROOT, "chroma_db")
DATA_DIR = os.path.join(PROJECT_ROOT, "data")

# --- 2. AUTOMATED MODEL PROVISIONING ---
# Checks for the model artifact. If missing, it downloads it automatically.
# In a real HPE environment, this would pull from a secure container registry or Artifactory.
if not os.path.exists(MODEL_PATH):
    print(f"\nModel artifact not found locally. Initiating download from HuggingFace...")
    print(f"   ‚Ä¢ Repo: {REPO_ID}")
    
    os.makedirs(MODEL_DIR, exist_ok=True)
    try:
        hf_hub_download(
            repo_id=REPO_ID,
            filename=FILENAME,
            local_dir=MODEL_DIR,
            local_dir_use_symlinks=False
        )
        print("Download complete. Artifact verified.")
    except Exception as e:
        print(f"Critical Error: Failed to download model. {e}")
        raise e
else:
    print(f"Model artifact found. Ready for inference.")

# --- 3. MODULE IMPORT SETUP ---
# Appends the project root to the system path to allow importing from 'src'
if PROJECT_ROOT not in sys.path:
    sys.path.append(PROJECT_ROOT)

try:
    from src.rag_engine import RAGSystem, LocalVectorStore, ingest_file
    print("RAG Engine Core loaded successfully.")
except ImportError as e:
    print("Error: Could not import 'src.rag_engine'. Verify the 'src' folder exists in project root.")
    raise e

# --- 4. RUNTIME INITIALIZATION ---
print("\nBooting RAG Subsystems...")

# Initialize Vector Store (ChromaDB)
# Persistence ensures we don't need to re-index data on every restart
store = LocalVectorStore(persistence_path=DB_PATH)

# Initialize Inference Engine (Llama.cpp)
# Loads the GGUF model into CPU memory
rag = RAGSystem(model_path=MODEL_PATH, vector_store=store)


Model artifact found. Ready for inference.
RAG Engine Core loaded successfully.

Booting RAG Subsystems...
Initializing Vector Store...
Loading Llama 3.2 1B (Quantized)...


llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


Model Loaded.


### Phase 2: Knowledge Ingestion (ETL Simulation)
*Objective: Transform unstructured technical documentation into a semantic vector index.*

In a large-scale production context, this ingestion process is typically handled by distributed **ETL pipelines** (e.g., using **Apache Airflow**) or dedicated data curation frameworks to manage continuous updates.

For this specific assignment, we simulate this workflow using a lightweight **Recursive Chunking Strategy** (500 chars). This approach balances semantic context retention with the memory constraints of local vector search.

**Dynamic Multi-Format Ingestion:**
The system automatically discovers and processes all supported files (PDF, TXT, MD) in the data directory, providing a flexible and scalable document ingestion pipeline suitable for enterprise environments where new documents are continuously added.

In [2]:
# Auto-discover and ingest all supported files in the data directory
print("Scanning data directory for documents...")

# Supported file extensions
SUPPORTED_EXTENSIONS = ['.pdf', '.txt', '.md']

# Discover all files
data_files = []
if os.path.exists(DATA_DIR):
    for filename in os.listdir(DATA_DIR):
        file_path = os.path.join(DATA_DIR, filename)
        if os.path.isfile(file_path):
            file_ext = os.path.splitext(filename)[1].lower()
            if file_ext in SUPPORTED_EXTENSIONS:
                data_files.append(file_path)

print(f"Found {len(data_files)} document(s) to process:")
for file_path in data_files:
    print(f"   ‚Ä¢ {os.path.basename(file_path)}")

# Ingest all discovered files
print("\nStarting ingestion pipeline...")
for file_path in data_files:
    print(f"\nProcessing: {os.path.basename(file_path)}")
    try:
        ingest_file(file_path, store)
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Error processing {os.path.basename(file_path)}: {e}")

print("\n‚úÖ Document ingestion complete!")

Scanning data directory for documents...
Found 3 document(s) to process:
   ‚Ä¢ 2502019_AI-Governance-Dialogue-Steering-the-Future-of-AI-2025.pdf
   ‚Ä¢ deepseek_v3_specs.txt
   ‚Ä¢ GEP-June-2025.pdf

Starting ingestion pipeline...

Processing: 2502019_AI-Governance-Dialogue-Steering-the-Future-of-AI-2025.pdf
Extracting text from PDF: c:\Users\kartik.saha\Desktop\rag-takehome-hpe\data\2502019_AI-Governance-Dialogue-Steering-the-Future-of-AI-2025.pdf
Extracted 257993 characters from 100 pages.
Embedding 574 chunks...
Indexed 574 chunks.

Processing: deepseek_v3_specs.txt
Embedding 4 chunks...
Indexed 4 chunks.

Processing: GEP-June-2025.pdf
Extracting text from PDF: c:\Users\kartik.saha\Desktop\rag-takehome-hpe\data\GEP-June-2025.pdf
Extracted 977299 characters from 254 pages.
Embedding 2172 chunks...
Indexed 2172 chunks.

‚úÖ Document ingestion complete!


### Phase 3: Telemetry & Observability
*Objective: Validate response latency and retrieval accuracy (Grounding).*

In a distributed production environment, observability is typically managed via APM tools like **Prometheus**, **Grafana**, or **OpenTelemetry** traces to monitor system health and **Service Level Indicators (SLIs)**.

For this local implementation, we inject a lightweight telemetry wrapper directly into the execution path. This provides immediate, real-time visibility into:
1.  **Retrieval Latency:** The specific time cost of the vector search operation.
2.  **Semantic Distance:** The L2 distance scores (lower is better), allowing us to audit the "Grounding" of the retrieved context and detect potential hallucinations.
3.  **Generation Latency:** The CPU time required for the quantized model to tokenize and generate the response.

In [3]:
import time

# Update your query function to include 'debug=True'
def query_with_telemetry(rag_system, user_query):
    print(f"‚ùì User Query: {user_query}")
    print("-" * 50)
    
    # 1. Measure Retrieval Time
    start_time = time.time()
    results = rag_system.vector_store.collection.query(
        query_embeddings=rag_system.vector_store.embedder.encode([user_query]).tolist(),
        n_results=3
    )
    retrieval_time = time.time() - start_time
    
    # 2. Show the "Why": Print Similarity Scores (Distance)
    # Chroma returns 'distances'. Lower is better for L2, Higher is better for Cosine.
    # Assuming default (L2 squared), smaller = closer.
    print(f"üîç Retrieval Phase ({retrieval_time:.4f}s):")
    for i, (doc, dist) in enumerate(zip(results['documents'][0], results['distances'][0])):
        print(f"   [Chunk {i+1}] Distance Score: {dist:.4f} | Content: {doc[:50]}...")
        
    # 3. Measure Generation Time
    start_gen = time.time()
    answer = rag_system.query(user_query) # Your existing function
    gen_time = time.time() - start_gen
    
    print("-" * 50)
    print(f"ü§ñ Model Response ({gen_time:.2f}s):")
    print(answer)
    print("=" * 50)

# Run it
query_with_telemetry(rag, "With global compute access becoming a strategic factor in 2025, how do economic trends influence scaling RAG systems for enterprises?")

‚ùì User Query: With global compute access becoming a strategic factor in 2025, how do economic trends influence scaling RAG systems for enterprises?
--------------------------------------------------
üîç Retrieval Phase (0.1343s):
   [Chunk 1] Distance Score: 0.8351 | Content: tion that diminishing returns may have been reache...
   [Chunk 2] Distance Score: 0.9962 | Content: ation technologies.‚Äù
Mr Kon√© warned that the lack ...
   [Chunk 3] Distance Score: 1.0195 | Content: in 2000-04 to less 
than 30 percent in 2019-23. 
T...




--------------------------------------------------
ü§ñ Model Response (1.45s):
I do not know.


### Phase 4: Interactive Validation (Acceptance Testing)
*Objective: Execute live queries to verify retrieval precision and generation quality against the ingested knowledge base.*

We perform two distinct types of validation tests to ensure the system meets the expected **Quality of Service (QoS)**:
* **Test Case 1 (Factual Recall):** Validates the system's ability to retrieve precise quantitative data (e.g., hardware specifications) which the base model would not know.
* **Test Case 2 (Conceptual Synthesis):** Tests the system's capacity to retrieve multiple chunks and synthesize a technical explanation of a complex architectural component.

In [4]:
# Test 1: Factual Query
query_with_telemetry(rag, "How is AI governance evolving in 2025 with the rise of AI agents, and what implications does it have for building RAG systems responsibly?")

# Test 2: Architecture Query
query_with_telemetry(rag, "Explain the Multi-Head Latent Attention.")

‚ùì User Query: How is AI governance evolving in 2025 with the rise of AI agents, and what implications does it have for building RAG systems responsibly?
--------------------------------------------------
üîç Retrieval Phase (0.0132s):
   [Chunk 1] Distance Score: 0.4657 | Content: and Trends in AI Governance  ........................
   [Chunk 2] Distance Score: 0.4673 | Content: 
coordination, technical standards, infrastructure...
   [Chunk 3] Distance Score: 0.4810 | Content: . (2025, April 17), Fn. 1
5 Gabriel, I., Manzini, ...
--------------------------------------------------
ü§ñ Model Response (2.97s):
The text does not provide detailed information about AI governance evolution, but it mentions that the field of AI has grown to encompass chat-style tools and AI agents. It suggests that AI agents are becoming more prominent, but the specific implications for building RAG (Robust Architectural Governance) systems are not detailed.
‚ùì User Query: Explain the Multi-Head Latent 