# Local HF Pipeline - End-to-End Test

This notebook validates that the **fully local** KohakuRAG pipeline works
without any network calls. It tests:

1. Local embeddings (`LocalHFEmbeddingModel` via sentence-transformers)
2. Local LLM chat (`HuggingFaceLocalChatModel` via transformers)
3. Full RAG pipeline: index documents, retrieve, and answer

**Prerequisites:**
- Kernel: `kohaku-gb10` (or your project venv)
- Dependencies installed: `pip install -r local_requirements.txt`
- Vendored packages installed: `pip install -e vendor/KohakuVault && pip install -e vendor/KohakuRAG`

## Step 1 - Verify imports

In [None]:
import torch
import transformers
import sentence_transformers

print(f"torch:                {torch.__version__}")
print(f"transformers:         {transformers.__version__}")
print(f"sentence-transformers: {sentence_transformers.__version__}")
print(f"CUDA available:       {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU:                  {torch.cuda.get_device_name(0)}")
print()

import kohakurag
import kohakuvault
print(f"kohakurag:  {kohakurag.__file__}")
print(f"kohakuvault: {kohakuvault.__file__}")
print("\nAll imports OK")

## Step 2 - Test local embeddings

In [None]:
from kohakurag.embeddings import LocalHFEmbeddingModel

# Use a small, fast model for testing
embedder = LocalHFEmbeddingModel(model_name="BAAI/bge-base-en-v1.5")
print(f"Embedding model loaded: BAAI/bge-base-en-v1.5")
print(f"Embedding dimension:    {embedder.dimension}")

In [None]:
import numpy as np

test_texts = [
    "Solar panels convert sunlight into electricity.",
    "Photovoltaic cells generate power from solar radiation.",
    "The capital of France is Paris.",
]

vecs = await embedder.embed(test_texts)
print(f"Embedding shape: {vecs.shape}")
print(f"Dtype:           {vecs.dtype}")

# Cosine similarity (vectors are already normalized)
sim_01 = float(np.dot(vecs[0], vecs[1]))
sim_02 = float(np.dot(vecs[0], vecs[2]))
print(f"\nSimilarity (solar vs photovoltaic): {sim_01:.4f}  (should be high)")
print(f"Similarity (solar vs Paris):         {sim_02:.4f}  (should be low)")
assert sim_01 > sim_02, "Semantic similarity check failed!"
print("\nEmbedding sanity check PASSED")

## Step 3 - Test local LLM chat

This loads a local HF model for generation. The default is `Qwen/Qwen2.5-7B-Instruct`.

**Note:** If this is too large for your GPU, change `LLM_MODEL_ID` to a smaller model
like `Qwen/Qwen2.5-1.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`.

In [None]:
# Configure the LLM model - adjust if needed for your hardware
LLM_MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"  # change to smaller model if OOM
LLM_DTYPE = "bf16"  # "bf16", "fp16", or "auto"

print(f"Will load: {LLM_MODEL_ID} ({LLM_DTYPE})")

In [None]:
from kohakurag.llm import HuggingFaceLocalChatModel

chat = HuggingFaceLocalChatModel(
    model=LLM_MODEL_ID,
    dtype=LLM_DTYPE,
    max_new_tokens=256,
    temperature=0.0,  # greedy for reproducibility
)
print(f"LLM loaded: {LLM_MODEL_ID}")

In [None]:
response = await chat.complete(
    "What is 2 + 2? Answer with just the number.",
    system_prompt="You are a helpful assistant. Be concise.",
)
print(f"LLM response: {response!r}")
assert "4" in response, f"Expected '4' in response, got: {response}"
print("LLM sanity check PASSED")

## Step 4 - Full RAG pipeline (in-memory)

This test creates a small document set, indexes it into an in-memory store,
then runs retrieval + answer generation. No database file needed.

In [None]:
# Sample documents about sustainable AI
documents = [
    {
        "id": "doc1",
        "title": "Energy Efficiency in Data Centers",
        "text": (
            "Modern data centers consume approximately 1-2% of global electricity. "
            "Liquid cooling systems can reduce energy usage by up to 40% compared to "
            "traditional air cooling. Google reported a PUE (Power Usage Effectiveness) "
            "of 1.10 across its fleet in 2023."
        ),
    },
    {
        "id": "doc2",
        "title": "Carbon Footprint of LLM Training",
        "text": (
            "Training GPT-3 (175B parameters) was estimated to emit approximately "
            "552 tonnes of CO2. Smaller models like Llama-2-7B require roughly "
            "30x less compute. Techniques such as mixed-precision training and "
            "gradient checkpointing can further reduce energy consumption by 20-30%."
        ),
    },
    {
        "id": "doc3",
        "title": "Renewable Energy for AI Workloads",
        "text": (
            "Microsoft committed to being carbon negative by 2030. Their Azure "
            "data centers in Sweden run on 100% renewable energy. Solar-powered "
            "inference clusters have shown 15% cost savings in regions with "
            "high solar irradiance."
        ),
    },
]

print(f"Prepared {len(documents)} test documents")

In [None]:
from kohakurag.types import NodeKind, StoredNode
from kohakurag.pipeline import RAGPipeline
from kohakurag.datastore import InMemoryNodeStore

# Build StoredNodes with embeddings
nodes = []
all_texts = [doc["text"] for doc in documents]
all_vecs = await embedder.embed(all_texts)

for doc, vec in zip(documents, all_vecs):
    node = StoredNode(
        node_id=f"{doc['id']}:p0:s0",
        parent_id=f"{doc['id']}:p0",
        kind=NodeKind.PARAGRAPH,
        title=doc["title"],
        text=doc["text"],
        metadata={"document_id": doc["id"]},
        embedding=vec,
        child_ids=[],
    )
    nodes.append(node)

# Create in-memory store and index
store = InMemoryNodeStore()
await store.upsert_nodes(nodes)
print(f"Indexed {len(nodes)} nodes into in-memory store")

In [None]:
# Assemble pipeline with local components
pipeline = RAGPipeline(
    store=store,
    embedder=embedder,
    chat_model=chat,
    top_k=3,
)
print("Pipeline assembled (local embedder + local LLM + in-memory store)")

In [None]:
# Test retrieval only
question = "How much CO2 does training a large language model produce?"

result = await pipeline.retrieve(question, top_k=3)
print(f"Question: {question}")
print(f"Retrieved {len(result.matches)} matches:\n")
for i, match in enumerate(result.matches):
    print(f"  [{i+1}] score={match.score:.4f}  {match.node.title}")
    print(f"      {match.node.text[:100]}...\n")

In [None]:
# Test full QA (retrieve + generate)
answer = await pipeline.answer(question)

print(f"Question: {answer['question']}\n")
print(f"Response:\n{answer['response']}")

## Step 5 - Structured QA (JSON output)

This tests the `run_qa` method with the same prompt templates used in production.

In [None]:
import json

system_prompt = (
    "You must answer strictly based on the provided context snippets. "
    "Do NOT use external knowledge. Respond in JSON with keys: "
    "explanation, answer, answer_value, ref_id."
)

user_template = """Question: {question}

Context:
{context}

Additional info: {additional_info_json}

Return STRICT JSON:
- explanation: 1-2 sentences
- answer: short answer
- answer_value: numeric/categorical value or "is_blank"
- ref_id: list of document ids used

JSON Answer:"""

structured_result = await pipeline.run_qa(
    question="What PUE did Google report for its data centers?",
    system_prompt=system_prompt,
    user_template=user_template,
    additional_info={"answer_unit": "PUE ratio"},
    top_k=3,
)

print(f"Answer:       {structured_result.answer.answer}")
print(f"Answer value: {structured_result.answer.answer_value}")
print(f"Ref IDs:      {structured_result.answer.ref_id}")
print(f"Explanation:  {structured_result.answer.explanation}")
print(f"\nRaw LLM output:\n{structured_result.raw_response[:500]}")

## Step 6 - Offline validation

Confirm no network calls were made by unsetting API keys and re-running.

In [None]:
import os

# Clear any API keys to prove we're fully local
for key in ["OPENROUTER_API_KEY", "OPENAI_API_KEY", "JINA_API_KEY"]:
    os.environ.pop(key, None)

# Re-run a query - should work without any API keys
offline_answer = await pipeline.answer(
    "What percentage of global electricity do data centers use?"
)
print(f"Offline response:\n{offline_answer['response']}")
print("\nOFFLINE VALIDATION PASSED - no API keys needed!")

## Summary

If all cells above ran successfully, your local HF pipeline is working:

| Component | Provider | Model |
|-----------|----------|-------|
| Embeddings | `LocalHFEmbeddingModel` | `BAAI/bge-base-en-v1.5` |
| LLM | `HuggingFaceLocalChatModel` | Configured above |
| Vector store | `InMemoryNodeStore` | (in-memory, no DB needed) |

To use with the full production pipeline (KVaultNodeStore + pre-indexed docs),
set `llm_provider = "hf_local"` and `embedding_model = "hf_local"` in your config.