# LlamaIndex Enterprise RAG Workbook

This notebook shows how to build a retrieval-augmented generation (RAG) pipeline over the assets in `data/` using LlamaIndex. The focus is on enterprise-friendly chunking, configuration hygiene, and reproducibility.


## 1. Prerequisites

- Python 3.11+ with access to this repository root
- `pip install -r requirements-dev.txt` once the dependencies below are added
- Secrets stored in `.env` (never commit credentials)

> Run the install cell only when setting up a new environment or upgrading tooling.


In [None]:
%pip install --quiet llama-index==0.10.54 python-dotenv pypdf


## 2. Configure runtime

### 2.1 Load environment configuration

Secrets and deployment toggles live in `.env`. Update `LLM_PROVIDER` and relevant API keys before running the next cell.


In [None]:
from pathlib import Path
import os

from dotenv import load_dotenv

project_root = Path.cwd()
if project_root.name == "notebooks":
    project_root = project_root.parent

dotenv_path = project_root / ".env"
if dotenv_path.exists():
    load_dotenv(dotenv_path)
    print(f"Loaded environment from {dotenv_path}")
else:
    print("No .env file found; relying on existing environment variables.")

LLM_PROVIDER = os.getenv("LLM_PROVIDER", "openai").lower()
print(f"Using LLM provider: {LLM_PROVIDER}")


### 2.2 Configure LLM and embedding providers

This block wires up default settings for OpenAI and can be extended for Azure or other providers. Failing fast when keys are missing keeps security audits simple.


In [None]:
from llama_index.core import Settings

if LLM_PROVIDER == "openai":
    if not os.getenv("OPENAI_API_KEY"):
        raise EnvironmentError("Set OPENAI_API_KEY before running index builds.")
    from llama_index.llms.openai import OpenAI
    from llama_index.embeddings.openai import OpenAIEmbedding

    Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.0)
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
elif LLM_PROVIDER == "azure":
    if not os.getenv("AZURE_OPENAI_API_KEY"):
        raise EnvironmentError("Set AZURE_OPENAI_API_KEY and related settings before running index builds.")
    from llama_index.llms.azure_openai import AzureOpenAI
    from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

    Settings.llm = AzureOpenAI(
        deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT", "gpt-4o"),
        api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-05-01-preview"),
        temperature=0.0,
    )
    Settings.embed_model = AzureOpenAIEmbedding(
        model=os.getenv("AZURE_OPENAI_EMBED_MODEL", "text-embedding-3-large"),
        deployment_name=os.getenv("AZURE_OPENAI_EMBED_DEPLOYMENT", "text-embedding"),
    )
else:
    raise ValueError(f"Unsupported provider '{LLM_PROVIDER}'. Extend this cell for your stack.")


## 3. Load and profile source documents

We target the curated corpora in `data/`. The reader handles both text and PDF assets, while metadata keeps traceability intact.


In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PDFReader

DATA_DIR = project_root / "data"
assert DATA_DIR.exists(), f"Expected data directory at {DATA_DIR}"

input_files = [
    DATA_DIR / "data.txt",
    DATA_DIR / "2101.03697v3.pdf",
]

reader = SimpleDirectoryReader(
    input_files=input_files,
    file_extractor={".pdf": PDFReader()},
)

documents = reader.load_data()
for doc in documents:
    file_name = doc.metadata.get("file_name", "unknown")
    word_count = len(doc.text.split())
    print(f"{file_name:<25} -> {word_count:>6} words")


## 4. Chunk documents with enterprise guardrails

We apply `SentenceSplitter` to craft overlapping chunks so the downstream retriever keeps context without bloating tokens. Text and PDF assets can use different policies if needed.


In [None]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=600, chunk_overlap=120, paragraph_separator="\n\n")

txt_docs = [doc for doc in documents if doc.metadata.get("file_name", "").endswith(".txt")]
pdf_docs = [doc for doc in documents if doc.metadata.get("file_name", "").endswith(".pdf")]

txt_nodes = splitter.get_nodes_from_documents(txt_docs)
pdf_nodes = splitter.get_nodes_from_documents(pdf_docs)

print(f"Text nodes: {len(txt_nodes)}")
print(f"PDF nodes:  {len(pdf_nodes)}")


### 4.1 Inspect representative chunks

Sampling nodes helps validate that entity boundaries and compliance-sensitive data stay intact.


In [None]:
def describe_nodes(label, nodes, sample=2):
    lengths = [len(node.text.split()) for node in nodes]
    if not lengths:
        print(f"No nodes for {label}")
        return
    avg_length = sum(lengths) / len(lengths)
    print(f"{label}: min={min(lengths)}, max={max(lengths)}, avg={avg_length:.1f} words")
    for idx, node in enumerate(nodes[:sample]):
        print(f"\n--- {label} sample {idx + 1} ---")
        preview = node.text[:500]
        print(preview)
        if len(node.text) > 500:
            print("...")

describe_nodes("data.txt", txt_nodes)
describe_nodes("2101.03697v3.pdf", pdf_nodes)


## 5. Build the vector index and query engine

We combine the chunked nodes into a vector index. Persist artifacts so production services can reload without rebuilding.


In [None]:
from llama_index.core import VectorStoreIndex

all_nodes = txt_nodes + pdf_nodes
index = VectorStoreIndex(all_nodes, show_progress=True)

query_engine = index.as_query_engine(similarity_top_k=5)


In [None]:
sample_question = "Summarize the core idea discussed in the PDF and relate it to the dataset."
response = query_engine.query(sample_question)
print(str(response))


### 5.1 Persist artifacts

Persisted storage keeps builds deterministic across CI, staging, and production. Check large assets into object storage if they exceed Git limits.


In [None]:
persist_dir = project_root / "storage" / "rag_llama_index"
persist_dir.mkdir(parents=True, exist_ok=True)
index.storage_context.persist(persist_dir=persist_dir)
print(f"Persisted index to {persist_dir}")


## 6. Quality controls and next steps

- Add regression tests in `tests/` that call the query engine with fixtures from `tests/resources/`
- Log `query_engine.retrieve` outputs during load testing to monitor drift
- Wire up feedback loops (manual grading or eval harnesses) before promoting an agent to production
- Rotate or encrypt `.env` secrets with your platform's secret manager
