## Quickstart: run this RAG notebook

This notebook builds a **Retrieval-Augmented Generation (RAG)** pipeline over PDFs in `./docs/` (relative to this notebook).

### What it does

- Load all `*.pdf` files from `langchain/docs/`
- Split pages into chunks
- Create embeddings and persist a local **Chroma** DB
- Retrieve relevant chunks for a question and generate an answer
- (Optional) Start a **Gradio** chat UI

### Prerequisites

- **Python + Jupyter** (local)
- **Dependencies**: install from `requirements.txt`
- **API key**: this notebook uses an **OpenAI-compatible** endpoint via OpenRouter (`base_url="https://openrouter.ai/api/v1"`).

### Setup

1) Create and activate a virtualenv, then install deps (from repo root):

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

2) Create a `.env` in the repo root with your API key:

```text
LLM=sk-or-v1-...
```

- The code reads `LLM` via `load_dotenv()` + `os.getenv("LLM")`.
- `.env` is gitignored; **don‚Äôt commit keys**.

3) Put PDFs into `langchain/docs/`.

- Note: `docs/` is gitignored (often large and/or non-redistributable).

### Documents used by this notebook (current local folder)

If you have the same local corpus as on this machine, `langchain/docs/` contains:

...

### Running

- Run cells top-to-bottom.
- The first run will build embeddings and persist the vector store under `./docs/chroma_db/`.
  - To **re-index**, delete `langchain/docs/chroma_db/` and rerun the indexing cells.
- Use the ‚ÄúTEST THE RAG PIPELINE‚Äù cell to try a question.
- Run the last cell to start the Gradio UI.

### Troubleshooting

- **`No such file or directory: ./docs`**: create `langchain/docs/` and add PDFs.
- **`LLM environment variable not set`**: add `LLM=...` to `.env` and restart the kernel.
- **PDF warnings** (‚ÄúIgnoring wrong pointing object ‚Ä¶‚Äù): often harmless PDF parsing noise.
- check if RE_INDEX_CHROMA is set to false


Front-to-back RAG implementation using LangChain + Chroma (see Quickstart above).

In [None]:
RE_INDEX_CHROMA = False

In [None]:
"""
Complete RAG Implementation with Issue Fixes
Addresses: API configuration, embeddings setup, database persistence, and full retrieval+generation pipeline
"""

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

from dotenv import load_dotenv
import os

load_dotenv()

PATH = './docs'


In [None]:
"""clients"""

LLM_KEY = os.getenv("LLM")  # Ensure this is set in .env
if not LLM_KEY:
    raise ValueError("LLM environment variable not set. Check your .env file.")
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    api_key=LLM_KEY,
    base_url="https://openrouter.ai/api/v1",  # Remove if using OpenAI directly
    temperature=0.7,
)
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key=LLM_KEY,
    base_url="https://openrouter.ai/api/v1",  # ISSUE #5 FIX: Use base_url instead of openai_api_base
)

In [None]:
"""load and split documents"""

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

documents = []
for file_path in os.listdir(PATH):
    if file_path.endswith('.pdf'):
        joined_path = os.path.join(PATH, file_path)
        loader = PyPDFLoader(joined_path)
        documents.extend(loader.load()) 

print(f"Loaded {len(documents)} pages")
split_docs = text_splitter.split_documents(documents)

print(f"Split into {len(split_docs)} chunks")


In [None]:
"""create embeddings and vector store"""



if RE_INDEX_CHROMA:
    chroma_db_path = os.path.join(PATH, "chroma_db")
    db = Chroma.from_documents(
        documents=split_docs,  # ISSUE #6 FIX: Index all chunks
        embedding=embedding_model,
        persist_directory=chroma_db_path
    )
else:
    chroma_db_path = os.path.join(PATH, "chroma_db")
    db = Chroma(
        embedding_function=embedding_model,
        persist_directory=chroma_db_path
    )
print(f"Vector store created with {len(split_docs)} documents\n")

In [None]:
"""ü¶Æ retriever"""
retriever = db.as_retriever(search_kwargs={"k": 5})  # Retrieve top 3 most relevant chunks

In [None]:
"""rag setup"""

import json

rag_prompt = PromptTemplate(
    template="""
You are a helpful assistant that answers questions based on the provided context.

## CONTEXT:
{context}

## QUESTION:
{question}

## ANSWER:
Provide a clear, concise answer based on the context above. If the context doesn't contain the answer, say so.
Include the sources of the used context in the answer also add specific page numbers if possible. Only include the sources that are actually used in the answer.
If there are additional readings, concepts or researchers mentioned in the context feel free to include them in the answer.
""",
    input_variables=["context", "question"],
)


def format_docs(docs):
    """Format retrieved documents for the prompt."""
    context = []
    for retreived_doc in docs:
        content = retreived_doc.page_content
        source = retreived_doc.metadata["source"]
        context.append({"source": source, "content": content})
    return json.dumps(context)


rag_chain = (
    {
        "context": retriever | format_docs,  # Retrieve and format documents
        "question": RunnablePassthrough(),  # Pass through the user question
    }
    | rag_prompt  # Format the prompt
    | llm  # Send to LLM
    | StrOutputParser()  # Parse the response as string
)

In [None]:
"""test the pipeline"""

query = "was ist qualit√§tsmanagement?"
answer = rag_chain.invoke(query)

print(answer)

In [None]:
"""simple out of the box frontend"""

import gradio as gr

def rag_pipeline(message, history):
    response = rag_chain.invoke(message)
    return response

demo = gr.ChatInterface(rag_pipeline)
demo.launch()

    