<a href="https://colab.research.google.com/github/micah-shull/RAG-LangChain/blob/main/LC_19_RAG_Langsmith.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## ✅ **`%pip` vs. `%%pip` vs. `!pip`**

1️⃣ **`!pip install ...`**

* This runs `pip` in a **shell subprocess** (like typing in a terminal).
* In Colab/Jupyter, this can sometimes install packages into a different Python environment than the one the notebook kernel uses — especially if multiple Python versions or virtual environments are involved.
* You might install a package but then `import` still fails → annoying!

2️⃣ **`%pip install ...`**

* `%pip` is an **IPython magic command** (single-line).
* It ensures that the package is installed in **the same Python environment as your notebook kernel**.
* It’s the recommended way for pip installs in Jupyter/Colab.

✅ Use `%pip` instead of `!pip` to avoid “module not found” surprises.

3️⃣ **`%%pip install ...`**

* `%%pip` is the **cell magic** version of `%pip`.
* It works exactly the same way but applies to the **entire cell**.
* Useful if you have multiple install lines in one cell.

---

## ⚡ **Example**

```python
# Good practice in Google Colab
%pip install langchain-openai langsmith

# Or, using cell magic:
%%pip install langchain-openai langsmith
```

---

## 🗂️ **What about `%%capture`?**

You wrote:

```python
%%capture --no-stderr
%pip install ...
```

* `%%capture` is another IPython cell magic that **captures stdout and stderr** (so you don’t see all the pip output).
* `--no-stderr` means it won’t capture errors → you’ll still see them if pip fails.

---

## ✅ **Best practice for Colab**

* Use `%pip` or `%%pip` instead of `!pip` for Python packages.
* Use `%%capture` if you want to hide noisy output.
* Always restart your runtime (`Runtime → Restart runtime`) after major installs, if needed.



In [5]:
# %%capture --no-stderr
# %pip install langsmith langchain-openai langchain-core langchain-community pydantic python-dotenv openai langgraph
# %pip install --upgrade langsmith

In [1]:
%pip install -Uq langsmith langchain-openai langchain-core langchain-community pydantic python-dotenv openai langgraph

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.4/70.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m441.4/441.4 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m63.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.1/755.1 kB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.1/50.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Environ Setup

In [3]:
# --- 1) Imports ---
import os
from dotenv import load_dotenv

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

# --- 2) Load environment variables ---
load_dotenv("/content/API_KEYS.env", override=True)

# Confirm keys
print("OPENAI_API_KEY:", os.getenv("OPENAI_API_KEY"))
print("LANGCHAIN_API_KEY:", os.getenv("LANGCHAIN_API_KEY"))
print("LANGCHAIN_PROJECT:", os.getenv("LANGCHAIN_PROJECT"))
print("LANGCHAIN_TRACING_V2:", os.getenv("LANGCHAIN_TRACING_V2"))

# Alternatively set project name for this run outside .env file
os.environ["LANGCHAIN_PROJECT"] = "project_00"
os.environ["USER_AGENT"] = "MyRAGApp/0.1"

OPENAI_API_KEY: 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
LANGCHAIN_API_KEY: None
LANGCHAIN_PROJECT: my_project_name
LANGCHAIN_TRACING_V2: true


## Model Selection

In [4]:
from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

## Embeddings

✅ Perfect — you’re doing it exactly right for embeddings!
Let me clarify how this works so you know you’re good to go:

---

## ✅ **Using `OpenAIEmbeddings`**

```python
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
```

**What’s happening?**

* `OpenAIEmbeddings` is the LangChain wrapper that calls OpenAI’s embedding endpoint.
* `model="text-embedding-3-large"` is the recommended newer embedding model (high quality).
* Your `OPENAI_API_KEY` must be set (which you have in your `.env`).

---

## 📌 **How it works**

When you call:

```python
vectorstore = FAISS.from_documents(docs, embeddings)
```

or:

```python
embeddings.embed_query("My question?")
```

LangChain calls OpenAI’s embedding endpoint, gets the vector, and returns it to your retriever or vector store.


In [5]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# ⚡️ Tip: inspect embedding
vector = embeddings.embed_query("How does LangSmith tracing work?")
print(len(vector))  # Should match the embedding dimension

3072


## Vector Store

In [6]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [7]:
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict

# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# Index chunks
_ = vector_store.add_documents(documents=all_splits)

# Define prompt for question-answering
# N.B. for non-US LangSmith endpoints, you may need to specify
# api_url="https://api.smith.langchain.com" in hub.pull.
prompt = hub.pull("rlm/rag-prompt")


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

## Code Breakdown

Let’s break this block down step-by-step so you really see **what’s happening and why** — because this is a **perfect small RAG + LangGraph example** and there’s a lot to learn here! 🚀

---

## ✅ **🔹 1) Imports**

```python
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
```

* `bs4` → BeautifulSoup: you’ll use it to parse the HTML blog post.
* `hub` → LangChain Hub: stores reusable prompts/templates.
* `WebBaseLoader` → loads text from a webpage.
* `Document` → LangChain’s document wrapper (metadata, page content).
* `RecursiveCharacterTextSplitter` → splits long docs into chunks.
* `StateGraph` → the new LCEL-based graph execution framework.
* `TypedDict` → defines your app state type for type safety.

---

## ✅ **🔹 2) Load and parse blog post**

```python
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()
```

👉 This:

* Loads **Lilian Weng’s “Agent” blog post**.
* Uses `bs4.SoupStrainer` to keep only useful HTML parts (post content, title, header) → more focused chunks.
* Wraps the text as `Document` objects.

✅ **Why?** You don’t want your embeddings polluted by navbars, footers, or ads!

---

## ✅ **🔹 3) Split into chunks**

```python
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)
```

👉 This:

* Breaks long text into smaller, overlapping chunks (RAG best practice).
* `chunk_size=1000`: each chunk \~1000 characters.
* `chunk_overlap=200`: 200 chars overlap between chunks to preserve context.

✅ **Why?** Smaller chunks = better similarity search and fewer tokens sent to LLM.

---

## ✅ **🔹 4) Index chunks**

```python
_ = vector_store.add_documents(documents=all_splits)
```

👉 This:

* Adds the chunks to your vector store (`InMemoryVectorStore` or `FAISS`).
* Each chunk is embedded and stored for similarity search later.

✅ **Why?** This builds your local knowledge base to support retrieval.

---

## ✅ **🔹 5) Pull RAG prompt from LangChain Hub**

```python
prompt = hub.pull("rlm/rag-prompt")
```

👉 This:

* Loads a reusable RAG-style prompt from the LangChain Hub.
* The prompt will look like:

  ```
  Use the following context to answer the question.
  Question: {question}
  Context: {context}
  Answer:
  ```

✅ **Why?** Standardizes your RAG behavior without hardcoding your own prompt.

---

## ✅ **🔹 6) Define app state**

```python
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str
```

👉 This:

* Defines the **input, intermediate, and output state** for your graph.
* `question` → the user’s query.
* `context` → retrieved docs.
* `answer` → final LLM output.

✅ **Why?** `StateGraph` needs a clear state schema to pass data between steps.

---

## ✅ **🔹 7) Define steps**

### `retrieve`

```python
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}
```

👉 This:

* Takes the question.
* Uses similarity search to find relevant chunks.
* Returns them as `context` for the next step.

✅ **Why?** Classic RAG step: “given a question, find matching chunks.”

---

### `generate`

```python
def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}
```

👉 This:

* Joins retrieved chunks into a single context block.
* Calls the prompt to format the input for the LLM.
* Calls the LLM to generate the answer.
* Returns the answer as your final state.

✅ **Why?** This is the generation half of RAG: “use the retrieved context to produce a grounded answer.”

---

## ✅ **🔹 8) Build & compile your `StateGraph`**

```python
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()
```

👉 This:

* Creates a graph where:

  * **Start** → `retrieve` → `generate`
* `StateGraph` wires up your pipeline with a clear flow.

✅ **Why?** This modular graph pattern is reusable, composable, and debuggable — much better than messy nested calls!

---

## ✅ **🔹 9) Next step: Run it**

After this block, you run:

```python
result = graph.invoke({"question": "What are ReAct agents?"})
print(result)
```

* This runs `retrieve` → `generate` → gives you an answer.
* ✅ And because your env vars are set, the entire run logs to **LangSmith**:

  * Question
  * Retrieved docs
  * LLM output
  * Token usage

---

## ✅ **📌 What this block gives you**

✔️ A **clean RAG pipeline** with:

* **Data loading**
* **Chunking**
* **Embeddings + vector store**
* **Retrieval**
* **Prompt templating**
* **LLM generation**
* **Modular graph orchestration**

✔️ **Full observability** via LangSmith (if your API key & project are correct).

---

## 🎉 **Why this is powerful**

* You can swap the retriever, LLM, or prompt independently.
* You can add more steps — like evals, re-ranking, or feedback loops.
* You can see the entire flow in your LangSmith dashboard to debug or tune.




### Soup Strainer

In [14]:
# Inspect raw HTML (not using SoupStrainer)
import requests
html = requests.get("https://lilianweng.github.io/posts/2023-06-23-agent/").text

print(html[:1000])  # Show first 1000 chars

<!DOCTYPE html>
<html lang="en" dir="auto">

<head><meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta name="robots" content="index, follow">
<title>LLM Powered Autonomous Agents | Lil&#39;Log</title>
<meta name="keywords" content="nlp, language-model, agent, steerability, prompting" />
<meta name="description" content="Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview
In a LLM-powered autonomous agent system, LLM functions as the agent&rsquo;s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down lar

In [13]:
print(len(docs))
print(docs[0].page_content[:500])  # show a snippet
print(docs[0].metadata)

1


      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In
{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}


In [18]:
from bs4 import BeautifulSoup, SoupStrainer

# Define what you want to extract:
only_useful = SoupStrainer(class_=("post-content", "post-title", "post-header"))

# Parse HTML, filtering at parse time:
soup = BeautifulSoup(html, "html.parser", parse_only=only_useful)

# Check what you got:
print(soup.prettify()[:1000])

<header class="post-header">
 <h1 class="post-title">
  LLM Powered Autonomous Agents
 </h1>
 <div class="post-meta">
  Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng
 </div>
</header>
<div class="post-content">
 <p>
  Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as
  <a href="https://github.com/Significant-Gravitas/Auto-GPT">
   AutoGPT
  </a>
  ,
  <a href="https://github.com/AntonOsika/gpt-engineer">
   GPT-Engineer
  </a>
  and
  <a href="https://github.com/yoheinakajima/babyagi">
   BabyAGI
  </a>
  , serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
 </p>
 <h1 id="agent-system-overview">
  Agent System Overview
  <a aria-hidden="true" class="anchor" hidden="" href="#agent-system-overview">
   #
  </a>
 </h1>
 <p>
  In a LLM-powered au

In [22]:
import requests
from bs4 import BeautifulSoup, SoupStrainer

url = "https://lilianweng.github.io/posts/2023-06-23-agent/"

# Step 1: Get HTML
html = requests.get(url).text

# Step 2: Strain for useful parts
only_useful = SoupStrainer(class_=("post-content", "post-title", "post-header"))
soup = BeautifulSoup(html, "html.parser", parse_only=only_useful)

# Step 3: Inspect
print("EXTRACTED PARTS:")
extracted_parts = soup.find_all(True)

for idx, part in enumerate(extracted_parts[0:15]):
    print(f"\n--- PART {idx+1} ---")
    print(f"Tag: {part.name}")
    print(f"Class: {part.get('class')}")
    print(f"Content:\n{part.get_text(strip=True)[:500]}")


EXTRACTED PARTS:

--- PART 1 ---
Tag: header
Class: ['post-header']
Content:
LLM Powered Autonomous AgentsDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng

--- PART 2 ---
Tag: h1
Class: ['post-title']
Content:
LLM Powered Autonomous Agents

--- PART 3 ---
Tag: div
Class: ['post-meta']
Content:
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng

--- PART 4 ---
Tag: div
Class: ['post-content']
Content:
Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such asAutoGPT,GPT-EngineerandBabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.Agent System Overview#In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:PlanningSubgoal

--- PART 5 ---
Tag: p
Class: None
Co

Let’s unpack this because it’s *super important* for how the loader → splitter → vector store flow works.

---

### ✅ **What actually happens with multiple parts?**

When you do this:

```python
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4.SoupStrainer(
        class_=("post-content", "post-title", "post-header")
    )}
)
docs = loader.load()
```

👉 **`loader.load()` returns a list of `Document` objects** — *one per matched chunk*.

So if your page has:

* `<h1 class="post-title"> ... </h1>` → that’s **one Document**
* `<header class="post-header"> ... </header>` → another **Document**
* `<div class="post-content"> ... </div>` → another **Document**

✅ So you may get `len(docs) == 3` for this blog.

---

## 🔍 **Do they automatically combine?**

**No!** `WebBaseLoader` does not automatically merge them.
They stay separate in your `docs` list:

```python
[
    Document(page_content="Agents: What are they?..."),
    Document(page_content="Author: Lilian Weng..."),
    Document(page_content="This post discusses ReAct, ...")
]
```

---

## ✅ **When do they get combined?**

They **don’t** get literally merged into one big `Document`.
Instead, they go through your **TextSplitter**, which will:

* Take each `Document` individually.
* Chunk them into smaller overlapping parts.
* The result is one long list of **smaller chunks**, all from all the original docs.

---

## 📌 **Example:**

```python
# Suppose you have 3 docs:
docs = loader.load()
print(len(docs))  # 3

# Then split:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

print(len(splits))  # Maybe 20 or 30 chunks!
```

So in the end:
✅ The chunks **represent the combined source**
✅ But technically they come from separate source `Documents`
✅ Each chunk keeps `metadata` → so your retrieval knows which source it came from!

---

## 🗂️ **Why this is good**

This design means:

* You can pull exactly the HTML bits you want → fine control.
* You get clear metadata for each chunk.
* Your retriever can later show you: *“This passage came from the post-content div, or the title header, etc.”*




In [23]:
len(extracted_parts)

401

In [24]:
# Suppose you have 3 docs:
docs = loader.load()
print(len(docs))  # 3

# Then split:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

print(len(splits))  # Maybe 20 or 30 chunks!


1
63


In [42]:
all_splits[0]

Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refin

### Splits Inspection

✅ **Fantastic insight — you’re asking exactly the right question!**
You’re right: the *tidiness* of your final chunks **does directly affect** your RAG output quality. Let’s break down *why*, what’s happening, and how to clean it up if you want **higher-quality chunks**.

---

## 📌 **What’s in your current split?**

Your `all_splits[0]` is:

```
Document(
  metadata={...},
  page_content=(
    "LLM Powered Autonomous Agents\n\nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM..."
  )
)
```

So your chunk has:

* ✅ **Good content**: the intro, headings, paragraphs.
* ⚠️ **Extra noise**: metadata like “Date: … | Author: …” that repeats in other chunks.
* ⚠️ Extra `\n` → doesn’t usually hurt the LLM, but lots of them can waste tokens or break flow.

---

## ✅ **Why does this matter for RAG?**

* The embeddings you generate for similarity search are only as good as the text quality.
* If every chunk carries repetitive noise (like the date/author for each blog post), it can:

  * Dilute your semantic similarity.
  * Waste tokens in your context window when you pass retrieved docs to the LLM.
  * Make the final answer sound repetitive or cluttered.

---

## 📌 **Where does this come from?**

* This is from the **HTML structure**:

  * For example, the “post-header” block probably includes the title, date, author.
  * That’s perfectly fine *once* — but when you split the text, it can get copied into multiple chunks because the `RecursiveCharacterTextSplitter` doesn’t “know” what’s semantically important vs. repeated.

---

## ✅ **How tidy is the final product by default?**

👉 By default:

* It’s **pretty good** — you’re getting meaningful, readable text blocks.
* But you *can* get repeated headers or metadata you may not want.
* The newline characters `\n` are usually not a big deal — but for very clean retrieval, you might normalize them.

---

## ⚡️ **How to clean it up (best practices)**

### ✅ 1️⃣ Strip or filter unwanted text before splitting

**One-time removal is the best place:**

```python
# Inspect your docs before splitting:
for doc in docs:
    print("--- RAW DOC ---")
    print(doc.page_content[:500])

# Example cleanup:
for doc in docs:
    text = doc.page_content

    # Example: Remove the Date/Author line if it's always the same
    text = text.replace("Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng", "")

    # Collapse multiple newlines
    text = "\n".join([line.strip() for line in text.splitlines() if line.strip()])

    doc.page_content = text
```

Then split:

```python
all_splits = text_splitter.split_documents(docs)
```

✅ This means your **chunks will never contain that redundant metadata**, so your embeddings will be clearer.

---

### ✅ 2️⃣ Or use a more advanced cleaner

You could also:

* Use regex to strip repeated footers, headers, disclaimers, timestamps.
* Apply `strip()` to remove leading/trailing whitespace.
* Remove line breaks if you don’t need them for semantic structure.

---

### ✅ 3️⃣ Normalizing whitespace helps

LLMs handle `\n` fine — they even use them in chain-of-thought. But random extra line breaks can waste tokens.
One common trick:

```python
def clean_text(text):
    # Remove extra newlines
    lines = text.splitlines()
    lines = [line.strip() for line in lines if line.strip()]
    return " ".join(lines)

for doc in docs:
    doc.page_content = clean_text(doc.page_content)
```

---

## 📌 **Does this really improve RAG performance?**

✅ YES — for real-world projects, *cleaner* chunks:

* Improve embedding similarity.
* Reduce hallucination risk because retrieval is more on-point.
* Use your token budget more effectively → less wasted cost.

---

## 🟢 **So your instinct is 100% right**

| Noisy chunks               | Cleaned chunks                    |
| -------------------------- | --------------------------------- |
| Date/author in every chunk | Appears once, or only in metadata |
| Extra whitespace           | Normalized                        |
| Poor semantic flow         | Improved relevance                |

---

## ✅ ✅ ✅ **Quick takeaway**

* The default tutorial gives you a *decent* baseline.
* But a tiny bit of **text pre-processing** can noticeably improve your results — especially for larger RAG projects.




In [43]:
for i, split in enumerate(all_splits[:10]):  # First 10 splits
    print(f"--- Chunk {i+1} ---")
    print(f"Length (chars): {len(split.page_content)}")
    print(f"Metadata: {split.metadata}")
    print("\nCONTENT:\n")
    print(split.page_content[:500])  # Preview first 500 chars
    print('\n' + '-'*80 + '\n')


--- Chunk 1 ---
Length (chars): 969
Metadata: {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}

CONTENT:

LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-p

--------------------------------------------------------------------------------

--- Chunk 2 ---
Length (chars): 665
Metadata: {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}

CONTENT:

Memory

Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.
Long-term memory:

In [55]:
def clean_text(text):
    # Split the text into lines
    lines = text.splitlines()

    # Strip each line of leading/trailing whitespace and drop empty lines
    lines = [line.strip() for line in lines if line.strip()]

    # Join the cleaned lines into one string with single spaces
    return " ".join(lines)


# Pick a doc to inspect
original = docs[0].page_content

# Clean it
cleaned = clean_text(original)

# Compare
import textwrap
from pprint import pprint  # For pretty-printing dicts if needed

# Compare side by side with nice wrapping
print("--- ORIGINAL ---")
print("\n".join(textwrap.wrap(original[:500], width=80)))

print("\n--- CLEANED ---")
print("\n".join(textwrap.wrap(cleaned[:500], width=80)))


--- ORIGINAL ---
LLM Powered Autonomous Agents Building agents with LLM (large language model) as
its core controller is a cool concept. Several proof-of-concepts demos, such as
AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality
of LLM extends beyond generating well-written copies, stories, essays and
programs; it can be framed as a powerful general problem solver. Agent System
Overview# In a LLM-powered autonomous agent system, LLM functions as the agent’s
brain, complemented by se

--- CLEANED ---
LLM Powered Autonomous Agents Building agents with LLM (large language model) as
its core controller is a cool concept. Several proof-of-concepts demos, such as
AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality
of LLM extends beyond generating well-written copies, stories, essays and
programs; it can be framed as a powerful general problem solver. Agent System
Overview# In a LLM-powered autonomous agent system, LLM functions as th

## Prompt

## 📌 **What is `hub.pull("rlm/rag-prompt")` doing?**

In the new LangChain 0.2+ framework, **`hub.pull`** loads a prompt **from the LangChain Hub**, which is a shared place to store reusable prompts, chains, and components.

```python
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")
```

So `rlm/rag-prompt` is:

* `rlm` → the username or org on the Hub.
* `rag-prompt` → the name of the prompt they published.

This lets you reuse **battle-tested** prompt templates instead of writing one from scratch.

---

## ✅ **Can you see the prompt online?**

✔️ Yes! Every public Hub asset has a **web page**.
👉 The URL pattern is:

```
https://smith.langchain.com/hub/<username>/<asset-name>
```

So in your case:

```
https://smith.langchain.com/hub/rlm/rag-prompt
```

If you open that, you’ll see:

* The full prompt template.
* Inputs it expects (e.g., `{question}` and `{context}`).
* Example usage.
* Versions, if any.

---

## 🟢 **What does this prompt contain?**

Here’s what a standard **RAG prompt** usually looks like:

```
Use the following pieces of context to answer the question at the end.
If you don’t know the answer, just say you don’t know — don’t try to make up an answer.

Context:
{context}

Question:
{question}

Helpful Answer:
```

✅ It ensures:

* The LLM grounds its answer in your retrieved chunks.
* It avoids hallucinations by telling the LLM not to guess.
* Your `{context}` and `{question}` are plugged in dynamically when you call `prompt.invoke(...)`.

---

## 🗂️ **How can you inspect it locally?**

After pulling it:

```python
print(prompt)
```

or:

```python
print(prompt.messages)
```

Depending on whether it’s a `PromptTemplate`, `ChatPromptTemplate`, or some other type, you’ll see its structure.

---

## ✅ **Key takeaway**

| Thing                                            | What it does                                                 |
| ------------------------------------------------ | ------------------------------------------------------------ |
| `hub.pull("rlm/rag-prompt")`                     | Loads a reusable prompt template from the Hub                |
| `https://smith.langchain.com/hub/rlm/rag-prompt` | Lets you view the exact prompt                               |
| `prompt.invoke({...})`                           | Fills in `{question}` and `{context}` when you run the graph |

---

## ✅ ✅ ✅ **Next**

* Go check it out at [smith.langchain.com/hub/rlm/rag-prompt](https://smith.langchain.com/hub/rlm/rag-prompt)


In [66]:
# The prompt is a ChatPromptTemplate with one or more messages
print(type(prompt))
# <class 'langchain_core.prompts.chat.ChatPromptTemplate'>

# The actual list of messages
print(prompt.messages)
# [HumanMessagePromptTemplate(...), ...]

# Grab the first HumanMessagePromptTemplate
human_message = prompt.messages[0]

print(type(human_message))
# <class 'langchain_core.prompts.chat.HumanMessagePromptTemplate'>

# That has a .prompt which is your underlying PromptTemplate
underlying_template = human_message.prompt

print(type(underlying_template))
# <class 'langchain_core.prompts.prompt.PromptTemplate'>

# Finally, get the raw template string!
print('\n')
print("\n".join(textwrap.wrap(underlying_template.template, width=80)))



<class 'langchain_core.prompts.chat.ChatPromptTemplate'>
[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})]
<class 'langchain_core.prompts.chat.HumanMessagePromptTemplate'>
<class 'langchain_core.prompts.prompt.PromptTemplate'>


You are an assistant for question-answering tasks. Use the following pieces of
retrieved context to answer the question. If you don't know the answer, just say
that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}  Context: {context}  Answer:


✅ **Good RAG prompts are often surprisingly simple.**

---

### 📌 **Why is it so simple?**

Your prompt:

```
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.

Question: {question}  
Context: {context}  
Answer:
```

---

### 🔑 **What this does right**

1️⃣ **Grounds the LLM**

* It says: *“Use the context I gave you. Don’t hallucinate.”*

2️⃣ **Has a clear fallback**

* *“If you don’t know, just say you don’t know.”*
* This reduces confident nonsense when the context is irrelevant or missing.

3️⃣ **Keeps output short**

* *“Three sentences max, keep it concise.”*
* Saves tokens and keeps answers direct — especially important if you show the source passages to the user too.

4️⃣ **No unnecessary instructions**

* No fancy formatting, references, or style — just plain QA.
* The retriever does the heavy lifting: the better your chunks, the better this works.

---

## ✅ **Good RAG = simple prompt + good chunks**

You nailed it:

* The *real power* comes from **high-quality, well-chunked, relevant context**.
* A fancy prompt can’t fix irrelevant or noisy chunks.
* A simple, direct prompt works better because the LLM doesn’t waste tokens or logic on unnecessary instructions.

---

## 📌 **When you might expand it**

You’d only make your RAG prompt more complex when you need:

* ✅ Citations: *“Add \[1], \[2], \[3] next to facts.”*
* ✅ Structured output: JSON, bullet lists, or tables.
* ✅ A custom style or tone: e.g., summarizing in plain language for kids.
* ✅ Additional constraints: like always adding an explicit source URL.

But for baseline QA → **short, direct instructions are king**.

---

## 🟢 **Key takeaway**

| Good chunk quality          | Good retriever                                   | Simple grounding prompt                                |
| --------------------------- | ------------------------------------------------ | ------------------------------------------------------ |
| ✅ Clean text, minimal noise | ✅ Similarity search that finds relevant passages | ✅ “Use this context. Be concise. Don’t make stuff up.” |

This is what gives you reliable, traceable answers.





> *“How important is prepping and optimizing my source documentation for RAG?”*

The short answer: **It’s absolutely critical — maybe the single most important thing you control!**

---

### 📌 **Why your chunk quality matters more than a fancy prompt**

### 🗂️ **RAG is really just two big steps:**

1️⃣ **Retrieve** — Use embeddings + vector store to find relevant text chunks.
2️⃣ **Generate** — Ask the LLM to answer your question using those chunks.

---

### ⚡ **What does “garbage in, garbage out” mean for RAG?**

👉 If your source text is noisy, irrelevant, repetitive, or poorly chunked:

* The vector store will return off-topic or low-quality context.
* The LLM will either hallucinate to fill gaps **OR** parrot irrelevant details.
* No prompt in the world can fix “bad chunks in, bad answer out.”

---

### ✅ **The retrieval step is your “truth filter.”**

A clear, tight prompt just says:

> “Hey model, trust these chunks. Don’t guess.”

But if the chunks themselves are:

* Missing key facts
* Full of boilerplate (headers, footers, legal disclaimers)
* Or too big and vague

…the LLM has no good raw material to pull an accurate answer from.

---

## 🟢 **What makes a chunk “high quality”?**

✅ Clear, well-written text — free from extra noise like navbars, unrelated disclaimers.

✅ Right chunk size — small enough to stay relevant (\~500–1000 tokens), large enough to preserve meaning.

✅ Overlap for context — to prevent splitting related ideas mid-paragraph.

✅ Consistent format — so your embeddings stay semantically sharp.

✅ Enriched metadata — so you can filter and display source details later.

---

## 🧹 **How do people prep their docs well?**

📄 **Good examples:**

* Strip repeated boilerplate: e.g., headers/footers that repeat every page.
* Remove unrelated sections: e.g., site nav, social share buttons.
* Normalize whitespace: e.g., clean up newlines, markdown, code blocks.
* Keep logical units together: e.g., don’t break a section heading from its paragraph.

---

## ✅ **How big is the impact?**

There’s tons of evidence (and real-world LangChain teams see this daily!):

* Improving chunk quality can improve RAG accuracy dramatically — often more than tweaking your LLM or prompt.
* If your chunks are noisy, similarity search pulls bad matches.
* Better chunks = better semantic similarity = more relevant context = less hallucination.

---

## 🔑 **Your secret RAG formula**

| Part                          | How much control you have         | Impact      |
| ----------------------------- | --------------------------------- | ----------- |
| Chunk quality & text cleaning | ✅ 100%                            | 🔥 Massive  |
| Chunk size & overlap          | ✅ 100%                            | 🔥 High     |
| Prompt clarity                | ✅ 100%                            | 🔥 Medium   |
| LLM model choice              | ✅ 50% (API pricing & performance) | 🔥 Moderate |
| Vector store tech             | ✅ 100%                            | 🔥 Medium   |

---

## ✅ ✅ ✅ **Key takeaway**

> 🏆 *“RAG is mostly about smart retrieval.
> Smart retrieval is all about high-quality, clean chunks.”*

A fancy prompt is just the icing on the cake.
A good retriever *is the cake.* 🍰✨


##State

This little `State` definition is **super important** for understanding how your LangGraph `StateGraph` works behind the scenes.

---

## 📌 **What is this doing?**

```python
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str
```

**This defines the “shape” of your state:**
It’s like saying:

> “At any point in this pipeline, my application will carry a `State` dictionary with exactly these keys and types.”

---

## ✅ **Why do you need a `State`?**

In a LangGraph `StateGraph`:

* Your RAG pipeline is built as a series of **steps** (nodes).
* Each step **receives** the current `State` and **returns** a modified version of it.
* The framework wires these steps together, passing the state along.

This gives you:

* **Clarity** → you always know what data flows through your app.
* **Type safety** → so you don’t accidentally break your chain.
* **Composability** → each step only worries about its piece.

---

## 🟢 **What does each field mean?**

| Field      | Type             | What it holds                                                    |
| ---------- | ---------------- | ---------------------------------------------------------------- |
| `question` | `str`            | The user’s input question.                                       |
| `context`  | `List[Document]` | The retrieved chunks from your vector store (before generation). |
| `answer`   | `str`            | The final LLM answer.                                            |

So your pipeline looks like:

```
Input: {"question": "What is ReAct?"}

   ↓  retrieve step adds:
   {"question": "...", "context": [Document(...), Document(...)]}

   ↓  generate step adds:
   {"question": "...", "context": [...], "answer": "ReAct is ..."}
```

✅ The state grows as each step adds more info.

---

## ✅ **Why `TypedDict`?**

👉 `TypedDict` comes from `typing_extensions` (for backward compatibility with older Python versions).
It means:

* This `State` is a *dictionary* with a specific schema.
* It’s not an actual class with methods — just a *type hint* for your graph.

---

## 🧩 **Where does this show up?**

```python
graph_builder = StateGraph(State)
```

This means:

> “This graph expects every node to handle and return a `State` matching this structure.”

---

## ✅ **Key takeaway**

| Part        | What it does                                          |
| ----------- | ----------------------------------------------------- |
| `State`     | Defines the *shared data* passed step-to-step         |
| `TypedDict` | Adds type safety and clear contracts                  |
| `context`   | Carries retrieved `Document` chunks to your generator |
| `answer`    | Carries the final output back to you                  |

---

## 🚀 **So what’s next?**

After this, your `retrieve` and `generate` functions:

* **Input:** `State`
* **Output:** `State` (or a partial dict that the graph merges)

✅ It’s clean, predictable, and traceable — perfect for LangSmith to log each step.

---

## ✅ ✅ ✅ **Summary**

👉 `State` = your app’s *blueprint for passing info around* in your RAG graph.
👉 `TypedDict` makes it explicit and debuggable.
👉 You get reproducibility, clarity, and great logs!






### 📌 **Why do we care about `State`?**

When you use a **graph-based pipeline** (like LangGraph), `State` is your **single source of truth** for *what data flows through the system*.

It answers:

> “What data am I passing from step to step?”
> “What did each step add, change, or depend on?”

Without a clear `State`, your pipeline is:

* 🔥 Harder to debug
* 🔥 Easier to break if someone changes one step’s input/output
* 🔥 Less traceable in LangSmith or other observability tools

---

## ✅ **What does `State` actually do?**

Here’s exactly what it provides:

### ✔️ 1️⃣ **Clear data flow**

Every node (step) knows:

* What it will **get** (`State` input)
* What it must **return** (a `State` with the same shape, or a partial update)

So there’s no hidden “oh, where did that variable come from?”

---

### ✔️ 2️⃣ **Makes steps reusable & composable**

You can swap out steps (like a new retriever or generator) because:

* They speak the same “language” → `State`.
* Each step promises: “I’ll take in a `State` and return a valid `State`.”

---

### ✔️ 3️⃣ **Better debugging & tracing**

When you run the pipeline in LangSmith:

* You see *exactly* what each node did to the state.
* You can inspect the input, output, and intermediate results step-by-step.
* This makes finding hallucinations or bad retrieval trivial.

---

### ✔️ 4️⃣ **Scales for complex workflows**

For simple RAG, `State` might just be:

```python
{
  "question": "...",
  "context": [...],
  "answer": "..."
}
```

But for real applications, `State` can include:

* Retrieval scores
* Source docs metadata
* Chain-of-thought traces
* Evaluation results
* User feedback

So your whole system becomes:

```
START
  ↓
Retrieve chunks → store in `context`
  ↓
Rerank chunks → update `context` with top 3
  ↓
Generate → store `answer`
  ↓
Run eval → store `eval_score`
  ↓
Save to LangSmith → full `State` trace
```

All structured, no surprises.

---

## 🗂️ **Without `State` you’d have spaghetti**

Imagine instead:

* Each step just returns some raw value.
* You’re passing random variables between steps.
* Want to add a reranker? You’re untangling function signatures and manually merging variables.

Your pipeline gets brittle and confusing really fast.

---

## ✅ **State = your contract**

LangGraph literally won’t compile the graph if you break the `State` contract.

That’s the whole power:

> “My pipeline’s data flow is explicit and reliable.”

---

## ⚡️ **Key takeaway**

| ✅ Why `State` exists                                                  | ✅ Why it matters                              |
| --------------------------------------------------------------------- | --------------------------------------------- |
| Defines what your pipeline inputs, produces, and passes between steps | Makes steps modular, reusable, and safe       |
| Enables **step-by-step tracing**                                      | You know *exactly* what happened at each node |
| Makes complex flows manageable                                        | Easy to evolve, add steps, or run tests       |

---

## ✅ ✅ ✅ **Bottom line**

> 🔑 If you want your RAG pipeline to be **debuggable**, **trustworthy**, and **easy to extend**,
> *define your `State` clearly.*

It’s one of the biggest upgrades you get when moving from “LLM spaghetti code” to **robust LCEL or LangGraph orchestration.**


✅ **This is where your LangGraph pipeline actually *does the work*!**
Let’s break each function down line by line, so you see exactly *how* they transform the `State`.

---

### 📌 **Big picture**

Each function is a **graph step (node)**:

* It **takes in** the `State` defined earlier:

  ```python
  class State(TypedDict):
      question: str
      context: List[Document]
      answer: str
  ```
* It **returns** a `dict` with one or more keys that merge back into the `State`.

LangGraph keeps the `State` flowing through these nodes step-by-step.

---

## ✅ **🔹 1️⃣ `retrieve` step**

```python
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}
```

### ✔️ What this does:

1️⃣ **Takes the user’s question:**

```python
state["question"]
```

This comes from the initial input to the graph:

```python
graph.invoke({"question": "What are ReAct agents?"})
```

2️⃣ **Calls the vector store’s `similarity_search`:**

```python
retrieved_docs = vector_store.similarity_search(...)
```

✅ This searches your embeddings for the *most semantically similar chunks*.
✅ The result is a `List[Document]`.

3️⃣ **Returns an update to the `State`:**

```python
return {"context": retrieved_docs}
```

✅ LangGraph merges this with the existing state.
So after this step:

```python
{
  "question": "What are ReAct agents?",
  "context": [Document(...), Document(...)]
}
```

---

## ✅ **🔹 2️⃣ `generate` step**

```python
def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}
```

### ✔️ What this does:

1️⃣ **Combines retrieved chunks into a single string:**

```python
docs_content = "\n\n".join(doc.page_content for doc in state["context"])
```

✅ This turns multiple retrieved `Document` objects into a single `context` block to feed the prompt.

---

2️⃣ **Calls the prompt template to render the final prompt:**

```python
messages = prompt.invoke({
  "question": state["question"],
  "context": docs_content
})
```

✅ This fills in `{question}` and `{context}` → outputs messages for the chat model.

---

3️⃣ **Calls the LLM to generate the final answer:**

```python
response = llm.invoke(messages)
```

✅ This is where the actual answer comes from — using the grounded context.

---

4️⃣ **Returns an update to the `State`:**

```python
return {"answer": response.content}
```

✅ So the final `State` now has:

```python
{
  "question": "...",
  "context": [...],
  "answer": "..."
}
```

---

## ✅ **So what’s the big picture?**

| Function   | Input                               | What it adds to `State`          | Output             |
| ---------- | ----------------------------------- | -------------------------------- | ------------------ |
| `retrieve` | `State` with `question`             | Adds `context` → retrieved docs  | `{"context": ...}` |
| `generate` | `State` with `question` + `context` | Adds `answer` → final LLM answer | `{"answer": ...}`  |

---

## 🧩 **How this connects to the graph**

When you do:

```python
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()
```

You’re telling LangGraph:

> “Run these nodes in sequence:
> START → retrieve → generate.
> Keep passing the `State` along and merge the updates at each step.”

---

## 🟢 **Why is this pattern powerful?**

✅ Each step is *tiny*, reusable, and testable on its own.
✅ `State` stays explicit — no hidden variables.
✅ Everything gets traced to LangSmith → so you see:

* What was retrieved
* What was generated
* Where things went wrong if they do



## Response

✅ This is the **final step** where your whole **LangGraph RAG pipeline** runs!
Let’s break down exactly what’s happening when you call:

---

```python
response = graph.invoke({"question": "What is Task Decomposition?"})
print(response["answer"])
```

---

### 📌 **1️⃣ `graph.invoke(...)`**

This kicks off your entire compiled **StateGraph**:

```python
graph = graph_builder.compile()
```

Your compiled graph knows:

```
START → retrieve → generate
```

So when you call:

```python
graph.invoke({"question": "What is Task Decomposition?"})
```

you’re saying:

> “Run this graph, starting with this `State`:
> `{ "question": "What is Task Decomposition?" }`”

---

## ✅ **2️⃣ What happens under the hood**

**Step-by-step flow:**

1️⃣ `State` starts with:

```python
{
  "question": "What is Task Decomposition?"
}
```

---

2️⃣ The **`retrieve` step** runs:

```python
retrieved_docs = vector_store.similarity_search(state["question"])
```

✅ It finds the most relevant chunks in your indexed docs → updates:

```python
{
  "question": "...",
  "context": [Document(...), Document(...)]
}
```

---

3️⃣ The **`generate` step** runs:

```python
docs_content = ...join chunks...
messages = prompt.invoke({"question": ..., "context": ...})
response = llm.invoke(messages)
```

✅ It produces your final answer → updates:

```python
{
  "question": "...",
  "context": [...],
  "answer": "..."
}
```

---

4️⃣ The final `State` is returned:

```python
response = {
  "question": "...",
  "context": [...],
  "answer": "..."
}
```

---

## ✅ **3️⃣ `print(response["answer"])`**

Finally, you just print the LLM’s generated answer:

```python
print(response["answer"])
```

---

## 🟢 **What’s actually logged to LangSmith?**

Because you have:

```python
LANGCHAIN_TRACING_V2 = "true"
LANGCHAIN_API_KEY = ...
LANGCHAIN_PROJECT = ...
```

✅ This entire run is tracked step-by-step:

* Input question
* Retrieved context chunks (which docs, similarity scores)
* Final prompt used for generation
* LLM output

So you can debug:

* Did your retriever find good chunks?
* Did the LLM answer accurately?
* Did it hallucinate?

---

## ✅ ✅ ✅ **Key takeaway**

| What you did                            | What it means                                                   |
| --------------------------------------- | --------------------------------------------------------------- |
| `graph.invoke({...})`                   | Runs your entire pipeline                                       |
| Starts with `{question: ...}`           | Input state                                                     |
| Ends with `{question, context, answer}` | Output state                                                    |
| `print(response["answer"])`             | Just shows the final LLM answer — grounded in your vector store |

---

## 🚀 **Next**

If you want, you can:
✔️ Inspect the entire `response`:

```python
print(response)
```

✔️ Loop through `response["context"]` to see what chunks were used:

```python
for doc in response["context"]:
    print(doc.page_content[:200])
    print(doc.metadata)
```

✔️ Or log this to LangSmith and trace every step!



In [68]:
response = graph.invoke({"question": "What is Task Decomposition?"})
print("\n--- RESPONSE ---")
print("\n".join(textwrap.wrap(response["answer"], width=80)))


--- RESPONSE ---
Task decomposition is the process of breaking down a complex task into smaller,
manageable steps or subgoals. This can be done using methods like simple
prompting, task-specific instructions, or human inputs. Techniques such as Chain
of Thought and Tree of Thoughts further enhance this process by facilitating
logical reasoning and exploration of multiple possibilities.


In [69]:
for doc in response["context"]:
    print(doc.page_content[:200])
    print(doc.metadata)

Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outl
{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a s
{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}
The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field d
{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}
Illustration of how HuggingGPT works. (Image source: Shen et al. 2023)

The system comprises of 4 stages:
(1) Task planning: LLM works as the brain and parses the user requests into multiple tasks

## Trace

✅ **Fantastic — this is the perfect next step!**
Using **LangSmith** for tracing is how you turn “black box LLM spaghetti” into **clear, debuggable RAG workflows**. Let’s break down exactly **what to look for** when you open your graph run in LangSmith.

---

### 🔍 **📌 What your trace shows**

When you run:

```python
response = graph.invoke({"question": "What is Task Decomposition?"})
```

✅ LangSmith records:

1. The **input** `State` (just the `question` at first)
2. Each **node** in your graph:

   * `retrieve`
   * `generate`
3. The **intermediate `State`** after each node
4. All **calls inside each node**:

   * Similarity search: `vector_store.similarity_search(...)`
   * Prompt rendering: `prompt.invoke(...)`
   * LLM call: `llm.invoke(...)`

So you get a **full breadcrumb trail** of how your answer was made.

---

## 🗂️ **What should you actually check?**

Here’s a quick checklist:

---

### ✅ **1️⃣ Look at the input**

* Did your question come through exactly as expected?

  * e.g., `"What is Task Decomposition?"`
* Any weird formatting? Extra spaces? Garbage?

---

### ✅ **2️⃣ Look at the `retrieve` node**

* **Did it run `similarity_search`?**

  * You should see the vector store node.
* **What documents did it retrieve?**

  * Look for `page_content` of each chunk.
  * Did they *actually* mention “Task Decomposition”?
* **How many chunks did you get?**

  * If you see irrelevant or empty results → maybe your vector store is too small, or your embeddings didn’t index well.

---

### ✅ **3️⃣ Look at the `generate` node**

* **What was the final prompt?**

  * You’ll see exactly how `{question}` and `{context}` were filled in.
* **Any redundant info or repeated noise?**

  * If your context is cluttered, the LLM might give messy answers.

---

### ✅ **4️⃣ Look at the LLM output**

* **Does the answer actually use the retrieved context?**

  * Or did the model hallucinate details you can’t find in the context?
* **Is it concise?**

  * Does it respect your prompt instructions (e.g., “Three sentences maximum”)?

---

### ✅ **5️⃣ Look for time & token usage**

* For each node, LangSmith shows usage data if available:

  * How many tokens did embedding, retrieval, and generation cost?
  * Are you paying for extra tokens because of poor chunk prep?

---

## ⚡️ **Why this matters**

This is how you debug real-world RAG systems:

* **Good chunks, poor answer?** → Maybe the prompt needs tweaking.
* **Poor chunks, decent answer?** → You’re relying on the model’s base knowledge (danger!).
* **Nothing retrieved?** → Maybe your chunks don’t cover the query or your embeddings don’t match well.

---

## ✅ **Example: questions you can answer by tracing**

| What to check                      | What it tells you                                     |
| ---------------------------------- | ----------------------------------------------------- |
| *Which chunks were retrieved?*     | Is retrieval relevant?                                |
| *Is context too big or too small?* | Do you need different chunk size/overlap?             |
| *Is the LLM ignoring context?*     | Should your prompt be stricter?                       |
| *Are costs high?*                  | Can you filter or rerank results to use fewer chunks? |

---

## ✅ ✅ ✅ **Key takeaway**

LangSmith’s graph tracing lets you:
✔️ Inspect every node in your RAG flow.
✔️ Verify that each step does what you expect.
✔️ Debug retrieval quality, prompt design, and generation output in a single place.

---

## 🏆 **Pro tip**

Once you trust your trace, you can:

* Save example traces to share with teammates.
* Run LangSmith **evaluators** that score relevance automatically.
* Compare different chunking or embedding models side by side.




## Eval

In [75]:
%pip install -Uq openevals

In [71]:
from langsmith import Client

client = Client()  # Uses your LANGCHAIN_API_KEY from your .env

# Get your most recent run (you can filter by project too)
runs = client.list_runs(project_name="project_00", limit=1)
run = list(runs)[0]
print("Run ID:", run.id)


Run ID: b4b09c21-9ac2-49bc-84a7-960d571a4b73


In [79]:
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

# Make a correctness judge
correctness_judge = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="openai:o3-mini",  # Or gpt-4o
    feedback_key="correctness",
)

# Run the eval
result = correctness_judge(
    inputs={"question": "What is Task Decomposition?"},
    outputs={"answer": response["answer"]},
    reference_outputs={"answer": "Task Decomposition means breaking a complex task into smaller, manageable sub-tasks."},
)

print(f"Key: {result['key']}")
print(f"Score: {result['score']}")

print("\nComment:")
print("\n".join(textwrap.wrap(result['comment'], width=80)))

Key: correctness
Score: True

Comment:
The answer correctly explains that task decomposition involves breaking a
complex task into smaller, manageable parts. It is factually accurate and
complete, including additional details about methods and techniques that relate
to logical reasoning which are consistent with the reference output. Thus, the
score should be: true.




### 📌 **How the correctness check works**

When you run something like:

```python
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

judge = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="openai:o3-mini",
    feedback_key="correctness",
)

result = judge(
    inputs={"question": "..."},
    outputs={"answer": "..."},
    reference_outputs={"answer": "..."}
)
```

Here’s what happens behind the scenes:

---

## ✅ **1️⃣ The evaluator sets up a “grading rubric”**

* `CORRECTNESS_PROMPT` is a clear system prompt template.
* It basically says:

  > *“Here is the question.
  > Here is the model’s answer.
  > Here is the reference answer.
  > Judge if the answer is factually correct, compared to the reference.”*

---

## ✅ **2️⃣ It calls an LLM to *be the judge***

* It passes your `inputs`, `outputs`, and `reference_outputs` to the judge LLM (e.g., `o3-mini` or `gpt-4o`).

* The prompt includes clear instructions like:

  ```
  Score: true if the answer is factually correct and matches the reference.
  Provide a concise explanation.
  ```

* The judge LLM then returns a **structured JSON**:

  ```json
  {
    "score": true,
    "comment": "This answer correctly explains..."
  }
  ```

---

## ✅ **3️⃣ The `score` means pass/fail for factual correctness**

* `score: True` means the model’s answer **is aligned with the reference**.
* `score: False` means the model added details not present in the reference or got it wrong.

So it’s using *another LLM* to compare your RAG output to a trusted answer.

---

## 🔍 **So who writes the “reference\_outputs”?**

* **You do** — you provide the “gold standard” expected answer for that question.
* This is crucial: if your reference answer is wrong, the eval will be off!

---

## ✅ ✅ ✅ **So how is correctness judged?**

| Step           | What it does                                                       |
| -------------- | ------------------------------------------------------------------ |
| System prompt  | Instructs the judge LLM on how to compare                          |
| Judge LLM      | Reads your model’s answer vs. reference                            |
| Grading rubric | Checks for factual match, logical completeness, no hallucinations  |
| Output         | `score` True/False and a natural language explanation in `comment` |

---

## ⚡️ **Why this is so powerful**

* It lets you scale up evaluation without human reviewers reading every answer.
* You can detect subtle hallucinations: the judge LLM *understands meaning*, not just word overlap.
* It works the same whether you’re using GPT-4, Claude, or open-weight models.

---

## 🏆 **Key takeaway**

✅ **Correctness = reference-based eval:** *Did the answer match what it should be?*
✅ Uses an LLM-as-a-judge: your reference answer is the source of truth.
✅ Super powerful for **automating RAG quality checks**, especially for large datasets.




In [80]:
def pretty_print_eval_result(result, width=80):
    print(f"Key: {result['key']}")
    print(f"Score: {result['score']}\n")
    print("Comment:")
    print("\n".join(textwrap.wrap(result['comment'], width=width)))

# Usage:
pretty_print_eval_result(result)


Key: correctness
Score: True

Comment:
The answer correctly explains that task decomposition involves breaking a
complex task into smaller, manageable parts. It is factually accurate and
complete, including additional details about methods and techniques that relate
to logical reasoning which are consistent with the reference output. Thus, the
score should be: true.


## Testing

In [83]:
dataset = [
    {
        "question": "What is Task Decomposition?",
        "reference_answer": "Task Decomposition means breaking a complex task into smaller, manageable subtasks."
    },
    {
        "question": "What are ReAct agents?",
        "reference_answer": "ReAct agents combine reasoning and acting, enabling an agent to reason step-by-step and take actions using tools."
    },
    {
        "question": "What is long-term memory in agents?",
        "reference_answer": "Long-term memory allows an agent to persist information across sessions, often using a vector store for retrieval."
    },
]


client = Client()

correctness_judge = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="openai:o3-mini",
    feedback_key="correctness"
)

for item in dataset:
    question = item["question"]
    ref_answer = item["reference_answer"]

    # ✅ Runs for each question
    response = graph.invoke({"question": question})

    print(f"Q: {question}")
    print(f"Model Answer: {response['answer']}")
    print(f"Reference: {ref_answer}")

    eval_result = correctness_judge(
        inputs={"question": question},
        outputs={"answer": response["answer"]},
        reference_outputs={"answer": ref_answer}
    )

    print(f"Score: {eval_result['score']}")
    print("\n".join(textwrap.wrap(eval_result['comment'], width=80)))
    print("-" * 80)

    # Optionally log to LangSmith if you want:
    runs = client.list_runs(project_name="project_00", limit=1)
    run_id = list(runs)[0].id

    client.create_feedback(
        run_id=run_id,
        key="correctness",
        score=eval_result["score"],
        comment=eval_result["comment"]
    )


Q: What is Task Decomposition?
Model Answer: Task decomposition is the process of breaking down a complex task into smaller, manageable sub-tasks or steps. This can be achieved through various methods, such as prompting a language model with specific instructions, utilizing task-specific guidelines, or incorporating human inputs. Techniques like Chain of Thought and Tree of Thoughts enhance this process by systematically exploring steps and reasoning possibilities.
Reference: Task Decomposition means breaking a complex task into smaller, manageable subtasks.
Score: True
The answer accurately explains that task decomposition involves breaking a
complex task into smaller, manageable sub-tasks. It also provides examples of
methods such as using language model prompts, task-specific guidelines, human
inputs, and references techniques like Chain of Thought and Tree of Thoughts for
exploring reasoning and steps. The response is complete, factually correct, and
logically consistent with the r


> *“Why use LangSmith when I could just run my RAG pipeline and print stuff myself?”*

Let’s break down exactly **what LangSmith adds**, so you see when it’s *worth it* — and when you can roll your own.

---

### 📌 **🔑 The big idea: LangSmith is not just logging**

LangSmith is like a **debugger + observability layer + experiment tracker** for LLM apps.

It gives you things that would be painful (or impossible) to maintain by hand:

* 🔍 **Full step-by-step trace** of every chain, agent, retriever, reranker, prompt, and model call.
* 🧵 **Structured `State` tracking** → you always know what data flows through each step.
* 🗂️ **Centralized storage of runs**, so you can compare versions.
* ✅ **Automated evals** → your correctness and relevance scores live next to the run that produced them.
* 🏷️ **Filtering, searching, comparing runs** across experiments.

---

## ✅ **Here’s what you get that you *don’t* get with just print statements**

| 🔍 Feature                        | Without LangSmith                                                | With LangSmith                                                                                 |
| --------------------------------- | ---------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| 🔗 **End-to-end trace**           | You’d need to manually log each input/output of every function.  | Each node in your graph is automatically traced: input, output, tokens, latency.               |
| 📜 **Prompt versioning**          | You’d have to store your prompt templates manually.              | Prompts pulled from LangChain Hub are versioned. You can see what prompt was used for any run. |
| 🕵️‍♂️ **Chunk-level visibility** | You’d have to print retrieved chunks to see if retrieval worked. | The retrieved docs, scores, and metadata are visible in the trace.                             |
| ⚡ **Eval feedback**               | You’d have to run judges + store results yourself.               | Eval results (correctness, relevance) attach directly to the run for easy comparison.          |
| 🗂️ **Experiment tracking**       | You’d need spreadsheets or local logs to compare versions.       | Everything is versioned by project/run ID. You can filter by model, prompt, retriever, etc.    |
| ✅ **Shareability**                | You’d need to copy logs for teammates.                           | You get a shareable link for any run with full context.                                        |

---

## ✅ **Example: A real debugging moment**

Imagine you deploy your RAG app and the LLM starts hallucinating.
With plain logs:

* You see the final answer.
* Maybe you see the question.
* But do you know:

  * What chunks were retrieved?
  * Which prompt was used?
  * What the retriever scores were?
  * Whether the retriever failed or the LLM hallucinated?

👉 **In LangSmith, you click into your run:**

* Inspect `retrieve` → see exactly which chunks the vector store found.
* Inspect `generate` → see the rendered prompt with `{context}`.
* Spot: *“Ah! My chunks don’t contain the answer → it’s a retriever problem, not the LLM!”*

✅ Saves you hours of guessing.

---

## ✅ ✅ ✅ **When is LangSmith worth it?**

| If you’re doing...                     | LangSmith is 🔥                                          |
| -------------------------------------- | -------------------------------------------------------- |
| Small test notebook                    | Eh — you can just `print()` chunks.                      |
| Multi-step RAG graphs                  | Yes! Seeing `State` at each step is huge.                |
| Lots of prompt experiments             | Yes — version control + side-by-side runs.               |
| Automated evals                        | Yes — scores + traces in one place.                      |
| Sharing with your team or stakeholders | Big yes — they can click through what actually happened. |

---

## ⚡ **What LangSmith doesn’t do**

It won’t:

* Replace your vector store — it just traces calls to it.
* Replace your chunk cleaning — you still do that in your code.
* Fix a bad retriever or a noisy corpus.

✅ It just makes it crystal clear *why* your LLM did what it did.

---

## 🟢 **Bottom line**

| LangSmith’s job | What it gives you                                      |
| --------------- | ------------------------------------------------------ |
| Observability   | What data went in/out at each step                     |
| Reproducibility | What prompt, model, and retriever version you used     |
| Debuggability   | What went wrong — bad chunks? bad generation?          |
| Evaluability    | Automatic scoring and feedback that lives with the run |

---

## 🎯 **Key takeaway**

👉 For **serious, repeatable RAG** — LangSmith saves you from homegrown logging, tangled notebooks, and “What the heck happened?” moments.

For small experiments?
Print statements are fine — but you’ll outgrow them fast!


