
# LangGraph and LangSmith — Agentic RAG Powered by LangChain

In this notebook we complete the Session 5 assignment.

- 🤝 **Breakout Room #1**
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

- 🤝 **Breakout Room #2**
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI


# 🤝 Breakout Room #1


## Part 1: LangGraph — Building Cyclic Applications with LangChain

LangGraph leverages LCEL to build coordinated multi-actor **stateful** apps that support cycles (loops). Cycles let the agent iterate until it has a good answer or hits guardrails you define.


## Task 1: Dependencies


If needed, install dependencies in your environment (already handled in project setup). In the notebook we import the libraries directly.



## Task 2: Environment Variables

Set OpenAI, Tavily and LangSmith keys and LangSmith tracing project.


In [None]:

import os
from uuid import uuid4

# Set API keys from environment variables or use placeholder values for demo
# In production, set these as environment variables or use a .env file
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "your-openai-api-key-here")
os.environ["TAVILY_API_KEY"] = os.getenv("TAVILY_API_KEY", "your-tavily-api-key-here") 
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGCHAIN_API_KEY", "your-langsmith-api-key-here")

# LangSmith tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE8 - LangGraph - {uuid4().hex[:8]}"



## Task 3: Creating our Tool Belt

We'll use:
- **Tavily** web search (via `langchain-tavily`)
- **ArXiv** search (via `langchain_community`)


In [None]:

# IMPORTANT: Use the non-deprecated Tavily package
from langchain_tavily import TavilySearch
from langchain_community.tools.arxiv.tool import ArxivQueryRun

tavily_tool = TavilySearch(max_results=5)
tool_belt = [tavily_tool, ArxivQueryRun()]



### Model

We use OpenAI's chat model and **bind** the tool belt using function-calling semantics.


In [None]:

from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4.1-nano", temperature=0)
model = model.bind_tools(tool_belt)



#### ❓ Question #1: How does the model determine which tool to use?

**Answer:** The model is provided JSON schemas for the tools via `bind_tools`. On each turn it predicts optional `tool_calls` (function name + JSON args). If `tool_calls` are present in the AI message, our graph routes to the `ToolNode`, which executes the tools and returns results to the agent. The decision of *which* tool and *with what arguments* is learned behavior guided by the tool schemas and the conversation context.



## Task 4: Putting the State in Stateful

We carry a shared `messages` list around the graph so nodes can read/write context as the agent iterates.


In [None]:

from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]



## Task 5: It's Graphing Time!

We create two nodes:
- `agent`: calls the model
- `action`: executes tool calls (if any)


In [None]:

from langgraph.prebuilt import ToolNode

def call_model(state: AgentState):
    messages = state["messages"]
    response = model.invoke(messages)
    return {"messages": [response]}

tool_node = ToolNode(tool_belt)


In [None]:

from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)
uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)
uncompiled_graph.set_entry_point("agent")

def should_continue(state: AgentState):
    last_message = state["messages"][-1]
    if getattr(last_message, "tool_calls", None):
        return "action"
    return END

# If function returns END we finish; otherwise we go to the named node
uncompiled_graph.add_conditional_edges("agent", should_continue)
uncompiled_graph.add_edge("action", "agent")

simple_agent_graph = uncompiled_graph.compile()



#### ❓ Question #2: Is there a limit to how many times we can cycle? How could we impose one?

**Answer:** There is no inherent limit—LangGraph will keep looping as long as your conditional edges allow it. To cap cycles, add a guard (e.g., a counter or `len(state["messages"])`) and route to `END` once a threshold is reached. You can also add timeouts or a separate "helpfulness" gate to exit when the answer is sufficient.



## Using Our Graph

We can stream updates to see tool calls and iterations.


In [None]:

from langchain_core.messages import HumanMessage

inputs = {"messages": [HumanMessage(content="How are technical professionals using AI to improve their work?")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: {node}")
        print(values["messages"])
        print()


In [None]:

inputs = {"messages": [HumanMessage(content="Search Arxiv for the 'A Comprehensive Survey of Deep Research' paper, then search each of the authors to find out where they work now using Tavily!")]} 

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: {node}")
        if node == "action" and values["messages"]:
            # show which tool executed
            try:
                print(f"Tool used: {values['messages'][0].name}")
            except Exception:
                pass
        print(values["messages"])
        print()



#### 🏗️ Activity #2 — Steps the agent took

1. The agent read the user message and produced `tool_calls` indicating which tools to use and with what arguments.  
2. The conditional edge detected `tool_calls` and routed to the `action` node.  
3. The `ToolNode` executed the tools (ArXiv, then Tavily) and appended their results to state.  
4. Control returned to the `agent`, which synthesized the tool outputs into a final answer.  
5. With no further `tool_calls`, the conditional edge returned `END`, finishing the run.


# 🤝 Breakout Room #2

## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

Wrap our graph to convert inputs and outputs.

In [None]:

def convert_inputs(input_object):
    return {"messages": [HumanMessage(content=input_object["text"])]}

def parse_output(state):
    return {"answer": state["messages"][-1].content}

agent_chain_with_formatting = convert_inputs | simple_agent_graph | parse_output
agent_chain_with_formatting.invoke({"text": "What is Deep Research?"})



### Task 1: Creating An Evaluation Dataset

Create at least 5 examples (we provide 6) related to the cohort use-case.


In [None]:

questions = [
    {
        "inputs": {"text": "Who were the main authors of 'A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications'?"},
        "outputs": {"must_mention": ["Renjun Xu", "Jingwen Peng", "Xu", "Peng"]},
    },
    {
        "inputs": {"text": "When was the 'A Comprehensive Survey of Deep Research' paper published?"},
        "outputs": {"must_mention": ["2025", "June"]},
    },
    {
        "inputs": {"text": "List two commercial deep research systems mentioned recently."},
        "outputs": {"must_mention": ["OpenAI", "Perplexity"]},
    },
    {
        "inputs": {"text": "What four technical dimensions are used to categorize deep research systems?"},
        "outputs": {"must_mention": ["reasoning", "tools", "planning", "synthesis"]},
    },
    {
        "inputs": {"text": "Name two challenges highlighted for deep research systems."},
        "outputs": {"must_mention": ["accuracy", "privacy"]},
    },
    {
        "inputs": {"text": "Where do the authors of the 'Deep Research' survey currently work?"},
        "outputs": {"must_mention": ["Zhejiang", "Liberty Mutual"]},
    },
]


Add the dataset to LangSmith:

In [None]:

from langsmith import Client
from uuid import uuid4

client = Client()
dataset_name = f"Simple Search Agent - Evaluation Dataset - {uuid4().hex[:8]}"
dataset = client.create_dataset(dataset_name=dataset_name, description="Questions about Deep Research to evaluate the Simple Search Agent.")
client.create_examples(dataset_id=dataset.id, examples=questions)
dataset_name


### Task 2: Adding Evaluators

In [None]:

from openevals.prompts import CORRECTNESS_PROMPT
print(CORRECTNESS_PROMPT)


In [None]:

from openevals.llm import create_llm_as_judge

correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="openai:o3-mini",   # tune as needed
    feedback_key="correctness",
)



Custom **must_mention** evaluator (improved for normalization and partial credit).


In [None]:

import re

def _normalize(t: str) -> str:
    return re.sub(r"\W+", " ", (t or "").lower()).strip()

def must_mention(inputs: dict, outputs: dict, reference_outputs: dict) -> float:
    required = reference_outputs.get("must_mention") or []
    out = _normalize(outputs.get("answer", ""))
    hits = 0
    for phrase in required:
        if _normalize(phrase) in out:
            hits += 1
    return hits / max(1, len(required))


### Task 3: Evaluating

In [None]:

results = client.evaluate(
    agent_chain_with_formatting,
    data=dataset.name,
    evaluators=[correctness_evaluator, must_mention],
    experiment_prefix="simple_agent, baseline",
    description="Testing the baseline system.",
    max_concurrency=4,
)

print("If running in a traced environment, open LangSmith to view comparison for the latest experiment.")


## Part 2: LangGraph with Helpfulness


### Task 3: Adding Helpfulness Check and "Loop" Limits

We add a conditional that either (a) executes tools, (b) ends if the answer is helpful, or (c) loops back for another refinement. We also impose a hard cap on turns.



**Explanation:** We instantiate a new `StateGraph(AgentState)` and add two nodes: `agent` (calls the LLM) and `action` (executes tool calls). We'll add a helpfulness gate next.


In [None]:

graph_with_helpfulness_check = StateGraph(AgentState)
graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)



**Explanation:** Set `"agent"` as the entry point so every run starts with the model.


In [None]:

graph_with_helpfulness_check.set_entry_point("agent")



**Explanation:** `tool_call_or_helpful` routes to `action` if the model emitted `tool_calls`. Otherwise it runs a **helpfulness** check comparing the initial query and current final response. If helpful → end; else → continue. A hard cap prevents infinite loops.


In [None]:

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state: AgentState):
    last_message = state["messages"][-1]
    if getattr(last_message, "tool_calls", None):
        return "action"

    # hard loop cap
    if len(state["messages"]) > 10:
        return "end"  # use mapping below

    initial_query = state["messages"][0]
    final_response = state["messages"][-1]

    prompt_template = """Given an initial query and a final response, determine if the final response is extremely helpful or not.
Reply with 'Y' if helpful, 'N' if not.

Initial Query:
{initial_query}

Final Response:
{final_response}
"""
    helpfulness_prompt = PromptTemplate.from_template(prompt_template)
    helpfulness_check_model = ChatOpenAI(model="gpt-4.1-mini")
    helpfulness_chain = helpfulness_prompt | helpfulness_check_model | StrOutputParser()
    helpfulness_response = helpfulness_chain.invoke({
        "initial_query": initial_query.content,
        "final_response": final_response.content
    })
    return "end" if "Y" in helpfulness_response else "continue"



**Explanation:** Connect conditional outcomes: `"action"` → `action`, `"continue"` → loop back to `agent`, `"end"` → finish.


In [None]:

from langgraph.graph import END

graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue": "agent",
        "action": "action",
        "end": END,
    }
)



**Explanation:** Tool results always flow back to `agent` so the model can read them and decide next steps.


In [None]:

graph_with_helpfulness_check.add_edge("action", "agent")



**Explanation:** Compile to a runnable graph and test with a prompt.


In [None]:

agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

inputs = {"messages": [HumanMessage(content="What are Deep Research Agents?")]}
async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: {node}")
        print(values["messages"])
        print()



## Part 3: LangGraph for the "Patterns" of GenAI

### Task 4: Helpfulness Check of GenAI Pattern Descriptions
Ask the system about the 3 main patterns: Context Engineering, Fine-tuning, and Agents.


In [None]:

patterns = ["Context Engineering", "Fine-tuning", "LLM-based agents"]

for pattern in patterns:
    q = f"What is {pattern} and when did it break onto the scene?"
    inputs = {"messages": [HumanMessage(content=q)]}
    state = agent_with_helpfulness_check.invoke(inputs)
    print(state["messages"][-1].content)
    print("\n")



#### ❓ Question #4: How could we improve the `must_mention` metric?

Upgraded must_mention evaluator below:

What it does better:

- Normalizes case/punct/whitespace.
- Accepts aliases/synonyms for each required term.
- Allows fuzzy matching (>= similarity threshold).
- Gives partial credit (0–1) with optional weights.
- Penalizes extraneous/hallucinated entities if you specify a disallow list.
- Optional “must cite” check (URL present) per item.


In [None]:
import re
from difflib import SequenceMatcher
from typing import Dict, List, Union

def _norm(t: str) -> str:
    return re.sub(r"\W+", " ", (t or "").lower()).strip()

def _contains_fuzzy(haystack: str, needle: str, thresh: float = 0.85) -> bool:
    """Return True if any substring of haystack is similar to needle with ratio>=thresh."""
    h = haystack
    n = needle
    if not h or not n:
        return False
    # Fast path: direct containment
    if n in h:
        return True
    # Fuzzy (approximate) search using SequenceMatcher windows
    # Use windows around the needle length
    ln = max(4, len(n))  # avoid tiny windows
    tokens = h.split()
    # Build sliding windows approx the length in tokens
    # Simple heuristic: compare against chunks in haystack
    for i in range(len(tokens)):
        chunk = " ".join(tokens[i : i + ln])
        if SequenceMatcher(None, chunk, n).ratio() >= thresh:
            return True
    return False

def must_mention(
    inputs: dict,
    outputs: dict,
    reference_outputs: dict
) -> float:
    """
    Robust must_mention metric with:
      - synonyms/aliases per required item
      - fuzzy matching
      - optional weights
      - optional negative penalty list
      - optional 'must_cite' URLs presence check
    Contract:
      reference_outputs may include:
        {
          "must_mention": [
            "Zhejiang University",
            ["Renjun Xu", "R. Xu"],        # aliases OK as list
            {"term": "Liberty Mutual", "aliases": ["Liberty"], "weight": 2.0, "must_cite": True},
          ],
          "disallow": ["MadeUpCo", "FakeLab"],   # penalize if present
          "fuzzy_threshold": 0.88                 # override default 0.85
        }
    Returns a float in [0, 1].
    """

    required: List[Union[str, list, dict]] = reference_outputs.get("must_mention") or []
    disallow: List[str] = reference_outputs.get("disallow", [])
    thresh: float = float(reference_outputs.get("fuzzy_threshold", 0.85))

    out_text = _norm(outputs.get("answer", ""))

    # Helper to test a single candidate (string) against output
    def _hit_one(candidate: str) -> bool:
        cand = _norm(candidate)
        return _contains_fuzzy(out_text, cand, thresh=thresh)

    # Score required terms with optional weights/aliases and optional must_cite
    total_weight = 0.0
    earned = 0.0

    # naive URL presence if must_cite is required for an item
    has_url = ("http://" in outputs.get("answer", "")) or ("https://" in outputs.get("answer", ""))

    for item in required:
        weight = 1.0
        must_cite = False
        aliases: List[str] = []

        if isinstance(item, str):
            aliases = [item]
        elif isinstance(item, list):
            # list of aliases
            aliases = item
        elif isinstance(item, dict):
            # {"term": "...", "aliases": [...], "weight": 2.0, "must_cite": True}
            main = item.get("term")
            if main:
                aliases.append(main)
            aliases += item.get("aliases", [])
            weight = float(item.get("weight", 1.0))
            must_cite = bool(item.get("must_cite", False))
        else:
            continue

        total_weight += weight

        # Mark hit if ANY alias matches
        hit = any(_hit_one(a) for a in aliases if a)
                                                                  
        # If the item requires a citation, we only award if there is a URL
        if hit and must_cite and not has_url:
            hit = False

        if hit:
            earned += weight

    # Optional penalty for disallowed / hallucinated entities (light penalty)
    penalty = 0.0
    if disallow:
        for bad in disallow:
            if _hit_one(bad):
                penalty += 0.25   # tune as needed

    # Avoid division by zero
    base_score = (earned / total_weight) if total_weight > 0 else 0.0
    final = max(0.0, min(1.0, base_score - penalty))
    return final


How to use it in your dataset quickly, we can keep simple strings as before or use rich items:

In [None]:
questions = [
    {
        "inputs": {"text": "Who were the main authors of the Deep Research survey?"},
        "outputs": {
            "must_mention": [
                ["Renjun Xu", "R. Xu"],     # aliases
                ["Jingwen Peng", "J. Peng"]
            ],
            "disallow": ["Completely Fake Author"]
        },
    },
    {
        "inputs": {"text": "Where do the authors currently work?"},
        "outputs": {
            "must_mention": [
                {"term":"Zhejiang University", "aliases":["ZJU"], "weight": 2.0},
                {"term":"Liberty Mutual", "aliases":["Liberty"], "must_cite": True}
            ],
            "fuzzy_threshold": 0.9
        },
    },
]
