# LangGraph and LangSmith - Agentic RAG Powered by LangChain

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

- 🤝 Breakout Room #2:
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI

# 🤝 Breakout Room #1

## Part 1: LangGraph - Building Cyclic Applications with LangChain

LangGraph is a tool that leverages LangChain Expression Language to build coordinated multi-actor and stateful applications that includes cyclic behaviour.

### Why Cycles?

In essence, we can think of a cycle in our graph as a more robust and customizable loop. It allows us to keep our application agent-forward while still giving the powerful functionality of traditional loops.

Due to the inclusion of cycles over loops, we can also compose rather complex flows through our graph in a much more readable and natural fashion. Effectively allowing us to recreate application flowcharts in code in an almost 1-to-1 fashion.

### Why LangGraph?

Beyond the agent-forward approach - we can easily compose and combine traditional "DAG" (directed acyclic graph) chains with powerful cyclic behaviour due to the tight integration with LCEL. This means it's a natural extension to LangChain's core offerings!

## Task 1:  Dependencies


## Task 2: Environment Variables

We'll want to set both our OpenAI API key and our LangSmith environment variables.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [2]:
os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY")

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE7 - LangGraph - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

## Task 3: Creating our Tool Belt

As is usually the case, we'll want to equip our agent with a toolbelt to help answer questions and add external knowledge.

There's a tonne of tools in the [LangChain Community Repo](https://github.com/langchain-ai/langchain-community/tree/main/libs/community) but we'll stick to a couple just so we can observe the cyclic nature of LangGraph in action!

We'll leverage:

- [Tavily Search Results](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/tavily_search/tool.py)
- [Arxiv](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/arxiv/tool.py)

#### 🏗️ Activity #1:

Please add the tools to use into our toolbelt.

> NOTE: Each tool in our toolbelt should be a method.

In [4]:
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.tools.arxiv.tool import ArxivQueryRun

tavily_tool = TavilySearchResults(max_results=5)

tool_belt = [
    tavily_tool,
    ArxivQueryRun(),
]

  tavily_tool = TavilySearchResults(max_results=5)


### Model

Now we can set-up our model! We'll leverage the familiar OpenAI model suite for this example - but it's not *necessary* to use with LangGraph. LangGraph supports all models - though you might not find success with smaller models - as such, they recommend you stick with:

- OpenAI's GPT-3.5 and GPT-4
- Anthropic's Claude
- Google's Gemini

> NOTE: Because we're leveraging the OpenAI function calling API - we'll need to use OpenAI *for this specific example* (or any other service that exposes an OpenAI-style function calling API.

In [5]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4.1-nano", temperature=0)

Now that we have our model set-up, let's "put on the tool belt", which is to say: We'll bind our LangChain formatted tools to the model in an OpenAI function calling format.

In [6]:
model = model.bind_tools(tool_belt)

#### ❓ Question #1:

How does the model determine which tool to use?

##### ✅ Answer:

## How LangGraph Determines Tool Selection

In LangGraph, tool selection is driven by the **LLM's decision-making process** combined with **conditional routing logic**. Here's how it works:

### 1. Tool Binding and Schema Exposure

First, tools are bound to the model, making them available for selection:

```python
from langchain_core.tools import tool
from langgraph.prebuilt import ToolNode

@tool
def get_weather(location: str) -> str:
    """Get current weather for a location."""
    return f"Weather in {location}: 72°F, sunny"

@tool  
def search_database(query: str) -> str:
    """Search the knowledge database."""
    return f"Found results for: {query}"

# Bind tools to model
llm_with_tools = llm.bind_tools([get_weather, search_database])
```

### 2. Model Decision Process

The model has the freedom to choose which tool to use based on the user's input. The LLM analyzes the input and determines:
- **Whether** to call any tool
- **Which specific tool** to call  
- **What arguments** to pass

A key principle of tool calling is that the model decides when to use a tool based on the input's relevance. The model doesn't always need to call a tool.

### 3. Tool Calls Detection

When the model decides to use a tool, it generates an `AIMessage` with a `tool_calls` attribute:

```python
def call_model(state: MessagesState):
    messages = state["messages"]
    response = llm_with_tools.invoke(messages)
    return {"messages": [response]}
```

### 4. Conditional Routing Logic

The core routing mechanism uses a `should_continue` function that inspects the model's response:

```python
def should_continue(state: MessagesState) -> str:
    messages = state["messages"]
    last_message = messages[-1]
    
    # Check if model called any tools
    if last_message.tool_calls:
        return "tools"  # Route to tool execution
    else:
        return END      # End conversation
```

### 5. Graph Structure

The graph defines the flow: START ➔ agent ➔ should_continue ➔ (tools OR END) ➔ agent:

```python
from langgraph.graph import StateGraph, MessagesState, START, END

workflow = StateGraph(MessagesState)
workflow.add_node("agent", call_model)
workflow.add_node("tools", ToolNode([get_weather, search_database]))

workflow.add_edge(START, "agent")
workflow.add_conditional_edges("agent", should_continue, ["tools", END])
workflow.add_edge("tools", "agent")
```

### 6. Tool Selection Factors

The model's tool selection is influenced by:

- **Tool descriptions**: Tools that are well-named, correctly-documented and properly type-hinted are easier for models to use
- **Input relevance**: The model matches user intent to appropriate tools
- **Tool simplicity**: Simple, narrowly scoped tools are easier for models to use correctly

### 7. Advanced Control

You can influence tool selection through:

```python
# Force tool usage
llm_with_tools = llm.bind_tools(tools, tool_choice="any")

# Limit parallel calls  
llm_with_tools = llm.bind_tools(tools, parallel_tool_calls=False)

# Dynamic tool selection with retrieval
selected_tools = retrieve_relevant_tools(user_query)
llm_with_selected_tools = llm.bind_tools(selected_tools)
```

### Bottom Line

The model decides whether to invoke a tool and determine the appropriate arguments based on analyzing the user input against available tool schemas. The `should_continue` function then routes the graph flow based on whether `tool_calls` are present in the model's response, creating a **ReAct (Reasoning + Acting) pattern** where the agent alternates between reasoning and tool execution until the task is complete.

## Task 4: Putting the State in Stateful

Earlier we used this phrasing:

`coordinated multi-actor and stateful applications`

So what does that "stateful" mean?

To put it simply - we want to have some kind of object which we can pass around our application that holds information about what the current situation (state) is. Since our system will be constructed of many parts moving in a coordinated fashion - we want to be able to ensure we have some commonly understood idea of that state.

LangGraph leverages a `StatefulGraph` which uses an `AgentState` object to pass information between the various nodes of the graph.

There are more options than what we'll see below - but this `AgentState` object is one that is stored in a `TypedDict` with the key `messages` and the value is a `Sequence` of `BaseMessages` that will be appended to whenever the state changes.

Let's think about a simple example to help understand exactly what this means (we'll simplify a great deal to try and clearly communicate what state is doing):

1. We initialize our state object:
  - `{"messages" : []}`
2. Our user submits a query to our application.
  - New State: `HumanMessage(#1)`
  - `{"messages" : [HumanMessage(#1)}`
3. We pass our state object to an Agent node which is able to read the current state. It will use the last `HumanMessage` as input. It gets some kind of output which it will add to the state.
  - New State: `AgentMessage(#1, additional_kwargs {"function_call" : "WebSearchTool"})`
  - `{"messages" : [HumanMessage(#1), AgentMessage(#1, ...)]}`
4. We pass our state object to a "conditional node" (more on this later) which reads the last state to determine if we need to use a tool - which it can determine properly because of our provided object!

In [7]:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

## Task 5: It's Graphing Time!

Now that we have state, and we have tools, and we have an LLM - we can finally start making our graph!

Let's take a second to refresh ourselves about what a graph is in this context.

Graphs, also called networks in some circles, are a collection of connected objects.

The objects in question are typically called nodes, or vertices, and the connections are called edges.

Let's look at a simple graph.

![image](https://i.imgur.com/2NFLnIc.png)

Here, we're using the coloured circles to represent the nodes and the yellow lines to represent the edges. In this case, we're looking at a fully connected graph - where each node is connected by an edge to each other node.

If we were to think about nodes in the context of LangGraph - we would think of a function, or an LCEL runnable.

If we were to think about edges in the context of LangGraph - we might think of them as "paths to take" or "where to pass our state object next".

Let's create some nodes and expand on our diagram.

> NOTE: Due to the tight integration with LCEL - we can comfortably create our nodes in an async fashion!

In [8]:
from langgraph.prebuilt import ToolNode

def call_model(state):
  messages = state["messages"]
  response = model.invoke(messages)
  return {"messages" : [response]}

tool_node = ToolNode(tool_belt)

Now we have two total nodes. We have:

- `call_model` is a node that will...well...call the model
- `tool_node` is a node which can call a tool

Let's start adding nodes! We'll update our diagram along the way to keep track of what this looks like!


In [9]:
from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)

uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x111311be0>

Let's look at what we have so far:

![image](https://i.imgur.com/md7inqG.png)

Next, we'll add our entrypoint. All our entrypoint does is indicate which node is called first.

In [10]:
uncompiled_graph.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x111311be0>

![image](https://i.imgur.com/wNixpJe.png)

Now we want to build a "conditional edge" which will use the output state of a node to determine which path to follow.

We can help conceptualize this by thinking of our conditional edge as a conditional in a flowchart!

Notice how our function simply checks if there is a "function_call" kwarg present.

Then we create an edge where the origin node is our agent node and our destination node is *either* the action node or the END (finish the graph).

It's important to highlight that the dictionary passed in as the third parameter (the mapping) should be created with the possible outputs of our conditional function in mind. In this case `should_continue` outputs either `"end"` or `"continue"` which are subsequently mapped to the action node or the END node.

In [11]:
def should_continue(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  return END

uncompiled_graph.add_conditional_edges(
    "agent",
    should_continue
)

<langgraph.graph.state.StateGraph at 0x111311be0>

Let's visualize what this looks like.

![image](https://i.imgur.com/8ZNwKI5.png)

Finally, we can add our last edge which will connect our action node to our agent node. This is because we *always* want our action node (which is used to call our tools) to return its output to our agent!

In [12]:
uncompiled_graph.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x111311be0>

Let's look at the final visualization.

![image](https://i.imgur.com/NWO7usO.png)

All that's left to do now is to compile our workflow - and we're off!

In [13]:
simple_agent_graph = uncompiled_graph.compile()

#### ❓ Question #2:

Is there any specific limit to how many times we can cycle?

If not, how could we impose a limit to the number of cycles?

##### ✅ Answer:

## LangGraph Cycle Limits

### Default Behavior
By default this value is set to 25 steps. LangGraph has a built-in **recursion limit** that prevents infinite loops. Once the limit is reached, LangGraph will raise GraphRecursionError.

### 1. Global Recursion Limit

You can control the overall number of graph execution steps using the `recursion_limit` parameter:

```python
# Method 1: Pass at runtime
graph.invoke(inputs, config={"recursion_limit": 10})

# Method 2: Configure permanently  
agent_with_limit = agent.with_config(recursion_limit=10)

# Method 3: Handle exceptions
from langgraph.errors import GraphRecursionError

try:
    response = graph.invoke(inputs, {"recursion_limit": 5})
except GraphRecursionError:
    print("Agent stopped due to max iterations.")
```

### 2. Custom Cycle Counters in State

For more granular control, implement custom counters within your state:

```python
from typing import TypedDict, Annotated
from collections import defaultdict
from langgraph.graph.message import add_messages

class StateWithCounter(TypedDict):
    messages: Annotated[list, add_messages]
    iteration_count: int
    step_counter: dict  # Track specific loops
    max_iterations: int

def should_continue(state: StateWithCounter) -> str:
    messages = state["messages"]
    last_message = messages[-1]
    
    # Check global iteration limit
    if state["iteration_count"] >= state["max_iterations"]:
        return END
    
    # Check tool calls
    if last_message.tool_calls:
        return "tools"
    return END

def call_model(state: StateWithCounter):
    messages = state["messages"]
    response = llm_with_tools.invoke(messages)
    
    # Increment counter
    return {
        "messages": [response],
        "iteration_count": state["iteration_count"] + 1
    }
```

### 3. Node-Specific Loop Limits

Track Iterations: Implement a counter within your custom class to monitor the number of iterations. Limit Iterations: Before proceeding with each iteration, check if the counter has reached the predefined maximum:

```python
class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    step_counter: dict[str, int]  # Per-node counters
    max_tool_calls: int
    max_analysis_steps: int

def tool_decision_node(state: AgentState):
    # Check specific tool call limit
    tool_count = state["step_counter"].get("tools", 0)
    
    if tool_count >= state["max_tool_calls"]:
        return END
        
    # Update counter
    new_counter = state["step_counter"].copy()
    new_counter["tools"] = tool_count + 1
    
    return {"step_counter": new_counter}
```

### 4. Practical Implementation

It also includes an iteration counter and a max_iterations limit to control how many times the workflow can loop, enabling iterative reasoning or decision-making by the agent:

```python
@dataclass
class ComplexAgentState:
    query: str = ""
    context: str = ""
    iteration: int = 0
    max_iterations: int = 3
    
def analysis_node(state: ComplexAgentState):
    if state.iteration >= state.max_iterations:
        return {"next_action": "finish"}
    
    # Perform analysis
    result = analyze_query(state.query)
    
    return {
        "context": result,
        "iteration": state.iteration + 1,
        "next_action": "continue" if needs_more_analysis(result) else "finish"
    }
```

### Bottom Line

**No inherent cycle limit exists** beyond the global recursion limit. You can impose limits through:

1. **Global approach**: recursion_limit value into your config object when invoking your graph like this: graph.invoke({...}, {"recursion_limit": 100})
2. **Custom state counters**: Track iterations within your state and implement conditional logic in your `should_continue` functions
3. **Node-specific limits**: You can pass {"recursion_limit": 15} for example to limit the graph loop iterations globally, or use per-node counters for granular control

The custom state approach gives you the most flexibility, allowing different limits for different types of operations while maintaining full control over the agent's behavior.

## Using Our Graph

Now that we've created and compiled our graph - we can call it *just as we'd call any other* `Runnable`!

Let's try out a few examples to see how it fairs:

In [14]:
from langchain_core.messages import HumanMessage

inputs = {"messages" : [HumanMessage(content="Who is the current captain of the Winnipeg Jets?")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_RSN7ADZ1fAfLArxoXf7wZUUG', 'function': {'arguments': '{"query":"current captain of the Winnipeg Jets"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 23, 'prompt_tokens': 162, 'total_tokens': 185, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_38343a2f8f', 'id': 'chatcmpl-BrCdcXMHPSWUSsTEK59SDkEjIXZZ0', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run--bd0a8cb1-2576-429a-82d7-461d850cf08a-0', tool_calls=[{'name': 'tavily_search_results_json', 'args': {'query': 'current captain of the Winnipeg Jets'}, 'id': 'call_RSN7ADZ1fAfLArxoXf7wZUUG

Let's look at what happened:

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node
4. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node
5. The agent node added a response to the state object and passed it along the conditional edge
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

Now let's look at an example that shows a multiple tool usage - all with the same flow!

In [15]:
inputs = {"messages" : [HumanMessage(content="Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using Tavily!")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"Tool Used: {values['messages'][0].name}")
        print(values["messages"])

        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_ZiRCPFRMrk5STyAjwroPWyPi', 'function': {'arguments': '{"query": "QLoRA"}', 'name': 'arxiv'}, 'type': 'function'}, {'id': 'call_bxELJtaBCWhl2BnMAPZoR88N', 'function': {'arguments': '{"query": "latest Tweet of author"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 54, 'prompt_tokens': 178, 'total_tokens': 232, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_38343a2f8f', 'id': 'chatcmpl-BrCdjlP6gVFvl3dKrlIWJo4c6r0nn', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run--0d924770-5c85-4755-8e49-087aa6cce9b3-0', tool_calls=[{'name': 'arxiv', 'a

#### 🏗️ Activity #2:

Please write out the steps the agent took to arrive at the correct answer.

##### ✅ Answer:

Here are the steps the agent took to answer the query **"Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using Tavily!"**:

## Step-by-Step Agent Execution

### 1. **Initial Query Analysis**
The agent received a complex multi-part request requiring two different information sources:
- Academic paper search (Arxiv)
- Social media/web search (Tavily)

### 2. **Parallel Tool Selection**
The agent made a smart decision to call **both tools simultaneously** rather than sequentially:

```python
tool_calls=[
    {'name': 'arxiv', 'args': {'query': 'QLoRA'}},
    {'name': 'tavily_search_results_json', 'args': {'query': 'latest Tweet of author'}}
]
```

### 3. **Tool Execution Phase**
**Action Node** executed both tools:

- **ArxivQueryRun**: Retrieved detailed information about QLoRA papers, including:
  - Main paper: "QLoRA: Efficient Finetuning of Quantized LLMs" 
  - Authors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
  - Publication date: 2023-05-23
  - Technical details about the approach

- **TavilySearchResults**: Searched for "latest Tweet of author" but returned generic results about various authors rather than specific tweets from the QLoRA paper authors

### 4. **Information Integration**
The agent analyzed both tool outputs and recognized that:
- The Arxiv search was successful and provided the target paper
- The Tavily search didn't return specific tweets from the QLoRA authors
- The search query was too generic ("latest Tweet of author")

### 5. **Final Response Synthesis**
The agent provided a structured response:
- Confirmed finding the QLoRA paper with full citation details
- Listed the authors found
- Acknowledged the limitation in the tweet search results
- Provided links to various author profiles found by Tavily

## Key Insights

**Parallel Processing**: The agent efficiently used both tools simultaneously rather than waiting for sequential execution.

**Query Interpretation**: The agent understood that "latest Tweet of author" referred to the QLoRA paper authors, though the search wasn't specific enough.

**Graceful Handling**: When the tweet search didn't return the expected results, the agent still provided useful information about author profiles.

**Structured Output**: The final response clearly separated the successful Arxiv results from the less successful Twitter search results.

This demonstrates LangGraph's ability to coordinate multiple tools and handle partial success scenarios while maintaining a coherent response structure.

# 🤝 Breakout Room #2

## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

To do a little bit more preprocessing, let's wrap our LangGraph agent in a simple chain.

In [16]:
def convert_inputs(input_object):
  return {"messages" : [HumanMessage(content=input_object["question"])]}

def parse_output(input_state):
  return input_state["messages"][-1].content

agent_chain_with_formatting = convert_inputs | simple_agent_graph | parse_output

In [17]:
agent_chain_with_formatting.invoke({"question" : "What is RAG?"})

"RAG can refer to different concepts depending on the context. Could you please specify whether you're asking about RAG in the context of project management, machine learning, or another field?"

### Task 1: Creating An Evaluation Dataset

Just as we saw last week, we'll want to create a dataset to test our Agent's ability to answer questions.

In order to do this - we'll want to provide some questions and some answers. Let's look at how we can create such a dataset below.

```python
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]
```

#### 🏗️ Activity #3:

Please create a dataset in the above format with at least 5 questions.

In [18]:
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]

Now we can add our dataset to our LangSmith project using the following code which we saw last Thursday!

In [19]:
from langsmith import Client

client = Client()

dataset_name = f"Retrieval Augmented Generation - Evaluation Dataset - {uuid4().hex[0:8]}"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    outputs=answers,
    dataset_id=dataset.id,
)

{'example_ids': ['14bfed01-8f31-4365-8efa-bce818d9f6c7',
  'f932c488-9a4c-4fcd-8f55-a034a27d61ef',
  '96279627-6209-4a7f-a158-cded03fe7809',
  'a704243d-5c8c-4a8d-8a4b-358937accede',
  'b1e60149-5ea1-42df-aa35-584ece34b20e',
  '338b5881-783f-4cf4-8f10-2b922714b256'],
 'count': 6}

#### ❓ Question #3:

How are the correct answers associated with the questions?

> NOTE: Feel free to indicate if this is problematic or not

##### ✅ Answer:

The correct answers are associated with questions through **positional indexing**. Here's how it works:

## Association Method

The questions and answers are linked by their **position in their respective arrays**:

```python
questions = [
    "What optimizer is used in QLoRA?",                    # Index 0
    "What data type was created in the QLoRA paper?",     # Index 1
    "What is a Retrieval Augmented Generation system?",   # Index 2
    "Who authored the QLoRA paper?",                      # Index 3
    "What is the most popular deep learning framework?",  # Index 4
    "What significant improvements does the LoRA system make?" # Index 5
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},      # Index 0 → matches question 0
    {"must_mention" : ["NF4", "NormalFloat"]},      # Index 1 → matches question 1
    {"must_mention" : ["ground", "context"]},       # Index 2 → matches question 2
    {"must_mention" : ["Tim", "Dettmers"]},         # Index 3 → matches question 3
    {"must_mention" : ["PyTorch", "TensorFlow"]},   # Index 4 → matches question 4
    {"must_mention" : ["reduce", "parameters"]},    # Index 5 → matches question 5
]
```

## LangSmith Dataset Creation

The association is maintained when creating the dataset:

```python
client.create_examples(
    inputs=[{"question" : q} for q in questions],  # Preserves order
    outputs=answers,                               # Preserves order
    dataset_id=dataset.id,
)
```

####  ✅ Is This Problematic?

**Yes, this approach has several potential issues:**

1. **Fragile Ordering**: If you accidentally reorder one array but not the other, the associations break silently
2. **Maintenance Risk**: Adding/removing questions requires careful synchronization of both arrays
3. **Error-Prone**: No explicit validation that arrays have the same length
4. **Poor Readability**: The relationship between questions and answers isn't immediately obvious

## Better Approaches

```python
# Option 1: Combined data structure
evaluation_data = [
    {
        "question": "What optimizer is used in QLoRA?",
        "must_mention": ["paged", "optimizer"]
    },
    {
        "question": "What data type was created in the QLoRA paper?",
        "must_mention": ["NF4", "NormalFloat"]
    }
]

# Option 2: Dictionary mapping
qa_pairs = {
    "What optimizer is used in QLoRA?": {"must_mention": ["paged", "optimizer"]},
    "What data type was created in the QLoRA paper?": {"must_mention": ["NF4", "NormalFloat"]}
}
```

The current positional approach works but is brittle and should be replaced with a more robust data structure for production use.

### Task 2: Adding Evaluators

Now we can add a custom evaluator to see if our responses contain the expected information.

We'll be using a fairly naive exact-match process to determine if our response contains specific strings.

In [20]:
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    score = all(phrase in prediction for phrase in required)
    return EvaluationResult(key="must_mention", score=score)

#### ❓ Question #4:

What are some ways you could improve this metric as-is?

> NOTE: Alternatively you can suggest where gaps exist in this method.

##### ✅ Answer:

## Current Method Limitations

The current evaluator uses simple exact string matching:
```python
@run_evaluator
def must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    score = all(phrase in prediction for phrase in required)
    return EvaluationResult(key="must_mention", score=score)
```

## Major Gaps & Improvements

### 1. **Case Sensitivity Issues**
```python
# Current: Fails if case doesn't match exactly
# "Tim Dettmers" vs "tim dettmers" → False

# Improved:
score = all(phrase.lower() in prediction.lower() for phrase in required)
```

### 2. **Partial Word Matching Problems**
```python
# Current: "paged" matches "paged optimizer" but also "damaged" 
# Need word boundary checking:
import re
score = all(re.search(r'\b' + re.escape(phrase) + r'\b', prediction, re.IGNORECASE) 
           for phrase in required)
```

### 3. **Synonyms and Semantic Equivalence**
```python
# Current: Misses semantically equivalent terms
# "PyTorch" required but answer says "PyTorch framework" → should pass
# "reduce parameters" required but answer says "fewer parameters" → should pass

# Improved: Use semantic similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_match(required_phrase, prediction):
    similarity = model.encode([required_phrase, prediction])
    return cosine_similarity(similarity[0], similarity[1]) > 0.8
```

### 4. **Contextual Understanding**
```python
# Current: No context awareness
# "Tim" could refer to any Tim, not specifically Tim Dettmers

# Improved: Check context
def contextual_mention(required_terms, prediction):
    # Check if terms appear in meaningful context
    return all(term in prediction and 
              any(context_word in prediction.lower() 
                  for context_word in ["author", "paper", "researcher"]) 
              for term in required_terms)
```

### 5. **Scoring Granularity**
```python
# Current: Binary pass/fail
# Improved: Partial credit scoring
def partial_credit_evaluator(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    
    matches = sum(1 for phrase in required if phrase.lower() in prediction.lower())
    score = matches / len(required) if required else 0
    
    return EvaluationResult(
        key="must_mention", 
        score=score,
        comment=f"Found {matches}/{len(required)} required phrases"
    )
```

### 6. **Robustness to Variations**
```python
# Handle common variations
def robust_evaluator(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    
    # Handle variations
    variations = {
        "PyTorch": ["pytorch", "torch", "pytorch framework"],
        "TensorFlow": ["tensorflow", "tf", "tensor flow"],
        "NF4": ["nf4", "normalfloat4", "normal float 4"]
    }
    
    score = all(
        any(variant.lower() in prediction.lower() 
            for variant in variations.get(phrase, [phrase]))
        for phrase in required
    )
    
    return EvaluationResult(key="must_mention", score=score)
```

### 7. **Add Detailed Feedback**
```python
@run_evaluator
def enhanced_must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    
    found_phrases = []
    missing_phrases = []
    
    for phrase in required:
        if phrase.lower() in prediction.lower():
            found_phrases.append(phrase)
        else:
            missing_phrases.append(phrase)
    
    score = len(found_phrases) / len(required) if required else 0
    
    return EvaluationResult(
        key="must_mention",
        score=score,
        comment=f"Found: {found_phrases}, Missing: {missing_phrases}"
    )
```

## Bottom Line

The current method is too simplistic for production use. Key improvements should focus on:
1. **Case-insensitive matching**
2. **Word boundary detection**
3. **Semantic similarity for synonyms**
4. **Partial credit scoring**
5. **Detailed feedback on what was found/missing**
6. **Handling of common variations and abbreviations**

These improvements would make the evaluator much more robust and provide better insights into model performance.

Task 3: Evaluating

All that is left to do is evaluate our agent's response!

In [21]:
experiment_results = client.evaluate(
    agent_chain_with_formatting,
    data=dataset_name,
    evaluators=[must_mention],
    experiment_prefix=f"Search Pipeline - Evaluation - {uuid4().hex[0:4]}",
    metadata={"version": "1.0.0"},
)

View the evaluation results for experiment: 'Search Pipeline - Evaluation - 8690-6a8a9360' at:
https://smith.langchain.com/o/7ffaf126-290e-4d08-9a81-6ef0b42d5153/datasets/2e674849-8462-481f-8ec0-a285bbb02079/compare?selectedSessions=e28148c1-e997-436f-925d-b1e97d6482bd




0it [00:00, ?it/s]

In [22]:
experiment_results

## Part 2: LangGraph with Helpfulness:

### Task 3: Adding Helpfulness Check and "Loop" Limits

Now that we've done evaluation - let's see if we can add an extra step where we review the content we've generated to confirm if it fully answers the user's query!

We're going to make a few key adjustments to account for this:

1. We're going to add an artificial limit on how many "loops" the agent can go through - this will help us to avoid the potential situation where we never exit the loop.
2. We'll add to our existing conditional edge to obtain the behaviour we desire.

First, let's define our state again - we can check the length of the state object, so we don't need additional state for this.

In [23]:
class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

Now we can set our graph up! This process will be almost entirely the same - with the inclusion of one additional node/conditional edge!

#### 🏗️ Activity #5:

Please write markdown for the following cells to explain what each is doing.

<!-- YOUR MARKDOWN HERE -->
##### ✅ Answer:
Think of this like building a flowchart for a smart assistant. We're creating a new "graph" (like a flowchart) that will have different steps our AI assistant can take.

In this cell, we're adding two main "stations" or "nodes" to our flowchart:
1. **Agent node**: This is where our AI assistant thinks and decides what to do
2. **Action node**: This is where our AI assistant actually uses tools (like searching the internet or looking up research papers)

It's like setting up a workspace with two desks - one for thinking and one for doing tasks!

In [24]:
graph_with_helpfulness_check = StateGraph(AgentState)

graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x11183f250>

<!-- YOUR MARKDOWN HERE -->
##### ✅ Answer:
This is like telling our AI assistant where to start! Every flowchart needs a starting point, right?

In this cell, we're saying "When someone asks you a question, start at the 'agent' desk first." This means:
- The AI will always begin by thinking about the question
- It won't immediately jump to using tools
- It's like a student who reads the question carefully before deciding whether they need to look something up

Think of it like the "START" circle at the beginning of a flowchart - we're pointing to the agent node and saying "Begin here!"

In [25]:
graph_with_helpfulness_check.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x11183f250>

<!-- YOUR MARKDOWN HERE -->
##### ✅ Answer:
This is the "brain" of our AI assistant! It's like a traffic controller that decides where to go next. Let me break it down:

**The function makes 3 important decisions:**

1. **"Do I need to use tools?"** - If the AI wants to search the internet or look up papers, it says "go to action"
2. **"Have I talked too much?"** - If there are more than 10 messages, it says "stop" (like a parent saying "enough chatting!")
3. **"Is my answer good enough?"** - It asks another AI to grade its response:
   - If the answer is helpful → "end" (we're done!)
   - If the answer isn't helpful → "continue" (try again!)

It's like a smart study buddy who knows when to look things up, when to stop talking, and when their explanation is clear enough for you to understand!

In [26]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  initial_query = state["messages"][0]
  final_response = state["messages"][-1]

  if len(state["messages"]) > 10:
    return "END"

  prompt_template = """\
  Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

  Initial Query:
  {initial_query}

  Final Response:
  {final_response}"""

  helpfullness_prompt_template = PromptTemplate.from_template(prompt_template)

  helpfulness_check_model = ChatOpenAI(model="gpt-4.1-mini")

  helpfulness_chain = helpfullness_prompt_template | helpfulness_check_model | StrOutputParser()

  helpfulness_response = helpfulness_chain.invoke({"initial_query" : initial_query.content, "final_response" : final_response.content})

  if "Y" in helpfulness_response:
    return "end"
  else:
    return "continue"

#### 🏗️ Activity #4:

Please write what is happening in our `tool_call_or_helpful` function!

##### ✅ Answer:

**What's happening in the `tool_call_or_helpful` function:**

This function is like a smart decision-maker that looks at what the AI just did and decides what should happen next. Let me break it down step by step:

**Step 1: "Did the AI want to use tools?"**
- It checks if the AI's last message included requests to use tools (like searching the internet)
- If YES → return "action" (go use those tools!)
- If NO → keep checking...

**Step 2: "Has the conversation gotten too long?"**
- It counts how many messages have been exchanged
- If there are more than 10 messages → return "END" (stop talking, this is getting too long!)
- If less than 10 → keep checking...

**Step 3: "Is the AI's answer actually helpful?"**
- It takes the original question and the AI's current answer
- It asks a second AI to be like a teacher grading the response: "Is this helpful? Y or N?"
- If the grade is "Y" (helpful) → return "end" (we're done!)
- If the grade is "N" (not helpful) → return "continue" (try again!)

**Think of it like a quality control inspector:**
- First, they check if more work needs to be done (tools needed?)
- Then, they check if we've spent too much time already (message limit)
- Finally, they check if the final product is good enough (helpfulness check)

This function is the "brain" that keeps our AI from getting stuck in endless loops while making sure it gives good answers!

<!-- YOUR MARKDOWN HERE -->
##### ✅ Answer:
Now we're connecting the dots! Remember that traffic controller function we just made? This is where we tell our flowchart what to do with those decisions.

**We're creating 3 different paths the AI can take:**

1. **"continue"** → Goes back to the agent (like saying "think again!")
2. **"action"** → Goes to the action node (like saying "time to use your tools!")
3. **"end"** → Stops the conversation (like saying "we're done here!")

Imagine you're at a crossroads with 3 signs:
- ← "Think More" (continue)
- → "Use Tools" (action)  
- ↓ "All Done!" (end)

The traffic controller (our function) reads the situation and points you toward the right path. It's like a choose-your-own-adventure book, but the AI is making the choices!

In [27]:
graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue" : "agent",
        "action" : "action",
        "end" : END
    }
)

<langgraph.graph.state.StateGraph at 0x11183f250>

<!-- YOUR MARKDOWN HERE -->
##### ✅ Answer:
This is like creating a "return path" in our flowchart! 

**Here's what's happening:**
- When the AI finishes using its tools (like searching the internet), it doesn't just stop
- Instead, it **always** goes back to the "thinking" desk (agent node) to process what it found
- It's like a student who goes to the library, finds information, and then comes back to their desk to write their report

**Why is this important?**
- The AI needs to think about the information it found
- It might need to search for more things
- It needs to put together a helpful answer from all the pieces it collected

Think of it like a basketball player who gets the ball, shoots, and then **always** runs back to defense. The action node always returns to the agent node - no exceptions!

In [28]:
graph_with_helpfulness_check.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x11183f250>

<!-- YOUR MARKDOWN HERE -->
##### ✅ Answer:
This is the magic moment! We've been building our flowchart piece by piece, and now we're turning it into a working AI assistant.

**Think of it like this:**
- We've drawn all the boxes and arrows on paper (our flowchart)
- Now we're bringing it to life - like turning a blueprint into a real house!

**What "compile" means:**
- It takes our flowchart design and creates a working AI assistant
- It's like pressing "Save" on a video game - everything becomes real and usable
- The AI can now actually follow the paths we created

**What can our new AI do?**
- It can think, use tools, and make decisions
- It knows when to stop talking (no more endless loops!)
- It can judge if its own answers are good enough
- It's like having a study buddy who knows when to research more and when to give you the final answer

Now our AI assistant is ready to help people with questions!

In [29]:
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

<!-- YOUR MARKDOWN HERE -->
##### ✅ Answer:
Time for the big test! Let's see if our AI assistant actually works the way we designed it.

**We're asking a challenging question:**
"What is LoRA? Who is Tim Dettmers? What is Attention?"

This is like asking 3 different questions at once - perfect for testing our AI!

**What we're watching for:**
- Does the AI realize it needs to search for information? (Will it use tools?)
- Does it search for all the different topics we asked about?
- Does it put together a helpful answer from what it finds?
- Does it know when to stop and say "I'm done"?

**The "streaming" part:**
- Instead of waiting for the final answer, we get to watch the AI work step by step
- It's like watching a chef cook instead of just getting the finished meal
- We can see each decision the AI makes along the way

This is our "report card" moment - let's see how well our AI assistant performs!

In [30]:
inputs = {"messages" : [HumanMessage(content="Related to machine learning, what is LoRA? Also, who is Tim Dettmers? Also, what is Attention?")]}

async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_Qm3qI2HjXYodlWduMXXfjdyI', 'function': {'arguments': '{"query": "LoRA machine learning"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_1pxnN8Sp2UIm86w78C5XKTYA', 'function': {'arguments': '{"query": "Tim Dettmers"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_Lsx5cldlTqCNM2LnGnLfer0z', 'function': {'arguments': '{"query": "Attention in machine learning"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 79, 'prompt_tokens': 177, 'total_tokens': 256, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_38343a2f8f', 'id': 'c

### Task 4: LangGraph for the "Patterns" of GenAI

Let's ask our system about the 4 patterns of Generative AI:

1. Prompt Engineering
2. RAG
3. Fine-tuning
4. Agents

In [31]:
patterns = ["prompt engineering", "RAG", "fine-tuning", "LLM-based agents"]

In [32]:
for pattern in patterns:
  what_is_string = f"What is {pattern} and when did it break onto the scene??"
  inputs = {"messages" : [HumanMessage(content=what_is_string)]}
  messages = agent_with_helpfulness_check.invoke(inputs)
  print(messages["messages"][-1].content)
  print("\n\n")

Prompt engineering is the process of designing and refining prompts to effectively communicate with and elicit desired responses from AI language models like GPT-3 and GPT-4. It involves crafting clear, specific, and contextually appropriate prompts to improve the quality, relevance, and accuracy of the AI's outputs. Prompt engineering is crucial for maximizing the utility of AI models in various applications, including content creation, coding, data analysis, and more.

Prompt engineering began gaining significant attention around 2020-2021, coinciding with the rise of large language models and their increasing adoption across industries. As these models became more powerful and versatile, the need for effective prompt design became evident, leading to the emergence of prompt engineering as a specialized skill and area of study within AI and NLP communities. The term and its practices have continued to evolve rapidly as AI technology advances.



RAG, which stands for Retrieval-Augmen