[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb)

#### [LangChain Handbook](https://www.pinecone.io/learn/series/langchain/)

# Retrieval Agents

We've seen in previous chapters how powerful [retrieval augmentation](https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/) and [conversational agents](https://www.pinecone.io/learn/series/langchain/langchain-agents/) can be. They become even more impressive when we begin using them together.

Conversational agents can struggle with data freshness, knowledge about specific domains, or accessing internal documentation. By coupling agents with retrieval augmentation tools we no longer have these problems.

One the other side, using "naive" retrieval augmentation without the use of an agent means we will retrieve contexts with *every* query. Again, this isn't always ideal as not every query requires access to external knowledge.

Merging these methods gives us the best of both worlds. In this notebook we'll learn how to do this.

To begin, we must install the prerequisite libraries that we will be using in this notebook.

In [1]:
!pip install -qU \
    pinecone==7.3.0 \
    langchain==0.3.25 \
    langchain-openai==0.3.22 \
    tiktoken==0.9.0 \
    datasets==3.6.0


[notice] A new release of pip is available: 23.1.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Building the Knowledge Base

We start by constructing our knowledge base. We'll use a mostly prepared dataset called **S**tanford **Qu**estion-**A**nswering **D**ataset (SQuAD) hosted on Hugging Face *Datasets*. We download it like so:

In [2]:
from datasets import load_dataset

data = load_dataset('squad', split='train')
data

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

The dataset does contain duplicate contexts, which we can remove like so:

In [3]:
data = data.to_pandas()
data.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...


In [4]:
data.drop_duplicates(subset='context', keep='first', inplace=True)
data.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
5,5733bf84d058e614000b61be,University_of_Notre_Dame,"As at most other universities, Notre Dame's st...",When did the Scholastic Magazine of Notre dame...,"{'text': ['September 1876'], 'answer_start': [..."
10,5733bed24776f41900661188,University_of_Notre_Dame,The university is the major seat of the Congre...,Where is the headquarters of the Congregation ...,"{'text': ['Rome'], 'answer_start': [119]}"
15,5733a6424776f41900660f51,University_of_Notre_Dame,The College of Engineering was established in ...,How many BS level degrees are offered in the C...,"{'text': ['eight'], 'answer_start': [487]}"
20,5733a70c4776f41900660f64,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...,What entity provides help with the management ...,"{'text': ['Learning Resource Center'], 'answer..."


### Initialize the Embedding Model and Vector DB

We'll be using OpenAI's `text-embedding-3-small` model initialize via LangChain and the Pinecone vector DB. We start by initializing the embedding model, for this we need an [OpenAI API key](https://platform.openai.com/docs/overview).

*(Note that OpenAI is a paid service and so running the remainder of this notebook may incur some small cost)*

In [5]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") \
    or getpass("Enter your OpenAI API key: ")

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [6]:
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-3-small'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

  embed = OpenAIEmbeddings(


Now we create our vector DB to store our vectors. For this we need to get a [free Pinecone API key](https://app.pinecone.io) — the API key can be found in the "API Keys" button found in the left navbar of the Pinecone dashboard.

In [7]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") \
    or getpass("Enter your Pinecone API key: ")

# configure client
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [8]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Creating an index, we set `dimension` equal to to dimensionality of `text-embedding-3-small` (`1536`), and use a `metric` also compatible with `text-embedding-3-small` (this can be either `cosine` or `dotproduct`). We also pass our `spec` to index initialization.

In [9]:
import time

index_name = "langchain-retrieval-agent"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891,
 'vector_type': 'dense'}

We should see that the new Pinecone index has a `total_vector_count` of `0`, as we haven't added any vectors yet.

## Indexing

We can perform the indexing task using the LangChain vector store object. But for now it is much faster to do it via the Pinecone python client directly. We will do this in batches of `100` or more.

In [10]:
from tqdm.auto import tqdm

batch_size = 100

texts = []
metadatas = []

for i in tqdm(range(0, len(data), batch_size)):
    # get end of batch
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    # first get metadata fields for this record
    metadatas = [{
        'title': record['title'],
        'text': record['context']
    } for j, record in batch.iterrows()]
    # get the list of contexts / documents
    documents = batch['context']
    # create document embeddings
    embeds = embed.embed_documents(documents)
    # get IDs
    ids = batch['id']
    # add everything to pinecone
    index.upsert(vectors=zip(ids, embeds, metadatas))

  0%|          | 0/189 [00:00<?, ?it/s]

PineconeApiException: (429)
Reason: Too Many Requests
HTTP response headers: HTTPHeaderDict({'Date': 'Wed, 09 Jul 2025 06:44:29 GMT', 'Content-Type': 'application/json', 'Content-Length': '166', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '701', 'x-pinecone-request-id': '3900709124805469048', 'x-envoy-upstream-service-time': '5', 'server': 'envoy'})
HTTP response body: {"code":8,"message":"Request failed. You've reached your write unit limit for the current month (2000000). To continue writing data, upgrade your plan.","details":[]}


We've indexed everything, now we can check the number of vectors in our index like so:

In [11]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891,
 'vector_type': 'dense'}

## Creating a Vector Store and Querying

Now that we've build our index we can switch back over to LangChain. We start by initializing a vector store using the same index we just built. We do that like so:

In [12]:
from langchain_pinecone import PineconeVectorStore

# initialize the vector store object
vectorstore = PineconeVectorStore(index=index, embedding=embed)

As in previous examples, we can use the `similarity_search` method to do a pure semantic search (without the generation component).

In [13]:
query = "when was the college of engineering in the University of Notre Dame established?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(id='57338724d058e614000b5c9f', metadata={'title': 'University_of_Notre_Dame'}, page_content="In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis. By contrast, the Jesuit colleges, bastions of academic conservatism, were reluctant to move to a system of electives. Their graduates were shut out of Harvard Law School for that reason. Notre Dame continued to grow over the years, adding more colleges, programs, and sports teams. By 1921, with the addition of the College of Commerce, Notre Dame had grown from a small college to a university with five colleges and a professional law school. The university continued to expand and add new residence halls and buildings with each subsequent president."),
 Document(id='573393e1d058e614000b5dc2', metadata={'title': '

Looks like we're getting good results. 

Let's keep this exact sentence in mind for future reference:

> In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis.


Let's take a look at how we can begin integrating this into a conversational agent.

## Initializing the Conversational Agent

Our conversational agent needs several components:
1. A Chat LLM for generating responses
2. A QA chain for retrieving and processing information
3. A chat history manager for maintaining conversation context
Let's create these using modern LangChain Expression Language (LCEL):

In [14]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableSerializable

llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name="gpt-4o-mini",
    temperature=0.0,
)

# First create the basic QA chain
qa_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question based on the following context:"),
    ("system", "{context}"),
    ("human", "{question}")
])

# Create the QA chain
qa: RunnableSerializable = (
    {
        "context": lambda x: x["context"],
        "question": lambda x: x["question"]
    }
    | qa_prompt
    | llm
)

Using these we can generate an answer using the `run` method:

In [15]:
qa_response = qa.invoke({
    "question": "When was Notre Dame's College of Engineering established?",
    "context": vectorstore.similarity_search(
        "When was Notre Dame's College of Engineering established?",
        k=3
    )
})
print("QA Chain Response:", qa_response.content)

QA Chain Response: The College of Engineering at Notre Dame was established in 1920.


But this isn't yet ready for our conversational agent. For that we need to convert this retrieval chain into a tool. We do that like so:

In [17]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage
from langchain_core.runnables.base import RunnableSerializable
import json

# Your existing knowledge_base tool
@tool
def knowledge_base(query: str) -> str:
    """Search the knowledge base for information about Notre Dame University.
    Use this tool when users ask questions about Notre Dame University.
    
    Args:
        query: The search query about Notre Dame University
        
    Returns:
        Relevant information from the knowledge base
    """
    try:
        # Your existing vectorstore search logic
        results = vectorstore.similarity_search(query, k=5)
        if results:
            context = "\n".join([doc.page_content for doc in results])
            return f"Based on the knowledge base: {context}"
        else:
            return "No relevant information found in the knowledge base."
    except Exception as e:
        return f"Error searching knowledge base: {str(e)}"

class BufferWindowMessageHistory:
    def __init__(self, k: int = 5):
        self.k = k
        self.messages = []
    
    def add_message(self, message: BaseMessage):
        self.messages.append(message)
        if len(self.messages) > self.k:
            self.messages = self.messages[-self.k:]
    
    def get_messages(self) -> list[BaseMessage]:
        return self.messages

# Create conversation memory
memory = BufferWindowMessageHistory(k=5)

# Enhanced prompt that enforces memory usage and autonomous tool decisions
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant with access to a knowledge base about Notre Dame University.

CRITICAL REQUIREMENTS:
1. ALWAYS review the conversation history and use it to inform your responses
2. ALWAYS provide a complete, non-empty response to the user
3. You may choose to use the knowledge_base tool if the user asks about Notre Dame University
4. If you use the knowledge_base tool, incorporate the retrieved information into your response
5. If you don't use the knowledge_base tool, respond normally using your knowledge and conversation history
6. Remember details from previous conversations and reference them when relevant

Your response must NEVER be empty. Always provide meaningful, complete answers."""),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

# Create tools list (only knowledge_base, no final_answer tool)
tools = [knowledge_base]




Now we can initialize the agent like so:

In [18]:
# Create the agent
agent: RunnableSerializable = (
        {
            "input": lambda x: x["input"],
            "chat_history": lambda x: x["chat_history"],
            "agent_scratchpad": lambda x: x.get("agent_scratchpad", [])
        }
        | prompt
        | llm.bind_tools(tools, tool_choice="auto")  # Let LLM decide when to use tools
    )

# Enhanced AgentExecutor with explicit memory management
class AgentExecutor:
    def __init__(self, agent, tools, memory, max_iterations: int = 3):
        self.agent = agent
        self.tools = tools
        self.memory = memory
        self.max_iterations = max_iterations
        self.name2tool = {tool.name: tool.func for tool in tools}
    
    def invoke(self, input_text: str) -> dict:
        """Execute the agent with conversation memory"""
        count = 0
        agent_scratchpad = []
        final_response = ""
        tools_used = []  # Track which tools were used

        chat_history_used = list(self.memory.get_messages()) 
        
        while count < self.max_iterations:
            # Get current chat history
            chat_history = self.memory.get_messages()
            
            # Invoke the agent
            response = self.agent.invoke({
                "input": input_text,
                "chat_history": chat_history,
                "agent_scratchpad": agent_scratchpad
            })
            
            # Check if the response contains tool calls
            if response.tool_calls:
                # Add the tool call to scratchpad
                agent_scratchpad.append(response)
                
                # Execute each tool call
                for tool_call in response.tool_calls:
                    tool_name = tool_call["name"]
                    tool_args = tool_call["args"]
                    tool_call_id = tool_call["id"]
                    
                    # Execute the tool
                    tool_output = self.name2tool[tool_name](**tool_args)
                    
                    # Track tool usage
                    tools_used.append({
                        "tool_name": tool_name,
                        "tool_args": tool_args,
                        "step": count + 1
                    })
                    
                    # Create tool message
                    tool_message = ToolMessage(
                        content=str(tool_output),
                        tool_call_id=tool_call_id
                    )
                    agent_scratchpad.append(tool_message)
                    
                    print(f"Step {count + 1}: Used {tool_name}({tool_args})")
                    print(f"\nTool result: {tool_output}\n")
                
                count += 1
                
                # Continue to get the final response after tool usage
                continue
            else:
                # No tool calls, this is the final response
                final_response = response.content
                break
        
        # If we've exhausted iterations and still have no final response, 
        # make one more call to get the final answer
        if not final_response:
            chat_history = self.memory.get_messages()
            response = self.agent.invoke({
                "input": input_text,
                "chat_history": chat_history,
                "agent_scratchpad": agent_scratchpad
            })
            final_response = response.content
        
        # Ensure response is not empty
        if not final_response or final_response.strip() == "":
            final_response = "I apologize, but I wasn't able to generate a proper response. Please try asking your question again."

        # Add messages to memory
        self.memory.add_message(HumanMessage(content=input_text))
        self.memory.add_message(AIMessage(content=final_response))
        
        return {
            "input": input_text,
            "output": final_response,
            "tools_used": tools_used,  # Include which tools were used
            "chat_history_used": chat_history_used,
            "chat_history_final": self.memory.get_messages()
        }

# Create the executor
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    memory=memory,
    max_iterations=3
)

### Using the Conversational Agent

#### 1. Small Talk

Let's use the agent with simple statement that shouldn't result in usage of the `knowledge_base` and which won't have any memory as it's the first message with that agent.

In [19]:
result1 = agent_executor.invoke("Hi, my name is Sarah and I'm a prospective student.")


print('\nANSWER:', result1['output'], '\n')


ANSWER: Hi Sarah! It's great to meet you. As a prospective student, do you have any specific questions about Notre Dame University or anything in particular you're interested in learning about? 



Let's see what was used in formulating the response.

In [20]:
print('Tools used:', result1['tools_used'])
print('Chat History Used:', result1['chat_history_used'])
print('Chat History Final:', result1['chat_history_final'])

Tools used: []
Chat History Used: []
Chat History Final: [HumanMessage(content="Hi, my name is Sarah and I'm a prospective student.", additional_kwargs={}, response_metadata={}), AIMessage(content="Hi Sarah! It's great to meet you. As a prospective student, do you have any specific questions about Notre Dame University or anything in particular you're interested in learning about?", additional_kwargs={}, response_metadata={})]


As we can see, there is no tool usage and no chat_history that was used in formulating the response.

#### 2. Knowledge Base Specific Query

Now let's ask it something that should trigger the `knowledge_base`.

In [21]:
result2 = agent_executor.invoke("What programs does Notre Dame offer?")
print('\nANSWER:', result2['output'], '\n')

Step 1: Used knowledge_base({'query': 'programs offered at Notre Dame University'})

Tool result: Based on the knowledge base: Besides its prominence in sports, Notre Dame is also a large, four-year, highly residential research University, and is consistently ranked among the top twenty universities in the United States  and as a major global university. The undergraduate component of the university is organized into four colleges (Arts and Letters, Science, Engineering, Business) and the Architecture School. The latter is known for teaching New Classical Architecture and for awarding the globally renowned annual Driehaus Architecture Prize. Notre Dame's graduate program has more than 50 master's, doctoral and professional degree programs offered by the five schools, with the addition of the Notre Dame Law School and a MD-PhD program offered in combination with IU medical School. It maintains a system of libraries, cultural venues, artistic and scientific museums, including Hesburgh Li

In [22]:
print('Tools used:', result2['tools_used'])
print('Chat History Used:', result2['chat_history_used'])
print('Chat History Final:', result2['chat_history_final'])

Tools used: [{'tool_name': 'knowledge_base', 'tool_args': {'query': 'programs offered at Notre Dame University'}, 'step': 1}]
Chat History Used: [HumanMessage(content="Hi, my name is Sarah and I'm a prospective student.", additional_kwargs={}, response_metadata={}), AIMessage(content="Hi Sarah! It's great to meet you. As a prospective student, do you have any specific questions about Notre Dame University or anything in particular you're interested in learning about?", additional_kwargs={}, response_metadata={})]
Chat History Final: [HumanMessage(content="Hi, my name is Sarah and I'm a prospective student.", additional_kwargs={}, response_metadata={}), AIMessage(content="Hi Sarah! It's great to meet you. As a prospective student, do you have any specific questions about Notre Dame University or anything in particular you're interested in learning about?", additional_kwargs={}, response_metadata={}), HumanMessage(content='What programs does Notre Dame offer?', additional_kwargs={}, resp

This is exactly what we needed, the previous interaction was used as part of the Chat History that provided the agent with context when formulating a response, and the `knowledge_base` tool was used in formulating a response. 

#### 3. Testing Memory

Let's also double check that the memory is actually used by the agent.

In [23]:
result3 = agent_executor.invoke("I sent you two messages previously, what were they?")
print('\nANSWER:', result3['output'], '\n')


ANSWER: In your previous messages, you introduced yourself as Sarah, a prospective student, and then you asked about the programs offered at Notre Dame University. If you have any more questions or need further information, feel free to ask! 



Excellent it works as expected!

#### 4. Ensuring that the Agent is Actually Using the `knowledge_base` Tool for Retrieval and Context

Let's ask the agent to quote the knowledge base, to ensure it's using it (rather than simply drawing upon the LLM's own knowledge of Notre Dame University).

In [24]:
query = """
There is a specific sentence in the knowledge_base tool that starts with: 
'In 1919 Father James Burns became president of Notre Dame, and in three years he produced...' 
Retrieve the full sentence from the knowledge_base tool, then quote the full sentence back to me."
"""

result4 = agent_executor.invoke(query)

Step 1: Used knowledge_base({'query': 'In 1919 Father James Burns became president of Notre Dame, and in three years he produced'})

Tool result: Based on the knowledge base: In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis. By contrast, the Jesuit colleges, bastions of academic conservatism, were reluctant to move to a system of electives. Their graduates were shut out of Harvard Law School for that reason. Notre Dame continued to grow over the years, adding more colleges, programs, and sports teams. By 1921, with the addition of the College of Commerce, Notre Dame had grown from a small college to a university with five colleges and a professional law school. The university continued to expand and add new residence halls and buildings with each subsequent pre

Remember the quote we want is

> In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis.

In [25]:
print('\nANSWER:', result4['output'], '\n')


ANSWER: The full sentence you requested from the knowledge base is: 

"In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis." 

If you have any more questions or need further information, feel free to ask! 



Amazing, this confirms that the agent is actually using the `knowledge_base` retrieval, rather than simply drawing upon it's own, trained knowledge.

## Concluding Remarks

We've successfully built an agent that:
- Will choose for itself whether the `knowledge_base` tool is relevant to the question.
    - If it is deemed relevant, then the semantic similarity is used to extract text from the vector db - which should be relevant to the user's input utterance. This text is then used by the LLM as additional context when formulating a response.
    - If it is not deemed relevant, then the `knowledge_base` tool is not used, and the LLM formulates a response as normal.
- In all cases, the chat history is used as additional context when generating a response. 

See notebook: `learn\generation\langchain\handbook\03-langchain-conversational-memory.ipynb` for other ways in which the memory could have been implemented.

That's all for this example of building a retrieval augmented conversational agent with OpenAI and Pinecone (the OP stack) and LangChain.

Once finished, we delete the Pinecone index to save resources:

In [26]:
pc.delete_index(index_name)

---