### Disclaimer
The notebook is adopted form https://www.ibm.com/think/tutorials/agentic-rag to run locally with Ollama

## Step 1. Setup Ollama & ChromaDB

In [None]:
#!pip install chromadb langgraph langchain_chroma 

## Step 2. Import necessary lib

In [2]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_ollama import ChatOllama

import os

from dotenv import load_dotenv

from langchain_chroma import Chroma

from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.prompts import PromptTemplate

from langchain.tools import tool
from langchain.tools.render import render_text_description_and_args
from langchain.agents.output_parsers import JSONAgentOutputParser, ReActJsonSingleInputOutputParser
from langchain.agents.format_scratchpad import format_log_to_str, format_log_to_messages
from langchain.agents import AgentExecutor
from langchain.memory import ConversationBufferMemory
from langchain_core.runnables import RunnablePassthrough

USER_AGENT environment variable not set, consider setting it to identify your requests.


## Step 3. Setup LLM model

In [3]:
# To control the randomness and creativity of the generated
# text by an LLM, use temperature = 0.0
model_name =  "llama3.1:8b" #"qwen2.5:7b"

llm = ChatOllama(temperature=0.0, model=model_name)
llm

ChatOllama(model='llama3.1:8b', temperature=0.0)

## Step 4. Setup Prompt template

In [4]:
template = "Answer the {query} accurately. If you do not know the answer, simply say you do not know."

prompt = PromptTemplate.from_template(template)

In [5]:
agent = prompt | llm

In [6]:
agent.invoke({"query": 'What sport is played at the US Open?'})

AIMessage(content='The sport played at the US Open is tennis.', additional_kwargs={}, response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-07-30T13:48:33.8306724Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1918515200, 'load_duration': 190334600, 'prompt_eval_count': 38, 'prompt_eval_duration': 417464500, 'eval_count': 11, 'eval_duration': 1306014300, 'model_name': 'llama3.1:8b'}, id='run--2c1ee129-d7d4-4597-adbc-5e7c7a50675b-0', usage_metadata={'input_tokens': 38, 'output_tokens': 11, 'total_tokens': 49})

The agent successfully responded to the basic query with the correct answer. In the next step of this tutorial, we will be creating a RAG tool for the agent to access relevant information about IBM's involvement in the 2024 US Open. As we have covered, traditional LLMs cannot obtain current information on their own. Let's verify this.

In [7]:
agent.invoke({"query": 'Where was the 2024 US Open Tennis Championship?'})

AIMessage(content="I don't have information about future events, but I can suggest checking official sources for updates on the location of the 2024 US Open Tennis Championship.", additional_kwargs={}, response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-07-30T13:48:38.9163311Z', 'done': True, 'done_reason': 'stop', 'total_duration': 5058050700, 'load_duration': 208389300, 'prompt_eval_count': 40, 'prompt_eval_duration': 207248100, 'eval_count': 32, 'eval_duration': 4637390400, 'model_name': 'llama3.1:8b'}, id='run--364b58b0-32f5-41b1-bf3a-b1eb254c1d39-0', usage_metadata={'input_tokens': 40, 'output_tokens': 32, 'total_tokens': 72})

Evidently, the LLM is unable to provide us with the relevant information. The training data used for this model contained information prior to the 2024 US Open and without the appropriate tools, the agent does not have access to this information.

## Step 5. Establish the knowledge base and retriever

The first step in creating the knowledge base is listing the URLs we will be extracting content from. In this case, our data source will be collected from our online content summarizing IBM’s involvement in the 2024 US Open. The relevant URLs are established in the urls  list.

In [8]:
urls = [
    'https://www.ibm.com/case-studies/us-open',
    'https://www.ibm.com/sports/usopen',
    'https://newsroom.ibm.com/US-Open-AI-Tennis-Fan-Engagement',
    'https://newsroom.ibm.com/2024-08-15-ibm-and-the-usta-serve-up-new-and-enhanced-generative-ai-features-for-2024-us-open-digital-platforms',
]

Next, load the documents using LangChain WebBaseLoader  for the URLs we listed. We'll also print a sample document to see how it loaded.

In [9]:
docs = [WebBaseLoader(url).load() for url in urls]

docs_list = [item for sublist in docs for item in sublist]

#docs_list[0]

Once the text splitter is initiated, we can apply it to our docs_list.

In [10]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

doc_splits = text_splitter.split_documents(docs_list)

Let's initiate embeddings as well.

In [11]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model=model_name)

In order to store our embedded documents, we will use Chroma DB, an open source vector store.

In [12]:
# vector_store = Chroma.from_documents(
#     documents=doc_splits,
#     collection_name="agentic-rag-chroma",
#     embedding=embeddings,
#     persist_directory="./chroma_langchain_db",
# )
vector_store = Chroma(
    collection_name="agentic-rag-chroma",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",
)

To access information in the vector store, we must set up a retriever.

In [13]:
retriever = vector_store.as_retriever()

In [14]:
vector_store.similarity_search(
    query='Where was the 2024 US Open Tennis Championship?',
    k=2
)

[Document(id='3bf1e951-aab8-4faa-a03e-8271488a7015', metadata={'title': 'US Open: IBM and AI Can Help Engage Tennis Fans ', 'language': 'en-us', 'description': '', 'source': 'https://newsroom.ibm.com/US-Open-AI-Tennis-Fan-Engagement'}, page_content='The new, immersive fan content seems likely to continue evolving in the future. Even in a regular year, USTA’s Corio said, 10 million people use the site during the U.S. Open. With the right engagement, she said, that number can grow. “If we see that any of the technology really has traction, that will provide the fuel we need.”'),
 Document(id='e7a27a93-d4eb-4624-be38-5365af86eed4', metadata={'description': '', 'title': 'US Open: IBM and AI Can Help Engage Tennis Fans ', 'language': 'en-us', 'source': 'https://newsroom.ibm.com/US-Open-AI-Tennis-Fan-Engagement'}, page_content='Match Insights used Watson Natural Language Processing to pore over more than a hundred thousand written articles to glean the most relevant facts and insights.')]

## Step 6. Define the agent's RAG tool
Let's define the retrieve()  tool our agent will be using. This tool's only parameter is the user query. The tool description is also noted to inform the agent of the use of the tool. This way, the agent knows when to call this tool. This tool can be used by the agentic RAG system for routing the user query to the vector store if it pertains to IBM’s involvement in the 2024 US Open.

In [15]:
from langchain_core.tools import tool
from langchain.tools import BaseTool
from langchain_core.tools import ToolException, StructuredTool

@tool
def retrieve(query: str):
    """Retrieve information related to a query."""
    retrieved_docs = vector_store.similarity_search(query, k=10)

    return retrieved_docs

tools = [retrieve]

## Step 7. Establish the prompt template

Next, we will set up a new prompt template to ask multiple questions. This template is more complex. It is referred to as a structured chat prompt and can be used for creating agents that have multiple tools available. In our case, the tool we are using was defined in Step 6. The structured chat prompt will be made up of a system_prompt , a human_prompt  and our RAG tool.

First, we will set up the system_prompt . This prompt instructs the agent to print its "thought process," which involves the agent's subtasks, the tools that were used and the final output. This gives us insight into the agent's function calling. The prompt also instructs the agent to return its responses in JSON Blob format.

In [17]:
from langchain import hub

prompt = hub.pull("hwchase17/react")

print(prompt.template)



Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}


In the following code, we are establishing the human_prompt . This prompt tells the agent to display the user input followed by the intermediate steps taken by the agent as part of the agent_scratchpad .

Now, lets finalize our prompt template by adding the tool names, descriptions and arguments using a partial prompt template. This allows the agent to access the information pertaining to each tool including the intended use cases. This also means we can add and remove tools without altering our entire prompt template.

In [18]:
agent_prompt = prompt.partial(
    tools=render_text_description_and_args(list(tools)),
    tool_names=", ".join([t.name for t in tools]),
)

## Step 8. Set up the agent's memory and chain
An important feature of AI agents is their memory. Agents are able to store past conversations and past findings in their memory to improve the accuracy and relevance of their responses going forward. In our case, we will use LangChain's ConversationBufferMemory()  as a means of memory storage.

And now we can set up a chain with our agent's scratchpad, memory, prompt and the LLM. The AgentExecutor  class is used to execute the agent. It takes the agent, its tools, error handling approach, verbose parameter and memory.

In [19]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, END, StateGraph
from typing_extensions import List, TypedDict


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

def search(state: State):
    """Retrieve information related to a query."""
    print(state["question"])
    
    result = retrieve(state["question"])

    print(f"Retrieved {len(result)} records")
    
    return {"context": result}

def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = agent_prompt.invoke({
        "input": state["question"],
        "agent_scratchpad": docs_content
    })
    response = llm.invoke(messages)

    return {"answer": response.content}

tools = [retrieve]

# Compile application and test
graph_builder = StateGraph(State).add_sequence([search, generate])
graph_builder.add_edge(START, "search")
graph_builder.add_edge("search", END)

from langgraph.checkpoint.memory import MemorySaver

memory = MemorySaver()
graph = graph_builder.compile(checkpointer=memory)

## Step 9. Generate responses with the agentic RAG system
We are now able to ask the agent questions. Recall the agent's previous inability to provide us with information pertaining to the 2024 US Open. Now that the agent has its RAG tool available to use, let's try asking the same questions again.

In [None]:
# Specify an ID for the thread
config = {"configurable": {"thread_id": "local"}}

query='Where was the 2024 US Open Tennis Championship?'

response = graph.invoke({"question": query}, config=config)
print(response["answer"])

Where was the 2024 US Open Tennis Championship?


  result = retrieve(state["question"])


Retrieved 10 records
