# Lab | Agent & Vector store

**Change the state union dataset and replicate this lab by updating the prompts accordingly.**

One such dataset is the [sonnets.txt](https://github.com/martin-gorner/tensorflow-rnn-shakespeare/blob/master/shakespeare/sonnets.txt) dataset or any other data of your choice from the same git.

# Combine agents and vector stores

This notebook covers how to combine agents and vector stores. The use case for this is that you've ingested your data into a vector store and want to interact with it in an agentic manner.

The recommended method for doing so is to create a `RetrievalQA` and then use that as a tool in the overall agent. Let's take a look at doing this below. You can do this with multiple different vector DBs, and use the agent as a way to route between them. There are two different ways of doing this - you can either let the agent use the vector stores as normal tools, or you can set `return_direct=True` to really just use the agent as a router.

## Create the vector store

In [19]:
#!pip install chromadb

In [1]:
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
HUGGINGFACEHUB_API_TOKEN = os.getenv('HUGGINGFACEHUB_API_TOKEN')

In [3]:
import re

# The cleaning function as before
def clean_email_text(text):
    # Remove email headers (like Date:, From:, Subject:)
    text = re.sub(r"Date:.*|From:.*|Subject:.*|To:.*|Confidential.*", "", text)

    # Remove redacted sections
    text = re.sub(r"Redacted", "", text)

    # Remove unwanted characters (fixing broken words, extra dashes)
    text = re.sub(r"\s*-\s*", " ", text)  # Replaces isolated dashes with spaces
    text = re.sub(r"\s+", " ", text)  # Reduces multiple spaces to one

    # Fixing broken words (optional, depending on the context)
    text = re.sub(r"\b([a-zA-Z]+)(\d+)([a-zA-Z]+)\b", r"\1\3", text)

    # Lowercase and strip
    text = text.lower().strip()

    return text

# Function to open a file, clean it, and save it
def clean_file(input_file, output_file):
    # Open the input file and read the content
    with open(input_file, 'r', encoding='utf-8') as file:
        text = file.read()

    # Clean the text using clean_email_text function
    cleaned_text = clean_email_text(text)

    # Save the cleaned text into the output file
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(cleaned_text)

# Example usage
input_file_path = 'API.txt'
output_file_path = 'cleaned_api.txt'

clean_file(input_file_path, output_file_path)

In [4]:
llm = OpenAI(temperature=0)

In [15]:
from pathlib import Path

relevant_parts = []
for p in Path(".").absolute().parts:
    relevant_parts.append(p)
    if relevant_parts[-3:] == ["langchain", "docs", "modules"]:
        break
doc_path = str(Path(*relevant_parts) / "API.txt")

In [46]:
loader = TextLoader(doc_path)
documents = loader.load()
text_splitter = CharacterTextSplitter(separator="\n",
                                      chunk_size=1000, 
                                      chunk_overlap=10
                                      )
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings, collection_name="api_information")

In [47]:
api_information = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=docsearch.as_retriever()
)

In [48]:
from langchain_community.document_loaders import WebBaseLoader

In [49]:
loader = WebBaseLoader("https://beta.ruff.rs/docs/faq/")

In [50]:
docs = loader.load()
ruff_texts = text_splitter.split_documents(docs)
ruff_db = Chroma.from_documents(ruff_texts, embeddings, collection_name="ruff")
ruff = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=ruff_db.as_retriever()
)

## Create the Agent

In [51]:
# Import things that are needed generically
from langchain.agents import AgentType, Tool, initialize_agent
from langchain_openai import OpenAI

In [52]:
tools = [
    Tool(
        name="Climate Desinformation",
        func=api_information.run,
        description="useful for when you need to answer questions climate change desinfinformation actions by the American Petrolium Institute (API).",
    ),
    Tool(
        name="Ruff QA System",
        func=ruff.run,
        description="useful for when you need to answer questions about ruff (a python linter). Input should be a fully formed question.",
    ),
]

In [53]:
# Construct the agent. We will use the default agent type here.
# See documentation for a full list of options.
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)

In [54]:
agent.invoke(
    "what are the main points of the American Petroleum Institute (API) robust policy framework of industry and government actions"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m The API robust policy framework is likely to be complex and have multiple points that need to be addressed.
Action: Climate Desinformation
Action Input: "What are the main points of the American Petroleum Institute (API) robust policy framework of industry and government actions?"[0m
Observation: [36;1m[1;3m The main points of the API's policy framework include accelerating technology and innovation, further mitigating emissions from operations, endorsing a carbon price policy, advancing cleaner fuels, and driving climate reporting.[0m
Thought:[32;1m[1;3m These points seem to focus on both reducing emissions and promoting innovation and technology.
Action: Climate Desinformation
Action Input: "What are the main points of the American Petroleum Institute (API) robust policy framework of industry and government actions?"[0m
Observation: [36;1m[1;3m The main points of the API's policy framework include accelerating tec

{'input': 'what are the main points of the American Petroleum Institute (API) robust policy framework of industry and government actions',
 'output': 'The main points of the American Petroleum Institute (API) robust policy framework of industry and government actions include accelerating technology and innovation, further mitigating emissions from operations, endorsing a carbon price policy, advancing cleaner fuels, and driving climate reporting.'}

In [55]:
agent.invoke("Why use ruff over flake8?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m Ruff and flake8 are both python linters, so they likely have similar functions. However, there may be specific reasons why one is preferred over the other.
Action: Ruff QA System
Action Input: "Why use ruff over flake8?"[0m
Observation: [33;1m[1;3m Ruff offers a larger rule set and re-implements popular Flake8 plugins natively. It also supports Python 3.7 onwards, including Python 3.13, while Flake8 only supports up to Python 3.9. Additionally, Ruff can be used as a formatter and linter independently, while Flake8 only offers linting capabilities.[0m
Thought:[32;1m[1;3m This information suggests that Ruff may be a more comprehensive and versatile option compared to Flake8.
Action: Ruff QA System
Action Input: "What are the specific advantages of using Ruff?"[0m
Observation: [33;1m[1;3m Ruff provides faster feedback on lint violations and can catch certain errors that a type checker would typically ignore. It is reco

{'input': 'Why use ruff over flake8?',
 'output': 'Based on the information gathered, it seems that Ruff may be a more comprehensive and efficient option compared to Flake8, but further research may be needed to fully understand any potential disadvantages.'}

## Use the Agent solely as a router

You can also set `return_direct=True` if you intend to use the agent as a router and just want to directly return the result of the RetrievalQAChain.

Notice that in the above examples the agent did some extra work after querying the RetrievalQAChain. You can avoid that and just return the result directly.

In [56]:
tools = [
    Tool(
        name="Climate Desinformation",
        func=api_information.run,
        description="useful for when you need to answer questions climate change desinfinformation actions by the American Petrolium Institute (API).",
    ),
    Tool(
        name="Ruff QA System",
        func=ruff.run,
        description="useful for when you need to answer questions about ruff (a python linter). Input should be a fully formed question.",
        return_direct=True,
    ),
]

In [57]:
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)

In [58]:
agent.invoke(
    "What do you know about Amanda E. Eversole from API ?"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should use the Climate Desinformation tool to find information about Amanda E. Eversole.
Action: Climate Desinformation
Action Input: Amanda E. Eversole[0m
Observation: [36;1m[1;3m I don't know.[0m
Thought:[32;1m[1;3m I should try using the Ruff QA System instead.
Action: Ruff QA System
Action Input: What do you know about Amanda E. Eversole from API?[0m
Observation: [33;1m[1;3m I don't know anything about Amanda E. Eversole from API.[0m
[32;1m[1;3m[0m

[1m> Finished chain.[0m


{'input': 'What do you know about Amanda E. Eversole from API ?',
 'output': " I don't know anything about Amanda E. Eversole from API."}

In [40]:
agent.invoke("Why use ruff over flake8?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m Ruff and flake8 are both python linters, so they have similar functions. However, there may be specific reasons why one is preferred over the other.
Action: Ruff QA System
Action Input: "Why use ruff over flake8?"[0m
Observation: [33;1m[1;3m Ruff offers a larger rule set and can automatically fix its own lint violations, while also being able to be used as a formatter. It also supports linting for any Python version from 3.7 onwards, including Python 3.13. Additionally, Ruff does not require the installation of Rust, making it more accessible for users.[0m
[32;1m[1;3m[0m

[1m> Finished chain.[0m


{'input': 'Why use ruff over flake8?',
 'output': ' Ruff offers a larger rule set and can automatically fix its own lint violations, while also being able to be used as a formatter. It also supports linting for any Python version from 3.7 onwards, including Python 3.13. Additionally, Ruff does not require the installation of Rust, making it more accessible for users.'}

## Multi-Hop vector store reasoning

Because vector stores are easily usable as tools in agents, it is easy to use answer multi-hop questions that depend on vector stores using the existing agent framework.

In [59]:
tools = [
    Tool(
        name="Climate Desinformation",
        func=api_information.run,
        description="useful for when you need to answer questions climate change desinfinformation actions by the American Petrolium Institute (API).",
    ),
    Tool(
        name="Ruff QA System",
        func=ruff.run,
        description="useful for when you need to answer questions about ruff (a python linter). Input should be a fully formed question, not referencing any obscure pronouns from the conversation before.",
    ),
]

In [60]:
# Construct the agent. We will use the default agent type here.
# See documentation for a full list of options.
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)

In [61]:
agent.invoke(
    "What tool does ruff use to run over Jupyter Notebooks? Did the APi mention it in their Emails?"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should check if the APi mentioned anything about ruff in their emails.
Action: Climate Desinformation
Action Input: "ruff"[0m

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens, however you requested 38860 tokens (38604 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}