## Agentic RAG System with Smolagents and Chroma DB
## Overview
This notebook presents the development of an Agentic Retrieval-Augmented Generateion (RAG) system powered by smolagents and Chroma Db. The system is designed to provide intelligent access to a custom knowledge base tailored for specialized tasks. While this example focuses on a finance domain use case, the framework is adaptable to any specific domain requiring dynamic data retrieval and classification.

## Steps
- **Data Preprocessing**: preprocess the documents by splitting them into smaller chunks and creating embeddings, which are dense, low-dimensional vector representations of high-dimensional data like words, capturaing their semantic meaning and relationships.
- **Vector Store**: We utillize  Chroma DB as the vector storage solution to efficiently house and manage our custom knowledge base in a searchable, vectorized format that is optimized for retrieval.
- **Agentic Search**: We implement an AI agent leveraging semantic search capabilities to enhance retrieval accuracy, enabling the system to refine and optimize its responses based on contextual understanding.


This project showcases a flexible, evolving approach to knowledge management and retrieval, combining cutting-edge vector storage with agent-driven intelligence.

In [1]:
import datasets

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.docstore.document import Document

In [3]:
knowledge_base = datasets.load_dataset("4DR1455/finance_questions", split="train")

In [4]:
knowledge_small = knowledge_base.select(range(500))

In [5]:
knowledge_small

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 500
})

In [6]:
source_docs = [
    Document(page_content=doc["output"], metadata={"source": "finance_questions"})
    for doc in knowledge_small
]


In [7]:
# Split the documents into smaller chunks for more efficient search
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],
)
new_docs = text_splitter.split_documents(documents=source_docs)

In [8]:
len(new_docs)

4549

In [9]:
###  BGE Embddings

from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,  encode_kwargs=encode_kwargs
)

  embeddings = HuggingFaceBgeEmbeddings(


In [10]:
import chromadb

In [11]:
db = Chroma.from_documents(new_docs, embeddings)

In [12]:
retriever = db.as_retriever(search_kwargs={"k": 3})
retriever.invoke('credit migration')

[Document(metadata={'source': 'finance_questions', 'start_index': 0}, page_content="Credit migration strategy is a type of credit risk management strategy that focuses on the changes in the credit quality of individual bonds or bond issuers over time. This strategy is based on the principle that the credit quality of bonds and bond issuers can change, or migrate, over time due to various factors such as changes in the financial health of the issuer, changes in macroeconomic conditions, or changes in the issuer's industry."),
 Document(metadata={'source': 'finance_questions', 'start_index': 2440}, page_content='In conclusion, credit migration strategy is a dynamic approach to credit risk management that can help portfolio managers manage interest rate risk. However, it requires a deep understanding of credit markets and the factors that influence credit ratings, as well as the ability to accurately anticipate changes in interest rates.'),
 Document(metadata={'source': 'finance_questions

### Create an Agent that Can Use Tools

In [13]:
from smolagents import Tool

class RetrieverTool(Tool):
    name = "retriever"
    description = "Uses semantic search to retrieve the parts of transformers documentation that could be most relevant to answer your query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self,  **kwargs):
        super().__init__(**kwargs)
        self.retriever = db.as_retriever(search_kwargs={"k": 4})

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = self.retriever.invoke(
            query,
        )
        return "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

retriever_tool = RetrieverTool()

In [14]:
import os

In [15]:
from smolagents import HfApiModel, CodeAgent
# initialize the agetn
agent = CodeAgent(
            tools=[retriever_tool], 
            model=HfApiModel(token=os.environ["HF_TOKEN"]), max_steps=4
        )

It is convenient that smolagents provides a detailed log of the agent run, including the execution steps and retrieved documents.

In [16]:
agent_output = agent.run("What is credit migration?")

print("Final output:")
print(agent_output)

Final output:
Credit migration refers to the change in the credit quality or credit rating of a borrower, bond issuer, or financial instrument over time. This migration can occur due to various factors, including changes in the financial health of the issuer, shifts in macroeconomic conditions, or changes in the issuer's industry. In the context of credit risk management, credit migration is an important concept because it helps assess potential changes in the creditworthiness of entities and the associated risks, allowing stakeholders to adjust their strategies and portfolios accordingly.


In [17]:
agent.run("Tell me about investing in commodities")

'Investing in commodities carries various risks, including price volatility and market risk, which investors should carefully assess and manage. Price volatility is due to factors such as supply and demand imbalances, geopolitical events, weather conditions, and economic indicators. Market risk is influenced by global market dynamics, including changes in interest rates, inflation rates, currency exchange rates, and overall market sentiment. Understanding these factors and diversifying investments can help mitigate these risks and potentially improve investment outcomes.'