# Implementação de Retrieval-Augmented Generation (RAG)

### Autor: ***Guilherme Oliveira***
### Contato: gmmoliveira1@gmail.com
### Data: 16 de agosto de 2025

#### Imports

In [None]:
from agno.agent import Agent
from agno.document.chunking.agentic import AgenticChunking
from agno.embedder.ollama import OllamaEmbedder
from agno.knowledge.pdf import PDFKnowledgeBase, PDFReader
from agno.models.ollama import Ollama
from agno.vectordb.pgvector import PgVector, SearchType
from ollama import AsyncClient
import yaml
import asyncio
from textwrap import dedent
import json


#### Definição de Constantes

Define constantes que controlam o funcionamento geral do script

In [2]:
KNOWLEDGE_BASE_PATH = "recursos/base_de_conhecimentos_PDFs/"
BASE_MODEL = "qwen3:32b"
DATABASE_CONFIG_PATH = "recursos/configs/database.yaml"
REQUIREMENTS_PATH = "recursos/requirements.txt"
OLLAMA_HOST = "http://localhost:54256"
QUESTIONS_PATH = "recursos/sample_questions.json"

Lê o arquivo de configurações e define a constante que determina como se conectar ao banco de dados vetorial

In [3]:
with open(DATABASE_CONFIG_PATH, 'r') as file:
    database_config_aux = yaml.safe_load(file)
database_config = database_config_aux["database"]

# postgresql+psycopg://<username>:<password>@<host>:<port>/<database>
DATABASE_URL = f"postgresql+psycopg://{database_config['user']}:{database_config['password']}@{database_config['host']}:{database_config['port']}/{database_config['dbname']}"

#### Consolidação dos Requisitos Python

Gera o arquivo `requirements.txt` para permitir reproduzir os resultados.

In [4]:
!pip freeze > $REQUIREMENTS_PATH

#### Download do LLM desejado

Faz download do LLM escolhido utilizando o ollama.

In [5]:
!ollama pull $BASE_MODEL

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling ma

In [None]:
# Cria a instância a classe que transforma texto em embeddings
embedder = OllamaEmbedder(
    dimensions=5120, # ajustar em acordo com o LLM escolhido
    id=BASE_MODEL,
)
# cria a base de dados do PostgreSQL empoderada com busca vetorial
database = PgVector(
    table_name="pdf_documents",
    db_url=DATABASE_URL,
    search_type=SearchType.hybrid,
    embedder=embedder,
)
# Cria a instância do leitor de PDFs para consumir o artigo 
reader = PDFReader(
    split_on_pages=False,
    chunk=True,
)
# Cria a instância que determina como dividir o arquivo PDF
chunking_strategy = AgenticChunking(
    model=BASE_MODEL,
    max_chunk_size=5000,
)
# Cria a instância da base de conhecimento
pdf_knowledge_base = PDFKnowledgeBase(
    path=KNOWLEDGE_BASE_PATH,
    vector_db=database,
    reader=reader,
    chunking_strategy=chunking_strategy,
    num_documents=15,
)
# Carrega os dados do PDF no banco de dados
pdf_knowledge_base.load(
    recreate=True,
    upsert=True,
    skip_existing=False,
)

Foram criados 18 fragmentos para o artigo que será utilizado como base de conhecimento

#### Instanciação do LLM

In [7]:
# Cria uma forma de acessar o LLM via ollama de maneira assíncrona
async_client = AsyncClient(
    host=OLLAMA_HOST,
    headers={
        "temperature": "0.15",
    }
)

In [8]:
# Cria uma instância do LLM que utiliza como base um único servidor
model = Ollama(
    id=BASE_MODEL,
    async_client=async_client,
)
# Cria uma instância do agente que irá utilizar o LLM
agent = Agent(
    model=model,
    knowledge=pdf_knowledge_base,
    description=dedent("""
        You are a **Search-Based Research Agent**, an expert in retrieving and synthesizing the most current,
        accurate information from trusted sources. Your core function is to answer user queries
        exclusively using data obtained through real-time search tool calls. You must never rely
        on pre-trained knowledge, assumptions, or unsourced information. Prioritize credibility,
        recency, and relevance in all responses.
    """),
    instructions=[
        dedent("""
        1. **Mandatory Search Activation**:  
            - For **every** user query, invoke the search tool immediately.  
            - Generate 1–3 optimized search queries targeting credible sources (e.g., academic journals, official reports, reputable news).  
            *Example: Querying "peer-reviewed definition of quantum entanglement" instead of "what is quantum entanglement?"* 
        """),
        dedent("""
        2. **Information Synthesis**:  
            - Extract **only** facts from the top 3–5 search results. Cross-verify overlapping information across sources.  
            - Discard conflicting/low-credibility data (e.g., unverified forums, outdated pages).  
        """),
        dedent("""
        3. **Response Structure**:  
            - **Attribution**: Cite sources for every claim. Format: `[Source: Domain/Title]`.  
            - **Conciseness**: Answer directly in ≤3 sentences.  
            - **Uncertainty Handling**: If sources are inadequate, respond:  
                > "I found no verified sources on this topic. Refine your query or ask another question."  
        """),
        dedent("""
        4. **Prohibitions**:  
            - No speculation, opinions, or unsupported statements.  
            - No use of internal knowledge without search validation.  
        """),
        dedent("""
        5. **Language**:
            - Answer using the same language as the user is using in their queries. When necessary, keep technical terms in english (e.g., Retrieval-Augmented Generation---RAG).
        """),
        dedent("""
        ### Example Interaction  
            **User**: Define "neuromorphic computing."  
            **Agent**:  
            1. *Searches*: ["neuromorphic computing definition academic"], ["neuromorphic vs traditional architecture peer-reviewed"].  
            2. *Synthesizes*:  
            > "Neuromorphic computing designs hardware to mimic the brain’s neural structure for energy-efficient AI processing [Source: Nature Electronics]. It uses spiking neural networks for real-time learning [Source: IEEE Spectrum]."  
        """),
        dedent("""
        **Key Principles**:  
            - **Search-First**: All answers originate from tool-retrieved data.  
            - **Precision > Creativity**: Prioritize factual accuracy over engagement.  
            - **Source Transparency**: Always expose origins for user verification.
            - **User Language Matching**: Answer in the same language the user uses (e.g., Portuguese).
        """),
    ],
    search_knowledge=True,
    show_tool_calls=True,
    markdown=True,
)

In [9]:
with open(QUESTIONS_PATH, 'r') as f:
    questions = json.load(f)["questions"]
questions

['Qual é o principal objetivo do estudo conduzido pela Anthropic?',
 'Quais são as duas categorias de tarefas que concentram quase metade do uso de IA?',
 'Que porcentagem de ocupações usa IA para pelo menos 25% de suas tarefas associadas?',
 'Como os autores categorizam os padrões de uso entre automação e aumento (augmentation)?',
 'Quais habilidades ocupacionais são mais prevalentes nas conversas com IA?',
 'Como o uso de IA varia conforme o salário das ocupações?',
 'Qual é a principal limitação dos dados utilizados no estudo?',
 'Como os modelos Claude 3 Opus e Claude 3.5 Sonnet diferem nos padrões de uso?',
 'Qual framework teórico fundamenta a análise das tarefas econômicas?',
 'Que tipo de ocupações apresenta menor penetração de IA segundo o estudo?']

In [10]:

for question in questions:
    # Gera 1 resposta por vez e apresenta na tela via streamming
    agent.print_response(question, stream=True, markdown=True)
    # Permite gerar a resposta assíncrona, porém não favorece a visualização
    #asyncio.run(agent.print_response(question, markdown=True))


Output()

Output()

Output()

KeyboardInterrupt: 