# Agent RAG System - Functionality Recap

This notebook provides a comprehensive overview of the different functionalities implemented in our Agent RAG system for analyzing bank documents and PRA rulebooks.

## System Overview

Our system consists of three main components:
1. **Document Ingestion Pipeline** - Processing and storing PDF documents
2. **Facts Retrieval System** - Querying knowledge from Neo4j graph database  
3. **Agent Framework** - Interactive assistant with specialized tools and prompts



## 1. Document Ingestion Pipeline (`pdf_processor.py`)

### Core Functionality

The document ingestion system is responsible for processing PDF documents and extracting structured knowledge that can be stored and retrieved efficiently. The main entry point is the `preproc_bank_documents()` function.

### Key Components

#### Document Processing Workflow
1. **PDF Text Extraction**: Uses PyPDF2 to extract raw text from each page
2. **Image Screenshot Generation**: Uses PyMuPDF (fitz) to create high-quality screenshots of each page
3. **Best Representation Detection**: AI-powered analysis to determine whether a page is better represented as text or image. Using image representation is especially relevant when information is postional and requires to understand the structure of the page (e.g: for slides or tables)
4. **Content Summarization**: Generates summaries for both individual pages and entire documents
5. **Fact Extraction**: Extracts structured facts from document content
6. **Embedding Generation**: Creates vector embeddings for semantic search


**Note**: Using OCR with agents should be monitored carefully especially for non-text data - when LLM performs well at word level, they tend to hallucinate characters for references or numbered data. A potential mitigation 

#### Key Classes and Methods

**Document Class methods**:
- `setup_from_path()`: Main processing method that handles PDF parsing and screenshot generation
- `generate_document_summaries()`: Creates AI-generated summaries of document content
- `detect_best_representation()`: Determines optimal representation (text vs image) for each page
- `generate_document_1stfacts()`: Extracts structured facts from document content
- `embed_document_facts()`: Generates embeddings for all extracted facts

**Page Class**:
- Stores text content, images, summaries, keywords, and embeddings for individual pages
- Links to extracted facts and their associated questions

**Fact Class**:
- Represents structured knowledge with associated questions
- Includes embedding vectors for semantic similarity matching
- Connected to source pages and documents

### AI-Powered Processing

The system leverages OpenAI's GPT models (gpt5-nano) for several intelligent processing tasks:

- **Best Representation Analysis**: Determines whether pages with tables, graphs, or complex layouts should be processed as images rather than text
- **Content Summarization**: Generates concise summaries while preserving key information
- **Fact Extraction**: Identifies and structures important factual information from documents. This approach is preferred to traditional text chunking as it enables to generate 'ready_to_use' information instead of non-coherent chunks.  
- **Question Generation**: Creates relevant questions that each fact can answer. These questions are generated to facilate retrieval and anticipate the kind of questions that can be asked on this document. It assumes that the main input for retrieval will be a question.

### Error Handling and Reliability

- **Exponential Backoff Retry**: Implements robust retry logic for API calls with exponential backoff
- **Async Processing**: Uses asynchronous operations for efficient concurrent processing ()
- **Duplicate Prevention**: Checks for existing documents in Neo4j to avoid reprocessing

### Storage Integration

The processed documents are stored in a Neo4j graph database with the following structure:
- **CORPUS** nodes: Represent collections of documents (e.g., by bank)
- **DOCUMENT** nodes: Individual PDF files with metadata and summaries
- **PAGE** nodes: Individual pages with text, summaries, and embeddings
- **FACT** nodes: Extracted factual information with embeddings and associated questions

### Usage Example

```python
# Process documents from a folder
await preproc_bank_documents(
    folder_path="data/Barclays",
    file_list=["annual_report_2023.pdf", "risk_assessment.pdf"],
    focus="risk management",
    fact_label="FACT"
)
```

This ingestion pipeline ensures that documents are pre-processed consistently and stored for retrieval and analysis by the agent system.



---

## 2. Facts Retrieval System (Neo4j Queries)

### Core Functionality

The facts retrieval system enables semantic search and retrieval of relevant information from the knowledge graph stored in Neo4j. The main component is the `GraphProcessor` class in `process_graph.py`.

### Key Components

#### GraphProcessor Class

The `GraphProcessor` class provides the main interface for querying the knowledge graph:

**Main Methods**:
- `query_graph()`: Primary method for semantic search across the knowledge base
- `embed_input()`: Generates embeddings for user queries using OpenAI's embedding models
- `find_existing_node()`: Checks for existing nodes to prevent duplicates
- `get_document_with_descriptions()`: Retrieves document metadata and descriptions

#### Query Architecture

The retrieval system uses sophisticated Cypher queries that combine:

1. **Vector Similarity Search**: Computes cosine similarity between query embeddings and stored fact/page embeddings
2. **Hybrid Scoring**: Combines fact-level and page-level similarity scores for more accurate results
3. **Corpus Filtering**: Allows filtering by specific document collections (banks, PRA rulebooks)
4. **Relevance Thresholding**: Filters results based on minimum similarity scores
5. **Result Limiting**: Controls the number of facts returned per document and total facts

#### Semantic Search Process

```python
async def query_graph(self, question: str, threshold: float = 0.6, 
                     doc_limit: int = None, total_fact_limit: int = None):
```

**Step-by-step Process**:

1. **Query Embedding**: Convert user question to vector representation using `text-embedding-3-large`
2. **Graph Traversal**: Navigate through FACT → PAGE → DOCUMENT → CORPUS relationships
3. **Similarity Computation**: Calculate cosine similarity between query and stored embeddings
4. **Hybrid Scoring**: Combine fact and page similarities: `(similarity_fact + similarity_page)/2`
5. **Filtering & Ranking**: Apply threshold filtering and sort by relevance
6. **Result Structuring**: Group facts by document with metadata

#### Neo4j Graph Schema

The knowledge graph follows this structure:

```
CORPUS ← [] - DOCUMENT ← [] - PAGE ← [QUESTION (relationships)] - FACT
```

**Node Types**:
- **CORPUS**: Document collections (e.g., "Barclays", "PRA_Rulebook")
- **DOCUMENT**: Individual PDF files with summaries and metadata
- **PAGE**: Document pages with text content and embeddings
- **FACT**: Extracted factual information with embeddings
- **QUESTION**: Relationships linking facts to questions they can answer

#### Advanced Query Features

**Corpus Filtering**:
```python
CORPUS = {
    "included": ["Barclays", "HSBC"],  # Only search these banks
    "excluded": ["PRA_Rulebook"]       # Exclude PRA documents
}
```

**Dynamic Limiting**:
- `doc_limit`: Maximum number of documents to return
- `total_fact_limit`: Maximum total facts across all documents
- Per-document fact limiting through Cypher query optimization

#### Result Structure

The system returns structured results through specialized classes:

**Retrieved_fact Class**:
- `page_id`, `page_name`: Source page information
- `fact_id`, `fact`: The actual factual content
- `question`: Associated question the fact answers
- `similarity`: Relevance score (0-1)

**Retrieval_by_document Class**:
- `document_summary`: AI-generated document overview
- `relevant_facts`: List of retrieved facts from this document
- `max_similarity`: Highest similarity score in this document
- `document_name`, `corpus_name`: Source identification

**Retrieval_overall Class**:
- Aggregates results across all documents
- Provides methods for result persistence and formatting
- Enables answer storage back to the knowledge graph

### Performance Optimizations

1. **Vector Indexing**: Neo4j vector indexes for fast similarity search
2. **Query Optimization**: Efficient Cypher queries with proper ordering and limiting
3. **Async Processing**: Non-blocking operations for better responsiveness
4. **Connection Pooling**: Reusable database connections
5. **Result Caching (non implemented yet)**: Potential for caching frequently accessed results by persisting questions that were sucessfully answered as a new `FACT`. The graph structure would then allow for this new FACT to be connected to multiple pages (going beyond 1-page-facts)

### Usage Examples

```python
# Basic semantic search
results = await graph_processor.query_graph(
    question="What are the capital requirements for banks?",
    threshold=0.7,
    total_fact_limit=20
)

# Filtered search within specific banks
results = await graph_processor.query_graph(
    question=" a question on bank based on which employee needs information",
    CORPUS={"included": ["Barclays"]},
    doc_limit=5
)
```

This retrieval system enables the agent to access relevant factual information efficiently, supporting both broad exploratory queries and specific targeted searches across the knowledge base.



---

## 3. Agent Functionalities (Tools, Prompts & Conversation Flow)

### Core Architecture

The agent system is built using LlamaIndex's `FunctionAgent` framework and implemented in `chatbot_framework.py`. It serves as the Bank of England Docs Assistant, helping employees investigate bank documentation and PRA rulebooks.

### Agent Tools

The agent has access to three specialized tools that enable comprehensive document analysis:

#### 1. RAG_tool_banks
**Purpose**: Retrieve information from bank documents
**Parameters**:
- `question` (str): The question to search for in the RAG database
- `filter` (list, optional): Comma-separated list of corpus names to filter on
- `total_fact_limit` (int, optional): Limit the total number of facts returned

**Functionality**: Searches through processed bank documents using semantic similarity to find relevant facts that answer user questions.

#### 2. RAG_tool_pra  
**Purpose**: Retrieve PRA rules and regulations related to user questions
**Parameters**:
- `question` (str): The client question to search for in the PRA database
- `total_fact_limit` (int, optional): Limit the total number of facts returned

**Example of output**

```
🔧 Tool Call: RAG_tool_pra
   Parameters: {'question': "What PRA Rulebook guidance should be used to frame an investigation into HSBC's capital and liquidity adequacy, focusing on Pillar 1 minima, Pillar 2A, buffers, ICAAP alignment, liquidity risk management (LCR, NSFR, contingency funding planning), governance, and supervisory reporting?"}
   Result: From document: banking-approach-2023.pdf
With following executive summary: An overview of the Bank of England’s Prudential Regulation Authority framework: its objectives, risk identification and assessment, and proportionate, risk-based supervision of banks, insurers and other firms. It covers supervisory activity, resolution planning, capital and liquidity requirements, governance, and cross-border cooperation, with accompanying boxes and annexes.
Relevant facts found:
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:3333): The PSM includes guidance on the adequacy of capital and liquidity, as described in Section 3 of the document.
  - From page: 046
  - Similarity: 0.820
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:3063): The PRA uses a forward-looking assessment of a firm’s prospects to determine the level of capital a firm requires, and uses the complexity of the firm’s business to inform judgments about its risk management processes.
  - From page: 021
  - Similarity: 0.808
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:2901): The PRA supervisory framework enumerates key topical areas including Capital, The leverage ratio framework, Liquidity, Operational resilience, and Resolvability.
  - From page: 003
  - Similarity: 0.807
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:3413): The PRA’s general approach to the authorisation and supervision of international banks is anchored by an assessment of factors including the degree of equivalence of the home state regulator’s regime and the effectiveness with which the PRA can supervise the international bank.
  - From page: 054
  - Similarity: 0.807
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:2952): Some prudential issues are common to all supervised firms, including maturity transformation and being levered (holding debt in their capital structure), which makes them inherently vulnerable to loss of confidence; the PRA sets out policies that firms should meet in spirit and to the letter.
  - From page: 009
  - Similarity: 0.803
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:2892): The document lists 'Risk management and controls' as a topic in the PRA rulebooks.
  - From page: 002
  - Similarity: 0.801
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:3421): The PRA’s approach to branch and subsidiary supervision is described in SS5/21, with a parallel FCA approach for international firms.
  - From page: 055
  - Similarity: 0.799
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:2894): The document lists 'Design and effectiveness of the Board and Senior Management' as a topic in the PRA rulebooks.
  - From page: 002
  - Similarity: 0.798
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:3137): The PRA forms judgments about how much capital individual firms need to maintain, given risks and uncertainties, but firms should take responsibility for determining the appropriate level of capital in the first instance; they should engage honestly and prudently in assessing capital adequacy and should not rely on regulatory minima or aggressive accounting practices.
  - From page: 028
  - Similarity: 0.796
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:3023): SS5/21/10 sets out the PRA's approach to supervising international banks with either branches or subsidiaries in the UK; further detail on this approach can be found in Section 5.
  - From page: 016
  - Similarity: 0.795
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:2889): The document lists 'Fundamental Rules' as a key rule category in the PRA rulebooks.
  - From page: 002
  - Similarity: 0.795
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:3143): The PRA will assess whether the stresses applied are appropriately prudent.
  - From page: 029
  - Similarity: 0.794
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:3272): The PRA supervisory approach is judgement-based, forward-looking, and focused on key risks, and it utilises a broad range of quantitative and qualitative data to inform supervisory judgments.
  - From page: 041
  - Similarity: 0.793
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:2896): The document lists 'Investigations into regulatory failure' as a topic in the PRA rulebooks.
  - From page: 002
  - Similarity: 0.790
  --
  - Fact (id: 4:e43eb764-a2c6-44ce-993e-d18abbf24318:3025): The intensity of PRA supervisory activity varies across firms and is principally determined by the firm’s potential impact on financial system stability, its proximity to failure (as described in the Proactive Intervention Framework), its resolvability, and statutory obligations.
  - From page: 017
  - Similarity: 0.790
  --

```

**Functionality**: Searches PRA rulebooks to understand what regulations suggest checking in bank documents, providing regulatory context for investigations.

#### 3. get_docs_descriptions_tool
**Purpose**: Get overview and descriptions of available documents
**Parameters**:
- `corpus_label` (str, optional): Specific corpus to describe; if not provided, returns all documents

**Functionality**: Provides metadata and descriptions of documents in the database, helping users understand what information is available.

### System Prompt & Conversation Flow

The agent follows a structured conversation flow designed for regulatory investigation:

#### Agent Role
```
You are the Bank of England Docs Assistant, assistant employee investigating 
documentation from banks.
```

#### Structured Conversation Flow

**Step 1: Question Collection**
- Gather employee questions about bank documentation
- Use `get_docs_descriptions_tool` if employee asks about available content

**Step 2: PRA Enhancement** 
- Use `RAG_tool_pra` to retrieve relevant PRA rules and regulations
- Enhance the employee's question with regulatory context
- Present PRA information to employee before proceeding

**Step 3: Bank Document Analysis**
- Use `RAG_tool_banks` to search for information in bank documents
- Break large questions into smaller, focused queries
- Create an "audit plan" based on PRA information

**Step 4: Iterative Refinement**
- Ask "Refine or check anything else?" to continue investigation

### Key Instructions & Behaviors

#### Information Integrity
- **Answer ONLY from the corpus** using approved tools
- **Never use external facts** or make up information
- If information is not in the corpus, explicitly state so
- If tools fail, acknowledge the failure rather than fabricating responses

#### Query Processing Strategy
- **Prioritize facts by relevance**, focusing on most relevant information first
- **Separate large questions** into smaller, focused queries for better results
- **Synthesize PRA information** into a series of targeted questions for bank document searches
- **Always answer employee** after PRA enhancement before proceeding to bank document searches

#### Response Formatting
The agent returns structured JSON responses with specific formatting rules:

```json
{
    "answer": "Conversational and synthetic response",
    "relevant_fact_ids": {
        "PRA": ["fact_id_1", "fact_id_2"],
        "BANK": ["fact_id_3", "fact_id_4"]
    }
}
```

**Formatting Guidelines**:
- **Conversational tone**: Responses should be natural and synthetic
- **Focused content**: Keep answers focused on essential facts only
- **Fact ID tracking**: Store IDs of facts used to generate responses
- **Length limits**: Short answers (≤120 words) when not using RAG tools
- **Language mirroring**: Match the employee's language style

### Implementation Classes

#### AsyncConversation Class
Manages conversation state and message history:
- `Client_prompt_class`: Stores user messages with timestamps
- `Bot_answer_class`: Stores agent responses with timestamps
- `generate_ready_to_read()`: Formats conversation history for agent context

#### Agent Building Process
```python
def build_agent(self, llm_model: str, verbose: bool = False) -> FunctionAgent:
    llm = llama_openai(model=llm_model)
    tools = self.toolbox()
    agent = FunctionAgent(
        llm=llm,
        tools=tools,
        system_prompt=system_prompt,
        verbose=verbose
    )
    return agent, system_prompt
```

### User Interfaces

The system provides two interaction modes:

#### 1. Command Line Interface (`conversation_process()`)
- Interactive terminal-based conversation
- Detailed logging and reasoning step stored in `/chatbot_logs` for debugging (especially to check tool calling)
- Suitable for development and debugging


### Usage Example

```python
# Initialize conversation system
graph_processor = GraphProcessor()
conversation = AsyncConversation(graph_processor)
agent, system_prompt = conversation.build_agent("gpt-4")

# Process user input
user_input = "What are the capital adequacy requirements for UK banks?"
response = await agent.run(user_msg=user_input, max_iterations=10)
```

This agent framework provides a sophisticated, tool-enabled assistant that can navigate complex regulatory and financial documentation while maintaining conversation context and ensuring information accuracy.

---

## System Integration & Summary

### Complete Workflow

The three components work together to create a comprehensive document analysis system:

1. **Document Ingestion** → **Knowledge Graph Storage** → **Agent Retrieval & Analysis**

### Key Strengths

- **Intelligent Processing**: AI-powered document analysis with best representation detection
- **Semantic Search**: Vector-based similarity matching for accurate information retrieval  
- **Regulatory Context**: Integration of PRA rulebooks with bank document analysis
- **Structured Conversation**: Guided investigation flow for systematic document review
- **Scalable Architecture**: Handles large document collections with efficient querying
- **Multi-Modal Support**: Processes both text and visual document elements

### Technical Highlights

- **Async Processing**: Efficient concurrent operations throughout the pipeline
- **Error Resilience**: Comprehensive retry logic and graceful error handling
- **Graph Database**: Neo4j provides flexible, relationship-rich data storage
- **Extensible data form**: Modular architecture allows for easy `FACT` additions -> (non-impelemnted yet) Agent could keep memory of previously generated questions.  

This system enables Bank of England employees to efficiently investigate and analyze complex financial documentation while ensuring regulatory compliance and maintaining high standards of information accuracy.
