# Agent Bricks: Building an AI Knowledge Assistant

**Agent Bricks** is Databricks' streamlined approach for building production-grade AI agents using natural language and pre-configured templates. It automates the complex work of building, optimizing, and deploying AI agents.

## What You'll Learn

‚úÖ Understand what Agent Bricks Knowledge Assistant does  
‚úÖ Create a Knowledge Assistant using the UI  
‚úÖ Connect it to PDF documents in a Unity Catalog Volume  
‚úÖ Test the assistant with natural language queries  
‚úÖ Use AI Functions (`ai_query`, `ai_parse_document`, `ai_extract`) to process documents  

---

## The Scenario

Your team has accumulated aircraft maintenance manuals, technical reports, and research papers as PDFs. Engineers need to quickly find information buried in these documents without manually searching through hundreds of pages.

**The solution:** Build a Knowledge Assistant that can answer questions about the content in these PDFs using natural language.

---

## What is Agent Bricks Knowledge Assistant?

**Agent Bricks Knowledge Assistant** is a pre-built template that:
- Automatically indexes documents from Unity Catalog Volumes
- Uses retrieval-augmented generation (RAG) to answer questions
- Provides source citations for transparency
- Requires no code to get started

**Why use it:**
- ‚ö° **Fast setup** - Minutes, not weeks
- üéØ **Accurate answers** - Cites sources from your documents
- üîÑ **Continuous optimization** - Databricks improves the system over time
- üöÄ **Production-ready** - Deploy as an API endpoint immediately

---

## Table of Contents

1. Configuration
2. Understanding Agent Bricks Knowledge Assistant
3. Creating Your Knowledge Assistant (UI)
4. Testing the Knowledge Assistant
5. AI Functions: `ai_query`
6. AI Functions: `ai_parse_document` and `ai_extract`
7. Summary

---

**References:**
- [Agent Bricks Overview](https://docs.databricks.com/aws/en/generative-ai/agent-bricks)
- [Knowledge Assistant](https://docs.databricks.com/aws/en/generative-ai/agent-bricks/knowledge-assistant)
- [AI Functions](https://docs.databricks.com/aws/en/large-language-models/ai-functions-example)
- [`ai_parse_document` Function](https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document)


In [0]:
# Configuration
import re

CATALOG = 'dwx_express_insights_platform_dev_working'
READ_SCHEMA = 'db_crash_course'  # Shared schema (read-only)
username = spark.sql("SELECT current_user()").collect()[0][0]
username_base = username.split('@')[0]  # Extract username before @ symbol
WRITE_SCHEMA = re.sub(r'[^a-zA-Z0-9_]', '_', username_base)  # Replace special chars with _

# Create personal schema for any writes
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{WRITE_SCHEMA}")

# PDF documents volume path (created during setup)
PDF_VOLUME_PATH = f"/Volumes/{CATALOG}/{READ_SCHEMA}/pdf_documents"

print(f"‚úÖ Using catalog: {CATALOG}")
print(f"üìñ Reading from schema: {READ_SCHEMA} (shared)")
print(f"‚úçÔ∏è  Writing to schema: {WRITE_SCHEMA} (your personal schema)")
print(f"üìÑ PDF documents at: {PDF_VOLUME_PATH}")

## 1. Understanding Agent Bricks Knowledge Assistant

### What is Agent Bricks?

**Agent Bricks** is a low-code/no-code way to build AI agents on Databricks. Instead of writing complex RAG (Retrieval-Augmented Generation) pipelines yourself, Agent Bricks provides pre-built templates that automatically:

1. **Index your documents** - Automatically chunks and vectorizes content
2. **Select optimal models** - Tries different embedding and LLM models
3. **Optimize retrieval** - Fine-tunes retrieval parameters for accuracy
4. **Deploy as API** - Creates production-ready endpoints
5. **Continuous improvement** - Monitors performance and suggests improvements

### What is a Knowledge Assistant?

A **Knowledge Assistant** is a specific Agent Bricks template designed to:
- Answer questions based on your documents (PDFs, text files, etc.)
- Provide source citations for every answer
- Handle follow-up questions with conversation context
- Scale to thousands of documents

**Under the hood, it uses:**
- Vector search for semantic document retrieval
- LLMs (Large Language Models) for answer generation
- RAG (Retrieval-Augmented Generation) architecture
- Databricks-optimized infrastructure

### Benefits of Agent Bricks Knowledge Assistant

‚úÖ **No code required** - Build through the UI in minutes  
‚úÖ **Automatic optimization** - Databricks tunes retrieval and generation  
‚úÖ **Source citations** - Every answer references specific documents  
‚úÖ **Production-ready** - Deploy immediately as a REST API  
‚úÖ **Continuous learning** - System improves over time  

### When to Use Knowledge Assistant

- **Technical documentation search** - Manuals, specs, procedures
- **Research paper analysis** - Scientific literature review
- **Policy and compliance** - Legal documents, regulations
- **Customer support** - Knowledge base Q&A
- **Internal wiki search** - Company documentation

### Requirements

- Unity Catalog enabled workspace
- Documents stored in a Unity Catalog Volume
- Supported formats: PDF, TXT, DOCX, HTML, Markdown
- Serverless compute (for optimal performance)


## 2. Creating Your Knowledge Assistant (UI)

Now let's create a Knowledge Assistant that can answer questions about the aircraft documentation PDFs.

### Step-by-Step Instructions

Follow these steps in the Databricks UI to create your Knowledge Assistant:

**Step 1: Navigate to Agent Bricks**

1. In the Databricks workspace, click **AI** in the left sidebar
2. Select **Agent Bricks** (or **AI Builder**)
3. Click **Create Agent** or **+ New Agent**

**Step 2: Select Knowledge Assistant Template**

1. Choose the **Knowledge Assistant** template
2. Give your assistant a name (e.g., "Aircraft Documentation Assistant")
3. Optionally add a description: "Answers questions about aircraft maintenance and technical documentation"

**Step 3: Configure Data Source**

1. Under **Data Source**, select **Unity Catalog Volume**
2. Browse to or enter the volume path:
   ```
   /Volumes/dwx_express_insights_platform_dev_working/db_crash_course/pdf_documents
   ```
3. The system will automatically detect the PDF files in the volume

**Step 4: Configure Settings (Optional)**

Keep the defaults, but you can optionally adjust:
- **Chunk size**: How documents are split (default: 1000 tokens)
- **Overlap**: Overlap between chunks (default: 200 tokens)
- **Top K**: Number of relevant chunks to retrieve (default: 5)
- **Model**: LLM used for answer generation (default: optimized by Databricks)

**Step 5: Create and Deploy**

1. Click **Create** to build the agent
2. Agent Bricks will:
   - Index and chunk all PDF documents
   - Create vector embeddings
   - Set up the RAG pipeline
   - Deploy a serverless endpoint
3. Wait 2-5 minutes for indexing to complete (check status indicator)

**What happens behind the scenes:**
- Documents are parsed and chunked into manageable pieces
- Each chunk is converted to a vector embedding for semantic search
- A vector index is created in Unity Catalog
- A serving endpoint is deployed for real-time queries


## 3. Testing Your Knowledge Assistant

Once your Knowledge Assistant is deployed, it's time to test it with real queries!

### Testing in the UI

**Step 1: Open the Chat Interface**

1. After deployment completes, you'll see a **Chat** interface in the Agent Bricks UI
2. Or click on your agent name to open the chat window

**Step 2: Ask Questions About Your PDFs**

Try these example questions (adjust based on the actual PDF content):

**Question 1: General Information**
```
What maintenance procedures are described in the aircraft documentation?
```

**Question 2: Specific Details**
```
What are the key failure modes mentioned in the run-to-failure simulation dataset?
```

**Question 3: Technical Specifications**
```
What ATM (Air Traffic Management) concepts are defined in the NASA ontology?
```

**Question 4: Complex Query**
```
What are the main differences between the maintenance approaches described across the documents?
```

### Understanding the Results

For each answer, the Knowledge Assistant provides:

‚úÖ **Answer text** - Generated response based on retrieved documents  
‚úÖ **Source citations** - Which PDFs and pages were used  
‚úÖ **Confidence score** - How confident the system is in the answer  
‚úÖ **Retrieved chunks** - The actual text excerpts used  

### What Makes a Good Answer?

- **Accurate**: Information matches the source documents
- **Cited**: References specific documents/pages
- **Concise**: Answers the question directly
- **Contextual**: Uses relevant information from multiple chunks if needed

### Quick Evaluation

Notice:
- Does the answer make sense?
- Are the sources relevant?
- Does it handle questions outside the document scope appropriately?

**Note:** For this training, we're keeping it simple. In production, you'd run more extensive evaluations using test question sets, but that's beyond our scope today.


## 4. AI Functions Overview

Beyond Agent Bricks, Databricks provides **AI Functions** - SQL functions that let you call generative AI models directly in your queries. These functions are powerful tools for document processing and information extraction.

### Three Key AI Functions

**1. `ai_query()`** - Call LLMs directly
- Ask questions or generate text
- Simplest way to use LLMs in SQL
- Great for one-off queries

**2. `ai_parse_document()`** - Extract structured content from documents
- Parses PDFs, images, Office docs
- Extracts text, tables, figures
- Returns structured JSON with layout information

**3. `ai_extract()`** - Extract entities from text
- Pull out names, dates, organizations, etc.
- Schema-on-read for unstructured text
- Returns structured data as columns

### Why Use AI Functions?

‚úÖ **Scalable** - Process thousands of documents in parallel  
‚úÖ **Integrated** - Works directly in SQL and PySpark  
‚úÖ **Cost-effective** - Only pay for what you use  
‚úÖ **No infrastructure** - Serverless execution  

Let's see these in action!


In [0]:
## 5. AI Functions Example 1: Simple `ai_query`

The simplest AI Function is `ai_query()` - it lets you call an LLM directly from SQL or Python.

### Basic Example

Let's ask a simple question to verify the function works:

%sql
SELECT ai_query(
  'databricks-meta-llama-3-1-70b-instruct',
  'What is the capital of France? Answer in one sentence.'
) AS answer


## 6. AI Functions Example 2: Parse and Extract from PDFs

Now let's do something more powerful - parse PDF documents and extract structured information!

### What We'll Do

1. Read PDFs from the Unity Catalog Volume as binary files
2. Use `ai_parse_document()` to extract text and structure
3. Use `ai_extract()` to pull out specific entities (organizations, dates, etc.)
4. Save the results to a table for further analysis

### Step 1: Read and Parse PDFs


from pyspark.sql.functions import expr, col

# Read PDFs from the volume as binary files
pdf_df = spark.read.format("binaryFile").load(PDF_VOLUME_PATH)

print(f"Found {pdf_df.count()} PDF files")
print("\nPDF files:")
pdf_df.select("path").show(truncate=False)

# Parse one document as an example (using the first PDF)
# ai_parse_document extracts text, tables, and structure
parsed_df = (
    pdf_df
    .limit(1)  # Just process one PDF for this example
    .withColumn("parsed", expr("ai_parse_document(content)"))
    .select(
        "path",
        "parsed"
    )
)

print("\n‚úÖ Document parsed successfully!")
print("The parsed output contains structured information about the document.")

# Display the parsed structure
parsed_df.display()


### Step 2: Extract Structured Information

Now let's extract specific entities from the parsed text using `ai_extract()`:

**What `ai_extract()` does:**
- Takes text and a list of entity types to extract
- Returns structured data (names, dates, organizations, etc.)
- Works great for turning unstructured documents into structured tables

In [None]:
# Extract text content from parsed elements
from pyspark.sql.functions import explode, concat_ws, transform

# Get the text from all elements in the document
text_extracted_df = (
    parsed_df
    .withColumn(
        "text_content",
        concat_ws(
            " ",
            transform(
                expr("parsed.document.elements"),
                lambda x: x.content
            )
        )
    )
    .select("path", "text_content")
)

# Use ai_extract to pull out specific entities
# We'll look for: organizations, dates, locations, and technical terms
extracted_df = (
    text_extracted_df
    .withColumn(
        "extracted_entities",
        expr("""
            ai_extract(
                substring(text_content, 1, 5000),
                array('organization', 'date', 'location', 'technical_term')
            )
        """)
    )
    .select(
        "path",
        col("extracted_entities.organization").alias("organizations"),
        col("extracted_entities.date").alias("dates"),
        col("extracted_entities.location").alias("locations"),
        col("extracted_entities.technical_term").alias("technical_terms")
    )
)

# Write results to a table in your personal schema
output_table = f"{CATALOG}.{WRITE_SCHEMA}.parsed_pdf_entities"
extracted_df.write.mode("overwrite").saveAsTable(output_table)

print(f"‚úÖ Extracted entities saved to: {output_table}")
print(f"   Table contains structured information extracted from PDF")

# Display the results
spark.table(output_table).display()

### What We Just Did

In just a few lines of code, we:

1. ‚úÖ **Read PDF documents** from Unity Catalog Volume
2. ‚úÖ **Parsed document structure** using `ai_parse_document()`
3. ‚úÖ **Extracted entities** (organizations, dates, locations) using `ai_extract()`
4. ‚úÖ **Saved structured data** to a Delta table for analysis

**Real-world applications:**
- **Invoice processing**: Extract vendor names, amounts, dates
- **Contract analysis**: Pull out parties, terms, dates
- **Research paper mining**: Extract authors, institutions, findings
- **Resume parsing**: Get names, companies, dates, skills

**Scaling this up:**
- Remove the `.limit(1)` to process all PDFs
- Use Databricks Jobs to run on a schedule
- Process thousands of documents in parallel

## Summary

Congratulations! You've learned how to build AI agents and use AI Functions on Databricks.

### What You Accomplished

‚úÖ **Understood Agent Bricks** - Low-code AI agent building platform  
‚úÖ **Created a Knowledge Assistant** - Built through UI in minutes  
‚úÖ **Indexed PDF documents** - Connected assistant to Unity Catalog Volume  
‚úÖ **Tested with queries** - Asked natural language questions about documents  
‚úÖ **Used `ai_query()`** - Called LLMs directly from SQL  
‚úÖ **Used `ai_parse_document()`** - Extracted structure from PDFs  
‚úÖ **Used `ai_extract()`** - Pulled entities into structured tables  

### Key Takeaways

**Agent Bricks Knowledge Assistant:**
1. **No-code RAG** - Build document Q&A systems through the UI
2. **Automatic optimization** - Databricks tunes retrieval and generation
3. **Source citations** - Every answer references specific documents
4. **Production-ready** - Deploy immediately as REST API
5. **Use case**: Technical documentation, research papers, policies

**AI Functions:**
1. **`ai_query()`** - Direct LLM access from SQL/Python
2. **`ai_parse_document()`** - Extract text, tables, layout from documents
3. **`ai_extract()`** - Pull structured entities from unstructured text
4. **Scalable** - Process thousands of documents in parallel
5. **Integrated** - Works seamlessly with Delta tables and Unity Catalog

### Architecture Pattern

```
PDFs in Volume ‚Üí ai_parse_document() ‚Üí Structured Text
                          ‚Üì
                   ai_extract() ‚Üí Entities
                          ‚Üì
                    Delta Table ‚Üí Analytics/ML
```

### Real-World Applications

**Knowledge Assistant:**
- Technical support documentation search
- Legal and compliance document Q&A
- Research paper analysis
- Internal wiki search
- Policy and procedure lookup

**AI Functions:**
- Invoice and receipt processing (extract amounts, vendors, dates)
- Contract analysis (extract parties, terms, obligations)
- Resume parsing (extract names, companies, skills)
- Scientific literature review (extract findings, authors)
- Medical records processing (extract diagnoses, medications)

---

### What's Next?

**Next Notebook (5 MLflow and MLOps)**: Learn to train custom models, track experiments, and manage the ML lifecycle with MLflow.

**Next Notebook (6 ML and AI Inference)**: Deploy models for batch, streaming, and real-time predictions.

---

## Try This Out (Optional Extensions)

Want more practice? Try these exercises:

### 1. Process All PDFs

Remove the `.limit(1)` and process all PDFs in the volume:

```python
# Process ALL PDFs instead of just one
parsed_df = (
    pdf_df  # Remove .limit(1)
    .withColumn("parsed", expr("ai_parse_document(content)"))
)
```

### 2. Extract Different Entities

Try extracting different types of information:

```python
# Extract technical specifications
expr("""
    ai_extract(
        text_content,
        array('model_number', 'specification', 'measurement', 'requirement')
    )
""")
```

### 3. Build a Custom Query

Use `ai_query()` to summarize the extracted information:

```sql
SELECT ai_query(
  'databricks-meta-llama-3-1-70b-instruct',
  CONCAT('Summarize these key points: ', organizations)
) AS summary
FROM parsed_pdf_entities
```

### 4. Chain AI Functions Together

Combine multiple AI functions for complex processing:

```python
# Parse ‚Üí Extract ‚Üí Query pattern
result_df = (
    pdf_df
    .withColumn("parsed", expr("ai_parse_document(content)"))
    .withColumn("text", expr("parsed.document.elements[0].content"))
    .withColumn(
        "summary",
        expr("""ai_query(
            'databricks-meta-llama-3-1-70b-instruct',
            CONCAT('Summarize in 2 sentences: ', text)
        )""")
    )
)
```

### 5. Add More Documents

Upload your own PDFs to the volume and query them:

```python
# Upload a new PDF to the volume
dbutils.fs.cp(
    "file:/path/to/your/document.pdf",
    f"{PDF_VOLUME_PATH}/your_document.pdf"
)

# Re-index in your Knowledge Assistant to include the new document
```

---

**Additional Resources:**
- [Agent Bricks Documentation](https://docs.databricks.com/aws/en/generative-ai/agent-bricks)
- [Knowledge Assistant Guide](https://docs.databricks.com/aws/en/generative-ai/agent-bricks/knowledge-assistant)
- [AI Functions Overview](https://docs.databricks.com/aws/en/large-language-models/ai-functions)
- [`ai_parse_document` Reference](https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document)
- [Mosaic AI Agent Framework](https://docs.databricks.com/aws/en/generative-ai/agent-framework/)

**Great job!** You now have the skills to build AI agents and process documents with AI Functions. üöÄ