# Graph Augmentation Agent - Interactive Exploration

This notebook demonstrates the Graph Augmentation Agent step-by-step,
running each analysis type in a separate cell so you can see exactly
what happens at each stage.

## What This Agent Does

The agent analyzes unstructured documents and suggests graph augmentations for Neo4j:

| Analysis | Description | Output |
|----------|-------------|--------|
| **Investment Themes** | Identifies emerging investment trends | Themes with market data |
| **New Entities** | Suggests new node types for the graph | Node definitions |
| **Missing Attributes** | Finds attributes not captured in schema | Property suggestions |
| **Implied Relationships** | Discovers hidden connections | Relationship types |

## Key Features

- **Native Structured Output** - Uses `ChatDatabricks.with_structured_output()` for validated Pydantic models
- **LangGraph Workflow** - StateGraph orchestration with memory persistence
- **Modular Architecture** - Clean separation of concerns in `core/` module

---

## 1. Configuration

Configure the Multi-Agent Supervisor endpoint below. This is **required** - the Lab 6 agent must use the Lab 5 Multi-Agent Supervisor endpoint.

In [None]:
# =============================================================================
# CONFIGURATION - Edit these values before running
# =============================================================================

# Multi-Agent Supervisor endpoint (REQUIRED - created in Lab 5)
# This must be set to the endpoint name from Lab 5's Multi-Agent Supervisor
# Example: "agents_retail-investment-intelligence-system_agent"
MAS_ENDPOINT_NAME = "mas-3ae5a347-endpoint"

# Databricks Secrets scope for credentials
SECRETS_SCOPE = "neo4j-creds"

print("Configuration:")
print(f"  MAS Endpoint:    {MAS_ENDPOINT_NAME}")
print(f"  Secrets Scope:   {SECRETS_SCOPE}")

---

## 2. Environment Setup

Load credentials from Databricks Secrets and configure the environment.

In [None]:
# Install required packages (run once per cluster)
# Uncomment if packages are not already installed on the cluster
# %pip install databricks-langchain>=0.11.0 langgraph>=1.0.5 langchain-core>=1.2.0 pydantic>=2.12.5

In [None]:
import os
import time

print("=" * 60)
print("ENVIRONMENT SETUP")
print("=" * 60)
print(f"Timestamp: {time.strftime('%Y-%m-%d %H:%M:%S')}")
print("")

# Validate and set the Multi-Agent Supervisor endpoint (REQUIRED)
if not MAS_ENDPOINT_NAME:
    print("=" * 70)
    print("[FATAL] MAS_ENDPOINT_NAME is not configured!")
    print("=" * 70)
    print("  The Lab 6 Augmentation Agent requires the Lab 5 Multi-Agent")
    print("  Supervisor endpoint. Please set MAS_ENDPOINT_NAME in the")
    print("  configuration cell above.")
    print("")
    print("  Example:")
    print('    MAS_ENDPOINT_NAME = "agents_retail-investment-intelligence-system_agent"')
    print("=" * 70)
    raise SystemExit("MAS_ENDPOINT_NAME is required but not set")

os.environ["MAS_ENDPOINT_NAME"] = MAS_ENDPOINT_NAME
print(f"[OK] MAS Endpoint: {MAS_ENDPOINT_NAME}")
print("")

# Load Databricks credentials from secrets
print(f"[DEBUG] Retrieving secrets from scope '{SECRETS_SCOPE}'...")

try:
    DATABRICKS_HOST = dbutils.secrets.get(scope=SECRETS_SCOPE, key="databricks_host")
    print(f"  [OK] databricks_host: {DATABRICKS_HOST[:30]}...")
except Exception as e:
    # Fallback: try to get from spark conf or workspace URL
    try:
        DATABRICKS_HOST = spark.conf.get("spark.databricks.workspaceUrl")
        if not DATABRICKS_HOST.startswith("https://"):
            DATABRICKS_HOST = f"https://{DATABRICKS_HOST}"
        print(f"  [OK] databricks_host (from workspace): {DATABRICKS_HOST[:30]}...")
    except Exception:
        print(f"  [WARN] databricks_host: Not found in secrets, using notebook context")
        DATABRICKS_HOST = None

try:
    DATABRICKS_TOKEN = dbutils.secrets.get(scope=SECRETS_SCOPE, key="databricks_token")
    print(f"  [OK] databricks_token: {'*' * 10}... ({len(DATABRICKS_TOKEN)} chars)")
except Exception as e:
    print(f"  [WARN] databricks_token: Not found in secrets, using notebook context")
    DATABRICKS_TOKEN = None

# Set environment variables for the MAS client
if DATABRICKS_HOST:
    os.environ["DATABRICKS_HOST"] = DATABRICKS_HOST
if DATABRICKS_TOKEN:
    os.environ["DATABRICKS_TOKEN"] = DATABRICKS_TOKEN

# Clear any conflicting auth methods
for var in ["DATABRICKS_CONFIG_PROFILE", "DATABRICKS_CLIENT_ID", 
            "DATABRICKS_CLIENT_SECRET", "DATABRICKS_ACCOUNT_ID"]:
    os.environ.pop(var, None)

print("")
print("=" * 60)
print("CONFIGURATION SUMMARY")
print("=" * 60)
print(f"  Databricks Host: {DATABRICKS_HOST[:40] if DATABRICKS_HOST else 'Using notebook context'}...")
print(f"  Mode:            Multi-Agent Supervisor (Lab 5)")
print(f"  MAS Endpoint:    {MAS_ENDPOINT_NAME}")
print(f"  Auth Method:     {'Token from secrets' if DATABRICKS_TOKEN else 'Notebook context'}")
print("=" * 60)

In [None]:
# Import the agent modules
from lab_6_augmentation_agent.core import (
    AnalysisType,
    ANALYSIS_CONFIGS,
    get_endpoint_info,
    run_single_analysis,
    format_analysis_result,
    display_suggestions,
    get_high_confidence_items,
)

# Display endpoint configuration
endpoint_info = get_endpoint_info()

print("Agent Modules Loaded")
print("=" * 60)
print(f"  Endpoint: {endpoint_info['endpoint']}")
print(f"  Mode:     {endpoint_info['mode']}")
print(f"  Method:   {endpoint_info['method']}")
print(f"  Docs:     {endpoint_info['docs']}")
print("=" * 60)

---

## 3. Available Analysis Types

Let's examine the four analysis types and their configurations.

In [None]:
# Show all available analysis types
print("Available Analysis Types:")
print("=" * 60)

for analysis_type in AnalysisType:
    config = ANALYSIS_CONFIGS[analysis_type]
    print(f"\n{config.display_name}")
    print(f"  Type: {analysis_type.value}")
    print(f"  Schema: {config.schema.__name__}")
    print(f"  Query: {config.query[:60]}...")

---

## 4. Run Individual Analyses

Now we'll run each analysis type in a separate cell. This allows you to:
- See the timing for each analysis
- Examine results individually
- Re-run specific analyses if needed

We'll use the `run_single_analysis()` utility function which provides
a simplified interface for running one analysis at a time.

In [None]:
# Store results for later comparison
results = {}

### 4.1 Investment Themes Analysis

Identifies emerging investment trends from market research documents.

In [None]:
# Run Investment Themes analysis
print("=" * 60)
results['themes'] = run_single_analysis(AnalysisType.INVESTMENT_THEMES)
print("=" * 60)

In [None]:
# Display Investment Themes results in detail
display_suggestions(results['themes'], show_evidence=True)

### 4.2 New Entities Analysis

Suggests new node types that should be added to the Neo4j graph.

In [None]:
# Run New Entities analysis
print("=" * 60)
results['entities'] = run_single_analysis(AnalysisType.NEW_ENTITIES)
print("=" * 60)

In [None]:
# Display New Entities results with examples
display_suggestions(results['entities'], show_evidence=True, show_examples=True)

### 4.3 Missing Attributes Analysis

Finds attributes mentioned in profiles but missing from the database schema.

In [None]:
# Run Missing Attributes analysis
print("=" * 60)
results['attributes'] = run_single_analysis(AnalysisType.MISSING_ATTRIBUTES)
print("=" * 60)

In [None]:
# Display Missing Attributes results
display_suggestions(results['attributes'], show_evidence=True, show_examples=True)

### 4.4 Implied Relationships Analysis

Discovers relationships that are implied but not explicitly captured in the graph.

In [None]:
# Run Implied Relationships analysis
print("=" * 60)
results['relationships'] = run_single_analysis(AnalysisType.IMPLIED_RELATIONSHIPS)
print("=" * 60)

In [None]:
# Display Implied Relationships results
display_suggestions(results['relationships'], show_evidence=True, show_examples=True)

---

## 5. Results Summary

Let's summarize all results and identify high-confidence suggestions.

In [None]:
# Summary statistics
print("Results Summary")
print("=" * 60)

total_duration = 0
total_high_conf = 0

for name, result in results.items():
    high_conf = get_high_confidence_items(result)
    total_high_conf += len(high_conf)
    total_duration += result.duration_seconds
    
    status = "SUCCESS" if result.success else "FAILED"
    print(f"\n{name.upper()}:")
    print(f"  Status: {status}")
    print(f"  Duration: {result.duration_seconds:.1f}s")
    print(f"  High confidence items: {len(high_conf)}")

print(f"\n{'=' * 60}")
print(f"Total duration: {total_duration:.1f}s")
print(f"Total high-confidence items: {total_high_conf}")

In [None]:
# Show all high-confidence suggestions
print("High-Confidence Suggestions")
print("=" * 60)

for name, result in results.items():
    high_conf = get_high_confidence_items(result)
    if high_conf:
        print(f"\n{name.upper()}:")
        for item in high_conf:
            item_name = (
                item.get('name') or 
                item.get('label') or 
                item.get('property_name') or 
                item.get('relationship_type', 'Unknown')
            )
            print(f"  - {item_name}")

---

## 6. Using the Full Agent API

For production use, you can use the `GraphAugmentationAgent` class
which provides LangGraph workflow orchestration with memory persistence.

In [None]:
from lab_6_augmentation_agent.core import GraphAugmentationAgent

# Create agent with memory persistence
agent = GraphAugmentationAgent()
print("Agent created with LangGraph workflow")
print("Memory persistence enabled via MemorySaver")

In [None]:
# Run a single analysis through the agent
# This uses the full LangGraph workflow
result = agent.run_single_analysis(
    AnalysisType.NEW_ENTITIES,
    thread_id="notebook-demo"
)

print(f"Analysis complete")
print(f"Completed analyses: {result.get('completed_analyses', [])}")

In [None]:
# Access structured response
response = agent.get_structured_response("notebook-demo")
if response:
    print(f"Total suggestions: {response.total_suggestions}")
    print(f"High confidence: {response.high_confidence_count}")

# Get specific suggestion types
nodes = agent.get_suggested_nodes("notebook-demo")
print(f"\nSuggested nodes: {len(nodes)}")
for node in nodes:
    print(f"  - :{node.label}")

---

## 7. Export Results

Export results to JSON for further processing or Neo4j import.

In [None]:
import json

# Export individual analysis results
export_data = {
    'endpoint': endpoint_info,
    'analyses': {}
}

for name, result in results.items():
    export_data['analyses'][name] = {
        'success': result.success,
        'duration_seconds': result.duration_seconds,
        'data': result.structured_data,
        'high_confidence_count': len(get_high_confidence_items(result))
    }

# Preview the export
print(json.dumps(export_data, indent=2, default=str)[:2000])
print("\n... (truncated)")

In [None]:
# Save to file (uncomment to save)
# with open('notebook_results.json', 'w') as f:
#     json.dump(export_data, f, indent=2, default=str)
# print("Results saved to notebook_results.json")

---

## Next Steps

After identifying augmentation opportunities:

1. **Review suggestions** - Examine the high-confidence items
2. **Update Neo4j schema** - Add new node labels and relationship types
3. **Extract new entities** - Parse documents to create new nodes
4. **Write back to Neo4j** - Use the structured output for graph updates

### Documentation

- [ChatDatabricks API](https://api-docs.databricks.com/python/databricks-ai-bridge/latest/databricks_langchain.html)
- [Databricks Structured Outputs](https://docs.databricks.com/aws/en/machine-learning/model-serving/structured-outputs)
- [LangGraph StateGraph](https://langchain-ai.github.io/langgraph/concepts/low_level/)
- [LangGraph Checkpointing](https://langchain-ai.github.io/langgraph/concepts/persistence/)