# Graph Augmentation Agent - DSPy Implementation

This notebook demonstrates the DSPy-based Graph Augmentation Agent that analyzes
unstructured documents and suggests graph schema improvements for Neo4j.

## What This Agent Does

The agent performs four types of analysis:
1. **Investment Themes** - Identifies emerging investment themes from market research
2. **New Entities** - Suggests new node types to add to the graph
3. **Missing Attributes** - Identifies attributes missing from existing nodes
4. **Implied Relationships** - Discovers relationships implied but not captured

Each analysis uses DSPy with ChainOfThought reasoning for step-by-step analysis.

## Prerequisites

- Multi-Agent Supervisor from Lab 5 deployed as a Databricks serving endpoint
- DSPy and MLflow packages installed on the cluster
- Authentication:
  - On Databricks: Automatic (runtime provides credentials)
  - Locally: DATABRICKS_HOST and DATABRICKS_TOKEN in .env file

---

## Configuration

Update the MAS endpoint name before running the notebook.

In [None]:
# =============================================================================
# CONFIGURATION - Update these values before running
# =============================================================================

# Multi-Agent Supervisor endpoint name
# This MUST be the MAS endpoint created in Lab 5 (lab_5_multi_agent)
# The agent relies on the MAS to route queries to Genie + Knowledge Agent
# Get your endpoint name from the MAS UI by clicking the cloud icon
MAS_ENDPOINT_NAME = "mas-3ae5a347-endpoint"

# Model parameters
TEMPERATURE = 0.1      # Lower = more deterministic responses
MAX_TOKENS = 4000      # Maximum tokens in response

print("Configuration:")
print(f"  MAS Endpoint:    {MAS_ENDPOINT_NAME}")
print(f"  Temperature:     {TEMPERATURE}")
print(f"  Max Tokens:      {MAX_TOKENS}")

---

## Verify Databricks Connection

Authentication is handled automatically by WorkspaceClient:
- On Databricks: Uses runtime's built-in authentication
- Locally: Uses DATABRICKS_HOST and DATABRICKS_TOKEN from .env file

In [None]:
# =============================================================================
# AUTHENTICATION - Verify Databricks connection
# =============================================================================
from databricks.sdk import WorkspaceClient

print("=" * 60)
print("AUTHENTICATION - Verifying Databricks connection")
print("=" * 60)

try:
    client = WorkspaceClient()
    print(f"  [OK] Connected to: {client.config.host}")
    print("=" * 60)
except Exception as e:
    print(f"  [FAIL] Connection failed: {e}")
    print("")
    print("  On Databricks: Authentication should be automatic")
    print("  Locally: Set DATABRICKS_HOST and DATABRICKS_TOKEN in .env")
    print("=" * 60)
    raise RuntimeError(f"Failed to connect to Databricks: {e}")

---

## Setup DSPy

Configure DSPy with the Databricks Multi-Agent Supervisor endpoint.

In [None]:
# Imports
from lab_6_augmentation_agent.dspy_modules.config import (
    configure_dspy,
    setup_mlflow_tracing,
)
from lab_6_augmentation_agent.dspy_modules.analyzers import (
    GraphAugmentationAnalyzer,
    InvestmentThemesAnalyzer,
    NewEntitiesAnalyzer,
    MissingAttributesAnalyzer,
    ImpliedRelationshipsAnalyzer,
)
from lab_6_augmentation_agent.utils import (
    SAMPLE_DOCUMENT_CONTEXT,
    print_investment_themes,
    print_new_entities,
    print_missing_attributes,
    print_implied_relationships,
    print_response_summary,
)

print("[OK] Imports loaded")

In [None]:
# Configure DSPy with the MAS endpoint
# (MAS endpoints use Responses API format and require ChatAdapter)
lm = configure_dspy(
    model_name=MAS_ENDPOINT_NAME,
    temperature=TEMPERATURE,
    max_tokens=MAX_TOKENS,
)

# Enable MLflow tracing for observability
setup_mlflow_tracing()

---

## Document Context

Let's view the sample document that will be analyzed. In production, this would
come from your Multi-Agent Supervisor or other data sources.

In [None]:
# Display the sample document context
print("=" * 60)
print("DOCUMENT CONTEXT FOR ANALYSIS")
print("=" * 60)
print(SAMPLE_DOCUMENT_CONTEXT)

---

## Initialize Analyzers

Each analyzer is a DSPy module that wraps a `ChainOfThought` predictor with
a typed signature. This encourages step-by-step reasoning before producing
structured output.

In [None]:
# Initialize individual analyzers for step-by-step demonstration
investment_themes_analyzer = InvestmentThemesAnalyzer()
new_entities_analyzer = NewEntitiesAnalyzer()
missing_attributes_analyzer = MissingAttributesAnalyzer()
implied_relationships_analyzer = ImpliedRelationshipsAnalyzer()

print("[OK] All analyzers initialized")

---

## Analysis 1: Investment Themes

This analysis identifies emerging investment themes from market research documents.
It extracts:
- Theme names and descriptions
- Market size and growth projections
- Key sectors and companies
- Investment recommendations

In [None]:
# Run Investment Themes Analysis
print("Running Investment Themes analysis...")
print("(This may take 10-30 seconds)\n")

investment_result = investment_themes_analyzer(SAMPLE_DOCUMENT_CONTEXT)

if investment_result.success and investment_result.data:
    print_investment_themes(investment_result.data)
else:
    print(f"[FAILED] {investment_result.error}")

In [None]:
# Show the reasoning process (if available)
if investment_result.reasoning:
    print("\n" + "=" * 60)
    print("CHAIN OF THOUGHT REASONING")
    print("=" * 60)
    print(investment_result.reasoning)

---

## Analysis 2: New Entities

This analysis suggests new node types (entities) that should be added to the
graph based on information in the documents. Each suggestion includes:
- Node label and description
- Key property for uniqueness
- Property definitions with types
- Confidence level and evidence

In [None]:
# Run New Entities Analysis
print("Running New Entities analysis...")
print("(This may take 10-30 seconds)\n")

entities_result = new_entities_analyzer(SAMPLE_DOCUMENT_CONTEXT)

if entities_result.success and entities_result.data:
    print_new_entities(entities_result.data)
else:
    print(f"[FAILED] {entities_result.error}")

In [None]:
# Show the reasoning process (if available)
if entities_result.reasoning:
    print("\n" + "=" * 60)
    print("CHAIN OF THOUGHT REASONING")
    print("=" * 60)
    print(entities_result.reasoning)

---

## Analysis 3: Missing Attributes

This analysis identifies attributes (properties) that are missing from existing
node types but are present in the documents. Each suggestion includes:
- Target node label
- Property name and type
- Example values
- Rationale for adding

In [None]:
# Run Missing Attributes Analysis
print("Running Missing Attributes analysis...")
print("(This may take 10-30 seconds)\n")

attributes_result = missing_attributes_analyzer(SAMPLE_DOCUMENT_CONTEXT)

if attributes_result.success and attributes_result.data:
    print_missing_attributes(attributes_result.data)
else:
    print(f"[FAILED] {attributes_result.error}")

In [None]:
# Show the reasoning process (if available)
if attributes_result.reasoning:
    print("\n" + "=" * 60)
    print("CHAIN OF THOUGHT REASONING")
    print("=" * 60)
    print(attributes_result.reasoning)

---

## Analysis 4: Implied Relationships

This analysis discovers relationships that are implied in the documents but
not currently captured in the graph schema. Each suggestion includes:
- Relationship type
- Source and target node labels
- Relationship properties
- Evidence and rationale

In [None]:
# Run Implied Relationships Analysis
print("Running Implied Relationships analysis...")
print("(This may take 10-30 seconds)\n")

relationships_result = implied_relationships_analyzer(SAMPLE_DOCUMENT_CONTEXT)

if relationships_result.success and relationships_result.data:
    print_implied_relationships(relationships_result.data)
else:
    print(f"[FAILED] {relationships_result.error}")

In [None]:
# Show the reasoning process (if available)
if relationships_result.reasoning:
    print("\n" + "=" * 60)
    print("CHAIN OF THOUGHT REASONING")
    print("=" * 60)
    print(relationships_result.reasoning)

---

## Running All Analyses Together

The `GraphAugmentationAnalyzer` orchestrates all four analyses and consolidates
the results into a single response with statistics.

In [None]:
# Run all analyses with the composite analyzer
print("Running ALL analyses with GraphAugmentationAnalyzer...")
print("(This may take 1-2 minutes)\n")

composite_analyzer = GraphAugmentationAnalyzer()
full_response = composite_analyzer(SAMPLE_DOCUMENT_CONTEXT)

# Print consolidated summary
print_response_summary(full_response)

---

## Accessing Raw Data

All results are available as typed Pydantic models for further processing.
You can serialize them to JSON or use them directly in your application.

In [None]:
# Export results as JSON for Neo4j integration
import json

# Get the Pydantic model as a dict
response_dict = full_response.model_dump()

# Show summary statistics
print("Response Statistics:")
print(f"  - Total Suggestions: {response_dict['total_suggestions']}")
print(f"  - High Confidence: {response_dict['high_confidence_count']}")
print(f"  - Suggested Nodes: {len(response_dict['all_suggested_nodes'])}")
print(f"  - Suggested Relationships: {len(response_dict['all_suggested_relationships'])}")
print(f"  - Suggested Attributes: {len(response_dict['all_suggested_attributes'])}")

# Example: serialize to JSON for storage
# json_output = json.dumps(response_dict, indent=2, default=str)
# with open('augmentation_results.json', 'w') as f:
#     f.write(json_output)

---

## Next Steps

The suggestions from this agent can be used to:

1. **Update Neo4j Schema** - Add new node labels and relationship types
2. **Enrich Existing Nodes** - Add missing properties to existing entities
3. **Create New Relationships** - Connect entities based on discovered patterns
4. **Investment Analysis** - Use identified themes for portfolio recommendations

See `PROPOSAL_structured_output_neo4j_writeback.md` for details on implementing
Neo4j writeback functionality.