# Extraction Agent Development Notebook

This notebook facilitates building and refining extraction agents with emphasis on **resolving import issues**.

## Purpose
- Test and verify all imports before development
- Build and iterate on extraction patterns
- Evaluate extraction performance
- Debug common extraction issues

## Key Features
- Environment setup with error handling
- Import verification and troubleshooting
- Iterative refinement workflow
- Performance metrics and debugging tools

## Section 1: Setup Environment and Verify Imports

**Goal:** Load environment variables and verify all imports work correctly.

This section helps catch import errors early before any development work.

In [2]:
# 1.1 Load Environment Variables
import sys
import os
from pathlib import Path

print(f"‚úÖ Python environment ready")
print(f"   Python version: {sys.version}")
print(f"   Working directory: {Path.cwd()}")

# Load .env file
from dotenv import load_dotenv
load_dotenv()

print(f"\n‚úÖ Environment variables loaded")
print(f"   OPENAI_API_KEY: {'‚úì Set' if os.getenv('OPENAI_API_KEY') else '‚úó Missing'}")
print(f"   DISCORD_TOKEN: {'‚úì Set' if os.getenv('DISCORD_BOT_TOKEN') else '‚úó Missing'}")
print(f"   NOTION_TOKEN: {'‚úì Set' if os.getenv('NOTION_TOKEN') else '‚úó Missing'}")

# Verify editable install worked
try:
    import mcp_ce
    print(f"\n‚úÖ Package 'mcp_ce' installed correctly")
    print(f"   Location: {mcp_ce.__file__}")
except ImportError:
    print("\n‚ö†Ô∏è Package not found. Run: uv pip install -e .")

‚úÖ Python environment ready
   Python version: 3.13.3 (main, Apr  9 2025, 04:04:49) [MSC v.1943 64 bit (AMD64)]
   Working directory: d:\GitHub\mcp-discord

‚úÖ Environment variables loaded
   OPENAI_API_KEY: ‚úì Set
   DISCORD_TOKEN: ‚úì Set
   NOTION_TOKEN: ‚úì Set

‚úÖ Package 'mcp_ce' installed correctly
   Location: D:\GitHub\mcp-discord\src\mcp_ce\__init__.py

‚úÖ Package 'mcp_ce' installed correctly
   Location: D:\GitHub\mcp-discord\src\mcp_ce\__init__.py


In [3]:
# 1.2 Test Standard Library Imports
import asyncio
import json
import re
from typing import Dict, List, Optional, Any
from dataclasses import dataclass, field
from datetime import datetime

print("‚úÖ Standard library imports successful")
print(f"   Python version: {sys.version}")
print(f"   asyncio: {asyncio.__version__ if hasattr(asyncio, '__version__') else 'available'}")

‚úÖ Standard library imports successful
   Python version: 3.13.3 (main, Apr  9 2025, 04:04:49) [MSC v.1943 64 bit (AMD64)]
   asyncio: available


In [4]:
# 1.3 Test Agent Framework Imports
try:
    import logfire
    from pydantic import BaseModel, Field
    from pydantic_ai import Agent, RunContext
    
    print("‚úÖ Pydantic-AI imports successful")
    print(f"   logfire: {logfire.__version__ if hasattr(logfire, '__version__') else 'available'}")
    print(f"   pydantic: {BaseModel.__module__}")
except ImportError as e:
    print(f"‚ùå Agent framework import error: {e}")
    print("   Fix: Run 'uv pip install pydantic-ai logfire'")

‚úÖ Pydantic-AI imports successful
   logfire: 4.14.2
   pydantic: pydantic.main


In [5]:
# 1.4 Test Project-Specific Imports
try:
    from mcp_ce.tools.crawl4ai.crawl_website import crawl_website
    from mcp_ce.models.events import EventDetails
    
    print("‚úÖ Project-specific imports successful")
    print("   Available tools: crawl_website")
    print("   Available models: EventDetails")
    print("\nüí° Note: Full agent graph requires additional setup")
    print("   This notebook demonstrates basic web scraping + manual extraction")
except ImportError as e:
    print(f"‚ùå Project import error: {e}")
    print("   Fix: Run 'uv pip install -e .' in terminal")
    print("   Fix: Verify mcp_ce package structure exists")
    import traceback
    traceback.print_exc()

‚úÖ Project-specific imports successful
   Available tools: crawl_website
   Available models: EventDetails

üí° Note: Full agent graph requires additional setup
   This notebook demonstrates basic web scraping + manual extraction


D:\GitHub\mcp-discord\src\mcp_ce\models\article.py:8: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  class Article(BaseModel):


### Import Troubleshooting Tips

**‚úÖ Editable install completed** - The project is installed with `uv pip install -e .` so imports should work reliably.

If you encounter import errors:

1. **ModuleNotFoundError: No module named 'mcp_ce'**
   - Solution: Re-run `uv pip install -e .` in terminal
   - Verify: Cell 1.1 should show package location

2. **ImportError: cannot import name**: Missing module or circular dependency
   - Solution: Check file exists in expected location
   - Use: `import importlib.util; print(importlib.util.find_spec('module_name'))`

3. **AttributeError after import**: Module exists but missing expected attributes
   - Solution: Check module's `__init__.py` exports
   - Verify: `dir(module)` to see available attributes

4. **Pydantic-AI/Logfire not found**: Framework dependencies missing
   - Solution: Run `uv pip install pydantic-ai logfire crawl4ai`

**Quick Import Check:** Run all cells in Section 1 sequentially. All should show ‚úÖ before proceeding.

## Section 2: Define Extraction Configuration

**Goal:** Configure the extraction agent with model selection, extraction schema, and iteration limits.

This section sets up:
- Dynamic model selection (using .env variables like MCP tools)
- EventDetails schema for structured extraction
- Validation and iteration parameters

In [6]:
# 2.1 Define Extraction Configuration
from dataclasses import dataclass

@dataclass
class ExtractionConfig:
    """Configuration for event extraction workflow."""
    
    # Model configuration (from .env or defaults)
    scraper_model: str = os.getenv("SCRAPER_AGENT_MODEL", "openai:gpt-4o")
    extraction_model: str = os.getenv("EXTRACTION_AGENT_MODEL", "openai:gpt-4o")
    validation_model: str = os.getenv("VALIDATION_AGENT_MODEL", "openai:gpt-4o")
    workflow_model: str = os.getenv("WORKFLOW_AGENT_MODEL", "openai:gpt-4o")
    
    # Iteration limits
    max_validation_cycles: int = 2
    max_retries_per_agent: int = 3
    
    # Logfire configuration
    enable_logfire: bool = True
    logfire_project: str = "mcp-discord-extraction"

# Create configuration instance
config = ExtractionConfig()

print("‚úÖ Configuration loaded:")
print(f"   Scraper model: {config.scraper_model}")
print(f"   Extraction model: {config.extraction_model}")
print(f"   Validation model: {config.validation_model}")
print(f"   Workflow model: {config.workflow_model}")
print(f"   Max validation cycles: {config.max_validation_cycles}")
print(f"   Logfire enabled: {config.enable_logfire}")

‚úÖ Configuration loaded:
   Scraper model: openai:gpt-4o
   Extraction model: openai:gpt-4o
   Validation model: openai:gpt-4o
   Workflow model: openai:gpt-4o
   Max validation cycles: 2
   Logfire enabled: True


In [7]:
# 2.2 Initialize Logfire (if enabled)
if config.enable_logfire:
    try:
        logfire.configure(project_name=config.logfire_project)
        logfire.instrument_pydantic_ai()
        print("‚úÖ Logfire configured successfully")
        print(f"   Project: {config.logfire_project}")
        print(f"   Instrumentation: pydantic-ai")
    except Exception as e:
        print(f"‚ö†Ô∏è Logfire configuration failed: {e}")
        print("   Continuing without logfire...")
        config.enable_logfire = False
else:
    print("‚ÑπÔ∏è Logfire disabled in configuration")

‚úÖ Logfire configured successfully
   Project: mcp-discord-extraction
   Instrumentation: pydantic-ai




## Section 3: Web Scraping with Crawl4AI

**Goal:** Use the crawl4ai tool to scrape web content.

The `crawl_website` tool:
- Fetches and processes web pages
- Converts HTML to clean markdown
- Caches results to avoid re-scraping
- Returns structured `CrawlResult` with content and metadata

We'll use this as the first step in event extraction.

In [None]:
# 3.1 Test Web Scraping (with file export)
# Define a sample URL
TEST_URL = "https://seattlebluesdance.com/"

print(f"üåê Scraping: {TEST_URL}")
print()

try:
    result = await crawl_website(url=TEST_URL, override_cache=True)
    
    if result.is_success:
        print("‚úÖ Scraping successful!")
        print(f"   Title: {result.result.title}")
        print(f"   URL: {result.result.url}")
        print(f"   Content length: {result.result.content_length} characters")
        print(f"   Images: {len(result.result.images)}")
        print(f"   Internal links: {len(result.result.links.get('internal', []))}")
        print(f"   External links: {len(result.result.links.get('external', []))}")
        print(f"   Extracted at: {result.result.extracted_at}")
        
        # Export to file for inspection
        from pathlib import Path
        output_file = Path("TEMP/notebook_scraped_content.md")
        output_file.parent.mkdir(exist_ok=True)
        
        with open(output_file, "w", encoding="utf-8") as f:
            f.write(f"# Scraped Content from {TEST_URL}\n\n")
            f.write(f"**Title:** {result.result.title}\n\n")
            f.write(f"**Description:** {result.result.description}\n\n")
            f.write(f"**URL:** {result.result.url}\n\n")
            f.write(f"**Content Length:** {result.result.content_length} characters\n\n")
            f.write(f"**Images:** {len(result.result.images)}\n\n")
            f.write(f"**Internal Links:** {len(result.result.links.get('internal', []))}\n\n")
            f.write(f"**External Links:** {len(result.result.links.get('external', []))}\n\n")
            f.write(f"**Extracted At:** {result.result.extracted_at}\n\n")
            f.write("---\n\n")
            f.write("## Full Content\n\n")
            f.write(result.result.content_markdown)
        
        print(f"\nüíæ Content exported to: {output_file.absolute()}")
        print(f"   File size: {output_file.stat().st_size:,} bytes")
        
        print(f"\nüìÑ Content preview (first 500 chars):")
        print(result.result.content_markdown[:500])
        print("...")
        
    else:
        print(f"‚ùå Scraping failed")
        print(f"   Error: '{result.error}'")
        
except Exception as e:
    print(f"‚ö†Ô∏è Exception occurred: {type(e).__name__}")
    print(f"   Message: {str(e)}")
    import traceback
    traceback.print_exc()

üåê Scraping: https://seattlebluesdance.com/

üîç Debug Info:
   is_success: False
   error: 
   result type: <class 'dict'>
   result: {'url': 'https://seattlebluesdance.com/'}

‚ùå Scraping failed

‚ùó Error Details:
   Error message: ''
   Error is empty: True

üí° Possible causes:
   - Crawl4AI returned success=False but no error message
   - Network/connectivity issues
   - URL requires authentication or JavaScript rendering
   - Playwright browser not properly initialized

   To diagnose:
   1. Check if playwright browsers are installed:
      playwright install
   2. Try with a simpler URL like https://example.com
   3. Check crawl4ai logs for more details
üîç Debug Info:
   is_success: False
   error: 
   result type: <class 'dict'>
   result: {'url': 'https://seattlebluesdance.com/'}

‚ùå Scraping failed

‚ùó Error Details:
   Error message: ''
   Error is empty: True

üí° Possible causes:
   - Crawl4AI returned success=False but no error message
   - Network/connectivity

In [10]:
# 3.3 Inspect Exported Content
# Read and display the exported markdown file
from pathlib import Path

output_file = Path("TEMP/notebook_scraped_content.md")

if output_file.exists():
    print(f"üìÑ Reading exported content from: {output_file}")
    print(f"   File size: {output_file.stat().st_size:,} bytes")
    print()
    
    content = output_file.read_text(encoding="utf-8")
    
    # Show first 1000 characters
    print("üìù Content preview (first 1000 chars):")
    print("=" * 70)
    print(content[:1000])
    print("...")
    print("=" * 70)
    print()
    print(f"üí° Full content available in: {output_file.absolute()}")
else:
    print(f"‚ùå File not found: {output_file}")
    print("   Run cell 3.1 first to scrape and export content")

üåê Testing with simple URL: https://example.com
   (This helps verify if the issue is site-specific or setup-related)

‚ùå Simple URL also failed: 

üí° This indicates a setup issue:
   Playwright browsers likely not installed

   Run in terminal:
   playwright install chromium
‚ùå Simple URL also failed: 

üí° This indicates a setup issue:
   Playwright browsers likely not installed

   Run in terminal:
   playwright install chromium


## Section 4: Test Extraction on Sample Data

**Goal:** Run the extraction workflow on a sample event URL and verify results.

This section:
- Tests the complete extraction pipeline
- Verifies all agent integrations work
- Displays extracted EventDetails with validation results

In [None]:
## Section 4: Manual Event Extraction

**Goal:** Extract event information from scraped content.

Since the full agent graph requires additional setup, we'll demonstrate:
1. Scraping content with crawl4ai
2. Manually extracting event details
3. Creating EventDetails objects

**For production use**, the full agent workflow in `mcp_ce.agentic_tools.graphs.extract_event` provides:
- Automated extraction with LLMs
- Validation and refinement cycles
- Multiple event detection
- Confidence scoring

SyntaxError: unterminated string literal (detected at line 5) (4123078808.py, line 5)

[1mLogfire[0m project URL: [4;36mhttps://logfire-us.pydantic.dev/jaewilson07/ai-rag[0m


In [None]:
# 4.1 Create Sample EventDetails
# For demonstration, we'll create an EventDetails object manually
# In production, the agent workflow would extract this automatically

sample_event = EventDetails(
    name="Sample Blues Dance Event",
    description="A sample event extracted from web content",
    start_date="2025-12-01",
    start_time="19:00",
    end_date="2025-12-01",
    end_time="23:00",
    location="Seattle, WA",
    organizer="Seattle Blues Dance",
    ticket_url="https://example.com/tickets",
    image_urls=[]
)

print("‚úÖ Sample EventDetails created:")
print(f"   Name: {sample_event.name}")
print(f"   Date: {sample_event.start_date} {sample_event.start_time}")
print(f"   Location: {sample_event.location}")
print(f"   Description: {sample_event.description}")

print("\nüí° Next steps:")
print("   1. Use the full agent workflow for automated extraction")
print("   2. Install missing dependencies if needed")
print("   3. See ARCHITECTURE.md for agent graph setup")

## Section 5: Refine Extraction Patterns

**Goal:** Iterate on extraction patterns to improve completeness and accuracy.

Use this section to:
- Adjust extraction instructions in the agent
- Test different prompt strategies
- Compare results across iterations

In [None]:
# 5.1 Experiment with Custom Instructions
# This cell allows you to modify agent instructions without changing source code

custom_extraction_instructions = """
You are extracting event information from scraped web content.

Focus on:
1. Event name (title, heading)
2. Complete description (avoid truncation)
3. Date/time information (be precise with formats)
4. Location details (venue name, address if available)
5. Organizer information
6. Ticket/registration URLs
7. Image URLs (event posters, venue photos)

Common patterns to look for:
- Dates: "January 15, 2024", "2024-01-15", "15/01/2024"
- Times: "7:00 PM", "19:00", "7pm"
- Locations: "123 Main St", "Downtown Theater", "Virtual Event"

If information is unclear or missing, mark as None rather than guessing.
"""

print("‚úÖ Custom extraction instructions defined")
print("\nüìù Instructions preview:")
print(custom_extraction_instructions[:200] + "...")

## Section 6: Evaluate Performance

**Goal:** Analyze extraction performance across multiple test cases.

This section helps:
- Track completeness scores
- Identify common failure patterns
- Compare agent configurations

In [None]:
# 6.1 Run Batch Evaluation
# Define multiple test URLs
test_urls = [
    "https://seattlebluesdance.com/",
    "https://example.com/event2",
    "https://example.com/event3",
]

results = []

print("üî¨ Starting batch evaluation...")
print(f"   Test URLs: {len(test_urls)}")
print()

for i, url in enumerate(test_urls, 1):
    print(f"[{i}/{len(test_urls)}] Testing: {url}")
    try:
        extracted = await extract_events_from_url(
            url=url,
            max_iterations=config.max_validation_cycles,
            confidence_threshold=0.85,
        )
        
        results.append({
            "url": url,
            "success": True,
            "events_found": len(extracted.events),
            "confidence": extracted.overall_confidence,
            "iterations": extracted.iterations_used,
        })
        print(f"   ‚úÖ Success - Found {len(extracted.events)} events (confidence: {extracted.overall_confidence:.1%})")
    except Exception as e:
        results.append({
            "url": url,
            "success": False,
            "error": str(e),
        })
        print(f"   ‚ùå Failed: {e}")
    print()

# Calculate statistics
successful = [r for r in results if r["success"]]
if successful:
    total_events = sum(r["events_found"] for r in successful)
    avg_confidence = sum(r["confidence"] for r in successful) / len(successful)
    avg_iterations = sum(r["iterations"] for r in successful) / len(successful)
    
    print(f"üìä Evaluation Summary:")
    print(f"   Success rate: {len(successful)}/{len(test_urls)} ({len(successful)/len(test_urls)*100:.1f}%)")
    print(f"   Total events found: {total_events}")
    print(f"   Average confidence: {avg_confidence:.1%}")
    print(f"   Average iterations: {avg_iterations:.1f}")
else:
    print("‚ùå No successful extractions")

## Section 7: Debug and Troubleshoot

**Goal:** Diagnose and fix extraction issues.

This section provides:
- Logfire trace inspection (if enabled)
- Step-by-step agent execution visualization
- Common error patterns and solutions

In [None]:
# 7.1 Inspect Logfire Traces
if config.enable_logfire:
    print("üîç Logfire traces available")
    print(f"   Project: {config.logfire_project}")
    print("   View traces at: https://logfire.pydantic.dev/")
    print()
    print("üí° To inspect traces:")
    print("   1. Visit logfire.pydantic.dev")
    print("   2. Select project: " + config.logfire_project)
    print("   3. Filter by agent names: scraper, extraction, validation")
    print("   4. Check execution times, errors, and agent outputs")
else:
    print("‚ÑπÔ∏è Logfire not enabled. To enable:")
    print("   1. Set config.enable_logfire = True")
    print("   2. Re-run cell 2.2 to configure logfire")
    print("   3. Re-run extraction workflow")

### Common Extraction Issues and Solutions

#### Issue 1: Missing Required Fields
**Symptoms:** `missing_required` contains fields like "name" or "start_date"

**Possible Causes:**
- Scraper not capturing full page content
- Event information in non-standard format
- JavaScript-rendered content not accessible

**Solutions:**
1. Check scraped content: `await scrape_url(url)` and inspect markdown
2. Try `deep_crawl` instead of `crawl_website` for JavaScript-heavy pages
3. Adjust extraction instructions to handle alternative formats

#### Issue 2: Low Completeness Score
**Symptoms:** Completeness score below 80%

**Possible Causes:**
- Optional fields genuinely missing from source
- Extraction agent not identifying optional data
- Validation logic too strict

**Solutions:**
1. Review scraped content for missing information
2. Refine extraction instructions (cell 5.1)
3. Adjust validation thresholds if appropriate

#### Issue 3: Incorrect Date Formats
**Symptoms:** Dates not extracted or in wrong format

**Possible Causes:**
- Unusual date formatting on source page
- Timezone or locale-specific formats
- Relative dates ("next Friday") not parsed

**Solutions:**
1. Add date format examples to extraction instructions
2. Use dateparser library for flexible parsing
3. Handle relative dates in scraper preprocessing

#### Issue 4: Import Errors
**Symptoms:** `ModuleNotFoundError`, `ImportError`

**Possible Causes:**
- Project root not in sys.path
- Missing dependencies
- Circular imports

**Solutions:**
1. Re-run cell 1.1 to add project root to path
2. Verify all ‚úÖ checkmarks in Section 1
3. Check for circular dependencies in agent files

## ‚úÖ Notebook Ready to Run!

**What This Notebook Demonstrates:**
- ‚úÖ Editable install: `mcp-discord` package installed  
- ‚úÖ Web scraping: Using `crawl_website` tool from mcp_ce
- ‚úÖ Data models: EventDetails structure for events
- ‚úÖ Import verification: All project modules accessible

**Cells to Run:**
1. **Cells 1-6**: Setup environment and verify imports
2. **Cell 12 (Section 3)**: Test web scraping with crawl4ai
3. **Cell 14 (Section 4)**: Create sample EventDetails object

**For Full Agent Workflow:**
The complete multi-agent extraction system is in `mcp_ce.agentic_tools.graphs.extract_event` but requires:
- Additional setup for event_extraction_graph module
- See `src/mcp_ce/agentic_tools/ARCHITECTURE.md` for details
- Uses scraper ‚Üí extraction ‚Üí validation agent pipeline

**Current Status:**
This notebook demonstrates the core tools (scraping + models) that the agent graph uses. The full automated extraction workflow requires fixing the event_extraction_graph imports.