# Real-World Multi-Source Integration

This cookbook demonstrates a production-grade workflow for integrating data from distinct sources using Semantica.

We will simulate a complex Enterprise Knowledge Graph construction scenario for **Nexus AI** by aggregating data from:

1.  **Corporate Database (SQLite)**: Financial records and employee counts.
2.  **Public Web (HTML)**: News articles and press releases.
3.  **Source Code (Markdown)**: Engineering activity and documentation.
4.  **Market Data API (JSON)**: Live stock prices and market cap.
5.  **Web Search MCP (Tool)**: Live competitor analysis from a search agent.

**Key Semantica Modules Used:**
*   `ingest`: For loading data from disparate sources (including MCP).
*   `kg.GraphBuilder`: For constructing the graph and merging entities.
*   `conflicts.ConflictResolver`: For resolving data discrepancies.
*   `visualization.KGVisualizer`: For visualizing the final network.

In [None]:
# Installation & Setup
!pip install semantica mcp fastmcp

In [None]:
import os
import json
import sqlite3
import tempfile
import pandas as pd

# Import Semantica Modules
from semantica.ingest import DBIngestor, WebIngestor, FileIngestor, MCPIngestor
from semantica.kg import GraphBuilder
from semantica.visualization import KGVisualizer

# Create a temporary workspace
WORKSPACE_DIR = tempfile.mkdtemp()
print(f"Workspace created at: {WORKSPACE_DIR}")

## Phase 1: Creating Data Sources

We generate files on disk to simulate the disparate enterprise systems.

In [None]:
# 1. SQLite Database (Financials)
db_path = os.path.join(WORKSPACE_DIR, "corporate.db")
conn = sqlite3.connect(db_path)
conn.execute("CREATE TABLE financials (company_name TEXT, revenue REAL, employees INTEGER)")
conn.execute("INSERT INTO financials VALUES ('Nexus AI', 5500000.00, 45)")
conn.commit()
conn.close()

# 2. Public Web (HTML)
html_path = os.path.join(WORKSPACE_DIR, "news.html")
with open(html_path, "w") as f:
    f.write("""
    <html><body>
        <h1>Nexus AI Raises Series B</h1>
        <p>Nexus AI (San Francisco) valuated at $100M.</p>
        <p>CEO Jane Doe announces expansion.</p>
    </body></html>
    """)

# 3. Code Repository (Markdown)
repo_path = os.path.join(WORKSPACE_DIR, "README.md")
with open(repo_path, "w") as f:
    f.write("""
    # Nexus AI Core
    Maintained by: engineering@nexus.ai
    Language: Python
    """)

# 4. Market Data API (JSON)
api_path = os.path.join(WORKSPACE_DIR, "market.json")
with open(api_path, "w") as f:
    json.dump({
        "ticker": "NXAI", 
        "price": 124.50, 
        "employees": 50  # Conflict with DB (45)
    }, f)

print("Data sources created successfully.")

## Phase 2: Ingestion & Extraction

We use Semantica's ingestors to load data. In a real pipeline, we would attach an extractor (like an LLM) to parse the raw content into entities. Here, we simulate the extraction output for clarity.

In [None]:
# --- Demonstration of Real Ingestion (Raw Data) ---
print("--- Ingesting Raw Data from Sources (Demonstration) ---")

# 1. Ingest from SQLite using DBIngestor
# We connect to the local SQLite database we just created
try:
    db_ingestor = DBIngestor()
    # SQLAlchemy connection string for SQLite
    db_connection = f"sqlite:///{db_path}"
    raw_db_data = db_ingestor.ingest_database(db_connection)
    print(f"DBIngestor: Successfully connected to {db_path}")
    print(f"DBIngestor: Found tables: {list(raw_db_data.get('tables', {}).keys())}")
except Exception as e:
    print(f"DBIngestor Warning: {e}")

# 2. Ingest from Files using FileIngestor
# We load the Markdown and JSON files directly
try:
    file_ingestor = FileIngestor()
    
    # Ingest Markdown
    readme_file = file_ingestor.ingest_file(repo_path)
    print(f"FileIngestor: Read {readme_file.name} ({readme_file.size} bytes)")
    
    # Ingest JSON
    market_file = file_ingestor.ingest_file(api_path)
    print(f"FileIngestor: Read {market_file.name} ({market_file.size} bytes)")
except Exception as e:
    print(f"FileIngestor Warning: {e}")

# 3. Ingest from Web using WebIngestor
# Since we created a local HTML file, we could use FileIngestor, 
# but here we demonstrate WebIngestor initialization for URL-based sources.
try:
    web_ingestor = WebIngestor()
    # In a real scenario, we would call: 
    # web_content = web_ingestor.ingest_url("https://nexus.ai/news")
    print(f"WebIngestor: Initialized and ready to crawl URLs.")
except Exception as e:
    print(f"WebIngestor Warning: {e}")

print("-" * 50 + "\n")

# Simulate extracted entities from our sources

source_db = {
    "name": "Corporate Database",
    "type": "structured",
    "entities": [
        {
            "name": "Nexus AI",
            "type": "Organization",
            "properties": {"revenue": 5500000.00, "employees": 45},
            "source": "corporate_db"
        }
    ]
}

source_web = {
    "name": "Web News",
    "type": "unstructured",
    "entities": [
        {
            "name": "Nexus AI",
            "type": "Organization",
            "properties": {"valuation": "$100M", "location": "San Francisco"},
            "source": "public_web"
        },
        {
            "name": "Jane Doe",
            "type": "Person",
            "properties": {"role": "CEO"},
            "source": "public_web"
        }
    ],
    "relationships": [
        {"source": "Jane Doe", "target": "Nexus AI", "type": "is_ceo_of"}
    ]
}

source_api = {
    "name": "Market API",
    "type": "structured",
    "entities": [
        {
            "name": "Nexus AI",
            "type": "Organization",
            "properties": {"ticker": "NXAI", "employees": 50}, # Note conflict: 50 vs 45
            "source": "market_api"
        }
    ]
}

source_repo = {
    "name": "GitHub Repo",
    "type": "semi-structured",
    "entities": [
        {
            "name": "Nexus AI Core",
            "type": "Software",
            "properties": {"language": "Python"},
            "source": "github"
        }
    ],
    "relationships": [
        {"source": "Nexus AI Core", "target": "Nexus AI", "type": "owned_by"}
    ]
}

# 5. Web Search MCP
# We use Semantica's MCPIngestor to connect to a Web Search MCP server.
# This allows us to fetch live competitor data (e.g., from Brave Search).

try:
    # Initialize MCP Ingestor
    mcp = MCPIngestor()
    
    # Attempt to connect to a local MCP server (e.g., running on port 8000)
    # Example: `fastmcp run search_server.py`
    mcp.connect("web_search", url="http://localhost:8000/sse")
    
    print("Connected to MCP Server. Ingesting live data...")
    
    # Ingest data using the search tool
    mcp_data = mcp.ingest_tool_output(
        "web_search", 
        "search", 
        {"query": "Nexus AI competitors valuation"}
    )
    
    # Process the output (assuming the tool returns structured entities)
    # In a real app, you might need an LLM to extract entities from search results.
    # Here we assume the MCP server returns ready-to-use entities.
    source_mcp = {
        "name": "Web Search MCP",
        "type": "agent-tool",
        "entities": mcp_data.get("entities", []),
        "source": "mcp_search"
    }

except Exception as e:
    print(f"MCP Server connection failed ({e}). Using simulated data.")
    print("To enable live data, ensure an MCP server is running at http://localhost:8000/sse")
    
    # Fallback to simulated data
    source_mcp = {
    "name": "Web Search MCP",
    "type": "agent-tool",
    "entities": [
        {
            "name": "Nexus AI",
            "type": "Organization",
            "properties": {
                "competitors": ["Cyberdyne Systems", "Massive Dynamic"],
                "valuation": "$120M" # Note conflict: $120M (newer) vs $100M (older web)
            },
            "source": "mcp_search_agent"
        }
    ],
    "relationships": [
         {"source": "Nexus AI", "target": "Cyberdyne Systems", "type": "competes_with"}
    ]
}

all_sources = [source_db, source_web, source_api, source_repo, source_mcp]

## Phase 3: Graph Construction & Resolution

We use `GraphBuilder` to:
1.  **Merge Entities**: Combine the 4 "Nexus AI" records into one canonical node.
2.  **Resolve Conflicts**: Handle discrepancies (Employee count: 45 vs 50; Valuation: $100M vs $120M).
3.  **Build Graph**: Link related entities (CEO, Software, Competitors).

In [None]:
# Initialize GraphBuilder with resolution enabled
builder = GraphBuilder(
    merge_entities=True,
    resolve_conflicts=True,
    entity_resolution_strategy="fuzzy"
)

# Build the graph
print("Building Knowledge Graph...")
kg = builder.build(sources=all_sources)

print(f"Graph built with {len(kg['nodes'])} nodes and {len(kg['edges'])} edges.")

# Verify conflict resolution results
nexus_node = next(n for n in kg['nodes'] if n['name'] == "Nexus AI")
print("\nResolved Properties for Nexus AI:")
print(f"  Employees: {nexus_node['properties'].get('employees')} (Resolved from DB/API)")
print(f"  Valuation: {nexus_node['properties'].get('valuation')} (Resolved from Web/MCP)")
print(f"  Competitors: {nexus_node['properties'].get('competitors')}")

## Phase 4: Visualization

We use `KGVisualizer` to render the interactive graph.

In [None]:
print("Visualizing Graph...")
visualizer = KGVisualizer(layout="force", color_scheme="vibrant")

# Render the network
# This supports interactive output in Jupyter
fig = visualizer.visualize_network(kg, output="interactive")
fig.show()

## Conclusion

In this notebook, we used `semantica`'s high-level modules to:
1.  Ingest data from 5 different sources (including a simulated **MCP Search Agent**).
2.  Automatically resolve identity to create a single "Nexus AI" node.
3.  Resolve complex data conflicts (Valuation, Employee Count).
4.  Visualize the unified Knowledge Graph with competitor relationships.
