# SuperScan Demo: End-to-End Workflow

This notebook demonstrates the complete SuperScan flow:

1. **Initialize Database** - Create tables in Snowflake
2. **Create Project** - Set up a new project container
3. **Upload PDF** - Ingest a document and extract metadata
4. **Sparse Scan** - Generate ontology proposal from PDF text
5. **Review Proposal** - Inspect LLM-generated schema suggestions
6. **Finalize Schemas** - Convert proposal into versioned Schema records

**Requirements**: Snowflake credentials in `.env`, OpenAI API key (optional for LLM)

## Step 0: Setup and Imports

In [None]:
import sys
import os
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Load environment
from dotenv import load_dotenv
load_dotenv(project_root / ".env")

print("✓ Environment loaded")

In [None]:
# Import core services
from code.graph_rag.db import get_db, init_database
from code.superscan.project_service import ProjectService
from code.superscan.file_service import FileService
from code.superscan.schema_service import SchemaService
from code.superscan.proposal_service import ProposalService
from code.superscan.fast_scan import FastScan
from code.superscan.pdf_parser import extract_text_from_file_path

print("✓ Services imported")

## Step 1: Initialize Database

Create all required tables in Snowflake (projects, schemas, nodes, edges, files, ontology_proposals)

In [None]:
# Initialize database (creates tables if not exist)
init_database()
print("✓ Database initialized")

## Step 2: Create Project

Create a project container for our knowledge graph

In [None]:
# Initialize services
db = get_db()
project_svc = ProjectService(db)
file_svc = FileService(db)
schema_svc = SchemaService(db)
proposal_svc = ProposalService(db)

# Create project
project_payload = {
    "project_name": "demo-superscan",
    "display_name": "SuperScan Demo Project",
    "owner_id": "demo-user",
    "tags": ["demo", "superscan"],
}

project = project_svc.create_project(project_payload)
project_id = project["project_id"]
print(f"✓ Project created: {project_id}")
print(f"  Name: {project['project_name']}")
print(f"  Status: {project['status']}")

## Step 3: Upload PDF

Upload a sample PDF and store its metadata in Snowflake

In [None]:
# For demo: use a sample PDF or create a mock one
# In production, this would be a real file upload

sample_pdf_path = "sample.pdf"  # Replace with actual path

# If no real PDF available, create mock metadata
if not os.path.exists(sample_pdf_path):
    print("⚠️ No sample PDF found. Using mock file metadata.")
    file_record = file_svc.upload_pdf(
        project_id=project_id,
        filename="sample_research_paper.pdf",
        size_bytes=1024000,  # 1 MB
        pages=12,
        metadata={"source": "demo", "topic": "knowledge graphs"},
    )
else:
    # Real file upload
    file_info = os.stat(sample_pdf_path)
    extracted = extract_text_from_file_path(sample_pdf_path, max_pages=10)
    
    file_record = file_svc.upload_pdf(
        project_id=project_id,
        filename=os.path.basename(sample_pdf_path),
        size_bytes=file_info.st_size,
        pages=extracted.get("total_pages", 0),
        metadata={"extracted_pages": extracted.get("pages", 0)},
    )

file_id = file_record["file_id"]
print(f"✓ File uploaded: {file_id}")
print(f"  Filename: {file_record['filename']}")
print(f"  Pages: {file_record['pages']}")
print(f"  Status: {file_record['status']}")

## Step 4: Sparse Scan & Proposal Generation

Extract sparse text from PDF and generate ontology proposal using LLM

In [None]:
# Extract sparse text from PDF (or use mock data)
if os.path.exists(sample_pdf_path):
    extracted = extract_text_from_file_path(sample_pdf_path, max_pages=10)
    text_snippets = extracted["text_snippets"]
else:
    # Mock text snippets
    text_snippets = [
        "This paper presents a novel approach to knowledge graph construction using multimodal databases.",
        "We propose a schema-driven architecture that supports relational, graph, and vector representations.",
        "The system enables entity resolution and deduplication across multiple data sources.",
    ]

print(f"✓ Extracted {len(text_snippets)} text snippets")
print(f"  Sample: {text_snippets[0][:100]}...")

In [None]:
# Generate ontology proposal using LLM
openai_key = os.getenv("OPENAI_API_KEY")  # Optional
scanner = FastScan(openai_api_key=openai_key)

proposal_dict = scanner.generate_proposal(
    snippets=text_snippets,
    hints={"domain": "knowledge graphs and databases"},
)

print("\n✓ Ontology proposal generated")
print(f"  Summary: {proposal_dict.get('summary', 'N/A')}")
print(f"  Nodes: {len(proposal_dict.get('nodes', []))}")
print(f"  Edges: {len(proposal_dict.get('edges', []))}")

## Step 5: Save Proposal to Snowflake

In [None]:
# Save proposal to database
proposal = proposal_svc.create_proposal(
    project_id=project_id,
    nodes=proposal_dict.get("nodes", []),
    edges=proposal_dict.get("edges", []),
    source_files=[file_id],
    summary=proposal_dict.get("summary", "Ontology from sparse scan"),
)

proposal_id = proposal["proposal_id"]
print(f"✓ Proposal saved: {proposal_id}")
print(f"  Status: {proposal['status']}")

## Step 6: Review Proposal

Inspect the generated ontology

In [None]:
import json

print("\n📋 Proposal Details:\n")
print(json.dumps(proposal, indent=2))

## Step 7: Finalize Proposal → Create Schemas

Convert the proposal into versioned Schema records

In [None]:
# Finalize proposal (creates Schema records)
result = proposal_svc.finalize_proposal(proposal_id)

print("✓ Proposal finalized. Schemas created:")
for schema in result["schemas"]:
    print(f"  - {schema['schema_name']} (v{schema['version']}, {schema['entity_type']})")

## Step 8: Verify Schemas in Database

In [None]:
# List all schemas for the project
schemas = schema_svc.list_schemas(project_id)

print(f"\n✓ Project has {schemas['total']} schema(s):")
for s in schemas["items"]:
    print(f"  - {s['schema_name']} v{s['version']} ({s['entity_type']}) [active={s['is_active']}]")

## Summary

**SuperScan workflow completed successfully!**

✅ Project created  
✅ PDF uploaded and metadata stored  
✅ Sparse scan generated ontology proposal  
✅ Proposal saved to Snowflake  
✅ Schemas finalized and versioned  

**Next steps (SuperKB)**:
- Deep scan with chunking
- Entity extraction and resolution
- Embedding generation
- Export to Postgres/Neo4j/Pinecone