# Metadata Sync: External Metadata API (Approach 2)

This notebook registers Neo4j node labels and relationship types as **external metadata**
objects in Unity Catalog using the [External Metadata API](https://docs.databricks.com/api/workspace/externalmetadata).
No data is copied — this is metadata-only registration for discoverability and lineage tracking.

**What this proves:** Neo4j schema metadata can be pushed into Unity Catalog's lineage
tracking system without materializing any data, enabling governance teams to see what
graph data exists and how it connects to downstream assets.

### Steps

1. Load configuration (secrets + auto-discovered workspace URL/token)
2. Verify Neo4j connectivity
3. Discover Neo4j schema (labels, relationships, properties, types)
4. Register a single label via External Metadata API (test)
5. Register all node labels
6. Register relationship types
7. List all registered metadata
8. Cleanup (optional)

### Prerequisites

- `neo4j-uc-creds` secret scope configured via `setup.sh`
- Neo4j Python driver installed (`neo4j`)
- Current user has `CREATE_EXTERNAL_METADATA` privilege on the metastore

### About the External Metadata API

The API (Public Preview) registers metadata about external systems in Unity Catalog:
- **Endpoint:** `POST /api/2.0/lineage-tracking/external-metadata`
- **`system_type`:** We use `OTHER` (Neo4j is not in the enum)
- **`entity_type`:** Free-form string — we use `NodeLabel` and `RelationshipType`
- **`columns`:** List of property names (string array, no type info)
- **`properties`:** Key-value map where we encode property types and constraints

---

## Configuration

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================
# Neo4j credentials from Databricks Secrets.
# Workspace URL and auth token are auto-discovered from the notebook context.

import time
import re
import json
import requests
from collections import defaultdict
from neo4j import GraphDatabase

SCOPE_NAME = "neo4j-uc-creds"

# Neo4j credentials
NEO4J_HOST = dbutils.secrets.get(SCOPE_NAME, "host")
NEO4J_USER = dbutils.secrets.get(SCOPE_NAME, "user")
NEO4J_PASSWORD = dbutils.secrets.get(SCOPE_NAME, "password")

try:
    NEO4J_DATABASE = dbutils.secrets.get(SCOPE_NAME, "database")
except Exception:
    NEO4J_DATABASE = "neo4j"

NEO4J_BOLT_URI = f"neo4j+s://{NEO4J_HOST}"

# Auto-discover Databricks workspace URL and auth token
# Works on both Single-user and Shared clusters
try:
    # Try the modern approach first (works on Shared clusters)
    WORKSPACE_URL = spark.conf.get("spark.databricks.workspaceUrl")
    if not WORKSPACE_URL.startswith("https://"):
        WORKSPACE_URL = f"https://{WORKSPACE_URL}"
except Exception:
    # Fall back to notebook context (Single-user clusters)
    ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
    WORKSPACE_URL = ctx.apiUrl().get()

try:
    ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
    AUTH_TOKEN = ctx.apiToken().get()
except Exception as e:
    print(f"[FAIL] Could not auto-discover auth token: {e}")
    print("[INFO] Ensure you are running on a cluster with notebook context access.")
    raise

# API configuration
API_BASE = f"{WORKSPACE_URL}/api/2.0/lineage-tracking/external-metadata"
HEADERS = {
    "Authorization": f"Bearer {AUTH_TOKEN}",
    "Content-Type": "application/json"
}

# Neo4j internal type → UC/SQL type mapping (for properties map)
TYPE_MAP = {
    "String": "STRING",
    "Long": "BIGINT",
    "Double": "DOUBLE",
    "Boolean": "BOOLEAN",
    "Date": "DATE",
    "LocalDateTime": "TIMESTAMP_NTZ",
    "DateTime": "TIMESTAMP",
    "StringArray": "ARRAY<STRING>",
    "LongArray": "ARRAY<BIGINT>",
    "DoubleArray": "ARRAY<DOUBLE>",
}

# Initialize cross-cell variables
discovered_labels = defaultdict(list)
discovered_relationships = defaultdict(lambda: {"properties": [], "patterns": []})
registered_ids = []

print("Configuration loaded:")
print(f"  Secret Scope: {SCOPE_NAME}")
print(f"  Neo4j Host: {NEO4J_HOST}")
print(f"  Bolt URI: {NEO4J_BOLT_URI}")
print(f"  Database: {NEO4J_DATABASE}")
print(f"  Workspace URL: {WORKSPACE_URL}")
print(f"  Auth Token: {'*' * 8} (auto-discovered from notebook context)")
print(f"  API Base: {API_BASE}")

---

## Step 1: Verify Neo4j Connectivity

In [None]:
# =============================================================================
# VERIFY NEO4J CONNECTIVITY
# =============================================================================

print("=" * 60)
print("VERIFY NEO4J CONNECTIVITY")
print("=" * 60)

try:
    with GraphDatabase.driver(NEO4J_BOLT_URI, auth=(NEO4J_USER, NEO4J_PASSWORD)) as driver:
        driver.verify_connectivity()
        print("\n[PASS] Driver connectivity verified")

        with driver.session(database=NEO4J_DATABASE) as session:
            result = session.run("RETURN 1 AS test")
            record = result.single()
            print(f"[PASS] Query executed: RETURN 1 = {record['test']}")

            result = session.run("CALL dbms.components() YIELD name, versions RETURN name, versions")
            for record in result:
                print(f"[INFO] Connected to: {record['name']} {record['versions']}")

    print("\nStatus: PASS")

except Exception as e:
    print(f"\n[FAIL] Connection failed: {e}")
    print("\nStatus: FAIL")

---

## Step 2: Discover Neo4j Schema

In [None]:
# =============================================================================
# DISCOVER NEO4J SCHEMA
# =============================================================================
# Uses db.schema.nodeTypeProperties() and db.schema.relTypeProperties()
# Built-in procedures — no APOC required.
#
# Note: db.schema.relTypeProperties() only yields:
#   relType, propertyName, propertyTypes, mandatory
# sourceNodeLabels/targetNodeLabels are NOT available in Neo4j 5.x Aura.

print("=" * 60)
print("DISCOVER NEO4J SCHEMA")
print("=" * 60)

multi_label_skipped = 0

try:
    with GraphDatabase.driver(NEO4J_BOLT_URI, auth=(NEO4J_USER, NEO4J_PASSWORD)) as driver:
        with driver.session(database=NEO4J_DATABASE) as session:
            # Node label properties
            print("\n[INFO] Running CALL db.schema.nodeTypeProperties()...")
            result = session.run("CALL db.schema.nodeTypeProperties()")
            for record in result:
                if record["propertyName"] is None:
                    continue
                labels = record["nodeLabels"]
                if len(labels) == 1:
                    label = labels[0]
                    discovered_labels[label].append({
                        "name": record["propertyName"],
                        "types": record["propertyTypes"],
                        "mandatory": record["mandatory"]
                    })
                else:
                    multi_label_skipped += 1

            # Relationship type properties
            print("[INFO] Running CALL db.schema.relTypeProperties()...")
            result = session.run("CALL db.schema.relTypeProperties()")
            for record in result:
                raw = record["relType"]
                rel_type = re.sub(r'^:`|`$', '', raw)
                if record["propertyName"]:
                    discovered_relationships[rel_type]["properties"].append({
                        "name": record["propertyName"],
                        "types": record["propertyTypes"],
                        "mandatory": record["mandatory"]
                    })
                # Ensure the rel_type entry exists even if no properties
                discovered_relationships[rel_type]

    # Display
    print(f"\nNODE LABELS ({len(discovered_labels)} discovered):")
    print("-" * 50)
    for label, props in sorted(discovered_labels.items()):
        print(f"  {label}: {len(props)} properties")
        for p in props[:5]:
            types_str = ", ".join(p["types"])
            print(f"    - {p['name']}: {types_str}")
        if len(props) > 5:
            print(f"    ... and {len(props) - 5} more")

    if multi_label_skipped > 0:
        print(f"\n[WARN] Skipped {multi_label_skipped} multi-label node type entries")

    print(f"\nRELATIONSHIP TYPES ({len(discovered_relationships)} discovered):")
    print("-" * 50)
    for rel_type, info in sorted(discovered_relationships.items()):
        prop_count = len(info["properties"])
        print(f"  [:{rel_type}]  ({prop_count} properties)")

    print(f"\n[PASS] Schema discovery complete")

except Exception as e:
    print(f"\n[FAIL] Schema discovery failed: {e}")
    import traceback
    traceback.print_exc()

---

## Step 3: Register One Label (Single Test)

Test the External Metadata API with a single label before registering everything.

In [None]:
# =============================================================================
# REGISTER ONE LABEL VIA EXTERNAL METADATA API
# =============================================================================

print("=" * 60)
print("REGISTER SINGLE LABEL (TEST)")
print("=" * 60)


def build_label_payload(label_name, properties):
    """Build External Metadata API payload for a node label."""
    columns = [p["name"] for p in properties if p["name"]]

    # Encode type info in the properties map
    props_map = {
        "neo4j.database": NEO4J_DATABASE,
        "neo4j.label": label_name,
        "neo4j.host": NEO4J_HOST,
        "neo4j.property_count": str(len(columns)),
    }
    for p in properties:
        if p["name"]:
            neo4j_type = p["types"][0] if p["types"] else "String"
            uc_type = TYPE_MAP.get(neo4j_type, "STRING")
            props_map[f"neo4j.property.{p['name']}.type"] = uc_type
            props_map[f"neo4j.property.{p['name']}.neo4j_type"] = neo4j_type
            if p["mandatory"]:
                props_map[f"neo4j.property.{p['name']}.mandatory"] = "true"

    return {
        "name": label_name,
        "system_type": "OTHER",
        "entity_type": "NodeLabel",
        "description": f"Neo4j :{label_name} node label ({len(columns)} properties)",
        "columns": columns,
        "url": NEO4J_BOLT_URI,
        "properties": props_map
    }


if not discovered_labels:
    print("\n[FAIL] No labels discovered — cannot proceed.")
    print("[INFO] Check the schema discovery cell above for errors.")
    test_label = None
else:
    test_label = sorted(discovered_labels.keys())[0]
    test_props = discovered_labels[test_label]

    payload = build_label_payload(test_label, test_props)
    print(f"\n[INFO] Registering label: {test_label}")
    print(f"[INFO] Payload:")
    print(json.dumps(payload, indent=2))

    # POST to External Metadata API
    resp = None
    try:
        resp = requests.post(API_BASE, headers=HEADERS, json=payload)
        resp.raise_for_status()
        result = resp.json()

        print(f"\n[PASS] Registered successfully")
        print(f"  ID: {result.get('id')}")
        print(f"  Name: {result.get('name')}")
        print(f"  Entity Type: {result.get('entity_type')}")
        print(f"  Created By: {result.get('created_by')}")
        print(f"  Created At: {result.get('create_time')}")

        # Verify with a GET
        verify_resp = requests.get(f"{API_BASE}/{result['id']}", headers=HEADERS)
        verify_resp.raise_for_status()
        print(f"\n[PASS] Verified via GET — object exists in UC")

        registered_ids.append(result["id"])

    except requests.exceptions.HTTPError as e:
        print(f"\n[FAIL] API error: {e}")
        if resp is not None:
            print(f"  Response: {resp.text[:200]}")
    except Exception as e:
        print(f"\n[FAIL] {e}")

---

## Step 4: Register All Node Labels

In [None]:
# =============================================================================
# REGISTER ALL NODE LABELS
# =============================================================================

print("=" * 60)
print(f"REGISTER ALL NODE LABELS ({len(discovered_labels)} labels)")
print("=" * 60)

label_results = []

for label in sorted(discovered_labels.keys()):
    # Skip the test label we already registered in this session
    if label == test_label and registered_ids:
        label_results.append({"label": label, "status": "PASS (already registered)"})
        print(f"  [SKIP] {label} (already registered in test step)")
        continue

    props = discovered_labels[label]
    payload = build_label_payload(label, props)

    resp = None
    start = time.time()
    try:
        resp = requests.post(API_BASE, headers=HEADERS, json=payload)
        resp.raise_for_status()
        result = resp.json()
        elapsed = time.time() - start

        registered_ids.append(result["id"])
        label_results.append({
            "label": label,
            "id": result["id"],
            "columns": len(payload["columns"]),
            "time_s": round(elapsed, 2),
            "status": "PASS"
        })
        print(f"  [PASS] {label} ({len(payload['columns'])} properties, {elapsed:.2f}s)")

    except requests.exceptions.HTTPError as e:
        error_msg = resp.text[:80] if resp is not None else str(e)[:80]
        label_results.append({"label": label, "status": f"FAIL: {error_msg}"})
        print(f"  [FAIL] {label} — {error_msg}")
    except Exception as e:
        label_results.append({"label": label, "status": f"FAIL: {str(e)[:80]}"})
        print(f"  [FAIL] {label} — {e}")

# Summary
passed = [r for r in label_results if r["status"].startswith("PASS")]
failed = [r for r in label_results if not r["status"].startswith("PASS")]

print(f"\n" + "=" * 60)
print(f"NODE LABELS SUMMARY")
print(f"  Registered: {len(passed)}/{len(label_results)}")
if failed:
    print(f"  Failed: {', '.join(r['label'] for r in failed)}")
print(f"  Total external metadata IDs: {len(registered_ids)}")

---

## Step 5: Register Relationship Types

In [None]:
# =============================================================================
# REGISTER RELATIONSHIP TYPES
# =============================================================================

print("=" * 60)
print(f"REGISTER RELATIONSHIP TYPES ({len(discovered_relationships)} types)")
print("=" * 60)

rel_results = []

for rel_type, info in sorted(discovered_relationships.items()):
    properties = info["properties"]

    columns = [p["name"] for p in properties if p["name"]]

    # Build properties map
    props_map = {
        "neo4j.database": NEO4J_DATABASE,
        "neo4j.relationship_type": rel_type,
        "neo4j.host": NEO4J_HOST,
        "neo4j.property_count": str(len(columns)),
    }

    # Add property types
    for p in properties:
        if p["name"]:
            neo4j_type = p["types"][0] if p["types"] else "String"
            uc_type = TYPE_MAP.get(neo4j_type, "STRING")
            props_map[f"neo4j.property.{p['name']}.type"] = uc_type

    desc = f"Neo4j [:{rel_type}] relationship type ({len(columns)} properties)"

    payload = {
        "name": rel_type,
        "system_type": "OTHER",
        "entity_type": "RelationshipType",
        "description": desc,
        "columns": columns,
        "url": NEO4J_BOLT_URI,
        "properties": props_map
    }

    resp = None
    start = time.time()
    try:
        resp = requests.post(API_BASE, headers=HEADERS, json=payload)
        resp.raise_for_status()
        result = resp.json()
        elapsed = time.time() - start

        registered_ids.append(result["id"])
        rel_results.append({
            "type": rel_type,
            "id": result["id"],
            "columns": len(columns),
            "time_s": round(elapsed, 2),
            "status": "PASS"
        })
        print(f"  [PASS] {rel_type} ({len(columns)} properties, {elapsed:.2f}s)")

    except requests.exceptions.HTTPError as e:
        error_msg = resp.text[:80] if resp is not None else str(e)[:80]
        rel_results.append({"type": rel_type, "status": f"FAIL: {error_msg}"})
        print(f"  [FAIL] {rel_type} — {error_msg}")
    except Exception as e:
        rel_results.append({"type": rel_type, "status": f"FAIL: {str(e)[:80]}"})
        print(f"  [FAIL] {rel_type} — {e}")

# Summary
passed_rels = [r for r in rel_results if r["status"].startswith("PASS")]
failed_rels = [r for r in rel_results if not r["status"].startswith("PASS")]

print(f"\n" + "=" * 60)
print(f"RELATIONSHIP TYPES SUMMARY")
print(f"  Registered: {len(passed_rels)}/{len(rel_results)}")
if failed_rels:
    print(f"  Failed: {', '.join(r['type'] for r in failed_rels)}")

---

## Step 6: List All Registered Metadata

In [None]:
# =============================================================================
# LIST ALL REGISTERED EXTERNAL METADATA
# =============================================================================

print("=" * 60)
print("ALL REGISTERED EXTERNAL METADATA")
print("=" * 60)

try:
    # Fetch all pages
    all_items = []
    page_token = None

    while True:
        params = {"page_size": 100}
        if page_token:
            params["page_token"] = page_token

        resp = requests.get(API_BASE, headers=HEADERS, params=params)
        resp.raise_for_status()
        data = resp.json()

        items = data.get("external_metadata", [])
        all_items.extend(items)

        page_token = data.get("next_page_token")
        if not page_token:
            break

    # Filter to our Neo4j entries
    neo4j_items = [m for m in all_items if m.get("system_type") == "OTHER" and
                   m.get("entity_type") in ("NodeLabel", "RelationshipType")]

    if neo4j_items:
        display_rows = []
        for item in neo4j_items:
            display_rows.append({
                "name": item.get("name", ""),
                "entity_type": item.get("entity_type", ""),
                "columns": len(item.get("columns", [])),
                "id": item.get("id", "")[:12] + "...",
                "created_by": item.get("created_by", ""),
            })

        display_df = spark.createDataFrame(display_rows)
        display_df.show(50, truncate=False)

        node_count = len([i for i in neo4j_items if i.get("entity_type") == "NodeLabel"])
        rel_count = len([i for i in neo4j_items if i.get("entity_type") == "RelationshipType"])
        print(f"\n[INFO] Found {node_count} NodeLabel + {rel_count} RelationshipType entries")
    else:
        print("\n[WARN] No Neo4j external metadata objects found")
        print("[INFO] They may not have been registered yet, or may use different entity_type values.")

    print(f"\n[PASS] External Metadata API query successful")

except requests.exceptions.HTTPError as e:
    print(f"\n[FAIL] API error: {e}")
    if resp is not None:
        print(f"  Response: {resp.text[:200]}")
except Exception as e:
    print(f"\n[FAIL] {e}")
    import traceback
    traceback.print_exc()

---

## Cleanup (Optional)

Uncomment and run the cell below to delete all external metadata objects created by this notebook.
These objects persist until explicitly deleted.

In [None]:
# =============================================================================
# CLEANUP — Uncomment to delete registered metadata
# =============================================================================

# print("=" * 60)
# print(f"DELETING {len(registered_ids)} EXTERNAL METADATA OBJECTS")
# print("=" * 60)
#
# deleted = 0
# failed_count = 0
# for obj_id in registered_ids:
#     try:
#         resp = requests.delete(f"{API_BASE}/{obj_id}", headers=HEADERS)
#         resp.raise_for_status()
#         deleted += 1
#         print(f"  [OK] Deleted {obj_id}")
#     except Exception as e:
#         failed_count += 1
#         print(f"  [FAIL] {obj_id} — {e}")
#
# print(f"\nDeleted: {deleted}, Failed: {failed_count}")

print("[INFO] Cleanup code is commented out. Uncomment and run to delete.")
print(f"[INFO] {len(registered_ids)} object IDs stored for cleanup")

---

## Summary

### What We Did

1. **Discovered** Neo4j schema (labels, relationship types, properties, types) using
   built-in `db.schema.nodeTypeProperties()` and `db.schema.relTypeProperties()`

2. **Registered** each label and relationship type as an external metadata object in
   Unity Catalog via the External Metadata API

3. **Verified** all objects are visible via the API

### Approach 2 vs Approach 3 Comparison

| Aspect | Approach 2 (This Notebook) | Approach 3 (metadata_sync_delta) |
|--------|---------------------------|----------------------------------|
| Data copied | No | Yes (full materialization) |
| INFORMATION_SCHEMA visible | No (REST API only) | Yes |
| Catalog Explorer visible | No | Yes |
| Column types in UC | In properties map only | Full native types |
| SQL queryable | No | Yes |
| Access control | Object-level only | Table/column-level |
| Storage cost | None | Delta storage |
| Lineage tracking | Yes | Yes |
| Setup complexity | Lower | Higher (Spark Connector needed) |

### Recommendation

Use **both approaches together**:
- **Approach 3** for high-value labels that need SQL access and Catalog Explorer visibility
- **Approach 2** for comprehensive metadata coverage of all labels and relationships