# Neo4j JDBC Unity Catalog Connection - Support Ticket

## Issue Summary

**Problem**: Unity Catalog JDBC connection to Neo4j fails with `Connection was closed before the operation completed` error, despite:
- Network connectivity working (TCP test passes)
- Neo4j Python driver working
- Neo4j Spark Connector working

**Error Location**: `com.databricks.safespark.jdbc.grpc_client.JdbcConnectClient.awaitWhileConnected`

This notebook provides a systematic test progression to isolate the failure point.

---

## Configuration

**Prerequisites**: Run `setup.sh` to configure Databricks secrets before running this notebook.

The setup script reads credentials from `.env` and stores them in the `neo4j-uc-creds` secret scope:
- `host` - Neo4j host
- `user` - Neo4j username
- `password` - Neo4j password
- `connection_name` - Unity Catalog connection name
- `jdbc_jar_path` - Path to JDBC JAR in UC Volume
- `database` - Neo4j database (optional, defaults to "neo4j")

In [None]:
# =============================================================================
# CONFIGURATION - Loaded from Databricks Secrets
# =============================================================================
# Secrets are configured using setup.sh which creates scope "neo4j-uc-creds"
# with secrets: host, user, password, connection_name, jdbc_jar_path, database

SCOPE_NAME = "neo4j-uc-creds"

# Aura Connection Details (from secrets)
NEO4J_HOST = dbutils.secrets.get(SCOPE_NAME, "host")
NEO4J_USER = dbutils.secrets.get(SCOPE_NAME, "user")
NEO4J_PASSWORD = dbutils.secrets.get(SCOPE_NAME, "password")

# Database defaults to "neo4j" if not set
try:
    NEO4J_DATABASE = dbutils.secrets.get(SCOPE_NAME, "database")
except:
    NEO4J_DATABASE = "neo4j"

# Unity Catalog Resources (from secrets)
JDBC_JAR_PATH = dbutils.secrets.get(SCOPE_NAME, "jdbc_jar_path")
UC_CONNECTION_NAME = dbutils.secrets.get(SCOPE_NAME, "connection_name")

# Derived URLs (no need to edit)
NEO4J_BOLT_URI = f"neo4j+s://{NEO4J_HOST}"
NEO4J_JDBC_URL = f"jdbc:neo4j+s://{NEO4J_HOST}:7687/{NEO4J_DATABASE}"
NEO4J_JDBC_URL_SQL = f"{NEO4J_JDBC_URL}?enableSQLTranslation=true"

print("Configuration loaded from Databricks Secrets:")
print(f"  Secret Scope: {SCOPE_NAME}")
print(f"  Neo4j Host: {NEO4J_HOST}")
print(f"  Bolt URI: {NEO4J_BOLT_URI}")
print(f"  JDBC URL: {NEO4J_JDBC_URL}")
print(f"  Connection Name: {UC_CONNECTION_NAME}")
print(f"  JAR Path: {JDBC_JAR_PATH}")

---

## Section 1: Environment Information

Capture cluster and runtime details for support context.

In [None]:
# Collect environment information
print("=" * 60)
print("ENVIRONMENT INFORMATION")
print("=" * 60)

# Spark version
print(f"\nSpark Version: {spark.version}")

# Databricks Runtime
try:
    dbr_version = spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
    print(f"Databricks Runtime: {dbr_version}")
except:
    print("Databricks Runtime: Unable to determine")

# Python version
import sys
print(f"Python Version: {sys.version}")

# Check neo4j package
try:
    import neo4j
    print(f"Neo4j Python Driver: {neo4j.__version__}")
except ImportError:
    print("Neo4j Python Driver: NOT INSTALLED")

# Check JAR file exists
print(f"\nJDBC JAR Path: {JDBC_JAR_PATH}")
try:
    files = dbutils.fs.ls(JDBC_JAR_PATH.rsplit('/', 1)[0])
    jar_found = any(JDBC_JAR_PATH.split('/')[-1] in f.name for f in files)
    print(f"JAR File Exists: {jar_found}")
except Exception as e:
    print(f"JAR File Check Error: {e}")

---

## Section 2: Network Connectivity Test (TCP Layer)

**Expected Result**: PASS - Proves network path is open.

In [None]:
# TCP connectivity test using netcat
print("=" * 60)
print("TEST: Network Connectivity (TCP)")
print("=" * 60)

spark.sql("""
CREATE OR REPLACE TEMPORARY FUNCTION connectionTest(host STRING, port STRING)
RETURNS STRING
LANGUAGE PYTHON AS $$
import subprocess
try:
    command = ['nc', '-zv', host, str(port)]
    result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    output = result.stdout.decode() + result.stderr.decode()
    if result.returncode == 0:
        status = "SUCCESS"
        message = f"Network connectivity to {host}:{port} is OPEN"
    else:
        status = "FAILURE"
        message = f"Cannot reach {host}:{port} - check firewall rules"
    return f"{status} (return_code={result.returncode}) | {message} | Details: {output.strip()}"
except Exception as e:
    return f"FAILURE (exception) | Error: {str(e)}"
$$
""")

result = spark.sql(f"SELECT connectionTest('{NEO4J_HOST}', '7687') AS result").collect()[0]['result']
print(f"\nResult: {result}")
print(f"\nStatus: {'PASS' if 'SUCCESS' in result else 'FAIL'}")

---

## Section 3: Neo4j Python Driver Test

**Expected Result**: PASS - Proves credentials work and Neo4j is accessible.

In [None]:
# Test Neo4j Python driver connectivity
print("=" * 60)
print("TEST: Neo4j Python Driver")
print("=" * 60)

from neo4j import GraphDatabase

try:
    driver = GraphDatabase.driver(NEO4J_BOLT_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
    
    # Verify connectivity
    driver.verify_connectivity()
    print("\n[PASS] Driver connectivity verified")
    
    # Test simple query
    with driver.session() as session:
        result = session.run("RETURN 1 AS test")
        record = result.single()
        print(f"[PASS] Query executed: RETURN 1 = {record['test']}")
        
        # Get Neo4j version
        result = session.run("CALL dbms.components() YIELD name, versions RETURN name, versions")
        for record in result:
            print(f"[INFO] Connected to: {record['name']} {record['versions']}")
    
    driver.close()
    print("\nStatus: PASS")
    
except Exception as e:
    print(f"\n[FAIL] Connection failed: {e}")
    print("\nStatus: FAIL")

---

## Section 4: Neo4j Spark Connector (Working Baseline)

**Expected Result**: PASS - This is our working baseline that proves Spark can communicate with Neo4j.

In [None]:
# Test Neo4j Spark Connector (known working method)
print("=" * 60)
print("TEST: Neo4j Spark Connector (org.neo4j.spark.DataSource)")
print("=" * 60)

try:
    df = spark.read.format("org.neo4j.spark.DataSource") \
        .option("url", NEO4J_BOLT_URI) \
        .option("authentication.type", "basic") \
        .option("authentication.basic.username", NEO4J_USER) \
        .option("authentication.basic.password", NEO4J_PASSWORD) \
        .option("query", "RETURN 'Spark Connector Works!' AS message, 1 AS value") \
        .load()
    
    print("\n[PASS] Spark Connector query executed successfully:")
    df.show(truncate=False)
    print("\nStatus: PASS")
    
except Exception as e:
    print(f"\n[FAIL] Spark Connector failed: {e}")
    print("\nStatus: FAIL")

---

## Section 5: Direct JDBC Tests (Bypassing Unity Catalog)

These tests use the Neo4j JDBC driver directly with Spark, **without** Unity Catalog's SafeSpark wrapper.

**Note**: Requires the JDBC JAR to be installed as a cluster library (not just in a UC Volume).

**Limitation Discovered**: Spark's JDBC driver wraps `query` option queries in a subquery for schema inference:
```sql
SELECT * FROM (your_query) SPARK_GEN_SUBQ_N WHERE 1=0
```
This breaks native Cypher even with `FORCE_CYPHER` hint (hint is inside subquery, outer wrapper is still SQL).

**Schema Inference Issue**: When using `dbtable` option, Spark's schema inference returns `NullType()` for all columns from Neo4j JDBC. This causes `No column has been read prior to this call` error when reading data. **Fix**: Use `customSchema` option to explicitly specify column types.

**Workarounds**:
1. Use `dbtable` option with `customSchema` (required to avoid NullType inference)
2. Use `query` option with `customSchema` for SQL queries
3. Use Neo4j Spark Connector instead of JDBC (Section 4 - works without customSchema)

In [None]:
# Direct JDBC - Using dbtable (reads Neo4j label as table, no subquery wrapping)
print("=" * 60)
print("TEST: Direct JDBC - dbtable option (reads label as table)")
print("=" * 60)
print(f"URL: {NEO4J_JDBC_URL_SQL}")

# Use dbtable to read a Neo4j label directly (no subquery wrapper)
# Replace 'Aircraft' with any label that exists in your Neo4j database
TEST_LABEL = "Aircraft"  # Change this to a label in your database

# IMPORTANT: customSchema is REQUIRED when using dbtable with Neo4j JDBC
# Without it, Spark schema inference returns NullType() for all columns,
# causing "No column has been read prior to this call" error when reading data.
# Adjust column names and types to match your actual Neo4j node properties.
# NOTE: Use backticks around column names with special characters (like $)
AIRCRAFT_SCHEMA = "`v$id` STRING, aircraft_id STRING, tail_number STRING, icao24 STRING, model STRING, operator STRING, manufacturer STRING"

try:
    df = spark.read.format("jdbc") \
        .option("url", NEO4J_JDBC_URL_SQL) \
        .option("driver", "org.neo4j.jdbc.Neo4jDriver") \
        .option("user", NEO4J_USER) \
        .option("password", NEO4J_PASSWORD) \
        .option("dbtable", TEST_LABEL) \
        .option("customSchema", AIRCRAFT_SCHEMA) \
        .load()
    
    print(f"\n[PASS] Direct JDBC dbtable '{TEST_LABEL}' read successfully:")
    print(f"Schema: {df.schema}")
    df.show(5, truncate=False)
    print("\nStatus: PASS")
    
except Exception as e:
    print(f"\n[FAIL] Direct JDBC dbtable failed: {e}")
    print("\nStatus: FAIL")
    print("\nNote: Ensure the label exists in Neo4j and JAR is installed as cluster library.")
    print("Also verify customSchema column names match your Neo4j node properties.")

In [None]:
# Direct JDBC - SQL Translation (SQL automatically converted to Cypher)
print("=" * 60)
print("TEST: Direct JDBC - SQL Translation")
print("=" * 60)
print(f"URL: {NEO4J_JDBC_URL_SQL}")

# Use customSchema to bypass Spark's schema inference
try:
    df = spark.read.format("jdbc") \
        .option("url", NEO4J_JDBC_URL_SQL) \
        .option("driver", "org.neo4j.jdbc.Neo4jDriver") \
        .option("user", NEO4J_USER) \
        .option("password", NEO4J_PASSWORD) \
        .option("query", "SELECT 1 AS value") \
        .option("customSchema", "value INT") \
        .load()
    
    print("\n[PASS] Direct JDBC (SQL translation) query executed:")
    df.show(truncate=False)
    print("\nStatus: PASS")
    
except Exception as e:
    print(f"\n[FAIL] Direct JDBC with SQL translation failed: {e}")
    print("\nStatus: FAIL")

In [None]:
# Direct JDBC - SQL Aggregate Query (COUNT)
print("=" * 60)
print("TEST: Direct JDBC - SQL Aggregate (COUNT)")
print("=" * 60)
print(f"URL: {NEO4J_JDBC_URL_SQL}")

# Aggregate functions work reliably with SQL translation
# SQL: SELECT COUNT(*) AS flight_count FROM Flight
# Cypher: MATCH (n:Flight) RETURN count(n) AS flight_count
try:
    df = spark.read.format("jdbc") \
        .option("url", NEO4J_JDBC_URL_SQL) \
        .option("driver", "org.neo4j.jdbc.Neo4jDriver") \
        .option("user", NEO4J_USER) \
        .option("password", NEO4J_PASSWORD) \
        .option("query", "SELECT COUNT(*) AS flight_count FROM Flight") \
        .option("customSchema", "flight_count LONG") \
        .load()

    print("\n[PASS] Direct JDBC SQL aggregate query executed:")
    df.show(truncate=False)
    print("\nStatus: PASS")

except Exception as e:
    print(f"\n[FAIL] Direct JDBC aggregate query failed: {e}")
    print("\nStatus: FAIL")
    print("\nNote: Ensure 'Flight' label exists in your Neo4j database, or change to a label that exists.")

In [None]:
# Direct JDBC - SQL JOIN Translation (NATURAL JOIN -> Cypher relationship)
print("=" * 60)
print("TEST: Direct JDBC - SQL JOIN Translation")
print("=" * 60)
print(f"URL: {NEO4J_JDBC_URL_SQL}")

# Neo4j JDBC translates SQL JOINs to Cypher relationship patterns:
# SQL:    SELECT COUNT(*) FROM Flight f NATURAL JOIN DEPARTS_FROM r NATURAL JOIN Airport a
# Cypher: MATCH (f:Flight)-[:DEPARTS_FROM]->(a:Airport) RETURN count(*) AS cnt
#
# See: https://neo4j.com/docs/jdbc-manual/current/sql2cypher/
try:
    df = spark.read.format("jdbc") \
        .option("url", NEO4J_JDBC_URL_SQL) \
        .option("driver", "org.neo4j.jdbc.Neo4jDriver") \
        .option("user", NEO4J_USER) \
        .option("password", NEO4J_PASSWORD) \
        .option("query", """SELECT COUNT(*) AS cnt
                           FROM Flight f
                           NATURAL JOIN DEPARTS_FROM r
                           NATURAL JOIN Airport a""") \
        .option("customSchema", "cnt LONG") \
        .load()

    print("\n[PASS] Direct JDBC SQL JOIN translation executed:")
    print("SQL JOINs translated to Cypher relationship pattern!")
    df.show(truncate=False)
    print("\nStatus: PASS")

except Exception as e:
    print(f"\n[FAIL] Direct JDBC JOIN translation failed: {e}")
    print("\nStatus: FAIL")
    print("\nNote: Requires Flight-[:DEPARTS_FROM]->Airport pattern in Neo4j.")
    print("Adjust labels/relationship types to match your graph model.")

---

## Section 6: Unity Catalog JDBC Connection

This section creates and tests the Unity Catalog JDBC connection, which uses the SafeSpark wrapper.

In [None]:
# Create Unity Catalog JDBC Connection
print("=" * 60)
print("SETUP: Create Unity Catalog JDBC Connection")
print("=" * 60)

# Drop existing connection
spark.sql(f"DROP CONNECTION IF EXISTS {UC_CONNECTION_NAME}")
print(f"Dropped existing connection (if any): {UC_CONNECTION_NAME}")

# Create connection with explicit driver class
# NOTE: customSchema must be in externalOptionsAllowList to bypass Spark schema inference
create_sql = f"""
CREATE CONNECTION {UC_CONNECTION_NAME} TYPE JDBC
ENVIRONMENT (
  java_dependencies '["{JDBC_JAR_PATH}"]'
)
OPTIONS (
  url '{NEO4J_JDBC_URL_SQL}',
  user '{NEO4J_USER}',
  password '{NEO4J_PASSWORD}',
  driver 'org.neo4j.jdbc.Neo4jDriver',
  externalOptionsAllowList 'dbtable,query,partitionColumn,lowerBound,upperBound,numPartitions,fetchSize,customSchema'
)
"""

try:
    spark.sql(create_sql)
    print(f"\n[PASS] Connection created: {UC_CONNECTION_NAME}")
except Exception as e:
    print(f"\n[FAIL] Failed to create connection: {e}")

In [None]:
# Verify connection configuration
print("=" * 60)
print("VERIFY: Connection Configuration")
print("=" * 60)

try:
    df = spark.sql(f"DESCRIBE CONNECTION {UC_CONNECTION_NAME}")
    print("\nConnection details:")
    df.show(truncate=False)
except Exception as e:
    print(f"\n[FAIL] Cannot describe connection: {e}")

---

## Section 7: Unity Catalog JDBC Tests

These tests use the Unity Catalog connection through the SafeSpark JDBC wrapper.

In [None]:
# Test UC Connection via Spark DataFrame API
print("=" * 60)
print("TEST: Unity Catalog - Spark DataFrame API")
print("=" * 60)

try:
    df = spark.read.format("jdbc") \
        .option("databricks.connection", UC_CONNECTION_NAME) \
        .option("query", "SELECT 1 AS test") \
        .load()
    
    print("\n[PASS] Unity Catalog Spark DataFrame API:")
    df.show()
    print("\nStatus: PASS")
    
except Exception as e:
    print(f"\n[FAIL] Unity Catalog Spark DataFrame API failed:")
    print(f"\nError: {e}")
    print("\nStatus: FAIL")

In [None]:
# Test UC Connection with native Cypher (FORCE_CYPHER hint)
print("=" * 60)
print("TEST: Unity Catalog - Native Cypher (FORCE_CYPHER)")
print("=" * 60)

# NOTE: Spark wraps query option in subquery for schema inference:
#   SELECT * FROM (your_query) SPARK_GEN_SUBQ_N WHERE 1=0
# This breaks native Cypher. Use customSchema to bypass schema inference.

try:
    df = spark.read.format("jdbc") \
        .option("databricks.connection", UC_CONNECTION_NAME) \
        .option("query", "/*+ NEO4J FORCE_CYPHER */ RETURN 1 AS test") \
        .option("customSchema", "test INT") \
        .load()
    
    print("\n[PASS] Unity Catalog with FORCE_CYPHER:")
    df.show()
    print("\nStatus: PASS")
    
except Exception as e:
    print(f"\n[FAIL] Unity Catalog with FORCE_CYPHER failed:")
    print(f"\nError: {e}")
    print("\nStatus: FAIL")

In [None]:
# Test UC Connection via remote_query() function
print("=" * 60)
print("TEST: Unity Catalog - remote_query() Function")
print("=" * 60)

try:
    df = spark.sql(f"""
        SELECT * FROM remote_query(
            '{UC_CONNECTION_NAME}',
            query => 'SELECT 1 AS test'
        )
    """)
    
    print("\n[PASS] Unity Catalog remote_query():")
    df.show()
    print("\nStatus: PASS")
    
except Exception as e:
    print(f"\n[FAIL] Unity Catalog remote_query() failed:")
    print(f"\nError: {e}")
    print("\nStatus: FAIL")

In [None]:
# Test UC Connection with SQL Aggregate Query using Custom Schema
print("=" * 60)
print("TEST: Unity Catalog - SQL Aggregate with Custom Schema")
print("=" * 60)

# CustomSchema for Neo4j JDBC
# ============================================
# Spark's automatic schema inference wraps queries in a subquery:
#   SELECT * FROM (your_query) SPARK_GEN_SUBQ WHERE 1=0
# Neo4j JDBC returns NullType() for all columns during inference,
# causing "No column has been read" errors when reading data.
#
# Possible Workaround: Use customSchema to explicitly define column types:
# - Column names MUST match query result aliases exactly
# - Use Spark SQL types: STRING, LONG, INT, DOUBLE, BOOLEAN, DECIMAL(p,s), etc.
# - Partial schemas allowed: unspecified columns use default inference
#
# This also failed to work
#
# Reference: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

# Define schema for aggregate query result
FLIGHT_COUNT_SCHEMA = "flight_count LONG"

try:
    df = spark.read.format("jdbc") \
        .option("databricks.connection", UC_CONNECTION_NAME) \
        .option("query", "SELECT COUNT(*) AS flight_count FROM Flight") \
        .option("customSchema", FLIGHT_COUNT_SCHEMA) \
        .load()
    
    print("\n[PASS] Unity Catalog SQL Aggregate with customSchema:")
    print(f"Schema applied: {FLIGHT_COUNT_SCHEMA}")
    print(f"DataFrame schema: {df.schema}")
    df.show(truncate=False)
    print("\nStatus: PASS")
    
except Exception as e:
    print(f"\n[FAIL] Unity Catalog SQL aggregate query failed:")
    print(f"\nError: {e}")
    print("\nStatus: FAIL")
    print("\nNote: Ensure 'Flight' label exists in your Neo4j database.")
    print("Adjust the label name to match your graph model if needed.")

---

## Section 8: Neo4j Schema Synchronization with Unity Catalog (Proof of Concept)

**Purpose**: Demonstrate how Neo4j graph schema can be discovered via JDBC DatabaseMetaData and mapped to Unity Catalog objects.

**Approach**:
1. **Phase 1**: Discover Neo4j schema using JDBC `DatabaseMetaData` API via `spark._jvm`
2. **Phase 2**: Test three approaches for creating Unity Catalog objects from discovered schema
3. **Phase 3**: Verify queries work through UC governance

**Key Insight**: Unity Catalog Foreign Catalogs only support specific databases (PostgreSQL, MySQL, etc.) - not generic JDBC. Therefore, we must manually create UC views/tables backed by the JDBC connection.

**Options Tested**:
- **Option A**: Views with inferred schema (schema discovered at query time)
- **Option B**: Tables with explicit schema (schema from in-memory metadata)
- **Option C**: Hybrid approach (schema registry table + views)

**Reference**: 
- [Neo4j JDBC Manual](https://neo4j.com/docs/jdbc-manual/current/)
- [META.md](../META.md) - Full proposal document

In [None]:
# =============================================================================
# PHASE 1: Schema Discovery via JDBC DatabaseMetaData
# =============================================================================
# Use spark._jvm (Py4J gateway) to access JDBC DatabaseMetaData directly.
# This discovers all labels, properties, and relationships without hardcoding.
print("=" * 70)
print("PHASE 1: JDBC DatabaseMetaData Schema Discovery")
print("=" * 70)

def get_neo4j_schema_via_jdbc(spark, jdbc_url, user, password):
    """
    Discover Neo4j schema using JDBC DatabaseMetaData API.
    
    Returns dict with 'labels' and 'relationships' containing full schema info.
    """
    schema = {"labels": {}, "relationships": []}
    
    # Access JVM gateway
    jvm = spark._jvm
    gateway = spark._sc._gateway
    
    # Create JDBC connection
    props = jvm.java.util.Properties()
    props.setProperty("user", user)
    props.setProperty("password", password)
    
    connection = None
    try:
        connection = jvm.java.sql.DriverManager.getConnection(jdbc_url, props)
        metadata = connection.getMetaData()
        
        print(f"\n[INFO] Connected to: {metadata.getDatabaseProductName()} {metadata.getDatabaseProductVersion()}")
        print(f"[INFO] JDBC Driver: {metadata.getDriverName()} {metadata.getDriverVersion()}")
        
        # --- Discover Labels (Tables) ---
        print("\n[INFO] Discovering node labels via getTables(TABLE)...")
        types_array = gateway.new_array(gateway.jvm.java.lang.String, 1)
        types_array[0] = "TABLE"
        
        rs = metadata.getTables(None, None, None, types_array)
        while rs.next():
            label_name = rs.getString("TABLE_NAME")
            schema["labels"][label_name] = {"columns": [], "primary_key": None}
        rs.close()
        
        print(f"[INFO] Found {len(schema['labels'])} labels")
        
        # --- Discover Columns for each Label ---
        print("\n[INFO] Discovering properties via getColumns()...")
        for label_name in schema["labels"]:
            rs = metadata.getColumns(None, None, label_name, None)
            while rs.next():
                col_info = {
                    "name": rs.getString("COLUMN_NAME"),
                    "type_name": rs.getString("TYPE_NAME"),
                    "sql_type": rs.getInt("DATA_TYPE"),
                    "nullable": rs.getString("IS_NULLABLE") == "YES",
                    "is_generated": rs.getString("IS_GENERATEDCOLUMN") == "YES"
                }
                schema["labels"][label_name]["columns"].append(col_info)
            rs.close()
            
            # Get primary key
            rs = metadata.getPrimaryKeys(None, None, label_name)
            while rs.next():
                schema["labels"][label_name]["primary_key"] = rs.getString("COLUMN_NAME")
            rs.close()
        
        # --- Discover Relationships ---
        print("\n[INFO] Discovering relationships via getTables(RELATIONSHIP)...")
        types_array[0] = "RELATIONSHIP"
        rs = metadata.getTables(None, None, None, types_array)
        while rs.next():
            rel_name = rs.getString("TABLE_NAME")
            remarks = rs.getString("REMARKS") or ""
            # Parse remarks to get from/to labels (format: "FromLabel\nREL_TYPE\nToLabel")
            parts = remarks.split("\n") if remarks else []
            if len(parts) >= 3:
                schema["relationships"].append({
                    "from_label": parts[0],
                    "type": parts[1],
                    "to_label": parts[2],
                    "table_name": rel_name
                })
            else:
                schema["relationships"].append({
                    "from_label": None,
                    "type": rel_name,
                    "to_label": None,
                    "table_name": rel_name
                })
        rs.close()
        
        print(f"[INFO] Found {len(schema['relationships'])} relationship patterns")
        
    finally:
        if connection:
            connection.close()
    
    return schema

# Execute schema discovery
try:
    # First, trigger driver loading with a minimal query
    _ = spark.read.format("jdbc") \
        .option("url", NEO4J_JDBC_URL_SQL) \
        .option("driver", "org.neo4j.jdbc.Neo4jDriver") \
        .option("user", NEO4J_USER) \
        .option("password", NEO4J_PASSWORD) \
        .option("query", "SELECT 1") \
        .option("customSchema", "result INT") \
        .load().take(1)
    
    # Now discover schema
    NEO4J_SCHEMA = get_neo4j_schema_via_jdbc(spark, NEO4J_JDBC_URL_SQL, NEO4J_USER, NEO4J_PASSWORD)
    print("\n[PASS] Schema discovery completed successfully")
    
except Exception as e:
    print(f"\n[FAIL] Schema discovery failed: {e}")
    NEO4J_SCHEMA = {"labels": {}, "relationships": []}

In [None]:
# =============================================================================
# Display In-Memory Schema Model
# =============================================================================
print("=" * 70)
print("IN-MEMORY SCHEMA MODEL")
print("=" * 70)

if NEO4J_SCHEMA and NEO4J_SCHEMA["labels"]:
    # Display labels with columns
    print(f"\nNODE LABELS ({len(NEO4J_SCHEMA['labels'])} discovered):")
    print("-" * 60)
    
    for label_name, label_info in NEO4J_SCHEMA["labels"].items():
        columns = label_info["columns"]
        pk = label_info["primary_key"]
        
        print(f"\n  {label_name}:")
        print(f"    Primary Key: {pk or '(none)'}")
        print(f"    Columns ({len(columns)}):")
        
        for col in columns[:8]:  # Show first 8 columns
            gen_marker = " [generated]" if col["is_generated"] else ""
            null_marker = " (nullable)" if col["nullable"] else ""
            print(f"      - {col['name']}: {col['type_name']}{null_marker}{gen_marker}")
        
        if len(columns) > 8:
            print(f"      ... and {len(columns) - 8} more columns")
    
    # Display relationships
    print(f"\nRELATIONSHIP PATTERNS ({len(NEO4J_SCHEMA['relationships'])} discovered):")
    print("-" * 60)
    
    for rel in NEO4J_SCHEMA["relationships"][:10]:  # Show first 10
        if rel["from_label"] and rel["to_label"]:
            print(f"  (:{rel['from_label']})-[:{rel['type']}]->(:{rel['to_label']})")
        else:
            print(f"  [:{rel['type']}] (pattern details unavailable)")
    
    if len(NEO4J_SCHEMA["relationships"]) > 10:
        print(f"  ... and {len(NEO4J_SCHEMA['relationships']) - 10} more patterns")
    
    # Summary statistics
    total_columns = sum(len(l["columns"]) for l in NEO4J_SCHEMA["labels"].values())
    print(f"\n" + "=" * 60)
    print(f"SUMMARY: {len(NEO4J_SCHEMA['labels'])} labels, {total_columns} total columns, {len(NEO4J_SCHEMA['relationships'])} relationships")
    
else:
    print("\n[WARN] No schema discovered. Run the schema discovery cell first.")

In [None]:
# =============================================================================
# OPTION A: Create Views with Inferred Schema
# =============================================================================
# Create Unity Catalog views that query through JDBC connection.
# Schema is inferred from JDBC ResultSetMetaData at query time.
print("=" * 70)
print("OPTION A: Views with Inferred Schema")
print("=" * 70)

# Select first label for testing
if NEO4J_SCHEMA and NEO4J_SCHEMA["labels"]:
    TEST_LABEL_A = list(NEO4J_SCHEMA["labels"].keys())[0]
else:
    TEST_LABEL_A = "Aircraft"  # Fallback

VIEW_NAME_A = f"neo4j_view_{TEST_LABEL_A.lower()}"

print(f"\n[INFO] Creating view for label: {TEST_LABEL_A}")
print(f"[INFO] View name: {VIEW_NAME_A}")

# Generate and display the DDL
view_ddl_a = f"""
CREATE OR REPLACE VIEW {VIEW_NAME_A} AS
SELECT * FROM (
    SELECT * FROM read_files(
        'jdbc',
        connection => '{UC_CONNECTION_NAME}',
        dbtable => '{TEST_LABEL_A}'
    )
)
"""



In [None]:
# Alternative approach using temp view + DataFrame
print("\n[INFO] Approach: Create view via DataFrame registration")

try:
    # Read data through UC connection
    df_a = spark.read.format("jdbc") \
        .option("databricks.connection", UC_CONNECTION_NAME) \
        .option("dbtable", TEST_LABEL_A) \
        .load()
    
    # Register as temp view (for testing - in production use permanent view)
    df_a.createOrReplaceTempView(VIEW_NAME_A)
    
    print(f"\n[PASS] Temp view '{VIEW_NAME_A}' created")
    
    # Show inferred schema
    print("\n[INFO] Inferred Schema:")
    df_a.printSchema()
    
    # Test query
    print("\n[INFO] Sample data (LIMIT 3):")
    spark.sql(f"SELECT * FROM {VIEW_NAME_A} LIMIT 3").show(truncate=False)
    
    # Verify via DESCRIBE
    print("\n[INFO] DESCRIBE output:")
    spark.sql(f"DESCRIBE {VIEW_NAME_A}").show(truncate=False)
    
    OPTION_A_SUCCESS = True
    print("\n[PASS] Option A completed successfully")
    
except Exception as e:
    print(f"\n[FAIL] Option A failed: {e}")
    OPTION_A_SUCCESS = False

In [None]:
# =============================================================================
# OPTION B: Tables with Explicit Schema
# =============================================================================
# Create Unity Catalog table with explicit column definitions from discovered schema.
print("=" * 70)
print("OPTION B: Tables with Explicit Schema")
print("=" * 70)

# Select second label (or same if only one)
labels_list = list(NEO4J_SCHEMA["labels"].keys()) if NEO4J_SCHEMA else []
TEST_LABEL_B = labels_list[1] if len(labels_list) > 1 else (labels_list[0] if labels_list else "Airport")

TABLE_NAME_B = f"neo4j_table_{TEST_LABEL_B.lower()}"

print(f"\n[INFO] Creating table for label: {TEST_LABEL_B}")
print(f"[INFO] Table name: {TABLE_NAME_B}")

# Build explicit schema from discovered metadata
def build_spark_schema_string(label_name, schema_dict):
    """Convert discovered schema to Spark SQL schema string."""
    if label_name not in schema_dict.get("labels", {}):
        return None
    
    columns = schema_dict["labels"][label_name]["columns"]
    
    # Map JDBC types to Spark SQL types
    type_mapping = {
        "STRING": "STRING",
        "VARCHAR": "STRING",
        "INTEGER": "INT",
        "BIGINT": "LONG",
        "LONG": "LONG",
        "DOUBLE": "DOUBLE",
        "FLOAT": "FLOAT",
        "BOOLEAN": "BOOLEAN",
        "DATE": "DATE",
        "TIMESTAMP": "TIMESTAMP",
    }
    
    col_defs = []
    for col in columns:
        col_name = col["name"]
        # Handle special characters in column names
        if "$" in col_name or " " in col_name:
            col_name = f"`{col_name}`"
        spark_type = type_mapping.get(col["type_name"].upper(), "STRING")
        col_defs.append(f"{col_name} {spark_type}")
    
    return ", ".join(col_defs)

# Generate schema string
schema_string_b = build_spark_schema_string(TEST_LABEL_B, NEO4J_SCHEMA)

if schema_string_b:
    print(f"\n[INFO] Generated customSchema:")
    print(f"  {schema_string_b[:100]}..." if len(schema_string_b) > 100 else f"  {schema_string_b}")
else:
    print(f"\n[WARN] Could not build schema for {TEST_LABEL_B}, using fallback")
    schema_string_b = "`v$id` STRING"  # Minimal fallback

try:
    # Read with explicit schema
    df_b = spark.read.format("jdbc") \
        .option("databricks.connection", UC_CONNECTION_NAME) \
        .option("dbtable", TEST_LABEL_B) \
        .option("customSchema", schema_string_b) \
        .load()
    
    # Register as temp view (simulating table creation)
    df_b.createOrReplaceTempView(TABLE_NAME_B)
    
    print(f"\n[PASS] Temp table '{TABLE_NAME_B}' created with explicit schema")
    
    # Show schema (should match our definition)
    print("\n[INFO] Applied Schema:")
    df_b.printSchema()
    
    # Test query
    print("\n[INFO] Sample data (LIMIT 3):")
    spark.sql(f"SELECT * FROM {TABLE_NAME_B} LIMIT 3").show(truncate=False)
    
    OPTION_B_SUCCESS = True
    print("\n[PASS] Option B completed successfully")
    
except Exception as e:
    print(f"\n[FAIL] Option B failed: {e}")
    OPTION_B_SUCCESS = False

In [None]:
# =============================================================================
# OPTION C: Hybrid Approach with Schema Registry
# =============================================================================
# Create a schema registry table storing discovered metadata + views for data access.
print("=" * 70)
print("OPTION C: Hybrid Approach with Schema Registry")
print("=" * 70)

REGISTRY_TABLE = "neo4j_schema_registry"

# Select third label (or cycle back)
labels_list = list(NEO4J_SCHEMA["labels"].keys()) if NEO4J_SCHEMA else []
TEST_LABEL_C = labels_list[2] if len(labels_list) > 2 else (labels_list[0] if labels_list else "Flight")
VIEW_NAME_C = f"neo4j_hybrid_{TEST_LABEL_C.lower()}"

print(f"\n[INFO] Creating schema registry: {REGISTRY_TABLE}")
print(f"[INFO] Creating view for label: {TEST_LABEL_C}")

try:
    # --- Step 1: Create Schema Registry Table ---
    # Build registry data from discovered schema
    registry_rows = []
    
    from datetime import datetime
    discovered_at = datetime.now().isoformat()
    
    for label_name, label_info in NEO4J_SCHEMA.get("labels", {}).items():
        for col in label_info["columns"]:
            registry_rows.append({
                "label_name": label_name,
                "column_name": col["name"],
                "column_type": col["type_name"],
                "is_nullable": col["nullable"],
                "is_generated": col["is_generated"],
                "is_primary_key": col["name"] == label_info.get("primary_key"),
                "discovered_at": discovered_at
            })
    
    # Add relationship patterns
    for rel in NEO4J_SCHEMA.get("relationships", []):
        registry_rows.append({
            "label_name": f"[REL] {rel['type']}",
            "column_name": f"({rel.get('from_label', '?')})->({rel.get('to_label', '?')})",
            "column_type": "RELATIONSHIP",
            "is_nullable": False,
            "is_generated": False,
            "is_primary_key": False,
            "discovered_at": discovered_at
        })
    
    # Create DataFrame and register as table
    if registry_rows:
        df_registry = spark.createDataFrame(registry_rows)
        df_registry.createOrReplaceTempView(REGISTRY_TABLE)
        
        print(f"\n[PASS] Schema registry '{REGISTRY_TABLE}' created with {len(registry_rows)} entries")
        
        # Show registry contents
        print("\n[INFO] Schema Registry Contents (sample):")
        spark.sql(f"""
            SELECT label_name, column_name, column_type, is_primary_key 
            FROM {REGISTRY_TABLE} 
            WHERE column_type != 'RELATIONSHIP'
            LIMIT 10
        """).show(truncate=False)
        
        # Show relationship patterns
        print("\n[INFO] Relationship Patterns in Registry:")
        spark.sql(f"""
            SELECT label_name as relationship, column_name as pattern
            FROM {REGISTRY_TABLE} 
            WHERE column_type = 'RELATIONSHIP'
            LIMIT 5
        """).show(truncate=False)
    else:
        print("\n[WARN] No schema data to populate registry")
    
    # --- Step 2: Create View for Data Access ---
    df_c = spark.read.format("jdbc") \
        .option("databricks.connection", UC_CONNECTION_NAME) \
        .option("dbtable", TEST_LABEL_C) \
        .load()
    
    df_c.createOrReplaceTempView(VIEW_NAME_C)
    
    print(f"\n[PASS] View '{VIEW_NAME_C}' created")
    
    # Test query
    print("\n[INFO] Sample data from view (LIMIT 3):")
    spark.sql(f"SELECT * FROM {VIEW_NAME_C} LIMIT 3").show(truncate=False)
    
    # --- Step 3: Demonstrate combined usage ---
    print("\n[INFO] Combined Query - Schema + Data:")
    print(f"  Registry shows {TEST_LABEL_C} has these columns:")
    spark.sql(f"""
        SELECT column_name, column_type 
        FROM {REGISTRY_TABLE} 
        WHERE label_name = '{TEST_LABEL_C}'
        LIMIT 5
    """).show(truncate=False)
    
    OPTION_C_SUCCESS = True
    print("\n[PASS] Option C completed successfully")
    
except Exception as e:
    print(f"\n[FAIL] Option C failed: {e}")
    import traceback
    traceback.print_exc()
    OPTION_C_SUCCESS = False

In [None]:
# =============================================================================
# VERIFICATION AND COMPARISON
# =============================================================================
print("=" * 70)
print("VERIFICATION AND COMPARISON")
print("=" * 70)

print("\n" + "=" * 70)
print("APPROACH COMPARISON SUMMARY")
print("=" * 70)

comparison_data = []

# Option A results
if 'OPTION_A_SUCCESS' in dir() and OPTION_A_SUCCESS:
    row_count_a = spark.sql(f"SELECT COUNT(*) as cnt FROM {VIEW_NAME_A}").collect()[0]["cnt"]
    comparison_data.append({
        "Approach": "Option A: Inferred Schema View",
        "Object": VIEW_NAME_A,
        "Status": "SUCCESS",
        "Row Count": row_count_a,
        "Schema Source": "Inferred at query time"
    })
else:
    comparison_data.append({
        "Approach": "Option A: Inferred Schema View",
        "Object": "N/A",
        "Status": "FAILED",
        "Row Count": 0,
        "Schema Source": "N/A"
    })

# Option B results
if 'OPTION_B_SUCCESS' in dir() and OPTION_B_SUCCESS:
    row_count_b = spark.sql(f"SELECT COUNT(*) as cnt FROM {TABLE_NAME_B}").collect()[0]["cnt"]
    comparison_data.append({
        "Approach": "Option B: Explicit Schema Table",
        "Object": TABLE_NAME_B,
        "Status": "SUCCESS",
        "Row Count": row_count_b,
        "Schema Source": "From DatabaseMetaData"
    })
else:
    comparison_data.append({
        "Approach": "Option B: Explicit Schema Table",
        "Object": "N/A",
        "Status": "FAILED",
        "Row Count": 0,
        "Schema Source": "N/A"
    })

# Option C results
if 'OPTION_C_SUCCESS' in dir() and OPTION_C_SUCCESS:
    row_count_c = spark.sql(f"SELECT COUNT(*) as cnt FROM {VIEW_NAME_C}").collect()[0]["cnt"]
    registry_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {REGISTRY_TABLE}").collect()[0]["cnt"]
    comparison_data.append({
        "Approach": "Option C: Hybrid (Registry + View)",
        "Object": f"{REGISTRY_TABLE} + {VIEW_NAME_C}",
        "Status": "SUCCESS",
        "Row Count": row_count_c,
        "Schema Source": f"Registry ({registry_count} entries)"
    })
else:
    comparison_data.append({
        "Approach": "Option C: Hybrid (Registry + View)",
        "Object": "N/A",
        "Status": "FAILED",
        "Row Count": 0,
        "Schema Source": "N/A"
    })

# Display comparison
df_comparison = spark.createDataFrame(comparison_data)
df_comparison.show(truncate=False)

# Pros and Cons
print("\n" + "-" * 70)
print("PROS AND CONS")
print("-" * 70)

print("""
OPTION A (Inferred Schema Views):
  Pros:
    - Simplest to implement
    - Always reflects current Neo4j schema
    - No schema maintenance needed
  Cons:
    - Schema inference may fail for some types
    - Less control over column types
    - Schema not visible until query time

OPTION B (Explicit Schema Tables):
  Pros:
    - Full control over column types
    - Predictable schema
    - Avoids inference issues
  Cons:
    - Must regenerate when Neo4j schema changes
    - Requires schema discovery step
    - More code to maintain

OPTION C (Hybrid with Registry):
  Pros:
    - Schema metadata is queryable
    - Good for documentation/discovery
    - Views provide live data access
    - Best of both worlds
  Cons:
    - Most complex to implement
    - Registry needs refresh mechanism
    - Two objects to manage per label
""")

print("\n[INFO] All approaches use the UC JDBC connection for governance")
print("[INFO] Choose based on your needs: simplicity (A), control (B), or visibility (C)")

In [None]:
# =============================================================================
# OPTIONAL: Relationship Traversal View Test
# =============================================================================
# Test creating a view that traverses relationships using SQL JOINs.
print("=" * 70)
print("OPTIONAL: Relationship Traversal View Test")
print("=" * 70)

# Find a relationship pattern to test
test_pattern = None
for rel in NEO4J_SCHEMA.get("relationships", []):
    if rel.get("from_label") and rel.get("to_label"):
        test_pattern = rel
        break

if test_pattern:
    from_label = test_pattern["from_label"]
    rel_type = test_pattern["type"]
    to_label = test_pattern["to_label"]
    
    print(f"\n[INFO] Testing relationship pattern:")
    print(f"  (:{from_label})-[:{rel_type}]->(:{to_label})")
    
    # SQL JOIN that translates to Cypher relationship traversal
    join_sql = f"""
        SELECT COUNT(*) AS relationship_count
        FROM {from_label} f
        NATURAL JOIN {rel_type} r
        NATURAL JOIN {to_label} t
    """
    
    print(f"\n[INFO] SQL Query:")
    print(f"  {join_sql.strip()}")
    
    try:
        df_rel = spark.read.format("jdbc") \
            .option("databricks.connection", UC_CONNECTION_NAME) \
            .option("query", join_sql) \
            .option("customSchema", "relationship_count LONG") \
            .load()
        
        count = df_rel.collect()[0]["relationship_count"]
        print(f"\n[PASS] Relationship traversal works!")
        print(f"[INFO] Found {count} relationships of type {rel_type}")
        
        # Create a traversal view
        TRAVERSAL_VIEW = f"neo4j_rel_{rel_type.lower()}"
        
        # For the view, get actual data (limited)
        data_sql = f"""
            SELECT * FROM {from_label} f
            NATURAL JOIN {rel_type} r
            NATURAL JOIN {to_label} t
            LIMIT 100
        """
        
        df_traversal = spark.read.format("jdbc") \
            .option("databricks.connection", UC_CONNECTION_NAME) \
            .option("query", data_sql) \
            .load()
        
        df_traversal.createOrReplaceTempView(TRAVERSAL_VIEW)
        print(f"\n[PASS] Traversal view '{TRAVERSAL_VIEW}' created")
        
        print("\n[INFO] Sample traversal data:")
        spark.sql(f"SELECT * FROM {TRAVERSAL_VIEW} LIMIT 3").show(truncate=False)
        
    except Exception as e:
        print(f"\n[FAIL] Relationship traversal failed: {e}")
        print("[INFO] This may be due to relationship pattern not existing in data")
else:
    print("\n[INFO] No relationship patterns discovered to test")
    print("[INFO] Skipping relationship traversal test")

In [None]:
# =============================================================================
# CLEANUP
# =============================================================================
# Drop temporary views/tables created during testing.
# Uncomment the lines below to clean up.
print("=" * 70)
print("CLEANUP")
print("=" * 70)

print("""
Temporary objects created during this demo:

Views/Tables:
""")

# List objects that may have been created
objects_to_clean = []
if 'VIEW_NAME_A' in dir():
    objects_to_clean.append(f"  - {VIEW_NAME_A} (Option A view)")
if 'TABLE_NAME_B' in dir():
    objects_to_clean.append(f"  - {TABLE_NAME_B} (Option B table)")
if 'REGISTRY_TABLE' in dir():
    objects_to_clean.append(f"  - {REGISTRY_TABLE} (Option C registry)")
if 'VIEW_NAME_C' in dir():
    objects_to_clean.append(f"  - {VIEW_NAME_C} (Option C view)")
if 'TRAVERSAL_VIEW' in dir():
    objects_to_clean.append(f"  - {TRAVERSAL_VIEW} (Relationship view)")

for obj in objects_to_clean:
    print(obj)

print("""
To clean up, uncomment and run:
""")

cleanup_commands = []
if 'VIEW_NAME_A' in dir():
    cleanup_commands.append(f"# spark.sql('DROP VIEW IF EXISTS {VIEW_NAME_A}')")
if 'TABLE_NAME_B' in dir():
    cleanup_commands.append(f"# spark.sql('DROP VIEW IF EXISTS {TABLE_NAME_B}')")
if 'REGISTRY_TABLE' in dir():
    cleanup_commands.append(f"# spark.sql('DROP VIEW IF EXISTS {REGISTRY_TABLE}')")
if 'VIEW_NAME_C' in dir():
    cleanup_commands.append(f"# spark.sql('DROP VIEW IF EXISTS {VIEW_NAME_C}')")
if 'TRAVERSAL_VIEW' in dir():
    cleanup_commands.append(f"# spark.sql('DROP VIEW IF EXISTS {TRAVERSAL_VIEW}')")

for cmd in cleanup_commands:
    print(cmd)

print("""
Note: These are temporary views that will be automatically dropped 
when the Spark session ends. Uncomment above to drop manually.
""")

### Section 8 Summary

This proof of concept demonstrated three approaches for synchronizing Neo4j graph schema with Unity Catalog:

| Approach | Method | Best For |
|----------|--------|----------|
| **Option A** | Views with inferred schema | Simple use cases, always-current schema |
| **Option B** | Tables with explicit schema | Predictable types, avoiding inference issues |
| **Option C** | Hybrid (registry + views) | Schema visibility, documentation, governance |

**Key Findings**:

1. **Schema Discovery Works**: JDBC `DatabaseMetaData` API successfully discovers all Neo4j labels, properties, and relationships without hardcoding

2. **UC Integration Works**: All three approaches successfully create queryable Unity Catalog objects backed by the JDBC connection

3. **Governance Applies**: Queries go through UC connection, inheriting permissions and audit logging

4. **SQL-to-Cypher Translation**: SQL JOINs correctly translate to Cypher relationship traversals

**Limitations**:

- Unity Catalog Foreign Catalogs don't support generic JDBC (only specific databases)
- Must manually create UC objects (no automatic schema sync)
- Temp views used for demo; production would use permanent tables/views

**Next Steps** (out of scope for this POC):

- Automated schema refresh mechanism
- Production table/view creation with proper catalog.schema paths
- Performance optimization for large graphs
- Error handling and recovery

**References**:
- [META.md](../META.md) - Full proposal document
- [Neo4j JDBC Manual](https://neo4j.com/docs/jdbc-manual/current/)
- [Databricks JDBC Connection](https://docs.databricks.com/aws/en/connect/jdbc-connection)