# PySpark + Unity Catalog Local Demo

This notebook demonstrates the **databricks-local** shim — a 100% open-source
local development environment that provides Unity Catalog–style and DBUtils-compatible APIs.

> **Disclaimer:** This project is NOT affiliated with Databricks, Inc. See the [NOTICE](../NOTICE) file.

## 1. Setup — Inject Notebook Context

A single call injects `spark`, `dbutils`, `display`, `sc`, and `uc` into the notebook globals,
matching the experience you get in a cloud workspace.

In [1]:
import sys, os
sys.path.insert(0, os.path.abspath(".."))

from databricks_shim import inject_notebook_context
inject_notebook_context("AnalysisDemo")

print(f"Spark version : {spark.version}")
print(f"Current catalog: {uc.get_current_catalog()}")

⚡ Initializing Local Spark — Databricks 16.4 LTS + Unity Catalog Emulator


26/02/18 12:28:55 WARN Utils: Your hostname, omar-Katana-GF76-12UE resolves to a loopback address: 127.0.1.1; using 192.168.1.40 instead (on interface wlo1)
26/02/18 12:28:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/omar/Documentos/DatabricksLocal/.venv/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/omar/.ivy2/cache
The jars for the packages stored in: /home/omar/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-10065207-5507-42ba-afa5-a99251fb148d;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.3.2 in central
	found io.delta#delta-storage;3.3.2 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 182ms :: artifacts dl 4ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.3.2 from central in [default]
	io.delta#delta-storage;3.3.2 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   | 

Spark version : 3.5.3
Current catalog: main


## 2. Unity Catalog — Catalogs, Schemas & Volumes

In [2]:
# Create a catalog and schema
uc.sql("CREATE CATALOG IF NOT EXISTS analytics")
uc.sql("CREATE SCHEMA IF NOT EXISTS analytics.bronze")
uc.sql("CREATE SCHEMA IF NOT EXISTS analytics.silver")
uc.sql("CREATE SCHEMA IF NOT EXISTS analytics.gold")

# List schemas
print("Schemas in analytics:")
uc.sql("SHOW SCHEMAS IN analytics")

[Unity] Catálogo 'analytics' creado → /home/omar/Documentos/DatabricksLocal/notebooks/.warehouse/analytics


26/02/18 12:29:04 WARN SharedState: Cannot qualify the warehouse path, leaving it unqualified.
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.spark.sql.internal.SharedState$.qualifyWarehousePath(SharedState.scala:288)
	at org.apache.spark.sql.internal.SharedState.liftedTree1$1(SharedState.scala:80)
	at org.apache.spark.sql.internal.SharedState.<init>(Sha

Schemas in analytics:


26/02/18 12:29:06 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
26/02/18 12:29:06 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
26/02/18 12:29:06 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception
Error creating transactional connection factory
org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory
	at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214)
	at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:162)
	at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:285)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:75)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstru

DataFrame[databaseName: string]

In [3]:
# Create volumes
uc.sql("CREATE VOLUME IF NOT EXISTS analytics.bronze.raw_data")
uc.sql("SHOW VOLUMES IN analytics.bronze")

[Unity] Volume 'analytics.bronze.raw_data' (MANAGED) → /home/omar/Documentos/DatabricksLocal/notebooks/.volumes/analytics/bronze/raw_data


DataFrame[catalog_name: string, schema_name: string, name: string, volume_type: string, storage_location: string]

## 3. DBUtils — Secrets, Widgets & Filesystem

In [4]:
import os
os.environ["MY_SCOPE_API_KEY"] = "demo-secret-value-12345"

# Secrets
secret = dbutils.secrets.get("my_scope", "api_key")
print(f"Secret retrieved: {secret[:5]}***")
print("Scopes:", dbutils.secrets.listScopes())

Secret retrieved: demo-***
Scopes: [SecretScope(name=''), SecretScope(name='app'), SecretScope(name='application'), SecretScope(name='aws'), SecretScope(name='bucket'), SecretScope(name='chrome'), SecretScope(name='clicolor'), SecretScope(name='conda'), SecretScope(name='databricks'), SecretScope(name='dbus'), SecretScope(name='debuginfod'), SecretScope(name='desktop'), SecretScope(name='electron'), SecretScope(name='fc'), SecretScope(name='fontconfig'), SecretScope(name='force'), SecretScope(name='gdk'), SecretScope(name='gio'), SecretScope(name='git'), SecretScope(name='gjs'), SecretScope(name='gnome'), SecretScope(name='gpg'), SecretScope(name='gsettings'), SecretScope(name='gtk'), SecretScope(name='im'), SecretScope(name='invocation'), SecretScope(name='journal'), SecretScope(name='lc'), SecretScope(name='ls'), SecretScope(name='memory'), SecretScope(name='minio'), SecretScope(name='my'), SecretScope(name='postgres'), SecretScope(name='pydevd'), SecretScope(name='python'), SecretSc

In [5]:
# Widgets
dbutils.widgets.text("environment", "dev", "Environment")
dbutils.widgets.dropdown("region", "us-east-1", ["us-east-1", "eu-west-1", "ap-south-1"], "Region")

print(f"Environment: {dbutils.widgets.get('environment')}")
print(f"Region     : {dbutils.widgets.get('region')}")

Environment: dev
Region     : us-east-1


In [6]:
# Filesystem operations
dbutils.fs.mkdirs("/Volumes/analytics/bronze/raw_data/files")
dbutils.fs.put("/Volumes/analytics/bronze/raw_data/files/hello.txt", "Hello from databricks-local!", True)
print(dbutils.fs.head("/Volumes/analytics/bronze/raw_data/files/hello.txt"))
dbutils.fs.ls("/Volumes/analytics/bronze/raw_data/files/")

Hello from databricks-local!


[FileInfo(path='/Volumes/analytics/bronze/raw_data/files/hello.txt', name='hello.txt', size=28, modificationTime=1771435798939)]

## 4. Delta Lake — Create & Query Tables

In [7]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

schema = StructType([
    StructField("id", IntegerType()),
    StructField("product", StringType()),
    StructField("price", DoubleType()),
    StructField("category", StringType()),
])

data = [
    (1, "Laptop",     999.99, "electronics"),
    (2, "Headphones", 149.99, "electronics"),
    (3, "T-Shirt",     29.99, "clothing"),
    (4, "Sneakers",    89.99, "clothing"),
    (5, "Blender",     59.99, "home"),
]

df = spark.createDataFrame(data, schema)
display(df)

                                                                                

+---+----------+------+-----------+
| id|   product| price|   category|
+---+----------+------+-----------+
|  1|    Laptop|999.99|electronics|
|  2|Headphones|149.99|electronics|
|  3|   T-Shirt| 29.99|   clothing|
|  4|  Sneakers| 89.99|   clothing|
|  5|   Blender| 59.99|       home|
+---+----------+------+-----------+



In [8]:
import tempfile, os
delta_path = os.path.join(tempfile.mkdtemp(), "products_delta")

# Write as Delta
df.write.format("delta").mode("overwrite").save(delta_path)
print(f"Delta table written to: {delta_path}")

# Read back
products = spark.read.format("delta").load(delta_path)
display(products)

26/02/18 12:30:36 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Delta table written to: /tmp/tmppnxjtl9z/products_delta
+---+----------+------+-----------+
| id|   product| price|   category|
+---+----------+------+-----------+
|  2|Headphones|149.99|electronics|
|  1|    Laptop|999.99|electronics|
|  4|  Sneakers| 89.99|   clothing|
|  3|   T-Shirt| 29.99|   clothing|
|  5|   Blender| 59.99|       home|
+---+----------+------+-----------+



## 5. Delta Lake — Time Travel

In [9]:
# Append new data to create version 1
new_data = [
    (6, "Tablet",  399.99, "electronics"),
    (7, "Jacket",  129.99, "clothing"),
]
spark.createDataFrame(new_data, schema).write.format("delta").mode("append").save(delta_path)

print("=== Version 1 (current) — 7 rows ===")
spark.read.format("delta").load(delta_path).show()

print("=== Version 0 (time travel) — 5 rows ===")
spark.read.format("delta").option("versionAsOf", 0).load(delta_path).show()

                                                                                

=== Version 1 (current) — 7 rows ===
+---+----------+------+-----------+
| id|   product| price|   category|
+---+----------+------+-----------+
|  2|Headphones|149.99|electronics|
|  6|    Tablet|399.99|electronics|
|  1|    Laptop|999.99|electronics|
|  4|  Sneakers| 89.99|   clothing|
|  3|   T-Shirt| 29.99|   clothing|
|  7|    Jacket|129.99|   clothing|
|  5|   Blender| 59.99|       home|
+---+----------+------+-----------+

=== Version 0 (time travel) — 5 rows ===
+---+----------+------+-----------+
| id|   product| price|   category|
+---+----------+------+-----------+
|  2|Headphones|149.99|electronics|
|  1|    Laptop|999.99|electronics|
|  4|  Sneakers| 89.99|   clothing|
|  3|   T-Shirt| 29.99|   clothing|
|  5|   Blender| 59.99|       home|
+---+----------+------+-----------+



## 6. Delta Lake — MERGE (Upsert)

In [10]:
from delta.tables import DeltaTable

# Updates + inserts
upsert_data = [
    (1, "Laptop Pro", 1299.99, "electronics"),  # update
    (8, "Candle",       14.99, "home"),          # insert
]
upsert_df = spark.createDataFrame(upsert_data, schema)

delta_table = DeltaTable.forPath(spark, delta_path)
delta_table.alias("target").merge(
    upsert_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

print("After MERGE:")
spark.read.format("delta").load(delta_path).orderBy("id").show()

                                                                                

After MERGE:
+---+----------+-------+-----------+
| id|   product|  price|   category|
+---+----------+-------+-----------+
|  1|Laptop Pro|1299.99|electronics|
|  2|Headphones| 149.99|electronics|
|  3|   T-Shirt|  29.99|   clothing|
|  4|  Sneakers|  89.99|   clothing|
|  5|   Blender|  59.99|       home|
|  6|    Tablet| 399.99|electronics|
|  7|    Jacket| 129.99|   clothing|
|  8|    Candle|  14.99|       home|
+---+----------+-------+-----------+



## 7. Unity Catalog — Grants, Tags & Lineage

In [11]:
# Grants on a UC-tracked table
uc.sql("GRANT SELECT ON TABLE analytics.bronze.products TO analyst@company.com")
uc.sql("GRANT INSERT ON TABLE analytics.bronze.products TO etl_service")
uc.sql("DENY DELETE ON TABLE analytics.bronze.products TO intern")

print("Grants:")
uc.sql("SHOW GRANTS ON TABLE analytics.bronze.products")

[Unity] GRANT SELECT ON TABLE analytics.bronze.products TO analyst@company.com
[Unity] GRANT INSERT ON TABLE analytics.bronze.products TO etl_service
[Unity] DENY DELETE ON TABLE analytics.bronze.products TO intern
Grants:


DataFrame[principal: string, privilege: string, object_type: string, object_key: string]

In [12]:
# Tags
uc.sql("ALTER TABLE analytics.bronze.products SET TAGS ('pii' = 'false', 'team' = 'data-eng', 'env' = 'dev')")
print("Tags:", uc.get_tags("analytics.bronze.products"))

Tags: {'pii': 'false', 'team': 'data-eng', 'env': 'dev'}


In [13]:
# Lineage
uc.track_lineage("analytics.bronze.products", "analytics.silver.products_clean", "TABLE")
uc.track_lineage("analytics.silver.products_clean", "analytics.gold.category_summary", "TABLE")

print("Lineage graph:")
uc.lineage_as_dataframe().show(truncate=False)

Lineage graph:
+-------------------------------+-------------------------------+------------+--------------------------------+
|source_table                   |target_table                   |lineage_type|tracked_at                      |
+-------------------------------+-------------------------------+------------+--------------------------------+
|analytics.bronze.products      |analytics.silver.products_clean|TABLE       |2026-02-18T17:32:06.641992+00:00|
|analytics.silver.products_clean|analytics.gold.category_summary|TABLE       |2026-02-18T17:32:06.642016+00:00|
+-------------------------------+-------------------------------+------------+--------------------------------+



## 8. Unity Catalog — Functions & Groups

In [14]:
# User-defined functions
uc.create_function("analytics", "bronze", "clean_text",
                   definition="TRIM(LOWER(input))",
                   description="Normalize text: trim + lowercase")

uc.sql("SHOW FUNCTIONS IN analytics.bronze")

[Unity] Función 'analytics.bronze.clean_text' registrada.


DataFrame[catalog: string, schema: string, function: string]

In [15]:
# Groups
uc.sql("CREATE GROUP data_engineers")
uc.sql("ALTER GROUP data_engineers ADD USER alice@company.com")
uc.sql("ALTER GROUP data_engineers ADD USER bob@company.com")
uc.sql("SHOW GROUPS")

[Unity] Grupo 'data_engineers' creado.
[Unity] 'alice@company.com' añadido al grupo 'data_engineers'.
[Unity] 'bob@company.com' añadido al grupo 'data_engineers'.


DataFrame[name: string]

## 9. Information Schema

In [16]:
print("=== Catalogs ===")
uc.information_schema.catalogs().show(truncate=False)

print("=== Schemas ===")
uc.information_schema.schemata().show(truncate=False)

print("=== Tables ===")
uc.information_schema.tables().show(truncate=False)

print("=== Volumes ===")
uc.information_schema.volumes().show(truncate=False)

=== Catalogs ===
+--------------+-------+
|catalog_name  |comment|
+--------------+-------+
|analytics     |       |
|hive_metastore|       |
|main          |       |
+--------------+-------+

=== Schemas ===


26/02/18 12:32:28 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
26/02/18 12:32:28 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
26/02/18 12:32:28 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception
Error creating transactional connection factory
org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory
	at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214)
	at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:162)
	at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:285)
	at jdk.internal.reflect.GeneratedConstructorAccessor97.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:53)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.ba

+------------+-----------+-------+
|catalog_name|schema_name|comment|
+------------+-----------+-------+
+------------+-----------+-------+

=== Tables ===


26/02/18 12:32:51 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
26/02/18 12:32:51 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
26/02/18 12:32:51 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception
Error creating transactional connection factory
org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory
	at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214)
	at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:162)
	at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:285)
	at jdk.internal.reflect.GeneratedConstructorAccessor97.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:53)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.ba

+-------------+------------+----------+----------+
|table_catalog|table_schema|table_name|table_type|
+-------------+------------+----------+----------+
+-------------+------------+----------+----------+

=== Volumes ===
+--------------+-------------+-----------+-----------+----------------+
|volume_catalog|volume_schema|volume_name|volume_type|storage_location|
+--------------+-------------+-----------+-----------+----------------+
+--------------+-------------+-----------+-----------+----------------+



## 10. Aggregation — Gold Layer Example

In [17]:
from pyspark.sql.functions import avg, count, round as spark_round, col

gold_df = (
    spark.read.format("delta").load(delta_path)
    .groupBy("category")
    .agg(
        count("*").alias("total_products"),
        spark_round(avg("price"), 2).alias("avg_price"),
    )
    .orderBy(col("avg_price").desc())
)

print("Category Summary (Gold):")
display(gold_df)

Category Summary (Gold):
+-----------+--------------+---------+
|   category|total_products|avg_price|
+-----------+--------------+---------+
|electronics|             3|   616.66|
|   clothing|             3|    83.32|
|       home|             2|    37.49|
+-----------+--------------+---------+



## 11. Audit Log — Track All Grants & Denials

In [18]:
# Full audit log of all grants, revokes, and denials
print("=== Audit Log ===")
uc.audit_log().show(truncate=False)

=== Audit Log ===
+-------------------+-----------+-------------------+-------------------------+--------------------------------+
|user_identity      |action_name|request_object_type|request_object_name      |event_time                      |
+-------------------+-----------+-------------------+-------------------------+--------------------------------+
|analyst@company.com|SELECT     |TABLE              |analytics.bronze.products|2026-02-18T17:31:47.327034+00:00|
|etl_service        |INSERT     |TABLE              |analytics.bronze.products|2026-02-18T17:31:47.327137+00:00|
|intern             |DENY:DELETE|TABLE              |analytics.bronze.products|2026-02-18T17:31:47.327196+00:00|
+-------------------+-----------+-------------------+-------------------------+--------------------------------+



## 12. UNDROP TABLE — Table Recovery

In [19]:
# Track a table drop and then recover it
uc.track_drop_table("analytics.bronze.old_events")
uc.track_drop_table("analytics.silver.stale_data")

print("=== Dropped Tables ===")
for t in uc.list_dropped_tables():
    print(f"  {t.catalog_name}.{t.schema_name}.{t.name} (dropped at {t.dropped_at})")

# Recover one
uc.undrop_table("analytics.bronze.old_events")

print("\n=== After UNDROP ===")
for t in uc.list_dropped_tables():
    print(f"  {t.catalog_name}.{t.schema_name}.{t.name}")

=== Dropped Tables ===
  analytics.bronze.old_events (dropped at 2026-02-18T17:33:38.113128+00:00)
  analytics.silver.stale_data (dropped at 2026-02-18T17:33:38.113156+00:00)
[Unity] UNDROP TABLE 'analytics.bronze.old_events' — restaurada.

=== After UNDROP ===
  analytics.silver.stale_data


## 13. Delta Lake — UPDATE & DELETE Operations

In [20]:
import tempfile, os
delta_path2 = os.path.join(tempfile.mkdtemp(), "update_delete_demo")

# Create a fresh Delta table
spark.createDataFrame(data, schema).write.format("delta").mode("overwrite").save(delta_path2)
dt2 = DeltaTable.forPath(spark, delta_path2)

# UPDATE: increase price of all electronics by 10%
dt2.update(
    condition="category = 'electronics'",
    set={"price": "price * 1.10"}
)
print("=== After UPDATE (electronics +10%) ===")
spark.read.format("delta").load(delta_path2).orderBy("id").show()

# DELETE: remove items under $50
dt2.delete("price < 50")
print("=== After DELETE (price < 50) ===")
spark.read.format("delta").load(delta_path2).orderBy("id").show()

# History
print("=== Delta History ===")
dt2.history().select("version", "timestamp", "operation").show(truncate=False)

                                                                                

=== After UPDATE (electronics +10%) ===
+---+----------+------------------+-----------+
| id|   product|             price|   category|
+---+----------+------------------+-----------+
|  1|    Laptop|          1099.989|electronics|
|  2|Headphones|164.98900000000003|electronics|
|  3|   T-Shirt|             29.99|   clothing|
|  4|  Sneakers|             89.99|   clothing|
|  5|   Blender|             59.99|       home|
+---+----------+------------------+-----------+

=== After DELETE (price < 50) ===
+---+----------+------------------+-----------+
| id|   product|             price|   category|
+---+----------+------------------+-----------+
|  1|    Laptop|          1099.989|electronics|
|  2|Headphones|164.98900000000003|electronics|
|  4|  Sneakers|             89.99|   clothing|
|  5|   Blender|             59.99|       home|
+---+----------+------------------+-----------+

=== Delta History ===
+-------+-----------------------+---------+
|version|timestamp              |operation

## 14. DBUtils — Notebook Context & Task Values

In [21]:
import json

# Notebook context — same API as in Databricks
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
print("=== Notebook Context ===")
print(json.dumps(json.loads(ctx.toJson()), indent=2))

# Task Values — inter-task communication
dbutils.jobs.taskValues.set("etl_rows_processed", 1500)
dbutils.jobs.taskValues.set("etl_status", "success")

val = dbutils.jobs.taskValues.get("etl_task", "etl_rows_processed", debugValue=0)
status = dbutils.jobs.taskValues.get("etl_task", "etl_status", debugValue="unknown")
print(f"\nTask values — rows: {val}, status: {status}")

=== Notebook Context ===
{
  "tags": {
    "orgId": "0",
    "clusterId": "local-cluster",
    "clusterName": "databricks-local-uc",
    "notebookPath": "/local/notebook",
    "user": "omar",
    "notebookId": "0",
    "currentCatalog": "main"
  }
}

Task values — rows: 1500, status: success


## 15. DBUtils — Credentials & Data Summarize

In [22]:
# Credentials (no-op locally, same API as Databricks)
result = dbutils.credentials.assumeRole("arn:aws:iam::123456789:role/my-role")
print(f"assumeRole result: {result}")
print(f"showCurrentRole: {dbutils.credentials.showCurrentRole()}")
print(f"showRoles: {dbutils.credentials.showRoles()}")

# Data summarize — DataFrame profiling
print("\n=== DataFrame Profile (data.summarize) ===")
sample_df = spark.createDataFrame(data, schema)
dbutils.data.summarize(sample_df)

[Mock] assumeRole arn:aws:iam::123456789:role/my-role (no-op locally)
assumeRole result: True
showCurrentRole: []
showRoles: []

=== DataFrame Profile (data.summarize) ===
+-------+------------------+-------+------------------+--------+
|summary|                id|product|             price|category|
+-------+------------------+-------+------------------+--------+
|  count|                 5|      5|                 5|       5|
|   mean|               3.0|   NULL|            265.99|    NULL|
| stddev|1.5811388300841898|   NULL|412.71055232450743|    NULL|
|    min|                 1|Blender|             29.99|clothing|
|    max|                 5|T-Shirt|            999.99|    home|
+-------+------------------+-------+------------------+--------+



## 16. DBUtils — Advanced Widgets

In [23]:
# Combobox and multiselect widgets
dbutils.widgets.combobox("output_format", "parquet", ["parquet", "delta", "csv"], "Output Format")
dbutils.widgets.multiselect("layers", "bronze", ["bronze", "silver", "gold"], "Layers")

print("=== All Widgets ===")
all_widgets = dbutils.widgets.getAll()
for name, value in all_widgets.items():
    print(f"  {name}: {value}")

# getArgument (alias for get with default)
arg = dbutils.widgets.getArgument("missing_widget", "fallback_value")
print(f"\ngetArgument('missing_widget'): {arg}")

# Remove a widget
dbutils.widgets.remove("output_format")
print(f"After remove: {list(dbutils.widgets.getAll().keys())}")

=== All Widgets ===
  environment: dev
  region: us-east-1
  output_format: parquet
  layers: bronze

getArgument('missing_widget'): fallback_value
After remove: ['environment', 'region', 'layers']


## 17. DBUtils — Advanced Filesystem (cp, mv, rm)

In [24]:
# Write files to DBFS
dbutils.fs.put("dbfs:/tmp/demo/file1.txt", "First file content", True)
dbutils.fs.put("dbfs:/tmp/demo/file2.txt", "Second file content", True)

# List
print("=== Before operations ===")
for f in dbutils.fs.ls("dbfs:/tmp/demo/"):
    print(f"  {f.name} ({f.size} bytes)")

# Copy
dbutils.fs.cp("dbfs:/tmp/demo/file1.txt", "dbfs:/tmp/demo/file1_copy.txt")
print("\nAfter cp:")
for f in dbutils.fs.ls("dbfs:/tmp/demo/"):
    print(f"  {f.name}")

# Move (rename)
dbutils.fs.mv("dbfs:/tmp/demo/file1_copy.txt", "dbfs:/tmp/demo/file1_renamed.txt")
print("\nAfter mv:")
for f in dbutils.fs.ls("dbfs:/tmp/demo/"):
    print(f"  {f.name}")

# Remove
dbutils.fs.rm("dbfs:/tmp/demo/", recurse=True)
print("\nAfter rm (recurse): cleaned up")

=== Before operations ===
  file1.txt (18 bytes)
  file2.txt (19 bytes)

After cp:
  file1.txt
  file1_copy.txt
  file2.txt

After mv:
  file1.txt
  file1_renamed.txt
  file2.txt

After rm (recurse): cleaned up


## 18. Unity Catalog — Describe Operations

In [25]:
# Describe Catalog
print("=== DESCRIBE CATALOG analytics ===")
uc.sql("DESCRIBE CATALOG analytics")

# Describe Schema
print("\n=== DESCRIBE SCHEMA analytics.bronze ===")
uc.describe_schema("analytics", "bronze")

# Describe Volume
print("\n=== DESCRIBE VOLUME analytics.bronze.raw_data ===")
uc.sql("DESCRIBE VOLUME analytics.bronze.raw_data")

# Describe Function
print("\n=== DESCRIBE FUNCTION analytics.bronze.clean_text ===")
uc.sql("DESCRIBE FUNCTION analytics.bronze.clean_text")

=== DESCRIBE CATALOG analytics ===

=== DESCRIBE SCHEMA analytics.bronze ===

=== DESCRIBE VOLUME analytics.bronze.raw_data ===

=== DESCRIBE FUNCTION analytics.bronze.clean_text ===


DataFrame[info_name: string, info_value: string]

## 19. Unity Catalog — Python API Direct Calls

In [26]:
# Python API (alternative to SQL)
# List catalogs
print("=== Catalogs (Python API) ===")
for c in uc.list_catalogs():
    print(f"  {c.name}: {c.comment or '(no comment)'}")

# List schemas
print("\n=== Schemas in 'analytics' (Python API) ===")
for s in uc.list_schemas("analytics"):
    print(f"  {s.name}: {s.comment or '(no comment)'}")

# List volumes
print("\n=== Volumes in analytics.bronze (Python API) ===")
for v in uc.list_volumes("analytics", "bronze"):
    print(f"  {v.name} ({v.volume_type}) → {v.storage_location}")

# Volume path helper
path = uc.volume_path("analytics", "bronze", "raw_data", "data.csv")
print(f"\nVolume path: {path}")

# Current catalog/schema state
print(f"\nCurrent catalog: {uc.get_current_catalog()}")
print(f"Current schema : {uc.get_current_schema()}")

# Switch context
uc.set_current_catalog("analytics")
uc.set_current_schema("silver")
print(f"After switch → catalog: {uc.get_current_catalog()}, schema: {uc.get_current_schema()}")

=== Catalogs (Python API) ===
  analytics: (no comment)
  hive_metastore: (no comment)
  main: (no comment)

=== Schemas in 'analytics' (Python API) ===


26/02/18 12:36:16 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
26/02/18 12:36:16 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
26/02/18 12:36:16 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception
Error creating transactional connection factory
org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory
	at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214)
	at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:162)
	at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:285)
	at jdk.internal.reflect.GeneratedConstructorAccessor97.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:53)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.ba


=== Volumes in analytics.bronze (Python API) ===
  raw_data (MANAGED) → /home/omar/Documentos/DatabricksLocal/notebooks/.volumes/analytics/bronze/raw_data

Volume path: /home/omar/Documentos/DatabricksLocal/notebooks/.volumes/analytics/bronze/raw_data/data.csv

Current catalog: main
Current schema : default


26/02/18 12:36:38 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
26/02/18 12:36:38 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
26/02/18 12:36:38 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception
Error creating transactional connection factory
org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory
	at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214)
	at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:162)
	at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:285)
	at jdk.internal.reflect.GeneratedConstructorAccessor97.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:53)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.ba

After switch → catalog: analytics, schema: silver


## 20. Unity Catalog — No-Op Cloud Commands (Delta Sharing, External Locations)

In [27]:
# These commands are silently accepted (no-op) so your notebooks
# run without errors both locally and in a cloud workspace.

noop_commands = [
    "CREATE SHARE IF NOT EXISTS my_share",
    "CREATE RECIPIENT IF NOT EXISTS partner_org",
    "CREATE EXTERNAL LOCATION my_loc URL 's3://bucket/path' WITH (STORAGE CREDENTIAL my_cred)",
    "CREATE STORAGE CREDENTIAL my_cred",
    "ALTER STORAGE CREDENTIAL my_cred SET OWNER TO admin",
    "CREATE CONNECTION my_conn TYPE 'POSTGRESQL'",
    "CREATE CLEAN ROOM my_room",
    "REFRESH MATERIALIZED VIEW my_mv",
    "REFRESH STREAMING TABLE my_st",
    "SYNC SCHEMA analytics",
    "MSCK REPAIR PRIVILEGES",
]

print("=== No-Op Cloud Commands (all accepted silently) ===")
for cmd in noop_commands:
    result = uc.sql(cmd)
    status = "✓" if result is None or result == "NO-OP" else f"→ {result}"
    print(f"  {status}  {cmd}")

=== No-Op Cloud Commands (all accepted silently) ===
[Unity] CREATE SHARE — no-op locally
  → DataFrame[result: string]  CREATE SHARE IF NOT EXISTS my_share
[Unity] CREATE RECIPIENT — no-op locally
  → DataFrame[result: string]  CREATE RECIPIENT IF NOT EXISTS partner_org
[Unity] CREATE EXTERNAL LOCATION — no-op locally
  → DataFrame[result: string]  CREATE EXTERNAL LOCATION my_loc URL 's3://bucket/path' WITH (STORAGE CREDENTIAL my_cred)
[Unity] CREATE STORAGE CREDENTIAL — no-op locally
  → DataFrame[result: string]  CREATE STORAGE CREDENTIAL my_cred
[Unity] ALTER STORAGE CREDENTIAL — no-op locally
  → DataFrame[result: string]  ALTER STORAGE CREDENTIAL my_cred SET OWNER TO admin
[Unity] CREATE CONNECTION — no-op locally
  → DataFrame[result: string]  CREATE CONNECTION my_conn TYPE 'POSTGRESQL'
[Unity] CREATE CLEAN ROOM — no-op locally
  → DataFrame[result: string]  CREATE CLEAN ROOM my_room
[Unity] REFRESH MATERIALIZED VIEW — no-op locally
  → DataFrame[result: string]  REFRESH MATERIA

## 21. Unity Catalog — SQL via REVOKE, SHOW GRANTS, COMMENT ON

In [28]:
# REVOKE a grant
uc.sql("REVOKE INSERT ON TABLE analytics.bronze.products FROM etl_service")

# Show grants after revoke
print("=== Grants after REVOKE ===")
uc.sql("SHOW GRANTS ON TABLE analytics.bronze.products")

# COMMENT ON
uc.sql("COMMENT ON TABLE analytics.bronze.products IS 'Product catalog — source of truth'")

# SHOW CATALOGS / SHOW SCHEMAS via SQL
print("\n=== SHOW CATALOGS ===")
uc.sql("SHOW CATALOGS")

print("\n=== SHOW SCHEMAS IN analytics ===")
uc.sql("SHOW SCHEMAS IN analytics")

[Unity] REVOKE INSERT ON TABLE analytics.bronze.products FROM etl_service
=== Grants after REVOKE ===
[Unity] COMMENT ON TABLE analytics.bronze.products: Product catalog — source of truth

=== SHOW CATALOGS ===

=== SHOW SCHEMAS IN analytics ===


26/02/18 12:37:13 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
26/02/18 12:37:13 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
26/02/18 12:37:13 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception
Error creating transactional connection factory
org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory
	at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214)
	at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:162)
	at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:285)
	at jdk.internal.reflect.GeneratedConstructorAccessor97.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:53)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.ba

DataFrame[databaseName: string]

## 22. Cleanup

In [29]:
# Clean up Delta paths and temp files
import shutil
for p in [delta_path, delta_path2]:
    shutil.rmtree(p, ignore_errors=True)

# Remove all widgets
dbutils.widgets.removeAll()

print("Cleanup complete!")
print(f"Demo finished — PySpark {spark.version} + Delta Lake running 100% locally.")
print(f"Total sections demonstrated: 22")
print(f"Features covered: Unity Catalog, DBUtils, Delta Lake, Grants, Tags, Lineage, Groups, Functions, No-Ops")

Cleanup complete!
Demo finished — PySpark 3.5.3 + Delta Lake running 100% locally.
Total sections demonstrated: 22
Features covered: Unity Catalog, DBUtils, Delta Lake, Grants, Tags, Lineage, Groups, Functions, No-Ops
