# Unity Catalog: Managed vs. External Volumes

**Objective:**
In this notebook, we will explore **Databricks Volumes**, a feature of Unity Catalog that allows you to govern, manage, and access unstructured, semi-structured, and structured data files (non-tabular data).

We will cover:
1.  **Managed Volumes:** Storage governed entirely by Unity Catalog in the managed storage location of the schema.
2.  **External Volumes:** Storage governed by Unity Catalog but located in an external cloud storage path (ADLS/S3) that you control.
3.  **File Operations:** Copying, reading, and querying files within volumes.

**Prerequisites:**
*   A Unity Catalog-enabled Databricks workspace.
*   `CREATE VOLUME` privileges on the schema.
*   For External Volumes: An `EXTERNAL LOCATION` already configured (covered in a previous notebook).

## 1. Setup Environment
We will use the catalog `dev` and schema `bronze` that we created in previous sessions.

In [None]:
# Select catalog and schema
spark.sql("USE CATALOG dev")
spark.sql("USE SCHEMA bronze")

print("Current context: dev.bronze")

## 2. Managed Volumes

### What is a Managed Volume?
A managed volume is a storage volume created within the default managed storage location of the containing schema. You do not need to specify a path when creating it. Databricks manages the lifecycle of the data files.

*   **Dropping a managed volume deletes both the metadata and the actual files.**

In [None]:
# Create a Managed Volume
# Syntax: CREATE VOLUME volume_name
spark.sql("""
    CREATE VOLUME IF NOT EXISTS dev.bronze.managed_vol
    COMMENT 'This is a managed volume for storing raw files'
""")

print("Managed Volume 'managed_vol' created successfully.")

### Inspecting the Volume
Let's look at the metadata of the volume we just created. Note the `storage_location` points to the Unity Catalog managed path.

In [None]:
display(spark.sql("DESCRIBE VOLUME dev.bronze.managed_vol"))

### Working with Files in Managed Volumes
We can use standard filesystem commands (via `%fs` or `dbutils`) to interact with volumes. The path format is:
`/Volumes/<catalog>/<schema>/<volume_name>/<path_to_file>`

Let's download a sample CSV file and copy it into our managed volume.

In [None]:
# 1. Create a sub-directory inside the volume
import os
volume_path = "/Volumes/dev/bronze/managed_vol"
files_dir = f"{volume_path}/files"

dbutils.fs.mkdirs(files_dir)

# 2. Download sample data to local driver and move it to the Volume
# Downloading the employee CSV used in previous demos
url = "https://media.githubusercontent.com/media/subhamkharwal/pyspark-zero-to-hero/refs/heads/master/datasets/emp.csv"
local_path = "/tmp/emp.csv"

# Use shell command to download
os.system(f"wget {url} -O {local_path}")

# Copy from local driver to Volume
dbutils.fs.cp(f"file:{local_path}", f"{files_dir}/emp.csv")

# 3. List files in the volume
display(dbutils.fs.ls(files_dir))

### Querying Data from Volumes
We can query files directly from volumes without creating a table first.

In [None]:
# Read CSV directly from volume path
df = spark.read.format("csv").option("header", "true").load(f"{files_dir}/emp.csv")
display(df)

# Or using SQL
display(spark.sql(f"SELECT * FROM csv.`{files_dir}/emp.csv`"))

## 3. External Volumes

### What is an External Volume?
An external volume is a storage volume created against a specific external location (ADLS/S3/GCP) that you manage. It allows you to bring existing data under Unity Catalog governance without moving it.

*   **Dropping an external volume removes the metadata from Unity Catalog but LEAVES the files in your cloud storage untouched.**

*Prerequisite Check: Ensure you have an External Location created (e.g., `ext_volume_loc`). If not, refer to the notebook on External Locations.*

In [None]:
# Define your External Location path (Update this with your actual external location path from previous lessons)
# Example: "abfss://data@<storage_account>.dfs.core.windows.net/ext_vol_path"
external_location_path = "YOUR_EXTERNAL_LOCATION_PATH_HERE/ext_vol_folder"

# Create External Volume
# We must specify the LOCATION
try:
    spark.sql(f"""
        CREATE EXTERNAL VOLUME IF NOT EXISTS dev.bronze.external_vol
        LOCATION '{external_location_path}'
        COMMENT 'External volume pointing to my ADLS container'
    """)
    print("External Volume 'external_vol' created.")
except Exception as e:
    print(f"Error creating volume. Ensure External Location exists. Details: {e}")

In [None]:
# Describe to verify it is EXTERNAL
display(spark.sql("DESCRIBE VOLUME dev.bronze.external_vol"))

### File Operations in External Volume
Just like managed volumes, we use the unified path: `/Volumes/dev/bronze/external_vol/...`

In [None]:
ext_vol_path = "/Volumes/dev/bronze/external_vol"

# Copy the same emp.csv to external volume
dbutils.fs.cp(f"file:{local_path}", f"{ext_vol_path}/emp_ext.csv")

# Verify file exists
display(dbutils.fs.ls(ext_vol_path))

## 4. Cleanup & Behavior Test (DROP)

### Dropping Managed Volume
This will delete the data.

In [None]:
spark.sql("DROP VOLUME IF EXISTS dev.bronze.managed_vol")

# Verify path is gone (This should throw an error or return empty)
try:
    dbutils.fs.ls(volume_path)
except Exception as e:
    print("Success: Managed volume path is no longer accessible.")

### Dropping External Volume
This will only remove the catalog object. The file `emp_ext.csv` will remain in your ADLS/S3 bucket.

In [None]:
spark.sql("DROP VOLUME IF EXISTS dev.bronze.external_vol")
print("External volume dropped from Unity Catalog metadata.")

# Note: You cannot access it via /Volumes/... anymore, but you could access it via the direct cloud path 
# if you have direct credentials or mount points set up.

## Summary
*   **Volumes** unify file access under `/Volumes/catalog/schema/volume`.
*   **Managed Volumes:** Easy setup, UC manages lifecycle, `DROP` deletes data.
*   **External Volumes:** Connects to existing cloud storage, `DROP` keeps data safe (metadata only delete).