In [0]:
df_json = spark.read.format('json').option('inferSchema', True)\
                    .option('header', True)\
                    .option('multiLine', False)\
                    .load('/Volumes/pyspark_practice/default/files/Practice/drivers.json')
df_json.display()

# 🧠 Unity Catalog Assets — Volumes, Tables, and Models Explained

---

## 📦 What Is a Volume?

A **volume** in Unity Catalog is a governed storage container for **non-tabular data**—think files like images, logs, PDFs, JSON, or ML datasets. It’s designed to complement tables by handling unstructured and semi-structured data.

### 🔹 Is a Volume Created Inside a Schema?

Yes! In Unity Catalog, volumes are **siblings** to tables, views, and models. They live inside a **schema**, which itself belongs to a **catalog**.

> 📁 Path format: `/Volumes/<catalog>/<schema>/<volume>/<path>/<file>`

---

## 🧪 Types of Volumes

### 1. **Managed Volume**
- Created and fully governed by Unity Catalog.
- No need to specify a location—Databricks handles storage.
- File access is only through Unity Catalog paths.
- Deleted volumes retain files for 7 days before cleanup.

✅ Use when:
- You want simple governance.
- You don’t need external access to the files.

⚠️ Avoid when:
- You need direct cloud URI access or external system integration.

---

### 2. **External Volume**
- Points to a directory in your cloud storage (e.g., S3, ADLS).
- You specify the location during creation.
- Unity Catalog governs access, but external systems can still read/write directly.

✅ Use when:
- You already have data in cloud storage.
- You want to apply governance without migrating files.

⚠️ Avoid when:
- You want Unity Catalog to fully manage lifecycle and cleanup.

---

## 📊 What Is a Table?

A **table** is a structured dataset with rows and columns. In Unity Catalog, tables are governed objects inside schemas.

### 🔹 Types of Tables

| Type           | Description                              | Use Case                          |
|----------------|------------------------------------------|-----------------------------------|
| Managed Table  | Databricks manages both data & metadata  | Internal-only workflows           |
| External Table | Data lives in external storage           | Shared or multi-tool environments |
| Delta Table    | Transactional, versioned, scalable       | Production-grade pipelines        |

> 📌 Delta Tables support ACID transactions, schema evolution, and time travel.

---

## 🧠 What Is a Model?

A **model** in Unity Catalog is a registered machine learning asset. It includes:

- Model files (e.g., `.pkl`, `.onnx`, `.pt`)
- Metadata (version, creator, tags)
- Permissions and lineage

Models are stored inside schemas and can be versioned, served, and shared.

✅ Use for:
- ML model governance
- Reproducibility and auditability
- Serving models via Databricks Model Serving

⚠️ Avoid storing models outside Unity Catalog if you need traceability or sharing.

---

## 🛠️ How to Create These Assets

### 🔹 Create a Volume

```sql
CREATE VOLUME my_volume
COMMENT 'Volume for storing raw images'
```

- For managed: no location needed.
- For external: add `LOCATION 's3://my-bucket/path/'`

### 🔹 Create a Table

```sql
CREATE TABLE my_table (
  id INT,
  name STRING
)
USING DELTA
```

- You can also use `USING CSV`, `PARQUET`, etc.
- For external tables, add `LOCATION '

In [0]:
df_csv = spark.read.format('csv').option('inferSchema',True).option('header',True).load('/Volumes/pyspark_practice/default/files/Practice/BigMart Sales.csv')
df_csv.display()

Note csv an json files are imported from my local machine , it is stored in Volumes . similar to volumes we can import csv as tables  . if imported as tables you cannot use above code , becuase it will throw error . 

In [0]:
# Detect available catalogs
catalogs = [row.catalog for row in spark.sql("SHOW CATALOGS").collect()]
print(catalogs)

# Define your table name
table_name = "big_mart_sales"
schema_name = "default"
catalog_name = "pyspark_practice"  # or "myorganization" if that's your top-level catalog

# Try Unity Catalog format if available
if catalog_name in catalogs:
    full_table_name = f"{catalog_name}.{schema_name}.{table_name}"
else:
    full_table_name = f"{schema_name}.{table_name}"

# Try loading the table
try:
    df = spark.table(full_table_name)
    print(f"✅ Loaded table: {full_table_name}")
    df.display()
except Exception as e:
    print(f"❌ Failed to load table: {full_table_name}")
    print(f"Error: {e}")