In [0]:
df_json = spark.read.format('json').option('inferSchema', True)\
                    .option('header', True)\
                    .option('multiLine', False)\
                    .load('/Volumes/pyspark_practice/default/files/Practice/drivers.json')
df_json.display()

# 🧠 Unity Catalog Assets — Volumes, Tables, and Models Explained

---

## 📦 What Is a Volume?

A **volume** in Unity Catalog is a governed storage container for **non-tabular data**—think files like images, logs, PDFs, JSON, or ML datasets. It’s designed to complement tables by handling unstructured and semi-structured data.

### 🔹 Is a Volume Created Inside a Schema?

Yes! In Unity Catalog, volumes are **siblings** to tables, views, and models. They live inside a **schema**, which itself belongs to a **catalog**.

> 📁 Path format: `/Volumes/<catalog>/<schema>/<volume>/<path>/<file>`

---

## 🧪 Types of Volumes

### 1. **Managed Volume**
- Created and fully governed by Unity Catalog.
- No need to specify a location—Databricks handles storage.
- File access is only through Unity Catalog paths.
- Deleted volumes retain files for 7 days before cleanup.

✅ Use when:
- You want simple governance.
- You don’t need external access to the files.

⚠️ Avoid when:
- You need direct cloud URI access or external system integration.

---

### 2. **External Volume**
- Points to a directory in your cloud storage (e.g., S3, ADLS).
- You specify the location during creation.
- Unity Catalog governs access, but external systems can still read/write directly.

✅ Use when:
- You already have data in cloud storage.
- You want to apply governance without migrating files.

⚠️ Avoid when:
- You want Unity Catalog to fully manage lifecycle and cleanup.

---

## 📊 What Is a Table?

A **table** is a structured dataset with rows and columns. In Unity Catalog, tables are governed objects inside schemas.

### 🔹 Types of Tables

| Type           | Description                              | Use Case                          |
|----------------|------------------------------------------|-----------------------------------|
| Managed Table  | Databricks manages both data & metadata  | Internal-only workflows           |
| External Table | Data lives in external storage           | Shared or multi-tool environments |
| Delta Table    | Transactional, versioned, scalable       | Production-grade pipelines        |

> 📌 Delta Tables support ACID transactions, schema evolution, and time travel.

---

## 🧠 What Is a Model?

A **model** in Unity Catalog is a registered machine learning asset. It includes:

- Model files (e.g., `.pkl`, `.onnx`, `.pt`)
- Metadata (version, creator, tags)
- Permissions and lineage

Models are stored inside schemas and can be versioned, served, and shared.

✅ Use for:
- ML model governance
- Reproducibility and auditability
- Serving models via Databricks Model Serving

⚠️ Avoid storing models outside Unity Catalog if you need traceability or sharing.

---

## 🛠️ How to Create These Assets

### 🔹 Create a Volume

```sql
CREATE VOLUME my_volume
COMMENT 'Volume for storing raw images'
```

- For managed: no location needed.
- For external: add `LOCATION 's3://my-bucket/path/'`

### 🔹 Create a Table

```sql
CREATE TABLE my_table (
  id INT,
  name STRING
)
USING DELTA
```

- You can also use `USING CSV`, `PARQUET`, etc.
- For external tables, add `LOCATION '

In [0]:
df_csv = spark.read.format('csv').option('inferSchema',True).option('header',True).load('/Volumes/pyspark_practice/default/files/Practice/BigMart Sales.csv')
df_csv.display()

Note csv an json files are imported from my local machine , it is stored in Volumes . similar to volumes we can import csv as tables  . if imported as tables you cannot use above code , becuase it will throw error . 

In [0]:
# Detect available catalogs
catalogs = [row.catalog for row in spark.sql("SHOW CATALOGS").collect()]
print(catalogs)

# Define your table name
table_name = "big_mart_sales"
schema_name = "default"
catalog_name = "pyspark_practice"  # or "myorganization" if that's your top-level catalog

# Try Unity Catalog format if available
if catalog_name in catalogs:
    full_table_name = f"{catalog_name}.{schema_name}.{table_name}"
else:
    full_table_name = f"{schema_name}.{table_name}"

# Try loading the table
try:
    df = spark.table(full_table_name)
    print(f"✅ Loaded table: {full_table_name}")
    df.display()
except Exception as e:
    print(f"❌ Failed to load table: {full_table_name}")
    print(f"Error: {e}")

# 🧠 PySpark Data Types — DDL vs StructType Cheat Sheet

Let’s break down PySpark data types into two intuitive categories: **DDL-style** and **Struct-based**. Think of it as SQL-like string schemas vs. Pythonic object schemas—each with its own strengths depending on your use case.

---

## 🔹 1. DDL-style (String-based schema)

This is the SQL-inspired way to define schemas using strings. It’s compact and often used when reading data from external sources like CSV or JSON.

### ✅ Best for:
- Quick schema definitions
- Config-driven pipelines
- Lightweight ingestion scripts

### 📌 Syntax

```python
ddl_schema = "name STRING, age INT, salary DOUBLE"
```

### 🧬 Supported Types

| DDL Type     |
|--------------|
| STRING       |
| INT          |
| BIGINT       |
| DOUBLE       |
| BOOLEAN      |
| DATE         |
| TIMESTAMP    |
| ARRAY<...>   |
| MAP<...>     |
| STRUCT<...>  |

### 🧪 Example with Nesting

```python
ddl_schema = "user STRUCT<id: INT, name: STRING>, scores ARRAY<DOUBLE>"
```

---

## 🔸 2. Struct-based (Programmatic schema)

This is the object-oriented way using `StructType` and `StructField`. It’s verbose but gives you full control—ideal for dynamic schema generation and validation.

### ✅ Best for:
- Complex pipelines
- Dynamic schema manipulation
- UDFs and typed transformations

### 📌 Syntax

```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

struct_schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("salary", DoubleType(), True)
])
```

### 🧬 Supported Types

| Struct Type       |
|-------------------|
| StringType()      |
| IntegerType()     |
| LongType()        |
| DoubleType()      |
| BooleanType()     |
| DateType()        |
| TimestampType()   |
| ArrayType(...)    |
| MapType(...)      |
| StructType([...]) |

### 🧪 Example with Nesting

```python
nested_schema = StructType([
    StructField("user", StructType([
        StructField("id", IntegerType(), True),
        StructField("name", StringType(), True)
    ]), True),
    StructField("scores", ArrayType(DoubleType()), True)
])
```

---

## 🧭 Comparison Table

| Feature               | DDL-style (String) | Struct-based (Object) |
|------------------------|--------------------|------------------------|
| Syntax Style           | SQL-like string     | Pythonic object        |
| Verbosity              | Compact             | Verbose                |
| Nesting Support        | ✅ Yes              | ✅ Yes                 |
| Dynamic Generation     | ⚠️ Limited         | ✅ Full control        |
| Best For               | Ingestion, configs  | UDFs, transformations  |
| Type Safety            | ❌ No               | ✅ Yes                 |
| Validation             | ❌ Manual           | ✅ Built-in            |

---

📘 *Would you like a utility to convert DDL to StructType dynamically? Or maybe a visual cheat sheet for your DataGym repo that maps DDL types to Struct types with examples? I’d love to help you build that!*


In [0]:
from datetime import datetime, date

data = [
    (
        "Alice", 29, 1001, 2500.75, True, date(1996, 5, 14), datetime(2025, 9, 19, 10, 30, 0),
        ["Python", "SQL"],
        {"theme": "dark", "language": "en"},
        {"city": "Mumbai", "zip": "400601"}
    ),
    (
        "Bob", 35, 1002, 1800.50, False, date(1989, 11, 2), datetime(2025, 9, 19, 11, 15, 0),
        ["Java", "Scala"],
        {"theme": "light", "language": "fr"},
        {"city": "Pune", "zip": "411001"}
    )
]

from pyspark.sql.types import *

struct_schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("user_id", LongType(), True),
    StructField("price", DoubleType(), True),
    StructField("is_active", BooleanType(), True),
    StructField("birthdate", DateType(), True),
    StructField("event_time", TimestampType(), True),
    StructField("skills", ArrayType(StringType()), True),
    StructField("preferences", MapType(StringType(), StringType()), True),
    StructField("address", StructType([
        StructField("city", StringType(), True),
        StructField("zip", StringType(), True)
    ]), True)
])
df_struct = spark.createDataFrame(data, schema=struct_schema)
df_struct.show(truncate=False)
df_struct.printSchema()

In [0]:
from datetime import datetime, date

data = [
    (
        "Alice",                      # STRING
        29,                           # INT
        1001,                         # BIGINT
        2500.75,                      # DOUBLE
        True,                         # BOOLEAN
        date(1996, 5, 14),            # DATE
        datetime(2025, 9, 19, 10, 30),# TIMESTAMP
        ["Python", "SQL"],            # ARRAY<STRING>
        {"theme": "dark", "lang": "en"}, # MAP<STRING, STRING>
        {"city": "Mumbai", "zip": "400601"} # STRUCT<city, zip>
    ),
    (
        "Bob",
        35,
        1002,
        1800.50,
        False,
        date(1989, 11, 2),
        datetime(2025, 9, 19, 11, 15),
        ["Java", "Scala"],
        {"theme": "light", "lang": "fr"},
        {"city": "Pune", "zip": "411001"}
    )
]

In [0]:
ddl_schema = """
    name STRING,
    age INT,
    user_id BIGINT,
    price DOUBLE,
    is_active BOOLEAN,
    birthdate DATE,
    event_time TIMESTAMP,
    skills ARRAY<STRING>,
    preferences MAP<STRING, STRING>,
    address STRUCT<city: STRING, zip: STRING>
"""

In [0]:
df = spark.createDataFrame(data, schema=ddl_schema)
df.show(truncate=False)
df.printSchema()