# 🔥 PySpark SparkSession Initialization — Explained for Databricks Users

## 🧠 What This Code Does

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .master("local[*]") \
    .config("spark.some.config.option", "value") \
    .getOrCreate()
```

This snippet manually creates a `SparkSession`, which is the **gateway to Spark's features** in PySpark. Here's a quick breakdown:

- `SparkSession.builder`: Starts the builder pattern to configure the session.
- `.appName("MyApp")`: Sets the name of the Spark application (visible in Spark UI).
- `.master("local[*]")`: Runs Spark locally using all available cores. This is used **outside clusters**, like on your laptop.
- `.config(...)`: Adds custom Spark configuration. Replace `"spark.some.config.option"` with actual config keys.
- `.getOrCreate()`: Returns an existing SparkSession or creates a new one.

> ✅ This is **essential** when running PySpark scripts in standalone environments (e.g., VS Code, terminal, or Jupyter).

---

## 🚀 Why This Is *Not Required* in Databricks

Databricks notebooks run inside a **fully managed Spark cluster**, which automatically provisions a `SparkSession` named `spark`.

### What Databricks Already Handles:
- ✅ SparkSession creation
- ✅ Cluster resource management
- ✅ Application naming and logging
- ✅ Context-aware configuration

### What You Don’t Need to Do:
- ❌ Call `.getOrCreate()` — it's already done.
- ❌ Set `.master("local[*]")` — Databricks uses cluster mode.
- ❌ Manually configure basic session settings — many are managed by the platform.

> 🎬 Think of it like walking into a cinema with your own projector—redundant, but not disruptive.

---

## ⚠️ What Happens If You Run This in Databricks

If you paste this code into a Databricks notebook:

- ✅ It **won’t throw an error**.
- 🔁 It will **reuse or override** the existing SparkSession.
- ⚠️ `.master("local[*]")` will be **ignored** or overridden by cluster settings.
- 🧩 `.config(...)` may not apply if it conflicts with Databricks-managed configs.

> 🧼 Best practice: Avoid redefining `spark` unless absolutely necessary.

---

## 🧪 Summary Comparison

| Environment     | SparkSession Needed? | `.master(...)` Valid? | Default `spark` Available? |
|----------------|----------------------|------------------------|-----------------------------|
| PySpark Script | ✅ Yes               | ✅ Yes                 | ❌ No                       |
| Databricks     | ❌ Not Required      | ❌ Ignored             | ✅ Yes                      |

---

## 💡 Pro Tip for Reusability

If you're writing notebooks that may run both inside and outside Databricks, use a conditional check:

```python
# Only needed when running outside Databricks
if "spark" not in locals():
    spark = SparkSession.builder \
        .appName("MyApp") \
        .master("local[*]") \
        .getOrCreate()
```

---

📘 *This explanation is part of DataGym’s onboarding series for PySpark learners. For more annotated examples and reusable notebook templates, check out the repository structure and contribution guide.*


In [0]:
spark.version

In [0]:
%sql
Select current_date()

# 🧠 Catalog vs Unity Catalog vs Hive Metastore — Explained for Databricks Users

Let’s unpack this like a layered ETL pipeline—starting with the basics, then diving into architecture, governance, and production strategy. Here's a comprehensive breakdown of **Catalog vs Unity Catalog vs Hive Metastore**, including what Databricks offers in its free edition and how it all connects.

---

## 📁 What Is a Catalog?

In Spark and Databricks, a **catalog** is a top-level container that organizes **schemas (databases)** and **tables**. Think of it as a folder system:

```
catalog.schema.table
```

- **Catalog**: Logical grouping (e.g., `main`, `hive_metastore`, `my_catalog`)
- **Schema**: Like a database (e.g., `sales`, `marketing`)
- **Table**: Actual data (e.g., `transactions`, `customers`)

---

## 🐝 What Is Hive Metastore?

The **Hive Metastore** is the traditional metadata store used in Hadoop and Spark ecosystems. It stores:

- Table definitions
- Schema info
- Data locations (e.g., file paths in HDFS or S3)

### 🔗 How It’s Related:
- Spark uses Hive Metastore to resolve table names and schemas.
- In Databricks, Hive Metastore is the **default catalog** called `hive_metastore`.

> 📌 In Databricks, you can query like:  
> `SELECT * FROM hive_metastore.sales.transactions`

---

## 🧭 What Is Unity Catalog?

**Unity Catalog** is Databricks’ modern, cloud-native metadata and governance layer. It replaces Hive Metastore with:

- ✅ Fine-grained access control (column-level, row-level)
- ✅ Cross-workspace governance
- ✅ Built-in auditing and lineage tracking
- ✅ Multi-cloud support (Azure, AWS, GCP)
- ✅ REST APIs and integration with identity providers (like Entra ID)

> 🧠 Unity Catalog introduces a **three-level namespace**:  
> `catalog.schema.table` — making it easier to manage data across teams and environments.

---

## 🧪 What Is V1 Catalog (aka Hive Metastore in Databricks)?

- The **V1 Catalog** refers to the legacy Hive Metastore implementation in Databricks.
- It’s **workspace-scoped** (each workspace has its own metastore).
- Limited to **basic table-level permissions**.
- No built-in lineage, audit logs, or cross-workspace sharing.

---

## 🆓 Databricks Free Edition (Community Edition)

| Feature              | Available in Free Edition | Notes |
|----------------------|---------------------------|-------|
| Hive Metastore (V1)  | ✅ Yes                    | Default catalog (`hive_metastore`) |
| Unity Catalog        | ❌ No                     | Requires premium tier or enterprise workspace |
| Custom Catalogs      | ❌ No                     | Only `hive_metastore` is available |

---

## 🏭 Which Should Be Used in Production?

| Feature              | Hive Metastore (V1) | Unity Catalog |
|----------------------|---------------------|----------------|
| Governance           | ❌ Basic            | ✅ Advanced (column-level, row-level) |
| Auditing             | ❌ Manual           | ✅ Built-in |
| Lineage Tracking     | ❌ External tools   | ✅ Native |
| Multi-workspace      | ❌ No               | ✅ Yes |
| Cloud Integration    | ⚠️ Limited         | ✅ Native |
| Security             | ❌ Workspace-local | ✅ Identity-based |
| Scalability          | ⚠️ Bottlenecks     | ✅ Petabyte-scale |
| Recommended for Prod | ❌ No               | ✅ Yes |

> 🧩 Unity Catalog is the clear choice for production environments, especially when data governance, compliance, and collaboration are key.

---

## 🧬 Similarities & Differences

| Aspect               | Hive Metastore (V1)         | Unity Catalog                   |
|----------------------|-----------------------------|----------------------------------|
| Metadata Storage     | RDBMS (MySQL, Postgres)     | Cloud-native, managed by Databricks |
| Namespace            | `database.table`            | `catalog.schema.table`          |
| Access Control       | Table-level (basic)         | Fine-grained (column, row)      |
| Integration          | Spark, Hive                 | Spark, Delta Lake, MLflow       |
| Governance           | External tools (e.g., Ranger) | Built-in                        |
| Multitenancy         | Single workspace            | Cross-workspace                 |

---

## 🛠️ Migration Strategy

If you're using Hive Metastore and want to move to Unity Catalog:

1. **Upgrade tables** to Unity Catalog.
2. **Federate Hive Metastore** into Unity Catalog as a foreign catalog (gradual migration).
3. **Disable direct access** to Hive Metastore once migrated.

---

## 🧵 TL;DR Summary

- **Hive Metastore (V1)**: Legacy, workspace-scoped, basic governance. Default in free edition.
- **Unity Catalog**: Modern, secure, scalable, cross-workspace. Recommended for production.
- **Catalogs**: Logical containers for organizing schemas and tables. Unity Catalog supports multiple.

---

📘 *This explanation is part of DataGym’s onboarding series for PySpark learners. For more annotated examples and reusable notebook templates, check out the repository structure and contribution guide.*


# 🧠 Databricks Catalog Concepts — Tables, Volumes, Models, Delta Shares & More

---

## 🆓 What Does Databricks Free Edition Provide?

| Feature              | Available in Free Edition | Notes |
|----------------------|---------------------------|-------|
| Hive Metastore (V1)  | ✅ Yes                    | Default catalog: `hive_metastore` |
| Unity Catalog        | ❌ No                     | Requires premium/enterprise tier |
| Delta Sharing        | ❌ No                     | Only available with Unity Catalog |
| Volumes & Models     | ❌ No                     | Unity Catalog features only |

> 📌 In Free Edition, you're limited to the legacy `hive_metastore` catalog and basic table-level access control.

---

## 📁 Tables vs Volumes vs Models in Unity Catalog

Unity Catalog introduces **data and AI asset types** that live inside catalogs and schemas:

### 🧮 Tables
- Structured tabular data (rows & columns)
- Can be **managed** (Databricks stores the files) or **external** (you manage the files)
- Types:
  - **Delta Tables**: Transactional, versioned, scalable
  - **Parquet/CSV Tables**: Non-transactional, static
- ✅ Use for: SQL queries, BI dashboards, ML training
- ⚠️ Avoid: CSV tables for production—no ACID guarantees

### 📦 Volumes
- Unstructured or semi-structured file storage (like a folder)
- Store images, PDFs, logs, JSON, etc.
- Access via `dbutils.fs` or Spark APIs
- ✅ Use for: ML datasets, raw ingestion, file-based workflows
- ⚠️ Avoid: Using volumes for structured tabular data—use tables instead

### 🤖 Models
- Registered ML models (e.g., sklearn, XGBoost, PyTorch)
- Stored with metadata, versioning, and permissions
- Can be shared via Delta Sharing
- ✅ Use for: Model serving, governance, reproducibility
- ⚠️ Avoid: Storing models outside Unity Catalog if you need auditability

---

## 🏢 My Organization vs 🌐 Delta Shares

### 🏢 My Organization
- Refers to **data assets accessible within your Databricks account**
- Includes all catalogs, schemas, tables, volumes, models you own
- Governed by Unity Catalog (if enabled)

### 🌐 Delta Shares
- Mechanism to **share data across organizations**
- Uses Unity Catalog to define **shares**, **recipients**, and **providers**
- Supports sharing:
  - Tables
  - Views
  - Volumes
  - Models
  - Notebooks

> 📌 Delta Shares are **not available in Free Edition**. They require Unity Catalog and a premium workspace.

---

## 📊 CSV Table vs Delta Table

| Feature            | CSV Table (via `CREATE TABLE`) | Delta Table (via `CREATE TABLE USING DELTA`) |
|--------------------|-------------------------------|----------------------------------------------|
| Format             | CSV                           | Delta (Parquet + transaction log)            |
| ACID Transactions  | ❌ No                          | ✅ Yes                                        |
| Schema Evolution   | ❌ Manual                     | ✅ Automatic                                  |
| Time Travel        | ❌ No                          | ✅ Yes                                        |
| Performance        | ⚠️ Slower                     | ✅ Optimized for big data                     |
| Recommended for    | Prototyping, small datasets    | Production, scalable pipelines               |

> ✅ Always prefer **Delta Tables** for production workloads.

---

## 🧱 Is Everything Inside "My Organization" a Database?

Not quite. Here's how Databricks organizes data:

### 🔹 Catalog
- Top-level container (e.g., `main`, `hive_metastore`, `my_catalog`)
- Unity Catalog supports multiple catalogs

### 🔹 Schema (aka Database)
- Logical grouping of tables, views, volumes, models
- Examples: `default`, `sales`, `ml_models`

### 🔹 Table / Volume / Model
- Actual data or AI asset

### 🔹 Special Schemas
- `default`: Default schema inside a catalog
- `information_schema`: System schema with metadata tables (e.g., `tables`, `columns`, `views`)
  - ✅ Use for: Auditing, introspection, governance

> 📌 So yes, `default` and `information_schema` are schemas (not catalogs), and they live inside a catalog like `main` or `hive_metastore`.

---

## 🧵 TL;DR Summary

- **Free Edition** gives you `hive_metastore` with basic tables.
- **Unity Catalog** adds volumes, models, Delta Sharing, and governance.
- **Tables** = structured data, **Volumes** = files, **Models** = ML assets.
- **Delta Tables** > CSV Tables for production.
- **My Organization** = internal assets; **Delta Shares** = external sharing.
- `default` and `information_schema` are schemas inside a catalog.

---

📘 *This guide is part of DataGym’s onboarding series. For visual cheat sheets or migration playbooks, let’s co-design something stunning!*
