# How Databricks Works Internally with Azure

**Objective:** Understand what actually happens in the background when you create a Databricks Workspace and spin up a Cluster. We will explore the "Managed Resource Group" in Azure to see the underlying Virtual Machines (VMs), Storage Accounts, and Networking components.

---

## 1. The "Managed Resource Group"

When we created our Azure Databricks workspace, we provided two resource groups:
1.  **Resource Group:** Where the Databricks Service object lives.
2.  **Managed Resource Group:** A restricted group where Databricks deploys the actual compute and storage resources.

### What's inside the Managed Resource Group?
If you navigate to the **Managed Resource Group** (e.g., `self-adb-managed-rg`) in the Azure Portal, you will find:

*   **Storage Account (DBFS Root):** Stores default data, logs, and libraries.
*   **Network Security Group (NSG):** Manages inbound/outbound traffic rules.
*   **Virtual Network (VNet):** (If not using VNet injection) Default network for clusters.
*   **Disk:** Managed disks for the VMs.
*   **Virtual Machines:** (Only when a cluster is running) The actual compute nodes.

---

## 2. Cluster Lifecycle & Azure Resources

Let's simulate the lifecycle of a cluster and observe the changes in Azure.

### A. Cluster Creation (Start)
When you click **Start** on a cluster:
1.  Databricks Control Plane sends a request to Azure Resource Manager.
2.  Azure provisions **Virtual Machines (VMs)** inside the **Managed Resource Group**.
3.  Azure allocates **Network Interfaces** and **Public IPs** (if applicable).
4.  The Spark Driver and Executors are installed on these VMs.

### B. Cluster Running
*   You will see resources like `Standard_DS3_v2` (Virtual Machine) in the Managed Resource Group.
*   **Cost:** You are billed by Azure for these VMs every minute they run.

### C. Cluster Termination (Stop)
When you click **Terminate** (or Auto-Termination kicks in):
1.  Databricks sends a de-provisioning request.
2.  Azure **deletes** the Virtual Machines.
3.  Azure **releases** the Public IPs and Network Interfaces.
4.  **Cost:** Billing for the VMs stops immediately.

*Note: The Storage Account and Network Security Group persist even when clusters are off.*

In [None]:
# Simulation: Checking Cluster Status via API (Mock)
# In a real scenario, you can use the Databricks REST API to check status.

def check_cluster_status(cluster_id):
    # Mock response
    status = "TERMINATED" # Change to "RUNNING" to simulate active state
    
    if status == "RUNNING":
        print(f"Cluster {cluster_id} is RUNNING.")
        print(" -> Action: Check Azure Portal for 'Virtual Machine' resources.")
        print(" -> Billing: Active ($$$)")
    else:
        print(f"Cluster {cluster_id} is TERMINATED.")
        print(" -> Action: Virtual Machines should be deleted from Azure.")
        print(" -> Billing: Paused")

# Run check
check_cluster_status("0928-152342-cluster123")

## 3. Storage Account Deep Dive (DBFS)

Inside the Managed Resource Group, there is a Storage Account (randomly named, e.g., `dbstorage...`). This is the **DBFS Root** (Databricks File System).

It contains containers like:
*   `access-logs`: Audit logs.
*   `libraries`: Python/Jar libraries installed on clusters.
*   `root`: The default location for `dbfs:/`.

*Warning: Never delete or modify this Storage Account directly from Azure. It will corrupt your Databricks workspace.*

---

## 4. Key Takeaways regarding Cost

1.  **Azure Billing:** You pay for the VMs, Disks, and Networking in the Managed Resource Group.
2.  **Databricks Billing:** You pay for DBUs (Databricks Units) based on the workload type (Jobs vs. All-Purpose Compute).
3.  **Optimization:** Always set **Auto Termination** (e.g., 10-20 mins) to ensure VMs are deleted when not in use.

## Next Steps
Now that we understand the infrastructure, the next logical step is to set up **Data Governance**. In the next video, we will dive into **Unity Catalog**, the unified governance solution for data and AI on the Lakehouse.