# Azure Medallion Data Pipeline Setup

This notebook will guide you through setting up the necessary Azure infrastructure for a medallion-style data pipeline. We'll use Azure CLI commands within this notebook to create resources such as:

- **Resource Group**
- **Storage Account** with containers for **Bronze**, **Silver**, and **Gold** layers
- **Azure Data Factory**
- **Azure Databricks Workspace**
- *(Optional)* **Azure Key Vault**

**Prerequisites:**

- Azure CLI installed and configured
- Azure account with sufficient permissions
- Jupyter extension installed in VS Code

**Note:** This notebook assumes you're running it in a Unix-like environment (e.g., Linux, macOS, or Windows Subsystem for Linux).

In [None]:
import os
import subprocess
import json

# Function to check if Azure CLI is installed
def check_azure_cli():
    try:
        subprocess.run(['az', '--version'], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        print("Azure CLI is installed.")
    except subprocess.CalledProcessError:
        print("Azure CLI is not installed. Please install it before running this notebook.")
        raise

# Check if Azure CLI is installed
check_azure_cli()

## User Inputs

Please provide the required information in the following cell.

In [None]:
# Replace the placeholder values with your actual Azure resource details.

# Resource Group Name
RESOURCE_GROUP = 'MedallionDataPipelineRG'  # e.g., 'MyResourceGroup'

# Azure Region
LOCATION = 'eastus'  # e.g., 'eastus', 'westus2'

# Storage Account Name (must be globally unique, 3-24 lowercase letters and numbers)
STORAGE_ACCOUNT_NAME = 'medallionstorage123'  # e.g., 'mystorageaccount'

# Data Factory Name
DATA_FACTORY_NAME = 'MedallionADF'  # e.g., 'MyDataFactory'

# Databricks Workspace Name
DATABRICKS_WORKSPACE = 'MedallionDatabricksWS'  # e.g., 'MyDatabricksWorkspace'

# Create Azure Key Vault?
CREATE_KEY_VAULT = True  # Set to False if you don't want to create a Key Vault

# Key Vault Name (if CREATE_KEY_VAULT is True)
KEY_VAULT_NAME = 'MedallionKeyVault123'  # e.g., 'MyKeyVault'

# Ensure storage account name is lowercase and between 3-24 characters
STORAGE_ACCOUNT_NAME = STORAGE_ACCOUNT_NAME.lower()
if len(STORAGE_ACCOUNT_NAME) < 3 or len(STORAGE_ACCOUNT_NAME) > 24:
    raise ValueError("Storage account name must be between 3 and 24 characters.")

print("You have provided the following information:")
print("--------------------------------------------")
print(f"Resource Group Name       : {RESOURCE_GROUP}")
print(f"Location                  : {LOCATION}")
print(f"Storage Account Name      : {STORAGE_ACCOUNT_NAME}")
print(f"Data Factory Name         : {DATA_FACTORY_NAME}")
print(f"Databricks Workspace Name : {DATABRICKS_WORKSPACE}")
if CREATE_KEY_VAULT:
    print(f"Key Vault Name            : {KEY_VAULT_NAME}")
print("--------------------------------------------")

In [None]:
# Confirm the inputs
confirm = input('Is this information correct? (yes/no): ')
if confirm.lower() != 'yes':
    raise Exception('Setup aborted by the user.')

In [None]:
# Check Azure login status
print("Checking Azure login status...")
try:
    subprocess.run(['az', 'account', 'show'], check=True, stdout=subprocess.PIPE)
    print("Already logged in to Azure.")
except subprocess.CalledProcessError:
    print("You are not logged in to Azure CLI. Please log in.")
    subprocess.run(['az', 'login'], check=True)

## Create Resource Group

In [None]:
print("Creating Resource Group...")
subprocess.run([
    'az', 'group', 'create',
    '--name', RESOURCE_GROUP,
    '--location', LOCATION
], check=True)
print("Resource Group created.")

## Create Storage Account

In [None]:
print("Creating Storage Account...")
subprocess.run([
    'az', 'storage', 'account', 'create',
    '--name', STORAGE_ACCOUNT_NAME,
    '--resource-group', RESOURCE_GROUP,
    '--location', LOCATION,
    '--sku', 'Standard_LRS',
    '--kind', 'StorageV2',
    '--hierarchical-namespace', 'true'
], check=True)
print("Storage Account created.")

## Create Blob Containers for Bronze, Silver, and Gold Layers

In [None]:
print("Creating Blob Containers...")

# Get storage account key
result = subprocess.run([
    'az', 'storage', 'account', 'keys', 'list',
    '--resource-group', RESOURCE_GROUP,
    '--account-name', STORAGE_ACCOUNT_NAME,
    '--query', '[0].value',
    '-o', 'tsv'
], check=True, stdout=subprocess.PIPE)
ACCOUNT_KEY = result.stdout.decode('utf-8').strip()

for container in ['bronze', 'silver', 'gold']:
    print(f"Creating container '{container}'...")
    subprocess.run([
        'az', 'storage', 'container', 'create',
        '--name', container,
        '--account-name', STORAGE_ACCOUNT_NAME,
        '--account-key', ACCOUNT_KEY
    ], check=True)
print("Blob Containers created.")

## Create Azure Data Factory

In [None]:
print("Creating Azure Data Factory...")
subprocess.run([
    'az', 'datafactory', 'create',
    '--resource-group', RESOURCE_GROUP,
    '--factory-name', DATA_FACTORY_NAME,
    '--location', LOCATION
], check=True)
print("Azure Data Factory created.")

## Create Azure Databricks Workspace

In [None]:
print("Creating Azure Databricks Workspace...")
subprocess.run([
    'az', 'databricks', 'workspace', 'create',
    '--resource-group', RESOURCE_GROUP,
    '--name', DATABRICKS_WORKSPACE,
    '--location', LOCATION,
    '--sku', 'standard'
], check=True)
print("Azure Databricks Workspace created.")

## (Optional) Create Azure Key Vault

In [None]:
if CREATE_KEY_VAULT:
    print("Creating Azure Key Vault...")
    subprocess.run([
        'az', 'keyvault', 'create',
        '--name', KEY_VAULT_NAME,
        '--resource-group', RESOURCE_GROUP,
        '--location', LOCATION
    ], check=True)
    print("Azure Key Vault created.")

## Next Steps

The basic Azure infrastructure for your medallion-style data pipeline has been set up. Here are the next steps to complete your pipeline configuration:

### 1. Configure Databricks Workspace

- Obtain the fully qualified domain name (FQDN) of your Databricks workspace from the Azure Portal.
- Generate a Personal Access Token (PAT) in the Databricks UI under **User Settings** > **Access Tokens**.
- Store the PAT securely, preferably in Azure Key Vault.

### 2. Set Up Linked Services in Azure Data Factory

- Create linked services for your Storage Account and Databricks workspace.
- Use the Storage Account key or reference it from Key Vault.
- Configure the Databricks linked service using the workspace URL and PAT.

### 3. Create Datasets in Azure Data Factory

- Define datasets for the Bronze, Silver, and Gold layers.
- Specify the container names and paths.

### 4. Develop Databricks Notebooks

- Create notebooks for data transformations:
  - **BronzeToSilver**: Read from Bronze, transform, write to Silver.
  - **SilverToGold**: Read from Silver, transform, write to Gold.
- Use Spark to read and write data, applying necessary transformations.

### 5. Create Pipelines in Azure Data Factory

- Define activities to orchestrate data movement and transformation.
- Configure parameters and variables as needed.

### 6. Set Up Scheduling and Triggers

- Determine the frequency and timing of pipeline runs.
- Create triggers in Azure Data Factory.

### 7. Security and Compliance

- Use Azure Key Vault to store secrets.
- Implement role-based access control (RBAC).
- Ensure compliance with any regulatory requirements.

## Clean Up Resources

To avoid incurring unnecessary charges, you can delete the resource group when it's no longer needed.

In [None]:
# Uncomment the following lines to delete the resource group
# print("Deleting Resource Group...")
# subprocess.run([
#     'az', 'group', 'delete',
#     '--name', RESOURCE_GROUP,
#     '--yes', '--no-wait'
# ], check=True)
# print("Resource Group deleted.")

## Conclusion

You have successfully set up the Azure infrastructure for your medallion-style data pipeline. Proceed with configuring the remaining components as outlined in the next steps.