# 🚀 99_Setup_Workflow Notebook

%md
### 🛠 99_Setup_Workflow.ipynb
This notebook creates or replaces a Databricks Workflow for the AeroDemo project.
It chains together the 01_ and 02_ series notebooks, using a fresh ML-enabled, autoscaling cluster (1–6 workers) for each task.

✅ Uses Databricks Runtime ML (latest: 14.3.x-cpu-ml-scala2.12)
✅ Configures autoscale (1–6 workers)
✅ Auto-terminates after 60 minutes idle
---

### 📋 Current notebooks included:
<strike>✅ `01_Table_Creation.ipynb`</strike>  
✅ `02_01_Sensor_Data_Generation.ipynb`  
✅ `02_02_Engine_Data_Generation.ipynb`  
✅ `02_03_CabinPressurization_Data_Generation.ipynb`  
✅ `02_04_Airframe_Synthetic_Data_Generation.ipynb`  
✅ `02_05_LandingGear_Data_Generation.ipynb`  
✅ `02_06_Avionics_Data_Generation.ipynb`  
✅ `02_07_ElectricalSystems_Data_Generation.ipynb`  
✅ `02_08_FuelSystems_Data_Generation.ipynb`  
✅ `02_09_HydraulicSystems_Data_Generation.ipynb`  
✅ `02_10_EnvironmentalSystems_Data_Generation.ipynb`

---

### 🔗 Future additions:
We’ll expand this workflow to include:
- `03_` series (DLT pipelines)
- `04_` series (ML models + scoring)
- `05_` series (dashboarding + alerts)

### ⚙️ Important Setup Notes

✅ **Cluster setup**
- Replace `<YOUR_CLUSTER_ID>` in the script with:
  - Your existing Databricks cluster ID, **or**
  - Switch to `new_cluster` configuration if you want the workflow to create its own cluster

✅ **Repo + notebook paths**
- Make sure all notebook paths align with:
  `/Repos/honnuanand/databricks-aerodemo/<NOTEBOOK_NAME>.ipynb`

✅ **Databricks SDK**
- This script uses the `databricks-sdk` (Python client).
- Run it from:
  - A Databricks notebook, **or**
  - A local Python environment with `databricks-sdk` installed and configured


---

Once you run this, you’ll have a fresh **AeroDemo_DataPipeline** workflow  
that orchestrates all current synthetic data generation steps!

### 🛠 Workflow Setup Helper Notes

This script sets up a Databricks Workflow that runs the AeroDemo synthetic data pipeline notebooks.

---

✅ **Configurable Parameters**
- `NOTEBOOK_BASE_PATH` → Set to the full workspace path where your notebooks live.  
  Example: `/Workspace/Users/anand.rao@databricks.com/databricks-aerodemo`

- `CLUSTER_ID` → Replace with:
  - An **existing cluster ID** you want the workflow to run on, **or**
  - Replace `existing_cluster_id` with a `new_cluster` configuration block if you want the workflow to create its own cluster

- `WORKFLOW_NAME` → Choose a descriptive name for your Databricks Job (Workflow).

---

✅ **Notebook Path Pattern**
- Each task points to:  
  `{NOTEBOOK_BASE_PATH}/{NOTEBOOK_NAME}.ipynb`

Make sure your notebook filenames in the workspace exactly match those listed in the `notebooks` array.

---

✅ **Expansion**
- You can later add:
  - `03_` series (DLT ingestion pipelines)
  - `04_` series (ML model training + scoring)
  - `05_` series (visualization + alert notebooks)

Just extend the `notebooks` list in the code and rerun the script to update the workflow.

---

✅ **Execution**
- Run this script inside a Databricks notebook or from a local Python environment with the `databricks-sdk` properly configured.

In [0]:
import os

# Get current notebook path and folder
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
folder_path = os.path.dirname(notebook_path)
workspace_path = f"/Workspace{folder_path}"

print(f"✅ Notebook path: {notebook_path}")
print(f"✅ Notebook folder (user-level): {folder_path}")
print(f"✅ Notebook folder (workspace-level): {workspace_path}")

In [0]:
import os
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, PipelineTask, TaskDependency

# Get current notebook context
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
folder_path = os.path.dirname(notebook_path)
NOTEBOOK_BASE_PATH = f"/Workspace{folder_path}/../02_SyntheticData"

print(f"✅ Using notebook base path: {NOTEBOOK_BASE_PATH}")

# ---------- CONFIG ----------
WORKFLOW_NAME = "AeroDemo_DataPipeline"
MODE = "DEV"
EXISTING_CLUSTER_ID = "0527-220936-f3oreeiv"
DLT_PIPELINE_ID = "a2ccd850-4b28-4f30-9a53-0fd5f5499713"

# Notebooks list
notebooks = [
    "02_01_Synthetic_Data_Generation_v2",
    "02_02_Engine_Data_Generation",
    "02_03_CabinPressurization_Data_Generation",
    "02_04_Airframe_Synthetic_Data_Generation",
    "02_05_LandingGear_Data_Generation",
    "02_06_Avionics_Data_Generation",
    "02_07_ElectricalSystems_Data_Generation",
    "02_08_FuelSystems_Data_Generation",
    "02_09_HydraulicSystems_Data_Generation",
    "02_10_EnvironmentalSystems_Data_Generation"
]

w = WorkspaceClient()
tasks = []

# ✅ Add notebook tasks
for idx, notebook in enumerate(notebooks):
    notebook_path_full = f"{NOTEBOOK_BASE_PATH}/{notebook}"
    print(f"📍 Adding notebook task: {notebook_path_full}")
    task_args = {
        "task_key": notebook,
        "notebook_task": NotebookTask(notebook_path=notebook_path_full),
        "existing_cluster_id": EXISTING_CLUSTER_ID if MODE == "DEV" else None
    }
    if idx > 0:
        task_args["depends_on"] = [TaskDependency(task_key=notebooks[idx - 1])]
    tasks.append(Task(**task_args))

# ✅ Add DLT pipeline task (depends on last notebook)
dlt_task_key = "Run_DLT_Pipeline"
dlt_task = Task(
    task_key=dlt_task_key,
    pipeline_task=PipelineTask(pipeline_id=DLT_PIPELINE_ID),
    depends_on=[TaskDependency(task_key=notebooks[-1])]
)
print(f"📍 Adding DLT pipeline task: Pipeline ID {DLT_PIPELINE_ID}")
tasks.append(dlt_task)

# ✅ Add final summary notebook task (depends on DLT)
final_task_key = "101_Final_Summary_Task"
final_task_path = f"{NOTEBOOK_BASE_PATH}/{final_task_key}"
final_task = Task(
    task_key=final_task_key,
    notebook_task=NotebookTask(notebook_path=final_task_path),
    existing_cluster_id=EXISTING_CLUSTER_ID if MODE == "DEV" else None,
    depends_on=[TaskDependency(task_key=dlt_task_key)]
)
print(f"📍 Adding final notebook task: {final_task_path}")
tasks.append(final_task)

# ✅ Create or update the job
job = w.jobs.create(
    name=WORKFLOW_NAME,
    tasks=tasks
)

print(f"✅ Workflow '{WORKFLOW_NAME}' created/updated with Job ID: {job.job_id}")

In [0]:
import requests
import json
import re

# ---------- CONFIG ----------
DATABRICKS_INSTANCE = "https://e2-demo-field-eng.cloud.databricks.com"
TOKEN = "YOUR_PAT"
JOB_ID = "173822373344591"
# ----------------------------

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

# ✅ Step 1: Get the current job definition
get_url = f"{DATABRICKS_INSTANCE}/api/2.1/jobs/get?job_id={JOB_ID}"
response = requests.get(get_url, headers=headers)

if response.status_code != 200:
    print(f"❌ Failed to fetch job: {response.text}")
    exit(1)

job_data = response.json()
print(f"✅ Fetched job '{job_data['settings']['name']}'")

# ✅ Step 2: Strip '.ipynb' suffixes in notebook tasks
for task in job_data['settings']['tasks']:
    if 'notebook_task' in task and 'notebook_path' in task['notebook_task']:
        original_path = task['notebook_task']['notebook_path']
        cleaned_path = re.sub(r"\.ipynb$", "", original_path)
        if original_path != cleaned_path:
            print(f"🔧 Fixing: {original_path} → {cleaned_path}")
            task['notebook_task']['notebook_path'] = cleaned_path

# ✅ Step 3: Re-submit (reset) the job definition
reset_url = f"{DATABRICKS_INSTANCE}/api/2.1/jobs/reset"
payload = {
    "job_id": JOB_ID,
    "new_settings": job_data['settings']
}

reset_response = requests.post(reset_url, headers=headers, data=json.dumps(payload))

if reset_response.status_code != 200:
    print(f"❌ Failed to reset job: {reset_response.text}")
else:
    print(f"✅ Job '{job_data['settings']['name']}' successfully patched!")

In [0]:
import requests
import json

# ---------- CONFIG ----------
DATABRICKS_INSTANCE = "https://e2-demo-field-eng.cloud.databricks.com"
TOKEN = "YOUR_PAT"
JOB_ID = "864722071013094"
DLT_PIPELINE_ID = "a2ccd850-4b28-4f30-9a53-0fd5f5499713"
# ----------------------------

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

# ✅ Step 1: Get current job definition
get_url = f"{DATABRICKS_INSTANCE}/api/2.1/jobs/get?job_id={JOB_ID}"
response = requests.get(get_url, headers=headers)
if response.status_code != 200:
    print(f"❌ Failed to fetch job: {response.text}")
    raise SystemExit

job_data = response.json()
print(f"✅ Fetched job '{job_data['settings']['name']}'")

# ✅ Step 2: Check if DLT task already exists
task_keys = [t['task_key'] for t in job_data['settings']['tasks']]
if "Run_DLT_Pipeline" in task_keys:
    print("⚠️ DLT task already exists in the workflow. Skipping add.")
else:
    # ✅ Step 3: Add DLT task at the end
    job_data['settings']['tasks'].append({
        "task_key": "Run_DLT_Pipeline",
        "depends_on": [{"task_key": "02_10_EnvironmentalSystems_Data_Generation"}],
        "pipeline_task": {
            "pipeline_id": DLT_PIPELINE_ID
        }
    })
    print("✅ DLT task added to workflow payload.")

    # ✅ Step 4: Patch the updated workflow
    reset_url = f"{DATABRICKS_INSTANCE}/api/2.1/jobs/reset"
    payload = {
        "job_id": JOB_ID,
        "new_settings": job_data['settings']
    }

    reset_response = requests.post(reset_url, headers=headers, data=json.dumps(payload))
    if reset_response.status_code != 200:
        print(f"❌ Failed to patch job: {reset_response.text}")
    else:
        print(f"✅ Job '{job_data['settings']['name']}' successfully updated with DLT task!")

In [0]:
import requests

# ---------- CONFIG ----------
DATABRICKS_INSTANCE = "https://e2-demo-field-eng.cloud.databricks.com"
TOKEN = "YOUR_PAT"
JOB_ID = "864722071013094"
# ----------------------------

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

# ✅ Fetch the updated job definition
get_url = f"{DATABRICKS_INSTANCE}/api/2.1/jobs/get?job_id={JOB_ID}"
response = requests.get(get_url, headers=headers)
if response.status_code != 200:
    print(f"❌ Failed to fetch job: {response.text}")
    raise SystemExit

job_data = response.json()
print(f"✅ Job '{job_data['settings']['name']}' has the following tasks:")
for task in job_data['settings']['tasks']:
    if 'notebook_task' in task:
        print(f" - Notebook task: {task['task_key']} → {task['notebook_task']['notebook_path']}")
    if 'pipeline_task' in task:
        print(f" - DLT pipeline task: {task['task_key']} → Pipeline ID: {task['pipeline_task']['pipeline_id']}")