
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Lab: Building a CI-CD Pipeline with Databricks CLI

In this lab, you will create and execute a CI/CD pipeline for Databricks notebooks. This pipeline automates notebook execution, validation, version control, and email notifications. You'll also handle error scenarios and re-run the pipeline after resolving issues.


**Lab Outline:**

_By the end of this lab, you will:_
- **Task 1 - Environment Setup: Git-Integrated Databricks Workspace**
  - 1.1. Configure Git Integration in Databricks  
  - 1.2. Clone Repository in Databricks  
  - 1.3. Execute Notebooks and Commit to Git Repository  
  - 1.4. Display Git Folder Structure  
  - 1.5. Pass Variables for Hyperparameter Tuning  

- **Task 2 - Pipeline Validation Workflow with Email Notifications**
  - 2.1. Create Folder Structure for Workflow Configuration  
  - 2.2. Define Workflow Configuration  
  - 2.3. Save Workflow Configuration to File  

- **Task 3 - Pipeline Execution and Version Update**
  - 3.1. Pipeline Execution Workflow Overview  
  - 3.2. Check Pipeline Status  
  - 3.3. Fix Errors and Re-Run the Pipeline  

- **Task 4 - Displaying the Final Git Folder Structure**

üìù **Your task:** Complete the **`<FILL_IN>`** sections in the code blocks and follow the other steps as instructed.

---

**Requirements**:
- Access to a **Databricks workspace** with admin rights.
- **GitHub repository** integrated with the workspace.
- Databricks CLI installed and authenticated.
- A basic understanding of CI/CD pipelines and Databricks workflows.

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.
Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.
1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:
   - In the drop-down, select **More**.
   - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:
1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.
1. Find the triangle icon to the right of your compute cluster name and click it.
1. Wait a few minutes for the cluster to start.
1. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **17.3.x-cpu-ml-scala2.13**



## Classroom Setup

Before starting the lab, run the provided classroom setup script. This script will define configuration variables necessary for the lab. Execute the following cell:

In [0]:
%run ../Includes/Classroom-Setup-01

**Other Conventions:**

Throughout this lab, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

### Authentication

In this Lab environment, setting up authentication for both the Databricks CLI and GitHub integration has been simplified. Follow the instructions below to ensure proper setup:

**Databricks CLI Authentication**

The CLI authentication process has been pre-configured for this environment. 

Usually, you would have to set up authentication for the CLI. But in this Lab environment, that's already taken care of if you ran through the accompanying 
**'Generate Tokens'** notebook. 
If you did, credentials will already be loaded into the **`DATABRICKS_HOST`** and **`DATABRICKS_TOKEN`** environment variables. 

#####*If you did not, run through it now then restart this notebook.*

In [0]:
DA.get_credentials()

**GitHub Authentication for CI/CD Integration**

To enable CI/CD functionality, such as interacting with GitHub repositories, you need to provide your GitHub credentials, including:
- **GitHub Username:** Your GitHub account username.
- **Repository Name:** The name of the repository you want to interact with.
- **GitHub Token:** Your personal access token (PAT) from GitHub.

To set or update these credentials, execute the following command:

In [0]:
DA.get_git_credentials()

## Install and Configure the Databricks CLI
Install the Databricks CLI
- Use the following command to install the Databricks CLI:

In [0]:
%sh rm -f $(which databricks); curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/v0.211.0/install.sh | sh

Verify CLI installation:

In [0]:
%sh databricks --version

### Notebook Path Setup Continued
This code cell performs the following setup tasks:

- Retrieves the current Databricks cluster ID and displays it.
- Identifies the path of the currently running notebook.
- Constructs paths to related notebooks for Training and deploying the model, Performance Testing,  Model Prediction Analysis, and printing the Summary report of the Model testing. These paths are printed to confirm their accuracy.

In [0]:
## Retrieve the current cluster ID
cluster_id = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
print(f"Cluster ID: {cluster_id}")

## Get the current notebook path
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
print(f"Current Notebook Path: {notebook_path}")

## Define paths to related notebooks
base_path = notebook_path.rsplit('/', 1)[0] + "/1.2 Lab - Pipeline workflow notebooks"
notebook_paths = {
    "data_cleaning": f"{base_path}/01 - Data Cleaning and Transformation",
    "feature_engineering": f"{base_path}/02 - Feature Engineering and Model Training",
    "failure_handling": f"{base_path}/03 - failure_handling.py",
    "model_evaluation": f"{base_path}/04 - Model Evaluation and Testing with Accuracy Check"
}
print("Notebook Paths:")
print(notebook_paths)

##Task 1- Environment Setup - Git-Integrated Databricks Workspace
In this section, you will set up a Git-integrated Databricks workspace and clone a GitHub repository for use in this lab. Follow the steps below to ensure proper setup and integration.

####Prerequisites for Git Integration with Databricks
Before starting, ensure you meet the following prerequisites:

- Access to a Databricks workspace.
- A **GitHub** (or similar) account with a repository, such as `Adv_mlops_lab`, to integrate with Databricks.
- A **Personal Access Token (PAT)** from GitHub with the necessary permissions (e.g., `repo` and `workflow` scopes) to interact with your repository.


###1.1. Configure Git Integration in Databricks
To link your GitHub account with Databricks using a Personal Access Token:
1. **Navigate to User Settings:**
   - In the top-right corner, click your **profile icon** and select **Settings** from the dropdown menu.

2. **Link GitHub with Personal Access Token:**
   - On the User Settings page, go to the **Linked Accounts** tab.
   - Under **Git Integration**, follow these steps:
      - Click on **Add Git credential**.
      - Select your Git provider (e.g., GitHub, Bitbucket Cloud) from the dropdown.
      - Add a **Nickname**(optional).
      - Choose **Personal Access Token** as the authentication method.
      - Enter your **Git provider email**.
      - Enter your **Git provider username**.
      - Paste your **Personal Access Token (PAT)** into the token field.
      - Click **Save** to complete the integration.

3. **Verify Integration:**
   - Once complete, your Git provider will appear under the **Linked Accounts** section in the Databricks settings.

---

> **Note**: A Personal Access Token provides a secure and straightforward way to connect your Git provider to Databricks.

###1.2. Clone Repository in Databricks
The provided code automates the process of cloning your GitHub repository into the Databricks workspace. It reads Git credentials from a configuration file, sets up the local repository, and ensures that the latest changes are pulled.

Instructions:
- **Prepare a GitHub Repository:**
  - Ensure you have a GitHub repository ready for use in this lab.
  - Make sure your PAT has the necessary scopes to interact with the repository.

- **Verify the Git Credentials Configuration File:**
  - A configuration file named git_credentials.cfg is used to store GitHub credentials. Ensure the file includes the following fields under the [DEFAULT] section:
  `[DEFAULT]
  github_username = <your_github_username>
  repo_name = <your_repository_name>
  github_token = <your_personal_access_token>`

- **Run the Provided Code:**

  - The code will read your GitHub credentials, clone the repository into the Databricks workspace, and prepare it for further lab operations.
  - The process includes:
    - Reading and validating credentials from `git_credentials.cfg`.
    - Cloning the repository into `/Shared/<repo_name>`.
    - Configuring Git settings (e.g., username and email).
    - Pulling the latest changes from the main branch.

**Steps:**

**Step 1: Read GitHub Credentials**
- The `read_git_credentials()` function reads your **GitHub username**, **repository name**, and **PAT** from `git_credentials.cfg`. If the file or credentials are missing, it raises an error.

**Step 2: Clone the Git Repository**
- The **setup_git_repo()** function:

  - Reads credentials using **read_git_credentials()**.
  - Defines the local path for cloning the repository `(/Shared/<repo_name>)`.
  - Clones the **repository** if it doesn't exist locally.
  - Sets up global Git configurations (username and email).
  - Pulls the latest changes from the `main` branch.
- **Error Handling**

  The code provides detailed error messages for missing files, incorrect credentials, or issues during Git operations.

In [0]:
import os
import subprocess
import configparser

def read_git_credentials(config_path="var/git_credentials.cfg"):
    """
    Reads GitHub credentials from a configuration file.

    Args:
        config_path (str): Path to the configuration file.

    Returns:
        tuple: GitHub username, repository name, and GitHub token.
    """
    if not os.path.exists(config_path):
        raise FileNotFoundError(f"Git credentials file not found: {config_path}")
    
    config = configparser.ConfigParser()
    config.read(config_path)
    
    github_username = config.get("DEFAULT", "github_username")
    repo_name = config.get("DEFAULT", "repo_name")
    github_token = config.get("DEFAULT", "github_token")
    
    ## Validate credentials
    if not github_username or not repo_name or not github_token:
        raise ValueError("GitHub credentials are incomplete. Please provide username, repo name, and token.")
    
    ## Debugging: Ensure credentials are read correctly
    print(f"[INFO] Read Credentials -> Username: {github_username}, Repo Name: {repo_name}, Token: {github_token[:6]}... (hidden)")
    return github_username, repo_name, github_token


def setup_git_repo(config_path="var/git_credentials.cfg"):
    """
    Sets up a GitHub repository locally by cloning it and preparing it for lab operations.

    Args:
        config_path (str): Path to the configuration file containing GitHub credentials.

    Returns:
        tuple: Final username, repository name, and GitHub token used for the setup.
    """
    ## Step 1: Load credentials
    github_username, repo_name, github_token = read_git_credentials(config_path)

    try:
        ## Step 2: Define paths and repository URL
        git_repo_path = "/Shared/{repo_name}"  # Lab-specific local repo path
        repo_url = <FILL_IN>

        ## Debugging: Print repository details
        print(f"[INFO] Repo Path: {git_repo_path}")
        print(f"[INFO] Repo URL: {repo_url}")

        ## Step 3: Clone or pull the latest repository updates
        if not os.path.exists(git_repo_path):
            print(f"[ACTION] Cloning the repository '{repo_name}'...")
            subprocess.run(<FILL_IN>, shell=True, check=True)
        os.chdir(git_repo_path)  # Change directory to the local repo

        ## Step 4: Set up Git configuration
        print("[ACTION] Setting Git configuration...")
        subprocess.run(<FILL_IN>, shell=True, check=True)
        subprocess.run(<FILL_IN>, shell=True, check=True)

        ## Step 5: Pull the latest changes
        print("[ACTION] Pulling latest changes from the repository...")
        subprocess.run(<FILL_IN>, shell=True, check=True)
        
        print("[SUCCESS] Git setup complete.")

    except FileNotFoundError as fnfe:
        print(f"[ERROR] {fnfe}")
    except subprocess.CalledProcessError as cpe:
        print(f"[ERROR] Git command error: {cpe}")
    except Exception as e:
        print(f"[ERROR] An error occurred while setting up Git: {e}")

    ## Step 6: Debug final values to ensure correctness
    print(f"[DEBUG] Final Values -> Username: {github_username}, Repo Name: {repo_name}, Token: {github_token[:6]}... (hidden)")

    return github_username, repo_name, github_token

## Lab-specific usage
if __name__ == "__main__":
    # Call the function to set up the Git repository and store the final values
    <FILL_IN>

    ## Print final values to verify correctness
    print(f"[FINAL] GitHub Username: {final_username}")
    print(f"[FINAL] Repository Name: {final_repo_name}")
    print(f"[FINAL] GitHub Token: {final_git_token[:6]}... (hidden)")

In [0]:
%skip
import os
import subprocess
import configparser

def read_git_credentials(config_path="var/git_credentials.cfg"):
    """
    Reads GitHub credentials from a configuration file.

    Args:
        config_path (str): Path to the configuration file.

    Returns:
        tuple: GitHub username, repository name, and GitHub token.
    """
    if not os.path.exists(config_path):
        raise FileNotFoundError(f"Git credentials file not found: {config_path}")
    
    config = configparser.ConfigParser()
    config.read(config_path)
    
    github_username = config.get("DEFAULT", "github_username")
    repo_name = config.get("DEFAULT", "repo_name")
    github_token = config.get("DEFAULT", "github_token")
    
    ## Validate credentials
    if not github_username or not repo_name or not github_token:
        raise ValueError("GitHub credentials are incomplete. Please provide username, repo name, and token.")
    
    ## Debugging: Ensure credentials are read correctly
    print(f"[INFO] Read Credentials -> Username: {github_username}, Repo Name: {repo_name}, Token: {github_token[:6]}... (hidden)")
    return github_username, repo_name, github_token


def setup_git_repo(config_path="var/git_credentials.cfg"):
    """
    Sets up a GitHub repository locally by cloning it and preparing it for lab operations.

    Args:
        config_path (str): Path to the configuration file containing GitHub credentials.

    Returns:
        tuple: Final username, repository name, and GitHub token used for the setup.
    """
    ## Step 1: Load credentials
    github_username, repo_name, github_token = read_git_credentials(config_path)

    try:
        ## Step 2: Define paths and repository URL
        git_repo_path = "/Shared/{repo_name}"  # Lab-specific local repo path
        repo_url = f"https://{github_username}:{github_token}@github.com/{github_username}/{repo_name}.git"

        ## Debugging: Print repository details
        print(f"[INFO] Repo Path: {git_repo_path}")
        print(f"[INFO] Repo URL: {repo_url}")

        ## Step 3: Clone or pull the latest repository updates
        if not os.path.exists(git_repo_path):
            print(f"[ACTION] Cloning the repository '{repo_name}'...")
            subprocess.run(f"git clone {repo_url} {git_repo_path}", shell=True, check=True)
        os.chdir(git_repo_path)  # Change directory to the local repo

        ## Step 4: Set up Git configuration
        print("[ACTION] Setting Git configuration...")
        subprocess.run('git config --global user.name "Databricks Lab User"', shell=True, check=True)
        subprocess.run('git config --global user.email "databricks-lab@example.com"', shell=True, check=True)

        ## Step 5: Pull the latest changes
        print("[ACTION] Pulling latest changes from the repository...")
        subprocess.run("git pull origin main --allow-unrelated-histories", shell=True, check=True)
        
        print("[SUCCESS] Git setup complete.")

    except FileNotFoundError as fnfe:
        print(f"[ERROR] {fnfe}")
    except subprocess.CalledProcessError as cpe:
        print(f"[ERROR] Git command error: {cpe}")
    except Exception as e:
        print(f"[ERROR] An error occurred while setting up Git: {e}")

    ## Step 6: Debug final values to ensure correctness
    print(f"[DEBUG] Final Values -> Username: {github_username}, Repo Name: {repo_name}, Token: {github_token[:6]}... (hidden)")

    return github_username, repo_name, github_token

## Lab-specific usage
if __name__ == "__main__":
    ## Call the function to set up the Git repository and store the final values
    final_username, final_repo_name, final_git_token = setup_git_repo()

    ## Print final values to verify correctness
    print(f"[FINAL] GitHub Username: {final_username}")
    print(f"[FINAL] Repository Name: {final_repo_name}")
    print(f"[FINAL] GitHub Token: {final_git_token[:6]}... (hidden)")

### 1.3. Execute Notebooks and Commit to Git Repository
In this section, you will clone a GitHub repository into your Databricks workspace and set up the local repository for CI/CD operations. The provided code automates the process of cloning, verifying the repository, and ensuring the latest changes are pulled.

**Instructions to Execute Notebooks and Commit Changes:**

1. **Define Repository Paths and Credentials:**
   - The code reads GitHub credentials (username, repository name, and personal access token) from a configuration file (`git_credentials.cfg`).
   - Constructs the URL and local path for the repository.

2. **Clone the Repository:**
   - The `setup_git_repo()` function clones your GitHub repository into the Databricks workspace (if not already cloned) and ensures it is up to date by pulling the latest changes.

3. **Pull Latest Changes:**
   - Ensures the latest changes from the `main` branch are pulled into the local repository.

4. **Set Up Git Configuration:**
   - Configures the Git username and email for the local repository to ensure smooth commit operations.

5. **Check for Changes**:
   - Use `git status` to check if there are any changes to be committed.
   - If changes are detected, add them to the staging area with `git add .`.


In [0]:
import os
import shutil
import subprocess


## Define the local repository path and Git configuration
repo_url = <FILL_IN>
local_git_repo_path = f"/Users/{DA.username}/{final_repo_name}.git"

## Function to clone the GitHub repository and set up the local environment
def setup_git_repo():
    try:
        if not os.path.exists(local_git_repo_path):
            print(f"Cloning the repository from {repo_url}...")
            subprocess.run(<FILL_IN>, shell=True, check=True)
        os.chdir(local_git_repo_path)
        subprocess.run(<FILL_IN>)
        print("Git setup complete.")
    except subprocess.CalledProcessError as e:
        print(f"Error setting up Git repository: {e}")
    ## Commit changes to Git if there are any changes
    os.chdir(local_git_repo_path)
    if has_changes_to_commit(local_git_repo_path):
        subprocess.run(<FILL_IN>)
        subprocess.run(<FILL_IN> "Added exported files to Git"', shell=True, check=True)
        subprocess.run(<FILL_IN>, shell=True, check=True)
        print("Notebooks committed and pushed to Git.")
    else:
        print("No changes to commit. Working tree is clean.")

## Function to check if there are changes to commit
def has_changes_to_commit(repo_path):
    result = subprocess.run(
        <FILL_IN>
    )
    return bool(result.stdout.strip())


## Setup Git repository
<FILL_IN>

In [0]:
%skip
import os
import shutil
import subprocess


## Define the local repository path and Git configuration
repo_url = f"https://github.com/{final_username}/{final_repo_name}.git"
local_git_repo_path = f"/Users/{DA.username}/{final_repo_name}.git"

## Function to clone the GitHub repository and set up the local environment
def setup_git_repo():
    try:
        if not os.path.exists(local_git_repo_path):
            print(f"Cloning the repository from {repo_url}...")
            subprocess.run(f"git clone {repo_url} {local_git_repo_path}", shell=True, check=True)
        os.chdir(local_git_repo_path)
        subprocess.run("git pull origin main", shell=True, check=True)
        print("Git setup complete.")
    except subprocess.CalledProcessError as e:
        print(f"Error setting up Git repository: {e}")
    ## Commit changes to Git if there are any changes
    os.chdir(local_git_repo_path)
    if has_changes_to_commit(local_git_repo_path):
        subprocess.run("git add .", shell=True)
        subprocess.run('git commit -m "Added exported files to Git"', shell=True, check=True)
        subprocess.run("git push origin main", shell=True, check=True)
        print("Notebooks committed and pushed to Git.")
    else:
        print("No changes to commit. Working tree is clean.")

## Function to check if there are changes to commit
def has_changes_to_commit(repo_path):
    result = subprocess.run(
        ["git", "-C", repo_path, "status", "--porcelain"],
        capture_output=True,
        text=True
    )
    return bool(result.stdout.strip())


## Setup Git repository
setup_git_repo()

###1.4. Display Git Folder Structure
This section provides a function that prints the hierarchical folder structure of a Git repository. Understanding the structure of your repository is important for locating files, ensuring proper organization, and verifying that the repository is set up correctly.

**Instructions:**

- **Run the Function:**

  - Execute the provided code after cloning your Git repository.
  - The code uses the path `local_git_repo_path`, which should point to the root directory of the cloned repository.
- **Analyze the Output:**
  - The printed folder structure should match the organization of your repository on GitHub.
  - Verify that all required files and directories (e.g., configuration files, notebooks, workflows) are present.
- **Debugging Tip:**
  - If the folder structure is incorrect or empty:
    - Ensure that the repository was successfully cloned.
    - Check the value of `local_git_repo_path` to confirm it points to the correct directory.

In [0]:
import os
def print_git_folder_structure(local_git_repo_path):
    for root, dirs, files in os.walk(local_git_repo_path):
        level = <FILL_IN>
        indent = ' ' * 4 * (level)
        print(f"{indent}{os.path.basename(root)}/")
        sub_indent = ' ' * 4 * (level + 1)
        for f in files:
            print(f"{sub_indent}{f}")

## Print the folder structure
print_git_folder_structure(local_git_repo_path)

In [0]:
%skip
import os
def print_git_folder_structure(local_git_repo_path):
    for root, dirs, files in os.walk(local_git_repo_path):
        level = root.replace(local_git_repo_path, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print(f"{indent}{os.path.basename(root)}/")
        sub_indent = ' ' * 4 * (level + 1)
        for f in files:
            print(f"{sub_indent}{f}")

## Print the folder structure
print_git_folder_structure(local_git_repo_path)

###1.5. Passing Variables for Hyperparameter Tuning
This section focuses on dynamically passing hyperparameters to notebooks in the CI/CD pipeline. These variables influence model training and are set as widgets, making them easy to modify without changing the code. Students will learn how to define, retrieve, and utilize these variables for experimentation and model optimization.

**Instructions:**
- **Purpose of Widgets:**
  - Widgets allow you to pass parameters into a notebook dynamically.
  - By using widgets, you can easily modify hyperparameter values without altering the code.
- **Define Widgets:**

    Use the `dbutils.widgets.text()` function to define the following widgets:
    - **max_depth:** Specifies the maximum depth of the decision tree or model.
    - **n_estimators:** Defines the number of estimators (e.g., trees in a forest) for the model.
    - **subsample:** Determines the fraction of samples to use for training.

In [0]:
## Define Widgets for Other Hyperparameters
dbutils.widgets.text("max_depth", <FILL_IN>, "Maximum Depth")
dbutils.widgets.text("n_estimators", <FILL_IN>, "Number of Estimators")
dbutils.widgets.text("subsample", <FILL_IN>, "Subsample Fraction")

## Retrieve Widget Values
max_depth = <FILL_IN>
n_estimators = <FILL_IN>
subsample = <FILL_IN>

## Display the Retrieved Values
print(f"Using maximum depth: {max_depth}")
print(f"Using number of estimators: {n_estimators}")
print(f"Using subsample fraction: {subsample}")

In [0]:
%skip
## Define Widgets for Other Hyperparameters
dbutils.widgets.text("max_depth", "5", "Maximum Depth")
dbutils.widgets.text("n_estimators", "100", "Number of Estimators")
dbutils.widgets.text("subsample", "1.0", "Subsample Fraction")

## Retrieve Widget Values
max_depth = int(dbutils.widgets.get("max_depth"))
n_estimators = int(dbutils.widgets.get("n_estimators"))
subsample = float(dbutils.widgets.get("subsample"))

## Display the Retrieved Values
print(f"Using maximum depth: {max_depth}")
print(f"Using number of estimators: {n_estimators}")
print(f"Using subsample fraction: {subsample}")

##Task 2- Pipeline Validation Workflow with Email Notifications
In this section, you will set up a Databricks workflow configuration for pipeline validation. The configuration includes tasks such as data cleaning, feature engineering, conditional execution, and model evaluation. You'll also implement email notifications to monitor the success or failure of tasks.

### 2.1. Create Folder Structure for Workflow Configuration
  - A dedicated folder is needed to store the workflow configuration `JSON` file.
  - The `os.makedirs()` function creates a folder at the specified path.
  - The `pipeline_config_folder` is the directory path where the JSON configuration file will be stored.
  - The `pipeline_config_file` is the full path, including the filename, for the workflow configuration.

In [0]:
import os
## Define folder and file paths
pipeline_config_folder = <FILL_IN>
pipeline_config_file = <FILL_IN>

## Create the folder structure
os.makedirs(pipeline_config_folder, exist_ok=True)

In [0]:
%skip
import os
## Define folder and file paths
pipeline_config_folder = os.path.join(local_git_repo_path, "lab_pipeline_config")
pipeline_config_file = os.path.join(pipeline_config_folder, "lab_pipeline-validation-workflow.json")

## Create the folder structure
os.makedirs(pipeline_config_folder, exist_ok=True)

### 2.2. Define Workflow Configuration

- The workflow configuration describes the sequence of tasks and their dependencies in the pipeline.
- Includes **email notifications** to alert you when tasks succeed or fail.
- **Tasks include:**
    - Data cleaning
    - Feature engineering
    - Conditional execution based on the success or failure of previous tasks
    - Model evaluation
- Each task is defined in the **`tasks`** array in JSON format.
- **`depends_on`** specifies task dependencies.
- **`base_parameters`** allows passing hyperparameters to notebooks.
- Email notifications are set for both success and failure.

In [0]:
from datetime import datetime
## Define the workflow configuration
workflow_config_pipeline = f"""
{{
  "name": "Lab Pipeline Validation Workflow with Conditional Execution - {datetime.now().strftime('%Y-%m-%d')}",
  "email_notifications": {{
    "on_failure": ["{DA.username}"],
    "on_success": ["{DA.username}"]
  }},
  "tasks": [
    {{
      "task_key": "data_cleaning",
      "notebook_task": {{
        "notebook_path": "<FILL_IN>",
        "source": "WORKSPACE"
      }},
      "existing_cluster_id": "{cluster_id}",
      "timeout_seconds": 600,
      "run_if": <FILL_IN>
    }},
    {{
      "task_key": "feature_engineering",
      "depends_on": [<FILL_IN>],
      "notebook_task": {{
        "notebook_path": "<FILL_IN>",
        "source": "WORKSPACE",
        "base_parameters": {{
          <FILL_IN>
        }}
      }},
      "existing_cluster_id": "{cluster_id}",
      "timeout_seconds": 600,
      "run_if": "ALL_SUCCESS"
    }},
    {{
      "task_key": "conditional_execution",
      "depends_on": [<FILL_IN>],
      "condition_task": {{
        "op": "EQUAL_TO",
        "left": "{{{{tasks.feature_engineering.values.feature_engineering_status}}}}",
        "right": "SUCCESS"
      }},
      "timeout_seconds": 0
    }},
    {{
      "task_key": "failure_handling",
      "depends_on": [{{"task_key": "conditional_execution", "outcome": "false"}}],
      "spark_python_task": {{
        "python_file": "<FILL_IN>",
        "parameters": [
          "-e",
          "NonExistentColumn: Column not found in the dataset"
        ]
      }},
      "existing_cluster_id": "{cluster_id}",
      "timeout_seconds": 600
    }},
    {{
      "task_key": "model_evaluation",
      "depends_on": [{{"task_key": "conditional_execution", "outcome": "true"}}],
      "notebook_task": {{
        "notebook_path": "<FILL_IN>",
        "source": "WORKSPACE",
        "base_parameters": {{
          <FILL_IN>
        }}
      }},
      "existing_cluster_id": "{cluster_id}",
      "timeout_seconds": 600
    }}
  ]
}}
"""

In [0]:
%skip
from datetime import datetime
## Define the workflow configuration
workflow_config_pipeline = f"""
{{
  "name": "Lab Pipeline Validation Workflow with Conditional Execution - {datetime.now().strftime('%Y-%m-%d')}",
  "email_notifications": {{
    "on_failure": ["{DA.username}"],
    "on_success": ["{DA.username}"]
  }},
  "tasks": [
    {{
      "task_key": "data_cleaning",
      "notebook_task": {{
        "notebook_path": "{notebook_paths['data_cleaning']}",
        "source": "WORKSPACE"
      }},
      "existing_cluster_id": "{cluster_id}",
      "timeout_seconds": 600,
      "run_if": "ALL_SUCCESS"
    }},
    {{
      "task_key": "feature_engineering",
      "depends_on": [{{"task_key": "data_cleaning"}}],
      "notebook_task": {{
        "notebook_path": "{notebook_paths['feature_engineering']}",
        "source": "WORKSPACE",
        "base_parameters": {{
          "max_depth": "{max_depth}",
          "n_estimators": "{n_estimators}",
          "subsample": "{subsample}"
        }}
      }},
      "existing_cluster_id": "{cluster_id}",
      "timeout_seconds": 600,
      "run_if": "ALL_SUCCESS"
    }},
    {{
      "task_key": "conditional_execution",
      "depends_on": [{{"task_key": "feature_engineering"}}],
      "condition_task": {{
        "op": "EQUAL_TO",
        "left": "{{{{tasks.feature_engineering.values.feature_engineering_status}}}}",
        "right": "SUCCESS"
      }},
      "timeout_seconds": 0
    }},
    {{
      "task_key": "failure_handling",
      "depends_on": [{{"task_key": "conditional_execution", "outcome": "false"}}],
      "spark_python_task": {{
        "python_file": "{notebook_paths['failure_handling']}",
        "parameters": [
          "-e",
          "NonExistentColumn: Column not found in the dataset"
        ]
      }},
      "existing_cluster_id": "{cluster_id}",
      "timeout_seconds": 600
    }},
    {{
      "task_key": "model_evaluation",
      "depends_on": [{{"task_key": "conditional_execution", "outcome": "true"}}],
      "notebook_task": {{
        "notebook_path": "{notebook_paths['model_evaluation']}",
        "source": "WORKSPACE",
        "base_parameters": {{
          "max_depth": "{max_depth}",
          "n_estimators": "{n_estimators}",
          "subsample": "{subsample}"
        }}
      }},
      "existing_cluster_id": "{cluster_id}",
      "timeout_seconds": 600
    }}
  ]
}}
"""

###2.3. Save Workflow Configuration to File
- Save the workflow configuration to a JSON file for later use in the pipeline execution.
- If the file already exists:
    - Prompts you for confirmation before overwriting the file.
    - Writes the configuration to the specified file path.

In [0]:
## Write the workflow configuration to a file
if os.path.exists(pipeline_config_file):
    user_input = input(f"The file {pipeline_config_file} already exists. Overwrite? (yes/no): ").strip().lower()
    if user_input != "yes":
        print("Operation canceled.")
    else:
        with open(pipeline_config_file, "w") as file:
            <FILL_IN>
        print(f"Workflow configuration overwritten: <FILL_IN>")
else:
    with open(pipeline_config_file, "w") as file:
        <FILL_IN>
    print(f"Workflow configuration saved: <FILL_IN>")

In [0]:
%skip
## Write the workflow configuration to a file
if os.path.exists(pipeline_config_file):
    user_input = input(f"The file {pipeline_config_file} already exists. Overwrite? (yes/no): ").strip().lower()
    if user_input != "yes":
        print("Operation canceled.")
    else:
        with open(pipeline_config_file, "w") as file:
            file.write(workflow_config_pipeline)
        print(f"Workflow configuration overwritten: {pipeline_config_file}")
else:
    with open(pipeline_config_file, "w") as file:
        file.write(workflow_config_pipeline)
    print(f"Workflow configuration saved: {pipeline_config_file}")

##Task 3- Pipeline Execution and Version Update
In this section, you will execute a Databricks pipeline using a pre-defined workflow configuration file. You will validate the pipeline's results, update its version, and commit any changes to the Git repository. This task helps automate pipeline execution and ensures proper versioning and logging of outputs.



###3.1. Pipeline Execution Workflow Overview
1. **Defining Paths**

    - **Git Repository Path:** Specifies the local location of the Git repository.
    - **Version File Path:** Stores pipeline version information.
    - **Workflow Configuration File Path:** Points to the JSON configuration file for the pipeline workflow.
    - **Failure Output File Path:** Captures detailed error information if tasks fail.

In [0]:
import os
import json
import subprocess

## Define paths
base_folder = f"/Workspace{base_path}"
version_file = <FILL_IN>
workflow_config_file = <FILL_IN>
failure_output_file = os.path.join(base_folder, "failure_output.json")

In [0]:
%skip
import os
import json
import subprocess

## Define paths
base_folder = f"/Workspace{base_path}"
version_file = os.path.join(local_git_repo_path, "lab_pipeline_config_version_info.json")
workflow_config_file = os.path.join(local_git_repo_path, "lab_pipeline_config", "lab_pipeline-validation-workflow.json")
failure_output_file = os.path.join(base_folder, "failure_output.json")

2. **Committing and Pushing Changes to Git**

    The `commit_and_push_changes()` function performs the following actions:

    - Stages all changes in the repository using git add ..
    - Checks for uncommitted changes.
    - Commits changes with a message if any changes are detected.
    - Pushes committed changes to the `main` branch.

In [0]:
## Function to commit and push changes to Git
def commit_and_push_changes():
    try:
        os.chdir(local_git_repo_path)
        subprocess.run(<FILL_IN>)

        ## Check for changes before committing
        result = <FILL_IN>
        if result.strip():  # If there are changes to commit
            subprocess.run(<FILL_IN>)
            
            ## Push changes to the 'main' branch (or any valid branch)
            branch_name = <FILL_IN>  # Change this to your target branch if it's not 'main'
            push_result = <FILL_IN>
            if "error" in push_result.lower():
                raise Exception(push_result)
            
            print(f"Changes committed and pushed to Git branch: {branch_name} successfully.")
        else:
            print("No changes to commit. Working tree is clean.")
    except subprocess.CalledProcessError as e:
        print(f"Git error: {e.stderr}")
    except Exception as e:
        print(f"Error during Git operations: {e}")

In [0]:
%skip
## Function to commit and push changes to Git
def commit_and_push_changes():
    try:
        os.chdir(local_git_repo_path)
        subprocess.run("git add .", shell=True, check=True)

        ## Check for changes before committing
        result = subprocess.getoutput("git status --porcelain")
        if result.strip():  # If there are changes to commit
            subprocess.run('git commit -m "Updated version and pipeline results""', shell=True, check=True)
            
            ## Push changes to the 'main' branch (or any valid branch)
            branch_name = "main"  # Change this to your target branch if it's not 'main'
            push_result = subprocess.getoutput(f"git push origin {branch_name}")
            if "error" in push_result.lower():
                raise Exception(push_result)
            
            print(f"Changes committed and pushed to Git branch: {branch_name} successfully.")
        else:
            print("No changes to commit. Working tree is clean.")
    except subprocess.CalledProcessError as e:
        print(f"Git error: {e.stderr}")
    except Exception as e:
        print(f"Error during Git operations: {e}")

3. **Running the Pipeline**

    1. **Execute the Pipeline**:
        - The function `run_pipeline_and_update_version()` will read the workflow configuration file, create a Databricks job, and execute it.
        - The pipeline's tasks will run sequentially as defined in the configuration.

    2. **Monitor the Pipeline Status**:
        - Once the pipeline finishes executing, a **`run_page_url`** link will be generated.
        - This link redirects you to the Databricks job run page, where you can see the result of the pipeline execution.

    3. **Review Task Outputs**:
        - The function will display a summary of all tasks in the pipeline.
        - For each task, it will include:
            - **Task Key**: The name of the task.
            - **Notebook Path**: The path of the notebook executed for the task.
            - **State**: Indicates whether the task succeeded, failed, or was skipped.
            - **Error Message**: If applicable, displays error details for failed tasks.

    4. **Handle Failures**:
        - If any task fails, detailed error messages will be printed to help you identify the root cause.
        - The output will include failure details from the **failure_output.json** file if available.

In [0]:
## Function to extract and print the failure output file contents
def print_failure_output(failure_output_file):
    """
    Reads and prints the contents of the failure_output.json file if it exists.
    """
    if os.path.exists(failure_output_file):
        print("\nReading failure output file...\n")
        try:
            with open(failure_output_file, "r") as f:
                failure_output = json.load(f)
                print("\n=======\nOutput of Final Task (Failure Details):\n")
                print(json.dumps(failure_output, indent=4))
        except Exception as e:
            print(f"Error reading failure output file: {e}")
    else:
        print("No failure output file found.")


## Function to extract and print task failure details
def extract_failed_task_details(run_job_output, failure_output_file):
    """
    Parses the job output JSON to locate and print details about failed tasks, including 'failure_handling'.
    """
    try:
        run_output_json = json.loads(run_job_output)
        tasks = run_output_json.get("tasks", [])
        for task in tasks:
            task_key = task.get("task_key")
            state = task.get("state", {})
            state_message = state.get("state_message", "No state message available.")
            result_state = state.get("result_state", "Unknown")

            print(f"\nTask Key: {task_key}")
            print(f"Result State: {result_state}")
            print(f"State Message: {state_message}")

            if task_key == "failure_handling":
                if result_state == "SUCCESS":
                    print("\n=== Failure Handling Task Output ===")
                    notebook_output = task.get("notebook_output", "No output available.")
                    print(f"Notebook Output:\n{notebook_output}")
                    print_failure_output(failure_output_file)
                elif result_state == "EXCLUDED":
                    print("Task 'failure_handling' was excluded. Skipping failure output file reading.")
    except Exception as e:
        print(f"Error parsing failed task details: {e}")


## Function to extract and print final task output
def extract_and_print_final_task_output(run_page_url, tasks, failure_output_file):
    try:
        print("\n=== Final Task Details ===")
        print(f"Run Page URL: {run_page_url}\n")
        for task in tasks:
            task_key = task.get("task_key", "Unknown Task")
            state = task.get("state", {}).get("result_state", "Unknown State")
            notebook_path = task.get("notebook_task", {}).get("notebook_path", "No Notebook Path")
            error_message = task.get("state", {}).get("state_message", "")

            print(f"Task Key: {task_key}")
            print(f"Notebook Path: {notebook_path}")
            print(f"State: {state}")
            if error_message:
                print(f"Error Message: {error_message}")
            print("====================\n")

            if task_key == "failure_handling" and notebook_path == "No Notebook Path" and state == "EXCLUDED":
                print("Feature Engineering and Model Training was successful.\n")
            elif task_key == "failure_handling" and state == "SUCCESS":
                print_failure_output(failure_output_file)
    except Exception as e:
        print(f"Error extracting task output: {e}")

## Function to run the pipeline and update the version
def run_pipeline_and_update_version():
    """
    Run the pipeline using the Databricks job API and update the version if successful.
    """
    try:
        print(f"Running pipeline using workflow config: {pipeline_config_file}")

        ## Create the Databricks job
        create_job_cmd = f"databricks jobs create --json @{pipeline_config_file}"
        job_creation_output = subprocess.getoutput(create_job_cmd)
        print(f"Job creation output: {job_creation_output}")

        ## Parse job creation output
        job_data = json.loads(job_creation_output)
        job_id = job_data.get("job_id")
        if not job_id:
            raise ValueError(f"Failed to create job. Output: {job_creation_output}")
        print(f"Job ID: {job_id}")

        ## Run the created job
        run_job_cmd = f"databricks jobs run-now {job_id}"
        run_job_output = subprocess.getoutput(run_job_cmd)
        print(f"Job run output: {run_job_output}")

        ## Parse the job run output
        job_run_data = json.loads(run_job_output)
        result_state = job_run_data.get("state", {}).get("result_state", "UNKNOWN")
        run_page_url = job_run_data.get("run_page_url", "No Run Page URL")
        tasks = job_run_data.get("tasks", [])
        print(f"Run Page URL: {run_page_url}")

        if result_state == "SUCCESS":
            print("Pipeline ran successfully.")
        else:
            print("Pipeline run failed.")
            extract_failed_task_details(run_job_output, failure_output_file)

        ## Extract final task output
        extract_and_print_final_task_output(run_page_url, tasks, failure_output_file)

        ## Update the version file
        version_data = {"version": "1.0.0"}
        if os.path.exists(version_file):
            with open(version_file, "r") as f:
                version_data = json.load(f)

        old_version = version_data["version"]
        major, minor, patch = map(int, old_version.split("."))
        version_data["version"] = f"{major}.{minor + 1}.0"

        with open(version_file, "w") as f:
            json.dump(version_data, f, indent=4)
        print(f"Version updated: {old_version} -> {version_data['version']}")
    
    except json.JSONDecodeError:
        print("Failed to parse job creation or run output. The response is not a valid JSON.")
    except Exception as e:
        print(f"Error during pipeline execution or version update: {e}")

    try:
        commit_and_push_changes()
    except Exception as e:
        print(f"Error during commit and push changes: {e}")


## Run the pipeline
run_pipeline_and_update_version()

In [0]:
%skip
## Function to extract and print the failure output file contents
def print_failure_output(failure_output_file):
    """
    Reads and prints the contents of the failure_output.json file if it exists.
    """
    if os.path.exists(failure_output_file):
        print("\nReading failure output file...\n")
        try:
            with open(failure_output_file, "r") as f:
                failure_output = json.load(f)
                print("\n=======\nOutput of Final Task (Failure Details):\n")
                print(json.dumps(failure_output, indent=4))
        except Exception as e:
            print(f"Error reading failure output file: {e}")
    else:
        print("No failure output file found.")
## Function to extract and print task failure details
def extract_failed_task_details(run_job_output, failure_output_file):
    """
    Parses the job output JSON to locate and print details about failed tasks, including 'failure_handling'.
    """
    try:
        run_output_json = json.loads(run_job_output)
        tasks = run_output_json.get("tasks", [])
        for task in tasks:
            task_key = task.get("task_key")
            state = task.get("state", {})
            state_message = state.get("state_message", "No state message available.")
            result_state = state.get("result_state", "Unknown")

            print(f"\nTask Key: {task_key}")
            print(f"Result State: {result_state}")
            print(f"State Message: {state_message}")

            if task_key == "failure_handling":
                if result_state == "SUCCESS":
                    print("\n=== Failure Handling Task Output ===")
                    notebook_output = task.get("notebook_output", "No output available.")
                    print(f"Notebook Output:\n{notebook_output}")
                    print_failure_output(failure_output_file)
                elif result_state == "EXCLUDED":
                    print("Task 'failure_handling' was excluded. Skipping failure output file reading.")
    except Exception as e:
        print(f"Error parsing failed task details: {e}")


## Function to extract and print final task output
def extract_and_print_final_task_output(run_page_url, tasks, failure_output_file):
    try:
        print("\n=== Final Task Details ===")
        print(f"Run Page URL: {run_page_url}\n")
        for task in tasks:
            task_key = task.get("task_key", "Unknown Task")
            state = task.get("state", {}).get("result_state", "Unknown State")
            notebook_path = task.get("notebook_task", {}).get("notebook_path", "No Notebook Path")
            error_message = task.get("state", {}).get("state_message", "")

            print(f"Task Key: {task_key}")
            print(f"Notebook Path: {notebook_path}")
            print(f"State: {state}")
            if error_message:
                print(f"Error Message: {error_message}")
            print("====================\n")

            if task_key == "failure_handling" and notebook_path == "No Notebook Path" and state == "EXCLUDED":
                print("Feature Engineering and Model Training was successful.\n")
            elif task_key == "failure_handling" and state == "SUCCESS":
                print_failure_output(failure_output_file)
    except Exception as e:
        print(f"Error extracting task output: {e}")

## Function to run the pipeline and update the version
def run_pipeline_and_update_version():
    """
    Run the pipeline using the Databricks job API and update the version if successful.
    """
    try:
        print(f"Running pipeline using workflow config: {pipeline_config_file}")

        ## Create the Databricks job
        create_job_cmd = f"databricks jobs create --json @{pipeline_config_file}"
        job_creation_output = subprocess.getoutput(create_job_cmd)
        print(f"Job creation output: {job_creation_output}")

        ## Parse job creation output
        job_data = json.loads(job_creation_output)
        job_id = job_data.get("job_id")
        if not job_id:
            raise ValueError(f"Failed to create job. Output: {job_creation_output}")
        print(f"Job ID: {job_id}")

        ## Run the created job
        run_job_cmd = f"databricks jobs run-now {job_id}"
        run_job_output = subprocess.getoutput(run_job_cmd)
        print(f"Job run output: {run_job_output}")

        ## Parse the job run output
        job_run_data = json.loads(run_job_output)
        result_state = job_run_data.get("state", {}).get("result_state", "UNKNOWN")
        run_page_url = job_run_data.get("run_page_url", "No Run Page URL")
        tasks = job_run_data.get("tasks", [])
        print(f"Run Page URL: {run_page_url}")

        if result_state == "SUCCESS":
            print("Pipeline ran successfully.")
        else:
            print("Pipeline run failed.")
            extract_failed_task_details(run_job_output, failure_output_file)

        ## Extract final task output
        extract_and_print_final_task_output(run_page_url, tasks, failure_output_file)

        ## Update the version file
        version_data = {"version": "1.0.0"}
        if os.path.exists(version_file):
            with open(version_file, "r") as f:
                version_data = json.load(f)

        old_version = version_data["version"]
        major, minor, patch = map(int, old_version.split("."))
        version_data["version"] = f"{major}.{minor + 1}.0"

        with open(version_file, "w") as f:
            json.dump(version_data, f, indent=4)
        print(f"Version updated: {old_version} -> {version_data['version']}")
    
    except json.JSONDecodeError:
        print("Failed to parse job creation or run output. The response is not a valid JSON.")
    except Exception as e:
        print(f"Error during pipeline execution or version update: {e}")

    try:
        commit_and_push_changes()
    except Exception as e:
        print(f"Error during commit and push changes: {e}")


## Run the pipeline
run_pipeline_and_update_version()

###3.2. Check Pipeline Status
- Click on the generated `run_page_url` link after executing the pipeline function.
- Review the status of each task in the pipeline:
- If all tasks are `successful`, the pipeline has executed successfully.
- If any task fails, follow the `troubleshooting` steps in the task_failed.py output.
- **Check for Failures:**
  - If a task fails during execution:
      - Locate the failed task in the output or on the Databricks job run page.
      - Investigate the error by reviewing the logs available in the Databricks job run page for the failed task.
      - Use the provided error message or logs to identify the root cause of the failure.

### 3.3. Fix Errors and Re-Run the Pipeline

If the pipeline fails during execution, follow these steps to troubleshoot, apply fixes, and re-run the pipeline.

**Steps to Fix and Re-Run the Pipeline:**

1. **Identify the Issue**:
    - Review the **Output of Final Task (Failure Details)** section printed in the output.
    - Look for error messages or troubleshooting options associated with the failed task.
    - Navigate to **Jobs & Pipelines** in the **left-side menu bar** and look for your recently created job to review the job run details.
    - Use the job logs on the Databricks job run page to identify the cause of the failure.

2. **Apply Fixes**:
    - Open the notebook associated with the failed task.
    - Use the error details and logs to debug the issue.
    - Correct the error in the notebook or the workflow configuration file.

3. **Re-Run the Task**:
    - After applying fixes, use the provided function to re-run the pipeline:

      `rerun_pipeline()`

    - This function will re-trigger the pipeline execution, ensuring the latest changes are included.

4. **Commit Changes to Git Repository**:
    - Once the pipeline executes successfully:
        - Commit the updated notebooks or workflow configuration files to the Git repository for version control.
        - Use the `commit_and_push_changes()` function to push the changes to the repository.

---

**Tips for Troubleshooting**

- **Check the Failure Details**: 
  - Look for the **failure_output.json** file referenced in the logs for additional details.
  - The output of the `failure_handling` task will provide insights into specific errors.

- **Validate Changes**:
  - Ensure all dependent tasks are updated and consistent with the applied fixes.
  - Verify notebook paths and parameters in the workflow configuration.

- **Monitor Progress**:
  - Use the `run_page_url` generated after completion of the pipeline to see the status of each task.

In [0]:
## Function to commit changes to Git
def commit_and_push_changes():
    try:
        os.chdir(local_git_repo_path)
        <FILL_IN>

        ## Check for changes before committing
        result = subprocess.getoutput("git status --porcelain")
        if result.strip():  # If there are changes to commit
            <FILL_IN>
            
            ## Push changes to the 'main' branch (or any valid branch)
            branch_name = <FILL_IN>  # Change this to your target branch if it's not 'main'
            push_result = <FILL_IN>
            if "error" in push_result.lower():
                raise Exception(push_result)
            
            print(f"Changes committed and pushed to Git branch: {branch_name} successfully.")
        else:
            print("No changes to commit. Working tree is clean.")
    except subprocess.CalledProcessError as e:
        print(f"Git error: {e.stderr}")
    except Exception as e:
        print(f"Error during Git operations: {e}")
## Function to re-run the pipeline
def rerun_pipeline():
    print("Re-running the pipeline...")
    try:
        ## Run the pipeline and update the version
        <FILL_IN>

        ## Commit and push changes
        <FILL_IN>
    except Exception as e:
        print(f"Error during pipeline re-run: {e}")

## Re-run the pipeline
<FILL_IN>

In [0]:
%skip
## Function to commit changes to Git
def commit_and_push_changes():
    try:
        os.chdir(local_git_repo_path)
        subprocess.run("git add .", shell=True, check=True)

        ## Check for changes before committing
        result = subprocess.getoutput("git status --porcelain")
        if result.strip():  # If there are changes to commit
            subprocess.run('git commit -m "Fixed errors and updated notebooks"', shell=True, check=True)
            
            ## Push changes to the 'main' branch (or any valid branch)
            branch_name = "main"  # Change this to your target branch if it's not 'main'
            push_result = subprocess.getoutput(f"git push origin {branch_name}")
            if "error" in push_result.lower():
                raise Exception(push_result)
            
            print(f"Changes committed and pushed to Git branch: {branch_name} successfully.")
        else:
            print("No changes to commit. Working tree is clean.")
    except subprocess.CalledProcessError as e:
        print(f"Git error: {e.stderr}")
    except Exception as e:
        print(f"Error during Git operations: {e}")
## Function to re-run the pipeline
def rerun_pipeline():
    print("Re-running the pipeline...")
    try:
        ## Run the pipeline and update the version
        run_pipeline_and_update_version()

        ## Commit and push changes
        commit_and_push_changes()
    except Exception as e:
        print(f"Error during pipeline re-run: {e}")

## Re-run the pipeline
rerun_pipeline()

## Task 4- Displaying the final Git Folder Structure

This task helps you visualize the structure of your Git repository after executing the pipeline. By inspecting the folder hierarchy, you can confirm the organization of files and directories, ensuring that all outputs are saved correctly.

**Instructions:**

1. **Repository Path Setup**:
   - The function `print_git_folder_structure()` accepts `local_git_repo_path` as an input, which points to the root directory of your local Git repository.
   - Ensure that the repository has been cloned to your Databricks environment.

2. **Traverse the Repository**:
   - The function uses the `os.walk()` method to traverse through all directories and subdirectories.
   - Files and folders are identified at each level.

3. **Print the Hierarchical Structure**:
   - Each directory name is displayed with an appropriate indentation to represent its level in the hierarchy.
   - Files within each directory are listed with further indentation.

4. **Run the Function**:
   - Execute the provided Python function to print the folder structure of the Git repository in a clear, hierarchical format.

In [0]:
## Display Git folder structure
def print_git_folder_structure(<FILL_IN>):
    for root, dirs, files in os.walk(<FILL_IN>):
        <FILL_IN>
        for f in files:
            <FILL_IN>

print_git_folder_structure(<FILL_IN>)

In [0]:
%skip
## Display Git folder structure
def print_git_folder_structure(local_git_repo_path):
    for root, dirs, files in os.walk(local_git_repo_path):
        level = root.replace(local_git_repo_path, '').count(os.sep)
        indent = ' ' * 4 * level
        print(f"{indent}{os.path.basename(root)}/")
        sub_indent = ' ' * 4 * (level + 1)
        for f in files:
            print(f"{sub_indent}{f}")

print_git_folder_structure(local_git_repo_path)

## Conclusion

This lab provided hands-on experience in setting up and executing a CI/CD pipeline for Databricks notebooks. You gained practical knowledge of integrating automated validation and version control, highlighting the seamless connection between Git and Databricks workflows. By completing this lab, you have learned how to automate and ensure the robust execution of your pipeline processes effectively.

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>