Skip to content

mysticrenji/databricks

Repository files navigation

ML Data Pipeline with Databricks Asset Bundles and Azure DevOps

This project provides a robust, production-ready CI/CD framework for managing Machine Learning workloads on Databricks. It leverages Databricks Asset Bundles (DABs) for resource management, Azure Pipelines for automated testing and deployment, and Apache Airflow for orchestration.

🏗 Architecture Overview

  • Compute: Azure Databricks.
  • Resource Management: Databricks Asset Bundles (DABs).
  • CI/CD: Azure Pipelines with ephemeral Kubernetes-based agents.
  • Orchestration: Apache Airflow (running on Kubernetes).
  • Quality Gate: SonarQube.
  • Artifacts: Python Wheels hosted on Azure Artifacts.

📁 Project Structure

.
├── azure-pipelines.yml      # CI/CD pipeline definition
├── databricks.yml           # DAB configuration (Dev/Prod targets)
├── setup.py                 # Python package configuration
├── dags/                    # Airflow DAGs
│   └── train_model_dag.py   # Model training orchestration
├── infra/                   # Kubernetes deployment manifests
│   ├── azure-devops-agents/ # KEDA-scaled ephemeral build agents
│   │   ├── deploy.sh
│   │   ├── 01-namespace.yaml
│   │   ├── 02-secret.yaml
│   │   ├── 03-trigger-auth.yaml
│   │   └── 04-scaledjob.yaml
│   ├── airflow/             # Airflow Helm deployment with git-sync
│   │   ├── deploy.sh
│   │   ├── 01-namespace.yaml
│   │   ├── 02-secrets.yaml
│   │   └── 03-helm-values.yaml
│   └── README.md            # Detailed deployment guide
├── src/                     # Core logic (Python package)
│   └── example_package/
└── tests/                   # Unit and integration tests

🚀 Getting Started

Prerequisites

  1. Databricks CLI: Installed and configured.
  2. Python 3.10+: For local development and testing.
  3. Azure DevOps: Access to the project and agent pools.

Local Development

  1. Clone the repository:

    git clone <repo-url>
    cd databricks-cicd
  2. Install dependencies:

    pip install -e .
    pip install pytest ruff
  3. Run tests:

    pytest tests/
  4. Validate DAB Configuration:

    databricks configure --host "https://<Instance>" --token   
    databricks bundle validate -t dev

🔄 CI/CD Workflow

The project follows a branch-based deployment strategy:

  1. Continuous Integration (CI):

    • Triggered on develop and main branches.
    • Runs linting (ruff) and unit tests (pytest).
    • Performs SonarQube code analysis.
    • Validates Databricks Bundles for both dev and prod targets.
    • Packages the code as a Python Wheel and uploads it to Azure Artifacts.
  2. Continuous Deployment (CD):

    • Development: Automatic deployment to Databricks when code is merged into the develop branch.
    • Production: Automatic deployment to Databricks when code is merged into the main branch. This stage requires manual approval in Azure DevOps Environments.

🎼 Orchestration (Airflow)

The train_model_orchestration DAG in dags/ is environment-aware. It uses an Airflow Variable env (defaulting to dev) to determine which Databricks Job to trigger.

  • Dev: Triggers "[DEV] Train Model"
  • Prod: Triggers "[PROD] Train Model"

It uses the DatabricksRunNowOperator and references jobs by name to ensure stability across bundle redeployments.

☸️ Infrastructure

The infrastructure components are designed to run on Kubernetes. See infra/README.md for detailed deployment instructions.

  • Airflow: Configured with KubernetesExecutor for scalable task execution. DAGs are synced via git-sync.
  • Build Agents: Uses KEDA (Kubernetes Event-driven Autoscaling) to spin up Azure Pipelines agents on-demand in the k8s-ephemeral-pool.

Deploy both components:

./infra/azure-devops-agents/deploy.sh
./infra/airflow/deploy.sh

About

Simple Databricks CICD Flow with Azure Devops and Apache Airflow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors