ML Data Pipeline with Databricks Asset Bundles and Azure DevOps

This project provides a robust, production-ready CI/CD framework for managing Machine Learning workloads on Databricks. It leverages Databricks Asset Bundles (DABs) for resource management, Azure Pipelines for automated testing and deployment, and Apache Airflow for orchestration.

🏗 Architecture Overview

Compute: Azure Databricks.
Resource Management: Databricks Asset Bundles (DABs).
CI/CD: Azure Pipelines with ephemeral Kubernetes-based agents.
Orchestration: Apache Airflow (running on Kubernetes).
Quality Gate: SonarQube.
Artifacts: Python Wheels hosted on Azure Artifacts.

📁 Project Structure

.
├── azure-pipelines.yml      # CI/CD pipeline definition
├── databricks.yml           # DAB configuration (Dev/Prod targets)
├── setup.py                 # Python package configuration
├── dags/                    # Airflow DAGs
│   └── train_model_dag.py   # Model training orchestration
├── infra/                   # Kubernetes deployment manifests
│   ├── azure-devops-agents/ # KEDA-scaled ephemeral build agents
│   │   ├── deploy.sh
│   │   ├── 01-namespace.yaml
│   │   ├── 02-secret.yaml
│   │   ├── 03-trigger-auth.yaml
│   │   └── 04-scaledjob.yaml
│   ├── airflow/             # Airflow Helm deployment with git-sync
│   │   ├── deploy.sh
│   │   ├── 01-namespace.yaml
│   │   ├── 02-secrets.yaml
│   │   └── 03-helm-values.yaml
│   └── README.md            # Detailed deployment guide
├── src/                     # Core logic (Python package)
│   └── example_package/
└── tests/                   # Unit and integration tests

🚀 Getting Started

Prerequisites

Databricks CLI: Installed and configured.
Python 3.10+: For local development and testing.
Azure DevOps: Access to the project and agent pools.

Local Development

Clone the repository:
```
git clone <repo-url>
cd databricks-cicd
```

Install dependencies:

pip install -e .
pip install pytest ruff

Run tests:
```
pytest tests/
```

Validate DAB Configuration:

databricks configure --host "https://<Instance>" --token   
databricks bundle validate -t dev

🔄 CI/CD Workflow

The project follows a branch-based deployment strategy:

Continuous Integration (CI):
- Triggered on develop and main branches.
- Runs linting (ruff) and unit tests (pytest).
- Performs SonarQube code analysis.
- Validates Databricks Bundles for both dev and prod targets.
- Packages the code as a Python Wheel and uploads it to Azure Artifacts.
Continuous Deployment (CD):
- Development: Automatic deployment to Databricks when code is merged into the develop branch.
- Production: Automatic deployment to Databricks when code is merged into the main branch. This stage requires manual approval in Azure DevOps Environments.

🎼 Orchestration (Airflow)

The train_model_orchestration DAG in dags/ is environment-aware. It uses an Airflow Variable env (defaulting to dev) to determine which Databricks Job to trigger.

Dev: Triggers "[DEV] Train Model"
Prod: Triggers "[PROD] Train Model"

It uses the DatabricksRunNowOperator and references jobs by name to ensure stability across bundle redeployments.

☸️ Infrastructure

The infrastructure components are designed to run on Kubernetes. See infra/README.md for detailed deployment instructions.

Airflow: Configured with KubernetesExecutor for scalable task execution. DAGs are synced via git-sync.
Build Agents: Uses KEDA (Kubernetes Event-driven Autoscaling) to spin up Azure Pipelines agents on-demand in the k8s-ephemeral-pool.

Deploy both components:

./infra/azure-devops-agents/deploy.sh
./infra/airflow/deploy.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Data Pipeline with Databricks Asset Bundles and Azure DevOps

🏗 Architecture Overview

📁 Project Structure

🚀 Getting Started

Prerequisites

Local Development

🔄 CI/CD Workflow

🎼 Orchestration (Airflow)

☸️ Infrastructure

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dags		dags
infra		infra
src/example_package		src/example_package
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
databricks.yml		databricks.yml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

ML Data Pipeline with Databricks Asset Bundles and Azure DevOps

🏗 Architecture Overview

📁 Project Structure

🚀 Getting Started

Prerequisites

Local Development

🔄 CI/CD Workflow

🎼 Orchestration (Airflow)

☸️ Infrastructure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages