# CI/CD and DevOps for Databricks

Implement version control, automated testing, and deployment automation for data pipelines.

## What You'll Learn

✅ Git integration with Databricks Repos  
✅ Create Databricks Asset Bundles  
✅ Set up CI/CD pipelines  
✅ Deploy across dev/staging/prod  
✅ Manage configurations and secrets  

**References:**
- [CI/CD Best Practices](https://docs.databricks.com/aws/en/dev-tools/ci-cd/)
- [Asset Bundles](https://docs.databricks.com/aws/en/dev-tools/bundles/)
- [Repos](https://docs.databricks.com/aws/en/repos/)

## 1. DevOps Best Practices

### Environment Strategy

```
Development → Staging → Production
```

**Dev Environment:**
- Individual developer workspaces
- Rapid iteration
- Sample datasets

**Staging:**
- Pre-production testing
- Full dataset
- Performance validation

**Production:**
- Live workloads
- Monitoring and alerts
- Change control

### Git Workflow

**Branching Strategy:**
```
main (production)
  ↑
staging
  ↑
develop
  ↑
feature/add-new-pipeline
```

**Process:**
1. Create feature branch
2. Develop and test locally
3. Create PR to develop
4. Review and merge
5. Deploy to staging
6. Validate
7. Promote to production

---

## 2. Databricks Repos

### Setup Git Integration

**Step 1: Connect Git Provider**
1. User Settings → Git Integration
2. Add GitHub/GitLab token
3. Configure SSH keys (optional)

**Step 2: Clone Repository**
```
Repos → Add Repo
URL: https://github.com/your-org/iot-pipeline
Branch: main
Path: /Repos/your-user/iot-pipeline
```

**Step 3: Development Workflow**
```bash
# Create feature branch
git checkout -b feature/new-transformation

# Make changes in Databricks
# Save notebooks

# Commit from UI or CLI
git add .
git commit -m "Add temperature normalization"
git push origin feature/new-transformation

# Create PR on GitHub/GitLab
```

---

## 3. Databricks Asset Bundles

### What are Bundles?

**Asset Bundles** define all workspace resources as code:
- Notebooks
- Jobs
- Pipelines
- Dashboards
- ML models
- Permissions

### Bundle Structure

```
iot-pipeline/
├── databricks.yml       # Main configuration
├── resources/
│   ├── jobs.yml
│   ├── pipelines.yml
│   └── experiments.yml
├── src/
│   ├── pipelines/
│   └── notebooks/
└── tests/
    └── integration/
```

### Example databricks.yml

```yaml
bundle:
  name: iot-pipeline
  
targets:
  dev:
    mode: development
    workspace:
      host: https://dev.cloud.databricks.com
    
  prod:
    mode: production
    workspace:
      host: https://prod.cloud.databricks.com

resources:
  jobs:
    iot_daily_pipeline:
      name: "IoT Daily Pipeline - ${bundle.target}"
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ${workspace.file_path}/src/pipelines/ingest.py
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            node_type_id: "i3.xlarge"
            num_workers: 2
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "America/Los_Angeles"
```

### Deploy with Bundles

```bash
# Validate configuration
databricks bundle validate

# Deploy to dev
databricks bundle deploy -t dev

# Deploy to prod
databricks bundle deploy -t prod

# Run deployed job
databricks bundle run iot_daily_pipeline -t prod
```

---

## 4. CI/CD Pipeline

### GitHub Actions Example

**.github/workflows/deploy.yml:**
```yaml
name: Deploy Databricks Pipeline

on:
  push:
    branches: [main, staging]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest
      
      - name: Run tests
        run: pytest tests/
  
  deploy:
    needs: test
    runs-on: ubuntu-latest
    if: github.event_name == 'push'
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Install Databricks CLI
        run: pip install databricks-cli
      
      - name: Deploy to Staging
        if: github.ref == 'refs/heads/staging'
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_STAGING_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_STAGING_TOKEN }}
        run: |
          databricks bundle deploy -t staging
      
      - name: Deploy to Production
        if: github.ref == 'refs/heads/main'
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_PROD_TOKEN }}
        run: |
          databricks bundle deploy -t prod
```

### Automated Testing

**Unit Tests:**
```python
# tests/test_transformations.py
import pytest
from src.pipelines import transformations

def test_temperature_conversion():
    result = transformations.fahrenheit_to_celsius(32)
    assert result == 0.0

def test_anomaly_detection():
    data = [60, 65, 70, 95, 68]  # 95 is anomaly
    anomalies = transformations.detect_anomalies(data, threshold=2)
    assert 95 in anomalies
```

**Integration Tests:**
```python
# tests/integration/test_pipeline.py
from pyspark.sql import SparkSession

def test_full_pipeline(spark):
    # Load sample data
    input_df = spark.read.csv("tests/fixtures/sample_sensors.csv")
    
    # Run pipeline
    result_df = run_pipeline(input_df)
    
    # Assertions
    assert result_df.count() > 0
    assert "temperature_celsius" in result_df.columns
    assert result_df.filter("temperature < -50").count() == 0
```

---

## 5. Configuration Management

### Environment-Specific Config

**config/dev.yml:**
```yaml
catalog: dev_catalog
schema: iot_dev
warehouse_id: dev_warehouse
cluster_size: 2
data_path: /Volumes/dev_catalog/iot_dev/data
```

**config/prod.yml:**
```yaml
catalog: prod_catalog
schema: iot_prod
warehouse_id: prod_warehouse
cluster_size: 10
data_path: /Volumes/prod_catalog/iot_prod/data
```

### Secrets Management

```python
# Access secrets in notebooks
db_password = dbutils.secrets.get(scope="prod-secrets", key="database-password")
api_key = dbutils.secrets.get(scope="prod-secrets", key="api-key")

# Never hardcode secrets!
# ❌ password = "my-secret-password"
# ✅ password = dbutils.secrets.get("scope", "key")
```

---

## Summary

✅ **Git Repos** - Version control for notebooks and code  
✅ **Asset Bundles** - Infrastructure as code  
✅ **CI/CD Pipelines** - Automated testing and deployment  
✅ **Multi-environment** - Dev, staging, production  
✅ **Configuration** - Environment-specific settings  

### Key Takeaways:

1. **Use Git** for all code and notebooks
2. **Asset Bundles** for deployments
3. **Automated testing** before production
4. **Separate environments** with proper promotion
5. **Never commit secrets** to Git

---

**Additional Resources:**
- [CI/CD Guide](https://docs.databricks.com/aws/en/dev-tools/ci-cd/)
- [Bundles Tutorial](https://docs.databricks.com/aws/en/dev-tools/bundles/jobs-tutorial)
- [Repos Documentation](https://docs.databricks.com/aws/en/repos/)