# Practical 7: Kubernetes for Scalable Data Pipelines

## Goals

This practical session introduces Kubernetes (K8s), the industry-standard container orchestration platform. You will learn how to deploy, scale, and manage containerized data processing applications in a production-like environment.

### Learning Objectives
* Understand Kubernetes architecture and core concepts
* Deploy applications using Pods, Deployments, and Services
* Manage configuration with ConfigMaps and Secrets
* Implement persistent storage with PersistentVolumes
* Run batch and scheduled jobs with Jobs and CronJobs
* Scale applications automatically with Horizontal Pod Autoscaler
* Deploy Spark applications on Kubernetes
* Monitor and troubleshoot Kubernetes workloads

### Prerequisites
* Completion of Practical 6 (Docker)
* Docker Desktop with Kubernetes enabled, OR
* Minikube installed ([Installation Guide](https://minikube.sigs.k8s.io/docs/start/))
* kubectl CLI installed ([Installation Guide](https://kubernetes.io/docs/tasks/tools/))

### Installation Verification

```bash
# Check kubectl version
kubectl version --client

# Check cluster status
kubectl cluster-info

# List nodes
kubectl get nodes
```

### Exercises Overview

| Exercise | Topic | Difficulty |
|----------|-------|------------|
| 1 | Kubernetes Architecture and kubectl Basics | ★ |
| 2 | Pods and Deployments | ★ |
| 3 | Services and Networking | ★★ |
| 4 | ConfigMaps and Secrets | ★★ |
| 5 | Persistent Storage | ★★ |
| 6 | Jobs and CronJobs for Batch Processing | ★★ |
| 7 | Horizontal Pod Autoscaling | ★★★ |
| 8 | Deploying Data Processing Pipelines | ★★★ |

---

## Exercise 1: Kubernetes Architecture and kubectl Basics [★]

### Kubernetes Architecture

```
┌────────────────────────────────────────────────────────────────────┐
│                        Control Plane                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │
│  │  API Server  │  │  Scheduler   │  │  Controller  │             │
│  │              │  │              │  │   Manager    │             │
│  └──────────────┘  └──────────────┘  └──────────────┘             │
│                           │                                        │
│                    ┌──────┴──────┐                                │
│                    │    etcd     │                                │
│                    │  (Storage)  │                                │
│                    └─────────────┘                                │
└────────────────────────────────────────────────────────────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         │                    │                    │
         ▼                    ▼                    ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│   Worker Node   │  │   Worker Node   │  │   Worker Node   │
│  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │
│  │  kubelet  │  │  │  │  kubelet  │  │  │  │  kubelet  │  │
│  └───────────┘  │  │  └───────────┘  │  │  └───────────┘  │
│  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │
│  │kube-proxy │  │  │  │kube-proxy │  │  │  │kube-proxy │  │
│  └───────────┘  │  │  └───────────┘  │  │  └───────────┘  │
│  ┌───┐ ┌───┐    │  │  ┌───┐ ┌───┐    │  │  ┌───┐ ┌───┐    │
│  │Pod│ │Pod│    │  │  │Pod│ │Pod│    │  │  │Pod│ │Pod│    │
│  └───┘ └───┘    │  │  └───┘ └───┘    │  │  └───┘ └───┘    │
└─────────────────┘  └─────────────────┘  └─────────────────┘
```

### Key Components

**Control Plane:**
- **API Server**: Entry point for all REST commands
- **etcd**: Distributed key-value store for cluster state
- **Scheduler**: Assigns pods to nodes
- **Controller Manager**: Runs control loops (ReplicaSet, Deployment, etc.)

**Worker Nodes:**
- **kubelet**: Agent that runs on each node
- **kube-proxy**: Network proxy for service networking
- **Container Runtime**: Docker, containerd, or CRI-O

### kubectl Basic Commands

```bash
# Get cluster information
kubectl cluster-info

# List all nodes
kubectl get nodes

# List all namespaces
kubectl get namespaces

# List all resources in current namespace
kubectl get all

# List pods with more details
kubectl get pods -o wide

# Describe a resource
kubectl describe pod <pod-name>

# View logs
kubectl logs <pod-name>

# Execute command in a pod
kubectl exec -it <pod-name> -- /bin/bash

# Apply a configuration
kubectl apply -f <file.yaml>

# Delete a resource
kubectl delete -f <file.yaml>
```

### Working with Namespaces

Namespaces provide isolation and organization for resources.

```bash
# Create a namespace
kubectl create namespace data-processing

# List pods in a specific namespace
kubectl get pods -n data-processing

# Set default namespace for current context
kubectl config set-context --current --namespace=data-processing

# List all resources across all namespaces
kubectl get pods --all-namespaces
```

### YAML Manifests

Kubernetes uses YAML files to define resources. Basic structure:

```yaml
apiVersion: v1              # API version
kind: Pod                   # Resource type
metadata:
  name: my-pod              # Resource name
  namespace: default        # Namespace
  labels:                   # Key-value labels
    app: myapp
spec:                       # Resource specification
  # ... resource-specific fields
```

### Questions - Exercise 1

**Q1.1** Explore your Kubernetes cluster:
- List all nodes and their status
- Describe one node to see its capacity and allocatable resources
- List all namespaces and pods across the cluster

**Q1.2** Create a namespace called `tdm-practicals` and set it as your default namespace.

**Q1.3** Use `kubectl explain` to explore the Pod resource:
- What fields are available in `spec.containers`?
- What is the difference between `resources.limits` and `resources.requests`?

---

## Exercise 2: Pods and Deployments [★]

### Pods

A Pod is the smallest deployable unit in Kubernetes. It can contain one or more containers that share storage and network.

**Simple Pod (pod.yaml):**

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: python-pod
  labels:
    app: python-demo
spec:
  containers:
  - name: python
    image: python:3.10-slim
    command: ["python", "-c"]
    args:
    - |
      import time
      while True:
          print(f"Hello from Kubernetes at {time.ctime()}")
          time.sleep(5)
    resources:
      requests:
        memory: "64Mi"
        cpu: "100m"
      limits:
        memory: "128Mi"
        cpu: "200m"
```

```bash
# Create the pod
kubectl apply -f pod.yaml

# View logs
kubectl logs python-pod -f

# Delete the pod
kubectl delete pod python-pod
```

### Deployments

Deployments manage ReplicaSets and provide declarative updates for Pods.

**Deployment (deployment.yaml):**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: data-processor
  labels:
    app: data-processor
spec:
  replicas: 3
  selector:
    matchLabels:
      app: data-processor
  template:
    metadata:
      labels:
        app: data-processor
    spec:
      containers:
      - name: processor
        image: python:3.10-slim
        command: ["python", "-c"]
        args:
        - |
          import socket
          import time
          hostname = socket.gethostname()
          while True:
              print(f"Processing on {hostname}")
              time.sleep(10)
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"
```

```bash
# Create deployment
kubectl apply -f deployment.yaml

# View deployment status
kubectl get deployments
kubectl get pods

# Scale deployment
kubectl scale deployment data-processor --replicas=5

# View deployment history
kubectl rollout history deployment/data-processor

# Update deployment (change image, etc.)
kubectl set image deployment/data-processor processor=python:3.11-slim

# Rollback to previous version
kubectl rollout undo deployment/data-processor
```

### Pod Lifecycle and Health Checks

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web
        image: python:3.10-slim
        command: ["python", "-c"]
        args:
        - |
          from http.server import HTTPServer, SimpleHTTPRequestHandler
          print('Starting server on port 8080')
          HTTPServer(('0.0.0.0', 8080), SimpleHTTPRequestHandler).serve_forever()
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 3
```

### Questions - Exercise 2

**Q2.1** Create a Deployment for a data processing application that:
- Runs 3 replicas
- Uses the Python image
- Processes data in a loop
- Has proper resource limits
- Includes liveness and readiness probes

**Q2.2** Experiment with scaling:
- Scale the deployment to 5 replicas
- Observe how Kubernetes distributes pods across nodes
- Scale down to 2 replicas and observe pod termination

**Q2.3** Perform a rolling update:
- Update the image version
- Watch the rollout progress
- Simulate a failed deployment and rollback

---

## Exercise 3: Services and Networking [★★]

### Service Types

Services expose pods and provide stable networking.

```
┌────────────────────────────────────────────────────────────┐
│                        Service Types                       │
├────────────────┬───────────────────────────────────────────┤
│ ClusterIP      │ Internal cluster IP (default)             │
│ NodePort       │ Exposes on each node's IP at static port  │
│ LoadBalancer   │ External load balancer (cloud provider)   │
│ ExternalName   │ Maps to external DNS name                 │
└────────────────┴───────────────────────────────────────────┘
```

### ClusterIP Service

**service-clusterip.yaml:**

```yaml
apiVersion: v1
kind: Service
metadata:
  name: data-processor-service
spec:
  type: ClusterIP
  selector:
    app: data-processor
  ports:
  - port: 80          # Service port
    targetPort: 8080  # Container port
    protocol: TCP
```

### NodePort Service

**service-nodeport.yaml:**

```yaml
apiVersion: v1
kind: Service
metadata:
  name: web-app-nodeport
spec:
  type: NodePort
  selector:
    app: web-app
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30080   # Optional: 30000-32767
```

```bash
# Access the service
# http://<node-ip>:30080

# With minikube
minikube service web-app-nodeport --url
```

### Complete Web Application Example

**webapp.yaml:**

```yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: flask-app
  template:
    metadata:
      labels:
        app: flask-app
    spec:
      containers:
      - name: flask
        image: python:3.10-slim
        command: ["/bin/bash", "-c"]
        args:
        - |
          pip install flask && python -c "
          from flask import Flask
          import socket
          app = Flask(__name__)
          @app.route('/')
          def hello():
              return f'Hello from {socket.gethostname()}'
          app.run(host='0.0.0.0', port=5000)
          "
        ports:
        - containerPort: 5000
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
  name: flask-app-service
spec:
  type: NodePort
  selector:
    app: flask-app
  ports:
  - port: 80
    targetPort: 5000
    nodePort: 30500
```

```bash
# Apply both resources
kubectl apply -f webapp.yaml

# Test load balancing (multiple requests show different hostnames)
for i in {1..10}; do curl http://localhost:30500; done
```

### DNS and Service Discovery

Kubernetes provides DNS-based service discovery. Services can be accessed by:
- `<service-name>` (same namespace)
- `<service-name>.<namespace>` (cross-namespace)
- `<service-name>.<namespace>.svc.cluster.local` (FQDN)

```bash
# Test DNS from within a pod
kubectl run -it --rm debug --image=busybox -- nslookup flask-app-service
```

### Questions - Exercise 3

**Q3.1** Create a multi-tier application:
- Frontend Deployment and Service (NodePort)
- Backend Deployment and Service (ClusterIP)
- Frontend communicates with backend via service name

**Q3.2** Test service discovery:
- Create a debug pod
- Use `nslookup` and `curl` to verify service connectivity
- Document the DNS resolution process

**Q3.3** Implement load balancing:
- Deploy 5 replicas of a web application
- Make multiple requests and track which pod handles each
- Analyze the load distribution

---

## Exercise 4: ConfigMaps and Secrets [★★]

### ConfigMaps

ConfigMaps store non-sensitive configuration data.

**configmap.yaml:**

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  # Key-value pairs
  DATABASE_HOST: "postgres-service"
  DATABASE_PORT: "5432"
  LOG_LEVEL: "INFO"
  
  # File-like keys
  config.json: |
    {
      "processing": {
        "batch_size": 100,
        "timeout": 30
      }
    }
```

**Using ConfigMaps in Pods:**

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: config-demo
spec:
  containers:
  - name: demo
    image: python:3.10-slim
    
    # Method 1: Environment variables from specific keys
    env:
    - name: DB_HOST
      valueFrom:
        configMapKeyRef:
          name: app-config
          key: DATABASE_HOST
    
    # Method 2: All keys as environment variables
    envFrom:
    - configMapRef:
        name: app-config
    
    # Method 3: Mount as volume
    volumeMounts:
    - name: config-volume
      mountPath: /etc/config
      
  volumes:
  - name: config-volume
    configMap:
      name: app-config
```

### Secrets

Secrets store sensitive data like passwords and API keys.

```bash
# Create secret from literal values
kubectl create secret generic db-credentials \
  --from-literal=username=admin \
  --from-literal=password=secretpass123

# Create secret from file
kubectl create secret generic tls-certs \
  --from-file=cert.pem \
  --from-file=key.pem

# View secret (base64 encoded)
kubectl get secret db-credentials -o yaml
```

**secret.yaml:**

```yaml
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
type: Opaque
data:
  # Values must be base64 encoded
  # echo -n 'admin' | base64
  username: YWRtaW4=
  password: c2VjcmV0cGFzczEyMw==
```

**Using Secrets in Pods:**

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: secret-demo
spec:
  containers:
  - name: demo
    image: python:3.10-slim
    env:
    - name: DB_USERNAME
      valueFrom:
        secretKeyRef:
          name: db-credentials
          key: username
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: db-credentials
          key: password
```

### Complete Example: Application with Configuration

**data-processor-config.yaml:**

```yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: processor-config
data:
  BATCH_SIZE: "100"
  PROCESSING_MODE: "parallel"
  LOG_LEVEL: "DEBUG"
---
apiVersion: v1
kind: Secret
metadata:
  name: processor-secrets
type: Opaque
stringData:  # Use stringData for unencoded values
  API_KEY: "your-secret-api-key"
  DATABASE_URL: "postgresql://user:pass@host:5432/db"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: data-processor
spec:
  replicas: 2
  selector:
    matchLabels:
      app: data-processor
  template:
    metadata:
      labels:
        app: data-processor
    spec:
      containers:
      - name: processor
        image: python:3.10-slim
        command: ["python", "-c"]
        args:
        - |
          import os
          import time
          print(f"Batch Size: {os.environ.get('BATCH_SIZE')}")
          print(f"Mode: {os.environ.get('PROCESSING_MODE')}")
          print(f"API Key: {os.environ.get('API_KEY')[:5]}...")
          while True:
              print("Processing...")
              time.sleep(10)
        envFrom:
        - configMapRef:
            name: processor-config
        - secretRef:
            name: processor-secrets
```

### Questions - Exercise 4

**Q4.1** Create configuration for a data processing application:
- ConfigMap with processing parameters (batch size, timeout, input/output paths)
- Secret with database credentials
- Deployment that uses both

**Q4.2** Implement configuration hot-reloading:
- Mount ConfigMap as a volume
- Write a Python script that watches for config file changes
- Update the ConfigMap and verify the application detects changes

**Q4.3** Create a production-ready configuration setup:
- Separate configs for dev/staging/prod environments
- Use Kustomize to manage environment-specific overrides
- Document the configuration management strategy

---

## Exercise 5: Persistent Storage [★★]

### Storage Concepts

```
┌─────────────────────────────────────────────────────────────┐
│                     Storage Architecture                     │
│                                                              │
│  ┌──────────────────┐                                       │
│  │       Pod        │                                       │
│  │  ┌────────────┐  │                                       │
│  │  │  Volume    │  │◄──── PersistentVolumeClaim (PVC)     │
│  │  │  Mount     │  │            │                          │
│  │  └────────────┘  │            │ binds                    │
│  └──────────────────┘            ▼                          │
│                         ┌──────────────────┐                │
│                         │ PersistentVolume │                │
│                         │      (PV)        │                │
│                         └────────┬─────────┘                │
│                                  │                          │
│                                  ▼                          │
│                         ┌──────────────────┐                │
│                         │ Storage Backend  │                │
│                         │ (NFS, EBS, etc.) │                │
│                         └──────────────────┘                │
└─────────────────────────────────────────────────────────────┘
```

### PersistentVolume and PersistentVolumeClaim

**pv-pvc.yaml:**

```yaml
---
# PersistentVolume (usually created by admin)
apiVersion: v1
kind: PersistentVolume
metadata:
  name: data-pv
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce      # RWO: single node
    # - ReadWriteMany    # RWX: multiple nodes
    # - ReadOnlyMany     # ROX: read-only multiple nodes
  persistentVolumeReclaimPolicy: Retain  # or Delete
  storageClassName: manual
  hostPath:
    path: /data/pv-data
---
# PersistentVolumeClaim (created by user)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: manual
```

### Using PVC in Pods

**pod-with-storage.yaml:**

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: data-writer
spec:
  containers:
  - name: writer
    image: python:3.10-slim
    command: ["python", "-c"]
    args:
    - |
      import time
      from datetime import datetime
      
      counter = 0
      while True:
          with open('/data/output.txt', 'a') as f:
              f.write(f"{datetime.now()}: Record {counter}\n")
          print(f"Written record {counter}")
          counter += 1
          time.sleep(5)
    volumeMounts:
    - name: data-volume
      mountPath: /data
  volumes:
  - name: data-volume
    persistentVolumeClaim:
      claimName: data-pvc
```

### StorageClass for Dynamic Provisioning

**storageclass.yaml:**

```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-storage
provisioner: kubernetes.io/gce-pd  # or aws-ebs, azure-disk
parameters:
  type: pd-ssd
reclaimPolicy: Delete
volumeBindingMode: Immediate
```

**Dynamic PVC:**

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dynamic-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: fast-storage  # Uses StorageClass
```

### Questions - Exercise 5

**Q5.1** Create a persistent storage setup for a data pipeline:
- PersistentVolume for input data
- PersistentVolume for output data
- Pod that reads from input, processes, writes to output

**Q5.2** Implement data sharing between pods:
- Create a PVC with ReadWriteMany access mode
- Deploy a writer pod and multiple reader pods
- Verify all pods can access the shared data

**Q5.3** Test data persistence:
- Deploy a database (PostgreSQL) with persistent storage
- Insert data, delete the pod, verify data persists
- Document the backup and restore process

---

## Exercise 6: Jobs and CronJobs for Batch Processing [★★]

### Jobs

Jobs run one or more pods to completion.

**job.yaml:**

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing-job
spec:
  completions: 5       # Number of successful completions required
  parallelism: 2       # Pods running in parallel
  backoffLimit: 3      # Retries before marking as failed
  activeDeadlineSeconds: 600  # Timeout
  template:
    spec:
      restartPolicy: Never  # or OnFailure
      containers:
      - name: processor
        image: python:3.10-slim
        command: ["python", "-c"]
        args:
        - |
          import random
          import time
          import socket
          
          hostname = socket.gethostname()
          work_time = random.randint(5, 15)
          
          print(f"Job {hostname} starting, will run for {work_time} seconds")
          time.sleep(work_time)
          print(f"Job {hostname} completed successfully")
```

```bash
# Create job
kubectl apply -f job.yaml

# Watch job progress
kubectl get jobs -w

# View pod logs
kubectl logs job/data-processing-job

# Delete job and its pods
kubectl delete job data-processing-job
```

### CronJobs

CronJobs run jobs on a schedule.

**cronjob.yaml:**

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: data-aggregator
spec:
  schedule: "*/5 * * * *"  # Every 5 minutes
  # Cron format: minute hour day-of-month month day-of-week
  # "0 * * * *"     - Every hour
  # "0 0 * * *"     - Every day at midnight
  # "0 0 * * 0"     - Every Sunday at midnight
  
  concurrencyPolicy: Forbid  # Allow, Forbid, Replace
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  startingDeadlineSeconds: 200
  
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: aggregator
            image: python:3.10-slim
            command: ["python", "-c"]
            args:
            - |
              from datetime import datetime
              print(f"Running aggregation at {datetime.now()}")
              # Simulate aggregation work
              import time
              time.sleep(30)
              print("Aggregation complete")
```

```bash
# Create cronjob
kubectl apply -f cronjob.yaml

# List cronjobs
kubectl get cronjobs

# View jobs created by cronjob
kubectl get jobs

# Manually trigger a job from cronjob
kubectl create job --from=cronjob/data-aggregator manual-run

# Suspend a cronjob
kubectl patch cronjob data-aggregator -p '{"spec": {"suspend": true}}'
```

### Data Processing Pipeline with Jobs

**etl-pipeline.yaml:**

```yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: etl-config
data:
  INPUT_PATH: "/data/input"
  OUTPUT_PATH: "/data/output"
---
apiVersion: batch/v1
kind: Job
metadata:
  name: etl-extract
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: extract
        image: python:3.10-slim
        command: ["python", "-c"]
        args:
        - |
          import os
          import json
          
          output_path = os.environ.get('OUTPUT_PATH', '/data/output')
          os.makedirs(output_path, exist_ok=True)
          
          # Simulate extraction
          data = [{'id': i, 'value': i * 10} for i in range(100)]
          
          with open(f'{output_path}/extracted.json', 'w') as f:
              json.dump(data, f)
          
          print(f"Extracted {len(data)} records")
        envFrom:
        - configMapRef:
            name: etl-config
        volumeMounts:
        - name: data-volume
          mountPath: /data
      volumes:
      - name: data-volume
        persistentVolumeClaim:
          claimName: etl-data-pvc
```

### Questions - Exercise 6

**Q6.1** Create a batch processing job that:
- Reads data from a ConfigMap
- Processes data in parallel (3 pods)
- Writes results to a PersistentVolume
- Handles failures with retries

**Q6.2** Implement a scheduled data pipeline:
- CronJob that runs hourly
- Fetches data from an external API (simulated)
- Processes and stores results
- Sends notification on completion (simulated)

**Q6.3** Create an ETL pipeline with multiple jobs:
- Extract job that fetches raw data
- Transform job that cleans and enriches data
- Load job that writes to final destination
- Use initContainers to ensure proper sequencing

---

## Exercise 7: Horizontal Pod Autoscaling [★★★]

### HPA Basics

The Horizontal Pod Autoscaler automatically scales the number of pods based on observed CPU/memory utilization or custom metrics.

```bash
# Enable metrics-server (required for HPA)
# For minikube:
minikube addons enable metrics-server

# Verify metrics are available
kubectl top nodes
kubectl top pods
```

### HPA Configuration

**deployment-for-hpa.yaml:**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu-intensive-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cpu-intensive
  template:
    metadata:
      labels:
        app: cpu-intensive
    spec:
      containers:
      - name: app
        image: python:3.10-slim
        command: ["/bin/bash", "-c"]
        args:
        - |
          pip install flask && python -c "
          from flask import Flask
          import math
          app = Flask(__name__)
          @app.route('/')
          def compute():
              x = 0
              for i in range(1000000):
                  x += math.sqrt(i)
              return f'Computed: {x}'
          app.run(host='0.0.0.0', port=5000)
          "
        ports:
        - containerPort: 5000
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"
---
apiVersion: v1
kind: Service
metadata:
  name: cpu-intensive-service
spec:
  type: NodePort
  selector:
    app: cpu-intensive
  ports:
  - port: 80
    targetPort: 5000
    nodePort: 30600
```

**hpa.yaml:**

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cpu-intensive-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cpu-intensive-app
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max
```

```bash
# Apply configurations
kubectl apply -f deployment-for-hpa.yaml
kubectl apply -f hpa.yaml

# Or create HPA via command line
kubectl autoscale deployment cpu-intensive-app --cpu-percent=50 --min=1 --max=10

# Watch HPA status
kubectl get hpa -w

# Generate load
# In another terminal:
kubectl run -it load-generator --rm --image=busybox -- /bin/sh -c \
  "while true; do wget -q -O- http://cpu-intensive-service; done"
```

### Memory-based HPA

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memory-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: memory-intensive-app
  minReplicas: 2
  maxReplicas: 8
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 500Mi
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
```

### Questions - Exercise 7

**Q7.1** Configure autoscaling for a data processing service:
- Deploy a CPU-intensive processing application
- Configure HPA with min=2, max=10 replicas
- Target CPU utilization at 60%
- Test with varying load levels

**Q7.2** Implement multi-metric scaling:
- Scale based on both CPU and memory
- Add custom metrics (if using Prometheus)
- Document the scaling behavior under different conditions

**Q7.3** Create a complete auto-scaling demo:
- Deploy application with HPA
- Create load generator
- Visualize scaling events
- Measure response times during scaling

---

## Exercise 8: Deploying Data Processing Pipelines [★★★]

### Complete Data Pipeline Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                           │
│                                                                 │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐       │
│  │   Data      │     │   Message   │     │   Worker    │       │
│  │  Ingester   │────►│   Queue     │────►│   Pods      │       │
│  │  (Deploy)   │     │ (RabbitMQ)  │     │  (Deploy)   │       │
│  └─────────────┘     └─────────────┘     └──────┬──────┘       │
│                                                  │              │
│                                                  ▼              │
│                                           ┌─────────────┐       │
│                                           │  Database   │       │
│                                           │ (PostgreSQL)│       │
│                                           └─────────────┘       │
│                                                  │              │
│                                                  ▼              │
│                                           ┌─────────────┐       │
│                                           │    API      │       │
│                                           │  (Deploy)   │       │
│                                           └─────────────┘       │
└─────────────────────────────────────────────────────────────────┘
```

### RabbitMQ Deployment

**rabbitmq.yaml:**

```yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rabbitmq-config
data:
  RABBITMQ_DEFAULT_USER: "admin"
  RABBITMQ_DEFAULT_PASS: "rabbitmq123"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rabbitmq
spec:
  replicas: 1
  selector:
    matchLabels:
      app: rabbitmq
  template:
    metadata:
      labels:
        app: rabbitmq
    spec:
      containers:
      - name: rabbitmq
        image: rabbitmq:3-management
        ports:
        - containerPort: 5672
        - containerPort: 15672
        envFrom:
        - configMapRef:
            name: rabbitmq-config
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "512Mi"
            cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: rabbitmq
spec:
  selector:
    app: rabbitmq
  ports:
  - name: amqp
    port: 5672
  - name: management
    port: 15672
```

### PostgreSQL Deployment

**postgres.yaml:**

```yaml
---
apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
type: Opaque
stringData:
  POSTGRES_USER: "datauser"
  POSTGRES_PASSWORD: "datapass123"
  POSTGRES_DB: "dataprocessing"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15
        ports:
        - containerPort: 5432
        envFrom:
        - secretRef:
            name: postgres-secret
        volumeMounts:
        - name: postgres-storage
          mountPath: /var/lib/postgresql/data
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "512Mi"
            cpu: "500m"
      volumes:
      - name: postgres-storage
        persistentVolumeClaim:
          claimName: postgres-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  selector:
    app: postgres
  ports:
  - port: 5432
```

### Worker Deployment with HPA

**worker.yaml:**

```yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: data-worker
spec:
  replicas: 2
  selector:
    matchLabels:
      app: data-worker
  template:
    metadata:
      labels:
        app: data-worker
    spec:
      containers:
      - name: worker
        image: python:3.10-slim
        command: ["/bin/bash", "-c"]
        args:
        - |
          pip install pika psycopg2-binary && python -c "
          import pika
          import psycopg2
          import time
          import os
          import json
          
          # Connect to RabbitMQ
          for i in range(10):
              try:
                  connection = pika.BlockingConnection(
                      pika.ConnectionParameters('rabbitmq', credentials=pika.PlainCredentials('admin', 'rabbitmq123'))
                  )
                  break
              except:
                  print('Waiting for RabbitMQ...')
                  time.sleep(5)
          
          channel = connection.channel()
          channel.queue_declare(queue='data_queue', durable=True)
          
          def callback(ch, method, props, body):
              data = json.loads(body)
              print(f'Processing: {data}')
              # Process data and save to database
              time.sleep(1)
              ch.basic_ack(delivery_tag=method.delivery_tag)
          
          channel.basic_qos(prefetch_count=1)
          channel.basic_consume(queue='data_queue', on_message_callback=callback)
          print('Worker started, waiting for messages...')
          channel.start_consuming()
          "
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "300m"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: data-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: data-worker
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
```

### Deploying the Complete Pipeline

```bash
# Create namespace
kubectl create namespace data-pipeline

# Deploy components
kubectl apply -f rabbitmq.yaml -n data-pipeline
kubectl apply -f postgres.yaml -n data-pipeline
kubectl apply -f worker.yaml -n data-pipeline

# Verify deployments
kubectl get all -n data-pipeline

# View logs
kubectl logs -f deployment/data-worker -n data-pipeline

# Port forward for debugging
kubectl port-forward svc/rabbitmq 15672:15672 -n data-pipeline
# Access RabbitMQ UI: http://localhost:15672
```

### Questions - Exercise 8

**Q8.1** Deploy a complete data processing pipeline:
- RabbitMQ for message queuing
- PostgreSQL for data storage
- Producer service that generates data
- Worker service with auto-scaling
- API service to query results

**Q8.2** Add monitoring and observability:
- Deploy Prometheus for metrics collection
- Configure scraping for all services
- Create Grafana dashboards
- Set up alerts for key metrics

**Q8.3** Implement a Spark on Kubernetes deployment:
- Deploy Spark operator (or use spark-submit with Kubernetes)
- Submit a Spark job to process data from PostgreSQL
- Configure executor scaling
- Monitor job progress and resource usage

---

## Summary

In this practical, you learned:

1. **Kubernetes Architecture**: Control plane, worker nodes, and core components
2. **Pods and Deployments**: Creating and managing containerized applications
3. **Services**: Exposing applications and enabling service discovery
4. **ConfigMaps and Secrets**: Managing application configuration
5. **Persistent Storage**: Implementing data persistence with PVs and PVCs
6. **Jobs and CronJobs**: Running batch and scheduled workloads
7. **Horizontal Pod Autoscaling**: Automatically scaling based on metrics
8. **Complete Pipelines**: Deploying production-ready data processing systems

### Key Takeaways

- Use Deployments for stateless applications, StatefulSets for stateful ones
- Always define resource requests and limits
- Use ConfigMaps for configuration, Secrets for sensitive data
- Implement health checks (liveness and readiness probes)
- Design for horizontal scaling from the start
- Use namespaces for resource isolation and organization

### Production Considerations

- **Security**: Use RBAC, Network Policies, Pod Security Policies
- **Monitoring**: Implement comprehensive observability
- **High Availability**: Deploy across multiple availability zones
- **Disaster Recovery**: Regular backups and tested recovery procedures
- **GitOps**: Use tools like ArgoCD or Flux for declarative deployments

### Further Reading

- [Kubernetes Documentation](https://kubernetes.io/docs/home/)
- [Kubernetes Patterns](https://www.oreilly.com/library/view/kubernetes-patterns/9781492050278/)
- [The Kubernetes Book](https://nigelpoulton.com/books/)
- [Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html)