# Model Deployment on Kubernetes
In this demo we are going to look at different ways to run Model Deployments on Kubernetes.

For HPA example (metric server install):
https://github.com/kubernetes-sigs/metrics-server/issues/196

## Having a Look around
In the `manifests` directory.

```bash
manifests
├── deployment.yaml
├── hpa.yaml
├── service.yaml
```

Deploy it!

## Horizontal Pod Autoscaler
HPA plays a huge role in being able to handle scaling your applications up and down.

This is especially useful for ML/AI workloads.

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model-deployment
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
```

`scaleTargetRef` defines what we are going to target to scale.

`minReplicas` and `maxRelicas` define the minimum and maximum number of pods to be running high or light load.

`metrics` is where we define the target resource to watch in order to trigger scaling events. 

## Node Affinity
Node affinity ensures that pods are scheduled on specific nodes by matching labels on the nodes with rules defined in the pod's configuration. 

It provides control over where workloads run, enabling optimization for hardware or specialized requirements, such as scheduling ML workloads on GPU-enabled nodes.


```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
spec:
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - "node02"
      containers:
      - name: ml-model-container
        image: wbassler/mobilenetv3lg-flask:v1.0
        imagePullPolicy: Always
        resources:
          requests:
            cpu: "200m"
            memory: "250Mi"
          limits:
            cpu: "200m"
            memory: "250Mi"
```

## Node Selector
Another way to schedule pods to specific nodes is through the use of `nodeSelector`. 

`nodeSelector` is not a flexible and is a simple key value match whereas node affinity is more advanced and flexible due to rules requirements. 

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
spec:
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      nodeSelector:
        kubernetes.io/hostname: "node02"
      containers:
      - name: ml-model-container
        image: wbassler/mobilenetv3lg-flask:v1.0
        imagePullPolicy: Always
        resources:
          requests:
            cpu: "200m"
            memory: "250Mi"
          limits:
            cpu: "200m"
            memory: "250Mi"
```

## Taints
Taints prevent pods from being scheduled on specific nodes unless the pods explicitly tolerate the taint. 

They are used to reserve nodes for specialized workloads or to isolate certain nodes.

```bash
kubectl taint nodes node02 role=pytorch:NoSchedule
```

```bash
kubectl describe node node02 | grep Taints
```

## Tolerations
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
spec:
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      tolerations:
      - key: "role"
        operator: "Equal"
        value: "pytorch"
        effect: "NoSchedule"
      nodeSelector:
        kubernetes.io/hostname: "node02"
      containers:
      - name: ml-model-container
        image: wbassler/mobilenetv3lg-flask:v1.0
        imagePullPolicy: Always
        resources:
          requests:
            cpu: "200m"
            memory: "250Mi"
          limits:
            cpu: "200m"
            memory: "250Mi"
```


## Assigning GPUs
Using GPU enabled nodes requires a couple additional things configured on your nodes. 

For NVIDIA:

1) Cluster has GPU-enabled nodes and that the NVIDIA drivers are installed on those nodes.

2) NVIDIA device plugin installed. More information [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin).

3) Update your deployment with below:

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
spec:
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      tolerations:
      - key: "role"
        operator: "Equal"
        value: "pytorch"
        effect: "NoSchedule"
      nodeSelector:
        kubernetes.io/hostname: "node02"
      containers:
      - name: ml-model-container
        image: wbassler/mobilenetv3lg-flask:v1.0
        imagePullPolicy: Always
        resources:
          requests:
            cpu: "200m"
            memory: "250Mi"
            nvidia.com/gpu: 1    # Request 1 GPU
          limits:
            cpu: "200m"
            memory: "250Mi"
            nvidia.com/gpu: 1    # Request 1 GPU
```

NOTE: Ensure that your application is configured to leverage GPU acceleration. ie: NVIDIA drivers and libraries