# Verify Elastic Pytorch training fault tolerance
In this notebook we will 
- Simulate SpotVM node eviction
- Verify ElasticJob Fault Tolerance
- Scale number of Nodes and verify Training job autoscaling


In [1]:
resource_group = "elastic-lab"   # replace with values from Step1 or use this default
aks_name = "elasticaks"          # replace with values from Step1 or use this default
aks_spot_nodepool = "spotgpu"    # replace with values from Step1 or use this default

In [2]:
# Get currently running pods on the spotvm nodes
!kubectl get nodes
!kubectl get pods -n elastic-job -o wide

NAME                                 STATUS   ROLES   AGE   VERSION
aks-cpuworkers-40607851-vmss000000   Ready    agent   45h   v1.18.17
aks-cpuworkers-40607851-vmss000003   Ready    agent   45h   v1.18.17
aks-nodepool1-40607851-vmss000000    Ready    agent   46h   v1.18.17
aks-spotgpu-40607851-vmss000000      Ready    agent   38h   v1.18.17
aks-spotgpu-40607851-vmss000002      Ready    agent   38h   v1.18.17
NAME                                          READY   STATUS    RESTARTS   AGE     IP            NODE                                 NOMINATED NODE   READINESS GATES
elastic-job-k8s-controller-5b9bc6b79c-xvdsw   1/1     Running   0          42h     10.244.13.2   aks-cpuworkers-40607851-vmss000003   <none>           <none>
etcd                                          1/1     Running   0          43h     10.244.12.2   aks-cpuworkers-40607851-vmss000000   <none>           <none>
imagenet-worker-0                             1/1     Running   0          2m35s   10.244.16.9   aks-spo

In [17]:
# Run commands to get all variables for REST API and strip beginning and end quotes
subscription_id = !az account show --query id 
subscription_id = str(subscription_id[0][1:-1])
node_rg = !az aks show --resource-group {resource_group} --name {aks_name} --query nodeResourceGroup --only-show-errors
node_rg = str(node_rg[0][1:-1])
vmss_name= !az vmss list -g MC_elastic-lab_elasticaks_eastus2 --query '[].name' -o tsv | grep spot
vmss_name = vmss_name[0]
instance_id = 1 ## Fill it in if different node !!

For information on SpotVM eviction see [Simulate an eviction docs](https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/use-spot#simulate-an-eviction)
Use REST API to simulate SpotVM eviction (it can also be tested using Azure API Console https://docs.microsoft.com/en-us/rest/api/compute/virtualmachinescalesetvms/simulateeviction#code-try-0)

Response HTTP Code should be 204 , accepting the event

In [18]:
!az rest --verbose --method POST \
   --uri 'https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{node_rg}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmss_name}/virtualMachines/{instance_id}/simulateEviction?api-version=2020-12-01'

[32mRequest URL: 'https://management.azure.com/subscriptions/f869415f-5cff-46a3-b728-20659d14d62d/resourceGroups/MC_elastic-lab_elasticaks_eastus2/providers/Microsoft.Compute/virtualMachineScaleSets/aks-spotgpu-40607851-vmss/virtualMachines/2/simulateEviction?api-version=2020-12-01'[0m
[32mRequest method: 'POST'[0m
[32mRequest headers:[0m
[32m    'User-Agent': 'python/3.6.10 (Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-debian-bullseye-sid) AZURECLI/2.19.1 (DEB)'[0m
[32m    'Accept-Encoding': 'gzip, deflate'[0m
[32m    'Accept': '*/*'[0m
[32m    'Connection': 'keep-alive'[0m
[32m    'x-ms-client-request-id': 'd2d5b243-406e-446b-949b-58f94664cc1e'[0m
[32m    'CommandName': 'rest'[0m
[32m    'ParameterSetName': '--verbose --method --uri'[0m
[32m    'Authorization': 'Bearer eyJ0eXAiOiJKV...'[0m
[32m    'Content-Length': '0'[0m
[32mRequest body:[0m
[32mNone[0m
[32mResponse status: 204[0m
[32mResponse headers:[0m
[32m    'Cache-Control': 'no-cache'[0m


Eviction event is sent and it might take a minute or so to remove a node, run the cell few times until you see node removed and some of the worker Pods in `Pending` state 

In [20]:
# Verify the node was removed from pool and workers on the node stoped
!kubectl get nodes
!kubectl get pods -n elastic-job -o wide

NAME                                 STATUS     ROLES   AGE   VERSION
aks-cpuworkers-40607851-vmss000000   Ready      agent   46h   v1.18.17
aks-cpuworkers-40607851-vmss000003   Ready      agent   45h   v1.18.17
aks-nodepool1-40607851-vmss000000    Ready      agent   46h   v1.18.17
aks-spotgpu-40607851-vmss000002      NotReady   agent   38h   v1.18.17
NAME                                          READY   STATUS    RESTARTS   AGE     IP            NODE                                 NOMINATED NODE   READINESS GATES
elastic-job-k8s-controller-5b9bc6b79c-xvdsw   1/1     Running   0          43h     10.244.13.2   aks-cpuworkers-40607851-vmss000003   <none>           <none>
etcd                                          1/1     Running   0          43h     10.244.12.2   aks-cpuworkers-40607851-vmss000000   <none>           <none>
imagenet-worker-0                             0/1     Pending   0          3m33s   <none>        <none>                               <none>           <none>
image

Once Node is evicted, workers on the removed node is deleted, and the rest of the running workers will detect that the process group has changed and rendezevous server will adjust the training accoringly.

Notice restart_count, group_rank and and process epoch in the logs:

In [57]:
 !kubectl logs -ljob-name=imagenet -n elastic-job --since 10m

=> using checkpoint from rank: 1, max_epoch: 0
=> checkpoint broadcast size is: 93588276
=> done broadcasting checkpoint
=> done restoring from previous checkpoint
=> start_epoch: 1, best_acc1: 0.6399999856948853
Epoch: [1][  0/782]	Time  4.830 ( 4.830)	Data  1.852 ( 1.852)	Loss 4.6512e+00 (4.6512e+00)	Acc@1   6.25 (  6.25)	Acc@5  17.19 ( 17.19)
Epoch: [1][ 10/782]	Time  2.197 ( 2.201)	Data  1.704 ( 1.494)	Loss 4.1309e+00 (4.3900e+00)	Acc@1  12.50 (  8.95)	Acc@5  34.38 ( 25.85)
Epoch: [1][ 20/782]	Time  1.499 ( 2.043)	Data  1.100 ( 1.457)	Loss 4.3086e+00 (4.4244e+00)	Acc@1   6.25 (  8.71)	Acc@5  25.00 ( 24.48)
Epoch: [1][ 30/782]	Time  1.606 ( 1.939)	Data  1.210 ( 1.380)	Loss 4.4815e+00 (4.4541e+00)	Acc@1   4.69 (  8.11)	Acc@5  23.44 ( 23.94)
Epoch: [1][ 40/782]	Time  1.820 ( 1.890)	Data  1.234 ( 1.336)	Loss 4.4205e+00 (4.4695e+00)	Acc@1  12.50 (  8.08)	Acc@5  25.00 ( 23.86)
=> checkpoint broadcast size is: 93588276
=> done broadcasting checkpoint
=> done restoring from previous checkp

Scale  Up the Spot Node Pool and note the new workers starting to run on the newly added nodes, and Rendezevous process adjusting accordingly

In [21]:
!az aks nodepool scale --cluster-name {aks_name} \
                      --name {aks_spot_nodepool} \
                      --resource-group {resource_group} \
                      --node-count 2

[33mThe behavior of this command has been altered by the following extension: aks-preview[0m
{
  "agentPoolType": "VirtualMachineScaleSets",
  "availabilityZones": null,
  "count": 2,
  "enableAutoScaling": null,
  "enableEncryptionAtHost": false,
  "enableFips": false,
  "enableNodePublicIp": false,
  "gpuInstanceProfile": null,
  "id": "/subscriptions/f869415f-5cff-xxxx-xxx-20659d14d62d/resourcegroups/elastic-lab/providers/Microsoft.ContainerService/managedClusters/elasticaks/agentPools/spotgpu",
  "kubeletConfig": null,
  "kubeletDiskType": "OS",
  "linuxOsConfig": null,
  "maxCount": null,
  "maxPods": 110,
  "minCount": null,
  "mode": "User",
  "name": "spotgpu",
  "nodeImageVersion": "AKSUbuntu-1804gpu-2021.05.01",
  "nodeLabels": {
    "kubernetes.azure.com/scalesetpriority": "spot"
  },
  "nodePublicIpPrefixId": null,
  "nodeTaints": [
    "kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
  ],
  "orchestratorVersion": "1.18.17",
  "osDiskSizeGb": 128,
  "osDiskType": "

In [23]:
# wait for node being added and all pods to start 
!kubectl wait -l job-name=imagenet pods --for condition=ready 
# Verify the node was added to the pool and all workers are running
!kubectl get nodes
!kubectl get pods -n elastic-job -o wide

pod/imagenet-worker-0 condition met
pod/imagenet-worker-1 condition met
pod/imagenet-worker-2 condition met
NAME                                 STATUS   ROLES   AGE   VERSION
aks-cpuworkers-40607851-vmss000000   Ready    agent   46h   v1.18.17
aks-cpuworkers-40607851-vmss000003   Ready    agent   46h   v1.18.17
aks-nodepool1-40607851-vmss000000    Ready    agent   47h   v1.18.17
aks-spotgpu-40607851-vmss000004      Ready    agent   12m   v1.18.17
aks-spotgpu-40607851-vmss000005      Ready    agent   12m   v1.18.17
NAME                                          READY   STATUS    RESTARTS   AGE   IP            NODE                                 NOMINATED NODE   READINESS GATES
elastic-job-k8s-controller-5b9bc6b79c-xvdsw   1/1     Running   0          43h   10.244.13.2   aks-cpuworkers-40607851-vmss000003   <none>           <none>
etcd                                          1/1     Running   0          43h   10.244.12.2   aks-cpuworkers-40607851-vmss000000   <none>           <none>
im

In [63]:
# Verify new Rendezevous group was created and workers readjusted
!kubectl logs imagenet-worker-0 -n elastic-job 

[INFO] 2021-05-24 05:27:12,911 launch: Running torchelastic.distributed.launch with args: ['/opt/conda/lib/python3.7/site-packages/torchelastic/distributed/launch.py', '--rdzv_backend=etcd', '--rdzv_endpoint=etcd-service:2379', '--rdzv_id=imagenet', '--nnodes=1:3', '--nproc_per_node=1', '/workspace/examples/imagenet/main.py', '--arch=resnet18', '--epochs=3', '--batch-size=64', '--workers=0', '/workspace/data/tiny-imagenet-200', '--checkpoint-file=/mnt/blob/data/checkpoint.pth.tar']
INFO 2021-05-24 05:27:12,921 Etcd machines: ['http://0.0.0.0:2379']
[INFO] 2021-05-24 05:27:12,937 launch: Using nproc_per_node=1.
[INFO] 2021-05-24 05:27:13,674 api: [default] starting workers for function: wrapper_fn
[INFO] 2021-05-24 05:27:13,674 api: [default] Rendezvous'ing worker group
INFO 2021-05-24 05:27:13,674 Attempting to join next rendezvous
INFO 2021-05-24 05:27:13,678 Observed existing rendezvous state: {'status': 'final', 'version': '9', 'participants': [0, 1], 'keep_alives': ['/torchelastic/

## Summary

This Lab has demonstrated Torch Elastic capabilities and ability to run Fault tolerant elastic training.

Great job for completeing it.