# Verify Elastic Pytorch training fault tolerance
In this notebook we will 
- Simulate SpotVM node eviction
- Verify ElasticJob Fault Tolerance
- Scale number of Nodes and verify Training job autoscaling


In [41]:
resource_group = "elastic-lab"   # replace with values from Step1 or use this default
aks_name = "elasticaks"          # replace with values from Step1 or use this default
aks_spot_nodepool = "spotgpu"    # replace with values from Step1 or use this default

In [45]:
# Get currently running pods on the spotvm nodes
!kubectl get nodes
!kubectl get pods -n elastic-job -o wide

NAME                                 STATUS   ROLES   AGE     VERSION
aks-cpuworkers-40607851-vmss000000   Ready    agent   7h13m   v1.18.17
aks-cpuworkers-40607851-vmss000003   Ready    agent   6h54m   v1.18.17
aks-nodepool1-40607851-vmss000000    Ready    agent   7h57m   v1.18.17
aks-spotgpu-40607851-vmss000000      Ready    agent   27m     v1.18.17
aks-spotgpu-40607851-vmss000001      Ready    agent   27m     v1.18.17
NAME                                          READY   STATUS    RESTARTS   AGE     IP            NODE                                 NOMINATED NODE   READINESS GATES
elastic-job-k8s-controller-5b9bc6b79c-xvdsw   1/1     Running   0          4h25m   10.244.13.2   aks-cpuworkers-40607851-vmss000003   <none>           <none>
etcd                                          1/1     Running   0          4h31m   10.244.12.2   aks-cpuworkers-40607851-vmss000000   <none>           <none>
imagenet-worker-0                             1/1     Running   0          23m     10.244.17

In [46]:
# Run commands to get all variables for REST API and strip beginning and end quotes
subscription_id = !az account show --query id 
subscription_id = str(subscription_id[0][1:-1])
node_rg = !az aks show --resource-group {resource_group} --name {aks_name} --query nodeResourceGroup --only-show-errors
node_rg = str(node_rg[0][1:-1])
vmss_name= !az vmss list -g MC_elastic-lab_elasticaks_eastus2 --query '[].name' -o tsv | grep spot
vmss_name = vmss_name[0]
instance_id = 1 ## Fill it in if different node !!

For information on SpotVM eviction see [Simulate an eviction docs](https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/use-spot#simulate-an-eviction)
Use REST API to simulate SpotVM eviction (it can also be tested using Azure API Console https://docs.microsoft.com/en-us/rest/api/compute/virtualmachinescalesetvms/simulateeviction#code-try-0)

Response HTTP Code should be 204 , accepting the event

In [47]:
!az rest --verbose --method POST \
   --uri 'https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{node_rg}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmss_name}/virtualMachines/{instance_id}/simulateEviction?api-version=2020-12-01'

[32mRequest URL: 'https://management.azure.com/subscriptions/f869415f-5cff-46a3-b728-20659d14d62d/resourceGroups/MC_elastic-lab_elasticaks_eastus2/providers/Microsoft.Compute/virtualMachineScaleSets/aks-spotgpu-40607851-vmss/virtualMachines/1/simulateEviction?api-version=2020-12-01'[0m
[32mRequest method: 'POST'[0m
[32mRequest headers:[0m
[32m    'User-Agent': 'python/3.6.10 (Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-debian-bullseye-sid) AZURECLI/2.19.1 (DEB)'[0m
[32m    'Accept-Encoding': 'gzip, deflate'[0m
[32m    'Accept': '*/*'[0m
[32m    'Connection': 'keep-alive'[0m
[32m    'x-ms-client-request-id': '786e7e09-9026-41c6-9e4f-58c9eb0eaa88'[0m
[32m    'CommandName': 'rest'[0m
[32m    'ParameterSetName': '--verbose --method --uri'[0m
[32m    'Authorization': 'Bearer eyJ0eXAiOiJKV...'[0m
[32m    'Content-Length': '0'[0m
[32mRequest body:[0m
[32mNone[0m
[32mResponse status: 204[0m
[32mResponse headers:[0m
[32m    'Cache-Control': 'no-cache'[0m


Eviction event is sent and it might take a minute or so to remove a node, run the cell few times until you see node removed and some of the worker Pods in `Pending` state 

In [52]:
# Verify the node was removed from pool and workers on the node stoped
!kubectl get nodes
!kubectl get pods -n elastic-job

NAME                                 STATUS   ROLES   AGE     VERSION
aks-cpuworkers-40607851-vmss000000   Ready    agent   7h15m   v1.18.17
aks-cpuworkers-40607851-vmss000003   Ready    agent   6h56m   v1.18.17
aks-nodepool1-40607851-vmss000000    Ready    agent   7h59m   v1.18.17
aks-spotgpu-40607851-vmss000000      Ready    agent   29m     v1.18.17
NAME                                          READY   STATUS    RESTARTS   AGE
elastic-job-k8s-controller-5b9bc6b79c-xvdsw   1/1     Running   0          4h27m
etcd                                          1/1     Running   0          4h33m
imagenet-worker-0                             0/1     Pending   0          11s
imagenet-worker-1                             1/1     Running   0          11s
imagenet-worker-2                             1/1     Running   0          25m


Once Node is evicted, workers on the removed node is deleted, and the rest of the running workers will detect that the process group has changed and rendezevous server will adjust the training accoringly.

Notice restart_count, group_rank and and process epoch in the logs:

In [57]:
 !kubectl logs -ljob-name=imagenet -n elastic-job --since 10m

=> using checkpoint from rank: 1, max_epoch: 0
=> checkpoint broadcast size is: 93588276
=> done broadcasting checkpoint
=> done restoring from previous checkpoint
=> start_epoch: 1, best_acc1: 0.6399999856948853
Epoch: [1][  0/782]	Time  4.830 ( 4.830)	Data  1.852 ( 1.852)	Loss 4.6512e+00 (4.6512e+00)	Acc@1   6.25 (  6.25)	Acc@5  17.19 ( 17.19)
Epoch: [1][ 10/782]	Time  2.197 ( 2.201)	Data  1.704 ( 1.494)	Loss 4.1309e+00 (4.3900e+00)	Acc@1  12.50 (  8.95)	Acc@5  34.38 ( 25.85)
Epoch: [1][ 20/782]	Time  1.499 ( 2.043)	Data  1.100 ( 1.457)	Loss 4.3086e+00 (4.4244e+00)	Acc@1   6.25 (  8.71)	Acc@5  25.00 ( 24.48)
Epoch: [1][ 30/782]	Time  1.606 ( 1.939)	Data  1.210 ( 1.380)	Loss 4.4815e+00 (4.4541e+00)	Acc@1   4.69 (  8.11)	Acc@5  23.44 ( 23.94)
Epoch: [1][ 40/782]	Time  1.820 ( 1.890)	Data  1.234 ( 1.336)	Loss 4.4205e+00 (4.4695e+00)	Acc@1  12.50 (  8.08)	Acc@5  25.00 ( 23.86)
=> checkpoint broadcast size is: 93588276
=> done broadcasting checkpoint
=> done restoring from previous checkp

Scale  Up the Spot Node Pool and note the new workers starting to run on the newly added nodes, and Rendezevous process adjusting accordingly

In [58]:
!az aks nodepool scale --cluster-name {aks_name} \
                      --name {aks_spot_nodepool} \
                      --resource-group {resource_group} \
                      --node-count 2

[33mThe behavior of this command has been altered by the following extension: aks-preview[0m
[91mThe new node count is the same as the current node count.[0m
[0m

In [None]:
# Verify the node was added to the pool and all workers are running
!kubectl get nodes
!kubectl get pods -n elastic-job

In [None]:
# Verify new Rendezevous group was created and workers readjusted
!kubectl logs -ljob-name=imagenet -n elastic-job 

## Summary

This Lab has demonstrated Torch Elastic capabilities and ability to run Fault tolerant elastic training.

Great job for completeing it.