# Verify Elastic Pytorch training fault tolerance
In this notebook we will 
- Simulate SpotVM node eviction
- Verify ElasticJob Fault Tolerance
- Scale number of Nodes and verify Training job autoscaling


In [36]:
resource_group = "elastic-lab"   # replace with values from Step1 or use this default
aks_name = "elasticaks"          # replace with values from Step1 or use this default
aks_spot_nodepool = "spotgpu"    # replace with values from Step1 or use this default

In [37]:
# Get currently running pods on the spotvm nodes
!kubectl get pods -n elastic-job -o wide

NAME                                          READY   STATUS    RESTARTS   AGE    IP            NODE                                 NOMINATED NODE   READINESS GATES
elastic-job-k8s-controller-5b9bc6b79c-xvdsw   1/1     Running   0          4h4m   10.244.13.2   aks-cpuworkers-40607851-vmss000003   <none>           <none>
etcd                                          1/1     Running   0          4h9m   10.244.12.2   aks-cpuworkers-40607851-vmss000000   <none>           <none>
imagenet-worker-0                             1/1     Running   0          76s    10.244.17.4   aks-spotgpu-40607851-vmss000001      <none>           <none>
imagenet-worker-1                             1/1     Running   0          76s    10.244.17.3   aks-spotgpu-40607851-vmss000001      <none>           <none>
imagenet-worker-2                             1/1     Running   0          76s    10.244.16.4   aks-spotgpu-40607851-vmss000000      <none>           <none>


In [38]:
# Run commands to get all variables for REST API and strip beginning and end quotes
subscription_id = !az account show --query id 
subscription_id = str(subscription_id[0][1:-1])
node_rg = !az aks show --resource-group {resource_group} --name {aks_name} --query nodeResourceGroup --only-show-errors
node_rg = str(node_rg[0][1:-1])
vmss_name= !az vmss list -g MC_elastic-lab_elasticaks_eastus2 --query '[].name' -o tsv | grep spot
vmss_name = vmss_name[0]
instance_id = 1 ## Fill it in if different node !!

For information on SpotVM eviction see [Simulate an eviction docs](https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/use-spot#simulate-an-eviction)
Use REST API to simulate SpotVM eviction (it can also be tested using Azure API Console https://docs.microsoft.com/en-us/rest/api/compute/virtualmachinescalesetvms/simulateeviction#code-try-0)

Response HTTP Code should be 204 , accepting the event

In [33]:
!az rest --verbose --method POST \
   --uri 'https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{node_rg}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmss_name}/virtualMachines/{instance_id}/simulateEviction?api-version=2020-12-01'

[32mRequest URL: 'https://management.azure.com/subscriptions/f869415f-5cff-46a3-b728-20659d14d62d/resourceGroups/MC_elastic-lab_elasticaks_eastus2/providers/Microsoft.Compute/virtualMachineScaleSets/aks-spotgpu-40607851-vmss/virtualMachines/1/simulateEviction?api-version=2020-12-01'[0m
[32mRequest method: 'POST'[0m
[32mRequest headers:[0m
[32m    'User-Agent': 'python/3.6.10 (Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-debian-bullseye-sid) AZURECLI/2.19.1 (DEB)'[0m
[32m    'Accept-Encoding': 'gzip, deflate'[0m
[32m    'Accept': '*/*'[0m
[32m    'Connection': 'keep-alive'[0m
[32m    'x-ms-client-request-id': 'b216b8aa-1c4b-4e13-89c1-46f88017015b'[0m
[32m    'CommandName': 'rest'[0m
[32m    'ParameterSetName': '--verbose --method --uri'[0m
[32m    'Authorization': 'Bearer eyJ0eXAiOiJKV...'[0m
[32m    'Content-Length': '0'[0m
[32mRequest body:[0m
[32mNone[0m
[32mResponse status: 204[0m
[32mResponse headers:[0m
[32m    'Cache-Control': 'no-cache'[0m


In [None]:
# Verify the node was removed from pool and workers on the node stoped
!kubectl get nodes
!kubectl get pods -n elastic-job

Once Node is evicted, workers on the removed node is deleted, and the rest of the running workers will detect that the process group has changed and rendezevous server will adjust the training accoringly.

Notice restart_count, group_rank and and process epoch in the logs:

In [None]:
 !kubectl logs -ljob-name=imagenet -n elastic-job 

Scale  Up the Spot Node Pool and note the new workers starting to run on the newly added nodes, and Rendezevous process adjusting accordingly

In [35]:
!az aks nodepool scale --cluster-name {aks_name} \
                      --name {aks_spot_nodepool} \
                      --resource-group {resource_group} \
                      --node-count 1

[33mThe behavior of this command has been altered by the following extension: aks-preview[0m
[91mCannot scale cluster autoscaler enabled node pool.[0m
[0m

In [None]:
!kubectl logs -ljob-name=imagenet -n elastic-job 

## Summary

This Lab has demonstrated Torch Elastic capabilities and ability to run Fault tolerant elastic training.

Great job for completeing it.