This is particularly important on Nimbus IaaS where termination is synchronous
We observed a case of a list_nodes query to Amazon EC2 hanging for 60 seconds and then reporting InternalError. These queries can slow down launches, so catch them earlier with a 10 seconds default timeout.
When doing large runs, a few EC2 instances get their status changed to running (in the EC2 meaning, for EPU it is STARTED) a long time after having requested them (up to 15 minutes, compared to about 30 seconds normally). These instances have booted successfully a long time before their state change and have already been fully contextualized. However, we currently don't check the context of these instances while they are still in PENDING. This commit allows us to change these instances to RUNNING when the context broker reports OK for them.
We assume that the termination happened and mark the node as TERMINATED.
- leader continually respawned termination threads - conflicts adding node and launch records were not handled correctly
lacks participant tracking (for disabled_all_agreed) and canceling the leader on connection failure.
Instances will jump straight to RUNNING when they hear so from IaaS.