Handle errors from AWS on creating instance #464

maikia · 2020-11-13T13:03:37Z

Add handling of errors thrown by AWS when creating a spot instance

tests

What does this implement/fix? Explain your changes.

Any other comments?

…aws instance

… an aws instance

codecov · 2020-11-13T13:22:55Z

Codecov Report

Merging #464 (aecf82a) into master (9565765) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #464      +/-   ##
==========================================
+ Coverage   93.37%   93.40%   +0.02%     
==========================================
  Files          99       99              
  Lines        8337     8367      +30     
==========================================
+ Hits         7785     7815      +30     
  Misses        552      552

Impacted Files	Coverage Δ
ramp-engine/ramp_engine/tests/test_aws.py	`59.75% <100.00%> (+23.21%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9565765...aecf82a. Read the comment docs.

maikia · 2020-11-13T15:11:28Z

I still need to handle/log some errors from the dispatcher level but if you could already review this one it would move things forward a bit faster

thanks in advance
@agramfort @lucyleeow @glemaitre @tomMoral

tomMoral

I don't know the code enough to tell you if the test is good but it seems nice to have a retry mechanism.

Just a small comment on the while loop that I suggest to replace with a for loop to get more readable code but it might be personal taste :)

ramp-engine/ramp_engine/aws/api.py

tomMoral · 2020-11-13T17:13:04Z

ramp-engine/ramp_engine/aws/api.py

@@ -55,6 +55,10 @@
 TRAIN_LOOP_INTERVAL_SECS_FIELD = 'train_loop_interval_secs'
 MEMORY_PROFILING_FIELD = 'memory_profiling'

+# how long to wait for connections
+WAIT_MINUTES = 10
+MAX_TRIES_TO_CONNECT = 6


1hour waiting time seems a lot no?
But I have no idea how frequent the failures are?

Thanks for your comments :-) @tomMoral
Yes. 1h seems like a lot but I think in my particular case (when there are only 3 instances available) It's good to give the submission a bit of a chance to get submitted: otherwise it will fail and the error will be coming from not having instances, not the wrong submission.
I think it would be better to put those two params in the config file of the event in the future, but for now I prefer to keep it as is so the events (especially the locomotion challenge) would not have to be redeployed

when there are only 3 instances available)

It might be worth following up with Nicolas about this. Also requesting for increase in spot instance limit seems to be much easier than requesting for increase in on-demand limit, I think if you requested to increase spot instance limit it should go through

I requested in spot instance limit -> the use of on demand is now broken on the current release of RAMP anyways
(I made the PR to fix it: #462).

Co-authored-by: Thomas Moreau <thomas.moreau.2010@gmail.com>

lucyleeow · 2020-11-13T20:31:25Z

Can you give a log of the errors from creating AWS instance? If the problem is that the dispatcher thinks we have more instances than we actually have - maybe the problem would be better fixed by correcting that and not waiting so long between requests...

Edit - thinking about it more, I think this is not a good idea. It keeps the dispatcher on requesting spot instances for up to 1h so other tasks, e.g., collecting results, training etc are not performed until after you have waited 1h for your spot instance request. What exactly is the error/cause of the error?

lucyleeow · 2020-11-13T20:40:09Z

ramp-engine/ramp_engine/aws/worker.py

        if self.instance:
            logger.info("Instance launched for submission '{}'".format(
                self.submission))
        else:
-            logger.info("Unable to launch instance for submission "
-                        "'{}'".format(self.submission))
+            logger.error("Unable to launch instance for submission "


nitpick, change this to f-string?

maikia · 2020-11-13T21:30:00Z

the error is this one: botocore.exceptions.ClientError thrown from Amazon and not caught by the dispatcher nor by the worker

@lucyleeow I agree that keeping the dispatcher waiting for just a single worker would not be a good idea. we can shorten the time of waiting for now

lucyleeow · 2020-11-13T21:30:52Z

botocore.exceptions.ClientError

Is there an error message with that?

lucyleeow · 2020-11-13T21:32:31Z

AWS service exceptions are caught with the underlying botocore exception, ClientError. After you catch this exception, you can parse through the response for specifics around that error, including the service-specific exception. Exceptions and errors from AWS services vary widely. You can quickly get a list of an AWS service’s exceptions using Boto3.

sounds like this is a relatively generic error and the message gives more details?

maikia · 2020-11-13T21:34:07Z

botocore.exceptions.ClientError: An error occurred (MaxSpotInstanceCountExceeded) when calling the RequestSpotInstances operation: Max spot instance count exceeded

This reverts commit 0edc054.

This reverts commit 6ddea9f.

agramfort

tests are green so it's good from my end. Let's try and see what it gives. If we break things it's no big deal. We can fix / revert quickly as long as we do not touch the DB

lucyleeow · 2020-11-14T11:47:55Z

tests are green so it's good from my end.

@agramfort FYI I think aws tests are skipped on CI

agramfort · 2020-11-14T11:50:03Z

arfff :( any chance to fix this? using free micro spot instances?

…

rth · 2020-11-14T13:02:45Z

FYI, there is https://github.com/spulec/moto for mocking AWS services in tests (though I have no experience using it)

maikia · 2020-11-16T00:08:36Z

There are two aws tests which are skipped. There are virtually no other tests except what I have added here <- which should not have been skipped by CI unless I am not looking at the right settings (?)

I am merging

Thanks @tomMoral @rth @lucyleeow and @agramfort

maikia added 8 commits November 13, 2020 10:21

added test which mocks raising not enough instances error

b0b2776

wip three tests for different cases of response while setting up the …

c4abe5a

…aws instance

testing logging works

77fe24a

parametrized the test

51b6dad

added handler for the unknown errors and client errors when launching…

95dcf03

… an aws instance

clean up

08f80ef

change the test name

030f213

add more informative log messages for reconnection to spot instances

39af642

update worker to work with instance returned as None

16d8de5

maikia changed the title ~~WIP Handle erros from AWS on creating instance~~ Handle erros from AWS on creating instance Nov 13, 2020

maikia changed the title ~~Handle erros from AWS on creating instance~~ Handle errors from AWS on creating instance Nov 13, 2020

maikia added 2 commits November 13, 2020 17:54

update the logged msg

3500c3f

go back to 10mins wait time

0edc054

tomMoral reviewed Nov 13, 2020

View reviewed changes

Update ramp-engine/ramp_engine/aws/api.py

6ddea9f

Co-authored-by: Thomas Moreau <thomas.moreau.2010@gmail.com>

lucyleeow reviewed Nov 13, 2020

View reviewed changes

maikia added 4 commits November 13, 2020 22:39

Revert "go back to 10mins wait time"

8b9f31a

This reverts commit 0edc054.

Revert "Update ramp-engine/ramp_engine/aws/api.py"

975b6a2

This reverts commit 6ddea9f.

change the wait + clean up

b2b587f

change the wait + clean up

aecf82a

agramfort approved these changes Nov 14, 2020

View reviewed changes

maikia merged commit 4ac7ce0 into paris-saclay-cds:master Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle errors from AWS on creating instance #464

Handle errors from AWS on creating instance #464

maikia commented Nov 13, 2020

codecov bot commented Nov 13, 2020 •

edited

Loading

maikia commented Nov 13, 2020

tomMoral left a comment

tomMoral Nov 13, 2020

maikia Nov 13, 2020

lucyleeow Nov 13, 2020 •

edited

Loading

maikia Nov 13, 2020

lucyleeow commented Nov 13, 2020 •

edited

Loading

lucyleeow Nov 13, 2020

maikia commented Nov 13, 2020

lucyleeow commented Nov 13, 2020

lucyleeow commented Nov 13, 2020

maikia commented Nov 13, 2020

agramfort left a comment

lucyleeow commented Nov 14, 2020

agramfort commented Nov 14, 2020 via email

rth commented Nov 14, 2020

maikia commented Nov 16, 2020

Handle errors from AWS on creating instance #464

Handle errors from AWS on creating instance #464

Conversation

maikia commented Nov 13, 2020

What does this implement/fix? Explain your changes.

Any other comments?

codecov bot commented Nov 13, 2020 • edited Loading

Codecov Report

maikia commented Nov 13, 2020

tomMoral left a comment

Choose a reason for hiding this comment

tomMoral Nov 13, 2020

Choose a reason for hiding this comment

maikia Nov 13, 2020

Choose a reason for hiding this comment

lucyleeow Nov 13, 2020 • edited Loading

Choose a reason for hiding this comment

maikia Nov 13, 2020

Choose a reason for hiding this comment

lucyleeow commented Nov 13, 2020 • edited Loading

lucyleeow Nov 13, 2020

Choose a reason for hiding this comment

maikia commented Nov 13, 2020

lucyleeow commented Nov 13, 2020

lucyleeow commented Nov 13, 2020

maikia commented Nov 13, 2020

agramfort left a comment

Choose a reason for hiding this comment

lucyleeow commented Nov 14, 2020

agramfort commented Nov 14, 2020 via email

rth commented Nov 14, 2020

maikia commented Nov 16, 2020

codecov bot commented Nov 13, 2020 •

edited

Loading

lucyleeow Nov 13, 2020 •

edited

Loading

lucyleeow commented Nov 13, 2020 •

edited

Loading