-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle errors from AWS on creating instance #464
Handle errors from AWS on creating instance #464
Conversation
Codecov Report
@@ Coverage Diff @@
## master #464 +/- ##
==========================================
+ Coverage 93.37% 93.40% +0.02%
==========================================
Files 99 99
Lines 8337 8367 +30
==========================================
+ Hits 7785 7815 +30
Misses 552 552
Continue to review full report at Codecov.
|
I still need to handle/log some errors from the dispatcher level but if you could already review this one it would move things forward a bit faster thanks in advance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the code enough to tell you if the test is good but it seems nice to have a retry mechanism.
Just a small comment on the while loop that I suggest to replace with a for
loop to get more readable code but it might be personal taste :)
ramp-engine/ramp_engine/aws/api.py
Outdated
@@ -55,6 +55,10 @@ | |||
TRAIN_LOOP_INTERVAL_SECS_FIELD = 'train_loop_interval_secs' | |||
MEMORY_PROFILING_FIELD = 'memory_profiling' | |||
|
|||
# how long to wait for connections | |||
WAIT_MINUTES = 10 | |||
MAX_TRIES_TO_CONNECT = 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1hour waiting time seems a lot no?
But I have no idea how frequent the failures are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments :-) @tomMoral
Yes. 1h seems like a lot but I think in my particular case (when there are only 3 instances available) It's good to give the submission a bit of a chance to get submitted: otherwise it will fail and the error will be coming from not having instances, not the wrong submission.
I think it would be better to put those two params in the config file of the event in the future, but for now I prefer to keep it as is so the events (especially the locomotion challenge) would not have to be redeployed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when there are only 3 instances available)
It might be worth following up with Nicolas about this. Also requesting for increase in spot instance limit seems to be much easier than requesting for increase in on-demand limit, I think if you requested to increase spot instance limit it should go through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I requested in spot instance limit -> the use of on demand is now broken on the current release of RAMP anyways
(I made the PR to fix it: #462).
Co-authored-by: Thomas Moreau <thomas.moreau.2010@gmail.com>
Can you give a log of the errors from creating AWS instance? If the problem is that the dispatcher thinks we have more instances than we actually have - maybe the problem would be better fixed by correcting that and not waiting so long between requests... Edit - thinking about it more, I think this is not a good idea. It keeps the dispatcher on requesting spot instances for up to 1h so other tasks, e.g., collecting results, training etc are not performed until after you have waited 1h for your spot instance request. What exactly is the error/cause of the error? |
if self.instance: | ||
logger.info("Instance launched for submission '{}'".format( | ||
self.submission)) | ||
else: | ||
logger.info("Unable to launch instance for submission " | ||
"'{}'".format(self.submission)) | ||
logger.error("Unable to launch instance for submission " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick, change this to f-string?
the error is this one: botocore.exceptions.ClientError thrown from Amazon and not caught by the dispatcher nor by the worker @lucyleeow I agree that keeping the dispatcher waiting for just a single worker would not be a good idea. we can shorten the time of waiting for now |
Is there an error message with that? |
sounds like this is a relatively generic error and the message gives more details? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests are green so it's good from my end. Let's try and see what it gives. If we break things it's no big deal. We can fix / revert quickly as long as we do not touch the DB
@agramfort FYI I think aws tests are skipped on CI |
arfff :(
any chance to fix this? using free micro spot instances?
… |
FYI, there is https://github.com/spulec/moto for mocking AWS services in tests (though I have no experience using it) |
There are two aws tests which are skipped. There are virtually no other tests except what I have added here <- which should not have been skipped by CI unless I am not looking at the right settings (?) I am merging Thanks @tomMoral @rth @lucyleeow and @agramfort |
Add handling of errors thrown by AWS when creating a spot instance
What does this implement/fix? Explain your changes.
Any other comments?