Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle errors from AWS on creating instance #464

Merged
merged 16 commits into from
Nov 16, 2020

Conversation

maikia
Copy link
Contributor

@maikia maikia commented Nov 13, 2020

Add handling of errors thrown by AWS when creating a spot instance

  • tests

What does this implement/fix? Explain your changes.

Any other comments?

@codecov
Copy link

codecov bot commented Nov 13, 2020

Codecov Report

Merging #464 (aecf82a) into master (9565765) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #464      +/-   ##
==========================================
+ Coverage   93.37%   93.40%   +0.02%     
==========================================
  Files          99       99              
  Lines        8337     8367      +30     
==========================================
+ Hits         7785     7815      +30     
  Misses        552      552              
Impacted Files Coverage Δ
ramp-engine/ramp_engine/tests/test_aws.py 59.75% <100.00%> (+23.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9565765...aecf82a. Read the comment docs.

@maikia
Copy link
Contributor Author

maikia commented Nov 13, 2020

I still need to handle/log some errors from the dispatcher level but if you could already review this one it would move things forward a bit faster

thanks in advance
@agramfort @lucyleeow @glemaitre @tomMoral

@maikia maikia changed the title WIP Handle erros from AWS on creating instance Handle erros from AWS on creating instance Nov 13, 2020
@maikia maikia changed the title Handle erros from AWS on creating instance Handle errors from AWS on creating instance Nov 13, 2020
Copy link
Collaborator

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the code enough to tell you if the test is good but it seems nice to have a retry mechanism.

Just a small comment on the while loop that I suggest to replace with a for loop to get more readable code but it might be personal taste :)

ramp-engine/ramp_engine/aws/api.py Outdated Show resolved Hide resolved
@@ -55,6 +55,10 @@
TRAIN_LOOP_INTERVAL_SECS_FIELD = 'train_loop_interval_secs'
MEMORY_PROFILING_FIELD = 'memory_profiling'

# how long to wait for connections
WAIT_MINUTES = 10
MAX_TRIES_TO_CONNECT = 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1hour waiting time seems a lot no?
But I have no idea how frequent the failures are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments :-) @tomMoral
Yes. 1h seems like a lot but I think in my particular case (when there are only 3 instances available) It's good to give the submission a bit of a chance to get submitted: otherwise it will fail and the error will be coming from not having instances, not the wrong submission.
I think it would be better to put those two params in the config file of the event in the future, but for now I prefer to keep it as is so the events (especially the locomotion challenge) would not have to be redeployed

Copy link
Contributor

@lucyleeow lucyleeow Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when there are only 3 instances available)

It might be worth following up with Nicolas about this. Also requesting for increase in spot instance limit seems to be much easier than requesting for increase in on-demand limit, I think if you requested to increase spot instance limit it should go through

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I requested in spot instance limit -> the use of on demand is now broken on the current release of RAMP anyways
(I made the PR to fix it: #462).

Co-authored-by: Thomas Moreau <thomas.moreau.2010@gmail.com>
@lucyleeow
Copy link
Contributor

lucyleeow commented Nov 13, 2020

Can you give a log of the errors from creating AWS instance? If the problem is that the dispatcher thinks we have more instances than we actually have - maybe the problem would be better fixed by correcting that and not waiting so long between requests...

Edit - thinking about it more, I think this is not a good idea. It keeps the dispatcher on requesting spot instances for up to 1h so other tasks, e.g., collecting results, training etc are not performed until after you have waited 1h for your spot instance request. What exactly is the error/cause of the error?

if self.instance:
logger.info("Instance launched for submission '{}'".format(
self.submission))
else:
logger.info("Unable to launch instance for submission "
"'{}'".format(self.submission))
logger.error("Unable to launch instance for submission "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, change this to f-string?

@maikia
Copy link
Contributor Author

maikia commented Nov 13, 2020

the error is this one: botocore.exceptions.ClientError thrown from Amazon and not caught by the dispatcher nor by the worker

@lucyleeow I agree that keeping the dispatcher waiting for just a single worker would not be a good idea. we can shorten the time of waiting for now

@lucyleeow
Copy link
Contributor

botocore.exceptions.ClientError

Is there an error message with that?

@lucyleeow
Copy link
Contributor

AWS service exceptions are caught with the underlying botocore exception, ClientError. After you catch this exception, you can parse through the response for specifics around that error, including the service-specific exception. Exceptions and errors from AWS services vary widely. You can quickly get a list of an AWS service’s exceptions using Boto3.

sounds like this is a relatively generic error and the message gives more details?

@maikia
Copy link
Contributor Author

maikia commented Nov 13, 2020

botocore.exceptions.ClientError: An error occurred (MaxSpotInstanceCountExceeded) when calling the RequestSpotInstances operation: Max spot instance count exceeded

Copy link
Collaborator

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are green so it's good from my end. Let's try and see what it gives. If we break things it's no big deal. We can fix / revert quickly as long as we do not touch the DB

@lucyleeow
Copy link
Contributor

tests are green so it's good from my end.

@agramfort FYI I think aws tests are skipped on CI

@agramfort
Copy link
Collaborator

agramfort commented Nov 14, 2020 via email

@rth
Copy link
Collaborator

rth commented Nov 14, 2020

FYI, there is https://github.com/spulec/moto for mocking AWS services in tests (though I have no experience using it)

@maikia
Copy link
Contributor Author

maikia commented Nov 16, 2020

There are two aws tests which are skipped. There are virtually no other tests except what I have added here <- which should not have been skipped by CI unless I am not looking at the right settings (?)

I am merging

Thanks @tomMoral @rth @lucyleeow and @agramfort

@maikia maikia merged commit 4ac7ce0 into paris-saclay-cds:master Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants