Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch suddenly terminated instances #491

Merged

Conversation

maikia
Copy link
Contributor

@maikia maikia commented Dec 8, 2020

closes #487

spot instances might be terminated at any point in time by AWS. therefore the worker needs to catch any problem with connecting to the instance and put the submission back to the queue if the connection problem occurs (ie set its status to 'retry'.

Copy link
Collaborator

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov
Copy link

codecov bot commented Dec 8, 2020

Codecov Report

Merging #491 (f22e038) into master (7fcb27d) will increase coverage by 0.01%.
The diff coverage is 96.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #491      +/-   ##
==========================================
+ Coverage   93.59%   93.61%   +0.01%     
==========================================
  Files          99       99              
  Lines        8527     8552      +25     
==========================================
+ Hits         7981     8006      +25     
  Misses        546      546              
Impacted Files Coverage Δ
ramp-engine/ramp_engine/base.py 93.90% <90.00%> (+0.31%) ⬆️
ramp-engine/ramp_engine/tests/test_aws.py 86.44% <100.00%> (+1.32%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7fcb27d...f22e038. Read the comment docs.

@agramfort agramfort merged commit 2e12b8d into paris-saclay-cds:master Dec 8, 2020
@agramfort
Copy link
Collaborator

Thx @maikia

maikia added a commit that referenced this pull request Dec 16, 2020
* catching calledProcessError and if caught restarting the submission

* catches CalledProcessError

* add the test for the status being set to retry if called process error on checking if the the training has finished
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG uncaught error when getting the output [AWS worker]
3 participants