Skip to content

Conversation

@evansiroky
Copy link
Contributor

@evansiroky evansiroky commented Jul 24, 2020

Checklist

  • Appropriate branch selected (all PRs must first be merged to dev before they can be merged to master)
  • Any modified or new methods or classes have helpful JavaDoc and code is thoroughly commented
  • The description lists all applicable issues this PR seeks to resolve
  • The description lists any configuration setting(s) that differ from the default settings
  • All tests and CI builds passing

Description

This improves the reliability of the MonitorServerStatusJob by doing the following:

  • Catching exceptions that can occur when making a AWS request to get the instance health and then retrying up to a max number of times (hardcoded as 5).
  • Catching all possible exceptions that can occur in the MonitorServerStatusJob and then failing the job gracefully in order for the MonitorServerStatusJob to be able to also recover and not leave any hanging instances.
  • Refactor a helper in MonitorServerStatusJob to include exceptions in some job status failures.
  • Using fresh ec2 clients to try to avoid some expired request errors (Refactor AWS client creation #332 has code that should avoid all request timeouts related to role session timeouts)
  • Always checking for exceptions when terminating instances and adding to the job output when instance termination failed.
  • Corrects the graph build timeout time
  • Introduces helper class TimeTracker
  • various otp-runner updates
    • Using a nonce to verify otp-runner status files
    • Uploads the otp-runner manifest, build-config.json and router-config.json to S3
    • Updates the otp-runner version and corresponding code to be compatible with changes to otp-runner that refactor file downloads
    • creates and uploads separate otp-runner manifest files for graph building and server running

@codecov-commenter
Copy link

codecov-commenter commented Jul 24, 2020

Codecov Report

Merging #331 into dev will decrease coverage by 0.05%.
The diff coverage is 19.23%.

Impacted file tree graph

@@             Coverage Diff              @@
##                dev     #331      +/-   ##
============================================
- Coverage     24.10%   24.04%   -0.06%     
- Complexity      539      542       +3     
============================================
  Files           168      169       +1     
  Lines          8821     8895      +74     
  Branches       1222     1230       +8     
============================================
+ Hits           2126     2139      +13     
- Misses         6486     6543      +57     
- Partials        209      213       +4     
Flag Coverage Δ Complexity Δ
#unit_tests 24.04% <19.23%> (-0.06%) 542.00 <5.00> (+3.00) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
.../com/conveyal/datatools/common/utils/AWSUtils.java 7.69% <0.00%> (ø) 4.00 <0.00> (ø)
...datatools/manager/jobs/MonitorServerStatusJob.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...eyal/datatools/manager/jobs/OtpRunnerManifest.java 100.00% <ø> (ø) 1.00 <0.00> (ø)
...nveyal/datatools/manager/jobs/OtpRunnerStatus.java 0.00% <ø> (ø) 0.00 <0.00> (ø)
.../datatools/manager/jobs/RecreateBuildImageJob.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
.../conveyal/datatools/manager/utils/TimeTracker.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...com/conveyal/datatools/manager/jobs/DeployJob.java 19.09% <31.11%> (-0.35%) 16.00 <4.00> (+2.00) ⬇️
...veyal/datatools/manager/persistence/FeedStore.java 35.23% <100.00%> (+4.76%) 15.00 <1.00> (+1.00)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e85446...6223b9b. Read the comment docs.

@evansiroky evansiroky mentioned this pull request Jul 28, 2020
5 tasks
return null;
}
// download otp-runner manifest
lines.add(String.format("aws s3 cp %s %s", otpRunnerManifestS3FilePath, otpRunnerManifestFileOnInstance));
Copy link
Contributor

@landonreed landonreed Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yea. This seems much better than echoing the manifest from the user-data. Also, makes things more repeatable. For some reason, in my head I thought we were already uploading the otp-runner file to s3. In any case, nice work.

@evansiroky
Copy link
Contributor Author

Ready for review now that #330 has been merged into dev.

"%s During job cleanup, the graph building instance was not properly terminated!",
status.message
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than surrounding each call to terminateInstances in a try/catch, we should just have the terminateInstances method itself contain the try/catch, check the TerminateInstancesResult and return a success boolean. I don't think the current return type is really used anywhere.

Copy link
Contributor

@landonreed landonreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments describing some changes I think would help.

@landonreed landonreed assigned evansiroky and unassigned landonreed Aug 18, 2020
@evansiroky evansiroky mentioned this pull request Aug 19, 2020
8 tasks
Copy link
Contributor

@binh-dam-ibigroup binh-dam-ibigroup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditional approval on what Landon said and some minor comments.

);
} catch (AmazonEC2Exception e) {
status.fail(
String.format("%s. During job cleanup, the instance was not properly terminated!", status.message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string appears four times in this file and can become a constant.

@evansiroky
Copy link
Contributor Author

See various changes that should address the comments.

@landonreed landonreed merged commit d068810 into dev Aug 20, 2020
@landonreed landonreed deleted the improve-monitorable-server-job-exception-handling branch August 20, 2020 17:58
@landonreed landonreed mentioned this pull request Oct 30, 2020
8 tasks
@landonreed
Copy link
Contributor

🎉 This PR is included in version 3.8.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants