New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce 12 hour maxJobDuration #275
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just may be worth spelling out that we're setting the default to 3 minutes short of 12 hours.
@@ -63,9 +63,10 @@ const dashboard = require(path.resolve(__dirname, 'dashboard.js')); | |||
* number or a reference, i.e. `{"Ref": "..."}`. | |||
* @param {Boolean} [options.privileged=false] - give the container elevated | |||
* privileges on the host container instance | |||
* @param {Number|ref} [options.maxJobDuration] - the maximum number of seconds | |||
* @param {Number|ref} [options.maxJobDuration] - the number of seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say that the default is 12 hours?
@@ -103,6 +103,14 @@ In writableFilesystem mode, the whole file system is writable and containers are | |||
|
|||
writableFilesystem mode has no restrictions to the file system: workers can write anywhere and read from anywhere, their files being instantly deleted after the job finishes and the container dies. | |||
|
|||
### maxJobDuration explained | |||
|
|||
When maxJobDuration is exceeded by a worker, the worker will stop all processes and return the processing message to the queue. Typically, the maxJobDuration should be set marginally above the maximum time that it takes your application to process and delete a message from the queue. This ensures all work can be completed; while providing protection against erroneous processing continuing to the maximum of twelve hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm likely confused but it looks like if maxJobDuration
is exceeded, the behavior is the same as when SQS hard limit of 12 hours is exceeded, because in both cases the message is returned to the queue, is that right?
Or will a worker keep working when the 12-hour visibility timeout is breached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or will a worker keep working when the 12-hour visibility timeout is breached?
Yes, I believe the worker will keep working regardless of hitting the timeout. We could implement a hard cutoff, but that may or may not be desirable depending on the application context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense, thanks @rclark!
Opening a pr to close the loop on timeout handling.
As it stands right now watchbot will duplicate work if the message takes longer than 12 hours to process and no timeout is set. Quick solution is to acknowledge the current 12 hour limitation by enforcing maxJobDuration on workers.
I've chosen 43020 as the max (12 hours minus three minutes) to avoid edge cases that could occur if the heartbeat tries to extend the visibility timeout 3 minutes and the total would be over 43200 seconds. Also attempted some docs to explain this limitation.
cc: @jakepruitt @rclark
closes #203