Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce 12 hour maxJobDuration #275

Closed
wants to merge 5 commits into from
Closed

Conversation

brendanmcfarland
Copy link
Contributor

Opening a pr to close the loop on timeout handling.

  • we can't extend message visibility past 12 hours
  • we don't have a single source of truth to avoid duplicate work

As it stands right now watchbot will duplicate work if the message takes longer than 12 hours to process and no timeout is set. Quick solution is to acknowledge the current 12 hour limitation by enforcing maxJobDuration on workers.

I've chosen 43020 as the max (12 hours minus three minutes) to avoid edge cases that could occur if the heartbeat tries to extend the visibility timeout 3 minutes and the total would be over 43200 seconds. Also attempted some docs to explain this limitation.

cc: @jakepruitt @rclark

closes #203

Copy link

@jakepruitt jakepruitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just may be worth spelling out that we're setting the default to 3 minutes short of 12 hours.

@@ -63,9 +63,10 @@ const dashboard = require(path.resolve(__dirname, 'dashboard.js'));
* number or a reference, i.e. `{"Ref": "..."}`.
* @param {Boolean} [options.privileged=false] - give the container elevated
* privileges on the host container instance
* @param {Number|ref} [options.maxJobDuration] - the maximum number of seconds
* @param {Number|ref} [options.maxJobDuration] - the number of seconds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say that the default is 12 hours?

@@ -103,6 +103,14 @@ In writableFilesystem mode, the whole file system is writable and containers are

writableFilesystem mode has no restrictions to the file system: workers can write anywhere and read from anywhere, their files being instantly deleted after the job finishes and the container dies.

### maxJobDuration explained

When maxJobDuration is exceeded by a worker, the worker will stop all processes and return the processing message to the queue. Typically, the maxJobDuration should be set marginally above the maximum time that it takes your application to process and delete a message from the queue. This ensures all work can be completed; while providing protection against erroneous processing continuing to the maximum of twelve hours.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm likely confused but it looks like if maxJobDuration is exceeded, the behavior is the same as when SQS hard limit of 12 hours is exceeded, because in both cases the message is returned to the queue, is that right?

Or will a worker keep working when the 12-hour visibility timeout is breached?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or will a worker keep working when the 12-hour visibility timeout is breached?

Yes, I believe the worker will keep working regardless of hitting the timeout. We could implement a hard cutoff, but that may or may not be desirable depending on the application context.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense, thanks @rclark!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The timeout problem in watchbot 4
5 participants