Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task: Fix checkCatalyst loop when the queues are long #148

Merged
merged 3 commits into from
Feb 14, 2023

Conversation

victorges
Copy link
Member

@victorges victorges commented Feb 13, 2023

A current bug that is happening now:

  • catalyst finishes processing a task
  • sends a final callback to task runner
  • task runner enqueues that success so it finalizes the processing with the appropriate load balancing
  • if the queues are long, this message might end up staying in the queue for a couple minutes
  • during that time, there will be no updated to the updatedAt of the task, as callbacks have stopped
  • when the checkCatalyst loop finally gets to it, it will see a task that has had no updates for minutes-long, treat it as lost and mark it as errored on the API
  • error only gets worse cause the API will then retry the task which has the same error again

This fixes it by improving the exit condition of the checkCatalyst step. Instead of only finishing it once
the task has stoped running, we will now update the task step on the API and make sure to stop the
checkCatalyst loop as soon as the task gets to the finalize step. We make sure to update the step
from the catalyst callback handler already, so it happens early and stops the checkCatalyst loop, not
only when the finalization event gets processed.

Fixes STU-29

@victorges victorges requested a review from a team as a code owner February 13, 2023 19:42
@victorges victorges changed the title task: FIx checkCatalyst loop when the queues are long task: Fix checkCatalyst loop when the queues are long Feb 13, 2023
@linear
Copy link

linear bot commented Feb 13, 2023

Copy link
Contributor

@thomshutt thomshutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, is there a similar situation where Catalyst API reports a terminal error?

@victorges
Copy link
Member Author

@thomshutt in case of errors we don't need the finalization step. So we just publish a task result which goes to a different queue and is processed by the API instead.

While debugging this i did notice one interesting scenario tho: currently the delay we enforce between API retries is literally just a sleep in the API code. It holds a message for a while without a king it until it schedules the retry. In case of too many tasks retrying, it might be the case that all API consumers are busy sleeping until a retry. This can cause a delay in processing task results and retries. But it shouldn't be that bad anyway as we have 12 API instances per region, maybe 2 consumers per instance. Either way something to keep in mind and potentially refactor our retry logic or increase number of consumers etc cc @gioelecerati

@victorges victorges merged commit af4b3e0 into main Feb 14, 2023
@victorges victorges deleted the vg/fix/long-queue-catalyst-result-handling branch February 14, 2023 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants