-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use last instead of first bacalhau execution #913
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Queued up 100 labsay jobs. 100/100 bacalhau jobs succeeded, however only 94/100 initially succeeded on the app frontend. The other 6 were perpetually in a state of Received the following error, similar to @thetechnocrat-dev for 2 of the 6 stalled jobs.
Marking the 2 jobs as The 6 stalled jobs seem to have coincided with a scale up from 1 CPU node to 3. Unexpected behavior of the jobs' NodeIDs seem to contribute. See one of the 2 problematic "stalled" jobs despite a successful Bacalhau run:
The
This seems to suggest that when autoscaling up, we sometimes run into a problem with the NodeId values changing causing stalls to the queue. Anecdotally, similar behavior seems to have occurred when previously scaling up. |
closing this PR as it is not relevant anymore |
What type of PR is this?
Description
I noticed
In the logs and jobs not processing. I realized this was because a some jobs had multiple bacalhau executions, the first was always a bid rejected capacity error. For these cases we should always look at the most recent execution.
Example