Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement deployments for system jobs #3644

Open
preetapan opened this issue Dec 11, 2017 · 18 comments
Open

Implement deployments for system jobs #3644

preetapan opened this issue Dec 11, 2017 · 18 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling type/enhancement

Comments

@preetapan
Copy link
Contributor

preetapan commented Dec 11, 2017

As of 0.7.0 deployments - update stanzas with auto_revert, canary, etc - are only implemented for service type jobs. We need to implement auto reverts and other reconciliation features for system jobs. There should be a way to stop a rolling upgrade of a bad system job across the fleet.

@schmichael schmichael changed the title Auto reverts and other reconciliation features for system jobs Implement deployments for system jobs Dec 11, 2017
@mihasya
Copy link

mihasya commented Dec 12, 2017

/me clicks on issue linked by co-worker, reads issue, sees @schmichael in the activity feed, strokes beard while nodding with approval

👋

@alxark
Copy link

alxark commented Feb 15, 2018

I'm so bored with non-transparent updates for system jobs. Really needs this feature.

@jrasell
Copy link
Member

jrasell commented Jun 8, 2018

@preetapan @schmichael @dadgar this is something I really want to see and am happy to have a crack at it unless its already being worked on internally? If you're not, any thoughts, ideas or tips would be greatly appreciated.

@preetapan
Copy link
Contributor Author

@jrasell We want to make several improvements to the system scheduler including implementing deployments, as well as bringing in other improvements that are in the reconciler. This is a fairly large scoped project and implementing this will involve a set of non trivial changes. We are currently targeting this for a future release, likely after Nomad 0.9.0

@jpasichnyk
Copy link

jpasichnyk commented Jan 19, 2019

@preetapan any update on a timeline for this? I specifically am looking for canary support for system jobs.

@mgeggie
Copy link

mgeggie commented Jul 11, 2019

@preetapan we've just launched into the world of Nomad, and found this issue when deploying our first system-level job to our cluster. Any update on when we can expect healthchecking for system jobs?

@taer
Copy link

taer commented Nov 22, 2019

@preetapan I see a new 0.10 nomad was recently released. Any updates on this feature?

@burdandrei
Copy link
Contributor

@jrasell will you take this in your hands now? ;)

@dpn
Copy link

dpn commented Apr 30, 2020

We'd also love to see this functionality, so +1 from our end! 👍

@xsikor
Copy link

xsikor commented Jun 15, 2020

Any updates about status for this functionality?
Or maybe it's already done but only for enterprise version?

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Aug 24, 2020
@schmichael
Copy link
Member

Or maybe it's already done but only for enterprise version?

Nope, it will be OSS!

This is roadmapped. As everyone can probably guess there's a lot going on in everyone's lives, so a timeline has been very tricky. We're very excited to see the initial PR #8841 from @dubadub and hope to have someone dig into it with them. There are some very tricky aspects to deployments for system jobs that we need to be right to maximize usability and minimize complexity.

For example if we spun up canaries concurrently with the stable version's allocation on the same node, there would likely be resource conflicts (static ports, host volumes) that block placement or prevent proper functioning. Therefore it seems like system deployments should diverge from service deployments in that canaries should act as replacements instead of additional capacity.

To further complicate matters: as @dubadub discovered in #8841 the code in question could use some refactoring. The layout of Nomad's scheduler has basically never changed, so as you can imagine there are some opportunities for cleanup.

So please keep the use cases coming! The more detailed you can be about the desired behaviors the better! I know it seems like we're silent sometimes, but we definitely parse and discuss and rehash every word of Github comments to ensure we're meeting the desired use case.

@josegonzalez
Copy link
Contributor

This is something we'd love to have. We built a system on top of Nomad that allows our developers to know how far along a job has been rolled out. At the moment, we can sort of fudge it by parsing the annotations in a nomad plan and stringing that along into our allocation tailing process, though this makes the code much more complicated than it would be if we just used the same method as how we rollout service jobs.

@weargoggles
Copy link
Contributor

I'd love to see this arrive. I would be completely satisfied with the design as I understand it in #8841, which allows a fixed number of allocations to be replaced at one time.

@johnnyplaydrums
Copy link

johnnyplaydrums commented Mar 2, 2022

Clarifying a use case: de-risking system job updates with the auto_revert flag. Currently, if an update to a system job is deployed, and the new allocations fail due to health checks or task state, that system job will be down across the cluster. Allowing the system job to be auto reverted would make rolling out updates to system jobs much less stressful 😅 I think this use case was mentioned in the initial comment but just wanted to spell it out a bit more. Thanks y'all!

@m1keil
Copy link

m1keil commented Jun 25, 2022

Even without auto_revert, one of the big problems is that there is no feedback from the nomad CLI about deploy issues as in the case of regular service jobs. The monitor returns 0 no matter what. This confuses both CI/CD and the user because it makes it seem as if the job was successfully started. A word of warning would be nice to make it clear that the user shouldn't expect the same guarantees as with service job.

@axsuul
Copy link
Contributor

axsuul commented Aug 5, 2022

Supporting this would be valuable for us since we use system jobs as a way to scale up our app server and background worker Nomad jobs. That is, all we need to do is launch new nodes and Nomad automatically schedules new allocations to run on these nodes. But with system jobs, we have to deal with the drawback of not being able to use canaries, etc. even though these aren't typical system jobs.

@robloxrob
Copy link

Supporting this would be valuable for us since we use system jobs as a way to scale up our app server and background worker Nomad jobs. That is, all we need to do is launch new nodes and Nomad automatically schedules new allocations to run on these nodes. But with system jobs, we have to deal with the drawback of not being able to use canaries, etc. even though these aren't typical system jobs.

Great point here. This is the same case for how we use system jobs.

@komapa
Copy link

komapa commented Nov 22, 2022

Another strong +1 for this feature. Everything we face was already mentioned by other comments so I will not repeat those great points, just wanted to bump this issue once again. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling type/enhancement
Projects
Status: 1.9 & 1.10 Shortlist (uncommitted)
Development

No branches or pull requests