Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(proposals): support scheduled builds #772

Merged
merged 6 commits into from
Oct 11, 2023

Conversation

jbrockopp
Copy link
Contributor

@jbrockopp jbrockopp commented Mar 16, 2023

xref: #538

This proposes adding support to schedule builds on a periodic cadence for a repo in Vela.

@jbrockopp jbrockopp requested a review from a team as a code owner March 16, 2023 15:05
@cognifloyd
Copy link
Member

The implementation for this needs some kind of lock or synchronization method in the server so that if there are multiple servers (HA: Active-Active), they do not kick off the same job at the same time. Either a single server needs to own schedule management, or the jobs need to be partitioned somehow.

@jbrockopp
Copy link
Contributor Author

jbrockopp commented Mar 17, 2023

@cognifloyd the plan is to leverage the scheduled_at field within the database for this.

Similar to how we check the DB for things like build.number and repo.counter before taking further actions.

Each time a build is pushed to the queue from a schedule, the server will update the scheduled_at field.

For each server being used, it'll compare the entry for the schedule against the scheduled_at field.

To provide an example scenario, let's say I have two servers (X & Y) with a repo (foo) that has a schedule (bar).

Let's assume all defaults are being used so the entry for bar is 1hr and the servers scan the table every 30m.

Server X scans the table, see the schedule for foo and finds the scheduled_at field is empty.

So, X pushes a build to the queue and updates the scheduled_at field for bar.

Server Y scans the table, sees the schedule for foo and finds the scheduled_at field is NOT empty.

So, Y crunches the data and notices a build for the schedule was just ran 10m ago.

Since 10m < 1hr, Y will skip doing anything else with foo and move on to the next schedule.


## Questions

1. Should we create a new field (i.e. `schedule`/`schedule_target`) or reuse existing `target` field in the `ruleset`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse for now, and add another field later if that causes any problems.


1. Should we create a new field (i.e. `schedule`/`schedule_target`) or reuse existing `target` field in the `ruleset`?

2. Should we prevent running multiple builds for the same schedule for a repo?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. This might need a new YAML pipeline key so the pipeline user to declare whether or not overlapping workflows is expected or safe.

@cognifloyd
Copy link
Member

@cognifloyd the plan is to leverage the scheduled_at field within the database for this.

Similar to how we check the DB for things like build.number and repo.counter before taking further actions.

Each time a build is pushed to the queue from a schedule, the server will update the scheduled_at field.

For each server being used, it'll compare the entry for the schedule against the scheduled_at field.

To provide an example scenario, let's say I have two servers (X & Y) with a repo (foo) that has a schedule (bar).

Let's assume all defaults are being used so the entry for bar is 1hr and the servers scan the table every 30m.

Server X scans the table, see the schedule for foo and finds the scheduled_at field is empty.

So, X pushes a build to the queue and updates the scheduled_at field for bar.

Server Y scans the table, sees the schedule for foo and finds the scheduled_at field is NOT empty.

So, Y crunches the data and notices a build for the schedule was just ran 10m ago.

Since 10m < 1hr, Y will skip doing anything else with foo and move on to the next schedule.

What if X and Y end up scanning for and starting the build at the same time?

Hmm. I guess this could work if both queueing and updating scheduled_at happen within a transaction that aborts if the schedule row has changed:

So, X pushes a build to the queue and updates the scheduled_at field for bar.

In which case, the db transaction would serve as the locking mechanism. Is that what you're suggesting?


The thread will be setup to scan the table every half of the minimum supported cadence i.e. `1hr` / 2 = `30m`

The server will use [jitter](https://en.wikipedia.org/wiki/Jitter) to minimize the amount of builds created at the same time for all repos.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there's an outage (i.e. upgrades, disasters, etc), how would you anticipate Scheduled Builds working? Does it skip any "schedules" it missed or would it attempt to catch up? If we play "catch up", that may spawn a lot of builds immediately following an outage. Alternatively, if we skip the "schedules" how much scope creep do we have to add an ondemand re-run of the pipeline without a PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there's an outage (i.e. upgrades, disasters, etc), how would you anticipate Scheduled Builds working?

I think the idea is to adopt the same behavior we have today for other features/functionality of the server.

i.e. the server doesn't push builds to the queue for webhooks delivered while it was unavailable

Regardless of the reason why the server(s) were down, no attempts will be made to catch up.

Once the server(s) comes back online, they'll start processing the schedules as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockopp How about an on-demand re-run? We've seen this request in the past and, depending on how we approach Scheduled Builds, this could be implemented very similarily. Do you think a "Run Now" option could fit into this proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I do think there is value in providing such a feature and we're hoping to leverage deployments for this:

https://github.com/go-vela/community/blob/feature/proposal/scheduled_builds/proposals/2023/03-16_scheduled-builds.md#yaml

@jbrockopp
Copy link
Contributor Author

What if X and Y end up scanning for and starting the build at the same time?

@cognifloyd this shouldn't be possible since we plan to use jitter with some level of randomness.

In an ideal scenario, this means two or more servers can't push builds to the queue for the same schedule.

In which case, the db transaction would serve as the locking mechanism. Is that what you're suggesting?

Yes, the plan is to adopt the same logic we have in other parts of the codebase that depend on data from the DB.

i.e. build number, hook number, repo counter etc.

@plyr4
Copy link
Contributor

plyr4 commented Mar 20, 2023

overall looks great

+1 to wondering how we'll do the database transaction locking (hoping to use that same logic in future functionality, so this will be good inspiration)

@plyr4
Copy link
Contributor

plyr4 commented Mar 20, 2023

have you had any thoughts on if we should do clean up on stale schedules? perhaps a "scheduled_until" field that defaults to 6 months, that gets renewed with some sort of engagement/interaction from the user?
i fear the "setup-and-forget" scenario where users set up schedules for testing and never clean them up.

@plyr4
Copy link
Contributor

plyr4 commented Mar 20, 2023

any thoughts on having a limit to the amount of schedules a repo can have? or is the plan to have the repo-build-limit be the gatekeeper for the excessive build traffic this might cause?
i think overall our build rate limiting could use some work.. so maybe its scope creep to think about it in this feature 🤷
and i suppose it should be up to the pipeline designer to be wary of hitting the rate limit

@jbrockopp
Copy link
Contributor Author

jbrockopp commented Mar 21, 2023

have you had any thoughts on if we should do clean up on stale schedules? perhaps a "scheduled_until" field that defaults to 6 months, that gets renewed with some sort of engagement/interaction from the user?
i fear the "setup-and-forget" scenario where users set up schedules for testing and never clean them up.

@plyr4 understand the concern and approach.. what if we based off the repo.active field?

i.e. if a repo's active field is false, then we don't run any schedules for that repo

My concern with scheduled_until is the extra complexity and logistics required to deliver on that.

For example, do we expect Vela to send reminders to users to notify them their schedule will stop running?

If so, then we'll need to build out that notification component within the system.

Also, do we only notify the repo owner? Or do we have to notify all repo admins?

I can already imagine users complaining that their scheduled builds didn't run and they weren't notified about it.

any thoughts on having a limit to the amount of schedules a repo can have? or is the plan to have the repo-build-limit be the gatekeeper for the excessive build traffic this might cause?
i think overall our build rate limiting could use some work.. so maybe its scope creep to think about it in this feature 🤷
and i suppose it should be up to the pipeline designer to be wary of hitting the rate limit

We have no opposition for setting a limit to the number of schedules a repo can have.

What did you have in mind as a limit for the number of schedules per repo?

To answer your question, the plan was to leverage the build limits per repo to prevent excessive build traffic.

The value here is that build limit is configurable per Vela installation granting admins some level of control.

Ideally that combined with configurable minimum schedule cadence would grant enough control for admins.

@jbrockopp jbrockopp added feature Indicates a new feature area/api Indicates a change to the API area/cli Indicates a change to the CLI area/server Indicates a change to the server area/ui Indicates a change to the UI labels Mar 21, 2023
JordanSussman
JordanSussman previously approved these changes Mar 22, 2023
@plyr4
Copy link
Contributor

plyr4 commented Mar 29, 2023

@jbrockopp awesome thank you for typing that up

i.e. if a repo's active field is false, then we don't run any schedules for that repo

good idea

notification component

noo thank you 😅

We have no opposition for setting a limit to the number of schedules a repo can have.

👍

@plyr4
Copy link
Contributor

plyr4 commented Mar 29, 2023

@jbrockopp ive still got one concern with your explanation on not needing a database locking mechanism

Yes, the plan is to adopt the same logic we have in other parts of the codebase that depend on data from the DB.

we already see a number of issues with concurrency using database constraints (see #213). its arguably easy to reproduce it. maybe im not following, but would not having a database locking mechanism just compound on the problem?
i know you mentioned jitter which sounds like it would help in most scenarios but is it still possible (albeit unlikely) that two servers could sync up and pick up the same schedule?

could you clarify a bit on where the jitter would be implemented? the server scanning interval, the actual cron scheduling time, both?

@plyr4
Copy link
Contributor

plyr4 commented Mar 29, 2023

also, any thoughts on making this feature-flaggable?
it might work on a per-repo basis, gated somehow by the platform admins. (speaking for some ideas i heard, totally stolen)

@plyr4
Copy link
Contributor

plyr4 commented Mar 29, 2023

another idea i heard some mention, thoughts on having a field to control if the schedule is "active" or "enabled"
enabled would be easier to discern from the repo.active field but maybe active would be more consistent 🤷

@jbrockopp
Copy link
Contributor Author

@plyr4

we already see a number of issues with concurrency using database constraints (see #213).

Are you saying you're actively facing issues with concurrency today? 😮

AFAIK, the issue for #213 is only present for questionable use-cases i.e. spamming builds via restart or webhooks

Ever since go-vela/server#191 was completed, I haven't seen or heard customers complaining

To be clear, the underlying issue still exists and you'll see that reflected in the server logs but customers don't 😅

but would not having a database locking mechanism just compound on the problem?

I'm not entirely sure but that kinda seems out of scope for the initial implementation?

Based off my limited exposure to database locking, I believe it should be a separate proposal 😅

Database locks come with their own set of caveats/tradeoffs and can be harmful to a system i.e. "excessive blocking"

Happy to write up a feature issue for this once the initial code is implemented 😃

i know you mentioned jitter which sounds like it would help in most scenarios but is it still possible (albeit unlikely) that two servers could sync up and pick up the same schedule?

In an ideal world, this shouldn't be possible since we'll use jitter with a variation of randomness 😄

If implemented correctly, the chances of two+ servers scanning at the same time should be a fraction of a percent

could you clarify a bit on where the jitter would be implemented? the server scanning interval, the actual cron scheduling time, both?

The jitter will be involved on both sides of the coin to prevent duplicate workloads when using multiple servers 👍

also, any thoughts on making this feature-flaggable?
it might work on a per-repo basis, gated somehow by the platform admins. (speaking for some ideas i heard, totally stolen)

No objections to that approach but that kinda seems out of scope for the initial implementation?

At least, we don't have a need for such functionality for the first attempt at rolling this out

Happy to write up an enhancement issue for this once the initial code is implemented 😃

another idea i heard some mention, thoughts on having a field to control if the schedule is "active" or "enabled"
enabled would be easier to discern from the repo.active field but maybe active would be more consistent

Could you elaborate on the value for this?

I assume the idea is a user want's to stop a schedule from running anymore.

There isn't a ton of information required to create a schedule so why wouldn't they simply remove the schedule?

@wass3rw3rk
Copy link
Member

we already see a number of issues with concurrency using database constraints (see #213).

Are you saying you're actively facing issues with concurrency today? 😮

Yes. Affected folks have put workarounds in place, so it's not pressing, but certainly not ideal.


also, any thoughts on making this feature-flaggable?
it might work on a per-repo basis, gated somehow by the platform admins. (speaking for some ideas i heard, totally stolen)

No objections to that approach but that kinda seems out of scope for the initial implementation?

The hope is to allow for the flexibility of a controlled rollout, similar to the allow repos flag (though that is not changeable post server launch at the moment).


another idea i heard some mention, thoughts on having a field to control if the schedule is "active" or "enabled"
enabled would be easier to discern from the repo.active field but maybe active would be more consistent

Could you elaborate on the value for this?

The idea is to have something in place that users and platform admins can leverage to pause schedules if needed without having to recall the schedule that may have been in place previously. A toggle would be pretty user friendly.

@jbrockopp
Copy link
Contributor Author

jbrockopp commented Apr 5, 2023

@wass3rw3rk

Yes. Affected folks have put workarounds in place, so it's not pressing, but certainly not ideal.

Agreed 😃 certainly not ideal

The hope is to allow for the flexibility of a controlled rollout, similar to the allow repos flag (though that is not changeable post server launch at the moment).

Makes sense and sounds like a great enhancement once that capability is available 👍

Until then, sounds like we want a VELA_SCHEDULE_ALLOWLIST that replicates VELA_REPO_ALLOWLIST?

https://go-vela.github.io/docs/installation/server/reference/#vela_repo_allowlist

The idea is to have something in place that users and platform admins can leverage to pause schedules if needed without having to recall the schedule that may have been in place previously. A toggle would be pretty user friendly.

Makes sense so are we good with a schedule.active field that replicates the behavior of repo.active?

If so, any other things we need to account for besides the below?

  • active field for all schedules attached to a repo mirror its active field
    • disabling a repo will also disable all schedules for that repo
    • enabling a repo will also enable all schedules for that repo (unsure if we actually want this part)
  • servers scanning the schedules table will only check for schedules where active == true
  • API/CLI/UI need to support modifying the active field for a schedule

EDIT: Thinking about this more, do we want to enable all schedules for a repo if it has been reenabled?

@cognifloyd
Copy link
Member

cognifloyd commented Apr 21, 2023

also, any thoughts on making this feature-flaggable?
it might work on a per-repo basis, gated somehow by the platform admins. (speaking for some ideas i heard, totally >> stolen)

No objections to that approach but that kinda seems out of scope for the initial implementation?

I would like a CLI flag on the server to completely disable support for any scheduled builds in Vela. At least for now, we're trying to push all scheduled runs of things into a different service so that there is a central view of all scheduled jobs (and no more ad-hoc cron on random servers). So, when this feature is implemented, I need a way to just turn it off and prevent people from using it.

@jbrockopp
Copy link
Contributor Author

@cognifloyd sounds good 👍

The ability to disable this feature will be available via the VELA_SCHEDULE_ALLOWLIST environment variable

@jbrockopp jbrockopp changed the title feat(proposals): start scheduled builds feat(proposals): support scheduled builds May 3, 2023
plyr4
plyr4 previously approved these changes Jul 18, 2023
wass3r
wass3r previously approved these changes Jul 18, 2023
@plyr4 plyr4 dismissed stale reviews from wass3r and themself via fb6ad9a October 11, 2023 14:25
Copy link
Contributor

@KellyMerrick KellyMerrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ecrupper ecrupper merged commit 9408b4b into main Oct 11, 2023
@ecrupper ecrupper deleted the feature/proposal/scheduled_builds branch October 11, 2023 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates a change to the API area/cli Indicates a change to the CLI area/server Indicates a change to the server area/ui Indicates a change to the UI feature Indicates a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants