New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

RFC - AR Cost Reduction Plan #79

Merged

brianherrera merged 2 commits into o3de:main from brianherrera:rfc-ar-cost-reduction-plan

Mar 23, 2023

Contributor

brianherrera commented Feb 22, 2023

No description provided.

brianherrera force-pushed the rfc-ar-cost-reduction-plan branch from 2014925 to f5950d7 Compare

February 22, 2023 18:17

brianherrera mentioned this pull request

Proposed SIG-Build meeting agenda for Feb-22-23 #80

Open

brianherrera force-pushed the rfc-ar-cost-reduction-plan branch from f5950d7 to 75d474c Compare

February 27, 2023 18:25

brianherrera mentioned this pull request

RFC - AR Cost Reduction Plan #81

Open

amzn-changml mentioned this pull request

TSC montly All-SIG meeting and roadmap board review (2/28/2023) o3de/tsc#70

Closed

brianherrera force-pushed the rfc-ar-cost-reduction-plan branch from 75d474c to 17de958 Compare

February 28, 2023 17:13

Kadino reviewed

View reviewed changes

Kadino left a comment

It's a bit difficult to comment on this RFC as there's many trailheads to proposed to investigate, but not many specific changes proposed. Data on where existing costs are spent may highlight where/how to optimize, as would estimates of how each proposed change is expected reduce costs. Potential downsides such as reduced execution speed are not discussed, likely as investigation has not yet yielded data.

rfcs/rfc-bld-20230222-1-ar-cost-reduction.md

Comment on lines +24 to +28

+              ### Improve our AR Configuration
+              SIG-Build has recently started the effort to reduce our build times across all platforms and is currently investigating where optimizations can be made.
+              We will address the stability of our AR pipeline to reduce the number times a maintainer has to repeat the AR run for flaky tests or any other AR failure that is not directly related to the incoming code change.

Kadino Feb 28, 2023

While this section highlights areas of investigation, there does not appear to be specific changes proposed. It is fairly unclear what will be done to improve configuration/build-times/stability. Merge queues below is adequately specific.

rfcs/rfc-bld-20230222-1-ar-cost-reduction.md


		We will address the stability of our AR pipeline to reduce the number times a maintainer has to repeat the AR run for flaky tests or any other AR failure that is not directly related to the incoming code change.

		We will also investigate using a merge queue in order to significantly reduce the total number of AR runs we need to execute. The purpose of the merge queue will be to batch together incoming PRs and run the AR on the combined changes. If the AR succeeds, the PRs are then merged in together. If the AR fails, the PR that caused the failure is removed from the queue and AR is retried. SIG-Build is currently working on a [mechanism](https://github.com/o3de/sig-build/blob/main/rfcs/rfc-bld-20220623-1-build-failure-rca.md) to identify owners of AR build failures and would enable this automation in a merge queue.

Kadino Feb 28, 2023 •

edited

Loading

There are likely to be implementation and UX costs to using a build queue. There may also be new inefficiencies introduced when performing failing builds of changes A+B+C+D+E which contains new bugs, and still requires a dozen build attempts for the five changes to get approval. However it seems like this needs investigation before a specific change is proposed.

It is also unclear how ownership information relates to build queuing. While notification can prompt action after getting rejected by the queue, the same action could be prompted regardless of build cadence or queuing. This seems like extraneous information, since it does not clarify why using a build queue is appropriate or how a build queue is expected to improve the pipeline.

rfcs/rfc-bld-20230222-1-ar-cost-reduction.md Outdated


		### Utilize lower cost build resources

		Addressing the runtimes and stability of the AR will allow us to migrate some of the jobs to lower cost build resources. This will include utilizing lower spec instances in our current infrastructure and also hosted platforms like GitHub Actions.

Kadino Feb 28, 2023

It is unclear how runtime and stability affect the ability to migrate to different hardware. Long builds and intermittent failures should cause inefficiencies at similar rates. The question seems to be about optimizing cost per compute-unit. Is this trying to highlight time constraints, which create tension from moving to the slowest, cheapest hardware available?

minor: May be more specific to state "pipeline duration" instead of "runtime" which has multiple meanings

rfcs/rfc-bld-20230222-1-ar-cost-reduction.md Outdated

Comment on lines 42 to 44

		### Revisit our testing approach in the PR workflow

		A very importance aspect of reducing our costs is revisiting our approach to setting up our build and test targets in the AR. This investigation will be driven by SIG-Build in collaboration will the other SIGs to determine how to get the best value from the AR.

Kadino Feb 28, 2023

This section is not making a specific suggested change, nor is it clarifying important metrics. I recommend defining a specific budget cap in terms of total execution-duration per pipeline run, and then we can hold discussion with other SIGs about slicing up where that budget gets spent.

minor: important

rfcs/rfc-bld-20230222-1-ar-cost-reduction.md Outdated

+              _Phase 3 Cost Improvements:_
+              * Migrate PR checks that are compatible with GitHub’s hosted runners.
+              * Separate longer running tasks that are not suitable to gate PRs or are incompatible with GitHub actions to the merge queue checks, post-merge checks, or nightly/weekly checks.

Kadino Feb 28, 2023 •

edited

Loading

Longer running, lower-priority tasks are already separated into the nightly checks. I don't see a necessary change unless there is a new heuristic of what "too long" means, such as a specific execution time budget.

rfcs/rfc-bld-20230222-1-ar-cost-reduction.md


		### Example: 22.10 Release AR Metrics

		For the 22.10 release we saw a high AR failure rate (48%) and total runs (2.2 avg per PR) for PRs targeting the stabilization branch. There was an [issue](https://github.com/o3de/o3de/pull/12346) in the stabilization branch that made AR runs intermittently fail. This went undetected for a few days while developers attempted to re-run their ARs to get a successful build. To prevent issues like this, we need to setup mechanisms that raise awareness about these failures and escalate it so developers address the issue.

Kadino Feb 28, 2023 •

edited

Loading

While this data does include intermittent failures, pull requests also contain legitimate failures being correctly caught by tests (or legitimate build failures, asset failures, etc). And while there is currently no way for contributors to claim or indicate which were intermittent in this dataset, the data from branch-update runs should be that subset of only the intermittent failures which initially passed AR.

SIG-Testing has a metrics-based proposal to improve detection of intermittent failures in branch update runs and the periodic runs: o3de/sig-testing#64
...but the onus will always be on SIGs to address instability they own in their product and its tests.


          Add RFC for AR cost reduction

f86147f

Signed-off-by: Brian Herrera <briher@amazon.com>

brianherrera force-pushed the rfc-ar-cost-reduction-plan branch from 17de958 to f86147f Compare

February 28, 2023 18:25

Contributor Author

brianherrera commented Feb 28, 2023

It's a bit difficult to comment on this RFC as there's many trailheads to proposed to investigate, but not many specific changes proposed. Data on where existing costs are spent may highlight where/how to optimize, as would estimates of how each proposed change is expected reduce costs. Potential downsides such as reduced execution speed are not discussed, likely as investigation has not yet yielded data.

Yes, the purpose of this doc is to provide the direction SIG-Build is taking to reduce costs for the AR at a high level. Primarily our changes to the workflow like integrating merge queues, allocating engineering resources to redesign our jobs to run with GitHub actions, and moving some tests outside the AR.

It's very likely this doc will spawn other more technically detailed RFCs for the components discussed here like integrating S3SIS into our pipeline to improve our caching. There would be a lot of cover if it was all included in this doc. And like you mentioned there are still outstanding investigations that need to be performed to validate some of the assumptions in this doc.

I can provide more data on our cost structure related to running the AR. I'll highlight that each job in the AR is executed on its own host which incurs EC2/EBS costs and provide details like the instance type, etc. and how the proposals in this doc plans to address it.

amzn-changml mentioned this pull request

Proposed SIG-Build meeting agenda for Mar-08-23 #83

Open

brianherrera force-pushed the rfc-ar-cost-reduction-plan branch from 61e42b0 to fd6ef44 Compare

March 14, 2023 22:50

brianherrera requested a review from a team

March 14, 2023 22:51


          Add AWS cost info and other hosted options.

b85d8fd

Signed-off-by: Brian Herrera <briher@amazon.com>

brianherrera force-pushed the rfc-ar-cost-reduction-plan branch from fd6ef44 to b85d8fd Compare

March 21, 2023 20:44

dshmz mentioned this pull request

Proposed SIG-Build meeting agenda for Mar-22-23 #84

Open

dshmz approved these changes

View reviewed changes

dshmz left a comment

Reviewed the AR change plan and it looks good. Approved.

brianherrera merged commit a00bc3e into o3de:main

brianherrera deleted the rfc-ar-cost-reduction-plan branch

March 23, 2023 17:39

amzn-changml mentioned this pull request

Proposed SIG-Build meeting agenda for Apr-19-23 #87

Open

amzn-changml mentioned this pull request

Proposed SIG-Build meeting agenda for May-3-23 #88

Open

amzn-changml mentioned this pull request

Proposed SIG-Build meeting agenda for May-17-23 #89

Open

amzn-changml mentioned this pull request

Proposed SIG-Build meeting agenda for May-31-2023 #90

Open

amzn-changml mentioned this pull request

Proposed SIG-Build meeting agenda for June 28 2023 #92

Open

amzn-changml mentioned this pull request

Proposed SIG-Build meeting agenda for July-12-2023 #93

Open

amzn-changml mentioned this pull request

Proposed SIG-Build meeting agenda for July-26-2023 #94

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet