Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - AR Cost Reduction Plan #79

Merged
merged 2 commits into from
Mar 23, 2023

Conversation

brianherrera
Copy link
Contributor

No description provided.

Copy link

@Kadino Kadino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit difficult to comment on this RFC as there's many trailheads to proposed to investigate, but not many specific changes proposed. Data on where existing costs are spent may highlight where/how to optimize, as would estimates of how each proposed change is expected reduce costs. Potential downsides such as reduced execution speed are not discussed, likely as investigation has not yet yielded data.

Comment on lines +24 to +28
### Improve our AR Configuration

SIG-Build has recently started the effort to reduce our build times across all platforms and is currently investigating where optimizations can be made.

We will address the stability of our AR pipeline to reduce the number times a maintainer has to repeat the AR run for flaky tests or any other AR failure that is not directly related to the incoming code change.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this section highlights areas of investigation, there does not appear to be specific changes proposed. It is fairly unclear what will be done to improve configuration/build-times/stability. Merge queues below is adequately specific.


We will address the stability of our AR pipeline to reduce the number times a maintainer has to repeat the AR run for flaky tests or any other AR failure that is not directly related to the incoming code change.

We will also investigate using a merge queue in order to significantly reduce the total number of AR runs we need to execute. The purpose of the merge queue will be to batch together incoming PRs and run the AR on the combined changes. If the AR succeeds, the PRs are then merged in together. If the AR fails, the PR that caused the failure is removed from the queue and AR is retried. SIG-Build is currently working on a [mechanism](https://github.com/o3de/sig-build/blob/main/rfcs/rfc-bld-20220623-1-build-failure-rca.md) to identify owners of AR build failures and would enable this automation in a merge queue.
Copy link

@Kadino Kadino Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are likely to be implementation and UX costs to using a build queue. There may also be new inefficiencies introduced when performing failing builds of changes A+B+C+D+E which contains new bugs, and still requires a dozen build attempts for the five changes to get approval. However it seems like this needs investigation before a specific change is proposed.

It is also unclear how ownership information relates to build queuing. While notification can prompt action after getting rejected by the queue, the same action could be prompted regardless of build cadence or queuing. This seems like extraneous information, since it does not clarify why using a build queue is appropriate or how a build queue is expected to improve the pipeline.


### Utilize lower cost build resources

Addressing the runtimes and stability of the AR will allow us to migrate some of the jobs to lower cost build resources. This will include utilizing lower spec instances in our current infrastructure and also hosted platforms like GitHub Actions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear how runtime and stability affect the ability to migrate to different hardware. Long builds and intermittent failures should cause inefficiencies at similar rates. The question seems to be about optimizing cost per compute-unit. Is this trying to highlight time constraints, which create tension from moving to the slowest, cheapest hardware available?

minor: May be more specific to state "pipeline duration" instead of "runtime" which has multiple meanings

Comment on lines 42 to 44
### Revisit our testing approach in the PR workflow

A very importance aspect of reducing our costs is revisiting our approach to setting up our build and test targets in the AR. This investigation will be driven by SIG-Build in collaboration will the other SIGs to determine how to get the best value from the AR.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is not making a specific suggested change, nor is it clarifying important metrics. I recommend defining a specific budget cap in terms of total execution-duration per pipeline run, and then we can hold discussion with other SIGs about slicing up where that budget gets spent.

minor: important

_Phase 3 Cost Improvements:_

* Migrate PR checks that are compatible with GitHub’s hosted runners.
* Separate longer running tasks that are not suitable to gate PRs or are incompatible with GitHub actions to the merge queue checks, post-merge checks, or nightly/weekly checks.
Copy link

@Kadino Kadino Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer running, lower-priority tasks are already separated into the nightly checks. I don't see a necessary change unless there is a new heuristic of what "too long" means, such as a specific execution time budget.


### Example: 22.10 Release AR Metrics

For the 22.10 release we saw a high AR failure rate (48%) and total runs (2.2 avg per PR) for PRs targeting the stabilization branch. There was an [issue](https://github.com/o3de/o3de/pull/12346) in the stabilization branch that made AR runs intermittently fail. This went undetected for a few days while developers attempted to re-run their ARs to get a successful build. To prevent issues like this, we need to setup mechanisms that raise awareness about these failures and escalate it so developers address the issue.
Copy link

@Kadino Kadino Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this data does include intermittent failures, pull requests also contain legitimate failures being correctly caught by tests (or legitimate build failures, asset failures, etc). And while there is currently no way for contributors to claim or indicate which were intermittent in this dataset, the data from branch-update runs should be that subset of only the intermittent failures which initially passed AR.

SIG-Testing has a metrics-based proposal to improve detection of intermittent failures in branch update runs and the periodic runs: o3de/sig-testing#64
...but the onus will always be on SIGs to address instability they own in their product and its tests.

Signed-off-by: Brian Herrera <briher@amazon.com>
@brianherrera
Copy link
Contributor Author

It's a bit difficult to comment on this RFC as there's many trailheads to proposed to investigate, but not many specific changes proposed. Data on where existing costs are spent may highlight where/how to optimize, as would estimates of how each proposed change is expected reduce costs. Potential downsides such as reduced execution speed are not discussed, likely as investigation has not yet yielded data.

Yes, the purpose of this doc is to provide the direction SIG-Build is taking to reduce costs for the AR at a high level. Primarily our changes to the workflow like integrating merge queues, allocating engineering resources to redesign our jobs to run with GitHub actions, and moving some tests outside the AR.

It's very likely this doc will spawn other more technically detailed RFCs for the components discussed here like integrating S3SIS into our pipeline to improve our caching. There would be a lot of cover if it was all included in this doc. And like you mentioned there are still outstanding investigations that need to be performed to validate some of the assumptions in this doc.

I can provide more data on our cost structure related to running the AR. I'll highlight that each job in the AR is executed on its own host which incurs EC2/EBS costs and provide details like the instance type, etc. and how the proposals in this doc plans to address it.

Signed-off-by: Brian Herrera <briher@amazon.com>
Copy link

@dshmz dshmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the AR change plan and it looks good. Approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants