Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve retry workflow for release automation #606

Closed
dagood opened this issue Jun 24, 2022 · 1 comment · Fixed by microsoft/go-infra#51
Closed

Improve retry workflow for release automation #606

dagood opened this issue Jun 24, 2022 · 1 comment · Fixed by microsoft/go-infra#51

Comments

@dagood
Copy link
Member

dagood commented Jun 24, 2022

Retrying a build requires copying a value from the build log and pasting it into the correct field in the "Run build" form. The dev needs to remember which field to paste the value into while navigating the AzDO UI to get to the right place. Not a huge problem, and a mistake would normally result in a failed build at worst (because the values aren't interchangeable), but it's worth improving.

Ideally, the build would generate a link that the dev can click to run a new build with the retry parameter filled in. This may not be possible with a simple link: the AzDO API expects POST, the build parameters are passed in the request body, and I suspect AzDO has cross-site limitations that could also block us from simple workarounds like posting a static HTML page to blob storage.

An alternative could be a releasego command that the dev runs on their machine to run the build. This adds some complexity and potentially secret management to the dev environment.

A simple mitigation would be to include the retry step number inside the value the dev copies. For example, instead of copying abc12342 and remembering it goes into field 3, the dev would copy 3 abc12342 and when pasting the value, it's obvious if the number doesn't match the prompt, and it's easy to ctrl-z and paste into the right one if it doesn't match. The build process then ensures the 3 is present in the value and fails immediately upon mismatch, so if the dev misses the number mismatch and hits "Run build" anyway, the build can provide very quick feedback. (Rather than infinitely polling for a value that will never exist, for example.)

@dagood
Copy link
Member Author

dagood commented Jul 5, 2022

Some approaches I tried out along the way:

Self-rerun

Add an approval ("manual intervention") step, and if approved, have the pipeline run itself again with the same parameters it was run with. I got it roughly working: microsoft/go-infra@46d549f. However, I don't think it actually fits very well:

  • This only lets you retry with the exact same parameters. You can "resume" polling if the job times out, but that's it. If the situation requires any tweaks/fixes between retries, you're stuck back with the old way of doing the retry. (Read the logs, copy and paste.) I don't think that resuming polling will actually be all that common to need this kind of hyperfocus.
  • The code itself is complicated, mostly because we have to work within the constraints and complications of AzDO Pipeline YML. Every poll1Foo parameter is now copy-pasted more times (making it harder to change the flow in the future) and now, all the non-polling parameters need to be copy-pasted, too. This is necessary so we can assemble the full set of build parameter/variables for the next time. (Unlike the "Run new" button, existing parameters aren't automatically transferred so we need to do it ourselves.)
  • The build needs to reserve an agent in order to send the re-run API call. This means there's some delay before the retry even starts (and requests its agents). Retrying with "Run new" goes through immediately. This is not the most major point, but worth mentioning.

URL that runs a retry build

There are a few issues and I didn't end up figuring this out:

  • The API accepts POST (not GET): https://docs.microsoft.com/en-us/rest/api/azure/devops/build/builds/queue?view=azure-devops-rest-7.1. It also expects a request body. AFAIK this means we can't use GET without some external contrivance.
  • The API requires a personal access token. Normally, devs don't need to maintain their own PAT.
  • I expect AzDO to prevent CSRF, so setting up a static page that takes retry info and uses the user's auth to queue a new build (perhaps using the AzDO frontend API rather than PAT API) seems unlikely to work.

Our own service

We could set up our own service (e.g. Azure functions) that has its own authentication to AzDO and has the authority to queue builds. This would be very flexible and could even help with retry scenarios that need something more complex than a straightforward retry.

This can be taken a step further: we could have a service do our release instead of having a set of AzDO pipelines doing it. (At least, it could only call out to AzDO for steps that involve manipulating artifacts.)

Reusing AzDO does have some benefits: it's relatively easy to see what release actions are going on and diagnose issues, and the security scope is fairly clear. It would take some time to get there with our own system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant