-
Notifications
You must be signed in to change notification settings - Fork 36
add landoscript #1135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add landoscript #1135
Conversation
f65011c to
58a14c5
Compare
6cd34a0 to
f880a85
Compare
…key names A few things going on here, mostly boring: * Move the existing GithubClient to scriptworker_client * Move the associated test fixtures to a new `pytest-scriptworker-client` package * Necessary adjustments in treescript to let it be able to find and use the new packages Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
…key names A few things going on here, mostly boring: * Move the existing GithubClient to scriptworker_client * Move the associated test fixtures to a new `pytest-scriptworker-client` package * Necessary adjustments in treescript to let it be able to find and use the new packages Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
…key names A few things going on here, mostly boring: * Move the existing GithubClient to scriptworker_client * Move the associated test fixtures to a new `pytest-scriptworker-client` package * Necessary adjustments in treescript to let it be able to find and use the new packages Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
…key names A few things going on here, mostly boring: * Move the existing GithubClient to scriptworker_client * Move the associated test fixtures to a new `pytest-scriptworker-client` package * Necessary adjustments in treescript to let it be able to find and use the new packages Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
…key names A few things going on here, mostly boring: * Move the existing GithubClient to scriptworker_client * Move the associated test fixtures to a new `pytest-scriptworker-client` package * Necessary adjustments in treescript to let it be able to find and use the new packages Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
…key names (#1147) A few things going on here, mostly boring: * Move the existing GithubClient to scriptworker_client * Move the associated test fixtures to a new `pytest-scriptworker-client` package * Necessary adjustments in treescript to let it be able to find and use the new packages Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in #1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
4c7c77b to
78d629c
Compare
ce7cc7a to
2e0da34
Compare
a73c334 to
8d4bc25
Compare
Goes with the landoscript work from mozilla-releng/scriptworker-scripts#1135
c93d3ae to
39fdb66
Compare
ahal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Did a first pass
|
|
||
| def log_file_contents(contents): | ||
| for line in contents.splitlines(): | ||
| log.info(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why this is better than just logging contents directly. Having the logger preamble in front of each line just makes it harder to copy/paste imo.
Feel free to ignore this if this was just copy/pasted and you want to keep it as similar as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall now...my only goal is to make sure newlines are rendered. That can probably be done without the splitting though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I recall why we have this now: it's to ensure that we get one log line per line in the file, while ensuring each line is prefixed. (Arguably we could simply log with a leading newline to achieve a similar effect...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity: I plan to leave this as is unless you feel it's preferable to log with a preceding newline.
|
|
||
|
|
||
| async def poll_until_complete(session: ClientSession, poll_time: int, status_url: str): | ||
| while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess the idea is that we poll until we hit the Taskcluster timeout?
I think we might be losing some of the idempotent-ness of the old approach here. It will certainly be possible for this task to fail due to timeout, but Lando could still land it afterwards. This means a failed task doesn't necessarily indicate a failed landing.
Do you know what would happen if we re-ran the task in such a case? For version bump, it might be worth ensuring that if the requested version is already what's committed, that we exit successfully. I'm not really sure what to do in the general case..
Another approach might be to timeout within the task, and abort the landing prior to failing. That way there's more of a guarantee that a failed landoscript task means the push didn't get landed. I don't think this needs to be fully solved for the MVP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, polling until we timeout is what is happening here, and that's a really good point you've made.
And to be honest, I'm not actually sure what the best thing to do here is. I think in the short term, attempting to cancel the request at shutdown for any non-success makes sense. But this is also less than ideal if, eg: we get spot terminated and have to redo the work because of it. There's also no guarantees that we'll be able to cancel the request in time in the event of a spot termination.
Another obvious option (for the future) would be to take the notarization approach, and do the polling in a separate task. But I also wonder if we even need to bother to poll in some cases? For example, l10n bumps don't have anything downstream of them...maybe we could just spit out the status URL, and leave it at that for cases like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spot termination is not a concern for scriptworkers. We might get killed if we exceed terminationGracePeriodSeconds though, in which case we should still aim to preserve idempotence. It looks like we only ever have a single lando request, which hopefully is handled atomically at the other end (TIL git push has a --atomic option, nice, I'm only 9 years late to notice)? (I was wondering how this worked for merge day. Currently we have 2 hg pushes, e.g. once to tag central and another to merge to beta. I guess that goes away because we know they're the same underlying git repo and so it doesn't matter where the tag goes?)
We could try and cancel the submission on shutdown (bitrisescript seems to do something similar:
| loop.add_signal_handler(signal.SIGTERM, handle_sigterm, future_group) |
So there's a few scenarios that could happen AFAICT:
- maybe the initial request didn't go through, we can rerun everything, all good
- maybe the initial request failed, or was canceled, we can safely retry.
- maybe the initial request succeeded before the rerun, we should detect that and not do anything
- maybe the initial request is still in flight. We could potentially find it and poll on it, or cancel it and retry? If we can't find it, we may have 2 requests in flight, hopefully only one can succeed?
In treescript we also had the case where one push went through but not the other, and the case where when the next run checked version numbers, the first push had gone through to hgssh but was not yet mirrored out to hgweb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional thoughts and strategies, @jcristau! Not needing to worry about spot termination certainly simplifies things, as we can make some safer assumptions.
I'm going to give this a bit more thought - but clearly we need some sort of special handling for the rerun scenario, if not separating polling altogether.
In treescript we also had the case where one push went through but not the other, and the case where when the next run checked version numbers, the first push had gone through to hgssh but was not yet mirrored out to hgweb.
If we're talking about two pushes to the same repository, this case will go away AFAIK - all changes to a given repository (which in the new world means branch within the new github repo) will either happen or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we have 2 hg pushes, e.g. once to tag central and another to merge to beta. I guess that goes away because we know they're the same underlying git repo and so it doesn't matter where the tag goes?)
Yeah, that's right. Although that does bring to mind something a bit odd about the API: tag actions are applied to a repository+branch - not to a repository overall. @cgsheeh - is this something you've thought about? (I don't think we necessarily need to change anything here, I'm just noting it because it's a bit odd.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about how to handle termination, the more it seems like moving polling out of the landoscript tasks might be the best thing to do. Something like:
- Return immediately from landoscript after a successful submission; attaching the status url in a well enough artifact
- Add a new task downstream of each landoscript task which does the polling. This doesn't need to be a scriptworker task AFAIK, because polling doesn't require credentials.
It would also avoid the need for any cancellation in landoscript, and thus any race conditions associated with that. If we trust that a landoscript task would never be rerun unless a human has inspected the state of things first, this would eliminate the need for special rerun handling in landoscript, I think. (Although perhaps it's too much a leap to assume it would never be errantly rerun, and we'd want at least some of the rerun safety checks?)
The main downside here is that if the polling task finds that the landoscript submission has failed, we'd need to rerun the upstream landoscript task, and then force a new polling task afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity, I discovered after writing this that all Lando endpoints (including status polling) require authentication. So at least in the immediate term, moving polling outside of landoscript is not an option. (I guess we could poll in a separate landoscript task, but that probably doesn't provide any benefits...)
| diff = "" | ||
| for l10n_file in l10n_files: | ||
| if l10n_file["dst_name"] not in orig_files: | ||
| log.warning(f"WEIRD: {l10n_file['dst_name']} not in dst_files, continuing anyways...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, I approve of this new log level.
| fi | ||
| } | ||
|
|
||
| # TODO: real URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment just to help remember this.
ahal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, lgtm!
…bump action This adds the rough structure for landoscript as well as implementing the `version_bump` action (necessary to make it practical to test the initial code).
Most notably, this moves common set-up into conftest, and tests that aren't testing action-specific logic (eg: lando submission) into test_script.
The helper functions here are copied out of treescript (which will soon be EOL'ed). Also included here is some minor refactoring to avoid duplication of common of `create-commit` logic.
Most of the helpers here are copied out of treescript, with some tweaks and simplications where it was possible (mostly to get rid of now-unnecessary logic). Some refactoring of other actions was done here as well, to make it possible to call them from the `merge_day` action. Most notably: the `version_bump` action has been updated to support multiple version bumps in one run, which allows us to do all the merge day version bumps in a single commit, as we do now with treescript. Additional test refactoring/movement was also done to make many of the helpers available to merge day tests.
…of the happy path tests This new helper allows a huge simplication for most of the action-specific tests.
The implementation of this is a fairly significant departure from the treescript one in two ways: * It supports these actions without having a Gecko or `android-l10n` tree available. This necessitated fetching the remote file listings of these repositories, and using `moz.l10n` instead of `compare-locales`. * I've fully separated out implementation of the actions. Although at a very high level they look similar, the details are different enough that IMO it's much easier and better to duplicate some of the code rather than add the indirection to avoid it. (I found it very difficult to read the treescript implementation because of this, and I didn't want to do the same here). A necessary part of doing this was some enhancements to `diff_contents` to support added and removed files properly.
…ation outside of l10n bump As it turns out, these are static for all other types of actions.
…uests Also ensure that LANDO_API and LANDO_TOKEN are set up during startup.
This required a little bit of massaging for places where there's nested dataclasses, but it was otherwise straightforward. This change also uncovered some places where tests were including unnecessary data, which have been fixed. It also replaced some existing null checks (because the dataclass will throw if a required member is not present during construction).
|
To help make additional changes easier to review I'm going to merge this initial patch. There are still a few things to address from comments here (most notably - how to handle unfinished lando submissions) that will come in follow-up PRs next week. Thanks to everyone who reviewed this giant patch so far! |
…ocessing any of them This is a follow to address something [@ahal called out in the initial review of landoscript](mozilla-releng#1135 (comment)).
As pointed out by ahal in mozilla-releng#1135, we're using a new enough Python that we can use the built in parser.
…ocessing any of them This is a follow to address something [@ahal called out in the initial review of landoscript](mozilla-releng#1135 (comment)).
As pointed out by ahal in mozilla-releng#1135, we're using a new enough Python that we can use the built in parser.
As pointed out by ahal in #1135, we're using a new enough Python that we can use the built in parser.
…ocessing any of them This is a follow to address something [@ahal called out in the initial review of landoscript](mozilla-releng#1135 (comment)).
…ocessing any of them This is a follow to address something [@ahal called out in the initial review of landoscript](mozilla-releng#1135 (comment)).
…ocessing any of them (#1165) This is a follow to address something [@ahal called out in the initial review of landoscript](#1135 (comment)).
This (fairly hefty) pull request adds a new scriptworker: landoscript. This new scriptworker will be replacing treescript when writes to hg.mozilla.org for Gecko repositories cease to happen.
Because it ultimately fulfills the same use cases as treescript, I've chosen to keep the payload as similar as possible. There are some necessary changes (mostly around using lando repository names instead of references to hg.mozilla.org repositories). I've also removed support for some things that are not used (eg:
ignore_closed_treeanddontbuildare either not present for most actions, or always have the same value).Some notes on the tests included here:
merge_dayandversion_bumptasks, which are often slightly different depending on which repository they are running from).async_main, rather than deeply unit testing individual components. This does end up making the tests notable bigger, which can be intimidating, but I've found that once written they need little to no touching, and has made refactoring as I go along significantly easier. There is probably an argument to be made that certain helper pure functions (likediff_contents) could do with some separate unit testing - but I've not gone to that length yet.aioresponsescalls is important. In an ideal world we would be able to key responses off of parts of the requests, but I haven't had time to make this happen.At the time of writing I have not yet been able to test it in a dev environment - I've merely been working off of the spec at https://docs.google.com/document/d/1F3xYp6v4YLLHsX7zFkYoygDyvIUPzujxeMW7W4qte0o/edit?tab=t.0#heading=h.c29tth48xps4. Seeing as this is completely unused so far, I'm hoping to get what's here merged, and then follow up with more targeted pull requests as I fix issues that I encounter when testing in the real world.