Skip to content

Conversation

@bhearsum
Copy link
Contributor

@bhearsum bhearsum commented Feb 13, 2025

This (fairly hefty) pull request adds a new scriptworker: landoscript. This new scriptworker will be replacing treescript when writes to hg.mozilla.org for Gecko repositories cease to happen.

Because it ultimately fulfills the same use cases as treescript, I've chosen to keep the payload as similar as possible. There are some necessary changes (mostly around using lando repository names instead of references to hg.mozilla.org repositories). I've also removed support for some things that are not used (eg: ignore_closed_tree and dontbuild are either not present for most actions, or always have the same value).

Some notes on the tests included here:

  • Most of the success tests have payloads taken directly from real world treescript tasks (with some small tweaks, per the above note). I fished around and tried to find all variations of these tasks that exist (most notably for merge_day and version_bump tasks, which are often slightly different depending on which repository they are running from).
  • The approach I've taken to testing here is the same as what I've been attempting to do for signingscript in WIP: add tests that operate at the API level #1097: execute tests by running async_main, rather than deeply unit testing individual components. This does end up making the tests notable bigger, which can be intimidating, but I've found that once written they need little to no touching, and has made refactoring as I go along significantly easier. There is probably an argument to be made that certain helper pure functions (like diff_contents) could do with some separate unit testing - but I've not gone to that length yet.
  • I'm not terribly happy with how the responses for the github client are set-up. Because the GraphQL API uses a single endpoint for all requests, and I'm simply returning responses in a fixed order, the ordering of the aioresponses calls is important. In an ideal world we would be able to key responses off of parts of the requests, but I haven't had time to make this happen.

At the time of writing I have not yet been able to test it in a dev environment - I've merely been working off of the spec at https://docs.google.com/document/d/1F3xYp6v4YLLHsX7zFkYoygDyvIUPzujxeMW7W4qte0o/edit?tab=t.0#heading=h.c29tth48xps4. Seeing as this is completely unused so far, I'm hoping to get what's here merged, and then follow up with more targeted pull requests as I fix issues that I encounter when testing in the real world.

bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Mar 18, 2025
…key names

A few things going on here, mostly boring:
* Move the existing GithubClient to scriptworker_client
* Move the associated test fixtures to a new `pytest-scriptworker-client` package
* Necessary adjustments in treescript to let it be able to find and use the new packages

Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Mar 18, 2025
…key names

A few things going on here, mostly boring:
* Move the existing GithubClient to scriptworker_client
* Move the associated test fixtures to a new `pytest-scriptworker-client` package
* Necessary adjustments in treescript to let it be able to find and use the new packages

Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Mar 18, 2025
…key names

A few things going on here, mostly boring:
* Move the existing GithubClient to scriptworker_client
* Move the associated test fixtures to a new `pytest-scriptworker-client` package
* Necessary adjustments in treescript to let it be able to find and use the new packages

Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Mar 18, 2025
…key names

A few things going on here, mostly boring:
* Move the existing GithubClient to scriptworker_client
* Move the associated test fixtures to a new `pytest-scriptworker-client` package
* Necessary adjustments in treescript to let it be able to find and use the new packages

Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Mar 18, 2025
…key names

A few things going on here, mostly boring:
* Move the existing GithubClient to scriptworker_client
* Move the associated test fixtures to a new `pytest-scriptworker-client` package
* Necessary adjustments in treescript to let it be able to find and use the new packages

Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in mozilla-releng#1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
bhearsum added a commit that referenced this pull request Mar 18, 2025
…key names (#1147)

A few things going on here, mostly boring:
* Move the existing GithubClient to scriptworker_client
* Move the associated test fixtures to a new `pytest-scriptworker-client` package
* Necessary adjustments in treescript to let it be able to find and use the new packages

Although not ideal, this also includes a minor fix to allow slashes in GraphQL key names that I had already incorporated into the version of this code in #1135. I can pull it out if necessary, but I think it's probably small and safe enough to just keep in here?
@bhearsum bhearsum force-pushed the lando branch 3 times, most recently from 4c7c77b to 78d629c Compare March 20, 2025 01:22
@bhearsum bhearsum force-pushed the lando branch 4 times, most recently from ce7cc7a to 2e0da34 Compare April 2, 2025 17:19
@bhearsum bhearsum force-pushed the lando branch 2 times, most recently from a73c334 to 8d4bc25 Compare April 3, 2025 20:48
bhearsum added a commit to bhearsum/scriptworker that referenced this pull request Apr 14, 2025
@bhearsum bhearsum force-pushed the lando branch 2 times, most recently from c93d3ae to 39fdb66 Compare April 15, 2025 16:51
Copy link
Contributor

@ahal ahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Did a first pass


def log_file_contents(contents):
for line in contents.splitlines():
log.info(line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why this is better than just logging contents directly. Having the logger preamble in front of each line just makes it harder to copy/paste imo.

Feel free to ignore this if this was just copy/pasted and you want to keep it as similar as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall now...my only goal is to make sure newlines are rendered. That can probably be done without the splitting though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I recall why we have this now: it's to ensure that we get one log line per line in the file, while ensuring each line is prefixed. (Arguably we could simply log with a leading newline to achieve a similar effect...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity: I plan to leave this as is unless you feel it's preferable to log with a preceding newline.



async def poll_until_complete(session: ClientSession, poll_time: int, status_url: str):
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess the idea is that we poll until we hit the Taskcluster timeout?

I think we might be losing some of the idempotent-ness of the old approach here. It will certainly be possible for this task to fail due to timeout, but Lando could still land it afterwards. This means a failed task doesn't necessarily indicate a failed landing.

Do you know what would happen if we re-ran the task in such a case? For version bump, it might be worth ensuring that if the requested version is already what's committed, that we exit successfully. I'm not really sure what to do in the general case..

Another approach might be to timeout within the task, and abort the landing prior to failing. That way there's more of a guarantee that a failed landoscript task means the push didn't get landed. I don't think this needs to be fully solved for the MVP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, polling until we timeout is what is happening here, and that's a really good point you've made.

And to be honest, I'm not actually sure what the best thing to do here is. I think in the short term, attempting to cancel the request at shutdown for any non-success makes sense. But this is also less than ideal if, eg: we get spot terminated and have to redo the work because of it. There's also no guarantees that we'll be able to cancel the request in time in the event of a spot termination.

Another obvious option (for the future) would be to take the notarization approach, and do the polling in a separate task. But I also wonder if we even need to bother to poll in some cases? For example, l10n bumps don't have anything downstream of them...maybe we could just spit out the status URL, and leave it at that for cases like that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spot termination is not a concern for scriptworkers. We might get killed if we exceed terminationGracePeriodSeconds though, in which case we should still aim to preserve idempotence. It looks like we only ever have a single lando request, which hopefully is handled atomically at the other end (TIL git push has a --atomic option, nice, I'm only 9 years late to notice)? (I was wondering how this worked for merge day. Currently we have 2 hg pushes, e.g. once to tag central and another to merge to beta. I guess that goes away because we know they're the same underlying git repo and so it doesn't matter where the tag goes?)

We could try and cancel the submission on shutdown (bitrisescript seems to do something similar:

loop.add_signal_handler(signal.SIGTERM, handle_sigterm, future_group)
), and there should be a couple of minutes between SIGTERM and SIGKILL. I would be a bit worried about races between the attempt to cancel and the request landing (or maybe the cancel attempt itself times out, or ...).

So there's a few scenarios that could happen AFAICT:

  1. maybe the initial request didn't go through, we can rerun everything, all good
  2. maybe the initial request failed, or was canceled, we can safely retry.
  3. maybe the initial request succeeded before the rerun, we should detect that and not do anything
  4. maybe the initial request is still in flight. We could potentially find it and poll on it, or cancel it and retry? If we can't find it, we may have 2 requests in flight, hopefully only one can succeed?

In treescript we also had the case where one push went through but not the other, and the case where when the next run checked version numbers, the first push had gone through to hgssh but was not yet mirrored out to hgweb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional thoughts and strategies, @jcristau! Not needing to worry about spot termination certainly simplifies things, as we can make some safer assumptions.

I'm going to give this a bit more thought - but clearly we need some sort of special handling for the rerun scenario, if not separating polling altogether.

In treescript we also had the case where one push went through but not the other, and the case where when the next run checked version numbers, the first push had gone through to hgssh but was not yet mirrored out to hgweb.

If we're talking about two pushes to the same repository, this case will go away AFAIK - all changes to a given repository (which in the new world means branch within the new github repo) will either happen or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have 2 hg pushes, e.g. once to tag central and another to merge to beta. I guess that goes away because we know they're the same underlying git repo and so it doesn't matter where the tag goes?)

Yeah, that's right. Although that does bring to mind something a bit odd about the API: tag actions are applied to a repository+branch - not to a repository overall. @cgsheeh - is this something you've thought about? (I don't think we necessarily need to change anything here, I'm just noting it because it's a bit odd.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about how to handle termination, the more it seems like moving polling out of the landoscript tasks might be the best thing to do. Something like:

  • Return immediately from landoscript after a successful submission; attaching the status url in a well enough artifact
  • Add a new task downstream of each landoscript task which does the polling. This doesn't need to be a scriptworker task AFAIK, because polling doesn't require credentials.

It would also avoid the need for any cancellation in landoscript, and thus any race conditions associated with that. If we trust that a landoscript task would never be rerun unless a human has inspected the state of things first, this would eliminate the need for special rerun handling in landoscript, I think. (Although perhaps it's too much a leap to assume it would never be errantly rerun, and we'd want at least some of the rerun safety checks?)

The main downside here is that if the polling task finds that the landoscript submission has failed, we'd need to rerun the upstream landoscript task, and then force a new polling task afterwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, I discovered after writing this that all Lando endpoints (including status polling) require authentication. So at least in the immediate term, moving polling outside of landoscript is not an option. (I guess we could poll in a separate landoscript task, but that probably doesn't provide any benefits...)

diff = ""
for l10n_file in l10n_files:
if l10n_file["dst_name"] not in orig_files:
log.warning(f"WEIRD: {l10n_file['dst_name']} not in dst_files, continuing anyways...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, I approve of this new log level.

fi
}

# TODO: real URLs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment just to help remember this.

Copy link
Contributor

@ahal ahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

bhearsum added 13 commits April 17, 2025 19:57
…bump action

This adds the rough structure for landoscript as well as implementing the `version_bump` action (necessary to make it practical to test the initial code).
Most notably, this moves common set-up into conftest, and tests that aren't testing action-specific logic (eg: lando submission) into test_script.
The helper functions here are copied out of treescript (which will soon be EOL'ed).

Also included here is some minor refactoring to avoid duplication of common of `create-commit` logic.
Most of the helpers here are copied out of treescript, with some tweaks and simplications where it was possible (mostly to get rid of now-unnecessary logic).

Some refactoring of other actions was done here as well, to make it possible to call them from the `merge_day` action. Most notably: the `version_bump` action has been updated to support multiple version bumps in one run, which allows us to do all the merge day version bumps in a single commit, as we do now with treescript.

Additional test refactoring/movement was also done to make many of the helpers available to merge day tests.
…of the happy path tests

This new helper allows a huge simplication for most of the action-specific tests.
The implementation of this is a fairly significant departure from the treescript one in two ways:
* It supports these actions without having a Gecko or `android-l10n` tree available. This necessitated fetching the remote file listings of these repositories, and using `moz.l10n` instead of `compare-locales`.
* I've fully separated out implementation of the actions. Although at a very high level they look similar, the details are different enough that IMO it's much easier and better to duplicate some of the code rather than add the indirection to avoid it. (I found it very difficult to read the treescript implementation because of this, and I didn't want to do the same here).

A necessary part of doing this was some enhancements to `diff_contents` to support added and removed files properly.
…ation outside of l10n bump

As it turns out, these are static for all other types of actions.
…uests

Also ensure that LANDO_API and LANDO_TOKEN are set up during startup.
This required a little bit of massaging for places where there's nested dataclasses, but it was otherwise straightforward.

This change also uncovered some places where tests were including unnecessary data, which have been fixed. It also replaced some existing null checks (because the dataclass will throw if a required member is not present during construction).
@bhearsum
Copy link
Contributor Author

To help make additional changes easier to review I'm going to merge this initial patch. There are still a few things to address from comments here (most notably - how to handle unfinished lando submissions) that will come in follow-up PRs next week. Thanks to everyone who reviewed this giant patch so far!

@bhearsum bhearsum merged commit 118b709 into mozilla-releng:master Apr 18, 2025
30 checks passed
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Apr 18, 2025
…ocessing any of them

This is a follow to address something [@ahal called out in the initial review of landoscript](mozilla-releng#1135 (comment)).
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Apr 18, 2025
As pointed out by ahal in mozilla-releng#1135, we're using a new enough Python that we can use the built in parser.
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Apr 21, 2025
…ocessing any of them

This is a follow to address something [@ahal called out in the initial review of landoscript](mozilla-releng#1135 (comment)).
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Apr 21, 2025
As pointed out by ahal in mozilla-releng#1135, we're using a new enough Python that we can use the built in parser.
bhearsum added a commit that referenced this pull request Apr 21, 2025
As pointed out by ahal in #1135, we're using a new enough Python that we can use the built in parser.
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Apr 22, 2025
…ocessing any of them

This is a follow to address something [@ahal called out in the initial review of landoscript](mozilla-releng#1135 (comment)).
bhearsum added a commit to bhearsum/scriptworker-scripts that referenced this pull request Apr 22, 2025
…ocessing any of them

This is a follow to address something [@ahal called out in the initial review of landoscript](mozilla-releng#1135 (comment)).
bhearsum added a commit that referenced this pull request Apr 22, 2025
…ocessing any of them (#1165)

This is a follow to address something [@ahal called out in the initial review of landoscript](#1135 (comment)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants