Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port all binary build tooling to s5cmd, add sync tests, simplify removals #716

Merged
merged 11 commits into from
May 29, 2024

Conversation

dzuelke
Copy link
Contributor

@dzuelke dzuelke commented May 23, 2024

s5cmd can execute operations in parallel, which will massively speed up our repo generation, syncing, and removal scripts:

Before (actual mkrepo step took 2 minutes 15 seconds) and after (actual mkrepo step took 3 seconds):
Screenshot 2024-05-28 at 13 05 46       ➡️  Screenshot 2024-05-28 at 13 06 03

Before (actual sync dry-run step took 5 minutes 4 seconds) and after (actual sync dry-run step took 4 seconds):
Screenshot 2024-05-23 at 16 34 29  ➡️  Screenshot 2024-05-23 at 16 34 39

GUS-W-15794884


For syncing, we now generate a "sync plan", from which operations that can be grouped together are then extracted, and executed in parallel using s5cmd run, in three groups:

  1. all copies of new/changed dist archives - if anything fails during these, a re-run of the sync will perform them again;
  2. all copies of new/changed manifests, as well as removals - if anything fails during these, only the failed ones will be performed again, including their dists;
  3. all dist archives of removed manifests - worst case, there will be some stray dist archives left behind, but nothing picks them up because their manifests are gone.

GUS-W-15794884


As the "sync plan" is now generated separately by a script, we can have some tests that assert the operations are correct, dist archive URL rewrites from source to destination bucket happen properly, etc.

GUS-W-15794893


There are now local source and destination dir options for sync.sh, which will use the given directory of .composer.json manifests, instead of downloading them from the source or destination bucket.

What we can now do is, upon removal, fetch all manifests from the repo, except the given list/wildcards of manifests to remove, and then tell sync.sh to use that filtered directory of .composer.json files as the local source directory for syncing.

This moves all of the sanity checks, file deletion, error handling, confirmation etc logic into sync.sh, and lets us drop most of the logic in remove.sh. In addition, the core "sync plan" code of sync.sh is also covered by tests, which now extend to remove.sh.

GUS-W-15794899


Finally, there is currently a bit of a bug in remove.sh: when removing all files from a repository (e.g. for certain cleanups, testing, etc), there are no more .composer.json files left behind that the code can match, and the subsequent call to mkrepo.sh to re-generate the updated repository fails (as there are no files to pass to it), instead of either producing an empty repository file, or removing the repository file altogether.

Instead, the tooling now detects this situation, and removes packages.json as well, so that repositories can be fully removed.

GUS-W-15838870

@dzuelke dzuelke requested a review from a team as a code owner May 23, 2024 18:50
@dzuelke dzuelke force-pushed the s5cmd branch 2 times, most recently from 8916896 to 0033853 Compare May 23, 2024 20:16
It can execute operations in parallel, which will massively speed up our repo generation, syncing, and removal scripts.

CI needs it, too, since we do not run that inside a build Dockerfile, but there are tests for the build tooling.

GUS-W-15794884
New bash quoting for echo-ing the command, too.

GUS-W-15794884
The args a weird - other scripts, like deploy.sh, use only the env vars. It's more consistent this way.

The bucket region is always returned by the "central" endpoint, even on a 403 response, so we can use that to auto-detect it if necessary.
We now first generate a "sync plan", from which operations that can be grouped together are then extracted, and executed in parallel using `s5cmd run`.

First, all copies of new/changed dist archives - if anything fails during these, a re-run of the sync will perform them again.

Second, all copies of new/changed manifests, as well as removals - if anything fails during these, only the failed ones will be performed again, including their dists.

Finally, all dist archives of removed manifests - worst case, there will be some stray dist archives left behind, but nothing picks them up because their manifests are gone.

GUS-W-15794884
Now that we're generating a "sync plan" for execution via `s5cmd run`, we can have some tests that assert the operations are correct, dist archive URL rewrites from source to destination bucket happen properly, etc.

GUS-W-15794893
There are now local source and destination dir options for `sync.sh`, which will use the given directory of `.composer.json` manifests, instead of downloading them from the source or destination bucket.

What we can now do is, upon removal, fetch all manifests from the repo, except the given list/wildcards of manifests to remove, and then tell `sync.sh` to use that filtered directory of `.composer.json` files as the local source directory for syncing.

This moves all of the sanity checks, file deletion, error handling, confirmation etc logic into `sync.sh`, and lets us drop most of the logic in `remove.sh`. In addition, the core "sync plan" code of `sync.sh` is also covered by tests, which now extend to `remove.sh`.

GUS-W-15794899
All scripts moved to s5cmd now

GUS-W-15794884
Now show the full sync output when really syncing, and only the sync.sh job summary (without the new-ish 'POTENTIALLY DESTRUCTIVE ACTION' warning at the end) when performing a dry-run, or when printing for info purposes at the end of a build
Not critical, but needs wheel installed so the bdist_wheel command during docopt (a bob-builder dependency) install works in the venv (making it a setup_requires dependency in bob-builder doesn't cut it).
@dzuelke dzuelke merged commit 2485a3e into main May 29, 2024
3 checks passed
@dzuelke dzuelke deleted the s5cmd branch May 29, 2024 16:10
@heroku-linguist heroku-linguist bot mentioned this pull request May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants