Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildmaster needs to leverage object storage #258

Open
kallisti5 opened this issue Jun 15, 2023 · 23 comments
Open

buildmaster needs to leverage object storage #258

kallisti5 opened this issue Jun 15, 2023 · 23 comments

Comments

@kallisti5
Copy link
Member

Constraints:

Considerations:

  • Persistent volumes are expensive. Object storage is cheap

Volumes today:

  • buildmaster-sources (shared by every buildmaster)
  • buildmaster-packages (shared by every buildmaster and frontend)
  • buildmaster-x86-gcc2
  • buildmaster-x86-64

We might be able to combine the buildmaster work volumes into one big shared pool of disk... but it really starts to shift this stuff back to a big pile of spaghetti. (however, this doesn't solve the high cost of direct attach storage either)

@kallisti5
Copy link
Member Author

This is becoming a cost and sizing difficulty as we start to look to builders for arm64 and riscv64.

@kallisti5
Copy link
Member Author

kallisti5 commented Jun 15, 2023

For solutions i've looked at before:

  • buildmaster-packages to object storage

    • buildmaster relies on local files and symlinks heavily. I've looked at adding additional abstraction layers (referencing file:/// , s3:/// etc) however it's almost a rewrite.
  • buildmaster-sources to object storage

    • buildmaster just runs in a loop and does a git clone, pull, etc for monitoring of changes. Not really a candidate for s3.
  • buildmaster-ARCH

    • lots of local file references throughout buildmaster. heavy refactor / rewrite.

Nothing is easy here. buildmaster is haikuporter running in a loop with multiple layers of shell scripts. It's a complete mess. I'm trying to stay positive and look for solutions than be grumpy about this... but the issues we suffer today are all direct results of the popular choices we made historically. We assumed a LOT of tech-debt going with buildmaster.

@mmlr
Copy link
Member

mmlr commented Jun 15, 2023

Look, I have a bit of an issue with the tone here. With that kind of attitude it is pretty hard to bring up the motivation to do anything about the actual issues. I will still try to lay out my view of what is currently going on and how this could be migrated to an object storage based setup.

To clear some things up first: There is a grand total of 2 bash scripts that run the buildmaster setup. The buildmaster script itself is a wrapper of HaikuPorter that handles all the tasks that encompass an individual buildrun that go beyond just package building, for example making the logs available for the current web frontend. The loop script manually polls the HaikuPorts git repository and triggers buildruns by calling the buildmaster wrapper script. Could both things be merged into a single script or could it be inlined into the HaikuPorter Python code? Sure; but it was actually built this way for a reason. It separates out the individual tasks so that they can be replaced by other methods of handling these things. The loop was meant to eventually be replaced by some form of CI or hook system that would trigger buildruns on repo push instead of just blindly polling the git repo. And the buildmaster script could be replaced with some other frontend that does more clever stuff with the outputs of HaikuPorter. As it turns out, these scripts have been good enough for a long time now and no one bothered to actually write something better that uses the modularity of the setup. While the shell scripts may not be the most elegant solution, they are reasonably well structured and do get the job done. Calling the entire solution "a complete mess" because of two shell scripts that call each other is just uncalled for IMO.

As I see it, the main issue is block storage and its associated cost. So where does the need for that storage actually come from?

What HaikuPorter needs to resolve dependencies are the package infos of the available packages and the DependencyInfo files generated from the recipe repository. To get the package infos, the native Haiku package tool is run on the package files to extract the needed info. As it turns out, calling a tool on a lot of files over and over is pretty slow, so a cache for that info has helpfully been implemented. This means that usually there is only a transient need for the package files to be available locally, namely when they are initially built and added to the repo. After that, HaikuPorter could absolutely run purely from the package info cache. Right now that would not work, because the cache is also pruned based on the package files becoming unavailable, but that can easily be changed to instead be based on the existing package obsoletion mechanism that also manages package removal for the repository.

There is also another user of the package files, the native package_repo tool that is used to create the package repository file. Since it is very inefficient to redo all of the repo every time a package is updated, that tool has also been built so that it can be run incrementally. So repo building only strictly needs access to a set of packages when initially creating the repo. After that, HaikuPorter calls it with the update command that takes an existing repo file (including all the package checksums and package infos) and writes a new repo with that info plus updated packages based on a package list file. That means, that also for this step, the package files only really need to be there when adding freshly built ones. Right now the package list is built based on the local package repository, so that would need to be rewritten to either work off of a cached package list or some other package enumeration method.

Then there is the need for providing build dependencies to the builders. Right now the BuildMaster simply pushes package files from its local package storage to the builders when they aren't already present in the package cache. Obviously this could trivially be changed to instead pull the packages from an object storage URL instead.

So overall, pretty much all of this could work with very little local block storage, just enough to keep the repository state and caches and (transiently) for the freshly built packages. I do not see a need for fundamental changes to how this setup works just to make this possible. The above mentioned places need to be adjusted to not assume a locally attached package file directory and the current method of making package files available, namely them being hardlinked into that directory, needs to be replaced by an upload to object storage.

BTW: The current web frontend was specifically built so that it could run purely from object storage. Pushing the logs there instead of writing them out to local files simply has never been implemented. Also, the buildrun architecture itself has been set up this way to allow for dynamically spinning up and down builders instead of requiring them to stay online while idle. That also has never been implemented, mainly due to using a VM hoster where non-running VMs still incur cost. But it's not exactly fair to just assume that it was all built without any forethought.

I do feel a lot of hostility towards the current solution and if that is enough to warrant throwing everything out and building something new, then so be it. But that is another discussion and has nothing to do with the technical problems that are presented here. As the person who built much of the current setup and having sponsored, maintained and even recently upgraded (doubling the cost) the only builders we've used for over 4 years now, the hostility is hurtful.

@kallisti5
Copy link
Member Author

Tone

Look, I have a bit of an issue with the tone here. With that kind of attitude it is pretty hard to bring up the motivation to do anything about the actual issues. I will still try to lay out my view of what is currently going on and how this could be migrated to an object storage based setup.

None of this was directed to you as an individual. There are no owners here (me included). I've been an asshole in the past given the history, however i'm really just trying to solve the biggest thorn in our infrastructure. I'm pretty much constantly frustrated at this stuff because i'm the only one dealing with the maintenance difficulty it causes and can't attract anyone to look at it. You're constantly hearing from me about it because i'm the only person maintaining it.

The last 1-2 years i've been trying to keep positive about this stuff to try and attract people to work on it... that's gone nowhere.

With all of that said, I do greatly appreciate you maintaining the two worker nodes over the last few years for this stuff. Your workers have built tens of thousands of HaikuPorts packages.

Solutions

As I see it, the main issue is block storage and its associated cost. So where does the need for that storage actually come from?

Correct. This is the most pressing issue. Locally attached storage is the biggest problem we run into. buildmaster is the last thing we have that drags around a bunch of complex filesystem attachments shared between multiple pods. Getting these repos onto s3 storage like the rest of our infrastructure would save substantial costs and ease a lot of maintenance time and risk.

I might be able to consolidate the individual architecture volume attachments to a single shared attachment. This will solve some of the drama around the number of volume attachments.

What HaikuPorter needs to resolve dependencies are the package infos of the available packages and the DependencyInfo files generated from the recipe repository. To get the package infos, the native Haiku package tool is run on the package files to extract the needed info. As it turns out, calling a tool on a lot of files over and over is pretty slow, so a cache for that info has helpfully been implemented. This means that usually there is only a transient need for the package files to be available locally, namely when they are initially built and added to the repo. After that, HaikuPorter could absolutely run purely from the package info cache. Right now that would not work, because the cache is also pruned based on the package files becoming unavailable, but that can easily be changed to instead be based on the existing package obsoletion mechanism that also manages package removal for the repository.

Correct. I started looking at implementing various abstraction layers (mostly passing around file references as URIs (ex: s3:///repo/file/etc, file:///local/file/etc)), however quickly found myself looking at core haikuporter code changes adding s3,etc support (which got weird quick)

Heck, here's a quick snippet of a diff from a local branch I have:
https://gist.github.com/kallisti5/e0c90960c073b597ff46ba39b0f956e3

There is also another user of the package files, the native package_repo tool that is used to create the package repository file. Since it is very inefficient to redo all of the repo every time a package is updated, that tool has also been built so that it can be run incrementally. So repo building only strictly needs access to a set of packages when initially creating the repo. After that, HaikuPorter calls it with the update command that takes an existing repo file (including all the package checksums and package infos) and writes a new repo with that info plus updated packages based on a package list file. That means, that also for this step, the package files only really need to be there when adding freshly built ones. Right now the package list is built based on the local package repository, so that would need to be rewritten to either work off of a cached package list or some other package enumeration method.

Correct. package_repository and the limited "outside of haiku" code support means we need all the packages locally present to generate a new package repo. This is a bigger problem (unless the ability to "update" (and re-sign) a repo file was added like you said)

Then there is the need for providing build dependencies to the builders. Right now the BuildMaster simply pushes package files from its local package storage to the builders when they aren't already present in the package cache. Obviously this could trivially be changed to instead pull the packages from an object storage URL instead.

Correct. This would be a nice have and would further reduce the stuff hidden away on persistent volumes.

So overall, pretty much all of this could work with very little local block storage, just enough to keep the repository state and caches and (transiently) for the freshly built packages. I do not see a need for fundamental changes to how this setup works just to make this possible. The above mentioned places need to be adjusted to not assume a locally attached package file directory and the current method of making package files available, namely them being hardlinked into that directory, needs to be replaced by an upload to object storage.

All correct. There are a lot of blockers to doing any of this work, and a substantial amount of potentially risky changes to core haikuporter logic.

@mmlr
Copy link
Member

mmlr commented Jun 16, 2023

As I see it, the main issue is block storage and its associated cost. So where does the need for that storage actually come from?

Correct. This is the most pressing issue. Locally attached storage is the biggest problem we run into. buildmaster is the last thing we have that drags around a bunch of complex filesystem attachments shared between multiple pods. Getting these repos onto s3 storage like the rest of our infrastructure would save substantial costs and ease a lot of maintenance time and risk.

The shared volumes were purely done to reduce duplication. There was no need to have the source git repositories duplicated, so they were shared. Back then, the most trivial way to get everything up and running in a buildmaster instance was to simply build a Haiku image for the target architecture. All the initial packages are built or downloaded and the package tools are built as a side effect. That is what the bootstrap script did. In a container setup, this can all be done as part of the container build. You already did implement that, at least partially. The only thing HaikuPorter uses from the source volume at runtime should be the licenses shipped with Haiku, and those can easily be placed in the container image at build time as well. So the Haiku and HaikuPorter sources are not needed and that probably gets rid of the source volume.

I might be able to consolidate the individual architecture volume attachments to a single shared attachment. This will solve some of the drama around the number of volume attachments.

You mean the architecture specific volumes? But once there is only local HaikuPorts state and no actual package files, these can simply be kept in a normal persistent volume. The amount of data should be in the megabytes.

What HaikuPorter needs to resolve dependencies are the package infos of the available packages and the DependencyInfo files generated from the recipe repository. To get the package infos, the native Haiku package tool is run on the package files to extract the needed info. As it turns out, calling a tool on a lot of files over and over is pretty slow, so a cache for that info has helpfully been implemented. This means that usually there is only a transient need for the package files to be available locally, namely when they are initially built and added to the repo. After that, HaikuPorter could absolutely run purely from the package info cache. Right now that would not work, because the cache is also pruned based on the package files becoming unavailable, but that can easily be changed to instead be based on the existing package obsoletion mechanism that also manages package removal for the repository.

Correct. I started looking at implementing various abstraction layers (mostly passing around file references as URIs (ex: s3:///repo/file/etc, file:///local/file/etc)), however quickly found myself looking at core haikuporter code changes adding s3,etc support (which got weird quick)

Heck, here's a quick snippet of a diff from a local branch I have: https://gist.github.com/kallisti5/e0c90960c073b597ff46ba39b0f956e3

I don't yet understand why this would be needed. The Repository class you reference here is the HaikuPorter internal repository, where the DependencyInfos and caches reside. This should be relatively little data and should just use a normal volume. The package repository, where all the package files are managed is handled in PackageRepository.

Correct. package_repository and the limited "outside of haiku" code support means we need all the packages locally present to generate a new package repo. This is a bigger problem (unless the ability to "update" (and re-sign) a repo file was added like you said)

The native package_repo tool already supports this form of updating in principle, it currently just reads the package info from all package files to check if they have been updated compared to the package info versions stored in the repo file. This can be solved easily by allowing to provide a list of "to be kept" package names which then simply get re-added to the new repo file without checking the version.

Actually it can be made to work right now, without changes to the package_repo tool, by creating a stub build package which only retains the .PackageInfo of the actual package file and using that as a placeholder. Repo updates work this way without having the actual packages present, I just tested that.

So overall, pretty much all of this could work with very little local block storage, just enough to keep the repository state and caches and (transiently) for the freshly built packages. I do not see a need for fundamental changes to how this setup works just to make this possible. The above mentioned places need to be adjusted to not assume a locally attached package file directory and the current method of making package files available, namely them being hardlinked into that directory, needs to be replaced by an upload to object storage.

All correct. There are a lot of blockers to doing any of this work, and a substantial amount of potentially risky changes to core haikuporter logic.

I am apparently missing something here, because I don't see any blockers to work being done and no core HaikuPorter logic changes. My plan would be:

  • Let the hpkgInfoCache be pruned by the package obsoletion process instead of inline and based on package file presence
  • Have the package file list for package_repo update be a cached file instead of building it from a local directory listing
  • Add a "to be kept packages list" option to the package_repo tool and feed it from the above
  • Have the PackageRepository class push files to object storage instead of linking them to a directory
  • Have the PackageRepository class delete obsolete files from object storage instead of unlinking local files

Once the hpkgInfoCache does not automatically prune infos when a package file goes away, it can be used instead of the actual package files from the HaikuPorter side and they are free to move to object storage. This change is neutral, as the package obsoletion and the current inline hpkgInfoCache pruning essentially run at the same time anyway. And once package_repo can be told to simply keep a list of packages without looking for updates, there is no need for local packages on the repo side of things either. The rest is factoring out the storage backend in PackageRepository to a local one that manages a package directory and a new remote one managing objects in object storage (really only two operations, put stuff there and remove stuff from there again based on an object name).

All of this can be implemented right now without breaking anything as far as I'm concerned.

@SamuraiCrow
Copy link

SamuraiCrow commented Jun 16, 2023

Would a Docker image containing the necessary cross development infrastructure be sufficient to implement the CI packaging? Also, would the package become hosted on the individual GitHub project releases or, if possible, can other non-GitHub repos also be used or is that more work just to mirror packages on other hosting sites like GitLab or Atlassian in the mix?

I can only assume I'm missing something due to the terminology used. Please spell it out plainly.

@pulkomandy
Copy link
Member

None of this was directed to you as an individual.

It is still counter productive.

By reading your messages, the code is "a giant mess of spaghetti" and half of haikuporter needs to be rewritten. Surely no one will feel motivation to dig into that (and then be yelled at by the grumpy sysadmin).

By reading mmlr messages, it's "just two bash scripts" and haikuporter core does not need to be touched at all. This seems a lot more reasonable to me. And we also get an outline of what needs to be changed.

Sure, the reality is probably somewhere in the middle, but attacking it from the optimistic side will surely mean people will want to hack on it and see what they can do :)

Would a Docker image containing the necessary cross development infrastructure be sufficient to implement the CI packaging?

The infrastructure is already up and running. It is not on Github but on Haiku own servers, and there is no intention to move it to Github or anywhere else. This discussion is only about internals of that architecture, and how we store the hpkg files and get the package repositories to serve them, and the haikuports buildmaster (a shell script that decides what needs to be rebuilt from haikuports recipes and which builder should do it) to keep up to date on it (it needs to know which packages are there, and then upload the newly built packages to that space).

kallisti5 has been very unhappy with the current approach that needs a shell script that's continuously running (it's a loop that waits for new commits in haikuports and runs builds for all the updated recipes). As mmlr said, the intent was to replace this with some continuous integration, but no one has done it. Maybe it could be done on github (since haikuports is there), or maybe it could be done on Haiku's Concourse-CI (since Haiku is managing all the other parts of this infrastructure). The bash script can be replaced by something that is triggered by git pushes to the haikuports repository, and run the same steps that the bash loop currently does.

The other script is the one that actually schedules the builds and update the packages to the repositories. Currently it needs access to several files to compute what needs to be built and schedule things in the right order (as updated packages can depend on other updated packages). This is, if I understand correctly, the part that needs access to an archive of all hpkg files currently, and the question is how to reduce that access to a more limited/smaller set of files (what mmlr suggests), or use S3 storage (accessible through REST APIs) instead of having the script expecting to find the hpkg files in some directory on the server (what kallisti5 suggests).

@SamuraiCrow
Copy link

SamuraiCrow commented Jun 16, 2023

Ok! I know and understand that polling loops are not efficient. I also know a few things about GitHub CI. The second script may need to be put on hold until the CI conversion is complete so the number of variables can be kept at a sane level.

Note: I may not get to this until tomorrow afternoon. I'm at my sister's and away from keyboard at the moment.

@kallisti5
Copy link
Member Author

kallisti5 commented Jun 16, 2023

kallisti5 has been very unhappy with the current approach that needs a shell script that's continuously running (it's a loop that waits for new commits in haikuports and runs builds for all the updated recipes). As mmlr said, the intent was to replace this with some continuous integration, but no one has done it. Maybe it could be done on github (since haikuports is there), or maybe it could be done on Haiku's Concourse-CI (since Haiku is managing all the other parts of this infrastructure). The bash script can be replaced by something that is triggered by git pushes to the haikuports repository, and run the same steps that the bash loop currently does.

Correct. The difficulty with this approach though is wrapping haikuporter buildmaster's invocation with some larger CI/CD system means you're designing a system which directly depends on locally executing haikuporter buildmaster python scripts on infrastructure which then reach out to remote builders and run more python.

You could start "grabbing local artifacts and pushing them places" from a larger wrapper CI/CD system and it may work... however it honestly rolls the tech-debit ball bigger. Chaining together shell scripts and python is not building robust infrastructure. I'm not "just being grumpy here", we deserve nice things. We have a large pool of talented developers (present company included)

This is my opinion on an ideal architecture. It's the standard microservice architecture. Same opinions I had 5 years ago.

  • Buildmaster runs on our infrastructure as a daemon
  • It offers API mechanisms for developers to trigger builds, copy haikuports packages into build-packages etc.
  • Live status reporting
  • It monitors git
    • It reaches out to Haiku workers, runs haikuporter (or whatever)
  • Supports storing resulting artifacts and repos other places other than local files.

That would solve a huge portion of our issues. We could authenticate users for API access, etc.

Y'all know I hate Java too, but i haven't complained about HaikuDepotServer once. The reason is it is reasonably architected, easy to maintain, doesn't eat up a massive amount of infrastructure, and has been stable.

@SamuraiCrow I appreciate you stepping up. Let me know if I can help you in any way.

@kallisti5
Copy link
Member Author

mmlr commented 13 hours ago
I am apparently missing something here, because I don't see any blockers to work being done and no core HaikuPorter logic changes. My plan would be:
.
.
Once the hpkgInfoCache does not automatically prune infos when a package file goes away, it can be used instead of the actual package files from the HaikuPorter side and they are free to move to object storage. This change is neutral, as the package obsoletion and the current inline hpkgInfoCache pruning essentially run at the same time anyway. And once package_repo can be told to simply keep a list of packages without looking for updates, there is no need for local packages on the repo side of things either. The rest is factoring out the storage backend in PackageRepository to a local one that manages a package directory and a new remote one managing objects in object storage (really only two operations, put stuff there and remove stuff from there again based on an object name).
All of this can be implemented right now without breaking anything as far as I'm concerned.

This all sounds great. I'll help in any way I can.

@pulkomandy
Copy link
Member

This is my opinion on an ideal architecture.

I am not at all a "microservice" developer, and while this may be ideal, I think the way the system behaves currently (ignoring how it is implemented) is quite sufficient. I don't need an API. I just want something that builds packages when recipes are pushed to a git repository.

If you ask me, this is what a CI/CD system would do. That usually does not run microservices, it's just some kind of script (bash or otherwise, doesn't matter) that is run whenever something is pushed to the git repository.

So, to me, your microservices approach is "let's reinvent CI/CD, but more complicated". Just like the current approach is a bit "let's reinvent CI/CD, but in bash".

The way I see things, the job is quite simple:

  • We have builders running Haiku, to which we must connect using ssh to run haikuporter. I don't think we need to change this
  • We have a git repository which should trigger actions whenever something is pushed to it (nothing new or very unusual here)
  • We have a storage space for our haiku packages that is served over HTTP (that is also a quite common thing to do)
  • Currently we have two shell scripts to glue these things together. The main part of this is: when a recipe is changed on the git repository, send it to the builders, get the resulting hpkg file from the builder, and update that to the http storage space. How hard can that be? It sounds exactly like the kind of thing I would use a simple script for. As mmlr already said, it would make sense for said script to be triggered by some CI/CD instead of the "loop" script, and indeed that loop was intentionally written as a separate script to make that easy.

From this point of view it is all quite simple and familiar tools, and I don't see what S3 and microservices and APIs and all these things have to do with it.

I may very well be missing a lot of details on how the internals are implemented (the build-packages stuff and how populating the haikuports package repository is handled, and the caching that mmlr mentionned). Maybe there are good reasons for doing a thing with microservices and S3 and all that, but really, I have no idea about anything in that tech space. Whereas "let's replace a bash script with a proper CI/CD system" (say, buildbot or gitlab actions or whatever) is something I can understand.

So, what am I missing here?

@kallisti5
Copy link
Member Author

kallisti5 commented Jun 16, 2023

So, what am I missing here?

Here's the basics:

  • If multiple applications need access to a single, locally mounted volume.. all applications need to be scheduled on the same physical node unless an RWX storage provider is used like NFS.
    • We ensure all haikuports builders are scheduled on the same physical node by keeping them in a pod (easy, no big deal)
    • Now.. something has to serve up these artifacts via http... we have to make sure it also runs on the same physical node as haikuports buildmaster.
  • Now.. we have various processes which also want access to haikuports packages (loading dock currently not deployed to collect new build-packages from developers, etc) These are all disabled due to the reason below.

Given the above, we have the following restricted to one node:

  • haikuports frontend
  • haikuporter buildmaster x86_64
  • haikuporter buildmaster x86_gcc2
  • haikuporter hpkg web server

Now. We also have 4 unique volume attachments:

  • haikuporter sources
  • haikuporter x86_64
  • haikuporter x86_gcc2
  • haikuporter packages

We're limited to 7 volume attachments per physical node.

So. This all means:

  • We have rather large nodes under-utilized since they're becoming dedicated to buildmaster's volumes
    • We have 3 volume mounts left available. (mount riscv64 and it's now 2 mounts. Mount a volume to receive uploads from the loading dock thing to receive build packages and it's now 0 mounts)
  • We can't schedule any other workloads on these nodes since they'll also push us over the mount limit.
  • We can't spread these out across our infrastructure since they have to be grouped together due to the shared access.
  • We need to grow our directly attached storage which is $1/per GiB per month. (vs our s3 which is ~$6 per TiB)

Solutions:

  • I can work to refactor the buildmaster to consume fewer volumes. That will help reduce the 4+ mounts to ~1-2.
  • However, at $1 per GiB per Month... if we start wanting to do haikuports mainline, release branches, multiple architectures... the costs begin to spiral upward.

@mmlr
Copy link
Member

mmlr commented Jun 16, 2023

kallisti5 has been very unhappy with the current approach that needs a shell script that's continuously running (it's a loop that waits for new commits in haikuports and runs builds for all the updated recipes). As mmlr said, the intent was to replace this with some continuous integration, but no one has done it. Maybe it could be done on github (since haikuports is there), or maybe it could be done on Haiku's Concourse-CI (since Haiku is managing all the other parts of this infrastructure). The bash script can be replaced by something that is triggered by git pushes to the haikuports repository, and run the same steps that the bash loop currently does.

Yes, that can totally be done, but that is not the actual pain point and what this issue is about. Also, TBH I don't think it doesn't make much of a difference. Spuriously pulling a git repository once a minute and looking for changes isn't exactly great, but it's also not really a problem. It does not cost us anything more than a microservice that listens for an API call would, there are not any more or less containers running because of it and the CPU load it produces is negligible. The only thing that would change here is the signaling, i.e. it would wait for an external trigger instead of polling. But again, this is somewhat off topic for the discussion about leveraging object storage for the package file repositories. It only came up as part of clarifying the current setup.

Correct. The difficulty with this approach though is wrapping haikuporter buildmaster's invocation with some larger CI/CD system means you're designing a system which directly depends on locally executing haikuporter buildmaster python scripts on infrastructure which then reach out to remote builders and run more python.

And how else would it work? At some point something that knows about recipes, HPKGs and dependencies needs to run somewhere and needs to instruct something else that is necessarily on separate instances, i.e. VMs actually running Haiku for the target architecture, to do something. Buildbot or some other tool cannot replace the middle part, it can only replace the invocation of it and possibly handle its output differently.

You could start "grabbing local artifacts and pushing them places" from a larger wrapper CI/CD system and it may work... however it honestly rolls the tech-debit ball bigger. Chaining together shell scripts and python is not building robust infrastructure. I'm not "just being grumpy here", we deserve nice things. We have a large pool of talented developers (present company included)

Why does it bother you so much that the wrappers are written as a bash script? It does not, at all, matter. It could just as well have been some more Python code as part of HaikuPorter. I explicitly decided against that because to me, moving around files and invoking commands is more trivial to do in a shell script than by using subprocess. Go ahead and rewrite it in Python, C++ or Rust, but the code will most definitely not be any cleaner, because the task it executes is just well suited for a shell script.

Many things are shell scripts, especially many things that wrap other things. I just looked at the file types of my /usr/bin and a good 10% of the 3k commands in there are indeed shell scripts. They range from oneliner convenience wrappers to things like xdg-open that is 1k lines of shell script and comes packaged with a man page and everything. Are these tools now suddenly all inferior just because they are written in a script language?

For context, these are the two scripts we are talking about, a total of less than 300 lines of code:

https://github.com/haikuports/haikuporter/blob/master/buildmaster/backend/assets/loop
https://github.com/haikuports/haikuporter/blob/master/buildmaster/backend/assets/bin/buildmaster

This is my opinion on an ideal architecture. It's the standard microservice architecture. Same opinions I had 5 years ago.

  • Buildmaster runs on our infrastructure as a daemon
  • It offers API mechanisms for developers to trigger builds, copy haikuports packages into build-packages etc.
  • Live status reporting
  • It monitors git

So basically the same that it does now, just not with a file based API but a web based one. I'm not against that and the separation that it has with the loop and buildmaster scripts actually allows writing such a new solution without having to first rip something out of HaikuPorter.

  • It reaches out to Haiku workers, runs haikuporter (or whatever)

That is something I would leave with HaikuPorter itself, as part of the buildrun logic as it currently is. Because HaikuPorter running HaikuPorter remotely allows for reusing specific knowledge about how HaikuPorter works and what it needs.

  • Supports storing resulting artifacts and repos other places other than local files.

And that's what this issue is actually about. But this part is IMHO completely separate from the parts above. We can change how and where the package repo is managed and solve the biggest pain point without having to start building completely new tools or integrations (and without impeding any future such work). Since probably no one has too much time to do these new tools anyway, let's just do something that needs relatively little work and gets the big issue solved.

Y'all know I hate Java too, but i haven't complained about HaikuDepotServer once. The reason is it is reasonably architected, easy to maintain, doesn't eat up a massive amount of infrastructure, and has been stable.

And here it is again, the line that kills all motivation by belittling something that someone else has put work in, just because it uses a different approach. How have the buildmasters not been stable in the last 4 years? There are issues in Haiku that keep the builders from being stable, yes, but that has nothing to do with the architecture. Have the shell scripts somehow crashed in these years? Have they broken the repos? No, they built packages, updated the repos and made everything available to be served. I totally understand that the model it uses right now is more suited to a traditional dedicated server with local disks and doesn't fit so well with a containerized setup, but it is not a somehow fundamentally flawed hack either.

And just to clarify, I am not some "old fashioned solution" die hard that shuns how modern infrastructure is run. In my day job I am the one to containerize everything (some even claim I am too strict about it, heck I don't even have node or npm installed on my host system even though I have to maintain and work on projects that use it all the time). I am just pragmatic about it. If the solution solves the problem without resorting to unstructured hacks, then I don't really care what it is written in (unless it pulls in a ton of dependencies for no reason).

I had just about recovered from the last motivational hit and was about to invest some more time into this to make a PoC of how the changes I outlined would work. But this discussion now starting to encompass all the personal gripes with choice of approach and technologies, really puts a damper on it for me. To some degree I take it personally, because I was the one who invested the time to build the existing setup, sure. But what is actually holding me up more is the feeling that whatever solution I would provide would just be dismissed as "an unworthy hack" and then be complained about for the next 4 years, regardless of whether it solves the problem at hand or not. Beside family and the workload of my day job, I really just don't have a lot of spare time. So being motivated and getting a "positive vibe" from spending that time is quite important to me. If we can look past personal preferences of approach and script language and just work on improving the existing solution to solve the actual problem, I can get behind that. Otherwise it just doesn't make sense for me to invest more energy into it and I'll leave it at the outlined suggestions above for someone else to actually act on them (or throw the whole thing out for something more shiny than a shell script).

@mmlr
Copy link
Member

mmlr commented Jun 16, 2023

Here's the basics:

Just to make sure: The solution I outlined above would solve all of these issues, right?

  • It would get rid of the source volume because that can all be done at container image build time
  • It would get rid of the shared packages volume because they would reside in object storage
  • It would only require normal persistent volumes for the HaikuPorts repository state and its caches, which are unproblematic

What is the "haikuports frontend" you listed? The webserver serving the static outputs like build logs and status, i.e. the buildmaster-frontend container? These are all completely static and so can be served by any webserver, the only thing that container image specializes on top of nginx is the no-cache and gzip for the logs. It also doesn't have to be a single frontend to serve all the architectures, each instance could share its persistent volume with a dedicated frontend, so that only a buildmaster instance and a frontend need to be scheduled on the same node and not all buildmaster instances.

@kallisti5
Copy link
Member Author

kallisti5 commented Jun 16, 2023

Here's the basics:
Just to make sure: The solution I outlined above would solve all of these issues, right?

  • It would get rid of the source volume because that can all be done at container image build time

Correct.

  • It would get rid of the shared packages volume because they would reside in object storage

Correct. If we were ever able to get the entire haikuports repo into an s3 bucket, we already have infrastructure to handle serving repository artifacts. (hpkgbouncer is used today for haiku's repos, and does http 302 redirects to the s3 object storage providers of our choice, (keeping an inventory cache, and managing branches, the latest "current" ,etc)). It also offers us the future ability to do geographic redirects based on source ip's.

s3 object storage lets us pick and choose the cheapest provider, and lets us pragmatically mirror repositories, try things like storj.io, etc. If Wasabi decides they hate us.. we just mirror them repos somewhere else, and update a few environment variables to send eu.hpkg.haiku-os.org traffic elsewhere.

  • It would only require normal persistent volumes for the HaikuPorts repository state and its caches, which are unproblematic

Correct. We can definitely have shared volumes, etc.

What is the "haikuports frontend" you listed? The webserver serving the static outputs like build logs and status, i.e. the buildmaster-frontend container? These are all completely static and so can be served by any webserver, the only thing that container image specializes on top of nginx is the no-cache and gzip for the logs. It also doesn't have to be a single frontend to serve all the architectures, each instance could share its persistent volume with a dedicated frontend, so that only a buildmaster instance and a frontend need to be scheduled on the same node and not all buildmaster instances.

Correct, the frontend container in the haikuporter buildmaster pod. aka https://build.haiku-os.org/buildmaster/master/x86_gcc2/ aka https://github.com/haikuports/haikuporter/tree/master/buildmaster/frontend

@SamuraiCrow
Copy link

I could recode the loop script as a GitHub action (whenever a new push is detected) easily enough but if the issue is having many volumes mounted at once, would it be wiser to refactor the actions to go one architecture at a time with multiple jobs running in parallel and then move on to the next architecture? That doesn't sound too bad to me but it reduces the parallalelism a bit. Am I misunderstanding the problem?

@SamuraiCrow
Copy link

If we can build 3 architectures at a time, as long as the result of the queued architecture isn't asynchronous with the result returns, I could just convert the bash file into a makefile and call it with -j3 or -j2 to limit the pool of architectures while allowing the maximum number at once.

@mmlr
Copy link
Member

mmlr commented Jun 17, 2023

I have further investigated the current places where the package repository is used or its form assumed in some way throughout HaikuPorter. One solution presents itself as a relatively simple way to address most of these cases without having to separate the existing logic: Shipping the packages to object storage while keeping local placeholders/caches in the form of build packages that only contain the package metadata. These are also HPKGs, but only contain the .PackageInfo. For all operations relevant to BuildMaster, this is sufficient and completely transparent to all logic in HaikuPorter. Having these stub packages in the local state provides a few benefits:

  • HaikuPorter does not need to have separate code to handle normal local port building and running as BuildMaster with object storage
  • The package_repo command is happy with having these metadata-only packages and does not need any changes
  • Any currently used metadata caches can continue to be built lazily as they are now (and they can also be rebuilt when needed)

The main challenge really boils down to timing. When the buildrun produces packages that are dependencies of other builds of the same buildrun there is a chicken and egg problem. The upload of packages to object storage, or at least the removal and replacement of the local packages with stubs, can only happen once the repo has been updated, because package_repo needs to build a checksum of the actual package file, not the stub. But the repo building only happens once after the buildrun is complete, because that is a lot more efficient. That means that either:

  • The freshly built dependency packages continue to be pushed to the builders the way it is done now (via sftp) and not via object storage, resulting in there being two different distribution mechanisms to maintain
  • The freshly built dependency packages are pushed to object storage as they arrive, but are only made available once the repo update happens

The first option is not great, because it means the current package distribution to builders needs to be kept around instead of being completely replaced with builder side object storage downloads, so that's a no-go. The second option is ok in principle, but allows for there to be unreferenced packages floating around in object storage in the timeframe between individual builds completing and the repo being updated after the buildrun completes. This may or may not be problematic if something goes wrong later on in the buildrun that causes it to be aborted and later repeated. Right now I can't think of a case where it would break anything, but I haven't thought it through completely.

In either case, the package files would be kept local for the repo update step and be removed by that step as well, so after the buildrun completes. This means that there is a temporary increase in local storage need, depending on how many packages are being produced by the buildrun and their size. If we deem this to be too much of a risk of running out of container volume storage, then we would need to extend package_repo to allow for being supplied with checksums instead of calculating them itself. This way, the built packages can always be checksummed, shipped to object storage and replaced by the metadata-only stub package whenever they become available. The repo build step would then work with the stub package for the metadata and take a checksum mapping as a new input. If the checksum and stub package creation is done directly on the builder, that would mean the buildmaster instance never even has to have the full packages locally, it could stream the downloading package directly to object storage. The builder could also do the object storage upload as well, but intuitively I wouldn't want to have the required access credentials spread there.

With this approach, the changes needed to make this work are even more minimal than I anticipated earlier. The only two places to be substantially adapted would be:

  • The PackageRepository that now needs to manage a remote repository in object storage and needs to handle repo building with package_repo slightly differently
  • The RemoteBuilderSSH that needs to provide object storage URLs for packages to the builder instead of pushing local packages and delegate handling of downloaded packages to the PackageRepository for object storage upload

Some of the places that deal with package obsoletion may also need to be changed to delegate this to the PackageRepository. I have not yet fully analysed which places are affected, as there are two different levels of package obsoletion. One of them deals with pruning the local repository state as part of build preparation, this does not need to change. The other is where the local state is then applied to the package repository, which may or may not need changing.

I could recode the loop script as a GitHub action (whenever a new push is detected) easily enough but if the issue is having many volumes mounted at once, would it be wiser to refactor the actions to go one architecture at a time with multiple jobs running in parallel and then move on to the next architecture? That doesn't sound too bad to me but it reduces the parallalelism a bit. Am I misunderstanding the problem?

The problem addressed in this issue is the current storage situation for the package repository. The triggering in the loop is suboptimal, but is not in pressing need to be addressed really. The volume attachments are not dynamic, so running architectures in sequence does not change the situation. It also doesn't really simplify the situation container wise, as the current loop would need to be replaced by a daemon. That daemon needs to listen for API calls with instructions for what to do and then trigger the corresponding commands. The currently available commands in the filesystem based API are "update" (pull the repo and build everything that changed), "everything" (attempt to build every recipe) and "build" (building a list of specific ports). The GitHub action would then call this API and trigger the "update" command on every push. The new API will also need some form of authentication. The actual work cannot be done as part of the GitHub action, because it needs the buildmaster state (basically the database of all available package metadata) and access to the builders. This is really a separate task to the one discussed here and should therefore be moved to a new issue if this shall be worked on in earnest.

@SamuraiCrow
Copy link

Ok.

@pulkomandy
Copy link
Member

The package_repo command is happy with having these metadata-only packages and does not need any changes

This might be a problem for package and repository signing when we get to implementing that? package_repo will need the checksum or possibly signature of each package then, and the checksum would be different for the metadata-only packages.

keeping local placeholders/caches in the form of build packages that only contain the package metadata. These are also HPKGs, but only contain the .PackageInfo.

It seems this can work and requires less changes, but also introduces some room for confusion about which package is metadata-only and which is complete. Maybe storing only the packageinfo file, not wrapped in an hpkg file, would reduce that risk? But it is more work, so maybe a thing to do as a later cleanup once the issue with object storage is fixed.

@mmlr
Copy link
Member

mmlr commented Jun 18, 2023

The package_repo command is happy with having these metadata-only packages and does not need any changes

This might be a problem for package and repository signing when we get to implementing that? package_repo will need the checksum or possibly signature of each package then, and the checksum would be different for the metadata-only packages.

It is not a problem in how it is used in buildmaster. The package_repo command is used with the update command that explicitly re-uses the checksums from the existing repo file and only needs the (stub) package files to check the metadata for version changes to know whether or not to update them. That command exists because it would be rather slow and wasteful to redo the checksums of the entire package repository on each update.

As for signing, the packages themselves are not signed, only the repo file is. And that one would be kept locally in any case for the updates to work. If we start signing the packages, then indeed we'd need to keep the signatures around somehow. Then the approach of extending package_repo to be able to take not only a list of packages, but a list of checksums and signatures as well can be implemented.

keeping local placeholders/caches in the form of build packages that only contain the package metadata. These are also HPKGs, but only contain the .PackageInfo.

It seems this can work and requires less changes, but also introduces some room for confusion about which package is metadata-only and which is complete. Maybe storing only the packageinfo file, not wrapped in an hpkg file, would reduce that risk? But it is more work, so maybe a thing to do as a later cleanup once the issue with object storage is fixed.

In the new setup, the freshly built packages from the buildrun would be held in a separate directory, just for the purpose of adding them to the repo. They would then simply be deleted once package_repo has run, as they were already uploaded to object storage beforehand to make them available for later builds in the buildrun. So the local packages directory in a buildmaster instance would always only be stub packages and never have full packages, which reduces the possible confusion.

Keeping the .PackageInfo file would be more obvious indeed, and to HaikuPorter this would be equally transparent, because the PackageRepository already treats HPKGs and .PackageInfo files the same. But it would not work for package_repo update as that one needs them to be HPKGs (I checked, it doesn't read .PackageInfo files, unlike the package tool). Obviously we could also just change package_repo to be able to use .PackageInfos, or to take a list of packages that shall be kept without looking for updates at all. The latter would be a nice addition because it would remove the need to parse many HPKG files just to figure out that they haven't changed, which is very slow.

But since HaikuPorter needs to keep the metadata around anyway, having them as stub packages can fit both needs for now and further optimizations can be done in a later step.

@waddlesplash
Copy link
Member

The "build_packages" system needs to make new repo files from scratch for a variety of reasons, so having a way to pass in .PackageInfos with SHA256 already computed to package_repo would simplify that usecase a lot under this new model, as well.

@kallisti5
Copy link
Member Author

As a reminder, a lot of implementations compute sha256 sums for you and present them as metadata:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html (this isn't standardized however)

However, us having local copies of packages cached seems fine too. If we need the checksum of a package we could just pull it back from the object storage buckets. (bandwidth is cheap at the scale of "several MiB" packages)

If you want to get really advanced, most object storage providers support versioning of individual files, so in theory (plus some code to actually do it) we could "roll back" a repository if it were corrupted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants