Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Bazel to build the AGW docker containers #14313

Closed
wolfseb opened this issue Oct 28, 2022 · 4 comments
Closed

Switch to Bazel to build the AGW docker containers #14313

wolfseb opened this issue Oct 28, 2022 · 4 comments
Labels
type: proposal Proposals and design documents

Comments

@wolfseb
Copy link
Contributor

wolfseb commented Oct 28, 2022

Switch to Bazel to build the AGW docker containers

Similar to previous efforts towards a unified Bazel build system in #8338 and #11293, we would like to bring forward a proposal for a next step towards that goal: to switch to Bazel for building the AGW containers.

TLDR

  • Rebuilding the AGW containers after a code change takes one to five minutes.
  • With Bazel it takes around four seconds.

Problem

The current state of the AGW container build is problematic:

  • When spinning up a new magma-dev VM, it takes roughly 30 minutes to build the containers. This is problematic for development purposes, as frequent rebuilding causes a lot of waiting time.
  • The container images are very large (~1 GB), because all services of a given language are installed in the same image. E.g. on the pipelined container, all AGW python services are installed as well. This is at best against Docker best practices, and at worst a security risk.
  • The previous point also means that any updates of any service in a container causes rebuilding and restarting of multiple services instead of just the affected service.

Solution

We propose to build the AGW containers using Bazel. This includes the following goals:

  • Use Bazel instead of Docker to build the AGW containers. This reduces the rebuild times to potentially under 5min, for single services even less (for more details see below).
  • Separate the container images so there is one image per service, instead of one per language.
  • Check that all integration tests still run using the bazel-ified containers.

Notes:

  • We will only move towards building the container images with Bazel. The deployment and orchestration will still be handled by docker-compose.
  • Building with Bazel is not limited to any OS. However, the rules_docker package has some convenience tools for building images based on Ubuntu.
  • Bazel builds the images outside the containers, which makes repeated builds after small changes much faster due to improved caching.
  • Separating the container images to one per service includes the following features, advantages, and disadvantages:
    • Minimal build context: only what is needed by one service is installed on its image.
    • Drastically reduces build times (especially due to Bazel's caching) and image size.
    • Rebuild times after changes to a single service are further reduced.
    • Changes to a single service only causes that single service to be restarted.
    • More code / files to be maintained, spread out in different locations (one per service, in the service's directory), instead of three centralized Dockerfiles for everything.
  • A PoC branch is available here

Non-goals

  • Changes to the networking of the containers, which is currently set to host networking
  • Any functional changes of the applications
  • Better isolation of the containers
  • Specific improvements regarding security
  • Changes in the way the containers are deployed

Technical details on Bazel caching and rebuild times

The way Bazel handles its cache (which lives in part outside the dev VM, where the containers are currently built) drastically speeds up the process of rebuilding the containers after code changes or after spinning up a new dev VM.
In our testing, we built several containers with Bazel and ran the s1ap integ tests to ensure their functionality.
Even a first build (i.e. without any Bazel cache) is already faster than with docker-compose, but the build times become significantly shorter with the cache available. This cuts the container build time even after destroying and spinning up a new dev VM from 30 to under 5 minutes.
The build times from our testing are summarized in the table below. Services tested are:

  • python: mobilityd, enodebd
  • C++: mme, sctpd, sessiond, li_agent, connectiond
  • Go: envoy_controller
Build container Docker Bazel (per service) Bazel (no VM cache) Bazel (total, no cache)
Python from scratch ~4.5min ~15s +35s (once) ~5.5min
Python after code change ~1min ~4s
C++ from scratch ~15min ~15s ~2min (total) ~11.5min
C++ after code change ~5.5min ~4s
Go from scratch ~9min ~15s ~1.5min ~2.5min
Go after code change ~1min ~4s

Note that the total build times in the last two columns should not be added up, there are common tools and dependencies which are cached and shared between those builds. In our testing, rebuilding all listed containers after setting up a new dev VM took less than 3 minutes.

@wolfseb wolfseb added the type: proposal Proposals and design documents label Oct 28, 2022
@jordanvrtanoski
Copy link
Contributor

General comment

The proposal contains several separate and independent actions:

  • Move the build of the containers to Bazel build system and
  • Restructure the containers so we have a separate container for each service
  • Change the containers from the Host network to a dedicated network

There is no dependancy between this two actions, meaning its still possible to separate the services in dedicated containers without changing the build system.

Build the containers with Bazel

The problems with this part of the proposal are:

  • Moving the build of the containers from the Dockerfiles to Bazel will not provide any benefits to the end users of the project
  • Will create confusion for the contributors, especially with fresh starters since it will be natural to expect to find Dockerfile for the containers and it's less expected that the build is performed by external build system.
  • The containers are not build so frequently. In a generic development cycle one would have the code build in the development environment and only after the code is verified and ready for testing the Docker containers would be build. Therefore, the benefit of the increasing the speed of the build is over emphasised.
  • The existing containers had removed the dependancy on the VMs and enabled us to have AGW deployed on multiple platforms (AMD64 and AARCH64). Introducing a dependancy on the VM to build the container is a step backwards toward having a build system that is independent of the VM and has control over the dependancies.

The use of the VM for the build had an effect to locked the project to AMD64 architecture. Although it could be argued that the VM is not intrinsically locking the project to any architecture, the reality is that the way the VMs are constructed and used in the build and development was (and still remains) the major reason we still do not have the bare-metal packages for AARCH64 produced by the project.

Restructure the containers to a dedicated services per container

This is actually sounds like this will bring some value to the end user, however the concern on the size of the container is invalid. Docker, thanks to the overlay system, will have only one physical copy of the image for each of the containers that are spawned, so the net effect of the space used remain almost unchanged (the sum of all files in the multiple images will be equal to the files in the single image).

The only real benefit will be the shorter time to build, however, since the containers are not build so often, this benefit alone do not justify the effort.

All other objectives can be achieved by the current state of the containers.

Moving from the host network to a dedicated network

This is a good proposal and has potential to bring a value to the end user. I suggest to create this as a new proposal so we can discuss it regardless of the decision on the previous two parts.

@jheidbrink
Copy link
Contributor

Thank you @jordanvrtanoski for posting your analysis and opinion on the proposed changes.

Build the containers with Bazel

 Moving the build of the containers from the Dockerfiles to Bazel will not provide any benefits to the end users of the project

End user benefits are not the only valuable things for the project. A reliable, easy, fast and efficient build setup and infrastructure is a major cost saver for the community and can ensure that the project is able to deliver the product. Those aspects have been well elaborated in #8338 and #11293.

Will create confusion for the contributors, especially with fresh starters since it will be natural to expect to find Dockerfile for the containers and it's less expected that the build is performed by external build system.

If the build system is clearly documented (what we intend), the particular approach might be surprising but hopefully not confusing. In particular, the Bazel models are well structured and easy to work with. Also large parts of the AGW are modeled and built with Bazel anyway, such that a new contributor has to work with Bazel in any case.

The containers are not build so frequently. In a generic development cycle one would have the code build in the development environment and only after the code is verified and ready for testing the Docker containers would be build. Therefore, the benefit of the increasing the speed of the build is over emphasised.

Actually one main motivation for this proposal was that developers got frustrated because the build times during development took so long. There is no alternative development environment for Docker, and the only way to get quick feedback times at the moment is to use the systemd build.
What you describe is only one possible development workflow. With the respect to motion of the community toward containerized components (e.g. AGW), more and more developers are very frequently rebuilding the docker containers, those being the testable build artifacts.

The existing containers had removed the dependancy on the VMs and enabled us to have AGW deployed on multiple platforms (AMD64 and AARCH64). Introducing a dependancy on the VM to build the container is a step backwards toward having a build system that is independent of the VM and has control over the dependancies.

I think this might be a misunderstanding. We do not plan to introduce a dependency on either the VM or on amd64.
You can run Bazel builds on amd64 and arm (also in CI) via .devcontainer/bazel-base/Dockerfile, you don't need the VM. This is also used by a lot of developers locally (either the bazel base container or the inherited devcontainer .devcontainer/Dockerfile). .devcontainer/bazel-base/Dockerfile could be used as a blueprint to find needed dependencies.
Would you have a use-case in mind where the AGW containers are built on a host directly?
We will test the arm build as well. Thanks for bringing this up.

Restructure the containers to a dedicated services per container

"Restructure the containers to a dedicated services per container" was not so much a goal of the proposal as it is a side effect. If there is a decision to work on "Build the containers with Bazel" then we automatically get dedicated containers for each service. But I agree that there is likely no overall size benefit if the sum over all containers is considered (although this is not sure as the current containers might have some unnecessary dependencies that will no longer be there). I also agree that this is probably no size improvement for the end-user because of the overlay system. However, having dedicated images per service with only the necessary dependencies and nothing else still seems like a good idea, and cleaner, safer and more resilient than the current setup. Since we have Bazel to manage dependencies, using that to construct the containers and not keeping a separate dependency list within Dockerfiles makes things simpler. Especially, with the larger movement of the community towards a containerized AGW.

Moving from the host network to a dedicated network

The item "Moving from the host network to a dedicated network" is listed as a non-goal in the proposal. I agree it is a valuable goal but this is absolutely not trivial and should be discussed separately. It is not affected by whether or not the containers are built with Bazel.

@jheidbrink
Copy link
Contributor

jheidbrink commented Nov 17, 2022

This proposal is potentially relevant for the Plan towards a fully community based development mode as it eases the maintenance of the containerized AGW, and thus might make it feasible to maintain both the AGW Ubuntu package and the AGW container images.

@maxhbr
Copy link
Member

maxhbr commented Dec 18, 2022

There was a vote on 2022-12-15 and not sufficient TSC members participated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: proposal Proposals and design documents
Projects
None yet
Development

No branches or pull requests

5 participants