CI Support for aarch64 (AWS graviton2) #78

directionless · 2021-02-24T14:23:02Z

Problem

osquery has had aarch64 support (osquery/osquery#6612) for a bit. Huge shoutouts to the contributors on that). The big sticking point in declaring it stable, is adding it to CI.

Our last CI was Azure Pipelines, our current CI is GitHub Actions. Unfortunately, neither of these host aarch64 runners. But, they both distribute runners for that platform so you can run your own... (GitHub actions is a fork of Azure Pipelines, so it's unsurprising they look similar)

Possible Solutions

A short link dump, and discussion, about possible solutions

Self Hosted Runner with an Auto Scaling Group

Envoy uses an AWS autoscaling group to manage workers. These workers have some tooling to run a single job, and then detach themselves. This feels very clean, in that it uses a simple AWS tool to handle availability.

References:

https://github.com/envoyproxy/ci-infra

Self Hosted Runner in Kubernetes (EKS)

We could host runners as pods in a Kubernetes cluster. This is appealing in it's simplicity, at least once you accept kubernetes.

I think this has some potential drawbacks around security. I don't pods are as isolated as we might like them to be.

There's also a drawback in that we have to bring in kubernetes. I have some experience there (Kolide runs several clusters) but it would be new to the osquery project

References:

Self Hosted Runner with Lambda Scaling

Philips uses a pile of terraform to creates lambdas to manage spinning up and down spot instances as workers. This looks pretty well formed, and has some discussion of security. I think it trades the complexity of the Auto Scaling Group for a lambda function.

While I think this is a strong contender, I think it will be simpler for us to use auto scaling groups.

References:

https://github.com/philips-labs/terraform-aws-github-runner

Moving CI

There may be some CI vendors that have native support for aarch64. Amazon's various offerings, travis-ci.

However, moving CI has significant complexity cost to us. We are currently primarily invested in GitHub.

However, if Amazon CodeBuild works well enough, it might be okay to maintain both? Worth at least a little experimenting

directionless · 2021-02-24T14:25:51Z

I spent awhile reading through the code on these. My current bias is towards simplicity. I have to recognize I'm not finding a lot of time, and some of these have a lot of complexity. While the complexity is hidden by Terraform, we don't have a good terraform story (yet), and it's still complexity to manage/debug/fix.

Given that, I am currently strongly biased towards the envoy style AWS ASG approach. It is, by far, the simplest approach here.

Last night I ported the AMI generation from envoyproxy/ci-infra to making a github runner -- osquery/infrastructure#7

mike-myers-tob · 2021-02-25T18:59:10Z

What if we use one of our existing available CI runners (Linux/x86), but cross-compile for ARM and then use cross-execution to run the osquery tests (using qemu-user and binfmt-misc so that any non-native binaries get executed as if they're native)? Because osquery is statically linked this might be more feasible than it sounds.

directionless · 2021-02-25T19:22:05Z

What if we use one of our existing available CI runners (Linux/x86), but cross-compile for ARM and then use cross-execution to run the osquery tests (using qemu-user and binfmt-misc so that any non-native binaries get executed as if they're native)? Because osquery is statically linked this might be more feasible than it sounds.

On slack a bit ago, Stefano said that was unacceptable slow. But maybe was compiling under qemu

mike-myers-tob · 2021-02-25T21:11:24Z

On slack a bit ago, Stefano said that was unacceptable slow. But maybe was compiling under qemu

Ah, I didn't see that conversation but I think he told me today that the ARM-based AWS instance was unacceptably slow. Cross-compiling shouldn't be slow, and qemu overhead for cross-execution should be acceptable.

AGSaidi · 2021-02-25T21:53:10Z

I'm not sure how fast you're expecting, but building on a Graviton2 instance on AWS it's about 6m15s to build without tests 6m43s with them.

mike-myers-tob · 2021-02-26T00:21:25Z

I'm not sure how fast you're expecting, but building on a Graviton2 instance on AWS it's about 6m15s to build without tests 6m43s with them.

That's plenty fast. He must've been talking about something else then.

Regardless of speed, my suggestion was just about a possible way to build and test ARM without having to provision our own ARM-based CI runners on another cloud, until GitHub Actions gets an ARM CI runner. Since it seems like we don't have the time to learn Terraform/Ansible, set up another cloud account and maintain it and pay for it etc.

directionless · 2021-02-26T00:25:51Z

Regardless of speed, my suggestion was just about a possible way to build and test ARM without having to provision our own ARM-based CI runners on another cloud, until GitHub Actions gets an ARM CI runner. Since it seems like we don't have the time to learn Terraform/Ansible, set up another cloud account and maintain it and pay for it etc.

https://osquery.slack.com/archives/C019GR05SAH/p1599466550051900 (Alessandro, not Stefano)

Time and money are a bit funny. We do have an AWS presence, and I'm ignoring the terraform side and manually configuring. I'm currently testing CodeBuilder and slowly trying to get a native runner up.

Of course, I haven't yet broached trailofbits/osquery:ubuntu-18.04-toolchain-v9

directionless · 2021-03-08T01:52:26Z

I tried spinning up AWS CodeBuild. (this is the AWS ci thing). I used an incredibly simple buildspec.yml and having created a multiplatform trailofbits/osquery:ubuntu-18.04-toolchain-v9

Build went smoothly. Took 1,123 seconds. (About 4 minutes in cmake and submodules, and 15min in build). While quite a bit more than the 7ish minutes cited earlier.

The codebuild tooling is nice. Good display of things. But not as many platforms or options as GitHub. Still, if I can't get another strategy to work, we can probably figure out how to use this as a fallback

AGSaidi · 2021-03-08T03:25:36Z

I used a VM that had more than the 8 vcpus the CodeBuild VMs have, so that makes sense.

directionless · 2021-03-08T12:49:31Z

osquery/osquery-toolchain#23 is the Dockerfile I'm using to build the builders

fkorotkov · 2021-08-18T15:37:20Z

Hey everyone,

I'm founder of Cirrus CI. We are collaborating with AWS folks to bring free managed Graviton2 CI for OSS projects which we are about to announce. Would you like to try it out? It's as simple as configuring Cirrus CI Github App and adding the following .cirrus.yml config. No need to manage your own infrastructure.

# .cirrus.yml
task:
  arm_conaitner:
    image: ubuntu:latest
  script: uname -a

Cirrus CI will run such CI task on a EKS cluster of Graviton2 instances. You can containers of any size up to 8 CPUs and of 16 CPUs in total concurrently (for example, 8 concurrent tasks with 2CPUs).

directionless · 2021-08-18T15:44:04Z

Hi @fkorotkov Coincidentally, I've been reading about Cirrus CI, and am overjoyed you found this. I'd love to chat!

I'd love a cleaner solution for aarch64, and we're starting to think about apple's m1 as well. Does it make sense for us to find some time to chat, or should I just try this first?

fkorotkov · 2021-08-18T15:54:59Z

Will be happy to chat! You can email me at fedor@cirruslabs.org and we'll figure something out.

For future researchers, there is a problem with Apple M1 because non of the existing virtualization technologies don't support it yet and therefore it's impossible for CIs to provide ephemeral VMs. But if you have your own M1 hardware, Cirrus CI natively supports it via Persistent Workers. @directionless you probably read about them because of this comment actions/runner#805 (comment)

fkorotkov · 2021-08-19T12:33:38Z

Forgot to mention that if you are planning to experiment with Cirrus CI I highly recommend to check out Cirrus CLI which can run Cirrus tasks locally. It's a great way to iterate quickly over config.

fkorotkov · 2021-08-23T14:00:15Z

FYI arm_containers are GA now and you can try them out. https://cirrus-ci.org/guide/linux/

directionless added the moving parts This involved infra, accounts, or services we need to manage label Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Support for aarch64 (AWS graviton2) #78

CI Support for aarch64 (AWS graviton2) #78

directionless commented Feb 24, 2021 •

edited

directionless commented Feb 24, 2021 •

edited

mike-myers-tob commented Feb 25, 2021

directionless commented Feb 25, 2021

mike-myers-tob commented Feb 25, 2021

AGSaidi commented Feb 25, 2021

mike-myers-tob commented Feb 26, 2021

directionless commented Feb 26, 2021

directionless commented Mar 8, 2021

AGSaidi commented Mar 8, 2021

directionless commented Mar 8, 2021

fkorotkov commented Aug 18, 2021

directionless commented Aug 18, 2021

fkorotkov commented Aug 18, 2021

fkorotkov commented Aug 19, 2021

fkorotkov commented Aug 23, 2021

CI Support for aarch64 (AWS graviton2) #78

CI Support for aarch64 (AWS graviton2) #78

Comments

directionless commented Feb 24, 2021 • edited

Problem

Possible Solutions

Self Hosted Runner with an Auto Scaling Group

Self Hosted Runner in Kubernetes (EKS)

Self Hosted Runner with Lambda Scaling

Moving CI

directionless commented Feb 24, 2021 • edited

mike-myers-tob commented Feb 25, 2021

directionless commented Feb 25, 2021

mike-myers-tob commented Feb 25, 2021

AGSaidi commented Feb 25, 2021

mike-myers-tob commented Feb 26, 2021

directionless commented Feb 26, 2021

directionless commented Mar 8, 2021

AGSaidi commented Mar 8, 2021

directionless commented Mar 8, 2021

fkorotkov commented Aug 18, 2021

directionless commented Aug 18, 2021

fkorotkov commented Aug 18, 2021

fkorotkov commented Aug 19, 2021

fkorotkov commented Aug 23, 2021

directionless commented Feb 24, 2021 •

edited

directionless commented Feb 24, 2021 •

edited