Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mkosi and GA CI support for boot test on Ubuntus #61

Merged
merged 2 commits into from
Apr 3, 2021

Conversation

vt-alt
Copy link
Contributor

@vt-alt vt-alt commented Mar 29, 2021

There are two commits:

Author: Vitaly Chikunov <vt@altlinux.org>
Date:   Mon Mar 29 06:37:51 2021 +0300

    Add mkosi support

    mkosi is systemd's boot test tool. This support is mainly for GA CI
    to implement full boot tests (on Ubuntu). But, this would be useful
    on it's own for experiments with lkrg on all mkosi supported distros.

    I support only bios (grub) BottProtocol without unified kernel.

    - .gitignore updated to exclude mkosi artifacts (otherwise they could
      recursively go into created image causing disk full error).
    - mkosi.default is mkosi config pre-confirured for ubuntu focal, you can
      overwrite this with command line options.
    - mkosi.build is script to build lkrg and install it into DESTDIR.
    - mkosi.postinst hook updates initrd to include and insmod lkrg and
      grub to remove 'quiet' cmdline option.

    Signed-off-by: Vitaly Chikunov <vt@altlinux.org>

I tested this only on Ubuntu focal (in Vagrant). But, this may work on other distributions where mkosi is supported (many, but not ALT).

Author: Vitaly Chikunov <vt@altlinux.org>
Date:   Mon Mar 29 06:46:52 2021 +0300

    CI: Boot test on GA using mkosi

    Use mkosi to test full system boot with LKRG module loaded early in
    initrd. mkosi creates system disk image (quite slow, 5 minutes for
    ubuntu focal in my tests, and size is 1.3G), builds lkrg there (using
    systemd-nspawn), and finally boots it in qemu. Then we grep boot.log
    for possible problems.

    Ubuntu is chosen, because it's native to GA. Only successful (for the
    test) releases are 'focal' and 'groovy'. It seems mkosi does not support
    'hirsute' (yet, failure installing packages into image). Also, 'xenial'
    does not build lkrg properly, but mkosi works good. Older Ubuntus seems
    to not have systemd, which is a hard mkosi requirement.

    Signed-off-by: Vitaly Chikunov <vt@altlinux.org>

Tested on different Ubuntu releases and only working are focal (with Linux v5.4), groovy (Linux v5.8), and bionic (with Linux v4.15, excluding the fact that LKRG does not build for it). I hope hirsute will work too as soon as it's released.

@solardiz
Copy link
Contributor

@vt-alt Thank you for the contribution!

Should we possibly see it in action already on this PR (we don't yet) or will it activate automatically upon merging (I guess so) or do we need to make any change in the repo's settings on GitHub?

@vt-alt
Copy link
Contributor Author

vt-alt commented Mar 29, 2021

Example in action there https://github.com/vt-alt/lkrg/actions two top builds with mkosi tag (or numbered 21, 22).
I did not enable anything for it to work.

@vt-alt
Copy link
Contributor Author

vt-alt commented Mar 29, 2021

After some thought, I think they run a version of workflow that is already committed to the branch where a new commit or PR appears.

(You can limit the events that trigger work, but there I don't limit anything.)

@vt-alt
Copy link
Contributor Author

vt-alt commented Mar 29, 2021

Possible small improvement to this PR: as you can see there https://github.com/vt-alt/lkrg/runs/2215564904?check_suite_focus=true#step:7:774 there are empty lines between console output, I think this is a possible "\r\n" artifact, I may try to add "\r" stripping between qemu | tee calls.

About rcu_sched detected stalls on CPUs/tasks message there (that caused test #22 failure) this seems intermittent and the cause is unknown to me. It never happened on my local tests nor other GA builds.

@vt-alt
Copy link
Contributor Author

vt-alt commented Mar 29, 2021

Possible small improvement to this PR: as you can see there https://github.com/vt-alt/lkrg/runs/2215564904?check_suite_focus=true#step:7:774 there are empty lines between console output, I think this is a possible "\r\n" artifact, I may try to add "\r" stripping between qemu | tee calls.

Yes, tr -d '\r' worked well. (Force pushed the change.) Also, you can see both builds in #23 are green.

@vt-alt
Copy link
Contributor Author

vt-alt commented Mar 29, 2021

FYI. Also, I decided to use expect (to boot qemu) (which is similar to how systemd using pexpect ) instead of just adding some unit or /etc/rc.local in created image to work and shutdown on every boot, to allow people use raw mkosi for experiments with LKRG'ed system, and not just boot/shootdown test.

ps. Ah, and I remember, one more small improvement would be to add >20 seconds sleep to trigger kernel softlockup check for sure. But, that could be done later. Now there is still >20 delay by the chance of slow boot: [ 14.500018] [p_lkrg] LKRG initialized successfully! ... And somewhere on the >40th second is shutdown.

@vt-alt
Copy link
Contributor Author

vt-alt commented Mar 29, 2021

Some additional thoughts on possible future tests additions and improvements:

  1. It's possible to use Docker image with other distribution (like Fodora) and still use mkosi for boot test there. Example: https://github.com/dracutdevs/dracut/blob/master/.github/workflows/fedora-33.yml - except they try to use KVM which is not supported on GA.
  2. Maybe we could download cloud images for other distros and nspawn install and qemu boot them (in a way like mkosi does).
  3. FYI. systemd/mkosi github action is quite slow, because it tries to build all required tools (to possibly boot all supported distros) from the sources.
  4. In theory, it's possible to boot like virtme with the host filesystem over p9 fs. This should be the fastest method in speed, but need to verify if systemd supports this sort of booting. This will be just one ubuntu kernel though. But, also possible to extend it too -- create an additional repo with the kernels [from different distros] and checkout it there and boot for all of them. Or maybe even Docker image(s)?
  5. I did not optimize the number of packages installed on the system (by mkosi), it's possible that installing linux-virtual is not optimal and there could be a smaller package just to build the module.

@vt-alt
Copy link
Contributor Author

vt-alt commented Mar 29, 2021

Force-pushed again to change the commit message, I noticed that I referenced xenial where wanted to say bionic.

@Adam-pi3
Copy link
Collaborator

About rcu_sched detected stalls on CPUs/tasks message there (that caused test #22 failure) this seems intermittent and the cause is unknown to me. It never happened on my local tests nor other GA builds.

It could be related to this bug:
http://lkml.iu.edu/hypermail/linux/kernel/1609.2/03265.html

@vt-alt thanks for that PR. I've no knowledge about mkosi so it's difficult for me to evaluate it. I will try to catch-up on it. However, I rely on @solardiz here.

@vt-alt
Copy link
Contributor Author

vt-alt commented Mar 30, 2021

FYI http://0pointer.net/blog/mkosi-a-tool-for-generating-os-images.html
It's just a tool to build bootable OS images they use to test systemd, and it supports many OS including Ubuntu (that's on GA). I thought the full boot would be quite a good test for LKRG, so it looked like a good fit. I spent more days on this PR than I wanted, though, because of many peculiarities to it (I tried to document them in comments and commit messages).

mkosi is systemd's boot test tool. This support is mainly for GA CI
to implement full boot tests (on Ubuntu). But, this would be useful
on it's own for experiments with lkrg on all mkosi supported distros.

I support only bios (grub) BootProtocol without unified kernel.

- .gitignore updated to exclude mkosi artifacts (otherwise they could
  recursively go into created image causing disk full error).
- mkosi.default is mkosi config pre-configured for ubuntu focal, you can
  overwrite this with command line options.
- mkosi.build is script to build lkrg and install it into DESTDIR.
- mkosi.postinst hook updates initrd to include and insmod lkrg and
  grub to remove 'quiet' cmdline option.

Signed-off-by: Vitaly Chikunov <vt@altlinux.org>
Use mkosi to test full system boot with LKRG module loaded early in
initrd. mkosi creates system disk image (quite slow, 5 minutes for
ubuntu focal in my tests, and size is 1.3G), builds lkrg there (using
systemd-nspawn), and finally boots it in qemu. Then we grep boot.log
for possible problems.

Ubuntu is chosen, because it's native to GA. Only successful (for the
test) releases are 'focal' and 'groovy'. It seems mkosi does not support
'hirsute' (yet, failure installing packages into image). Also, 'bionic'
does not build lkrg properly, but mkosi works good. Older Ubuntu seems
to not have systemd, which is a hard mkosi requirement.

Signed-off-by: Vitaly Chikunov <vt@altlinux.org>
@vt-alt
Copy link
Contributor Author

vt-alt commented Apr 2, 2021

Ping?

@Adam-pi3
Copy link
Collaborator

Adam-pi3 commented Apr 3, 2021

@solardiz what's your opinion?

@vt-alt
Copy link
Contributor Author

vt-alt commented Apr 3, 2021

@Adam-pi3 Why rely on @solardiz opinion? Since you know all the internals what do you think, such full system boot (with systemd) would be an effective test or not?

@vt-alt
Copy link
Contributor Author

vt-alt commented Apr 3, 2021

Maybe I am wrong with the test approach. There is so 0 of opinion on this PR. so it looked like you not interested at all.

@solardiz
Copy link
Contributor

solardiz commented Apr 3, 2021

@vt-alt I'm sorry, I just didn't have enough time for everything needing my attention. Please don't feel discouraged. Your contribution is greatly appreciated and we'll proceed to try it out. For this, I intend to go ahead and merge it now, and then we'll see what's next. Thank you!

@solardiz solardiz merged commit 55d538d into lkrg-org:main Apr 3, 2021
@solardiz
Copy link
Contributor

solardiz commented Apr 3, 2021

The CI tests added here were triggered right upon merging this PR and have completed in 7 minutes each (for two versions of Ubuntu). Thanks again, @vt-alt!

@Adam-pi3
Copy link
Collaborator

Adam-pi3 commented Apr 3, 2021

@Adam-pi3 Why rely on @solardiz opinion? Since you know all the internals what do you think, such full system boot (with systemd) would be an effective test or not?

I do agree that the full system boot tests are necessary and important - thanks for working on such feature! However, I was overblown with the urgent work in my daily job for the last 2 weeks and didn't have time to follow-up on mkosi, analyze how does it work and what are the caveats. Moreover, my knowledge of systemd is weak and it sounds like mkosi interacts with systemd so I felt more comfortable to rely on @solardiz knowledge here (regarding technical details not the idea itself).

Maybe I am wrong with the test approach. There is so 0 of opinion on this PR. so it looked like you not interested at all.

I do agree that the full system boot tests are necessary and important - thanks for working on such feature!

@solardiz
Copy link
Contributor

solardiz commented Apr 3, 2021

my knowledge of systemd is weak and it sounds like mkosi interacts with systemd so I felt more comfortable to rely on @solardiz knowledge here (regarding technical details not the idea itself).

Unfortunately, I also lack knowledge on systemd and mkosi. So I think @vt-alt is the primary maintainer of this feature now.

In my past discussions with @vt-alt (via private e-mail), I had encouraged him or others at ALT to contribute something like this (after having heard @wladmis had implemented something "like this" for/in ALT's package of LKRG). Our thinking was that @Adam-pi3 could then proceed to add (regression) tests (for known past issues) to run in the system booted in the temporary VM. So, Adam, we'd appreciate it if you do proceed to add such tests if/when any become relevant. I guess this shouldn't be too hard to do even without full knowledge over how that system is created and booted up.

@Adam-pi3
Copy link
Collaborator

Adam-pi3 commented Apr 3, 2021

I think the most valuable would be to run a sequence of commands including insmod of the vulnerable driver and exploits against some of the primitives which are available. Compare the LKRG execution logic before the exploit and later after the exploit and detection itself. Is that possible?

@vt-alt
Copy link
Contributor Author

vt-alt commented Apr 3, 2021

I do agree that the full system boot tests are necessary and important - thanks for working on such feature!

That's great, thanks for your opinion!

In my past discussions ...

I took a different (but more flexible) approach than that small init.c style test. There is no single init.c now but, we could execute any binary or scripts in the newly booted system, so it turned out not bad.

I think the most valuable would be to run a sequence of commands including insmod of the vulnerable driver and exploits against some of the primitives which are available.

If you show some example module and a test binary/script I could try to add them into CI workflow.

But,mkosi qemu boot cycle is around 1 minute. If there should be many tests with different attack scenarios each crashing the system — I could try to work on faster (virtme-style boot over 9pfs of the current system) boot sequence w/o systemd, but still with bash scripts. That should be like 5-10 seconds per boot cycle.

Compare the LKRG execution logic before the exploit and later after the exploit and detection itself. Is that possible?

That part I don't understand.

@vt-alt
Copy link
Contributor Author

vt-alt commented Apr 3, 2021

But,mkosi qemu boot cycle is around 1 minute. If there should be many tests with different attack scenarios each crashing the system — I could try to work on faster (virtme-style boot over 9pfs of the current system) boot sequence w/o systemd, but still with bash scripts. That should be like 5-10 seconds per boot cycle.

To add. In that case, we can leave mkosi boot test as is just as a full system boot test, which is beneficial in itself. And create an additional workflow for regression tests.

  1. Also, I want to know. Is it interesting to test on the latest mainline kernels or older kernels are better for lkrg? (Ubuntu has daily mainline kernel builds, I could (no promises) try to script downloading the last one for a boot test.) (This could be an additional workflow too.)
  2. What is the most popular distro for LKRG users? Maybe it's better to boot test not (or not just) on Ubuntu?

@wladmis
Copy link
Contributor

wladmis commented Apr 3, 2021

I think the most valuable would be to run a sequence of commands including insmod of the vulnerable driver and exploits against some of the primitives which are available. Compare the LKRG execution logic before the exploit and later after the exploit and detection itself. Is that possible?

Do I understand you right: you mean to make a special kernel module for test purposes with a vulnerability, and check whether LKRG catches an exploit of it?

@solardiz
Copy link
Contributor

solardiz commented Apr 4, 2021

In that case, we can leave mkosi boot test as is just as a full system boot test, which is beneficial in itself. And create an additional workflow for regression tests.

We could, but if we disable LKRG causing kernel panic on detection violations (and are OK with not testing that aspect) then we could as well use one workflow for both successful boot testing (make sure there are no unexpected messages from LKRG, say, in the first 10 seconds after its success message) and vulnerable module loading and exploit testing (make sure every exploitation attempt produces the expected detection messages from LKRG).

Is it interesting to test on the latest mainline kernels or older kernels are better for lkrg?

It is relevant to test any kernel versions we support, so in your question it would be both latest mainline and older distro kernels. I don't know which are "better" to include in the testing - we'd want to detect and look into any issues with either kind.

However, if some breakage is caused by a change in latest mainline rather than in LKRG, it could take us a longer while to address it, and it would be unfortunate to have sort of non-working CI during this period (that would keep reporting the would-be-already-known issue with mainline and making it harder for us to see if there's also some LKRG regression). So perhaps we should have an easy way to temporarily disable and re-enable testing with latest mainline kernels?

What is the most popular distro for LKRG users?

We don't know. We generally only hear from a user when there's a problem, and we've heard about all sorts of distros. Ubuntu is a popular one, but other popular choices seems to be CentOS, Arch, Debian/Whonix, and perhaps ALT?

Maybe it's better to boot test not (or not just) on Ubuntu?

I think we should keep Ubuntu, but also add perhaps CentOS 7 and 8. Instead of CentOS 8, it can be (and will eventually have to be) AlmaLinux or (later) RockyLinux. Can also have Fedora or CentOS Stream - would probably let us catch issues (typically back-ports that LKRG needs to adapt to) before they appear in a RHEL/CentOS/AlmaLinux/RockyLinux release.

@solardiz
Copy link
Contributor

solardiz commented Apr 4, 2021

you mean to make a special kernel module for test purposes with a vulnerability, and check whether LKRG catches an exploit of it?

Yes, that's what Adam meant, and he already has that intentionally vulnerable kernel module - but perhaps it would need some cleanups and maybe some kind of safety feature (against inadvertent misuse) before being committed to somewhere in this repo.

@vt-alt
Copy link
Contributor Author

vt-alt commented Apr 4, 2021

@solardiz Thanks for your answer. There is a few thought on this:

However, if some breakage is caused by a change in latest mainline rather than in LKRG, it could take us a longer while to address it, and it would be unfortunate to have sort of non-working CI during this period (that would keep reporting the would-be-already-known issue with mainline and making it harder for us to see if there's also some LKRG regression). So perhaps we should have an easy way to temporarily disable and re-enable testing with latest mainline kernels?

I think tests do not need to be perfect and all green all the times, as we (well, you) (when looking why tests reported not green status) can ignore tests that are not producing correct results temporarily, or know 'latest kernel' test isn't relevant today and don't even need to be looked upon for details.

GA has 'experimental' job flag, but I don't know yet how they are reported (affecting color of workflow or not).

@Adam-pi3
Copy link
Collaborator

Adam-pi3 commented Apr 5, 2021

you mean to make a special kernel module for test purposes with a vulnerability, and check whether LKRG catches an exploit of it?

Yes, that's what Adam meant, and he already has that intentionally vulnerable kernel module - but perhaps it would need some cleanups and maybe some kind of safety feature (against inadvertent misuse) before being committed to somewhere in this repo.

Correct. We need to clean it up and before publishing. From my understanding similar functionality can be made by out-of-tree (https://github.com/jollheef/out-of-tree) by @jollheef. I'm using it for my local tests (booting Linux with LKRG) but didn't add integration with vulnerable driver.

@Adam-pi3
Copy link
Collaborator

Adam-pi3 commented Apr 5, 2021

Is it interesting to test on the latest mainline kernels or older kernels are better for lkrg?

It is relevant to test any kernel versions we support, so in your question it would be both latest mainline and older distro kernels. I don't know which are "better" to include in the testing - we'd want to detect and look into any issues with either kind.

Agreed. We should have both tests including latest mainline. @vt-alt I assume you are thinking about integration of https://kernel.ubuntu.com/~kernel-ppa/mainline/ compilation, correct?
Such tests could give us an early-warning about the "breakable" changes.

@vt-alt
Copy link
Contributor Author

vt-alt commented Apr 5, 2021

correct?

Yes. I am already (I think) done with the download script, now thinking about better integration with mkosi and GA.

baseurl="https://kernel.ubuntu.com/~kernel-ppa"
sortmode="C=N;O=D"
for listurl in \
        "$baseurl/mainline/daily/" \
        "$baseurl/mainline/"
do
        echo >&2 "List $listurl"
        curl -s "$listurl?$sortmode" \
        | grep -Eio 'href="[^"]+"'   \
        | grep -Eo '"[^"]+"'         \
        | grep -Po '[v2][rc\.\d-]+'  \
        | while read subdir; do
                url="$listurl$subdir/amd64/"
                echo >&2 "Trying $url"
                if page=$(curl -s --fail "$url"); then
                        echo "$page" \
                        | grep -Eio 'href="[^"]+"' \
                        | grep -o "linux.*deb"     \
                        | grep -v "lowlatency"     \
                        | while read deb; do
                                echo >&2 "Download $url$deb"
                                curl -O -C- "$url$deb"
                        done
                        # Signal success to upper shell.
                        exit 22
                fi
        done
        # Exit if subshell succeeded.
        [ $? -eq 22 ] && exit
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants