Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium fix bpffs check #8599

Merged
merged 4 commits into from
Feb 22, 2020

Conversation

olemarkus
Copy link
Member

The check introduced in #7832 is unfortunately too naïve to detect if bpffs has been mounted leading to the unit file never being installed. This is quite disastrous for cilium.

This check in this PR should be more robust.

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 20, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @olemarkus. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 20, 2020
@hakman
Copy link
Member

hakman commented Feb 20, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 20, 2020
@rifelpet
Copy link
Member

@olemarkus I see the original PR was backported all the way to 1.15. Should this be considered a blocker for 1.16.0 stable?

@olemarkus
Copy link
Member Author

I don't understand why, but the 1.16 and 1.15 branch is not affected. The /sys/fs/bpf path doesn't exist at the time of the check on those branches even if I use the same ami. Another change added to kops after 1.16 branch was cut must be causing this.

@hakman
Copy link
Member

hakman commented Feb 21, 2020

Just a quick update, the Cilium periodic e2e is working again:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-cni-cilium/1230702775738830848

@olemarkus
Copy link
Member Author

@rifelpet unfortunately, this does indeed break cilium back to 1.15.2. It should probably block 1.16.

@hakman
Copy link
Member

hakman commented Feb 21, 2020

@olemarkus the sorting in vendor/modules.txt seems off can you check that you are using go v1.13.x?
https://go-review.googlesource.com/c/go/+/174527/

@olemarkus
Copy link
Member Author

I am using go1.13.4 and just running make gomod. When I run it again, there are no changes to the vendor/modules.txt file.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 22, 2020
@olemarkus
Copy link
Member Author

For some reason make gomod used another binary than expected. Things looks a lot more sane now.

@hakman
Copy link
Member

hakman commented Feb 22, 2020

Thanks for the update @olemarkus. I was just about to start checking why it worked differently on my end.

@hakman
Copy link
Member

hakman commented Feb 22, 2020

Me and @rifelpet did some research yesterday on this topic and will add some questions.

Comment on lines -75 to -79
if os.IsNotExist(err) {
alreadyMounted = false
} else {
return fmt.Errorf("error checking for /sys/fs/bpf: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still valid. If there is an unexpected error when reading from /sys/fs/bp, the systemd mount would not help, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably still valid. Most likely this would be due to the path being absent and if the unit tries to mount it will just fail. Result will be quite similar and I suppose equally easy to detect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

} else {
alreadyMounted = true
alreadyMounted = int32(magic) == int32(fsdata.Type)
Copy link
Member

@hakman hakman Feb 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the fs type is not BPF? Would it be better to return with an error also?

The Cilium code describes the logic for mounting the BPF fs and there is also a mention of a fallback location /run/cilium/bpffs:
https://github.com/cilium/cilium/blob/cbd63be46569751ac45060e54f6787f1286b75cd/pkg/bpf/bpffs_linux.go#L213-L224

// checkOrMountDefaultLocations tries to check or mount the BPF filesystem in
// standard locations, which are:
// - /sys/fs/bpf
// - /run/cilium/bpffs
// There is a procedure of determining which directory is going to be used:
// 1. Checking whether BPFFS filesystem is mounted in /sys/fs/bpf.
// 2. If there is no mount, then mount BPFFS in /sys/fs/bpf and finish there.
// 3. If there is a BPFFS mount, finish there.
// 4. If there is a mount, but with the other filesystem, then it means that most
// probably Cilium is running inside container which has mounted /sys/fs/bpf
// from host, but host doesn't have proper BPFFS mount, so that mount is just
// the empty directory. In that case, mount BPFFS under /run/cilium/bpffs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BPF is not mounted on /sys/fs/bpf, the path will still exist, which is why the old check fails. So most likly the fs type will not be bpf before we install the unit.

/run/cilium/bpffs is a fallback if the host is not mounting bpf. This will cause network disruptions every time cilium pods restart though since the bpf mounts are lost.

Copy link
Member

@hakman hakman Feb 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, more clear now. Thanks for clarifying the fallback part.

Regarding the check, if the path exists, unix.Statfs would succeed and will return SYSFS_MAGIC=0x62656572 or something similar. If there is an error other than the "no such file or directory", maybe it's not a good thing to continue. Same goes for a magic number that is not SYSFS_MAGIC or BPF_FS_MAGIC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you be able to mount bpffs on whatever path that exists?

I will add back the return on error shortly. unix.Statfs("/sys/fs/bpf", &fsdata) would return error on file not found too, shouldn't it? So the os.Stat check would be redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked and you are right, if the mount path doesn't exist, the unit won't help. Probably for any type of error BPF mounting will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is mostly ok. There may be a rare case where the magic number is different from SYSFS_MAGIC=0x62656572 or BPF_FS_MAGIC= 0xCAFE4A11 and mounting will fail. I don't think it's worth adding it for now.

} else {
return fmt.Errorf("error checking for /sys/fs/bpf: %v", err)
}
return fmt.Errorf("error checking for /sys/fs/bpf: %v", err)
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need this else statement anymore.

@@ -70,15 +70,15 @@ func (b *NetworkBuilder) Build(c *fi.ModelBuilderContext) error {
if networking.Cilium != nil {
// systemd v238 includes the bpffs mount by default; and gives an error "has a bad unit file setting" if we try to mount it again (see mount_point_is_api)
var alreadyMounted bool
_, err := os.Stat("/sys/fs/bpf")
// bpffs magic number
magic := uint32(0xCAFE4A11)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth adding this link as a comment: https://github.com/torvalds/linux/blob/v4.8/include/uapi/linux/magic.h#L80

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would be nice to move the var and comment just before it's used.

@rifelpet
Copy link
Member

/lgtm
/approve

We increased the frequency of the cilium periodic job to run hourly, we'll keep an eye on it to confirm things are still stable before we merge cherry-picks of this 👍

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@justinsb
Copy link
Member

I am a little confused ... https://k8s-testgrid.appspot.com/google-aws#kops-aws-cni-cilium seems to suggest this broke two days ago, but my original (insufficient) PR is from a month or two ago. Do we know what caused it to break? (Was it a newer version of cilium that makes the directory?)

@olemarkus
Copy link
Member Author

Just bad timing. The e2e tests broke because of the typo fixed here #8591
The bpf mount bug this PR fixed is unrelated to the e2e tests.

@justinsb
Copy link
Member

Ah - thanks @olemarkus (and thanks also for fixing!)

k8s-ci-robot added a commit that referenced this pull request Feb 22, 2020
…9-origin-release-1.15

Automated cherry pick of #8599: Properly detect that bpffs has been mounted
k8s-ci-robot added a commit that referenced this pull request Feb 22, 2020
…9-origin-release-1.17

Automated cherry pick of #8599: Properly detect that bpffs has been mounted
k8s-ci-robot added a commit that referenced this pull request Feb 22, 2020
…9-origin-release-1.16

Automated cherry pick of #8599: Properly detect that bpffs has been mounted
@mazzy89
Copy link
Contributor

mazzy89 commented Feb 23, 2020

@olemarkus I'm running k8s 1.15.9 with cilium from kops 1.15.2 but I'm not having this issue.Is it clear for you why this version is not affected?

@olemarkus
Copy link
Member Author

It is affected too. If you see the cilium agent logs, you should see a warning about bpf not already mounted on the hair

@mazzy89
Copy link
Contributor

mazzy89 commented Feb 23, 2020

I see. I'll take a look at that. Which are the effects of this? I did not notice anything weird in the cluster. Everything is operative and healthy in term of networking.

@olemarkus olemarkus deleted the cilium-fix-bpffs-check branch March 25, 2020 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. blocks-next cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants