Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retina-agent pod initialization failed to reconcile plugin dropreason: field NfConntrackConfirm: program nf_conntrack_confirm #246

Closed
einnse opened this issue Apr 9, 2024 · 12 comments
Assignees
Labels
area/ebpf priority/0 P0 type/bug Something isn't working type/question Further information is requested

Comments

@einnse
Copy link

einnse commented Apr 9, 2024

Describe the bug
installation commands: helm-install-with-operator

retina-agent pod status as follows:

kubectl get pods -n kube-system |grep retina-agent
retina-agent-7q7ls                           0/1     CrashLoopBackOff   72 (2m59s ago)   5h49m
retina-agent-9m272                           0/1     CrashLoopBackOff   72 (2m58s ago)   5h49m
retina-agent-nd2qg                           0/1     CrashLoopBackOff   72 (95s ago)     5h49m
retina-agent-wg44m                           0/1     CrashLoopBackOff   72 (2m5s ago)    5h49m

containers errlogs is:

retina ts=2024-04-09T03:47:28.299Z level=debug caller=loader/compile.go:22 msg=Running goversion=go1.21.9 os=linux arch=amd64 numcores=8 hostname=cce-bpf-master podname=retina-agent-wg44m version=ef779b6 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser command="/bin/clang -target bpf -Wall -D__TARGET_ARCH_x86 -g -O2 -c /go/src/github.com/microsoft/retina/pkg/plugin/dropreason/_cprog/drop_reason.c -o /go/src/github.com/microsoft/retina/pkg/plugin/dropreason/kprobe_bpf.o -I/go/src/github.com/microsoft/retina/pkg/plugin/dropreason/../lib/_amd64 -I/go/src/github.com/microsoft/retina/pkg/plugin/dropreason/../lib/common/libbpf/_src -I/go/src/github.com/microsoft/retina/pkg/plugin/dropreason/../filter/_cprog/"
retina ts=2024-04-09T03:47:29.030Z level=debug caller=loader/compile.go:29 msg="Output running command" goversion=go1.21.9 os=linux arch=amd64 numcores=8 hostname=cce-bpf-master podname=retina-agent-wg44m version=ef779b6 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser command="/bin/clang -target bpf -Wall -D__TARGET_ARCH_x86 -g -O2 -c /go/src/github.com/microsoft/retina/pkg/plugin/dropreason/_cprog/drop_reason.c -o /go/src/github.com/microsoft/retina/pkg/plugin/dropreason/kprobe_bpf.o -I/go/src/github.com/microsoft/retina/pkg/plugin/dropreason/../lib/_amd64 -I/go/src/github.com/microsoft/retina/pkg/plugin/dropreason/../lib/common/libbpf/_src -I/go/src/github.com/microsoft/retina/pkg/plugin/dropreason/../filter/_cprog/" stdout=
retina ts=2024-04-09T03:47:29.030Z level=info caller=dropreason/dropreason_linux.go:120 msg="DropReason metric compiled" goversion=go1.21.9 os=linux arch=amd64 numcores=8 hostname=cce-bpf-master podname=retina-agent-wg44m version=ef779b6 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser
retina ts=2024-04-09T03:47:29.333Z level=error caller=dropreason/dropreason_linux.go:155 msg="Error loading objects: %w" goversion=go1.21.9 os=linux arch=amd64 numcores=8 hostname=cce-bpf-master podname=retina-agent-wg44m version=ef779b6 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="field NfConntrackConfirm: program nf_conntrack_confirm: apply CO-RE relocations: load kernel module spec: open /sys/kernel/btf/nf_conntrack: no such file or directory"
retina ts=2024-04-09T03:47:29.333Z level=info caller=server/server.go:79 msg="gracefully shutting down HTTP server..." goversion=go1.21.9 os=linux arch=amd64 numcores=8 hostname=cce-bpf-master podname=retina-agent-wg44m version=ef779b6 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser
retina ts=2024-04-09T03:47:29.333Z level=info caller=server/server.go:71 msg="HTTP server stopped with err: http: Server closed" goversion=go1.21.9 os=linux arch=amd64 numcores=8 hostname=cce-bpf-master podname=retina-agent-wg44m version=ef779b6 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser
retina ts=2024-04-09T03:47:29.333Z level=panic caller=controllermanager/controllermanager.go:119 msg="Error running controller manager" goversion=go1.21.9 os=linux arch=amd64 numcores=8 hostname=cce-bpf-master podname=retina-agent-wg44m version=ef779b6 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to reconcile plugin dropreason: field NfConntrackConfirm: program nf_conntrack_confirm: apply CO-RE relocations: load kernel module spec: open /sys/kernel/btf/nf_conntrack: no such file or directory" errorVerbose="field NfConntrackConfirm: program nf_conntrack_confirm: apply CO-RE relocations: load kernel module spec: open /sys/kernel/btf/nf_conntrack: no such file or directory\nfailed to reconcile plugin dropreason\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:169\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:109\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
retina panic: Error running controller manager

Expected behavior
retina-agent pod status is normal.

OS: CentOS Linux 8.2 (Core)
Kernel Version: 5.4.273-1.el8.elrepo.x86_64
Kubernetes Version: v1.26.12
Host: local kubernets
Retina Version: v0.0.5
images Tag: ef779b6-linux-amd64

Additional context
Host /sys/kernel/btf/ directory

tree /sys/kernel/btf/
/sys/kernel/btf/
└── vmlinux

0 directories, 1 file

the exception reported error is the lack of necessary dependencies, which necessary dependencies need to be installed. I don't know much about bpf, btf related technologies, I hope I can get help to solve this problem!

@rbtr
Copy link
Collaborator

rbtr commented Apr 9, 2024

It's possible that kernel 5.4 does not have the features we need. @vakalapa do we have any idea what our kernel backcompat is?

@rbtr rbtr added type/bug Something isn't working type/question Further information is requested area/ebpf priority/1 P1 labels Apr 9, 2024
@neaggarwMS
Copy link
Contributor

Shouldnt we be checking kernal version check at the retina init and update docs too?

@wenhuwang
Copy link
Contributor

I recently encountered the same problem, the v0.0.2 version ran normally, but v0.0.5 had this problem.
I looked at the source code and found no relevant code changes, does anyone know what caused it?

@rbtr
Copy link
Collaborator

rbtr commented Apr 10, 2024

@wenhuwang which distro and kernel version?

@rbtr rbtr added priority/0 P0 and removed priority/1 P1 labels Apr 10, 2024
@wenhuwang
Copy link
Contributor

wenhuwang commented Apr 10, 2024

@wenhuwang which distro and kernel version?

@rbtr Env
OS: Ubuntu 18.04.5 LTS
Kernel Version: 5.10.87-051087-generic
Kubernetes Version: 1.22.2

@wenhuwang
Copy link
Contributor

wenhuwang commented Apr 19, 2024

After a long period of troubleshooting, i found that this issues was caused by the cilium/ebpf package upgrade.

The verification steps is as follows:
The same problem occurs when i build the image using the main branch and then run it. When I lower the cilium/ebpf package version to v0.13.2, the built image can run normally.

PR #1300 for the cilium/ebpf package related to this issue

@rbtr
Copy link
Collaborator

rbtr commented Apr 22, 2024

@wenhuwang I don't understand why this change which added CO-RE (which is supposed to improve kernel compatibility) would cause this issue. I wonder if it may be fixed with the changes in the latest cilium/ebpf

@rbtr rbtr unassigned rbtr and vakalapa Apr 22, 2024
@wenhuwang
Copy link
Contributor

wenhuwang commented Apr 23, 2024

@rbtr I guess that this commit caused the change, and the error location is loadKernelModuleSpec function. This commit will determine the kernel module based on the ebpf program type and attach point, and then find the btf file corresponding to the kernel module. However, some lower version kernels only have vmlinux file.

The ebpf program in the dropreason plugin needs to be mounted to the nf_conntrack kernel module, but there is no nf_conntrack file in the /sys/kernel/btf directory of my node.

@wenhuwang
Copy link
Contributor

wenhuwang commented Apr 23, 2024

This commit seems to fix the issues

@rbtr
Copy link
Collaborator

rbtr commented Apr 23, 2024

This commit seems to fix the issues

Great, I'm queueing #300 so that we have that fix in our next release. Thanks for investigating this issue!

@einnse
Copy link
Author

einnse commented Apr 30, 2024

This commit seems to fix the issues

Great, I'm queueing #300 so that we have that fix in our next release. Thanks for investigating this issue!

This commit seems to fix the issues

Great, I'm queueing #300 so that we have that fix in our next release. Thanks for investigating this issue!

Thank you to all Retina developers. Retina version 0.0.9 is currently running normally!

retina ts=2024-04-30T06:16:00.902Z level=debug caller=linuxutil/ethtool_stats_linux.go:81 msg="Processed ethtool Stats " goversion=go1.22.2 os=linux arch=amd64 numcores=4 hostname=cce-bpf-slave3 podname=retina-agent-72gg6 version=v0.0.9 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser ifacename=enp4s3
retina ts=2024-04-30T06:16:00.902Z level=error caller=linuxutil/ethtool_stats_linux.go:73 msg="Error while getting ethtool:" goversion=go1.22.2 os=linux arch=amd64 numcores=4 hostname=cce-bpf-slave3 podname=retina-agent-72gg6 version=v0.0.9 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser ifacename=kube-ipvs0 error="operation not supported"
retina ts=2024-04-30T06:16:00.902Z level=error caller=linuxutil/ethtool_stats_linux.go:73 msg="Error while getting ethtool:" goversion=go1.22.2 os=linux arch=amd64 numcores=4 hostname=cce-bpf-slave3 podname=retina-agent-72gg6 version=v0.0.9 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser ifacename=tunl0 error="operation not supported"

Another question, the error information of these operation not supported ifacenames can be ignored without attention

@rbtr
Copy link
Collaborator

rbtr commented Apr 30, 2024

Thanks @einnse for letting us know this is fixed 🙂

That error can probably be ignored. #296 is open re customizing interfaces to skip in ethtool

@rbtr rbtr closed this as completed Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ebpf priority/0 P0 type/bug Something isn't working type/question Further information is requested
Projects
Status: Done
Archived in project
Development

No branches or pull requests

6 participants