Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retina-agent pod initialization failed #153

Closed
wenhuwang opened this issue Mar 27, 2024 · 8 comments · Fixed by #165
Closed

retina-agent pod initialization failed #153

wenhuwang opened this issue Mar 27, 2024 · 8 comments · Fixed by #165
Assignees
Labels
area/plugins priority/0 P0 type/bug Something isn't working
Milestone

Comments

@wenhuwang
Copy link
Contributor

wenhuwang commented Mar 27, 2024

Describe the bug

installation commands: make helm-install-with-operator

retina-agent pod status as follows:

# k -n kube-system get pods retina-agent-5lwhj
NAME                 READY   STATUS                  RESTARTS        AGE
retina-agent-5lwhj   0/1     Init:CrashLoopBackOff   6 (3m37s ago)   11m

init-retina containers logs is:

# k -n kube-system logs retina-agent-5lwhj init-retina
ts=2024-03-27T01:19:37.004Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 path=/sys/fs/bpf
ts=2024-03-27T01:19:37.004Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T01:19:37.004Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T01:19:37.005Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map [recovered]
	panic: Failed to initialize filter map

goroutine 1 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
	/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0xb338a0?, 0xc000219130?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc00013bb60?})
	/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00024e0d0, {0xc00023a9c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6b9e0?, {0xc69630?, 0xd6b900?}, {0xc00023a9c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/zap@v1.26.0/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc000593ec8)
	/go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
	/go/src/github.com/microsoft/retina/init/retina/main_linux.go:33 +0x214

Expected behavior
retina-agent pod status is normal.

Platform (please complete the following information):

  • OS: CentOS Linux 7 (Core)
  • Kernel Version: 5.4.207-1.el7.elrepo.x86_64
  • Kubernetes Version: 1.22.2
  • Host: local kubernets
  • Retina Version: v0.0.1
@wenhuwang
Copy link
Contributor Author

I guess there needs to allow the current process to lock memory for eBPF resource.
If there is no problem with this solution, i can deal with this issues.

@rbtr rbtr added type/bug Something isn't working area/plugins priority/0 P0 labels Mar 27, 2024
@rbtr rbtr added this to the v0.0.3 milestone Mar 27, 2024
@vakalapa
Copy link
Contributor

@wenhuwang is it possible to describe daemonset you applied as yaml here, i want to see the permissions applied on init container.

@parkjeongryul
Copy link
Contributor

Same here.
I installed with basic mode. make helm-install

$ k logs retina-agent-7n7xc -n kube-system -c init-retina

ts=2024-03-27T16:04:56.840Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 path=/sys/fs/bpf
ts=2024-03-27T16:04:56.840Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T16:04:56.841Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T16:04:56.841Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map [recovered]
	panic: Failed to initialize filter map

goroutine 1 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
	/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0xb338a0?, 0xc000231180?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc0000bfb80?})
	/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000252340, {0xc00024c9c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6b9e0?, {0xc69630?, 0xd6b900?}, {0xc00024c9c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/zap@v1.26.0/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc000291ec8)
	/go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
$ k get ds -n kube-system retina-agent -o yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
    meta.helm.sh/release-name: retina
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-03-27T16:00:51Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    k8s-app: retina
  name: retina-agent
  namespace: kube-system
  resourceVersion: "8080926"
  uid: ee956376-f701-4d2f-bfc2-0055c5c48a0b
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: retina
  template:
    metadata:
      annotations:
        checksum/config: 1aa5dfa2b1c3bc86cd80d7e983d27ffc4668458df1a51541f906e4827abc2e62
        prometheus.io/port: "10093"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: retina
        k8s-app: retina
    spec:
      containers:
      - args:
        - --health-probe-bind-address=:18081
        - --metrics-bind-address=:18080
        - --config
        - /retina/config/config.yaml
        command:
        - /retina/controller
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: NODE_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: ghcr.io/microsoft/retina/retina-agent:v0.0.1
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /metrics
            port: 10093
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: retina
        ports:
        - containerPort: 10093
          hostPort: 10093
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 300Mi
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
            - SYS_RESOURCE
            - NET_ADMIN
            - IPC_LOCK
          privileged: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /sys/fs/bpf
          name: bpf
        - mountPath: /sys/fs/cgroup
          name: cgroup
        - mountPath: /retina/config
          name: config
        - mountPath: /sys/kernel/debug
          name: debug
        - mountPath: /tmp
          name: tmp
        - mountPath: /sys/kernel/tracing
          name: trace
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - image: ghcr.io/microsoft/retina/retina-init:v0.0.1
        imagePullPolicy: Always
        name: init-retina
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /sys/fs/bpf
          mountPropagation: Bidirectional
          name: bpf
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: retina-agent
      serviceAccountName: retina-agent
      terminationGracePeriodSeconds: 90
      volumes:
      - hostPath:
          path: /sys/fs/bpf
          type: ""
        name: bpf
      - hostPath:
          path: /sys/fs/cgroup
          type: ""
        name: cgroup
      - configMap:
          defaultMode: 420
          name: retina-config
        name: config
      - hostPath:
          path: /sys/kernel/debug
          type: ""
        name: debug
      - emptyDir: {}
        name: tmp
      - hostPath:
          path: /sys/kernel/tracing
          type: ""
        name: trace
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 10
  desiredNumberScheduled: 10
  numberMisscheduled: 0
  numberReady: 0
  numberUnavailable: 10
  observedGeneration: 1
  updatedNumberScheduled: 10

Platform (please complete the following information):

  • OS: Ubuntu 20.04.3 LTS
  • kernel version: 5.4.0-99-generic
  • Kubernetes Version: v1.27.9
  • Retina Version: v0.0.1
  • Host: Managed Kubernetes

@jimassa
Copy link
Contributor

jimassa commented Mar 27, 2024

This bug is caused by the same issue as #115
It is happening on arm64 nodes and caused by fork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory when trying to reconcile dropreason plugin.

@rbtr
Copy link
Collaborator

rbtr commented Mar 27, 2024

This bug is caused by the same issue as #115 It is happening on arm64 nodes and caused by fork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory when trying to reconcile dropreason plugin.

The clang issue was fixed here before v0.0.2. Is the solution here just to upgrade to v0.0.2?

@parkjeongryul
Copy link
Contributor

Architecture of our node is amd64.

$ k get nodes jrpark-w-4hb7 -o yaml | grep architecture
    architecture: amd64

The clang issue was #133 before v0.0.2. Is the solution here just to upgrade to v0.0.2?

I just tried upgrading to v0.0.2 and it didn't fix the issue.

$ k get ds retina-agent -n kube-system -o yaml | grep image:
        image: ghcr.io/microsoft/retina/retina-agent:v0.0.2
      - image: ghcr.io/microsoft/retina/retina-init:v0.0.2
$ k logs retina-agent-2jnf6 -n kube-system -c init-retina

ts=2024-03-27T23:47:36.410Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 path=/sys/fs/bpf
ts=2024-03-27T23:47:36.410Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T23:47:36.411Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T23:47:36.411Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map

goroutine 1 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc0000bfb60?})
	/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000264000, {0xc00024c9c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6ba00?, {0xc69630?, 0xd6b920?}, {0xc00024c9c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/zap@v1.26.0/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc0001e9ec0)
	/go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
	/go/src/github.com/microsoft/retina/init/retina/main_linux.go:40 +0x24a

@wenhuwang
Copy link
Contributor Author

wenhuwang commented Mar 28, 2024

@wenhuwang is it possible to describe daemonset you applied as yaml here, i want to see the permissions applied on init container.

@vakalapa retina-agent daemonset yaml:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: retina-agent
  namespace: kube-system
  labels:
    app.kubernetes.io/managed-by: Helm
    k8s-app: retina
  annotations:
    deprecated.daemonset.template.generation: '20'
    field.cattle.io/publicEndpoints: 'null'
    meta.helm.sh/release-name: retina
    meta.helm.sh/release-namespace: kube-system
spec:
  selector:
    matchLabels:
      app: retina
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: retina
        k8s-app: retina
      annotations:
        checksum/config: 48f843d88ced90f531a61ed0ee4f1e0f9bf256a47ac281655788542bf0f520fb
        kubesphere.io/restartedAt: '2024-03-27T09:42:56.531Z'
        prometheus.io/port: '10093'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: bpf
          hostPath:
            path: /sys/fs/bpf
            type: ''
        - name: cgroup
          hostPath:
            path: /sys/fs/cgroup
            type: ''
        - name: config
          configMap:
            name: retina-config
            defaultMode: 420
        - name: debug
          hostPath:
            path: /sys/kernel/debug
            type: ''
        - name: tmp
          emptyDir: {}
        - name: trace
          hostPath:
            path: /sys/kernel/tracing
            type: ''
      initContainers:
        - name: init-retina
          image: '*****/retina-init:v0.0.1'
          resources: {}
          volumeMounts:
            - name: bpf
              mountPath: /sys/fs/bpf
              mountPropagation: Bidirectional
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: FallbackToLogsOnError
          imagePullPolicy: Always
          securityContext:
            privileged: true
      containers:
        - name: retina
          image: '***/retina-agent:v0.0.1'
          command:
            - /retina/controller
          args:
            - '--health-probe-bind-address=:18081'
            - '--metrics-bind-address=:18080'
            - '--config'
            - /retina/config/config.yaml
          ports:
            - hostPort: 10093
              containerPort: 10093
              protocol: TCP
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: NODE_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.hostIP
          resources:
            limits:
              cpu: 500m
              memory: 300Mi
          volumeMounts:
            - name: bpf
              mountPath: /sys/fs/bpf
            - name: cgroup
              mountPath: /sys/fs/cgroup
            - name: config
              mountPath: /retina/config
            - name: debug
              mountPath: /sys/kernel/debug
            - name: tmp
              mountPath: /tmp
            - name: trace
              mountPath: /sys/kernel/tracing
          livenessProbe:
            httpGet:
              path: /metrics
              port: 10093
              scheme: HTTP
            initialDelaySeconds: 30
            timeoutSeconds: 1
            periodSeconds: 30
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
                - SYS_RESOURCE
                - NET_ADMIN
                - IPC_LOCK
            privileged: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 90
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: retina-agent
      serviceAccount: retina-agent
      hostNetwork: true
      securityContext: {}

I have solved this issues by removing lock memory limit for eBPF resource.
add the follow code to location https://github.com/microsoft/retina/blob/main/pkg/plugin/filter/filter_map_linux.go#L46

        if err := rlimit.RemoveMemlock(); err != nil {
		f.l.Error("remove memlock failed", zap.Error(err))
		return f, err
	}

Could you please assign this issues to me?

@snguyen64
Copy link
Contributor

We would most likely see this error for kernal versions < 5.11 and be able to reproduce
https://pkg.go.dev/github.com/cilium/ebpf/rlimit

github-merge-queue bot pushed a commit that referenced this issue Mar 31, 2024
# Description

fixes: #153 

## Testing Done

1. Built amd64 image and deployed them in a K8s cluster
2. agent pod is up and running

Signed-off-by: wenhuwang <wang15691700816@gmail.com>
@rbtr rbtr closed this as completed in #165 Mar 31, 2024
hainenber pushed a commit to hainenber/retina that referenced this issue Apr 1, 2024
# Description

fixes: microsoft#153

## Testing Done

1. Built amd64 image and deployed them in a K8s cluster
2. agent pod is up and running

Signed-off-by: wenhuwang <wang15691700816@gmail.com>
hainenber pushed a commit to hainenber/retina that referenced this issue Apr 1, 2024
# Description

fixes: microsoft#153

## Testing Done

1. Built amd64 image and deployed them in a K8s cluster
2. agent pod is up and running

Signed-off-by: wenhuwang <wang15691700816@gmail.com>
hainenber pushed a commit to hainenber/retina that referenced this issue Apr 4, 2024
# Description

fixes: microsoft#153

## Testing Done

1. Built amd64 image and deployed them in a K8s cluster
2. agent pod is up and running

Signed-off-by: wenhuwang <wang15691700816@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/plugins priority/0 P0 type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants