Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure COS to use NPD in daemonset mode and align kubeup NPD manifests with the manifests in the NPD repo #121007

Merged
merged 1 commit into from
Oct 23, 2023

Conversation

upodroid
Copy link
Member

@upodroid upodroid commented Oct 5, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

NodeProblemDetector tests are currently failing on kops clusters because this test tries to SSH to the API Server IP.

host local exec was introduced sometime ago to address this problem.

Which issue(s) this PR fixes:

Part of #120989

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Part of kubernetes/enhancements#4224

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 5, 2023
@k8s-ci-robot k8s-ci-robot added area/test do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/testing Categorizes an issue or PR as relevant to SIG Testing. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 5, 2023
@upodroid
Copy link
Member Author

upodroid commented Oct 5, 2023

/test npd

@k8s-ci-robot
Copy link
Contributor

@upodroid: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-cadvisor-e2e-kubernetes
  • /test pull-kubernetes-conformance-kind-ga-only-parallel
  • /test pull-kubernetes-coverage-unit
  • /test pull-kubernetes-dependencies
  • /test pull-kubernetes-dependencies-go-canary
  • /test pull-kubernetes-e2e-gce
  • /test pull-kubernetes-e2e-gce-100-performance
  • /test pull-kubernetes-e2e-gce-big-performance
  • /test pull-kubernetes-e2e-gce-canary
  • /test pull-kubernetes-e2e-gce-cos
  • /test pull-kubernetes-e2e-gce-cos-canary
  • /test pull-kubernetes-e2e-gce-cos-no-stage
  • /test pull-kubernetes-e2e-gce-network-proxy-http-connect
  • /test pull-kubernetes-e2e-gce-scale-performance-manual
  • /test pull-kubernetes-e2e-kind
  • /test pull-kubernetes-e2e-kind-ipv6
  • /test pull-kubernetes-integration
  • /test pull-kubernetes-integration-go-canary
  • /test pull-kubernetes-kubemark-e2e-gce-scale
  • /test pull-kubernetes-node-e2e-containerd
  • /test pull-kubernetes-typecheck
  • /test pull-kubernetes-unit
  • /test pull-kubernetes-unit-go-canary
  • /test pull-kubernetes-update
  • /test pull-kubernetes-verify
  • /test pull-kubernetes-verify-go-canary

The following commands are available to trigger optional jobs:

  • /test check-dependency-stats
  • /test pull-ci-kubernetes-unit-windows
  • /test pull-crio-cgroupv1-node-e2e-eviction
  • /test pull-crio-cgroupv1-node-e2e-features
  • /test pull-crio-cgroupv1-node-e2e-hugepages
  • /test pull-crio-cgroupv1-node-e2e-resource-managers
  • /test pull-e2e-gce-cloud-provider-disabled
  • /test pull-kubernetes-conformance-image-test
  • /test pull-kubernetes-conformance-kind-ga-only
  • /test pull-kubernetes-conformance-kind-ipv6-parallel
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
  • /test pull-kubernetes-crio-node-memoryqos-cgrpv2
  • /test pull-kubernetes-cross
  • /test pull-kubernetes-e2e-autoscaling-hpa-cm
  • /test pull-kubernetes-e2e-autoscaling-hpa-cpu
  • /test pull-kubernetes-e2e-capz-azure-disk
  • /test pull-kubernetes-e2e-capz-azure-disk-vmss
  • /test pull-kubernetes-e2e-capz-azure-file
  • /test pull-kubernetes-e2e-capz-azure-file-vmss
  • /test pull-kubernetes-e2e-capz-conformance
  • /test pull-kubernetes-e2e-capz-windows-alpha-feature-vpa
  • /test pull-kubernetes-e2e-capz-windows-alpha-features
  • /test pull-kubernetes-e2e-capz-windows-master
  • /test pull-kubernetes-e2e-capz-windows-serial-slow-hpa
  • /test pull-kubernetes-e2e-containerd-gce
  • /test pull-kubernetes-e2e-ec2
  • /test pull-kubernetes-e2e-ec2-conformance
  • /test pull-kubernetes-e2e-gce-correctness
  • /test pull-kubernetes-e2e-gce-cos-alpha-features
  • /test pull-kubernetes-e2e-gce-cos-kubetest2
  • /test pull-kubernetes-e2e-gce-csi-serial
  • /test pull-kubernetes-e2e-gce-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-kubelet-credential-provider
  • /test pull-kubernetes-e2e-gce-network-proxy-grpc
  • /test pull-kubernetes-e2e-gce-serial
  • /test pull-kubernetes-e2e-gce-storage-disruptive
  • /test pull-kubernetes-e2e-gce-storage-slow
  • /test pull-kubernetes-e2e-gce-storage-snapshot
  • /test pull-kubernetes-e2e-gci-gce-autoscaling
  • /test pull-kubernetes-e2e-gci-gce-ingress
  • /test pull-kubernetes-e2e-gci-gce-ipvs
  • /test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
  • /test pull-kubernetes-e2e-kind-alpha-features
  • /test pull-kubernetes-e2e-kind-canary
  • /test pull-kubernetes-e2e-kind-dual-canary
  • /test pull-kubernetes-e2e-kind-ipv6-canary
  • /test pull-kubernetes-e2e-kind-ipvs-dual-canary
  • /test pull-kubernetes-e2e-kind-kms
  • /test pull-kubernetes-e2e-kind-multizone
  • /test pull-kubernetes-e2e-kops-aws
  • /test pull-kubernetes-e2e-storage-kind-disruptive
  • /test pull-kubernetes-e2e-ubuntu-gce-network-policies
  • /test pull-kubernetes-integration-eks
  • /test pull-kubernetes-kind-dra
  • /test pull-kubernetes-kind-json-logging
  • /test pull-kubernetes-kind-text-logging
  • /test pull-kubernetes-kubemark-e2e-gce-big
  • /test pull-kubernetes-linter-hints
  • /test pull-kubernetes-local-e2e
  • /test pull-kubernetes-node-arm64-e2e-containerd-ec2
  • /test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2
  • /test pull-kubernetes-node-arm64-ubuntu-serial-gce
  • /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
  • /test pull-kubernetes-node-crio-e2e
  • /test pull-kubernetes-node-crio-e2e-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-1-7-dra
  • /test pull-kubernetes-node-e2e-containerd-alpha-features
  • /test pull-kubernetes-node-e2e-containerd-ec2
  • /test pull-kubernetes-node-e2e-containerd-features
  • /test pull-kubernetes-node-e2e-containerd-features-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-serial-ec2
  • /test pull-kubernetes-node-e2e-containerd-sidecar-containers
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha
  • /test pull-kubernetes-node-e2e-crio-dra
  • /test pull-kubernetes-node-kubelet-credential-provider
  • /test pull-kubernetes-node-kubelet-serial-containerd
  • /test pull-kubernetes-node-kubelet-serial-containerd-alpha-features
  • /test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
  • /test pull-kubernetes-node-kubelet-serial-hugepages
  • /test pull-kubernetes-node-kubelet-serial-memory-manager
  • /test pull-kubernetes-node-kubelet-serial-pod-disruption-conditions
  • /test pull-kubernetes-node-kubelet-serial-topology-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
  • /test pull-kubernetes-node-swap-fedora
  • /test pull-kubernetes-node-swap-fedora-serial
  • /test pull-kubernetes-node-swap-ubuntu-serial
  • /test pull-kubernetes-unit-experimental
  • /test pull-kubernetes-verify-strict-lint
  • /test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-conformance-kind-ipv6-parallel
  • pull-kubernetes-dependencies
  • pull-kubernetes-e2e-ec2
  • pull-kubernetes-e2e-ec2-conformance
  • pull-kubernetes-e2e-gce
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6
  • pull-kubernetes-integration
  • pull-kubernetes-linter-hints
  • pull-kubernetes-node-e2e-containerd
  • pull-kubernetes-typecheck
  • pull-kubernetes-unit
  • pull-kubernetes-verify
  • pull-kubernetes-verify-strict-lint

In response to this:

/test npd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 5, 2023
@upodroid
Copy link
Member Author

upodroid commented Oct 5, 2023

/retest

My changes are successful. https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/121007/pull-kubernetes-e2e-gce/1709933808054177792 Look for NodeProblemDetector in the passed tab

/cc @pohly @SergeyKanzhelev

@upodroid
Copy link
Member Author

upodroid commented Oct 5, 2023

/test pull-kubernetes-node-e2e-containerd-standalone-mode

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/provider/gcp Issues or PRs related to gcp provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 5, 2023
@@ -380,7 +378,7 @@ func getNpdPodStat(ctx context.Context, f *framework.Framework, nodeName string)

hasNpdPod := false
for _, pod := range summary.Pods {
if !strings.HasPrefix(pod.PodRef.Name, "npd") {
if !strings.HasPrefix(pod.PodRef.Name, "node-problem-detector") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npd.yaml used a daemonset name that didn't match the values in the upstream manifests. https://github.com/kubernetes/node-problem-detector/blob/master/deployment/node-problem-detector.yaml

kops deploys npd's daemonset with the name node-problem-detector so it is expected all kubernetes clusters are bootstrapped using the correct manifests provided by the component/project maintainer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @dims or @justinsb

@k8s-ci-robot
Copy link
Contributor

@upodroid: GitHub didn't allow me to request PR reviews from the following users: or.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @dims or @justinsb

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@upodroid upodroid force-pushed the npd-host-exec-rewrite branch 2 times, most recently from 355000a to 841a861 Compare October 5, 2023 16:18
@upodroid
Copy link
Member Author

/test pull-kubernetes-e2e-gce-correctness

@upodroid
Copy link
Member Author

This PR is ready to merged.

Notes:

@upodroid upodroid changed the title Rewrite NodeProblemDetector test to support host local exec Configure COS to use NPD in daemonset mode and align NPD manifests with upstream NPD Oct 15, 2023
@upodroid upodroid changed the title Configure COS to use NPD in daemonset mode and align NPD manifests with upstream NPD Configure COS to use NPD in daemonset mode and align kubeup NPD manifests with the manifests in the NPD repo Oct 15, 2023
@upodroid
Copy link
Member Author

/test pull-kubernetes-e2e-gce-correctness

@upodroid
Copy link
Member Author

/retest

@upodroid
Copy link
Member Author

/retest

@dims
Copy link
Member

dims commented Oct 17, 2023

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, upodroid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2023
@@ -287,12 +287,7 @@ export ENABLE_DNS_HORIZONTAL_AUTOSCALER="${KUBE_ENABLE_DNS_HORIZONTAL_AUTOSCALER
# none - Not run node problem detector.
# daemonset - Run node problem detector as daemonset.
# standalone - Run node problem detector as standalone system daemon.
if [[ "${NODE_OS_DISTRIBUTION}" == "gci" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are loosing the test coverage for the standalone mode? I think the hidden logic of defauling it to standalone on COS is wrong. But I worry that we are loosing test coverage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't because standalone tests are currently launched using the node e2e runner, which doesn't use the cluster/* scripts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NPD, I think we lost test coverage in standalone mode due to this PR. In NPD standalone mode, we currently pull the tar files from gs://kubernetes-release/node-problem-detector/. See https://github.com/kubernetes/kubernetes/blob/d3d06c3c7e07c7c79ff46c0fc3b9f081ce6b0226/cluster/gce/gci/configure.sh#L299C99-L299C117.

But running gsutil ls gs://kubernetes-release/node-problem-detector/, there is no NPD v0.8.13, which is bumped by this PR. It only has NPD versions up to v0.8.10. But none of the release blocking tests failed.

@upodroid
Copy link
Member Author

This is ready to be merged. Can I get LGTM please?

@dims
Copy link
Member

dims commented Oct 23, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 6f5204869bed8298c3d21764ff3bf89cb6f4d8dc

@k8s-ci-robot k8s-ci-robot merged commit 604e9e0 into kubernetes:master Oct 23, 2023
15 checks passed
SIG Node CI/Test Board automation moved this from PRs - Needs Approver to Done Oct 23, 2023
SIG Node PR Triage automation moved this from Needs Approver to Done Oct 23, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 23, 2023
@aojea
Copy link
Member

aojea commented Nov 24, 2023

This is causing all these DNS jobs to fail because the NPD can not be scheduled
https://testgrid.k8s.io/sig-network-gce#gci-gce-kube-dns-nodecache
https://testgrid.k8s.io/sig-network-gce#gci-gce-coredns-nodecache
...

[FAILED] Error waiting for all pods to be running and ready: Timed out after 600.000s.
  Expected all pods (need at least 0) in namespace "kube-system" to be running and ready (except for 0).
  32 / 33 pods were running and ready.
  Expected 5 pod replicas, 5 are Running and Ready.
  Pods that were neither completed nor running:
      <[]v1.Pod | len:1, cap:1>: 
          - metadata:
              creationTimestamp: "2023-11-24T05:00:49Z"
              generateName: node-problem-detector-
              labels:
                app.kubernetes.io/name: node-problem-detector
                app.kubernetes.io/version: v0.8.13
                controller-revision-hash: 77d7676dcb
                pod-template-generation: "1"
              managedFields:
              - apiVersion: v1
                fieldsType: FieldsV1
                fieldsV1:
                  f:metadata:
                    f:generateName: {}
                    f:labels:
                      .: {}
                      f:app.kubernetes.io/name: {}
                      f:app.kubernetes.io/version: {}
                      f:controller-revision-hash: {}
                      f:pod-template-generation: {}
                    f:ownerReferences:
                      .: {}
                      k:{"uid":"05359bbf-d6c3-4b2a-bb96-f48f6f20aea8"}: {}
                  f:spec:
                    f:affinity:
                      .: {}
                      f:nodeAffinity:
                        .: {}
                        f:requiredDuringSchedulingIgnoredDuringExecution: {}
                    f:containers:
                      k:{"name":"node-problem-detector"}:
                        .: {}
                        f:command: {}
                        f:env:
                          .: {}
                          k:{"name":"NODE_NAME"}:
                            .: {}
                            f:name: {}
                            f:valueFrom:
                              .: {}
                              f:fieldRef: {}
                        f:image: {}
                        f:imagePullPolicy: {}
                        f:name: {}
                        f:resources:
                          .: {}
                          f:limits:
                            .: {}
                            f:cpu: {}
                            f:memory: {}
                          f:requests:
                            .: {}
                            f:cpu: {}
                            f:memory: {}
                        f:securityContext:
                          .: {}
                          f:privileged: {}
                        f:terminationMessagePath: {}
                        f:terminationMessagePolicy: {}
                        f:volumeMounts:
                          .: {}
                          k:{"mountPath":"/dev/kmsg"}:
                            .: {}
                            f:mountPath: {}
                            f:name: {}
                            f:readOnly: {}
                          k:{"mountPath":"/etc/localtime"}:
                            .: {}
                            f:mountPath: {}
                            f:name: {}
                            f:readOnly: {}
                          k:{"mountPath":"/var/log"}:
                            .: {}
                            f:mountPath: {}
                            f:name: {}
                    f:dnsPolicy: {}
                    f:enableServiceLinks: {}
                    f:restartPolicy: {}
                    f:schedulerName: {}
                    f:securityContext: {}
                    f:serviceAccount: {}
                    f:serviceAccountName: {}
                    f:terminationGracePeriodSeconds: {}
                    f:tolerations: {}
                    f:volumes:
                      .: {}
                      k:{"name":"kmsg"}:
                        .: {}
                        f:hostPath:
                          .: {}
                          f:path: {}
                          f:type: {}
                        f:name: {}
                      k:{"name":"localtime"}:
                        .: {}
                        f:hostPath:
                          .: {}
                          f:path: {}
                          f:type: {}
                        f:name: {}
                      k:{"name":"log"}:
                        .: {}
                        f:hostPath:
                          .: {}
                          f:path: {}
                          f:type: {}
                        f:name: {}
                manager: kube-controller-manager
                operation: Update
                time: "2023-11-24T05:00:49Z"
              - apiVersion: v1
                fieldsType: FieldsV1
                fieldsV1:
                  f:status:
                    f:conditions:
                      .: {}
                      k:{"type":"PodScheduled"}:
                        .: {}
                        f:lastProbeTime: {}
                        f:lastTransitionTime: {}
                        f:message: {}
                        f:reason: {}
                        f:status: {}
                        f:type: {}
                manager: kube-scheduler
                operation: Update
                subresource: status
                time: "2023-11-24T05:00:49Z"
              name: node-problem-detector-g9h6s
              namespace: kube-system
              ownerReferences:
              - apiVersion: apps/v1
                blockOwnerDeletion: true
                controller: true
                kind: DaemonSet
                name: node-problem-detector
                uid: 05359bbf-d6c3-4b2a-bb96-f48f6f20aea8
              resourceVersion: "[984](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-coredns-performance-nodecache/1727913089497567232#1:build-log.txt%3A984)"
              uid: 6bf2ba29-68df-4544-8df5-a83dbf31bb7a
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                    - matchFields:
                      - key: metadata.name
                        operator: In
                        values:
                        - gce-coredns-perf-cache-master
              containers:
              - command:
                - /bin/sh
                - -c
                - exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json,/config/systemd-monitor.json
                  --config.custom-plugin-monitor=/config/kernel-monitor-counter.json,/config/systemd-monitor-counter.json
                  --config.system-stats-monitor=/config/system-stats-monitor.json >>/var/log/node-problem-detector.log
                  2>&1
                env:
                - name: NODE_NAME
                  valueFrom:
                    fieldRef:
                      apiVersion: v1
                      fieldPath: spec.nodeName
                image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.13
                imagePullPolicy: IfNotPresent
                name: node-problem-detector
                resources:
                  limits:
                    cpu: 200m
                    memory: 100Mi
                  requests:
                    cpu: 20m
                    memory: 20Mi
                securityContext:
                  privileged: true
                terminationMessagePath: /dev/termination-log
                terminationMessagePolicy: File
                volumeMounts:
                - mountPath: /var/log
                  name: log
                - mountPath: /dev/kmsg
                  name: kmsg
                  readOnly: true
                - mountPath: /etc/localtime
                  name: localtime
                  readOnly: true
                - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
                  name: kube-api-access-6xftj
                  readOnly: true
              dnsPolicy: ClusterFirst
              enableServiceLinks: true
              preemptionPolicy: PreemptLowerPriority
              priority: 0
              restartPolicy: Always
              schedulerName: default-scheduler
              securityContext: {}
              serviceAccount: node-problem-detector
              serviceAccountName: node-problem-detector
              terminationGracePeriodSeconds: 30
              tolerations:
              - effect: NoExecute
                operator: Exists
              - effect: NoSchedule
                operator: Exists
              - key: CriticalAddonsOnly
                operator: Exists
              - effect: NoExecute
                key: node.kubernetes.io/not-ready
                operator: Exists
              - effect: NoExecute
                key: node.kubernetes.io/unreachable
                operator: Exists
              - effect: NoSchedule
                key: node.kubernetes.io/disk-pressure
                operator: Exists
              - effect: NoSchedule
                key: node.kubernetes.io/memory-pressure
                operator: Exists
              - effect: NoSchedule
                key: node.kubernetes.io/pid-pressure
                operator: Exists
              - effect: NoSchedule
                key: node.kubernetes.io/unschedulable
                operator: Exists
              volumes:
              - hostPath:
                  path: /var/log/
                  type: ""
                name: log
              - hostPath:
                  path: /dev/kmsg
                  type: ""
                name: kmsg
              - hostPath:
                  path: /etc/localtime
                  type: FileOrCreate
                name: localtime
              - name: kube-api-access-6xftj
                projected:
                  defaultMode: 420
                  sources:
                  - serviceAccountToken:
                      expirationSeconds: 3607
                      path: token
                  - configMap:
                      items:
                      - key: ca.crt
                        path: ca.crt
                      name: kube-root-ca.crt
                  - downwardAPI:
                      items:
                      - fieldRef:
                          apiVersion: v1
                          fieldPath: metadata.namespace
                        path: namespace
            status:
              conditions:
              - lastProbeTime: null
                lastTransitionTime: "2023-11-24T05:00:49Z"
                message: '0/4 nodes are available: 1 Insufficient cpu. preemption: 0/4 nodes
                  are available: 4 No preemption victims found for incoming pod.'
                reason: Unschedulable
                status: "False"
                type: PodScheduled
              phase: Pending
              qosClass: Burstable

@upodroid
Copy link
Member Author

This is the problem

                message: '0/4 nodes are available: 1 Insufficient cpu. preemption: 0/4 nodes
                  are available: 4 No preemption victims found for incoming pod.'

npd requests the following:

        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"

bumping the controlplane from n1-standard-1 to n1-standard-2 can fix it

@aojea
Copy link
Member

aojea commented Nov 24, 2023

This is the problem

                message: '0/4 nodes are available: 1 Insufficient cpu. preemption: 0/4 nodes
                  are available: 4 No preemption victims found for incoming pod.'

npd requests the following:

        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"

bumping the controlplane from n1-standard-1 to n1-standard-2 can fix it

I prefer this https://github.com/kubernetes/test-infra/pull/31312/files

This change is ok,

@aojea
Copy link
Member

aojea commented Nov 24, 2023

bumping the controlplane from n1-standard-1 to n1-standard-2 can fix it

it is a daemonset, so you'll need to bump all nodes, but there is no need to waste resources, this changes is ok, the DNS jobs don't need to install npd kubernetes/test-infra#31312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/gcp Issues or PRs related to gcp provider area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

8 participants