Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrashLoopBackOff 0.29.0 #1403

Closed
kaykhan opened this issue May 17, 2024 · 5 comments
Closed

CrashLoopBackOff 0.29.0 #1403

kaykhan opened this issue May 17, 2024 · 5 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@kaykhan
Copy link

kaykhan commented May 17, 2024

What version of descheduler are you using?

descheduler version: 0.29.0

Does this issue reproduce with the latest release?

yes

Which descheduler CLI options are you using?

Please provide a copy of your descheduler policy config file

# Default values for descheduler.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

# CronJob or Deployment
kind: Deployment
namespace: kube-system
nodeSelector:
  geeiq/node-type: ops

image:
  repository: registry.k8s.io/descheduler/descheduler
  # Overrides the image tag whose default is the chart version
  tag: ""
  pullPolicy: IfNotPresent

imagePullSecrets:
#   - name: container-registry-secret

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  # limits:
  #   cpu: 100m
  #   memory: 128Mi

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  privileged: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

nameOverride: ""
fullnameOverride: ""

# labels that'll be applied to all resources
commonLabels: {}

cronJobApiVersion: "batch/v1"
schedule: "*/2 * * * *"
suspend: false
# startingDeadlineSeconds: 200
# successfulJobsHistoryLimit: 3
# failedJobsHistoryLimit: 1
# ttlSecondsAfterFinished 600

# Required when running as a Deployment
deschedulingInterval: 5m

# Specifies the replica count for Deployment
# Set leaderElection if you want to use more than 1 replica
# Set affinity.podAntiAffinity rule if you want to schedule onto a node
# only if that node is in the same zone as at least one already-running descheduler
replicas: 1

# Specifies whether Leader Election resources should be created
# Required when running as a Deployment
# NOTE: Leader election can't be activated if DryRun enabled
leaderElection: {}
#  enabled: true
#  leaseDuration: 15s
#  renewDeadline: 10s
#  retryPeriod: 2s
#  resourceLock: "leases"
#  resourceName: "descheduler"
#  resourceNamescape: "kube-system"

command:
- "/bin/descheduler"

cmdOptions:
  v: 3

# Recommended to use the latest Policy API version supported by the Descheduler app version
deschedulerPolicyAPIVersion: "descheduler/v1alpha1"

deschedulerPolicy:
  nodeSelector: "geeiq/node-type=worker"
  # maxNoOfPodsToEvictPerNode: 10
  # maxNoOfPodsToEvictPerNamespace: 10
  # ignorePvcPods: true
  # evictLocalStoragePods: true
  strategies:
    RemoveDuplicates:
      enabled: true
    RemovePodsHavingTooManyRestarts:
      enabled: false
      params:
        podsHavingTooManyRestarts:
          podRestartThreshold: 100
          includingInitContainers: true
    RemovePodsViolatingNodeTaints:
      enabled: false 
    RemovePodsViolatingNodeAffinity:
      enabled: false 
      params:
        nodeAffinityType:
        - requiredDuringSchedulingIgnoredDuringExecution
    RemovePodsViolatingInterPodAntiAffinity:
      enabled: false
    RemovePodsViolatingTopologySpreadConstraint:
      enabled: true 
      params:
        includeSoftConstraints: true
    LowNodeUtilization:
      enabled: false
      params:
        nodeResourceUtilizationThresholds:
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
          targetThresholds:
            cpu: 50
            memory: 50
            pods: 50

priorityClassName: system-cluster-critical

nodeSelector: {}
#  foo: bar

affinity: {}
# nodeAffinity:
#   requiredDuringSchedulingIgnoredDuringExecution:
#     nodeSelectorTerms:
#     - matchExpressions:
#       - key: kubernetes.io/e2e-az-name
#         operator: In
#         values:
#         - e2e-az1
#         - e2e-az2
#  podAntiAffinity:
#    requiredDuringSchedulingIgnoredDuringExecution:
#      - labelSelector:
#          matchExpressions:
#            - key: app.kubernetes.io/name
#              operator: In
#              values:
#                - descheduler
#        topologyKey: "kubernetes.io/hostname"
tolerations: []
# - key: 'management'
#   operator: 'Equal'
#   value: 'tool'
#   effect: 'NoSchedule'

rbac:
  # Specifies whether RBAC resources should be created
  create: true

serviceAccount:
  # Specifies whether a ServiceAccount should be created
  create: true
  # The name of the ServiceAccount to use.
  # If not set and create is true, a name is generated using the fullname template
  name:
  # Specifies custom annotations for the serviceAccount
  annotations: {}

podAnnotations: {}

podLabels: {}

livenessProbe:
  failureThreshold: 3
  httpGet:
    path: /healthz
    port: 10258
    scheme: HTTPS
  initialDelaySeconds: 3
  periodSeconds: 10

service:
  enabled: false

serviceMonitor:
  enabled: false
  # The namespace where Prometheus expects to find service monitors.
  # namespace: ""
  # Add custom labels to the ServiceMonitor resource
  additionalLabels: {}
    # prometheus: kube-prometheus-stack
  interval: ""
  # honorLabels: true
  insecureSkipVerify: true
  serverName: null
  metricRelabelings: []
    # - action: keep
    #   regex: 'descheduler_(build_info|pods_evicted)'
    #   sourceLabels: [__name__]
  relabelings: []
    # - sourceLabels: [__meta_kubernetes_pod_node_name]
    #   separator: ;
    #   regex: ^(.*)$
    #   targetLabel: nodename
    #   replacement: $1
    #   action: replace

What k8s version are you using (kubectl version)?

1.29 AWS EKS

kubectl version Output
$ kubectl version

--
What did you do?

--

What did you expect to see?

Pods to be running

What did you see instead?

Pods stuck in CrashLookBackOff

kubectl get nodes

ip-10-1-133-59.eu-west-2.compute.internal   Ready    <none>   35m     v1.29.0-eks-5e0fdde
ip-10-1-134-41.eu-west-2.compute.internal   Ready    <none>   2d22h   v1.29.0-eks-5e0fdde
ip-10-1-15-126.eu-west-2.compute.internal   Ready    <none>   2d22h   v1.29.0-eks-5e0fdde
ip-10-1-15-170.eu-west-2.compute.internal   Ready    <none>   2d22h   v1.29.0-eks-5e0fdde

kubectl logs -f pod/descheduler-ddcd5b7b8-g694d -n kube-system

I0517 08:31:19.342817       1 secure_serving.go:57] Forcing use of http/1.1 only
I0517 08:31:19.343250       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1715934679\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1715934679\" (2024-05-17 07:31:18 +0000 UTC to 2025-05-17 07:31:18 +0000 UTC (now=2024-05-17 08:31:19.343215161 +0000 UTC))"
I0517 08:31:19.343275       1 secure_serving.go:213] Serving securely on [::]:10258
I0517 08:31:19.343306       1 tracing.go:87] Did not find a trace collector endpoint defined. Switching to NoopTraceProvider
I0517 08:31:19.343337       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0517 08:31:19.344232       1 conversion.go:248] converting Balance plugin: RemovePodsViolatingTopologySpreadConstraint
I0517 08:31:19.344255       1 conversion.go:248] converting Balance plugin: RemoveDuplicates
W0517 08:31:19.351740       1 descheduler.go:246] failed to convert Descheduler minor version to float
I0517 08:31:19.362334       1 reflector.go:289] Starting reflector *v1.PriorityClass (0s) from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.362464       1 reflector.go:325] Listing and watching *v1.PriorityClass from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.362353       1 reflector.go:289] Starting reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.362589       1 reflector.go:325] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.362372       1 reflector.go:289] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.362802       1 reflector.go:325] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.362388       1 reflector.go:289] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.362873       1 reflector.go:325] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.364298       1 reflector.go:351] Caches populated for *v1.PriorityClass from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.365771       1 reflector.go:351] Caches populated for *v1.Namespace from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.370446       1 reflector.go:351] Caches populated for *v1.Node from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.390904       1 reflector.go:351] Caches populated for *v1.Pod from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.463161       1 descheduler.go:120] "The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting.."
E0517 08:31:19.463192       1 descheduler.go:430] the cluster size is 0 or 1
I0517 08:31:19.463276       1 reflector.go:295] Stopping reflector *v1.PriorityClass (0s) from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.463316       1 reflector.go:295] Stopping reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.463341       1 reflector.go:295] Stopping reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.463362       1 reflector.go:295] Stopping reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:159
I0517 08:31:19.468462       1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
I0517 08:31:19.468509       1 secure_serving.go:258] Stopped listening on [::]:10258

Installed using helm + terraform https://artifacthub.io/packages/helm/descheduler/descheduler

resource "helm_release" "descheduler" {
  name = "descheduler"

  repository = "https://kubernetes-sigs.github.io/descheduler/"
  chart      = "descheduler"
  namespace  = "kube-system"
  version    = "0.29.0"

  values = [
    "${file("utils/values.yaml")}"
  ]
}
@kaykhan kaykhan added the kind/bug Categorizes issue or PR as related to a bug. label May 17, 2024
@kaykhan kaykhan changed the title CrashLookupBackOff Fresh Install CrashLoopBackOff 0.29.0 May 17, 2024
@a7i
Copy link
Contributor

a7i commented May 18, 2024

Related #1350

@kaykhan
Copy link
Author

kaykhan commented May 21, 2024

Related #1350

You can see i have multiple nodes in my cluster, is this the same problem?

@a7i
Copy link
Contributor

a7i commented Jun 30, 2024

I0517 08:31:19.463161       1 descheduler.go:120] "The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting.."
E0517 08:31:19.463192       1 descheduler.go:430] the cluster size is 0 or 1

Indicates that's not the case. can you show the output of this?

kubectl get nodes -l geeiq/node-type=worker

@a7i
Copy link
Contributor

a7i commented Jun 30, 2024

/kind support
/remove-kind bug

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jun 30, 2024
@kaykhan
Copy link
Author

kaykhan commented Jul 1, 2024

I0517 08:31:19.463161       1 descheduler.go:120] "The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting.."
E0517 08:31:19.463192       1 descheduler.go:430] the cluster size is 0 or 1

Indicates that's not the case. can you show the output of this?

kubectl get nodes -l geeiq/node-type=worker
ip-10-1-11-56.eu-west-2.compute.internal    Ready    <none>   12d   v1.30.0-eks-036c24b
ip-10-1-33-176.eu-west-2.compute.internal   Ready    <none>   25d   v1.30.0-eks-036c24b

I'm not facing this problem anymore, so im going to close this

@kaykhan kaykhan closed this as completed Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

3 participants