Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster creating hundreds of pods because node is down #6303

Closed
migs35323 opened this issue Oct 20, 2022 · 4 comments
Closed

cluster creating hundreds of pods because node is down #6303

migs35323 opened this issue Oct 20, 2022 · 4 comments

Comments

@migs35323
Copy link

Not sure if this is the right place to ask this or in rancher..
i have a k3s cluster with some nodes, and i have a deployment of an app where i made it so it only schedulles in one particular node,
i used rancher for this effect, (deployment > config > Node Scheduling > run in specific node )
the thing is that particular node went down and when i did come back it attempted to create hundreds of pods, i have the cluster overflowing with terminating pods, hundreds or thousans of them at one point in the past...

is there a way that i can make it so it just doesn't try to create pods when the available machines are down?

@brandond
Copy link
Contributor

You didn't fill out the issue template, so I'm not sure what version of K3s you're working with or what your cluster configuration is.

It is not expected that the deployment controller would continue to create pods when there is no node available to schedule them on, or when a node does become available. Are you using an autoscaler that scaled up the deployment replica count in an attempt to create pods? Can you post more information on what specifically you're seeing, including kubectl get deployment -o yaml for your deployment, and kubectl get pod -o yaml showing the pods in question?

@migs35323
Copy link
Author

migs35323 commented Oct 21, 2022

k3s v1.24.4+k3s1
cluster with rancher v2.6.8
1 master with no schedule taint,
2 normal nodes
1 extra node with PreferNoSchedule
 
the extra node (the last one) is where i had the workload in question, the machine went down for a day

the most recent app where i have this issue is a standard gitlab runner deployment, i had the same situation in another deployment before where i basically used the same steps

heres the deployment i had to redact most info.
rancher applied the "spec.nodename" tag,
i believe when the cluster tries to shedule the pod doesn't check or care if the machine is down, sees the pod is not up, and tries it again

`apiVersion: apps/v1
kind: Deployment
metadata:
name: gitlab-runner-
namespace:
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
...
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
...
spec:
affinity: {}
containers:
- command:
...
image: gitlab/gitlab-runner
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
...
livenessProbe:
exec:
command:
...
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: gitlab-runner
...
readinessProbe:
...
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
...
initContainers:
- command:
- sh
...
image: gitlab/gitlab-runner
imagePullPolicy: IfNotPresent
name: configure
resources: {}
securityContext:
allowPrivilegeEscalation: false
...
nodeName: ${EXTRA_NODE}
restartPolicy: Always
schedulerName: default-scheduler
...
status:
availableReplicas: 1
conditions:

  • lastTransitionTime:
    lastUpdateTime:
    message: ReplicaSet "gitlab-runner" has successfully
    progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  • lastTransitionTime:
    lastUpdateTime:
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
    readyReplicas: 1
    replicas: 1
    updatedReplicas: 1
    `

@brandond
Copy link
Contributor

It doesn't sound like this is a problem with k3s or rancher then, but rather just the behavior of Kubernetes itself when you configure such a Deployment?

@caroline-suse-rancher
Copy link
Contributor

I'm going to convert this to a discussion as this seems like more of a question than a bug report.

@k3s-io k3s-io locked and limited conversation to collaborators Apr 19, 2023
@caroline-suse-rancher caroline-suse-rancher converted this issue into discussion #7318 Apr 19, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants