Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods and headless Services don't get DNS A records without at least one service port #174

Closed
seh opened this issue Nov 29, 2017 · 16 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@seh
Copy link

seh commented Nov 29, 2017

Observation

Per earlier discussion in #70, the documentation for kube-dns says that it will publish A records for pods so long as a headless Service named with the same subdomain exists in the same namespace. However, if the Service has no ports, kube-dns publishes no such records.

What follows is an example to demonstrate this discrepancy.

Example

We create the following objects in a given namespace:

  • A headless Service named "sub"
    Initially, the Service exposes one port with value 80 and called "nonexistent," since the image used here doesn't have any servers listening, to point out that it doesn't matter whether the advertised port actually allows connecting to anything insider the container.
  • Some number of pods.
    Though a single pod is sufficient, here we create three pods selected by the aforementioned Service, each with a single container running the busybox image, each also situated within the subdomain "sub," matching the name of the Service:
    • busybox-1
    • busybox-2
    • busybox-3
apiVersion: v1
kind: List
items:
- apiVersion: v1
  kind: Pod
  metadata: &metadata
    name: busybox-1
    labels:
      app: &app busybox
  spec: &spec
    hostname: host-1
    subdomain: &subdomain sub
    containers:
    - name: busybox
      image: busybox
      command:
      - sleep
      - "3600"
- apiVersion: v1
  kind: Pod
  metadata:
    <<: *metadata
    name: busybox-2
  spec:
    <<: *spec
    hostname: host-2
- apiVersion: v1
  kind: Pod
  metadata:
    <<: *metadata
    name: busybox-3
  spec:
    <<: *spec
    hostname: host-3
- apiVersion: v1
  kind: Service
  metadata:
    name: *subdomain
  spec:
    clusterIP: None
    selector:
      app: *app
    ports:
    # NB: Here the Service has at least one port.
    - name: nonexistent
      port: 80

Assuming that YAML document is available in a file called manifests.yaml, create these objects in some namespace:

kubectl apply -f manifests.yaml

Now run a container using an image with dig available in that same namespace, probing first for DNS A records for our subdomain "sub":

kubectl run dig --image=tutum/dnsutils \
  --restart=Never --rm=true --tty --stdin --command -- \
  dig sub a +search +noall +answer
; <<>> DiG 9.10.2 <<>> sub a +search +noall +answer
;; global options: +cmd
sub.my-ns.svc.cluster.local. 30	IN A	172.30.48.142
sub.my-ns.svc.cluster.local. 30	IN A	172.30.48.83
sub.my-ns.svc.cluster.local. 30	IN A	172.30.98.82

Next, confirm that records exist for all three of our pod host names:

for i in $(seq 3); do \
  kubectl run dig --image=tutum/dnsutils \
    --restart=Never --rm=true --tty --stdin --command -- \
    dig "host-${i}.sub" a +search +noall +answer \
done
; <<>> DiG 9.10.2 <<>> host-1.sub a +search +noall +answer
;; global options: +cmd
host-1.sub.my-ns.svc.cluster.local. 30 IN A 172.30.98.82
; <<>> DiG 9.10.2 <<>> host-2.sub a +search +noall +answer
;; global options: +cmd
host-2.sub.my-ns.svc.cluster.local. 30 IN A 172.30.48.83
; <<>> DiG 9.10.2 <<>> host-3.sub a +search +noall +answer
;; global options: +cmd
host-3.sub.my-ns.svc.cluster.local. 30 IN A 172.30.48.142

Next, amend the Service "sub" to remove all of its service ports:

apiVersion: v1
kind: Service
metadata:
  name: *subdomain
spec:
  clusterIP: None
  selector:
    app: *app
  ports:
  # NB: Here the Service has no ports.

With that change applied, we repeat our earlier invocations of dig:

kubectl run dig --image=tutum/dnsutils \
  --restart=Never --rm=true --tty --stdin --command -- \
  dig sub a +search +noall +answer
; <<>> DiG 9.10.2 <<>> sub a +search +noall +answer
;; global options: +cmd
for i in $(seq 3); do \
  kubectl run dig --image=tutum/dnsutils \
    --restart=Never --rm=true --tty --stdin --command -- \
    dig "host-${i}.sub" a +search +noall +answer \
done
; <<>> DiG 9.10.2 <<>> host-1.sub a +search +noall +answer
;; global options: +cmd
; <<>> DiG 9.10.2 <<>> host-2.sub a +search +noall +answer
;; global options: +cmd
; <<>> DiG 9.10.2 <<>> host-3.sub a +search +noall +answer
;; global options: +cmd

Note there how there are no DNS A records available for the Service or any of the pods it selects.

Cause

Why is this so? In method (*KubeDNS.generateRecordsForHeadlessService), it iterates over the Endpoints.Subsets sequence, which is only populated for ports that exist on the related Service object. If the Service defines no ports, the Endpoints object has no subsets for any of the selected pods.

kubectl get endpoints sub --output=jsonpath='{.subsets}'
[]

So long as kube-dns is implemented this way, we need a port to notice the pods backing the Service. Hence, we should either adjust the documentation to match the implementation's constraints, or reconsider the implementation (much harder, duplicating some of the monitoring and filtering work done by the endpoints controller).

Environment

kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T19:11:02Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"7+", GitVersion:"v1.7.4-30+6c97db85c5ab05", GitCommit:"6c97db85c5ab0586c15be39b3e88c7a425b96947", GitTreeState:"clean", BuildDate:"2017-11-21T09:07:18Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2018
@seh
Copy link
Author

seh commented Feb 27, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2018
@thockin
Copy link
Member

thockin commented Feb 28, 2018

/lifecycle frozen
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 28, 2018
@seh
Copy link
Author

seh commented Aug 16, 2018

Related issues:

Related fix:

I thought those would be sufficient to resolve this problem, but testing this again today with Kubernetes version 1.11.2 and CoreDNS still leaves us without A records for the pods, because the Endpoints object has no subsets for the pods.

@xiangpengzhao, do you think this case should work now?

@chrisohaver
Copy link
Contributor

Whats the use case here?
A headless service that defines the open services is kind of the "Kubernetes" way to solve this...
In your case, is it difficult/complex to create and maintain a service definition?

@seh
Copy link
Author

seh commented Aug 16, 2018

It's not that the Service is hard to write; it's that the documentation about how to do so is either wrong, or there's still a defect here. Please note that I was not complaining about why the Service is necessary; I was demonstrating that the despite the documentation, the Service requires at least one port to be defined, even if nothing will ever use that port, and even if no containers actually listen on that port.

@chrisohaver
Copy link
Contributor

Ah OK - So updating the documentation is also an option...

@seh
Copy link
Author

seh commented Aug 16, 2018

Right. I'm hoping someone can weigh in and say, "That was never supposed to work. Why did we claim it would?" Alternately, "Yes, we want this to work, but haven't been able to fix the problem yet."

Note, though, that @thockin did weigh in here:

I think it should work without ports.

@chrisohaver
Copy link
Contributor

OK. According to the DNS Spec, this should work, as long as the endpoints are still deemed "ready" by the API if they have no open ports.

@seh
Copy link
Author

seh commented Aug 16, 2018

Yes, and my reading of kubernetes/kubernetes#47250 says that it should have fixed this problem, but when I run kubectl get endpoints sub --output=yaml for the second phase of my test case above, the Endpoints object has no "subsets" field.

@thockin
Copy link
Member

thockin commented Aug 16, 2018 via email

@seh
Copy link
Author

seh commented Aug 16, 2018

No, that's not bad, Tim. We can change the documentation to match the implementation.

What's confusing, though, is the recent efforts (the aforementioned merged PRs) to accommodate Services without ports, and why they don't cover this case. The documentation overpromises, but the implementation is trying to make it true. It seems that we're close.

@thockin
Copy link
Member

thockin commented Aug 20, 2018

Looking at this now. I feel like it SHOULD work as documented.

@thockin
Copy link
Member

thockin commented Aug 20, 2018

OK, I found the bug. It's not pretty to fix, but it's not terrible. Let me pull a PR together and see about tests.

thockin added a commit to thockin/kubernetes that referenced this issue Aug 21, 2018
As cited in
kubernetes/dns#174 - this is documented to
work, and I don't see why it shouldn't work.  We allowed the definition
of headless services without ports, but apparently nobody tested it very
well.

Manually tested clusterIP services with no ports - validation error.

Manually tested services with negative ports - validation error.

New tests failed, output inspected and verified.  Now pass.
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Aug 21, 2018
…orts

Automatic merge from submit-queue (batch tested with PRs 67661, 67497, 66523, 67622, 67632). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Allow headless svc without ports to have endpoints

As cited in
kubernetes/dns#174 - this is documented to
work, and I don't see why it shouldn't work.  We allowed the definition
of headless services without ports, but apparently nobody tested it very
well.

Manually tested clusterIP services with no ports - validation error.

Manually tested services with negative ports - validation error.

New tests failed, output inspected and verified.  Now pass.

xref kubernetes/dns#174

**Release note**:
```release-note
Headless Services with no ports defined will now create Endpoints correctly, and appear in DNS.
```
@thockin
Copy link
Member

thockin commented Aug 21, 2018

Merged.

@seh
Copy link
Author

seh commented Aug 22, 2018

Thank you, Tim. I'm happy to see one more rough edge—however obscure—sanded away.

@seh seh closed this as completed Aug 22, 2018
aniket-s-kulkarni pushed a commit to aniket-s-kulkarni/kubernetes that referenced this issue Aug 24, 2018
As cited in
kubernetes/dns#174 - this is documented to
work, and I don't see why it shouldn't work.  We allowed the definition
of headless services without ports, but apparently nobody tested it very
well.

Manually tested clusterIP services with no ports - validation error.

Manually tested services with negative ports - validation error.

New tests failed, output inspected and verified.  Now pass.
MrHohn pushed a commit to MrHohn/kubernetes that referenced this issue Feb 21, 2019
As cited in
kubernetes/dns#174 - this is documented to
work, and I don't see why it shouldn't work.  We allowed the definition
of headless services without ports, but apparently nobody tested it very
well.

Manually tested clusterIP services with no ports - validation error.

Manually tested services with negative ports - validation error.

New tests failed, output inspected and verified.  Now pass.
rjaini added a commit to msazurestackworkloads/kubernetes that referenced this issue Apr 14, 2019
* Default extensions/v1beta1 Deployment's ProgressDeadlineSeconds to MaxInt32.

1. MaxInt32 has the same meaning as unset, for compatibility
2. Deployment controller treats MaxInt32 the same as unset (nil)

* Update API doc of ProgressDeadlineSeconds

* Autogen

1. hack/update-generated-protobuf.sh
2. hack/update-generated-swagger-docs.sh
3. hack/update-swagger-spec.sh
4. hack/update-openapi-spec.sh
5. hack/update-api-reference-docs.sh

* Lookup PX api port from k8s service

Fixes kubernetes#70033

Signed-off-by: Harsh Desai <harsh@portworx.com>

* cache portworx API port

- reused client whenever possible
- refactor get client function into explicit cluster-wide and local functions

Signed-off-by: Harsh Desai <harsh@portworx.com>

* Fix bug with volume getting marked as not in-use with pending op

Add test for verifying volume detach

* Fix flake with e2e test that checks detach while mount in progress

A volume can show up as in-use even before it gets attached
to the node.

* fix node and kubelet start times

* Bump golang to 1.10.7 (CVE-2018-16875)

* Kubernetes version v1.11.7-beta.0 openapi-spec file updates

* Add/Update CHANGELOG-1.11.md for v1.11.6.

* New sysctls to improve pod termination

* Retry scheduling on various events.

* Test rescheduling on various events. - Add resyncPeriod parameter for setupCluster to make resync period of scheduler configurable. - Add test case for static provisioning and delay binding storage class. Move pods into active queue on PV add/update events. - Add a stress test with scheduler resync to detect possible race conditions.

* fix predicate invalidation method

* Fixed clearing of devicePath after UnmountDevice

UnmountDevice must not clear devicepath, because such devicePath
may come from node.status (e.g. on AWS) and subsequent MountDevice
operation (that may be already enqueued) needs it.

* fix race condition when attach azure disk in vmss

fix gofmt issue

* Check for volume-subpaths directory in orpahaned pod cleanup

* Leave refactoring TODO

* Update BUILD file

* Protect Netlink calls with a mutex

* Fix race in setting nominated node

* autogenerated files

* update cloud provider boilerplate

The pull-kubernetes-verify presubmit is failing on
verify-cloudprovider-gce.sh because it is a new year and thus current
test generated code doesn't match the prior committed generated code in
the copyright header.  The verifier is removed in master now, so for
simplicity and rather than fixing the verifier to ignore the header
differences for prior supported branched, this commit is the result of
rerunning hack/update-cloudprovider-gce.sh.

Signed-off-by: Tim Pepper <tpepper@vmware.com>

* Cluster Autoscaler 1.3.5

* Move unmount volume util from pkg/volume/util to pkg/util/mount

* Update doCleanSubpaths to use UnmountMountPoint

* Add unit test for UnmountMountPoint

* Add comments around use of PathExists

* Move linux test utils to os-independent test file

* Rename UnmountMountPoint to CleanupMountPoint

* Add e2e test for removing the subpath directory

* change azure disk host cache to ReadOnly by default

change cachingMode default value for azure disk PV

revert back to ReadWrite in azure disk PV setting

* activate unschedulable pods only if the node became more schedulable

* make integration/verify script look for k8s under GOPATH

* Clean up artifacts variables in hack scripts

* use json format to get rbd image size

* change sort function of scheduling queue to avoid starvation when unschedulable pods are in the queue

When starvation heppens:
- a lot of unschedulable pods exists in the head of queue
- because condition.LastTransitionTime is updated only when condition.Status changed
- (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.)

What was changed:
- condition.LastProbeTime is updated everytime when pod is determined
unschedulable.
- changed sort function so to use LastProbeTime to avoid starvation
described above

Consideration:
- This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as
unschedulable.

Signed-off-by: Shingo Omura <everpeace@gmail.com>

* Fix action required for pr 61373

* Fix kube-proxy PodSecurityPolicy RoleBinding namespace

* Find current resourceVersion for waiting for deletion/conditions

* Add e2e test for file exec

* Fix nil panic propagation

* Add `metrics-port` to kube-proxy cmd flags.

* Fix AWS NLB security group updates

This corrects a problem where valid security group ports were removed
unintentionally when updating a service or when node changes occur.

Fixes kubernetes#60825, kubernetes#64148

* Unit test for aws_lb security group filtering

kubernetes#60825

* Do not snapshot scheduler cache before starting preemption

* Fix and improve preemption test to work with the new logic

* changelog duplicate

* Increase limit for object size in streaming serializer

* Attempt to deflake HPA e2e test

Increase CPU usage requested from resource consumer. Observed CPU usage
must:
- be consistently above 300 milliCPU (2 pods * 500 mCPU request per
pod * .3 target utilization) to avoid scaling down below 3.
- never exceed 600 mCPU (4 pods * ...) to avoid scaling up above 4.

Also improve logging in case this doesn't solve the problem.

Change-Id: Id1d9c0193ccfa063855b29c5274587f05c1eb4d3

* Kubernetes version v1.11.8-beta.0 openapi-spec file updates

* Add/Update CHANGELOG-1.11.md for v1.11.7.

* Correlate max-inflight values in GCE with master VM sizes

* Update to go1.10.8

* Don't error on error on deprecated native http_archive rule

* add goroutine to move unschedulablepods to activeq regularly

* Always select the in-memory group/version as a target when decoding from storage

* fix mac filtering in vsphere cloud provider

* fix mac filtering in vsphere cloud provider

* Fix kubernetes#73479 AWS NLB target groups missing tags

`elbv2.AddTags` doesn't seem to support assigning the same set of
tags to multiple resources at once leading to the following error:
  Error adding tags after modifying load balancer targets:
  "ValidationError: Only one resource can be tagged at a time"

This can happen when using AWS NLB with multiple listeners pointing
to different node ports.

When k8s creates a NLB it creates a target group per listener along
with installing security group ingress rules allowing the traffic to
reach the k8s nodes.

Unfortunately if those target groups are not tagged, k8s will not
manage them, thinking it is not the owner.

This small changes assigns tags one resource at a time instead of
batching them as before.

Signed-off-by: Brice Figureau <brice@daysofwonder.com>

* support multiple cidr vpc for nlb health check

* Use watch cache when rv=0 even when limit is set

* Avoid going back in time in watchcache watchers

* Bump the pod memory to higher levels to work on power

* vendor: bump github.com/evanphx/json-patch

Grab important bug fix that can cause a `panic()` from this package on
certain inputs. See evanphx/json-patch@73af7f5

Signed-off-by: Brandon Philips <brandon@ifup.org>

* vendor: bump github.com/evanphx/json-patch

Grab important bug fix that can cause a `panic()` from this package on
certain inputs. See evanphx/json-patch#64

* update json-patch to pick up bug fixes

* Importing latest json-patch.

* Set the maximum size increase the copy operations in a json patch can cause

* Adding a limit on the maximum bytes accepted to be decoded in a resource write request.

* Cluster Autoscaler 1.3.7

* Make intergration test helper public.

This was done in the master branch in
kubernetes#69902. The pull includes many
other changes, so we made this targeted patch.

* add integration test

* Loosing the request body size limit to 100MB to account for the size ratio between json and protobuf.

* Limit the number of operations in a single json patch to be 10,000

* Fix testing if an interface is the loopback

It's not guaranteed that the loopback interface only has the loopback
IP, in our environments our loopback interface is also assigned a 169
address as well.

* fix smb remount issue on Windows

add comments for doSMBMount func

fix comments about smb mount

fix build error

* Allow headless svc without ports to have endpoints

As cited in
kubernetes/dns#174 - this is documented to
work, and I don't see why it shouldn't work.  We allowed the definition
of headless services without ports, but apparently nobody tested it very
well.

Manually tested clusterIP services with no ports - validation error.

Manually tested services with negative ports - validation error.

New tests failed, output inspected and verified.  Now pass.

* do not return error on invalid mac address in vsphere cloud provider

* remove get azure accounts in the init process set timeout for get azure account operation

use const for timeout value

remove get azure accounts in the init process

add lock for account init

* add timeout in GetVolumeLimits operation

add timeout for getAllStorageAccounts

* record event on endpoint update failure

* fix parse devicePath issue on Azure Disk

* add retry for detach azure disk

add more logging info in detach disk

add azure disk attach/detach logs

* Fix find-binary to locate bazel e2e tests

* Reduce cardinality of admission webhook metrics

* Explicitly set GVK when sending objects to webhooks

* Kubernetes version v1.11.9-beta.0 openapi-spec file updates

* Add/Update CHANGELOG-1.11.md for v1.11.8.

* add Azure Container Registry anonymous repo support

apply fix for msi and fix test failure

* cri_stats_provider: overload nil as 0 for exited containers stats

Always report 0 cpu/memory usage for exited containers to make
metrics-server work as expect.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>

* Fix panic in kubectl cp command

* Adding a check to make sure UseInstanceMetadata flag is true to get data from metadata.

* add module 'nf_conntrack' in ipvs prerequisite check

* Ensure Azure load balancer cleaned up on 404 or 403

* Allow disable outbound snat when Azure standard load balancer is used

* Allow session affinity a period of time to setup for new services.

This is to deal with the flaky session affinity test.

* build/gci: bump CNI version to 0.7.5

* Fix size of repd e2e to use Gi

* Missed one changes.
rjaini added a commit to msazurestackworkloads/kubernetes that referenced this issue May 6, 2019
* Default extensions/v1beta1 Deployment's ProgressDeadlineSeconds to MaxInt32.

1. MaxInt32 has the same meaning as unset, for compatibility
2. Deployment controller treats MaxInt32 the same as unset (nil)

* Update API doc of ProgressDeadlineSeconds

* Autogen

1. hack/update-generated-protobuf.sh
2. hack/update-generated-swagger-docs.sh
3. hack/update-swagger-spec.sh
4. hack/update-openapi-spec.sh
5. hack/update-api-reference-docs.sh

* Lookup PX api port from k8s service

Fixes kubernetes#70033

Signed-off-by: Harsh Desai <harsh@portworx.com>

* cache portworx API port

- reused client whenever possible
- refactor get client function into explicit cluster-wide and local functions

Signed-off-by: Harsh Desai <harsh@portworx.com>

* Fix bug with volume getting marked as not in-use with pending op

Add test for verifying volume detach

* Fix flake with e2e test that checks detach while mount in progress

A volume can show up as in-use even before it gets attached
to the node.

* fix node and kubelet start times

* Bump golang to 1.10.7 (CVE-2018-16875)

* Kubernetes version v1.11.7-beta.0 openapi-spec file updates

* Add/Update CHANGELOG-1.11.md for v1.11.6.

* New sysctls to improve pod termination

* Retry scheduling on various events.

* Test rescheduling on various events. - Add resyncPeriod parameter for setupCluster to make resync period of scheduler configurable. - Add test case for static provisioning and delay binding storage class. Move pods into active queue on PV add/update events. - Add a stress test with scheduler resync to detect possible race conditions.

* fix predicate invalidation method

* Fixed clearing of devicePath after UnmountDevice

UnmountDevice must not clear devicepath, because such devicePath
may come from node.status (e.g. on AWS) and subsequent MountDevice
operation (that may be already enqueued) needs it.

* fix race condition when attach azure disk in vmss

fix gofmt issue

* Check for volume-subpaths directory in orpahaned pod cleanup

* Leave refactoring TODO

* Update BUILD file

* Protect Netlink calls with a mutex

* Fix race in setting nominated node

* autogenerated files

* update cloud provider boilerplate

The pull-kubernetes-verify presubmit is failing on
verify-cloudprovider-gce.sh because it is a new year and thus current
test generated code doesn't match the prior committed generated code in
the copyright header.  The verifier is removed in master now, so for
simplicity and rather than fixing the verifier to ignore the header
differences for prior supported branched, this commit is the result of
rerunning hack/update-cloudprovider-gce.sh.

Signed-off-by: Tim Pepper <tpepper@vmware.com>

* Cluster Autoscaler 1.3.5

* Move unmount volume util from pkg/volume/util to pkg/util/mount

* Update doCleanSubpaths to use UnmountMountPoint

* Add unit test for UnmountMountPoint

* Add comments around use of PathExists

* Move linux test utils to os-independent test file

* Rename UnmountMountPoint to CleanupMountPoint

* Add e2e test for removing the subpath directory

* change azure disk host cache to ReadOnly by default

change cachingMode default value for azure disk PV

revert back to ReadWrite in azure disk PV setting

* activate unschedulable pods only if the node became more schedulable

* make integration/verify script look for k8s under GOPATH

* Clean up artifacts variables in hack scripts

* use json format to get rbd image size

* change sort function of scheduling queue to avoid starvation when unschedulable pods are in the queue

When starvation heppens:
- a lot of unschedulable pods exists in the head of queue
- because condition.LastTransitionTime is updated only when condition.Status changed
- (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.)

What was changed:
- condition.LastProbeTime is updated everytime when pod is determined
unschedulable.
- changed sort function so to use LastProbeTime to avoid starvation
described above

Consideration:
- This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as
unschedulable.

Signed-off-by: Shingo Omura <everpeace@gmail.com>

* Fix action required for pr 61373

* Fix kube-proxy PodSecurityPolicy RoleBinding namespace

* Find current resourceVersion for waiting for deletion/conditions

* Add e2e test for file exec

* Fix nil panic propagation

* Add `metrics-port` to kube-proxy cmd flags.

* Fix AWS NLB security group updates

This corrects a problem where valid security group ports were removed
unintentionally when updating a service or when node changes occur.

Fixes kubernetes#60825, kubernetes#64148

* Unit test for aws_lb security group filtering

kubernetes#60825

* Do not snapshot scheduler cache before starting preemption

* Fix and improve preemption test to work with the new logic

* changelog duplicate

* Increase limit for object size in streaming serializer

* Attempt to deflake HPA e2e test

Increase CPU usage requested from resource consumer. Observed CPU usage
must:
- be consistently above 300 milliCPU (2 pods * 500 mCPU request per
pod * .3 target utilization) to avoid scaling down below 3.
- never exceed 600 mCPU (4 pods * ...) to avoid scaling up above 4.

Also improve logging in case this doesn't solve the problem.

Change-Id: Id1d9c0193ccfa063855b29c5274587f05c1eb4d3

* Kubernetes version v1.11.8-beta.0 openapi-spec file updates

* Add/Update CHANGELOG-1.11.md for v1.11.7.

* Correlate max-inflight values in GCE with master VM sizes

* Update to go1.10.8

* Don't error on error on deprecated native http_archive rule

* add goroutine to move unschedulablepods to activeq regularly

* Always select the in-memory group/version as a target when decoding from storage

* fix mac filtering in vsphere cloud provider

* fix mac filtering in vsphere cloud provider

* Fix kubernetes#73479 AWS NLB target groups missing tags

`elbv2.AddTags` doesn't seem to support assigning the same set of
tags to multiple resources at once leading to the following error:
  Error adding tags after modifying load balancer targets:
  "ValidationError: Only one resource can be tagged at a time"

This can happen when using AWS NLB with multiple listeners pointing
to different node ports.

When k8s creates a NLB it creates a target group per listener along
with installing security group ingress rules allowing the traffic to
reach the k8s nodes.

Unfortunately if those target groups are not tagged, k8s will not
manage them, thinking it is not the owner.

This small changes assigns tags one resource at a time instead of
batching them as before.

Signed-off-by: Brice Figureau <brice@daysofwonder.com>

* support multiple cidr vpc for nlb health check

* Use watch cache when rv=0 even when limit is set

* Avoid going back in time in watchcache watchers

* Bump the pod memory to higher levels to work on power

* vendor: bump github.com/evanphx/json-patch

Grab important bug fix that can cause a `panic()` from this package on
certain inputs. See evanphx/json-patch@73af7f5

Signed-off-by: Brandon Philips <brandon@ifup.org>

* vendor: bump github.com/evanphx/json-patch

Grab important bug fix that can cause a `panic()` from this package on
certain inputs. See evanphx/json-patch#64

* update json-patch to pick up bug fixes

* Importing latest json-patch.

* Set the maximum size increase the copy operations in a json patch can cause

* Adding a limit on the maximum bytes accepted to be decoded in a resource write request.

* Cluster Autoscaler 1.3.7

* Make intergration test helper public.

This was done in the master branch in
kubernetes#69902. The pull includes many
other changes, so we made this targeted patch.

* add integration test

* Loosing the request body size limit to 100MB to account for the size ratio between json and protobuf.

* Limit the number of operations in a single json patch to be 10,000

* Fix testing if an interface is the loopback

It's not guaranteed that the loopback interface only has the loopback
IP, in our environments our loopback interface is also assigned a 169
address as well.

* fix smb remount issue on Windows

add comments for doSMBMount func

fix comments about smb mount

fix build error

* Allow headless svc without ports to have endpoints

As cited in
kubernetes/dns#174 - this is documented to
work, and I don't see why it shouldn't work.  We allowed the definition
of headless services without ports, but apparently nobody tested it very
well.

Manually tested clusterIP services with no ports - validation error.

Manually tested services with negative ports - validation error.

New tests failed, output inspected and verified.  Now pass.

* do not return error on invalid mac address in vsphere cloud provider

* remove get azure accounts in the init process set timeout for get azure account operation

use const for timeout value

remove get azure accounts in the init process

add lock for account init

* add timeout in GetVolumeLimits operation

add timeout for getAllStorageAccounts

* record event on endpoint update failure

* fix parse devicePath issue on Azure Disk

* add retry for detach azure disk

add more logging info in detach disk

add azure disk attach/detach logs

* Fix find-binary to locate bazel e2e tests

* Reduce cardinality of admission webhook metrics

* Explicitly set GVK when sending objects to webhooks

* Kubernetes version v1.11.9-beta.0 openapi-spec file updates

* Add/Update CHANGELOG-1.11.md for v1.11.8.

* add Azure Container Registry anonymous repo support

apply fix for msi and fix test failure

* cri_stats_provider: overload nil as 0 for exited containers stats

Always report 0 cpu/memory usage for exited containers to make
metrics-server work as expect.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>

* Fix panic in kubectl cp command

* Adding a check to make sure UseInstanceMetadata flag is true to get data from metadata.

* Update Cluster Autoscaler version to 1.3.8

* add module 'nf_conntrack' in ipvs prerequisite check

* Ensure Azure load balancer cleaned up on 404 or 403

* Allow disable outbound snat when Azure standard load balancer is used

* kubelet: updated logic of verifying a static critical pod

- check if a pod is static by its static pod info
- meanwhile, check if a pod is critical by its corresponding mirror pod info

* Allow session affinity a period of time to setup for new services.

This is to deal with the flaky session affinity test.

* Restore username and password kubectl flags

* build/gci: bump CNI version to 0.7.5

* Fix size of repd e2e to use Gi

* bump repd min size in e2es

* allows configuring NPD release and flags on GCI and add cluster e2e test

* allows configuring NPD image version in node e2e test and fix the test

* Kubernetes version v1.11.10-beta.0 openapi-spec file updates

* Add/Update CHANGELOG-1.11.md for v1.11.9.

* stop vsphere cloud provider from spamming logs with `failed to patch IP` Fixes: kubernetes#75236

* Restore *filter table for ipvs

Resolve: kubernetes#68194

* Update gcp images with security patches

[stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes.
[fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes.
[fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes.
[fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes.
[metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes.

* Bump debian-iptables to v11.0.2.

* Updated Regional PD failover test to use node taints instead of instance group deletion

* Updated regional PD minimum size; changed regional PD failover test to use StorageClassTest to generate PVC template

* Removed istio related addon manifests, as the directory is deprecated.

* Use Node-Problem-Detector v0.6.3 on GCI

* Increase default maximumLoadBalancerRuleCount to 250

* Fix Azure SLB support for multiple backend pools

Azure VM and vmssVM support multiple backend pools for the same SLB, but
not for different LBs.

* disable HTTP2 ingress test

* Upgrade compute API to version 2019-03-01

* Replace vmss update API with instance-level update API

* Cleanup codes that not required any more

* Cleanup interfaces and add unit tests

* Update vendors

* Create the "internal" firewall rule for kubemark master.

This is equivalent to the "internal" firewall rule that is created for
the regular masters.
The main reason for doing it is to allow prometheus scraping metrics
from various kubemark master components, e.g. kubelet.

Ref. kubernetes/perf-tests#503

* Move back APIs to Azure stack supported version (#19)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

5 participants