Volumes fail to clean up when kubelet restart due to race between actual and desired state #75345

jingxu97 · 2019-03-13T21:35:08Z

We have tests to delete pods while kubelet restarts, the test is a little bit flaky because the following race condition

stop kubelet, delete pod gracefully and restarts kubelet
pod's volumes are first put into the volume desired state. Then populator will try to remove it because pod is deleted. However, it is normally failed because we check whether actual state already has the volume record or not.
actual state verify volume is attached and have a record of it, but it does not record it as mounted yet
desired state remove the volume right after step 3 happened.
reconciler never has a chance to umount volume because actual state does not has it as mounted (only has it as attached)

I think in step 3, we should check actual state has this volume as mounted or not, instead of just checking whether it exists or not (attached).

mariantalla · 2019-03-15T22:11:58Z

Hey @jingxu97, just checking in - any updates on this one?

athenabot · 2019-03-16T03:05:34Z

/sig node

These SIGs are my best guesses for this issue. Please comment /remove-sig <name> if I am incorrect about one.
🤖 I am an (alpha) bot run by @vllry. 👩‍🔬

mariantalla · 2019-03-18T13:41:32Z

/remove-sig node

mariantalla · 2019-03-18T17:17:06Z

Hi @jingxu97 / @msau42 - any news on this one? #75328 is currently flaking about 15% of the time in release-blocking boards.

jingxu97 · 2019-03-18T17:42:07Z

@mariantalla sorry for the delay. I will work on a fix today.

spiffxp · 2019-03-18T21:33:25Z

/milestone v1.14

spiffxp · 2019-03-18T21:41:15Z

/priority important-soon

manager This PR fixes the issue kubernetes#75345. This fix modified the checking volume in actual state when validating whether volume can be removed from desired state or not. Only if volume status is already mounted in actual state, it can be removed from desired state. For the case of mounting fails always, it can still work because the check also validate whether pod still exist in pod manager. In case of mount fails, pod should be able to removed from pod manager so that volume can also be removed from desired state.

jingxu97 · 2019-03-19T01:53:36Z

open a PR #75458 to fix it

nikopen · 2019-03-21T15:04:22Z

/milestone clear

the PR can merge to master and can be cherry picked to 1.14.1 if it doesnt make 1.14.0

…manager This PR fixes the issue kubernetes#75345. This fix modified the checking volume in actual state when validating whether volume can be removed from desired state or not. Only if volume status is already mounted in actual state, it can be removed from desired state. For the case of mounting fails always, it can still work because the check also validate whether pod still exist in pod manager. In case of mount fails, pod should be able to removed from pod manager so that volume can also be removed from desired state.

* test: remove k8s.io/apiextensions-apiserver from framework There are two reason why this is useful: 1. less code to vendor into external users of the framework The following dependencies become obsolete due to this change (from `dep`): (8/23) Removed unused project github.com/grpc-ecosystem/go-grpc-prometheus (9/23) Removed unused project github.com/coreos/etcd (10/23) Removed unused project github.com/globalsign/mgo (11/23) Removed unused project github.com/go-openapi/strfmt (12/23) Removed unused project github.com/asaskevich/govalidator (13/23) Removed unused project github.com/mitchellh/mapstructure (14/23) Removed unused project github.com/NYTimes/gziphandler (15/23) Removed unused project gopkg.in/natefinch/lumberjack.v2 (16/23) Removed unused project github.com/go-openapi/errors (17/23) Removed unused project github.com/go-openapi/analysis (18/23) Removed unused project github.com/go-openapi/runtime (19/23) Removed unused project sigs.k8s.io/structured-merge-diff (20/23) Removed unused project github.com/go-openapi/validate (21/23) Removed unused project github.com/coreos/go-systemd (22/23) Removed unused project github.com/go-openapi/loads (23/23) Removed unused project github.com/munnerz/goautoneg 2. works around kubernetes#75338 which currently breaks vendoring Some recent changes to crd_util.go must now be pulling in the broken k8s.io/apiextensions-apiserver packages, because it was still working in revision 2e90d92 (as demonstrated by https://github.com/intel/pmem-CSI/tree/586ae281ac2810cb4da6f1e160cf165c7daf0d80). * update Bazel files * test: fix golint warnings in crd_util.go Because the code was moved, golint is now active. Because users of the code must adapt to the new location of the code, it makes sense to also change the API at the same time to address the style comments from golint ("struct field ApiGroup should be APIGroup", same for ApiExtensionClient). * fix race condition issue for smb mount on windows change var name * stop vsphere cloud provider from spamming logs with `failed to patch IP` Fixes: kubernetes#75236 * Remove reference to USE_RELEASE_NODE_BINARIES. This variable was used for development purposes and was accidentally introduced in kubernetes@f0f7829. This is its only use in the tree: https://github.com/kubernetes/kubernetes/search?q=USE_RELEASE_NODE_BINARIES&unscoped_q=USE_RELEASE_NODE_BINARIES * Clear conntrack entries on 0 -> 1 endpoint transition with externalIPs As part of the endpoint creation process when going from 0 -> 1 conntrack entries are cleared. This is to prevent an existing conntrack entry from preventing traffic to the service. Currently the system ignores the existance of the services external IP addresses, which exposes that errant behavior This adds the externalIP addresses of udp services to the list of conntrack entries that get cleared. Allowing traffic to flow Signed-off-by: Jacob Tanenbaum <jtanenba@redhat.com> * Move to golang 1.12.1 official image We used 1.12.0 + hack to download 1.12.1 binaries as we were in a rush on friday since the images were not published at that time. Let's remove the hack now and republish the kube-cross image Change-Id: I3ffff3283b6ca755320adfca3c8f4a36dc1c2b9e * fix-kubeadm-init-output * Mark audit e2e tests as flaky * Bump kube-cross image to 1.12.1-2 * Restore username and password kubectl flags * build/gci: bump CNI version to 0.7.5 * Add/Update CHANGELOG-1.14.md for v1.14.0-rc.1. * Restore machine readability to the print-join-command output The output of `kubeadm token create --print-join-command` should be usable by batch scripts. This issue was pointed out in: kubernetes/kubeadm#1454 * bump required minimum go version to 1.12.1 (strings package compatibility) * Bump go-openapi/jsonpointer and go-openapi/jsonreference versions xref: kubernetes#75653 Signed-off-by: Jorge Alarcon Ochoa <alarcj137@gmail.com> * Kubernetes version v1.14.1-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.14.md for v1.14.0. * 1.14 release notes fixes * Add flag to enable strict ARP * Do not delete existing VS and RS when starting * Update Cluster Autscaler version to 1.14.0 No changes since 1.14.0-beta.2 Changelog: https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.14.0 * Fix Windows to read VM UUIDs from serial numbers Certain versions of vSphere do not have the same value for product_uuid and product_serial. This mimics the change in kubernetes#59519. Fixes kubernetes#74888 * godeps: update vmware/govmomi to v0.20 release * vSphere: add token auth support for tags client SAML auth support for the vCenter rest API endpoint came to govmomi a bit after Zone support came to vSphere Cloud Provider. Fixes kubernetes#75511 * vsphere: govmomi rest API simulator requires authentication * gce: configure: validate SA has storage scope If the VM SA doesn't have storage scope associated, don't use the token in the curl request or the request will fail with 403. * fix-external-etcd * Update gcp images with security patches [stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes. [fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes. [fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes. * kubeadm: fix "upgrade plan" not working without k8s version If the k8s version argument passed to "upgrade plan" is missing the logic should perform the following actions: - fetch a "stable" version from the internet. - if that fails, fallback to the local client version. Currentely the logic fails because the cfg.KubernetesVersion is defaulted to the version of the existing cluster, which then causes an early exit without any ugprade suggestions. See app/cmd/upgrade/common.go::enforceRequirements(): configutil.FetchInitConfigurationFromCluster(..) Fix that by passing the explicit user value that can also be "". This will then make the "offline getter" treat it as an explicit desired upgrade target. In the future it might be best to invert this logic: - if no user k8s version argument is passed - default to the kubeadm version. - if labels are passed (e.g. "stable"), fetch a version from the internet. * Disable GCE agent address management on Windows nodes. With this metadata key set, "GCEWindowsAgent: GCE address manager status: disabled" will appear in the VM's serial port output during boot. Tested: PROJECT=${CLOUDSDK_CORE_PROJECT} KUBE_GCE_ENABLE_IP_ALIASES=true NUM_WINDOWS_NODES=2 NUM_NODES=2 KUBERNETES_NODE_PLATFORM=windows go run ./hack/e2e.go -- --up cluster/gce/windows/smoke-test.sh cat > iis.yaml <<EOF apiVersion: v1 kind: Pod metadata: name: iis labels: app: iis spec: containers: - image: mcr.microsoft.com/windows/servercore/iis imagePullPolicy: IfNotPresent name: iis-server ports: - containerPort: 80 protocol: TCP nodeSelector: beta.kubernetes.io/os: windows tolerations: - effect: NoSchedule key: node.kubernetes.io/os operator: Equal value: windows1809 EOF kubectl create -f iis.yaml kubectl expose pod iis --type=LoadBalancer --name=iis kubectl get services curl http://<service external IP address> * kube-aggregator: bump openapi aggregation log level * Explicitly flush headers when proxying * fix-kubeadm-upgrade-12-13-14 * GCE/Windows: disable stackdriver logging agent The logging service could not be stopped at times, causing node startup failures. Disable it until the issue is fixed. * Finish saving test results on failure The conformance image should be saving its results regardless of the results of the tests. However, with errexit set, when ginkgo gets test failures it exits 1 which prevents saving the results for Sonobuoy to pick up. Fixes: kubernetes#76036 * Avoid panic in cronjob sorting This change handles the case where the ith cronjob may have its start time set to nil. Previously, the Less method could cause a panic in case the ith cronjob had its start time set to nil, but the jth cronjob did not. It would panic when calling Before on a nil StartTime. * Removed cleanup for non-current kube-proxy modes in newProxyServer() * Depricated --cleanup-ipvs flag in kube-proxy * Fixed old function signature in kube-proxy tests. * Revert "Deprecated --cleanup-ipvs flag in kube-proxy" This reverts commit 4f1bb2b. * Revert "Fixed old function signature in kube-proxy tests." This reverts commit 29ba1b0. * Fixed --cleanup-ipvs help text * Check for required name parameter in dynamic client The Create, Delete, Get, Patch, Update and UpdateStatus methods in the dynamic client all expect the name parameter to be non-empty, but did not validate this requirement, which could lead to a panic. Add explicit checks to these methods. * Fix empty array expansion error in cluster/gce/util.sh Empty array expansion causes "unbound variable" error in bash 4.2 and bash 4.3. * Improve volume operation metrics * Add e2e tests * ensuring that logic is checking for differences in listener * Kubernetes version v1.14.2-beta.0 openapi-spec file updates * Delete only unscheduled pods if node doesn't exist anymore. * Add/Update CHANGELOG-1.14.md for v1.14.1. * Use Node-Problem-Detector v0.6.3 on GCI * proxy: Take into account exclude CIDRs while deleting legacy real servers * kubeadm: Don't error out on join with --cri-socket override In the case where newControlPlane is true we don't go through getNodeRegistration() and initcfg.NodeRegistration.CRISocket is empty. This forces DetectCRISocket() to be called later on, and if there is more than one CRI installed on the system, it will error out, while asking for the user to provide an override for the CRI socket. Even if the user provides an override, the call to DetectCRISocket() can happen too early and thus ignore it (while still erroring out). However, if newControlPlane == true, initcfg.NodeRegistration is not used at all and it's overwritten later on. Thus it's necessary to supply some default value, that will avoid the call to DetectCRISocket() and as initcfg.NodeRegistration is discarded, setting whatever value here is harmless. Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com> * Bump coreos/go-semver The https://github.com/coreos/go-semver/ dependency has formally release v0.3.0 at commit e214231b295a8ea9479f11b70b35d5acf3556d9b. This is the commit point we've been using, but the hack/verify-godeps.sh script notices the discrepancy and causes ci-kubernetes-verify job to fail. Fixes: kubernetes#76526 Signed-off-by: Tim Pepper <tpepper@vmware.com> * Fix Azure SLB support for multiple backend pools Azure VM and vmssVM support multiple backend pools for the same SLB, but not for different LBs. * Restore metrics-server using of IP addresses This preference list matches is used to pick prefered field from k8s node object. It was introduced in metrics-server 0.3 and changed default behaviour to use DNS instead of IP addresses. It was merged into k8s 1.12 and caused breaking change by introducing dependency on DNS configuration. * refactor detach azure disk retry operation * move disk lock process to azure cloud provider fix comments fix import keymux check error add unit test for attach/detach disk funcs * Fix concurrent map access in Portworx create volume call Fixes kubernetes#76340 Signed-off-by: Harsh Desai <harsh@portworx.com> * Fix race condition between actual and desired state in kublet volume manager This PR fixes the issue kubernetes#75345. This fix modified the checking volume in actual state when validating whether volume can be removed from desired state or not. Only if volume status is already mounted in actual state, it can be removed from desired state. For the case of mounting fails always, it can still work because the check also validate whether pod still exist in pod manager. In case of mount fails, pod should be able to removed from pod manager so that volume can also be removed from desired state. * fix validation message: apiServerEndpoints -> apiServerEndpoint * add shareName param in azure file storage class skip create azure file if it exists * Update Cluster Autoscaler to 1.14.2 * Create the "internal" firewall rule for kubemark master. This is equivalent to the "internal" firewall rule that is created for the regular masters. The main reason for doing it is to allow prometheus scraping metrics from various kubemark master components, e.g. kubelet. Ref. kubernetes/perf-tests#503 * fix disk list corruption issue * Restrict builds to officially supported platforms Prior to this change, including windows/amd64 in KUBE_BUILD_PLATFORMS would, for example, attempt to build the server binaries/tars/images for Windows, which is not supported. This can break downstream build steps. * Fix verify godeps failure github.com/evanphx/json-patch added a new tag at the same sha this morning: https://github.com/evanphx/json-patch/releases/tag/v4.2.0 This confused godeps. This PR updates our file to match godeps expectation. Fixes issue 77238 * Upgrade Stackdriver Logging Agent addon image from 1.6.0 to 1.6.8. * Test kubectl cp escape * Properly handle links in tar * Bump debian-iptables versions to v11.0.2. * os exit when option is true * Pin GCE Windows node image to 1809 v20190312. This is to work around kubernetes#76666. * Update the dynamic volume limit in GCE PD Currently GCE PD support 128 maximum disks attached to a node for all machines types except shared-core. This PR updates the limit number to date. Change-Id: Id9dfdbd24763b6b4138935842c246b1803838b78 * Use consistent imageRef during container startup * Replace vmss update API with instance-level update API commit * Cleanup codes that not required any more * Add unit tests * Upgrade compute API to version 2019-03-01 * Update vendors * Fix issues because of rebase * Pick up security patches for fluentd-gcp-scaler by upgrading to version 0.5.2 * Short-circuit quota admission rejection on zero-delta updates * Accept admission request if resource is being deleted * Error when etcd3 watch finds delete event with nil prevKV * Bump addon-manager to v9.0.1 - Rebase image on debian-base:v1.0.0. * Remove terminated pod from summary api. Signed-off-by: Lantao Liu <lantaol@google.com> * Expect the correct object type to be removed * check if Memory is not nil for container stats * Update to go 1.12.4 * Update to go 1.12.5 * Some remaining fixes.

* Fix kubernetes#73479 AWS NLB target groups missing tags `elbv2.AddTags` doesn't seem to support assigning the same set of tags to multiple resources at once leading to the following error: Error adding tags after modifying load balancer targets: "ValidationError: Only one resource can be tagged at a time" This can happen when using AWS NLB with multiple listeners pointing to different node ports. When k8s creates a NLB it creates a target group per listener along with installing security group ingress rules allowing the traffic to reach the k8s nodes. Unfortunately if those target groups are not tagged, k8s will not manage them, thinking it is not the owner. This small changes assigns tags one resource at a time instead of batching them as before. Signed-off-by: Brice Figureau <brice@daysofwonder.com> * record event on endpoint update failure * Fix scanning of failed targets If a iSCSI target is down while a volume is attached, reading from /sys/class/iscsi_host/host415/device/session383/connection383:0/iscsi_connection/connection383:0/address fails with an error. Kubelet should assume that such target is not available / logged in and try to relogin. Eventually, if such error persists, it should continue mounting the volume if the other paths are healthy instead of failing whole WaitForAttach(). * Applies zone labels to newly created vsphere volumes * Provision vsphere volume honoring zones * Explicitly set GVK when sending objects to webhooks * Remove reflector metrics as they currently cause a memory leak * add health plugin in the DNS tests * add more logging in azure disk attach/detach * Kubernetes version v1.13.5-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.13.md for v1.13.4. * add Azure Container Registry anonymous repo support apply fix for msi and fix test failure * DaemonSet e2e: Update image and rolling upgrade test timeout Use Nginx as the DaemonSet image instead of the ServeHostname image. This was changed because the ServeHostname has a sleep after terminating which makes it incompatible with the DaemonSet Rolling Upgrade e2e test. In addition, make the DaemonSet Rolling Upgrade e2e test timeout a function of the number of nodes that make up the cluster. This is required because the more nodes there are, the longer the time it will take to complete a rolling upgrade. Signed-off-by: Alexander Brand <alexbrand09@gmail.com> * Revert kubelet to default to ttl cache secret/configmap behavior * cri_stats_provider: overload nil as 0 for exited containers stats Always report 0 cpu/memory usage for exited containers to make metrics-server work as expect. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> * flush iptable chains first and then remove them while cleaning up ipvs mode. flushing iptable chains first and then remove the chains. this avoids trying to remove chains that are still referenced by rules in other chains. fixes kubernetes#70615 * Checks whether we have cached runtime state before starting a container that requests any device plugin resource. If not, re-issue Allocate grpc calls. This allows us to handle the edge case that a pod got assigned to a node even before it populates its extended resource capacity. * Fix panic in kubectl cp command * Bump debian-iptables to v11.0.1 Rebase docker image on debian-base:0.4.1 * Adding a check to make sure UseInstanceMetadata flag is true to get data from metadata. * GetMountRefs fixed to handle corrupted mounts by treating it like an unmounted volume * Fix the network policy tests. This is a cherrypick of the following commit https://github.com/kubernetes/kubernetes/pull/74290/commits * Update Cluster Autoscaler version to 1.13.2 * Ensure Azure load balancer cleaned up on 404 or 403 * Allow disable outbound snat when Azure standard load balancer is used * Allow session affinity a period of time to setup for new services. This is to deal with the flaky session affinity test. * Distinguish volume path with mount path * Delay CSI client initialization * kubelet: updated logic of verifying a static critical pod - check if a pod is static by its static pod info - meanwhile, check if a pod is critical by its corresponding mirror pod info * Restore username and password kubectl flags * build/gci: bump CNI version to 0.7.5 * fix smb unmount issue on Windows fix log warning use IsCorruptedMnt in GetMountRefs on Windows use errorno in IsCorruptedMnt check fix comments: add more error code add more error no checking change year fix comments * fix race condition issue for smb mount on windows change var name * Fix aad support in kubectl for sovereign cloud * make describers of different versions work properly when autoscaling/v2beta2 is not supported * allows configuring NPD release and flags on GCI and add cluster e2e test * allows configuring NPD image version in node e2e test and fix the test * bump repd min size in e2es * Kubernetes version v1.13.6-beta.0 openapi-spec file updates * stop vsphere cloud provider from spamming logs with `failed to patch IP` Fixes: kubernetes#75236 * Add/Update CHANGELOG-1.13.md for v1.13.5. * Add flag to enable strict ARP * Do not delete existing VS and RS when starting * Fix updating 'currentMetrics' field for HPA with 'AverageValue' target * Update config tests * Bump go-openapi/jsonpointer and go-openapi/jsonreference versions xref: kubernetes#75653 Signed-off-by: Jorge Alarcon Ochoa <alarcj137@gmail.com> * Fix nil pointer dereference panic in attachDetachController add check `attachableVolumePlugin == nil` to operationGenerator.GenerateDetachVolumeFunc() * if ephemeral-storage not exist in initialCapacity, don't upgrade ephemeral-storage in node status * Update gcp images with security patches [stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes. [fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes. [fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes. * Fix AWS driver fails to provision specified fsType * Bump debian-iptables to v11.0.2. * Avoid panic in cronjob sorting This change handles the case where the ith cronjob may have its start time set to nil. Previously, the Less method could cause a panic in case the ith cronjob had its start time set to nil, but the jth cronjob did not. It would panic when calling Before on a nil StartTime. * Updated regional PD minimum size; changed regional PD failover test to use StorageClassTest to generate PVC template * Check for required name parameter in dynamic client The Create, Delete, Get, Patch, Update and UpdateStatus methods in the dynamic client all expect the name parameter to be non-empty, but did not validate this requirement, which could lead to a panic. Add explicit checks to these methods. * disable HTTP2 ingress test * ensuring that logic is checking for differences in listener * Use Node-Problem-Detector v0.6.3 on GCI * proxy: Take into account exclude CIDRs while deleting legacy real servers * Update addon-manager to use debian-base:v1.0.0 * Increase default maximumLoadBalancerRuleCount to 250 * Set CPU metrics for init containers under containerd metrics-server doesn't return metrics for pods with init containers under containerd because they have incomplete CPU metrics returned by the kubelet /stats/summary API. This problem has been fixed in 1.14 (kubernetes#74336), but the cherry-picks dropped the `usageNanoCores` metric. This change adds the missing `usageNanoCores` metric for init containers. Fixes kubernetes#76292 * kube-proxy: rename internal field for clarity * kube-proxy: rename vars for clarity, fix err str * kube-proxy: rename field for congruence * kube-proxy: reject 0 endpoints on forward Previously we only REJECTed on OUTPUT which works for packets from the node but not for packets from pods on the node. * kube-proxy: remove old cleanup rules * Kube-proxy: REJECT LB IPs with no endpoints We REJECT every other case. Close this FIXME. To get this to work in all cases, we have to process service in filter.INPUT, since LB IPS might be manged as local addresses. * Retool HTTP and UDP e2e utils This is a prefactoring for followup changes that need to use very similar but subtly different test. Now it is more generic, though it pushes a little logic up the stack. That makes sense to me. * Fix small race in e2e Occasionally we get spurious errors about "no route to host" when we race with kube-proxy. This should reduce that. It's mostly just log noise. * Bump coreos/go-semver The https://github.com/coreos/go-semver/ dependency has formally release v0.3.0 at commit e214231b295a8ea9479f11b70b35d5acf3556d9b. This is the commit point we've been using, but the hack/verify-godeps.sh script notices the discrepancy and causes ci-kubernetes-verify job to fail. Fixes: kubernetes#76526 Signed-off-by: Tim Pepper <tpepper@vmware.com> * Fix Azure SLB support for multiple backend pools Azure VM and vmssVM support multiple backend pools for the same SLB, but not for different LBs. * Restore metrics-server using of IP addresses This preference list matches is used to pick prefered field from k8s node object. It was introduced in metrics-server 0.3 and changed default behaviour to use DNS instead of IP addresses. It was merged into k8s 1.12 and caused breaking change by introducing dependency on DNS configuration. * refactor detach azure disk retry operation * move disk lock process to azure cloud provider fix comments fix import keymux check error add unit test for attach/detach disk funcs fix build error fix build error * e2e-node-tests: fix path to system specs e2e-node tests may use custom system specs for validating nodes to conform the specs. The functionality is switched on when the tests are run with this command: make SYSTEM_SPEC_NAME=gke test-e2e-node Currently the command fails with the error: F1228 16:12:41.568836 34514 e2e_node_suite_test.go:106] Failed to load system spec: open /home/rojkov/go/src/k8s.io/kubernetes/k8s.io/kubernetes/cmd/kubeadm/app/util/system/specs/gke.yaml: no such file or directory Move the spec file under `test/e2e_node/system/specs` and introduce a single public constant referring the file to use instead of multiple private constants. * Fix concurrent map access in Portworx create volume call Fixes kubernetes#76340 Signed-off-by: Harsh Desai <harsh@portworx.com> * add shareName param in azure file storage class skip create azure file if it exists * Update Cluster Autoscaler to 1.13.4 * Create the "internal" firewall rule for kubemark master. This is equivalent to the "internal" firewall rule that is created for the regular masters. The main reason for doing it is to allow prometheus scraping metrics from various kubemark master components, e.g. kubelet. Ref. kubernetes/perf-tests#503 * fix disk list corruption issue * Fix verify godeps failure for 1.13 github.com/evanphx/json-patch added a new tag at the same sha this morning: https://github.com/evanphx/json-patch/releases/tag/v4.2.0 This confused godeps. This PR updates our file to match godeps expectation. Fixes issue 77238 * Upgrade Stackdriver Logging Agent addon image from 1.6.0 to 1.6.8. * Test kubectl cp escape * Properly handle links in tar * Update the dynamic volume limit in GCE PD Currently GCE PD support 128 maximum disks attached to a node for all machines types except shared-core. This PR updates the limit number to date. Change-Id: Id9dfdbd24763b6b4138935842c246b1803838b78 * Use consistent imageRef during container startup * Replace vmss update API with instance-level update API * Cleanup codes that not required any more * Add unit tests * Upgrade compute API to version 2019-03-01 * Update vendors * Fix issues because of rebase * Pick up security patches for fluentd-gcp-scaler by upgrading to version 0.5.2 * Fix race condition between actual and desired state in kublet volume manager This PR fixes the issue kubernetes#75345. This fix modified the checking volume in actual state when validating whether volume can be removed from desired state or not. Only if volume status is already mounted in actual state, it can be removed from desired state. For the case of mounting fails always, it can still work because the check also validate whether pod still exist in pod manager. In case of mount fails, pod should be able to removed from pod manager so that volume can also be removed from desired state. * Error when etcd3 watch finds delete event with nil prevKV

fejta-bot · 2019-06-19T15:41:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

* test: remove k8s.io/apiextensions-apiserver from framework There are two reason why this is useful: 1. less code to vendor into external users of the framework The following dependencies become obsolete due to this change (from `dep`): (8/23) Removed unused project github.com/grpc-ecosystem/go-grpc-prometheus (9/23) Removed unused project github.com/coreos/etcd (10/23) Removed unused project github.com/globalsign/mgo (11/23) Removed unused project github.com/go-openapi/strfmt (12/23) Removed unused project github.com/asaskevich/govalidator (13/23) Removed unused project github.com/mitchellh/mapstructure (14/23) Removed unused project github.com/NYTimes/gziphandler (15/23) Removed unused project gopkg.in/natefinch/lumberjack.v2 (16/23) Removed unused project github.com/go-openapi/errors (17/23) Removed unused project github.com/go-openapi/analysis (18/23) Removed unused project github.com/go-openapi/runtime (19/23) Removed unused project sigs.k8s.io/structured-merge-diff (20/23) Removed unused project github.com/go-openapi/validate (21/23) Removed unused project github.com/coreos/go-systemd (22/23) Removed unused project github.com/go-openapi/loads (23/23) Removed unused project github.com/munnerz/goautoneg 2. works around kubernetes#75338 which currently breaks vendoring Some recent changes to crd_util.go must now be pulling in the broken k8s.io/apiextensions-apiserver packages, because it was still working in revision 2e90d92 (as demonstrated by https://github.com/intel/pmem-CSI/tree/586ae281ac2810cb4da6f1e160cf165c7daf0d80). * update Bazel files * test: fix golint warnings in crd_util.go Because the code was moved, golint is now active. Because users of the code must adapt to the new location of the code, it makes sense to also change the API at the same time to address the style comments from golint ("struct field ApiGroup should be APIGroup", same for ApiExtensionClient). * fix race condition issue for smb mount on windows change var name * stop vsphere cloud provider from spamming logs with `failed to patch IP` Fixes: kubernetes#75236 * Remove reference to USE_RELEASE_NODE_BINARIES. This variable was used for development purposes and was accidentally introduced in kubernetes@f0f7829. This is its only use in the tree: https://github.com/kubernetes/kubernetes/search?q=USE_RELEASE_NODE_BINARIES&unscoped_q=USE_RELEASE_NODE_BINARIES * Clear conntrack entries on 0 -> 1 endpoint transition with externalIPs As part of the endpoint creation process when going from 0 -> 1 conntrack entries are cleared. This is to prevent an existing conntrack entry from preventing traffic to the service. Currently the system ignores the existance of the services external IP addresses, which exposes that errant behavior This adds the externalIP addresses of udp services to the list of conntrack entries that get cleared. Allowing traffic to flow Signed-off-by: Jacob Tanenbaum <jtanenba@redhat.com> * Move to golang 1.12.1 official image We used 1.12.0 + hack to download 1.12.1 binaries as we were in a rush on friday since the images were not published at that time. Let's remove the hack now and republish the kube-cross image Change-Id: I3ffff3283b6ca755320adfca3c8f4a36dc1c2b9e * fix-kubeadm-init-output * Mark audit e2e tests as flaky * Bump kube-cross image to 1.12.1-2 * Restore username and password kubectl flags * build/gci: bump CNI version to 0.7.5 * Add/Update CHANGELOG-1.14.md for v1.14.0-rc.1. * Restore machine readability to the print-join-command output The output of `kubeadm token create --print-join-command` should be usable by batch scripts. This issue was pointed out in: kubernetes/kubeadm#1454 * bump required minimum go version to 1.12.1 (strings package compatibility) * Bump go-openapi/jsonpointer and go-openapi/jsonreference versions xref: kubernetes#75653 Signed-off-by: Jorge Alarcon Ochoa <alarcj137@gmail.com> * Kubernetes version v1.14.1-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.14.md for v1.14.0. * 1.14 release notes fixes * Add flag to enable strict ARP * Do not delete existing VS and RS when starting * Update Cluster Autscaler version to 1.14.0 No changes since 1.14.0-beta.2 Changelog: https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.14.0 * Fix Windows to read VM UUIDs from serial numbers Certain versions of vSphere do not have the same value for product_uuid and product_serial. This mimics the change in kubernetes#59519. Fixes kubernetes#74888 * godeps: update vmware/govmomi to v0.20 release * vSphere: add token auth support for tags client SAML auth support for the vCenter rest API endpoint came to govmomi a bit after Zone support came to vSphere Cloud Provider. Fixes kubernetes#75511 * vsphere: govmomi rest API simulator requires authentication * gce: configure: validate SA has storage scope If the VM SA doesn't have storage scope associated, don't use the token in the curl request or the request will fail with 403. * fix-external-etcd * Update gcp images with security patches [stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes. [fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes. [fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes. * kubeadm: fix "upgrade plan" not working without k8s version If the k8s version argument passed to "upgrade plan" is missing the logic should perform the following actions: - fetch a "stable" version from the internet. - if that fails, fallback to the local client version. Currentely the logic fails because the cfg.KubernetesVersion is defaulted to the version of the existing cluster, which then causes an early exit without any ugprade suggestions. See app/cmd/upgrade/common.go::enforceRequirements(): configutil.FetchInitConfigurationFromCluster(..) Fix that by passing the explicit user value that can also be "". This will then make the "offline getter" treat it as an explicit desired upgrade target. In the future it might be best to invert this logic: - if no user k8s version argument is passed - default to the kubeadm version. - if labels are passed (e.g. "stable"), fetch a version from the internet. * Disable GCE agent address management on Windows nodes. With this metadata key set, "GCEWindowsAgent: GCE address manager status: disabled" will appear in the VM's serial port output during boot. Tested: PROJECT=${CLOUDSDK_CORE_PROJECT} KUBE_GCE_ENABLE_IP_ALIASES=true NUM_WINDOWS_NODES=2 NUM_NODES=2 KUBERNETES_NODE_PLATFORM=windows go run ./hack/e2e.go -- --up cluster/gce/windows/smoke-test.sh cat > iis.yaml <<EOF apiVersion: v1 kind: Pod metadata: name: iis labels: app: iis spec: containers: - image: mcr.microsoft.com/windows/servercore/iis imagePullPolicy: IfNotPresent name: iis-server ports: - containerPort: 80 protocol: TCP nodeSelector: beta.kubernetes.io/os: windows tolerations: - effect: NoSchedule key: node.kubernetes.io/os operator: Equal value: windows1809 EOF kubectl create -f iis.yaml kubectl expose pod iis --type=LoadBalancer --name=iis kubectl get services curl http://<service external IP address> * kube-aggregator: bump openapi aggregation log level * Explicitly flush headers when proxying * fix-kubeadm-upgrade-12-13-14 * GCE/Windows: disable stackdriver logging agent The logging service could not be stopped at times, causing node startup failures. Disable it until the issue is fixed. * Finish saving test results on failure The conformance image should be saving its results regardless of the results of the tests. However, with errexit set, when ginkgo gets test failures it exits 1 which prevents saving the results for Sonobuoy to pick up. Fixes: kubernetes#76036 * Avoid panic in cronjob sorting This change handles the case where the ith cronjob may have its start time set to nil. Previously, the Less method could cause a panic in case the ith cronjob had its start time set to nil, but the jth cronjob did not. It would panic when calling Before on a nil StartTime. * Removed cleanup for non-current kube-proxy modes in newProxyServer() * Depricated --cleanup-ipvs flag in kube-proxy * Fixed old function signature in kube-proxy tests. * Revert "Deprecated --cleanup-ipvs flag in kube-proxy" This reverts commit 4f1bb2b. * Revert "Fixed old function signature in kube-proxy tests." This reverts commit 29ba1b0. * Fixed --cleanup-ipvs help text * Check for required name parameter in dynamic client The Create, Delete, Get, Patch, Update and UpdateStatus methods in the dynamic client all expect the name parameter to be non-empty, but did not validate this requirement, which could lead to a panic. Add explicit checks to these methods. * Fix empty array expansion error in cluster/gce/util.sh Empty array expansion causes "unbound variable" error in bash 4.2 and bash 4.3. * Improve volume operation metrics * Add e2e tests * ensuring that logic is checking for differences in listener * Kubernetes version v1.14.2-beta.0 openapi-spec file updates * Delete only unscheduled pods if node doesn't exist anymore. * Add/Update CHANGELOG-1.14.md for v1.14.1. * Use Node-Problem-Detector v0.6.3 on GCI * proxy: Take into account exclude CIDRs while deleting legacy real servers * kubeadm: Don't error out on join with --cri-socket override In the case where newControlPlane is true we don't go through getNodeRegistration() and initcfg.NodeRegistration.CRISocket is empty. This forces DetectCRISocket() to be called later on, and if there is more than one CRI installed on the system, it will error out, while asking for the user to provide an override for the CRI socket. Even if the user provides an override, the call to DetectCRISocket() can happen too early and thus ignore it (while still erroring out). However, if newControlPlane == true, initcfg.NodeRegistration is not used at all and it's overwritten later on. Thus it's necessary to supply some default value, that will avoid the call to DetectCRISocket() and as initcfg.NodeRegistration is discarded, setting whatever value here is harmless. Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com> * Bump coreos/go-semver The https://github.com/coreos/go-semver/ dependency has formally release v0.3.0 at commit e214231b295a8ea9479f11b70b35d5acf3556d9b. This is the commit point we've been using, but the hack/verify-godeps.sh script notices the discrepancy and causes ci-kubernetes-verify job to fail. Fixes: kubernetes#76526 Signed-off-by: Tim Pepper <tpepper@vmware.com> * Fix Azure SLB support for multiple backend pools Azure VM and vmssVM support multiple backend pools for the same SLB, but not for different LBs. * Restore metrics-server using of IP addresses This preference list matches is used to pick prefered field from k8s node object. It was introduced in metrics-server 0.3 and changed default behaviour to use DNS instead of IP addresses. It was merged into k8s 1.12 and caused breaking change by introducing dependency on DNS configuration. * refactor detach azure disk retry operation * move disk lock process to azure cloud provider fix comments fix import keymux check error add unit test for attach/detach disk funcs * Fix concurrent map access in Portworx create volume call Fixes kubernetes#76340 Signed-off-by: Harsh Desai <harsh@portworx.com> * Fix race condition between actual and desired state in kublet volume manager This PR fixes the issue kubernetes#75345. This fix modified the checking volume in actual state when validating whether volume can be removed from desired state or not. Only if volume status is already mounted in actual state, it can be removed from desired state. For the case of mounting fails always, it can still work because the check also validate whether pod still exist in pod manager. In case of mount fails, pod should be able to removed from pod manager so that volume can also be removed from desired state. * fix validation message: apiServerEndpoints -> apiServerEndpoint * add shareName param in azure file storage class skip create azure file if it exists * Update Cluster Autoscaler to 1.14.2 * Create the "internal" firewall rule for kubemark master. This is equivalent to the "internal" firewall rule that is created for the regular masters. The main reason for doing it is to allow prometheus scraping metrics from various kubemark master components, e.g. kubelet. Ref. kubernetes/perf-tests#503 * fix disk list corruption issue * Restrict builds to officially supported platforms Prior to this change, including windows/amd64 in KUBE_BUILD_PLATFORMS would, for example, attempt to build the server binaries/tars/images for Windows, which is not supported. This can break downstream build steps. * Fix verify godeps failure github.com/evanphx/json-patch added a new tag at the same sha this morning: https://github.com/evanphx/json-patch/releases/tag/v4.2.0 This confused godeps. This PR updates our file to match godeps expectation. Fixes issue 77238 * Upgrade Stackdriver Logging Agent addon image from 1.6.0 to 1.6.8. * Test kubectl cp escape * Properly handle links in tar * Bump debian-iptables versions to v11.0.2. * os exit when option is true * Pin GCE Windows node image to 1809 v20190312. This is to work around kubernetes#76666. * Update the dynamic volume limit in GCE PD Currently GCE PD support 128 maximum disks attached to a node for all machines types except shared-core. This PR updates the limit number to date. Change-Id: Id9dfdbd24763b6b4138935842c246b1803838b78 * Use consistent imageRef during container startup * Replace vmss update API with instance-level update API commit * Cleanup codes that not required any more * Add unit tests * Upgrade compute API to version 2019-03-01 * Update vendors * Fix issues because of rebase * Pick up security patches for fluentd-gcp-scaler by upgrading to version 0.5.2 * Short-circuit quota admission rejection on zero-delta updates * Accept admission request if resource is being deleted * Error when etcd3 watch finds delete event with nil prevKV * Bump addon-manager to v9.0.1 - Rebase image on debian-base:v1.0.0. * Remove terminated pod from summary api. Signed-off-by: Lantao Liu <lantaol@google.com> * Expect the correct object type to be removed * check if Memory is not nil for container stats * Fix eviction dry-run * Update k8s-dns-node-cache image version This revised image resolves kubernetes dns#292 by updating the image from `k8s-dns-node-cache:1.15.2` to `k8s-dns-node-cache:1.15.2` * Update to go 1.12.4 * Update to go 1.12.5 * fix incorrect prometheus metrics fix left incorrect metrics * In GuaranteedUpdate, retry on any error if we are working with stale data * BoundServiceAccountTokenVolume: fix InClusterConfig * Don't create a RuntimeClassManager without a KubeClient * Kubernetes version v1.14.3-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.14.md for v1.14.2. * fix CVE-2019-11244: `kubectl --http-cache=<world-accessible dir>` creates world-writeable cached schema files * Upgrade Azure network API version to 2018-07-01 * Update godeps * Terminate watchers when watch cache is destroyed * honor overridden tokenfile, add InClusterConfig override tests * Don't use mapfile as it isn't bash 3 compatible * fix unbound array variable * fix unbound variable release.sh * Don't use declare -g in build * Check KUBE_SERVER_PLATFORMS existence when compile kubectl on platform other than linux/amd64, we need to check the KUBE_SERVER_PLATFORMS array emptiness before assign it. the example command is: make WHAT=cmd/kubectl KUBE_BUILD_PLATFORMS="darwin/amd64 windows/amd64" * Backport of kubernetes#78137: godeps: update vmware/govmomi to v0.20.1 Cannot cherry-pick kubernetes#78137 (go mod vs godep) Includes fix for SAML token auth with vSphere and zones API Issue kubernetes#77360 See also: kubernetes#75742 * fix: failed to close kubelet->API connections on heartbeat failure * Revert "Use consistent imageRef during container startup" This reverts commit 26e3c86. * fix azure retry issue when return 2XX with error fix comments * Disable graceful termination for udp

* Fix kubernetes#73479 AWS NLB target groups missing tags `elbv2.AddTags` doesn't seem to support assigning the same set of tags to multiple resources at once leading to the following error: Error adding tags after modifying load balancer targets: "ValidationError: Only one resource can be tagged at a time" This can happen when using AWS NLB with multiple listeners pointing to different node ports. When k8s creates a NLB it creates a target group per listener along with installing security group ingress rules allowing the traffic to reach the k8s nodes. Unfortunately if those target groups are not tagged, k8s will not manage them, thinking it is not the owner. This small changes assigns tags one resource at a time instead of batching them as before. Signed-off-by: Brice Figureau <brice@daysofwonder.com> * record event on endpoint update failure * Fix scanning of failed targets If a iSCSI target is down while a volume is attached, reading from /sys/class/iscsi_host/host415/device/session383/connection383:0/iscsi_connection/connection383:0/address fails with an error. Kubelet should assume that such target is not available / logged in and try to relogin. Eventually, if such error persists, it should continue mounting the volume if the other paths are healthy instead of failing whole WaitForAttach(). * Applies zone labels to newly created vsphere volumes * Provision vsphere volume honoring zones * Explicitly set GVK when sending objects to webhooks * Remove reflector metrics as they currently cause a memory leak * add health plugin in the DNS tests * add more logging in azure disk attach/detach * Kubernetes version v1.13.5-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.13.md for v1.13.4. * add Azure Container Registry anonymous repo support apply fix for msi and fix test failure * DaemonSet e2e: Update image and rolling upgrade test timeout Use Nginx as the DaemonSet image instead of the ServeHostname image. This was changed because the ServeHostname has a sleep after terminating which makes it incompatible with the DaemonSet Rolling Upgrade e2e test. In addition, make the DaemonSet Rolling Upgrade e2e test timeout a function of the number of nodes that make up the cluster. This is required because the more nodes there are, the longer the time it will take to complete a rolling upgrade. Signed-off-by: Alexander Brand <alexbrand09@gmail.com> * Revert kubelet to default to ttl cache secret/configmap behavior * cri_stats_provider: overload nil as 0 for exited containers stats Always report 0 cpu/memory usage for exited containers to make metrics-server work as expect. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> * flush iptable chains first and then remove them while cleaning up ipvs mode. flushing iptable chains first and then remove the chains. this avoids trying to remove chains that are still referenced by rules in other chains. fixes kubernetes#70615 * Checks whether we have cached runtime state before starting a container that requests any device plugin resource. If not, re-issue Allocate grpc calls. This allows us to handle the edge case that a pod got assigned to a node even before it populates its extended resource capacity. * Fix panic in kubectl cp command * Bump debian-iptables to v11.0.1 Rebase docker image on debian-base:0.4.1 * Adding a check to make sure UseInstanceMetadata flag is true to get data from metadata. * GetMountRefs fixed to handle corrupted mounts by treating it like an unmounted volume * Fix the network policy tests. This is a cherrypick of the following commit https://github.com/kubernetes/kubernetes/pull/74290/commits * Update Cluster Autoscaler version to 1.13.2 * Ensure Azure load balancer cleaned up on 404 or 403 * Allow disable outbound snat when Azure standard load balancer is used * Allow session affinity a period of time to setup for new services. This is to deal with the flaky session affinity test. * Distinguish volume path with mount path * Delay CSI client initialization * kubelet: updated logic of verifying a static critical pod - check if a pod is static by its static pod info - meanwhile, check if a pod is critical by its corresponding mirror pod info * Restore username and password kubectl flags * build/gci: bump CNI version to 0.7.5 * fix smb unmount issue on Windows fix log warning use IsCorruptedMnt in GetMountRefs on Windows use errorno in IsCorruptedMnt check fix comments: add more error code add more error no checking change year fix comments * fix race condition issue for smb mount on windows change var name * Fix aad support in kubectl for sovereign cloud * make describers of different versions work properly when autoscaling/v2beta2 is not supported * allows configuring NPD release and flags on GCI and add cluster e2e test * allows configuring NPD image version in node e2e test and fix the test * bump repd min size in e2es * Kubernetes version v1.13.6-beta.0 openapi-spec file updates * stop vsphere cloud provider from spamming logs with `failed to patch IP` Fixes: kubernetes#75236 * Add/Update CHANGELOG-1.13.md for v1.13.5. * Add flag to enable strict ARP * Do not delete existing VS and RS when starting * Fix updating 'currentMetrics' field for HPA with 'AverageValue' target * Update config tests * Bump go-openapi/jsonpointer and go-openapi/jsonreference versions xref: kubernetes#75653 Signed-off-by: Jorge Alarcon Ochoa <alarcj137@gmail.com> * Fix nil pointer dereference panic in attachDetachController add check `attachableVolumePlugin == nil` to operationGenerator.GenerateDetachVolumeFunc() * if ephemeral-storage not exist in initialCapacity, don't upgrade ephemeral-storage in node status * Update gcp images with security patches [stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes. [fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes. [fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes. * Fix AWS driver fails to provision specified fsType * Bump debian-iptables to v11.0.2. * Avoid panic in cronjob sorting This change handles the case where the ith cronjob may have its start time set to nil. Previously, the Less method could cause a panic in case the ith cronjob had its start time set to nil, but the jth cronjob did not. It would panic when calling Before on a nil StartTime. * Updated regional PD minimum size; changed regional PD failover test to use StorageClassTest to generate PVC template * Check for required name parameter in dynamic client The Create, Delete, Get, Patch, Update and UpdateStatus methods in the dynamic client all expect the name parameter to be non-empty, but did not validate this requirement, which could lead to a panic. Add explicit checks to these methods. * disable HTTP2 ingress test * ensuring that logic is checking for differences in listener * Delete only unscheduled pods if node doesn't exist anymore. * Use Node-Problem-Detector v0.6.3 on GCI * proxy: Take into account exclude CIDRs while deleting legacy real servers * Update addon-manager to use debian-base:v1.0.0 * Increase default maximumLoadBalancerRuleCount to 250 * Set CPU metrics for init containers under containerd metrics-server doesn't return metrics for pods with init containers under containerd because they have incomplete CPU metrics returned by the kubelet /stats/summary API. This problem has been fixed in 1.14 (kubernetes#74336), but the cherry-picks dropped the `usageNanoCores` metric. This change adds the missing `usageNanoCores` metric for init containers. Fixes kubernetes#76292 * kube-proxy: rename internal field for clarity * kube-proxy: rename vars for clarity, fix err str * kube-proxy: rename field for congruence * kube-proxy: reject 0 endpoints on forward Previously we only REJECTed on OUTPUT which works for packets from the node but not for packets from pods on the node. * kube-proxy: remove old cleanup rules * Kube-proxy: REJECT LB IPs with no endpoints We REJECT every other case. Close this FIXME. To get this to work in all cases, we have to process service in filter.INPUT, since LB IPS might be manged as local addresses. * Retool HTTP and UDP e2e utils This is a prefactoring for followup changes that need to use very similar but subtly different test. Now it is more generic, though it pushes a little logic up the stack. That makes sense to me. * Fix small race in e2e Occasionally we get spurious errors about "no route to host" when we race with kube-proxy. This should reduce that. It's mostly just log noise. * Bump coreos/go-semver The https://github.com/coreos/go-semver/ dependency has formally release v0.3.0 at commit e214231b295a8ea9479f11b70b35d5acf3556d9b. This is the commit point we've been using, but the hack/verify-godeps.sh script notices the discrepancy and causes ci-kubernetes-verify job to fail. Fixes: kubernetes#76526 Signed-off-by: Tim Pepper <tpepper@vmware.com> * Fix Azure SLB support for multiple backend pools Azure VM and vmssVM support multiple backend pools for the same SLB, but not for different LBs. * Kubelet: add usageNanoCores from CRI stats provider * Fix computing of cpu nano core usage CRI runtimes do not supply cpu nano core usage as it is not part of CRI stats. However, there are upstream components that still rely on such stats to function. The previous fix was faulty because the multiple callers could compete and update the stats, causing inconsistent/incoherent metrics. This change, instead, creates a separate call for updating the usage, and rely on eviction manager, which runs periodically, to trigger the updates. The caveat is that if eviction manager is completley turned off, no one would compute the usage. * Restore metrics-server using of IP addresses This preference list matches is used to pick prefered field from k8s node object. It was introduced in metrics-server 0.3 and changed default behaviour to use DNS instead of IP addresses. It was merged into k8s 1.12 and caused breaking change by introducing dependency on DNS configuration. * refactor detach azure disk retry operation * move disk lock process to azure cloud provider fix comments fix import keymux check error add unit test for attach/detach disk funcs fix build error fix build error * e2e-node-tests: fix path to system specs e2e-node tests may use custom system specs for validating nodes to conform the specs. The functionality is switched on when the tests are run with this command: make SYSTEM_SPEC_NAME=gke test-e2e-node Currently the command fails with the error: F1228 16:12:41.568836 34514 e2e_node_suite_test.go:106] Failed to load system spec: open /home/rojkov/go/src/k8s.io/kubernetes/k8s.io/kubernetes/cmd/kubeadm/app/util/system/specs/gke.yaml: no such file or directory Move the spec file under `test/e2e_node/system/specs` and introduce a single public constant referring the file to use instead of multiple private constants. * Fix concurrent map access in Portworx create volume call Fixes kubernetes#76340 Signed-off-by: Harsh Desai <harsh@portworx.com> * add shareName param in azure file storage class skip create azure file if it exists * Update Cluster Autoscaler to 1.13.4 * Create the "internal" firewall rule for kubemark master. This is equivalent to the "internal" firewall rule that is created for the regular masters. The main reason for doing it is to allow prometheus scraping metrics from various kubemark master components, e.g. kubelet. Ref. kubernetes/perf-tests#503 * fix disk list corruption issue * Fix verify godeps failure for 1.13 github.com/evanphx/json-patch added a new tag at the same sha this morning: https://github.com/evanphx/json-patch/releases/tag/v4.2.0 This confused godeps. This PR updates our file to match godeps expectation. Fixes issue 77238 * Upgrade Stackdriver Logging Agent addon image from 1.6.0 to 1.6.8. * Test kubectl cp escape * Properly handle links in tar * Update the dynamic volume limit in GCE PD Currently GCE PD support 128 maximum disks attached to a node for all machines types except shared-core. This PR updates the limit number to date. Change-Id: Id9dfdbd24763b6b4138935842c246b1803838b78 * Use consistent imageRef during container startup * Replace vmss update API with instance-level update API * Cleanup codes that not required any more * Add unit tests * Upgrade compute API to version 2019-03-01 * Update vendors * Fix issues because of rebase * Pick up security patches for fluentd-gcp-scaler by upgrading to version 0.5.2 * Fix race condition between actual and desired state in kublet volume manager This PR fixes the issue kubernetes#75345. This fix modified the checking volume in actual state when validating whether volume can be removed from desired state or not. Only if volume status is already mounted in actual state, it can be removed from desired state. For the case of mounting fails always, it can still work because the check also validate whether pod still exist in pod manager. In case of mount fails, pod should be able to removed from pod manager so that volume can also be removed from desired state. * Error when etcd3 watch finds delete event with nil prevKV * Kubernetes version v1.13.7-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.13.md for v1.13.6. * check if Memory is not nil for container stats * Update k8s-dns-node-cache image version This revised image resolves kubernetes dns#292 by updating the image from `k8s-dns-node-cache:1.15.2` to `k8s-dns-node-cache:1.15.2` * In GuaranteedUpdate, retry on any error if we are working with stale data * BoundServiceAccountTokenVolume: fix InClusterConfig * fix CVE-2019-11244: `kubectl --http-cache=<world-accessible dir>` creates world-writeable cached schema files * Upgrade Azure network API version to 2018-07-01 * Terminate watchers when watch cache is destroyed * Update godeps * honor overridden tokenfile, add InClusterConfig override tests * Remove terminated pod from summary api. Signed-off-by: Lantao Liu <lantaol@google.com> * fix incorrect prometheus metrics little code refactor * Fix eviction dry-run * Revert "Use consistent imageRef during container startup" This reverts commit 26e3c86. * fix azure retry issue when return 2XX with error fix comments * Disable graceful termination for udp

* test: remove k8s.io/apiextensions-apiserver from framework There are two reason why this is useful: 1. less code to vendor into external users of the framework The following dependencies become obsolete due to this change (from `dep`): (8/23) Removed unused project github.com/grpc-ecosystem/go-grpc-prometheus (9/23) Removed unused project github.com/coreos/etcd (10/23) Removed unused project github.com/globalsign/mgo (11/23) Removed unused project github.com/go-openapi/strfmt (12/23) Removed unused project github.com/asaskevich/govalidator (13/23) Removed unused project github.com/mitchellh/mapstructure (14/23) Removed unused project github.com/NYTimes/gziphandler (15/23) Removed unused project gopkg.in/natefinch/lumberjack.v2 (16/23) Removed unused project github.com/go-openapi/errors (17/23) Removed unused project github.com/go-openapi/analysis (18/23) Removed unused project github.com/go-openapi/runtime (19/23) Removed unused project sigs.k8s.io/structured-merge-diff (20/23) Removed unused project github.com/go-openapi/validate (21/23) Removed unused project github.com/coreos/go-systemd (22/23) Removed unused project github.com/go-openapi/loads (23/23) Removed unused project github.com/munnerz/goautoneg 2. works around kubernetes#75338 which currently breaks vendoring Some recent changes to crd_util.go must now be pulling in the broken k8s.io/apiextensions-apiserver packages, because it was still working in revision 2e90d92 (as demonstrated by https://github.com/intel/pmem-CSI/tree/586ae281ac2810cb4da6f1e160cf165c7daf0d80). * update Bazel files * test: fix golint warnings in crd_util.go Because the code was moved, golint is now active. Because users of the code must adapt to the new location of the code, it makes sense to also change the API at the same time to address the style comments from golint ("struct field ApiGroup should be APIGroup", same for ApiExtensionClient). * fix race condition issue for smb mount on windows change var name * stop vsphere cloud provider from spamming logs with `failed to patch IP` Fixes: kubernetes#75236 * Remove reference to USE_RELEASE_NODE_BINARIES. This variable was used for development purposes and was accidentally introduced in kubernetes@f0f7829. This is its only use in the tree: https://github.com/kubernetes/kubernetes/search?q=USE_RELEASE_NODE_BINARIES&unscoped_q=USE_RELEASE_NODE_BINARIES * Clear conntrack entries on 0 -> 1 endpoint transition with externalIPs As part of the endpoint creation process when going from 0 -> 1 conntrack entries are cleared. This is to prevent an existing conntrack entry from preventing traffic to the service. Currently the system ignores the existance of the services external IP addresses, which exposes that errant behavior This adds the externalIP addresses of udp services to the list of conntrack entries that get cleared. Allowing traffic to flow Signed-off-by: Jacob Tanenbaum <jtanenba@redhat.com> * Move to golang 1.12.1 official image We used 1.12.0 + hack to download 1.12.1 binaries as we were in a rush on friday since the images were not published at that time. Let's remove the hack now and republish the kube-cross image Change-Id: I3ffff3283b6ca755320adfca3c8f4a36dc1c2b9e * fix-kubeadm-init-output * Mark audit e2e tests as flaky * Bump kube-cross image to 1.12.1-2 * Restore username and password kubectl flags * build/gci: bump CNI version to 0.7.5 * Add/Update CHANGELOG-1.14.md for v1.14.0-rc.1. * Restore machine readability to the print-join-command output The output of `kubeadm token create --print-join-command` should be usable by batch scripts. This issue was pointed out in: kubernetes/kubeadm#1454 * bump required minimum go version to 1.12.1 (strings package compatibility) * Bump go-openapi/jsonpointer and go-openapi/jsonreference versions xref: kubernetes#75653 Signed-off-by: Jorge Alarcon Ochoa <alarcj137@gmail.com> * Kubernetes version v1.14.1-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.14.md for v1.14.0. * 1.14 release notes fixes * Add flag to enable strict ARP * Do not delete existing VS and RS when starting * Update Cluster Autscaler version to 1.14.0 No changes since 1.14.0-beta.2 Changelog: https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.14.0 * Fix Windows to read VM UUIDs from serial numbers Certain versions of vSphere do not have the same value for product_uuid and product_serial. This mimics the change in kubernetes#59519. Fixes kubernetes#74888 * godeps: update vmware/govmomi to v0.20 release * vSphere: add token auth support for tags client SAML auth support for the vCenter rest API endpoint came to govmomi a bit after Zone support came to vSphere Cloud Provider. Fixes kubernetes#75511 * vsphere: govmomi rest API simulator requires authentication * gce: configure: validate SA has storage scope If the VM SA doesn't have storage scope associated, don't use the token in the curl request or the request will fail with 403. * fix-external-etcd * Update gcp images with security patches [stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes. [fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes. [fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes. * kubeadm: fix "upgrade plan" not working without k8s version If the k8s version argument passed to "upgrade plan" is missing the logic should perform the following actions: - fetch a "stable" version from the internet. - if that fails, fallback to the local client version. Currentely the logic fails because the cfg.KubernetesVersion is defaulted to the version of the existing cluster, which then causes an early exit without any ugprade suggestions. See app/cmd/upgrade/common.go::enforceRequirements(): configutil.FetchInitConfigurationFromCluster(..) Fix that by passing the explicit user value that can also be "". This will then make the "offline getter" treat it as an explicit desired upgrade target. In the future it might be best to invert this logic: - if no user k8s version argument is passed - default to the kubeadm version. - if labels are passed (e.g. "stable"), fetch a version from the internet. * Disable GCE agent address management on Windows nodes. With this metadata key set, "GCEWindowsAgent: GCE address manager status: disabled" will appear in the VM's serial port output during boot. Tested: PROJECT=${CLOUDSDK_CORE_PROJECT} KUBE_GCE_ENABLE_IP_ALIASES=true NUM_WINDOWS_NODES=2 NUM_NODES=2 KUBERNETES_NODE_PLATFORM=windows go run ./hack/e2e.go -- --up cluster/gce/windows/smoke-test.sh cat > iis.yaml <<EOF apiVersion: v1 kind: Pod metadata: name: iis labels: app: iis spec: containers: - image: mcr.microsoft.com/windows/servercore/iis imagePullPolicy: IfNotPresent name: iis-server ports: - containerPort: 80 protocol: TCP nodeSelector: beta.kubernetes.io/os: windows tolerations: - effect: NoSchedule key: node.kubernetes.io/os operator: Equal value: windows1809 EOF kubectl create -f iis.yaml kubectl expose pod iis --type=LoadBalancer --name=iis kubectl get services curl http://<service external IP address> * kube-aggregator: bump openapi aggregation log level * Explicitly flush headers when proxying * fix-kubeadm-upgrade-12-13-14 * GCE/Windows: disable stackdriver logging agent The logging service could not be stopped at times, causing node startup failures. Disable it until the issue is fixed. * Finish saving test results on failure The conformance image should be saving its results regardless of the results of the tests. However, with errexit set, when ginkgo gets test failures it exits 1 which prevents saving the results for Sonobuoy to pick up. Fixes: kubernetes#76036 * Avoid panic in cronjob sorting This change handles the case where the ith cronjob may have its start time set to nil. Previously, the Less method could cause a panic in case the ith cronjob had its start time set to nil, but the jth cronjob did not. It would panic when calling Before on a nil StartTime. * Removed cleanup for non-current kube-proxy modes in newProxyServer() * Depricated --cleanup-ipvs flag in kube-proxy * Fixed old function signature in kube-proxy tests. * Revert "Deprecated --cleanup-ipvs flag in kube-proxy" This reverts commit 4f1bb2b. * Revert "Fixed old function signature in kube-proxy tests." This reverts commit 29ba1b0. * Fixed --cleanup-ipvs help text * Check for required name parameter in dynamic client The Create, Delete, Get, Patch, Update and UpdateStatus methods in the dynamic client all expect the name parameter to be non-empty, but did not validate this requirement, which could lead to a panic. Add explicit checks to these methods. * Fix empty array expansion error in cluster/gce/util.sh Empty array expansion causes "unbound variable" error in bash 4.2 and bash 4.3. * Improve volume operation metrics * Add e2e tests * ensuring that logic is checking for differences in listener * Kubernetes version v1.14.2-beta.0 openapi-spec file updates * Delete only unscheduled pods if node doesn't exist anymore. * Add/Update CHANGELOG-1.14.md for v1.14.1. * Use Node-Problem-Detector v0.6.3 on GCI * proxy: Take into account exclude CIDRs while deleting legacy real servers * kubeadm: Don't error out on join with --cri-socket override In the case where newControlPlane is true we don't go through getNodeRegistration() and initcfg.NodeRegistration.CRISocket is empty. This forces DetectCRISocket() to be called later on, and if there is more than one CRI installed on the system, it will error out, while asking for the user to provide an override for the CRI socket. Even if the user provides an override, the call to DetectCRISocket() can happen too early and thus ignore it (while still erroring out). However, if newControlPlane == true, initcfg.NodeRegistration is not used at all and it's overwritten later on. Thus it's necessary to supply some default value, that will avoid the call to DetectCRISocket() and as initcfg.NodeRegistration is discarded, setting whatever value here is harmless. Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com> * Bump coreos/go-semver The https://github.com/coreos/go-semver/ dependency has formally release v0.3.0 at commit e214231b295a8ea9479f11b70b35d5acf3556d9b. This is the commit point we've been using, but the hack/verify-godeps.sh script notices the discrepancy and causes ci-kubernetes-verify job to fail. Fixes: kubernetes#76526 Signed-off-by: Tim Pepper <tpepper@vmware.com> * Fix Azure SLB support for multiple backend pools Azure VM and vmssVM support multiple backend pools for the same SLB, but not for different LBs. * Restore metrics-server using of IP addresses This preference list matches is used to pick prefered field from k8s node object. It was introduced in metrics-server 0.3 and changed default behaviour to use DNS instead of IP addresses. It was merged into k8s 1.12 and caused breaking change by introducing dependency on DNS configuration. * refactor detach azure disk retry operation * move disk lock process to azure cloud provider fix comments fix import keymux check error add unit test for attach/detach disk funcs * Fix concurrent map access in Portworx create volume call Fixes kubernetes#76340 Signed-off-by: Harsh Desai <harsh@portworx.com> * Fix race condition between actual and desired state in kublet volume manager This PR fixes the issue kubernetes#75345. This fix modified the checking volume in actual state when validating whether volume can be removed from desired state or not. Only if volume status is already mounted in actual state, it can be removed from desired state. For the case of mounting fails always, it can still work because the check also validate whether pod still exist in pod manager. In case of mount fails, pod should be able to removed from pod manager so that volume can also be removed from desired state. * fix validation message: apiServerEndpoints -> apiServerEndpoint * add shareName param in azure file storage class skip create azure file if it exists * Update Cluster Autoscaler to 1.14.2 * Create the "internal" firewall rule for kubemark master. This is equivalent to the "internal" firewall rule that is created for the regular masters. The main reason for doing it is to allow prometheus scraping metrics from various kubemark master components, e.g. kubelet. Ref. kubernetes/perf-tests#503 * fix disk list corruption issue * Restrict builds to officially supported platforms Prior to this change, including windows/amd64 in KUBE_BUILD_PLATFORMS would, for example, attempt to build the server binaries/tars/images for Windows, which is not supported. This can break downstream build steps. * Fix verify godeps failure github.com/evanphx/json-patch added a new tag at the same sha this morning: https://github.com/evanphx/json-patch/releases/tag/v4.2.0 This confused godeps. This PR updates our file to match godeps expectation. Fixes issue 77238 * Upgrade Stackdriver Logging Agent addon image from 1.6.0 to 1.6.8. * Test kubectl cp escape * Properly handle links in tar * Bump debian-iptables versions to v11.0.2. * os exit when option is true * Pin GCE Windows node image to 1809 v20190312. This is to work around kubernetes#76666. * Update the dynamic volume limit in GCE PD Currently GCE PD support 128 maximum disks attached to a node for all machines types except shared-core. This PR updates the limit number to date. Change-Id: Id9dfdbd24763b6b4138935842c246b1803838b78 * Use consistent imageRef during container startup * Replace vmss update API with instance-level update API commit * Cleanup codes that not required any more * Add unit tests * Upgrade compute API to version 2019-03-01 * Update vendors * Fix issues because of rebase * Pick up security patches for fluentd-gcp-scaler by upgrading to version 0.5.2 * Short-circuit quota admission rejection on zero-delta updates * Accept admission request if resource is being deleted * Error when etcd3 watch finds delete event with nil prevKV * Bump addon-manager to v9.0.1 - Rebase image on debian-base:v1.0.0. * Remove terminated pod from summary api. Signed-off-by: Lantao Liu <lantaol@google.com> * Expect the correct object type to be removed * check if Memory is not nil for container stats * Fix eviction dry-run * Update k8s-dns-node-cache image version This revised image resolves kubernetes dns#292 by updating the image from `k8s-dns-node-cache:1.15.2` to `k8s-dns-node-cache:1.15.2` * Update to go 1.12.4 * Update to go 1.12.5 * Bump ip-masq-agent version to v2.3.0 * fix incorrect prometheus metrics fix left incorrect metrics * In GuaranteedUpdate, retry on any error if we are working with stale data * BoundServiceAccountTokenVolume: fix InClusterConfig * Don't create a RuntimeClassManager without a KubeClient * Kubernetes version v1.14.3-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.14.md for v1.14.2. * fix CVE-2019-11244: `kubectl --http-cache=<world-accessible dir>` creates world-writeable cached schema files * Upgrade Azure network API version to 2018-07-01 * Update godeps * Terminate watchers when watch cache is destroyed * honor overridden tokenfile, add InClusterConfig override tests * Don't use mapfile as it isn't bash 3 compatible * fix unbound array variable * fix unbound variable release.sh * Don't use declare -g in build * Check KUBE_SERVER_PLATFORMS existence when compile kubectl on platform other than linux/amd64, we need to check the KUBE_SERVER_PLATFORMS array emptiness before assign it. the example command is: make WHAT=cmd/kubectl KUBE_BUILD_PLATFORMS="darwin/amd64 windows/amd64" * Backport of kubernetes#78137: godeps: update vmware/govmomi to v0.20.1 Cannot cherry-pick kubernetes#78137 (go mod vs godep) Includes fix for SAML token auth with vSphere and zones API Issue kubernetes#77360 See also: kubernetes#75742 * fix: failed to close kubelet->API connections on heartbeat failure * Revert "Use consistent imageRef during container startup" This reverts commit 26e3c86. * fix azure retry issue when return 2XX with error fix comments * Disable graceful termination for udp * cherry pick of 017f57a, had to do a very simple merge of BUILD * Fix memory leak from not closing hcs container handles * Fix volume mount tests issue for windows For windows node, security context is disabled. This PR fixes a bug so that fsGroup will not be applied to pods that run on windows node. Change-Id: Id9870416d2ad8ef791b3b4896d6747a2adbada2f * Kubernetes version v1.14.4-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.14.md for v1.14.3. * Fix kubectl apply skew test with extra properties * fix: update vm if detach a non-existing disk fix gofmt issue * picked up extra unnecessary dep in merge at least verify build thinks its unnecessary * Move CSIDriver Lister to the controller * Fix incorrect procMount defaulting * vSphere: allow SAML token delegation Issue kubernetes#77360 * Use any host that mounts the datastore to create Volume Also, This change makes zone to work per datacenter and cleans up dummy vms. There can be multiple datastores found for a given name. The datastore name is unique only within a datacenter. So this commit returns a list of datastores for a given datastore name in FindDatastoreByName() method. The calles are responsible to handle or find the right datastore to use among those returned. * ipvs: fix string check for IPVS protocol during graceful termination Signed-off-by: Andrew Sy Kim <kiman@vmware.com> * fix flexvol stuck issue due to corrupted mnt point fix comments about PathExists fix comments revert change in PathExists func * Avoid the default server mux * Ignore cgroup pid support if related feature gates are disabled * kubelet: retry pod sandbox creation when containers were never created If kubelet never gets past sandbox creation (i.e., never attempted to create containers for a pod), it should retry the sandbox creation on failure, regardless of the restart policy of the pod. * Default resourceGroup should be used when value of annotation azure-load-balancer-resource-group is empty string * fix kubelet can not delete orphaned pod directory when the kubelet's root directory symbolically links to another device's directory * Allow unit test to pass on machines without ipv6 * Fix AWS DHCP option set domain names causing garbled InternalDNS or Hostname addresses on Node * Fix closing of dirs in doSafeMakeDir This fixes the issue where "childFD" from syscall.Openat is assigned to a local variable inside the for loop, instead of the correct one in the function scope. This results in that when trying to close the "childFD" in the function scope, it will be equal to "-1", instead of the correct value. * There are various reasons that the HPA will decide not the change the current scale. Two important ones are when missing metrics might change the direction of scaling, and when the recommended scale is within tolerance of the current scale. The way that ReplicaCalculator signals it's desire to not change the current scale is by returning the current scale. However the current scale is from scale.Status.Replicas and can be larger than scale.Spec.Replicas (e.g. during Deployment rollout with configured surge). This causes a positive feedback loop because scale.Status.Replicas is written back into scale.Spec.Replicas, further increasing the current scale. This PR fixes the feedback loop by plumbing the replica count from spec through horizontal.go and replica_calculator.go so the calculator can punt with the right value. * edit google dns hostname

* Fix kubernetes#73479 AWS NLB target groups missing tags `elbv2.AddTags` doesn't seem to support assigning the same set of tags to multiple resources at once leading to the following error: Error adding tags after modifying load balancer targets: "ValidationError: Only one resource can be tagged at a time" This can happen when using AWS NLB with multiple listeners pointing to different node ports. When k8s creates a NLB it creates a target group per listener along with installing security group ingress rules allowing the traffic to reach the k8s nodes. Unfortunately if those target groups are not tagged, k8s will not manage them, thinking it is not the owner. This small changes assigns tags one resource at a time instead of batching them as before. Signed-off-by: Brice Figureau <brice@daysofwonder.com> * record event on endpoint update failure * Fix scanning of failed targets If a iSCSI target is down while a volume is attached, reading from /sys/class/iscsi_host/host415/device/session383/connection383:0/iscsi_connection/connection383:0/address fails with an error. Kubelet should assume that such target is not available / logged in and try to relogin. Eventually, if such error persists, it should continue mounting the volume if the other paths are healthy instead of failing whole WaitForAttach(). * Applies zone labels to newly created vsphere volumes * Provision vsphere volume honoring zones * Explicitly set GVK when sending objects to webhooks * Remove reflector metrics as they currently cause a memory leak * add health plugin in the DNS tests * add more logging in azure disk attach/detach * Kubernetes version v1.13.5-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.13.md for v1.13.4. * add Azure Container Registry anonymous repo support apply fix for msi and fix test failure * DaemonSet e2e: Update image and rolling upgrade test timeout Use Nginx as the DaemonSet image instead of the ServeHostname image. This was changed because the ServeHostname has a sleep after terminating which makes it incompatible with the DaemonSet Rolling Upgrade e2e test. In addition, make the DaemonSet Rolling Upgrade e2e test timeout a function of the number of nodes that make up the cluster. This is required because the more nodes there are, the longer the time it will take to complete a rolling upgrade. Signed-off-by: Alexander Brand <alexbrand09@gmail.com> * Revert kubelet to default to ttl cache secret/configmap behavior * cri_stats_provider: overload nil as 0 for exited containers stats Always report 0 cpu/memory usage for exited containers to make metrics-server work as expect. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> * flush iptable chains first and then remove them while cleaning up ipvs mode. flushing iptable chains first and then remove the chains. this avoids trying to remove chains that are still referenced by rules in other chains. fixes kubernetes#70615 * Checks whether we have cached runtime state before starting a container that requests any device plugin resource. If not, re-issue Allocate grpc calls. This allows us to handle the edge case that a pod got assigned to a node even before it populates its extended resource capacity. * Fix panic in kubectl cp command * Bump debian-iptables to v11.0.1 Rebase docker image on debian-base:0.4.1 * Adding a check to make sure UseInstanceMetadata flag is true to get data from metadata. * GetMountRefs fixed to handle corrupted mounts by treating it like an unmounted volume * Fix the network policy tests. This is a cherrypick of the following commit https://github.com/kubernetes/kubernetes/pull/74290/commits * Update Cluster Autoscaler version to 1.13.2 * Ensure Azure load balancer cleaned up on 404 or 403 * Allow disable outbound snat when Azure standard load balancer is used * Allow session affinity a period of time to setup for new services. This is to deal with the flaky session affinity test. * Distinguish volume path with mount path * Delay CSI client initialization * kubelet: updated logic of verifying a static critical pod - check if a pod is static by its static pod info - meanwhile, check if a pod is critical by its corresponding mirror pod info * Restore username and password kubectl flags * build/gci: bump CNI version to 0.7.5 * fix smb unmount issue on Windows fix log warning use IsCorruptedMnt in GetMountRefs on Windows use errorno in IsCorruptedMnt check fix comments: add more error code add more error no checking change year fix comments * fix race condition issue for smb mount on windows change var name * Fix aad support in kubectl for sovereign cloud * make describers of different versions work properly when autoscaling/v2beta2 is not supported * allows configuring NPD release and flags on GCI and add cluster e2e test * allows configuring NPD image version in node e2e test and fix the test * bump repd min size in e2es * Kubernetes version v1.13.6-beta.0 openapi-spec file updates * stop vsphere cloud provider from spamming logs with `failed to patch IP` Fixes: kubernetes#75236 * Add/Update CHANGELOG-1.13.md for v1.13.5. * Add flag to enable strict ARP * Do not delete existing VS and RS when starting * Fix updating 'currentMetrics' field for HPA with 'AverageValue' target * Update config tests * Bump go-openapi/jsonpointer and go-openapi/jsonreference versions xref: kubernetes#75653 Signed-off-by: Jorge Alarcon Ochoa <alarcj137@gmail.com> * Fix nil pointer dereference panic in attachDetachController add check `attachableVolumePlugin == nil` to operationGenerator.GenerateDetachVolumeFunc() * if ephemeral-storage not exist in initialCapacity, don't upgrade ephemeral-storage in node status * Update gcp images with security patches [stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes. [fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes. [fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes. * Fix AWS driver fails to provision specified fsType * Bump debian-iptables to v11.0.2. * Avoid panic in cronjob sorting This change handles the case where the ith cronjob may have its start time set to nil. Previously, the Less method could cause a panic in case the ith cronjob had its start time set to nil, but the jth cronjob did not. It would panic when calling Before on a nil StartTime. * Updated regional PD minimum size; changed regional PD failover test to use StorageClassTest to generate PVC template * Check for required name parameter in dynamic client The Create, Delete, Get, Patch, Update and UpdateStatus methods in the dynamic client all expect the name parameter to be non-empty, but did not validate this requirement, which could lead to a panic. Add explicit checks to these methods. * disable HTTP2 ingress test * ensuring that logic is checking for differences in listener * Delete only unscheduled pods if node doesn't exist anymore. * Use Node-Problem-Detector v0.6.3 on GCI * proxy: Take into account exclude CIDRs while deleting legacy real servers * Update addon-manager to use debian-base:v1.0.0 * Increase default maximumLoadBalancerRuleCount to 250 * Set CPU metrics for init containers under containerd metrics-server doesn't return metrics for pods with init containers under containerd because they have incomplete CPU metrics returned by the kubelet /stats/summary API. This problem has been fixed in 1.14 (kubernetes#74336), but the cherry-picks dropped the `usageNanoCores` metric. This change adds the missing `usageNanoCores` metric for init containers. Fixes kubernetes#76292 * kube-proxy: rename internal field for clarity * kube-proxy: rename vars for clarity, fix err str * kube-proxy: rename field for congruence * kube-proxy: reject 0 endpoints on forward Previously we only REJECTed on OUTPUT which works for packets from the node but not for packets from pods on the node. * kube-proxy: remove old cleanup rules * Kube-proxy: REJECT LB IPs with no endpoints We REJECT every other case. Close this FIXME. To get this to work in all cases, we have to process service in filter.INPUT, since LB IPS might be manged as local addresses. * Retool HTTP and UDP e2e utils This is a prefactoring for followup changes that need to use very similar but subtly different test. Now it is more generic, though it pushes a little logic up the stack. That makes sense to me. * Fix small race in e2e Occasionally we get spurious errors about "no route to host" when we race with kube-proxy. This should reduce that. It's mostly just log noise. * Bump coreos/go-semver The https://github.com/coreos/go-semver/ dependency has formally release v0.3.0 at commit e214231b295a8ea9479f11b70b35d5acf3556d9b. This is the commit point we've been using, but the hack/verify-godeps.sh script notices the discrepancy and causes ci-kubernetes-verify job to fail. Fixes: kubernetes#76526 Signed-off-by: Tim Pepper <tpepper@vmware.com> * Fix Azure SLB support for multiple backend pools Azure VM and vmssVM support multiple backend pools for the same SLB, but not for different LBs. * Kubelet: add usageNanoCores from CRI stats provider * Fix computing of cpu nano core usage CRI runtimes do not supply cpu nano core usage as it is not part of CRI stats. However, there are upstream components that still rely on such stats to function. The previous fix was faulty because the multiple callers could compete and update the stats, causing inconsistent/incoherent metrics. This change, instead, creates a separate call for updating the usage, and rely on eviction manager, which runs periodically, to trigger the updates. The caveat is that if eviction manager is completley turned off, no one would compute the usage. * Restore metrics-server using of IP addresses This preference list matches is used to pick prefered field from k8s node object. It was introduced in metrics-server 0.3 and changed default behaviour to use DNS instead of IP addresses. It was merged into k8s 1.12 and caused breaking change by introducing dependency on DNS configuration. * refactor detach azure disk retry operation * move disk lock process to azure cloud provider fix comments fix import keymux check error add unit test for attach/detach disk funcs fix build error fix build error * e2e-node-tests: fix path to system specs e2e-node tests may use custom system specs for validating nodes to conform the specs. The functionality is switched on when the tests are run with this command: make SYSTEM_SPEC_NAME=gke test-e2e-node Currently the command fails with the error: F1228 16:12:41.568836 34514 e2e_node_suite_test.go:106] Failed to load system spec: open /home/rojkov/go/src/k8s.io/kubernetes/k8s.io/kubernetes/cmd/kubeadm/app/util/system/specs/gke.yaml: no such file or directory Move the spec file under `test/e2e_node/system/specs` and introduce a single public constant referring the file to use instead of multiple private constants. * Fix concurrent map access in Portworx create volume call Fixes kubernetes#76340 Signed-off-by: Harsh Desai <harsh@portworx.com> * add shareName param in azure file storage class skip create azure file if it exists * Update Cluster Autoscaler to 1.13.4 * Create the "internal" firewall rule for kubemark master. This is equivalent to the "internal" firewall rule that is created for the regular masters. The main reason for doing it is to allow prometheus scraping metrics from various kubemark master components, e.g. kubelet. Ref. kubernetes/perf-tests#503 * fix disk list corruption issue * Fix verify godeps failure for 1.13 github.com/evanphx/json-patch added a new tag at the same sha this morning: https://github.com/evanphx/json-patch/releases/tag/v4.2.0 This confused godeps. This PR updates our file to match godeps expectation. Fixes issue 77238 * Upgrade Stackdriver Logging Agent addon image from 1.6.0 to 1.6.8. * Test kubectl cp escape * Properly handle links in tar * Update the dynamic volume limit in GCE PD Currently GCE PD support 128 maximum disks attached to a node for all machines types except shared-core. This PR updates the limit number to date. Change-Id: Id9dfdbd24763b6b4138935842c246b1803838b78 * Use consistent imageRef during container startup * Replace vmss update API with instance-level update API * Cleanup codes that not required any more * Add unit tests * Upgrade compute API to version 2019-03-01 * Update vendors * Fix issues because of rebase * Pick up security patches for fluentd-gcp-scaler by upgrading to version 0.5.2 * Fix race condition between actual and desired state in kublet volume manager This PR fixes the issue kubernetes#75345. This fix modified the checking volume in actual state when validating whether volume can be removed from desired state or not. Only if volume status is already mounted in actual state, it can be removed from desired state. For the case of mounting fails always, it can still work because the check also validate whether pod still exist in pod manager. In case of mount fails, pod should be able to removed from pod manager so that volume can also be removed from desired state. * Short-circuit quota admission rejection on zero-delta updates * Error when etcd3 watch finds delete event with nil prevKV * Accept admission request if resource is being deleted * Kubernetes version v1.13.7-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.13.md for v1.13.6. * Bump addon-manager to v8.9.1 - Rebase image on debian-base:v1.0.0 * check if Memory is not nil for container stats * Update k8s-dns-node-cache image version This revised image resolves kubernetes dns#292 by updating the image from `k8s-dns-node-cache:1.15.2` to `k8s-dns-node-cache:1.15.2` * Bump ip-masq-agent version to v2.3.0 * In GuaranteedUpdate, retry on any error if we are working with stale data * BoundServiceAccountTokenVolume: fix InClusterConfig * fix CVE-2019-11244: `kubectl --http-cache=<world-accessible dir>` creates world-writeable cached schema files * Upgrade Azure network API version to 2018-07-01 * Terminate watchers when watch cache is destroyed * Update godeps * honor overridden tokenfile, add InClusterConfig override tests * Remove terminated pod from summary api. Signed-off-by: Lantao Liu <lantaol@google.com> * fix incorrect prometheus metrics little code refactor * Fix eviction dry-run * Revert "Use consistent imageRef during container startup" This reverts commit 26e3c86. * fix azure retry issue when return 2XX with error fix comments * Disable graceful termination for udp * Kubernetes version v1.13.8-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.13.md for v1.13.7. * fix: update vm if detach a non-existing disk fix gofmt issue * Fix incorrect procMount defaulting * ipvs: fix string check for IPVS protocol during graceful termination Signed-off-by: Andrew Sy Kim <kiman@vmware.com> * fix flexvol stuck issue due to corrupted mnt point fix comments about PathExists fix comments revert change in PathExists func * Avoid the default server mux * kubelet: retry pod sandbox creation when containers were never created If kubelet never gets past sandbox creation (i.e., never attempted to create containers for a pod), it should retry the sandbox creation on failure, regardless of the restart policy of the pod. * Default resourceGroup should be used when value of annotation azure-load-balancer-resource-group is empty string * Replace bitbucket with github This commit has the following changes: - Replace `bitbucket.org/ww/goautoneg` with `github.com/munnerz/goautoneg`. - Replace `bitbucket.org/bertimus9/systemstat` with `github.com/nikhita/systemstat`. - Bump kube-openapi to remove so that it's dependency on `bitbucket.org/ww/goautoneg` moves to `github.com/munnerz/goautoneg`. - Generate `swagger.json` generated from the above change. - Update `BUILD` files. Bitbucket is replaced with GitHub because: Atlassian finally pulled the plug on their 1.0 api and forces everyone to use 2.0 now: https://developer.atlassian.com/cloud/bitbucket/deprecation-notice-v1-apis/ This leads to an error like: ``` godep: error downloading dep (bitbucket.org/ww/goautoneg): https://api.bitbucket.org/1.0/repositories/ww/goautoneg: 410 Gone ``` This was fixed in upstream go in golang/tools@13ba8ad. To fix this in k/k: 1) We'll need to either bump our vendored version https://github.com/kubernetes/kubernetes/blob/release-1.13/vendor/golang.org/x/tools/go/vcs/vcs.go#L676. However, this bump brings in _lots_ of changes. 2) We can entirely remove our dependency on bitbucket. The second point is better because: 1) godep itself vendors in an older version: https://github.com/tools/godep/blob/master/vendor/golang.org/x/tools/go/vcs/vcs.go#L667. This means that anyone who installs godep directly, without forking it, will not be able to use it with Kubernetes if we stick to bitbucket. 2) Bumping `golang/x/tools` requires running `godep restore`, which doesn't work because that uses the 1.0 api...leading to a catch-22 like situation. * Allow unit test to pass on machines without ipv6 * fix kubelet can not delete orphaned pod directory when the kubelet's root directory symbolically links to another device's directory * Fix AWS DHCP option set domain names causing garbled InternalDNS or Hostname addresses on Node * Fix closing of dirs in doSafeMakeDir This fixes the issue where "childFD" from syscall.Openat is assigned to a local variable inside the for loop, instead of the correct one in the function scope. This results in that when trying to close the "childFD" in the function scope, it will be equal to "-1", instead of the correct value. * There are various reasons that the HPA will decide not the change the current scale. Two important ones are when missing metrics might change the direction of scaling, and when the recommended scale is within tolerance of the current scale. The way that ReplicaCalculator signals it's desire to not change the current scale is by returning the current scale. However the current scale is from scale.Status.Replicas and can be larger than scale.Spec.Replicas (e.g. during Deployment rollout with configured surge). This causes a positive feedback loop because scale.Status.Replicas is written back into scale.Spec.Replicas, further increasing the current scale. This PR fixes the feedback loop by plumbing the replica count from spec through horizontal.go and replica_calculator.go so the calculator can punt with the right value. * edit google dns hostname

fejta-bot · 2019-07-19T16:28:02Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-08-18T17:23:58Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-08-18T17:24:06Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jingxu97 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 13, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 13, 2019

jingxu97 self-assigned this Mar 13, 2019

jingxu97 added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Mar 13, 2019

k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 13, 2019

jingxu97 mentioned this issue Mar 13, 2019

[Flaky test] When kubelet restarts Should test that a volume mounted to a pod that is deleted while the kubelet is down unmounts when the kubelet returns #75328

Closed

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 16, 2019

k8s-ci-robot removed the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 18, 2019

k8s-ci-robot added this to the v1.14 milestone Mar 18, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 18, 2019

jingxu97 mentioned this issue Mar 19, 2019

Fix race condition between actual and desired state in kublet volume manager #75458

Merged

k8s-ci-robot removed this from the v1.14 milestone Mar 21, 2019

nikopen mentioned this issue Mar 24, 2019

[Failing Test] [sig-storage] subPath should unmount if pod is gracefully deleted while kubelet is down #75643

Closed

cofyc mentioned this issue Apr 20, 2019

[Flaking test] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down #75326

Closed

jingxu97 mentioned this issue Apr 24, 2019

Fix race condition between actual and desired state in kublet volume manager #76980

Merged

jingxu97 mentioned this issue May 2, 2019

Fix race condition between actual and desired state in kublet volume … #77351

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2019

k8s-ci-robot closed this as completed Aug 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volumes fail to clean up when kubelet restart due to race between actual and desired state #75345

Volumes fail to clean up when kubelet restart due to race between actual and desired state #75345

jingxu97 commented Mar 13, 2019

mariantalla commented Mar 15, 2019

athenabot commented Mar 16, 2019

mariantalla commented Mar 18, 2019

mariantalla commented Mar 18, 2019

jingxu97 commented Mar 18, 2019

spiffxp commented Mar 18, 2019

spiffxp commented Mar 18, 2019

jingxu97 commented Mar 19, 2019

nikopen commented Mar 21, 2019

fejta-bot commented Jun 19, 2019

fejta-bot commented Jul 19, 2019

fejta-bot commented Aug 18, 2019

k8s-ci-robot commented Aug 18, 2019

Volumes fail to clean up when kubelet restart due to race between actual and desired state #75345

Volumes fail to clean up when kubelet restart due to race between actual and desired state #75345

Comments

jingxu97 commented Mar 13, 2019

mariantalla commented Mar 15, 2019

athenabot commented Mar 16, 2019

mariantalla commented Mar 18, 2019

mariantalla commented Mar 18, 2019

jingxu97 commented Mar 18, 2019

spiffxp commented Mar 18, 2019

spiffxp commented Mar 18, 2019

jingxu97 commented Mar 19, 2019

nikopen commented Mar 21, 2019

fejta-bot commented Jun 19, 2019

fejta-bot commented Jul 19, 2019

fejta-bot commented Aug 18, 2019

k8s-ci-robot commented Aug 18, 2019