Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 #68058

Merged
merged 1 commit into from
Sep 10, 2018
Merged

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 #68058

merged 1 commit into from
Sep 10, 2018

Conversation

nikopen
Copy link
Contributor

@nikopen nikopen commented Aug 30, 2018

Cherry pick of #67825 on release-1.11.

#67825: Fix VMWare VM freezing bug by reverting #51066

@k8s-ci-robot k8s-ci-robot added do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 30, 2018
@nikopen
Copy link
Contributor Author

nikopen commented Aug 30, 2018

/assign @saad-ali @foxish

@davidz627
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 30, 2018
@nikopen
Copy link
Contributor Author

nikopen commented Aug 30, 2018

/retest

1 similar comment
@nikopen
Copy link
Contributor Author

nikopen commented Aug 31, 2018

/retest

@divyenpatel
Copy link
Member

@nikopen @gnufied @SandeepPissay

This is the behavior on 1.11 branch with this change.

Created deployment deployment.
Pod scheduled on kubernetes-node4 and volumes got attached to this node.

# kubectl get pods
NAME                               READY     STATUS    RESTARTS   AGE
wordpress-mysql-56647d4597-nvq46   1/1       Running   0          1m
# kubectl describe pod wordpress-mysql-56647d4597-nvq46 | grep Node:
Node:           kubernetes-node4/10.192.91.105
# kubectl get nodes     
NAME                STATUS                     ROLES     AGE       VERSION
kubernetes-master   Ready,SchedulingDisabled   <none>    2h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node1    Ready                      <none>    2h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node2    Ready                      <none>    2h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node3    Ready                      <none>    2h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node4    Ready                      <none>    2h        v1.11.3-beta.0.57+28b2e1799fea17

Powered off the node kubernetes-node4, on which the Pod is running.
New Pod is spawned on the Node - kubernetes-node2.

# kubectl get pods
NAME                               READY     STATUS    RESTARTS   AGE
wordpress-mysql-56647d4597-hp4rv   1/1       Running   0          7m
# kubectl describe pod wordpress-mysql-56647d4597-hp4rv | grep Node:
Node:           kubernetes-node2/10.192.71.62

Powered Off Node is removed from the API server.

# kubectl get nodes
NAME                STATUS                     ROLES     AGE       VERSION
kubernetes-master   Ready,SchedulingDisabled   <none>    2h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node1    Ready                      <none>    2h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node2    Ready                      <none>    2h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node3    Ready                      <none>    2h        v1.11.3-beta.0.57+28b2e1799fea17

After 6 minutes, volumes successfully attached to new node kubernetes-node2 but did not detached from the powered off node.

Attempt to power on node kubernetes-node4 will fail with error - File system specific implementation of OpenFile[file] failed.

Attempt to stop kubelet on the node kubernetes-node2 .

root@kubernetes-node2 [ ~ ]# systemctl stop kubelet.

Node status changed to NotReady.

# kubectl get nodes
NAME                STATUS                     ROLES     AGE       VERSION
kubernetes-master   Ready,SchedulingDisabled   <none>    3h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node1    Ready                      <none>    3h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node2    NotReady                   <none>    3h        v1.11.3-beta.0.57+28b2e1799fea17
kubernetes-node3    Ready                      <none>    3h        v1.11.3-beta.0.57+28b2e1799fea17
# kubectl get pods
NAME                               READY     STATUS              RESTARTS   AGE
wordpress-mysql-56647d4597-hp4rv   1/1       Unknown             0          31m
wordpress-mysql-56647d4597-lmppg   0/1       ContainerCreating   0          45s

new pod is spawned on node kubernetes-node1.

# kubectl describe pod wordpress-mysql-56647d4597-lmppg | grep Node:
Node:           kubernetes-node1/10.192.93.102

new pod wordpress-mysql-56647d4597-lmppg remains in to ContainerCreating state. volumes could not detached from kubernetes-node2 and attached to new node kubernetes-node1.

# kubectl get pods 
NAME                               READY     STATUS              RESTARTS   AGE
wordpress-mysql-56647d4597-hp4rv   1/1       Unknown             0          40m
wordpress-mysql-56647d4597-lmppg   0/1       ContainerCreating   0          10m

This is the same behavior as mentioned for 1.9 branch at #68056 (comment) and #68056 (comment)

@nikopen
Copy link
Contributor Author

nikopen commented Sep 4, 2018

/retest

@nikopen
Copy link
Contributor Author

nikopen commented Sep 4, 2018

#68056 (comment)

cc @saad-ali @foxish for approval

@foxish foxish added cherrypick-candidate cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. sig/storage Categorizes an issue or PR as relevant to SIG Storage. area/provider/vmware Issues or PRs related to vmware provider and removed do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. labels Sep 4, 2018
@foxish foxish added this to the v1.11 milestone Sep 4, 2018
@foxish
Copy link
Contributor

foxish commented Sep 4, 2018

Cherrypick approved. cc/ @saad-ali for LGTM and approval.

@foxish foxish removed the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 6, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label Sep 6, 2018
@childsb
Copy link
Contributor

childsb commented Sep 6, 2018

/kind bug
/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 6, 2018
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: childsb, nikopen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2018
@childsb childsb added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. and removed do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. cherrypick-candidate labels Sep 6, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

5 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-github-robot
Copy link

/test all

Tests are more than 96 hours old. Re-running tests.

@k8s-ci-robot k8s-ci-robot merged commit f30a74e into kubernetes:release-1.11 Sep 10, 2018
@redbaron
Copy link
Contributor

@foxish, is this regression enough to justify new 1.11.4 release? We've been hit by it, but would also benefit from other bugfixes in 1.11.3.

@nikopen
Copy link
Contributor Author

nikopen commented Sep 24, 2018

@redbaron it didn't make it to 1.11.3 in time, I suppose 1.11.4 will be out in 2-3 weeks or so.

as a workaround you can compile a custom kube-controller-manager with the related lines removed, this is a good guide: https://docs.microsoft.com/en-us/virtualization/windowscontainers/kubernetes/compiling-kubernetes-binaries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/vmware Issues or PRs related to vmware provider cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants