New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RePD failover test flaking #69005

Open
msau42 opened this Issue Sep 24, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@msau42
Member

msau42 commented Sep 24, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
@kubernetes/sig-storage-bugs

What happened:
Testgrid: https://k8s-testgrid.appspot.com/sig-storage#gke-regional-serial

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/regional_pd.go:81
Error getting instance group gke-bootstrap-e2e-default-pool-b9ea428e-grp in zone us-central1-c
Expected error:
    <*googleapi.Error | 0xc422a16b90>: {
        Code: 404,
        Message: "The resource 'projects/gke-up-g1-3-clat-up-mas/zones/us-central1-c/instanceGroups/gke-bootstrap-e2e-default-pool-b9ea428e-grp' was not found",
        Body: "{\n \"error\": {\n  \"errors\": [\n   {\n    \"domain\": \"global\",\n    \"reason\": \"notFound\",\n    \"message\": \"The resource 'projects/gke-up-g1-3-clat-up-mas/zones/us-central1-c/instanceGroups/gke-bootstrap-e2e-default-pool-b9ea428e-grp' was not found\"\n   }\n  ],\n  \"code\": 404,\n  \"message\": \"The resource 'projects/gke-up-g1-3-clat-up-mas/zones/us-central1-c/instanceGroups/gke-bootstrap-e2e-default-pool-b9ea428e-grp' was not found\"\n }\n}\n",
        Header: nil,
        Errors: [
            {
                Reason: "notFound",
                Message: "The resource 'projects/gke-up-g1-3-clat-up-mas/zones/us-central1-c/instanceGroups/gke-bootstrap-e2e-default-pool-b9ea428e-grp' was not found",
            },
        ],
    }
    googleapi: Error 404: The resource 'projects/gke-up-g1-3-clat-up-mas/zones/us-central1-c/instanceGroups/gke-bootstrap-e2e-default-pool-b9ea428e-grp' was not found, notFound
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/regional_pd.go:219
@msau42

This comment has been minimized.

Show comment
Hide comment
@msau42

msau42 Sep 24, 2018

Member

/assign @verult

Member

msau42 commented Sep 24, 2018

/assign @verult

@verult

This comment has been minimized.

Show comment
Hide comment
@verult

verult Oct 3, 2018

Contributor

The test's assumption about NodeInstanceGroup test parameter is wrong. Previously it assumed the group name is the same across all zones, but they are different in regional clusters.

Looking into using node taints as the permanent fix, as suggested by the TODO comment in the test.

Contributor

verult commented Oct 3, 2018

The test's assumption about NodeInstanceGroup test parameter is wrong. Previously it assumed the group name is the same across all zones, but they are different in regional clusters.

Looking into using node taints as the permanent fix, as suggested by the TODO comment in the test.

@msau42

This comment has been minimized.

Show comment
Hide comment
@msau42

msau42 Oct 3, 2018

Member

Hm I think taint based evictions is still an alpha feature.

Member

msau42 commented Oct 3, 2018

Hm I think taint based evictions is still an alpha feature.

@verult

This comment has been minimized.

Show comment
Hide comment
@verult

verult Oct 4, 2018

Contributor

OK. I propose the following solution:

  • Get the zone of NodeInstanceGroup
  • Schedule the StatefulSet pod to that zone using node affinity; wait for the pod to run
  • Remove the node affinity from StatefulSet
  • Delete NodeInstanceGroup
Contributor

verult commented Oct 4, 2018

OK. I propose the following solution:

  • Get the zone of NodeInstanceGroup
  • Schedule the StatefulSet pod to that zone using node affinity; wait for the pod to run
  • Remove the node affinity from StatefulSet
  • Delete NodeInstanceGroup
@msau42

This comment has been minimized.

Show comment
Hide comment
@msau42

msau42 Oct 4, 2018

Member

Can it be simplified?

  • Schedule the pod
  • Figure out which zone that pod got scheduled in
  • Delete the instance group of that zone
Member

msau42 commented Oct 4, 2018

Can it be simplified?

  • Schedule the pod
  • Figure out which zone that pod got scheduled in
  • Delete the instance group of that zone
@verult

This comment has been minimized.

Show comment
Hide comment
@verult

verult Oct 4, 2018

Contributor

Unfortunately I don't know of a good way to look up the instance group given the zone, since there could be instance groups from multiple clusters in the same project

Contributor

verult commented Oct 4, 2018

Unfortunately I don't know of a good way to look up the instance group given the zone, since there could be instance groups from multiple clusters in the same project

@msau42

This comment has been minimized.

Show comment
Hide comment
@msau42

msau42 Oct 4, 2018

Member

How about:

  • Schedule the pod
  • Figure out the zone
  • Taint all the nodes in the zone
  • Kill the pod
Member

msau42 commented Oct 4, 2018

How about:

  • Schedule the pod
  • Figure out the zone
  • Taint all the nodes in the zone
  • Kill the pod
@verult

This comment has been minimized.

Show comment
Hide comment
@verult

verult Oct 5, 2018

Contributor

Taint with NoSchedule? yeah that sounds good

Contributor

verult commented Oct 5, 2018

Taint with NoSchedule? yeah that sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment