Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor and improve CRD publishing e2e tests in an HA setup #90452

Closed

Conversation

p0lyn0mial
Copy link
Contributor

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind flake

What this PR does / why we need it: makes CRD publishing e2e tests more stable, especially in an HA setup by checking all instances of API server before running the actual test.

Which issue(s) this PR fixes: In our environment with multiple masters these tests are very unstable. It turned out that the tests didn't guarantee that all instances had observed the same version of the spec before running the actual tests.

Fixes #

Special notes for your reviewer: we need to build an image with socat and I need to refactor portforward.go tests to reuse RunKubectlPortForward function

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 24, 2020
@k8s-ci-robot k8s-ci-robot added area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 24, 2020
@p0lyn0mial
Copy link
Contributor Author

/assign @roycaihw @liggitt @sttts


// setupAPIServersProxyPodAndPortForward a convenience method that creates and runs a pod that proxies connections to the API servers.
// It also uses kubectl port-forward to route local connections to that pod.
func setupAPIServersProxyPodAndPortForward(f *framework.Framework) ([]int, func(), error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear to me why running this code inside a pod would improve stability of this test... something running in a pod can have the same issue reaching an arbitrary API server in an HA environment, and this seems to introduce additional ways the test can fail (issues with the pod, with the kubectl port-forward, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will improve stability because the pod exposes a proxy to all instances (the list is taken from the default endpoint resource) and we check the spec from every replica before running the actual tests.

before we checked a pubic endpoint that was behind a LB - so we didn't know if we reached all replicas.

}

func getAllAPIServersEndpoint(c k8sclientset.Interface) ([]string, error) {
eps, err := c.CoreV1().Endpoints(metav1.NamespaceDefault).Get(context.TODO(), "kubernetes", metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is going directly to the kubernetes endpoints like this guaranteed to be valid in all conformant environments (remember this is modifying a conformance test)? I vaguely remember different configurations for the kubernetes service (e.g. #13978)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that endpoints for kubernetes service are populated by EndpointReconciler. From what I understand it should be okay as the endpoints are filled by kube API servers unless the endpoint reconciler was turned off. Could you please double-check that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt have you had time to verify that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running without the endpoint reconciler is a supported option (--endpoint-reconciler-type=none), as is using a node-port for the kubernetes master service (--kubernetes-service-node-port=...). Switching a conformance test to go directly to endpoints instead of through the kubernetes service doesn't seem proper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect there to be a shim in our test framework that correctly waits for HA configurations (an HA configuration should be able to be shown to be conformant) by getting api endpoints appropriately depending on what the config is. I.e. h := framework.NewAPIEndpointHelper(client); h.WaitForAll(func(<gets a client config>) (bool, error) { do a test with the client to see if x is true }) that handles that.

ukasz do you think you would be willing to articulate a version of the helper that works in each case jordan mentioned and hides this behavior when it's not possible (if no endpoints are set, the test may also need a way to clarify that)?

Everyone is instead doing this poorly, incorrectly, and inconsistently in the existing code - having a correct way to do at least do this for the apiserver is good. I don't think it's acceptable for us to punt on HA apiservers which is the recommended prod configuration of kube to prove that they are conformant in a reasonable way (and medium length sleeps may not be sufficient). We definitely paper over this in other components (stuff running in deployment sized two) - so it would be nice if the underlying code could be adapted to something like the OpenShift ingress tests, or the upgrade tests that verify the data. We still have a need for probalistic checking (i.e. if we're trying to verify a SLB directly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ukasz do you think you would be willing to articulate a version of the helper that works in each case jordan mentioned and hides this behavior when it's not possible (if no endpoints are set, the test may also need a way to clarify that)?

@smarterclayton sure, I can start with a version that understands endpoints. Many thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref: #100585

p0lyn0mial added a commit to p0lyn0mial/origin that referenced this pull request Apr 27, 2020
p0lyn0mial added a commit to p0lyn0mial/origin that referenced this pull request Apr 28, 2020
p0lyn0mial added a commit to p0lyn0mial/origin that referenced this pull request Apr 28, 2020
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Apr 28, 2020
Origin-commit: b7f9cc11026358f9eb96e08aa287d44373148d7c
p0lyn0mial added a commit to p0lyn0mial/origin that referenced this pull request Apr 29, 2020
p0lyn0mial added a commit to p0lyn0mial/origin that referenced this pull request Apr 29, 2020
p0lyn0mial added a commit to p0lyn0mial/origin that referenced this pull request Apr 29, 2020
p0lyn0mial added a commit to p0lyn0mial/kubernetes that referenced this pull request Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet