KEP-4224: Replace kube-up e2e clusters with kops clusters #4250

upodroid · 2023-09-28T10:01:27Z

One-line PR description: Replace kube-up e2e clusters with kops clusters

Issue link: Replace kube-up e2e clusters with kops for kubernetes/kubernetes e2e testing #4224

upodroid · 2023-09-28T10:03:05Z

/cc @dims @justinsb @aojea @elmiko

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeup-to-kops

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/kep.yaml

CecileRobertMichon · 2023-09-29T15:55:31Z

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md

+information to express the idea and why it was not acceptable.
+-->
+
+We don't have one.


Was using Cluster API considered? I remember there was an issue open in k/k (can't seem to find the link about deprecating the cluster/ directory and using cluster-api for tests

kOps doesn't support as many infrastructure providers (everything besides AWS and GCE are alpha or beta) so it may be less versatile for running k/k tests across different clouds in the future.

cc @justinsb @fabriziopandini @vincepri

I didn't articulate this, but CAPI requires a kubernetes cluster to create the test cluster which adds an extra complication. Also, is it trivial to fix cluster bootstrap business logic? For example, this is a list of bugs I fixed/fixing in kops https://github.com/kubernetes/kops/pulls?q=is%3Apr+author%3Aupodroid

Also, CAPG isn't well maintained. This is what I found when I took CAPG for a spin at the end of 2022 vmware-tanzu/crash-diagnostics#243

Relevant historical slack thread:

https://kubernetes.slack.com/archives/C2C40FMNF/p1657180208598469

Google docs link to previous discussion:

https://docs.google.com/document/d/1n1Znf-SY85Bjo_B_ReKsqAvEbU4Z8Vw-AvgeiSGVXBA/edit#heading=h.xk6zt7zfht6s

Looks like this never got an owner, lemme know if I can help with that.

I didn't articulate this, but CAPI requires a kubernetes cluster to create the test cluster which adds an extra complication

A common way to get around this is to use a kind cluster as bootstrap cluster. All the k/k cloud-provider Azure tests create test k8s clusters using CAPZ currently: https://testgrid.k8s.io/provider-azure-master-signal

Also, CAPG isn't well maintained

@dims @richardcase @cpanato are the maintainers of CAPG, they can comment on the project maturity.

An alternative would be to run most tests using Docker which doesn't require spinning up any cloud infrastructure (less $$ and faster), this how core Cluster API runs all its tests https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/docker/README.md

There is one more aspect to using kOps for testing purposes, which is less complexity. kOps creates the cloud resources starts the K8s cluster and then it gets out of the way. There are no controllers that try to reconcile things in the background. This makes it easy to figure out what is happening with failing tests or broken clusters.

@upodroid please consider recording the points above in the "alternatives" section of this KEP. "We don't have one" doesn't seem factual given the discussion above.

I'll add a summary of this thread to the alternatives section and join the CAPI meeting this afternoon to say hello and answer some questions.

Summarizing the feedback at the Cluster API office hours:

It'd be great to repurpose the KEP to be focused on improving the current e2e test suites, and not rely on a particular tool to run them (e.g. kube-up or kops).

Cluster API and the major providers like AWS, GCP, Azure, and vSphere are all currently running conformance test suites. @fabriziopandini will coordinate and open issue with the different providers to start running the other e2e suites like the ones shown by @upodroid in https://testgrid.k8s.io/sig-cluster-lifecycle-kubeup-to-kops

The Cluster API bootstrap cluster problem can potentially be solved in a number of different ways, which the CAPI community and myself can help with; with the end goal in mind of reducing costs.

@upodroid @CecileRobertMichon @fabriziopandini did I miss anything from the summary?

Related note for CAPG maintainers (@dims @cpanato @richardcase): it'd be good to understand what's the delta of features we'd need to run the e2e suites on CAPG. Seems the provider has made lots of progress in the past coming months, and I can carve out some time to help as well.

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md

CecileRobertMichon · 2023-09-29T15:57:18Z

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md

+- The shell scripts are very fragile and no new features are being accepted
+- arm64 testing needs to be done on GCE
+- python and debian upgrades always break the Kubernetes e2e tests, particularly at a bad time(cutting releases, reverting/patching a critical bug). This will no longer be the case once we move to kops. For example, 
+- We have some tests in kubernetes that make assumptions about a specific pieces of cloud infrastructure and are not a good fit for kubernetes e2e tests. Tests that rely on cloud-infra that are not reachable will be removed.


Tests that rely on cloud-infra that are not reachable will be removed.

can you please expand on this one?

I think he means things like these kubernetes/kubernetes#120968

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md

dims · 2023-09-29T16:35:17Z

@upodroid i'd like us to run these jobs on both GCE and EC2 in parallel as well

CecileRobertMichon · 2023-09-29T17:05:32Z

i'd like us to run these jobs on both GCE and EC2 in parallel as well

@dims @BenTheElder what would it take for us to run these tests on Azure as well?

cc @lachie83

BenTheElder

Thanks @upodroid

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md

BenTheElder · 2023-09-29T17:44:29Z

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md

+information to express the idea and why it was not acceptable.
+-->
+
+We don't have one.


I think this thread is missing the point.

This KEP is about eliminating kube-up and kube-up jobs, by shifting to existing production-grade coverage.

Other CI coverage will continue to exist alongside this and folks are welcome to continue to invest in that.

kOps has already for year provided reliable CI coverage for the vendors that actually provide us credits.

Future cloud vendors is not the problem, we're many many years into the credits program and moving CI out of google.com, and yet GCP+AWS are the only vendors that actually provide credits. Of those two, with apologies, CAPG support is just not there.

Bootstrap is not an issue, you can use kind or even lighter options like https://github.com/fabriziopandini/kBB-8 (8s to start provisioning a cluster)

As a kind maintainer ... that's actually unacceptably expensive. KIND is only cheaper when we don't have to run the cloud cluster. If we have to run a kind cluster for every cloud e2e cluster the CI costs are going to get out of hand.

e2e tests can run for a long time and generally require almost no resources in the CI cluster, only the cluster under tests.

An alternative would be to run most tests using Docker which doesn't require spinning up any cloud infrastructure (less $$ and faster), this how core Cluster API runs all its tests https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/docker/README.md

Again, actually much more expensive than kOps. The kind clusters are not free either, just cheaper than a cloud cluster. But one cloud cluster is cheaper than two cloud clusters or a cloud cluster + a kind cluster.

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md

upodroid · 2023-09-29T17:47:23Z

i'd like us to run these jobs on both GCE and EC2 in parallel as well

what would it take for us to run these tests on Azure as well?

kops has an Azure implementation, but it is alpha and I haven't used it.

For EC2, I would duplicate these jobs https://testgrid.k8s.io/sig-cluster-lifecycle-kubeup-to-kops to run on EC2 and Ubuntu

I'm expecting fewer failing tests as the aws cloud-provider is better maintained and kubernetes test suite runs fewer tests against the aws clusters

~~We should probably resolve the kops vs CAPI discussion~~

BenTheElder · 2023-09-29T17:47:46Z

@dims @BenTheElder what would it take for us to run these tests on Azure as well?

Azure cannot be used for release blocking because those accounts are a mystery with no budget and are not owned by the project.

We've been bitten badly by this in the past with PR blocking kops-aws running out of money and shutting down before we had community controlled accounts.

kubernetes/k8s.io#1637

EDIT: Note that this sub-topic is a very old discussion we've been having for years now. So we shouldn't block phasing out kube-up on revisiting this for the Nth time, though I'd be happy to see this change anyhow :-)

dims · 2023-09-29T18:09:30Z

@dims @BenTheElder what would it take for us to run these tests on Azure as well?

@CecileRobertMichon we'll need to setup a credits program where infra is run by volunteers similar to what we do with GCP and AWS.

aojea · 2023-09-30T16:37:25Z

Some context on the origin of this KEP to try to focus the discussion

There is a group of developers commited to support arm CI kubernetes/test-infra#29693, and there was a PR to add support to the existing cluster scripts for arm kubernetes/kubernetes#120144
This was brought up on the last sig-testing meeting, and I told them that we should not keep developing on the cluster scripts adding new features (and I think that we all agree here)

I asked them to look for some existing tool that runs kubernetes on arm, and it turned out kops already had a CI job for running on arm kubernetes/kubernetes#78995 (comment)

In another note, there was another effort to migrate scalability jobs to aws and these developers choose kops too kubernetes/test-infra#29139

Since the migration of jobs to kops was getting traction and showing results, I asked @upodroid to open a KEP for visibility and for defining the criteria and the plan to make this effort sustainable in the long term and how we can do a smooth transition.

One important thing I want to highlight is that the CI of kubernetes is for testing kubernetes, not for testing the tool that test kubernetes at the same time, if these new jobs start to flake or are unstable because of the errors or incompatibilites on the installer and these errors are not promptly fixed or opened against kubernetes developers, we'll revisit this decision.

k8s-ci-robot · 2023-10-02T12:07:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: upodroid
Once this PR has been reviewed and has the lgtm label, please assign michelle192837, wojtek-t for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-testing/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2023-10-02T12:11:02Z

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md

+
+## Scalability
+
+kops is already used to run scale tests on AWS. We can use it to replace the kube-up scale tests.


The test on AWS is still visibly less stable than the GCE one:
https://testgrid.k8s.io/kops-misc#ec2-master-scale-performance
vs
https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance

It's by far not clear to me whether it's not because e.g. some settings that are critical to achieve reasonable performance at scale are e.g. not set correctly in kops.

@shyamjvs - FYI

Note - I'm supported for the effort itself, I'm just saying that it may not be as straightforward as you think...
In general, what I would like to see is a diff for flags [fortunately all our components log them on startup] for the existing and new jobs and show that this diff is zero :)

kops does expose most of the settings but they do have values that are different from what is set in kube-up clusters.

kubernetes/kops#15982 tries to close the apiserver differences

I'll look at the scale tests once the serial, disruptive and alpha tests are stabilised.

We use visibly more for scalability tests - you may want to look at:
https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presets.yaml

this is an example of blocker

hakman · 2023-10-02T13:28:52Z

i'd like us to run these jobs on both GCE and EC2 in parallel as well

@dims @BenTheElder what would it take for us to run these tests on Azure as well?

cc @lachie83

@CecileRobertMichon kOps has support for Azure, not perfect, but quite good since recently.
I guess, the biggest blocker would be to get @lachie83's approval to use the Azure some account & credits for this. I know you told me to ping him about it, but I didn't have the time.

jprzychodzen · 2023-10-02T13:30:06Z

What's the future of Kubemark in a such setting? AFAIK it's only deployable through kube-up scripts.

/cc @marseel

elmiko · 2023-10-02T13:37:45Z

@jprzychodzen fwiw, we do have an actively maintained kubemark provider for cluster-api, https://github.com/kubernetes-sigs/cluster-api-provider-kubemark . that is another possibility for creating kubemark nodes, since cluster api has come up in discussion here.

dims · 2023-10-02T13:40:28Z

Here's an old write up when we were discussing options - https://hackmd.io/pw1kt61lRM-wZh5MU1G_QQ for the record.

upodroid · 2023-10-04T12:40:07Z

/label tide/merge-method-squash

aojea · 2023-10-05T22:15:18Z

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md

+proposal will be implemented, this is the place to discuss them.
+-->
+
+We'll create new prowjobs that use kops clusters to run cluster e2e testing. kops has been used for e2e testing for a long time but it runs a narrower set of tests designed for testing Kubernetes Distributions with various components.


I'd like to see a difference between components installed in kube-up and in kops, what cni is being used per example? kube-up is pretty neat, I don't like to end with calico or cilium in the critical path, per example

BenTheElder · 2023-10-17T20:37:21Z

What's the future of Kubemark in a such setting? AFAIK it's only deployable through kube-up scripts.

We should migrate it to this, however it's worth pointing out that kube-up.sh is already long disowned, deprecated, and removed as a subproject by SIG Cluster Lifecycle and ad-hoc maintained by a handful of ~SIG Testing folks because so much CI uses it.

We need to phase out the bulk of CI using it, and eventually the rest. So I would say kubemark jobs are already at risk by not migrating off of it.

k8s-triage-robot · 2024-01-22T13:30:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

upodroid · 2024-01-22T14:13:58Z

/remove-lifecycle stale

k8s-triage-robot · 2024-04-21T14:19:54Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

upodroid · 2024-04-21T15:05:10Z

/remove-lifecycle stale

kep to replace kubeup clusters

4b88197

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 28, 2023

k8s-ci-robot requested review from alvaroaleman and aojea September 28, 2023 10:01

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 28, 2023

k8s-ci-robot requested review from dims, elmiko and justinsb September 28, 2023 10:03

aojea reviewed Sep 28, 2023

View reviewed changes

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/kep.yaml Show resolved Hide resolved

CecileRobertMichon reviewed Sep 29, 2023

View reviewed changes

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Sep 29, 2023

View reviewed changes

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md Show resolved Hide resolved

CecileRobertMichon reviewed Sep 29, 2023

View reviewed changes

kannon92 reviewed Sep 29, 2023

View reviewed changes

keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md Show resolved Hide resolved

BenTheElder reviewed Sep 29, 2023

View reviewed changes

update the kep, add ben as approver

331966c

upodroid mentioned this pull request Oct 2, 2023

Replace kube-up e2e clusters with kops for kubernetes/kubernetes e2e testing #4224

Open

4 tasks

wojtek-t reviewed Oct 2, 2023

View reviewed changes

fix the toc

6123645

k8s-ci-robot requested a review from marseel October 2, 2023 13:30

write down the alternatives

ef27d18

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 4, 2023

upodroid mentioned this pull request Oct 4, 2023

Tests that need to be removed/rewritten to support kops kubernetes/kubernetes#120989

Open

add link to issue tracking test failures

bf1111e

aojea reviewed Oct 5, 2023

View reviewed changes

BenTheElder mentioned this pull request Oct 17, 2023

images/krte: add Rootless Docker (and systemd as a dependency) kubernetes/test-infra#30744

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2024


		## Scalability

		kops is already used to run scale tests on AWS. We can use it to replace the kube-up scale tests.

KEP-4224: Replace kube-up e2e clusters with kops clusters #4250

Are you sure you want to change the base?

KEP-4224: Replace kube-up e2e clusters with kops clusters #4250

Conversation

upodroid commented Sep 28, 2023

upodroid commented Sep 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upodroid Sep 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dims commented Sep 29, 2023

CecileRobertMichon commented Sep 29, 2023

BenTheElder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upodroid commented Sep 29, 2023 • edited

BenTheElder commented Sep 29, 2023 • edited

dims commented Sep 29, 2023

aojea commented Sep 30, 2023

k8s-ci-robot commented Oct 2, 2023

Choose a reason for hiding this comment

upodroid Oct 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakman commented Oct 2, 2023

jprzychodzen commented Oct 2, 2023

elmiko commented Oct 2, 2023

dims commented Oct 2, 2023

upodroid commented Oct 4, 2023

Choose a reason for hiding this comment

BenTheElder commented Oct 17, 2023

k8s-triage-robot commented Jan 22, 2024

upodroid commented Jan 22, 2024

k8s-triage-robot commented Apr 21, 2024

upodroid commented Apr 21, 2024

upodroid Sep 29, 2023 •

edited

upodroid commented Sep 29, 2023 •

edited

BenTheElder commented Sep 29, 2023 •

edited

upodroid Oct 2, 2023 •

edited