Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4224: Replace kube-up e2e clusters with kops clusters #4250

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

upodroid
Copy link
Member

  • One-line PR description: Replace kube-up e2e clusters with kops clusters

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 28, 2023
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 28, 2023
@upodroid
Copy link
Member Author

information to express the idea and why it was not acceptable.
-->

We don't have one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was using Cluster API considered? I remember there was an issue open in k/k (can't seem to find the link about deprecating the cluster/ directory and using cluster-api for tests

kOps doesn't support as many infrastructure providers (everything besides AWS and GCE are alpha or beta) so it may be less versatile for running k/k tests across different clouds in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@upodroid upodroid Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't articulate this, but CAPI requires a kubernetes cluster to create the test cluster which adds an extra complication. Also, is it trivial to fix cluster bootstrap business logic? For example, this is a list of bugs I fixed/fixing in kops https://github.com/kubernetes/kops/pulls?q=is%3Apr+author%3Aupodroid

Also, CAPG isn't well maintained. This is what I found when I took CAPG for a spin at the end of 2022 vmware-tanzu/crash-diagnostics#243

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant historical slack thread:

https://kubernetes.slack.com/archives/C2C40FMNF/p1657180208598469

Google docs link to previous discussion:

https://docs.google.com/document/d/1n1Znf-SY85Bjo_B_ReKsqAvEbU4Z8Vw-AvgeiSGVXBA/edit#heading=h.xk6zt7zfht6s

Looks like this never got an owner, lemme know if I can help with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't articulate this, but CAPI requires a kubernetes cluster to create the test cluster which adds an extra complication

A common way to get around this is to use a kind cluster as bootstrap cluster. All the k/k cloud-provider Azure tests create test k8s clusters using CAPZ currently: https://testgrid.k8s.io/provider-azure-master-signal

Also, CAPG isn't well maintained

@dims @richardcase @cpanato are the maintainers of CAPG, they can comment on the project maturity.

An alternative would be to run most tests using Docker which doesn't require spinning up any cloud infrastructure (less $$ and faster), this how core Cluster API runs all its tests https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/docker/README.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more aspect to using kOps for testing purposes, which is less complexity. kOps creates the cloud resources starts the K8s cluster and then it gets out of the way. There are no controllers that try to reconcile things in the background. This makes it easy to figure out what is happening with failing tests or broken clusters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@upodroid please consider recording the points above in the "alternatives" section of this KEP. "We don't have one" doesn't seem factual given the discussion above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a summary of this thread to the alternatives section and join the CAPI meeting this afternoon to say hello and answer some questions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing the feedback at the Cluster API office hours:

  • It'd be great to repurpose the KEP to be focused on improving the current e2e test suites, and not rely on a particular tool to run them (e.g. kube-up or kops).
  • Cluster API and the major providers like AWS, GCP, Azure, and vSphere are all currently running conformance test suites. @fabriziopandini will coordinate and open issue with the different providers to start running the other e2e suites like the ones shown by @upodroid in https://testgrid.k8s.io/sig-cluster-lifecycle-kubeup-to-kops
  • The Cluster API bootstrap cluster problem can potentially be solved in a number of different ways, which the CAPI community and myself can help with; with the end goal in mind of reducing costs.

@upodroid @CecileRobertMichon @fabriziopandini did I miss anything from the summary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related note for CAPG maintainers (@dims @cpanato @richardcase): it'd be good to understand what's the delta of features we'd need to run the e2e suites on CAPG. Seems the provider has made lots of progress in the past coming months, and I can carve out some time to help as well.

- The shell scripts are very fragile and no new features are being accepted
- arm64 testing needs to be done on GCE
- python and debian upgrades always break the Kubernetes e2e tests, particularly at a bad time(cutting releases, reverting/patching a critical bug). This will no longer be the case once we move to kops. For example,
- We have some tests in kubernetes that make assumptions about a specific pieces of cloud infrastructure and are not a good fit for kubernetes e2e tests. Tests that rely on cloud-infra that are not reachable will be removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests that rely on cloud-infra that are not reachable will be removed.

can you please expand on this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he means things like these kubernetes/kubernetes#120968

@dims
Copy link
Member

dims commented Sep 29, 2023

@upodroid i'd like us to run these jobs on both GCE and EC2 in parallel as well

@CecileRobertMichon
Copy link
Member

i'd like us to run these jobs on both GCE and EC2 in parallel as well

@dims @BenTheElder what would it take for us to run these tests on Azure as well?

cc @lachie83

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @upodroid

information to express the idea and why it was not acceptable.
-->

We don't have one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this thread is missing the point.

This KEP is about eliminating kube-up and kube-up jobs, by shifting to existing production-grade coverage.

Other CI coverage will continue to exist alongside this and folks are welcome to continue to invest in that.

kOps has already for year provided reliable CI coverage for the vendors that actually provide us credits.

Future cloud vendors is not the problem, we're many many years into the credits program and moving CI out of google.com, and yet GCP+AWS are the only vendors that actually provide credits. Of those two, with apologies, CAPG support is just not there.

Bootstrap is not an issue, you can use kind or even lighter options like https://github.com/fabriziopandini/kBB-8 (8s to start provisioning a cluster)

As a kind maintainer ... that's actually unacceptably expensive. KIND is only cheaper when we don't have to run the cloud cluster. If we have to run a kind cluster for every cloud e2e cluster the CI costs are going to get out of hand.

e2e tests can run for a long time and generally require almost no resources in the CI cluster, only the cluster under tests.

An alternative would be to run most tests using Docker which doesn't require spinning up any cloud infrastructure (less $$ and faster), this how core Cluster API runs all its tests https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/docker/README.md

Again, actually much more expensive than kOps. The kind clusters are not free either, just cheaper than a cloud cluster. But one cloud cluster is cheaper than two cloud clusters or a cloud cluster + a kind cluster.

@upodroid
Copy link
Member Author

upodroid commented Sep 29, 2023

i'd like us to run these jobs on both GCE and EC2 in parallel as well

what would it take for us to run these tests on Azure as well?

kops has an Azure implementation, but it is alpha and I haven't used it.

For EC2, I would duplicate these jobs https://testgrid.k8s.io/sig-cluster-lifecycle-kubeup-to-kops to run on EC2 and Ubuntu

I'm expecting fewer failing tests as the aws cloud-provider is better maintained and kubernetes test suite runs fewer tests against the aws clusters

We should probably resolve the kops vs CAPI discussion

@BenTheElder
Copy link
Member

BenTheElder commented Sep 29, 2023

@dims @BenTheElder what would it take for us to run these tests on Azure as well?

Azure cannot be used for release blocking because those accounts are a mystery with no budget and are not owned by the project.

We've been bitten badly by this in the past with PR blocking kops-aws running out of money and shutting down before we had community controlled accounts.

kubernetes/k8s.io#1637

EDIT: Note that this sub-topic is a very old discussion we've been having for years now. So we shouldn't block phasing out kube-up on revisiting this for the Nth time, though I'd be happy to see this change anyhow :-)

@dims
Copy link
Member

dims commented Sep 29, 2023

@dims @BenTheElder what would it take for us to run these tests on Azure as well?

@CecileRobertMichon we'll need to setup a credits program where infra is run by volunteers similar to what we do with GCP and AWS.

@aojea
Copy link
Member

aojea commented Sep 30, 2023

Some context on the origin of this KEP to try to focus the discussion

There is a group of developers commited to support arm CI kubernetes/test-infra#29693, and there was a PR to add support to the existing cluster scripts for arm kubernetes/kubernetes#120144
This was brought up on the last sig-testing meeting, and I told them that we should not keep developing on the cluster scripts adding new features (and I think that we all agree here)

I asked them to look for some existing tool that runs kubernetes on arm, and it turned out kops already had a CI job for running on arm kubernetes/kubernetes#78995 (comment)

In another note, there was another effort to migrate scalability jobs to aws and these developers choose kops too kubernetes/test-infra#29139

Since the migration of jobs to kops was getting traction and showing results, I asked @upodroid to open a KEP for visibility and for defining the criteria and the plan to make this effort sustainable in the long term and how we can do a smooth transition.

One important thing I want to highlight is that the CI of kubernetes is for testing kubernetes, not for testing the tool that test kubernetes at the same time, if these new jobs start to flake or are unstable because of the errors or incompatibilites on the installer and these errors are not promptly fixed or opened against kubernetes developers, we'll revisit this decision.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: upodroid
Once this PR has been reviewed and has the lgtm label, please assign michelle192837, wojtek-t for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


## Scalability

kops is already used to run scale tests on AWS. We can use it to replace the kube-up scale tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test on AWS is still visibly less stable than the GCE one:
https://testgrid.k8s.io/kops-misc#ec2-master-scale-performance
vs
https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance

It's by far not clear to me whether it's not because e.g. some settings that are critical to achieve reasonable performance at scale are e.g. not set correctly in kops.

@shyamjvs - FYI

Note - I'm supported for the effort itself, I'm just saying that it may not be as straightforward as you think...
In general, what I would like to see is a diff for flags [fortunately all our components log them on startup] for the existing and new jobs and show that this diff is zero :)

Copy link
Member Author

@upodroid upodroid Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kops does expose most of the settings but they do have values that are different from what is set in kube-up clusters.

kubernetes/kops#15982 tries to close the apiserver differences

I'll look at the scale tests once the serial, disruptive and alpha tests are stabilised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an example of blocker

@hakman
Copy link
Member

hakman commented Oct 2, 2023

i'd like us to run these jobs on both GCE and EC2 in parallel as well

@dims @BenTheElder what would it take for us to run these tests on Azure as well?

cc @lachie83

@CecileRobertMichon kOps has support for Azure, not perfect, but quite good since recently.
I guess, the biggest blocker would be to get @lachie83's approval to use the Azure some account & credits for this. I know you told me to ping him about it, but I didn't have the time.

@jprzychodzen
Copy link

What's the future of Kubemark in a such setting? AFAIK it's only deployable through kube-up scripts.

/cc @marseel

@elmiko
Copy link
Contributor

elmiko commented Oct 2, 2023

@jprzychodzen fwiw, we do have an actively maintained kubemark provider for cluster-api, https://github.com/kubernetes-sigs/cluster-api-provider-kubemark . that is another possibility for creating kubemark nodes, since cluster api has come up in discussion here.

@dims
Copy link
Member

dims commented Oct 2, 2023

Here's an old write up when we were discussing options - https://hackmd.io/pw1kt61lRM-wZh5MU1G_QQ for the record.

@upodroid
Copy link
Member Author

upodroid commented Oct 4, 2023

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 4, 2023
proposal will be implemented, this is the place to discuss them.
-->

We'll create new prowjobs that use kops clusters to run cluster e2e testing. kops has been used for e2e testing for a long time but it runs a narrower set of tests designed for testing Kubernetes Distributions with various components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a difference between components installed in kube-up and in kops, what cni is being used per example? kube-up is pretty neat, I don't like to end with calico or cilium in the critical path, per example

@BenTheElder
Copy link
Member

What's the future of Kubemark in a such setting? AFAIK it's only deployable through kube-up scripts.

We should migrate it to this, however it's worth pointing out that kube-up.sh is already long disowned, deprecated, and removed as a subproject by SIG Cluster Lifecycle and ad-hoc maintained by a handful of ~SIG Testing folks because so much CI uses it.

We need to phase out the bulk of CI using it, and eventually the rest. So I would say kubemark jobs are already at risk by not migrating off of it.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@upodroid
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2024
@upodroid
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet