Remove the /cluster directory #78995

timothysc · 2019-06-13T19:57:35Z

For years we have publicly stated that the /cluster directory is deprecated and not maintained. However every cycle it is updated and there are bugs found+fixed by sig-cluster-lifecycle.

I'd like to enumerate what needs to get done in order for us to wholesale remove the /cluster directory.

/assign @dims @spiffxp @justinsb @timothysc
/cc @liggitt @neolit123

dims · 2019-06-13T20:00:41Z

/area code-organization

timothysc · 2019-06-13T20:01:48Z

/cc @andrewsykim

neolit123 · 2019-06-13T20:04:18Z

or potentially break it down and move certain sub-folders out of tree.
currently it contains a collection of items:

image files (conformance image, etcd)
kube-up
random bash scripts used by test-infra/kubetest deployers as helpers
addons and the addon manager.
other?

liggitt · 2019-06-13T20:05:46Z

looking at the references to it...

/sig testing
for e2e bringup

/sig scalability
for kubemark bringup

sftim · 2019-06-16T21:51:00Z

Is this also relevant to SIG Docs?

timothysc · 2019-06-17T13:42:07Z

xref -#78543
^ Is an example of continued technical debt that we see and have to pay for in different ways across SCL.

jaypipes · 2019-07-02T16:37:28Z

Is this also relevant to SIG Docs?

Yes, I think this is a good example of where /cluster scripts are referenced from the docs: kubernetes/website#14929

andrewsykim · 2019-07-11T18:45:22Z

For v1.16: investigate if cluster API is a potential replacement and meets the same level of coverage as /cluster and enumerate what is missing cluster API

/assign @alejandrox1

dims · 2019-07-12T13:29:34Z

Step 1 we talked about was ... look at all the CI jobs that use the kube-up in the cluster directory and inventory the knobs/settings/configurations they setup or use and cross check if cluster-api/kubeadm allows us to do the same.
Step 2 mirror the GCE e2e CI Job (pull-kubernetes-e2e-gce to be specific) using cluster-api for AWS.

Both steps can be done in parallel and will need help/effort/coordination on the wg-k8s-infra and sig-testing folks

alejandrox1 · 2019-07-12T13:41:04Z

I'll start with step 1 right away.
So step 2 is free if anyone wants to work on this as well 😃

dims · 2019-07-12T14:49:08Z

@alejandrox1 sounds good. please start a google doc or something that we can use to compile notes on the various flags/options

alejandrox1 · 2019-07-17T02:54:55Z

On the google doc: https://docs.google.com/document/d/1p3c_sOALbEzg2VH2OPz3w9yKwlwaU4jATyS0lqxFzt4/edit?usp=sharing

Still got a lot to do for step 1 but will ping when this is ready
/cc @mariantalla

mariantalla · 2019-08-22T08:30:33Z

Could I start work on step 2 (i.e. investigating/starting the refactor of pull-kubernetes-e2e-gce to use Cluster API)?

I'll assign myself, but please feel free to unassign me if someone else is already working on it!

/assign

neolit123 · 2020-09-03T17:40:33Z

/sig cloud-provider
/area provider/gcp

cheftako · 2021-02-03T21:29:06Z

/assign
/triage accepted
cc @jpbetz

jpbetz · 2021-03-24T19:19:09Z

The etcd images are not kubernetes specific. The main thing they do is automatically upgrade etcd one minor at a time, per etcd administration guidelines, when the cluster administrator upgrades etcd to a new version. E.g. if the cluster administrator upgrades to etcd 3.4 and the cluster is currently on 3.1, it upgrades first to 3.2 and then to 3.3 before upgrading to 3.4.

So I'm tempted to ask if the etcd community would be willing to own this. If the etcd community was okay with this, the main issue to solve is that the images are published to the k8s.io container repo.

cc @gyuho, @ptabor, @wenjiaswe

gyuho · 2021-03-24T19:26:38Z

The etcd images are not kubernetes specific

Yeah, I think we can implement some mechanism to update docs and container images in the registry for Kubernetes as part of etcd release process.

@ptabor @wenjiaswe Any thoughts?

justinsb · 2021-03-24T19:51:34Z

We could also do this in the etcdadm project, as that is a kubernetes-sigs project and is thus set up to push to k8s repos / follow k8s governance etc.

ptabor · 2021-03-24T20:39:04Z

The northstar IMHO is that we should get rid of process of 'updating' etcd by running consecutive minor versions of etcd
and instead etcd should have a dedicated tool: e.g. etcdstoragectl migrate (I would call it etcdadm, but the name is taken ;) ) process that knows the DB changes between different versions and explicitly 'fixes' the database instead of depending on the running for undefined duration full 'historical' etcd servers.

There are multiple benefits of this:

Such code is 'hands-on' documentation to the storage format differences.
the tool/library should be incorporated into backup-restore process
no maintenance of fragile multi-version generation code.
if changes are related to WAL log, the tool should create a new snapshot and truncate history. Etcd doesn't to this on boot.

Maybe its naive, but in scope of etcd-v3 minor versions, during work on etcd storage-format documentation I haven't spotted any significant differences in the format.
It's more like: 'set a default' if given field is missing, style of O(5) rules.

jpbetz · 2021-03-24T21:02:01Z

The northstar IMHO is that we should get rid of process of 'updating' etcd by running consecutive minor versions of etcd

I'm a big fan of this approach. If we can get support for this one the etcd side then the solution w/r/t the /cluster direct might just to be drop these images and use upstream etcd.

Curious what others think.

gyuho · 2021-03-24T21:27:58Z

w/r/t the /cluster direct might just to be drop these images and use upstream etcd

Yeah always wonder why we even need two separate etcd registries. I am open to dropping the whole container support in our release process, and let downstream project build their own. Or merge onto Kubernetes-community managed registry.

aojea · 2021-06-10T08:55:08Z

I think that what has to be done is to have an alternative as reliable as cluster or better, I'm not saying that cluster is great, but I didn't see anything better so far ... and this is very easy to measure with testgrid

k8s-triage-robot · 2023-02-08T02:21:49Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot · 2023-02-08T02:21:53Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dims · 2023-02-27T15:33:46Z

Sketching up requirements:
https://hackmd.io/pw1kt61lRM-wZh5MU1G_QQ

Here's a snapshot as of Feb27 just to be safe, please see all the comments on the hackmd.

# Requirements for cluster/kube-up.sh replacement

We have a lot of choices in the community like Kops/Kubespray/CAPA etc but all of them take too long or do not support all test sceanarios or hard to inject freshly built code. Hence the search for a new replacement. As you can see from this is a [long standing issue](https://github.com/kubernetes/kubernetes/issues/78995) and one that open to "easy" fixes.

- Must support 80% of jobs today (revisit all the environment flags we use to control different aspects of the cluster to verify)
- All nodes must run on a VM to replicate (how it is being done today, we already have `kind` to replicate things run inside a container)
- Must be able to deploy the cluster built directly from either a PR or the tip of a branch (to cover both presubmit and periodic jobs)
- Must use `kubeadm` to bootstrap the both the control plane node and the worker nodes
- `kubeadm` needs systemd for running `kubelet` so the images deployed should use systemd
- Must have a mechanical way to translate existing jobs to this new harness
- Should have a minimum of moving parts to ensure we are not chasing flakes and digging into things we don't need to
- Should have a clean path (UX) to debug things like we have today (logs from VM/cloudinit/systemd/kubelet/containers should tell the whole story)
- Should work on GCP and AWS, with Azure closely behind (so should be pluggable for other clouds)
- Should not be a hack, we need this to sustain us for 5-8 years at least. 
- Should work with things we already have like prow+boskos
- Adding a new service that is always on (like prow/boskos) should be well thought out as it is not trivial to debug yet another thing. Any such solution must be rock solid.
- we should be able to test kubeadm, external cloud providers (CPI) and storage drivers (CSI). 
- should support switching between containerd and CRI-O as well as cgroup v1/v2

We would need this solution to run in parallel for a full release cycle and offer equal or better results. This will need active owners who can take care of it for the long haul as worst case we will drop it like a hot potato however elegant or technologically superior the solution is.

Forward looking ideas that can be explored:
- Making whatever changes are needed to those tools to boot faster / allow injection of code.

BenTheElder · 2023-02-27T22:14:13Z

Must use kubeadm to bootstrap the both the control plane node and the worker nodes

I'm not sure this should be a hard requirement, as much appreciation as I have for kubeadm, it does a relatively small portion of cluster bootstrapping and we often need to test things it does not do / reconfigure things it does not directly support configuring anyhow. I'd suggest using kubeadm if we staffed writing something new from scratch, but I don't think we should preclude pre-existing options based on it.

We have a lot of choices in the community like Kops/Kubespray/CAPA etc but all of them take too long or do not support all test sceanarios or hard to inject freshly built code. Hence the search for a new replacement. As you can see from this is a long standing issue and one that open to "easy" fixes.

I'm also not sure the statement about kOps in particular is accurate, we actually used to run kOps on Kubernetes PRs as recently as 2019 and it worked fine. We lost this due to the AWS bill going unpaid somewhere between Amazon and the CNCF leading to the account being terminated and no ability to run the job. #73444 (comment)

We didn't spin up kops-GCE instead because we already had cluster-up and there was a strong push for Cluster-API, but kops-GCE works and passes tests. It's also relatively mature.
https://testgrid.k8s.io/kops-gce#kops-gce-latest

kubespray does not support kubernetes @ HEAD and CAPA is not passing tests / adds overhead like the local cluster, AIUI.

kops is already supported by kubetest + kubetest2, boskos, etc. and is passing tests on GCE + AWS. In theory it supports other providers but I don't think we have any CI there yet.

Must have a mechanical way to translate existing jobs to this new harness

This seems impractical. Most of the tricky part here is jobs setting kube-up environment variables that reconfigure the cluster. A human will have to dive into the scripts and see what each env ultimately does and then remap it to any other tool used. Unless that tool is also written in bash, we won't have drop in identical behavior, a lot of it is expanding variables inside bash-generated configuration files.

We could target the most important jobs and work from there.

aojea · 2023-09-05T17:30:35Z

@upodroid brought a related topic today, he is working on adding ARM64 support to the CI and modifyng the cluster scripts to do that #120144

I can see kops already is running arm64 (@justinsb) https://testgrid.k8s.io/google-aws#kops-aws-arm64-ci
I rather prefer these news CI jobs to use a new tool instead of keep building on the cluster folder, if possible, or we'll never break this loop

upodroid · 2023-09-05T18:01:07Z

I'll create the arm64 equivalent of https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd on GCE using kops and then have a conversation about migrating the amd64 one to kops.

neolit123 · 2023-09-05T18:14:00Z

I'll create the arm64 equivalent of https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd on GCE using kops and then have a conversation about migrating the amd64 one to kops.

for k0ps related questions we have #kops-dev on k8s slack.

dims · 2024-01-04T16:43:08Z

@upodroid how close are we to doing this? 1.31 perhaps?

timothysc added area/test sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 13, 2019

k8s-ci-robot assigned dims, justinsb, spiffxp and timothysc Jun 13, 2019

k8s-ci-robot added the area/code-organization Issues or PRs related to kubernetes code organization label Jun 13, 2019

timothysc added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jun 13, 2019

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jun 13, 2019

dims removed their assignment Jul 8, 2019

k8s-ci-robot assigned alejandrox1 Jul 11, 2019

dims mentioned this issue Jul 12, 2019

phase out cluster/ directory #49213

Closed

alejandrox1 mentioned this issue Jul 24, 2019

Fixed upgrade/downgrade jobs to go from stable to latest and vice versa kubernetes/test-infra#13577

Closed

dims added this to Backlog in code-organization subproject Jul 25, 2019

alejandrox1 mentioned this issue Aug 10, 2019

[Failing test] Upgrade [Feature:Upgrade] cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] #74893

Closed

k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. area/provider/gcp Issues or PRs related to gcp provider labels Sep 3, 2020

k8s-ci-robot assigned cheftako Feb 3, 2021

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Feb 3, 2021

dims mentioned this issue Apr 1, 2021

Support antrea as network policy provider in kube-up #100736

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 8, 2023

upodroid mentioned this issue Mar 2, 2023

Running node e2e tests on AWS kubernetes/test-infra#28899

Closed

neolit123 mentioned this issue Sep 5, 2023

[WIP] Fix cluster scripts to support arm64 GCE tests #120144

Closed

This was referenced Sep 12, 2023

Migrate kube-up cluster tests to kops - part one kubernetes/test-infra#30686

Closed

Skip kube-dns tests if coredns is installed #120749

Merged

upodroid mentioned this issue Sep 20, 2023

Replace kube-up e2e clusters with kops for kubernetes/kubernetes e2e testing kubernetes/enhancements#4224

Open

4 tasks

aojea mentioned this issue Sep 30, 2023

KEP-4224: Replace kube-up e2e clusters with kops clusters kubernetes/enhancements#4250

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the /cluster directory #78995

Remove the /cluster directory #78995

timothysc commented Jun 13, 2019 •

edited

dims commented Jun 13, 2019

timothysc commented Jun 13, 2019

neolit123 commented Jun 13, 2019 •

edited

liggitt commented Jun 13, 2019

sftim commented Jun 16, 2019

timothysc commented Jun 17, 2019

jaypipes commented Jul 2, 2019

andrewsykim commented Jul 11, 2019

dims commented Jul 12, 2019

alejandrox1 commented Jul 12, 2019

dims commented Jul 12, 2019

alejandrox1 commented Jul 17, 2019

mariantalla commented Aug 22, 2019

neolit123 commented Sep 3, 2020

cheftako commented Feb 3, 2021

jpbetz commented Mar 24, 2021

gyuho commented Mar 24, 2021

justinsb commented Mar 24, 2021

ptabor commented Mar 24, 2021 •

edited

jpbetz commented Mar 24, 2021

gyuho commented Mar 24, 2021

aojea commented Jun 10, 2021

k8s-triage-robot commented Feb 8, 2023

k8s-ci-robot commented Feb 8, 2023

dims commented Feb 27, 2023

BenTheElder commented Feb 27, 2023

aojea commented Sep 5, 2023

upodroid commented Sep 5, 2023 •

edited

neolit123 commented Sep 5, 2023

dims commented Jan 4, 2024

Remove the /cluster directory #78995

Remove the /cluster directory #78995

Comments

timothysc commented Jun 13, 2019 • edited

dims commented Jun 13, 2019

timothysc commented Jun 13, 2019

neolit123 commented Jun 13, 2019 • edited

liggitt commented Jun 13, 2019

sftim commented Jun 16, 2019

timothysc commented Jun 17, 2019

jaypipes commented Jul 2, 2019

andrewsykim commented Jul 11, 2019

dims commented Jul 12, 2019

alejandrox1 commented Jul 12, 2019

dims commented Jul 12, 2019

alejandrox1 commented Jul 17, 2019

mariantalla commented Aug 22, 2019

neolit123 commented Sep 3, 2020

cheftako commented Feb 3, 2021

jpbetz commented Mar 24, 2021

gyuho commented Mar 24, 2021

justinsb commented Mar 24, 2021

ptabor commented Mar 24, 2021 • edited

jpbetz commented Mar 24, 2021

gyuho commented Mar 24, 2021

aojea commented Jun 10, 2021

k8s-triage-robot commented Feb 8, 2023

k8s-ci-robot commented Feb 8, 2023

dims commented Feb 27, 2023

BenTheElder commented Feb 27, 2023

aojea commented Sep 5, 2023

upodroid commented Sep 5, 2023 • edited

neolit123 commented Sep 5, 2023

dims commented Jan 4, 2024

timothysc commented Jun 13, 2019 •

edited

neolit123 commented Jun 13, 2019 •

edited

ptabor commented Mar 24, 2021 •

edited

upodroid commented Sep 5, 2023 •

edited