Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the /cluster directory #78995

Open
timothysc opened this issue Jun 13, 2019 · 44 comments
Open

Remove the /cluster directory #78995

timothysc opened this issue Jun 13, 2019 · 44 comments
Assignees
Labels
area/code-organization Issues or PRs related to kubernetes code organization area/provider/gcp Issues or PRs related to gcp provider area/test kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@timothysc
Copy link
Member

timothysc commented Jun 13, 2019

For years we have publicly stated that the /cluster directory is deprecated and not maintained. However every cycle it is updated and there are bugs found+fixed by sig-cluster-lifecycle.

I'd like to enumerate what needs to get done in order for us to wholesale remove the /cluster directory.

/assign @dims @spiffxp @justinsb @timothysc
/cc @liggitt @neolit123

@timothysc timothysc added area/test sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 13, 2019
@dims
Copy link
Member

dims commented Jun 13, 2019

/area code-organization

@k8s-ci-robot k8s-ci-robot added the area/code-organization Issues or PRs related to kubernetes code organization label Jun 13, 2019
@timothysc timothysc added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jun 13, 2019
@timothysc
Copy link
Member Author

/cc @andrewsykim

@neolit123
Copy link
Member

neolit123 commented Jun 13, 2019

or potentially break it down and move certain sub-folders out of tree.
currently it contains a collection of items:

  • image files (conformance image, etcd)
  • kube-up
  • random bash scripts used by test-infra/kubetest deployers as helpers
  • addons and the addon manager.
  • other?

@liggitt
Copy link
Member

liggitt commented Jun 13, 2019

looking at the references to it...

/sig testing
for e2e bringup

/sig scalability
for kubemark bringup

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jun 13, 2019
@sftim
Copy link
Contributor

sftim commented Jun 16, 2019

Is this also relevant to SIG Docs?

@timothysc
Copy link
Member Author

xref -#78543
^ Is an example of continued technical debt that we see and have to pay for in different ways across SCL.

@jaypipes
Copy link
Contributor

jaypipes commented Jul 2, 2019

Is this also relevant to SIG Docs?

Yes, I think this is a good example of where /cluster scripts are referenced from the docs: kubernetes/website#14929

@dims dims removed their assignment Jul 8, 2019
@andrewsykim
Copy link
Member

For v1.16: investigate if cluster API is a potential replacement and meets the same level of coverage as /cluster and enumerate what is missing cluster API

/assign @alejandrox1

@dims
Copy link
Member

dims commented Jul 12, 2019

Step 1 we talked about was ... look at all the CI jobs that use the kube-up in the cluster directory and inventory the knobs/settings/configurations they setup or use and cross check if cluster-api/kubeadm allows us to do the same.
Step 2 mirror the GCE e2e CI Job (pull-kubernetes-e2e-gce to be specific) using cluster-api for AWS.

Both steps can be done in parallel and will need help/effort/coordination on the wg-k8s-infra and sig-testing folks

@alejandrox1
Copy link
Contributor

I'll start with step 1 right away.
So step 2 is free if anyone wants to work on this as well 😃

@dims
Copy link
Member

dims commented Jul 12, 2019

@alejandrox1 sounds good. please start a google doc or something that we can use to compile notes on the various flags/options

@alejandrox1
Copy link
Contributor

On the google doc: https://docs.google.com/document/d/1p3c_sOALbEzg2VH2OPz3w9yKwlwaU4jATyS0lqxFzt4/edit?usp=sharing

Still got a lot to do for step 1 but will ping when this is ready
/cc @mariantalla

@mariantalla
Copy link
Contributor

Could I start work on step 2 (i.e. investigating/starting the refactor of pull-kubernetes-e2e-gce to use Cluster API)?

I'll assign myself, but please feel free to unassign me if someone else is already working on it!

/assign

@neolit123
Copy link
Member

/sig cloud-provider
/area provider/gcp

@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. area/provider/gcp Issues or PRs related to gcp provider labels Sep 3, 2020
@cheftako
Copy link
Member

cheftako commented Feb 3, 2021

/assign
/triage accepted
cc @jpbetz

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Feb 3, 2021
@jpbetz
Copy link
Contributor

jpbetz commented Mar 24, 2021

The etcd images are not kubernetes specific. The main thing they do is automatically upgrade etcd one minor at a time, per etcd administration guidelines, when the cluster administrator upgrades etcd to a new version. E.g. if the cluster administrator upgrades to etcd 3.4 and the cluster is currently on 3.1, it upgrades first to 3.2 and then to 3.3 before upgrading to 3.4.

So I'm tempted to ask if the etcd community would be willing to own this. If the etcd community was okay with this, the main issue to solve is that the images are published to the k8s.io container repo.

cc @gyuho, @ptabor, @wenjiaswe

@gyuho
Copy link
Member

gyuho commented Mar 24, 2021

The etcd images are not kubernetes specific

Yeah, I think we can implement some mechanism to update docs and container images in the registry for Kubernetes as part of etcd release process.

@ptabor @wenjiaswe Any thoughts?

@justinsb
Copy link
Member

We could also do this in the etcdadm project, as that is a kubernetes-sigs project and is thus set up to push to k8s repos / follow k8s governance etc.

@ptabor
Copy link
Contributor

ptabor commented Mar 24, 2021

The northstar IMHO is that we should get rid of process of 'updating' etcd by running consecutive minor versions of etcd
and instead etcd should have a dedicated tool: e.g. etcdstoragectl migrate (I would call it etcdadm, but the name is taken ;) ) process that knows the DB changes between different versions and explicitly 'fixes' the database instead of depending on the running for undefined duration full 'historical' etcd servers.

There are multiple benefits of this:

  • Such code is 'hands-on' documentation to the storage format differences.
  • the tool/library should be incorporated into backup-restore process
  • no maintenance of fragile multi-version generation code.
  • if changes are related to WAL log, the tool should create a new snapshot and truncate history. Etcd doesn't to this on boot.

Maybe its naive, but in scope of etcd-v3 minor versions, during work on etcd storage-format documentation I haven't spotted any significant differences in the format.
It's more like: 'set a default' if given field is missing, style of O(5) rules.

@jpbetz
Copy link
Contributor

jpbetz commented Mar 24, 2021

The northstar IMHO is that we should get rid of process of 'updating' etcd by running consecutive minor versions of etcd

I'm a big fan of this approach. If we can get support for this one the etcd side then the solution w/r/t the /cluster direct might just to be drop these images and use upstream etcd.

Curious what others think.

@gyuho
Copy link
Member

gyuho commented Mar 24, 2021

w/r/t the /cluster direct might just to be drop these images and use upstream etcd

Yeah always wonder why we even need two separate etcd registries. I am open to dropping the whole container support in our release process, and let downstream project build their own. Or merge onto Kubernetes-community managed registry.

@aojea
Copy link
Member

aojea commented Jun 10, 2021

I think that what has to be done is to have an alternative as reliable as cluster or better, I'm not saying that cluster is great, but I didn't see anything better so far ... and this is very easy to measure with testgrid

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 8, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Member

dims commented Feb 27, 2023

Sketching up requirements:
https://hackmd.io/pw1kt61lRM-wZh5MU1G_QQ

Here's a snapshot as of Feb27 just to be safe, please see all the comments on the hackmd.

# Requirements for cluster/kube-up.sh replacement

We have a lot of choices in the community like Kops/Kubespray/CAPA etc but all of them take too long or do not support all test sceanarios or hard to inject freshly built code. Hence the search for a new replacement. As you can see from this is a [long standing issue](https://github.com/kubernetes/kubernetes/issues/78995) and one that open to "easy" fixes.

- Must support 80% of jobs today (revisit all the environment flags we use to control different aspects of the cluster to verify)
- All nodes must run on a VM to replicate (how it is being done today, we already have `kind` to replicate things run inside a container)
- Must be able to deploy the cluster built directly from either a PR or the tip of a branch (to cover both presubmit and periodic jobs)
- Must use `kubeadm` to bootstrap the both the control plane node and the worker nodes
- `kubeadm` needs systemd for running `kubelet` so the images deployed should use systemd
- Must have a mechanical way to translate existing jobs to this new harness
- Should have a minimum of moving parts to ensure we are not chasing flakes and digging into things we don't need to
- Should have a clean path (UX) to debug things like we have today (logs from VM/cloudinit/systemd/kubelet/containers should tell the whole story)
- Should work on GCP and AWS, with Azure closely behind (so should be pluggable for other clouds)
- Should not be a hack, we need this to sustain us for 5-8 years at least. 
- Should work with things we already have like prow+boskos
- Adding a new service that is always on (like prow/boskos) should be well thought out as it is not trivial to debug yet another thing. Any such solution must be rock solid.
- we should be able to test kubeadm, external cloud providers (CPI) and storage drivers (CSI). 
- should support switching between containerd and CRI-O as well as cgroup v1/v2

We would need this solution to run in parallel for a full release cycle and offer equal or better results. This will need active owners who can take care of it for the long haul as worst case we will drop it like a hot potato however elegant or technologically superior the solution is.

Forward looking ideas that can be explored:
- Making whatever changes are needed to those tools to boot faster / allow injection of code.

@BenTheElder
Copy link
Member

  • Must use kubeadm to bootstrap the both the control plane node and the worker nodes

I'm not sure this should be a hard requirement, as much appreciation as I have for kubeadm, it does a relatively small portion of cluster bootstrapping and we often need to test things it does not do / reconfigure things it does not directly support configuring anyhow. I'd suggest using kubeadm if we staffed writing something new from scratch, but I don't think we should preclude pre-existing options based on it.

We have a lot of choices in the community like Kops/Kubespray/CAPA etc but all of them take too long or do not support all test sceanarios or hard to inject freshly built code. Hence the search for a new replacement. As you can see from this is a long standing issue and one that open to "easy" fixes.

I'm also not sure the statement about kOps in particular is accurate, we actually used to run kOps on Kubernetes PRs as recently as 2019 and it worked fine. We lost this due to the AWS bill going unpaid somewhere between Amazon and the CNCF leading to the account being terminated and no ability to run the job. #73444 (comment)

We didn't spin up kops-GCE instead because we already had cluster-up and there was a strong push for Cluster-API, but kops-GCE works and passes tests. It's also relatively mature.
https://testgrid.k8s.io/kops-gce#kops-gce-latest

kubespray does not support kubernetes @ HEAD and CAPA is not passing tests / adds overhead like the local cluster, AIUI.

kops is already supported by kubetest + kubetest2, boskos, etc. and is passing tests on GCE + AWS. In theory it supports other providers but I don't think we have any CI there yet.

  • Must have a mechanical way to translate existing jobs to this new harness

This seems impractical. Most of the tricky part here is jobs setting kube-up environment variables that reconfigure the cluster. A human will have to dive into the scripts and see what each env ultimately does and then remap it to any other tool used. Unless that tool is also written in bash, we won't have drop in identical behavior, a lot of it is expanding variables inside bash-generated configuration files.

We could target the most important jobs and work from there.

@aojea
Copy link
Member

aojea commented Sep 5, 2023

@upodroid brought a related topic today, he is working on adding ARM64 support to the CI and modifyng the cluster scripts to do that #120144

I can see kops already is running arm64 (@justinsb) https://testgrid.k8s.io/google-aws#kops-aws-arm64-ci
I rather prefer these news CI jobs to use a new tool instead of keep building on the cluster folder, if possible, or we'll never break this loop

@upodroid
Copy link
Member

upodroid commented Sep 5, 2023

I'll create the arm64 equivalent of https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd on GCE using kops and then have a conversation about migrating the amd64 one to kops.

@neolit123
Copy link
Member

I'll create the arm64 equivalent of https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd on GCE using kops and then have a conversation about migrating the amd64 one to kops.

for k0ps related questions we have #kops-dev on k8s slack.

@dims
Copy link
Member

dims commented Jan 4, 2024

@upodroid how close are we to doing this? 1.31 perhaps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/code-organization Issues or PRs related to kubernetes code organization area/provider/gcp Issues or PRs related to gcp provider area/test kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Development

No branches or pull requests