-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to latest /cluster directory from kubernetes/kubernetes #216
Update to latest /cluster directory from kubernetes/kubernetes #216
Conversation
7d700ba
to
e521822
Compare
/assign @cici37 @jiahuif @DangerOnTheRanger |
/retest Error was:
edit: actual problem was --terminated-pod-gc-threshold flag being set like @DangerOnTheRanger has found. Adding a temp fix commit on this PR until @DangerOnTheRanger merges a fix. |
e6b4b8c
to
893432f
Compare
/lgtm |
fde83b6
to
1e31d3d
Compare
Sanity check that pulling a private image works as expected: https://gist.github.com/jpbetz/31f4f720ebac8ad2a0c652eaeeb2640f |
/lgtm |
/test cloud-provider-gcp-verify-all |
538c0ce
to
e892dd4
Compare
cloud-provider-gcp-verify-all - will be fixed by #225 |
/test cloud-provider-gcp-verify-all |
/hold The last two cloud-provider-gcp-e2e-create failures show the kube-controller-manager failed to start on one run and the scheduler failed to start on the next. It appears to be a master CPU resource issue:
cloud-provider-gcp-e2e-create uses a n1-standard-1 for master (1 vCPU). CPU requests for the controllers are unchanged in this PR: https://github.com/kubernetes/cloud-provider-gcp/pull/216/files#diff-ada96bec4aba13195248d6dd21ee67e333bead2bd898b81a1b089d53bd06b4d8R3403-R3405 so I'm a bit unclear what's caused this to start happening. I don't see any other obvious changes to requested CPU in this PR. |
I checked the master (I set it to a n1-standard-2 to make sure it runs everything): $ kubectl describe node kubernetes-master
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system cloud-controller-manager-kubernetes-master 200m (10%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system etcd-server-events-kubernetes-master 100m (5%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system etcd-server-kubernetes-master 200m (10%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system fluentd-gcp-v3.2.0-gj25w 100m (5%) 1 (50%) 200Mi (2%) 500Mi (6%) 23m
kube-system konnectivity-server-kubernetes-master 25m (1%) 0 (0%) 0 (0%) 0 (0%) 22m
kube-system kube-addon-manager-kubernetes-master 5m (0%) 0 (0%) 50Mi (0%) 0 (0%) 22m
kube-system kube-apiserver-kubernetes-master 250m (12%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system kube-controller-manager-kubernetes-master 200m (10%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system kube-scheduler-kubernetes-master 75m (3%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system l7-lb-controller-kubernetes-master 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 22m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1165m (58%) 1 (50%)
memory 300Mi (4%) 500Mi (6%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-gce-pd 0 edit: Previously We're 165m over the n1-standard-1 limit. I'll decrease cloud-controller-manager-kubernetes-master to 50m (after discussing briefly with @cheftako) and fluentd-gcp to 75m (which was not explicitly set and I suspect this could be decreased further). |
4e1cf93
to
b1a71bd
Compare
/hold cancel |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheftako, jpbetz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
NODE_SIZE=${NODE_SIZE:-n1-standard-2} | ||
NUM_NODES=${NUM_NODES:-3} | ||
NUM_WINDOWS_NODES=${NUM_WINDOWS_NODES:-0} | ||
# TODO: Migrate to e2-standard machine family. | ||
MASTER_SIZE=${MASTER_SIZE:-n1-standard-$(get-master-size)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that may turn out to be n1-standard-2
or -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find. Looks like for 3 node tests we should expect n1-standard-1:
function get-master-size { |
Figured out why this PR was triggering CPU requested resource limit exceeded problems. I hadn't merged in this change to default to a n1-standard-2 for masters:
We are better off, I think, it establishing some lower limits and allowing n1-standard-1 masters to continue to be used, so I'm not going to reapply that change, but I wanted to make record of it on this PR. |
Fixes #214. I'll be writing up a design for how to keep the /cluster directory up-to-date more regularly through automation. Until then, this PR helps "catch up" with kubernetes/kubernetes /cluster directory. In addition to the immediate benefits of having a fresh /cluster directory, I expect this will help us with long term automation by eliminating merge conflicts caused by some of the adhoc cherry-picked commits from kubernetes/kubernetes.
Commits in this PR:
hack/lib/util.sh
->cluster/util.sh
fixes that has been overlooked and some executable file permission bits that were out-of-sync with kubernetes/kubernetes.Notes for reviewers
Alternative considered
Individually merge all 600+ commits from kubernetes/kubernetes into cloud-provider-gcp. I actually tried to introduce some automation for this (https://github.com/jpbetz/cloud-provider-gcp/blob/merge-cluster-changes/tools/merge-cluster-changes.sh). The problem using the automation is that there have been enough cherry-picks of more recent kubernetes/kubernetes /cluster directory changes into cloud-provider-gcp for files like configure-helper.sh that merges don't cleanly apply, and manual resolving 600+ merge conflicts would be required, which is incredibly tedious and error prone. Since there only a handful of PRs that really changed the /cluster directory in cloud-provider-gcp, it is much easier and safer to go with the rebase/reapply route. Once this PR merges and we're rebased I'll can circle back and see if the automation will work better, which I hope it might given that the problematic cherry-picks should be out of the way.
Selectively re-applied changes in 2nd commit from previous cloud-provider-gcp PRs
Deploy Kubernetes from cloud-provider-gcp. #143
Add basic cluster up/down e2e test. #144
Add logdump for e2e create. #148
Fix CCM image. #151
Fix shellcheck failure in cluster/gce/config-default.sh #152
Create the bucket for tars based on $PROJECT #154
Add auth-provider-gcp for out-of-tree credential provision #168
Disable local loopback for volume host #181
Bump cloud-provider-gcp to v1.21 #204