-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create HA cluster is flaky #588
Comments
seems to work with 3e5f299 EDIT: never mind, was able to reproduce it with v0.3.0 (flakes?) /assign @BenTheElder @BenTheElder i'm going to switch the kubeadm jobs to use a stable kind version (kubetest/kind needs update to pin a newer stable than 0.1.0): possibly the sig-testing kind jobs can remain to use kind from master. |
I can´t reproduce it neither find a similar error in the jobs, can you give me a link to some logs? I´m not familiar with testgrid and prow 😅
The errors is 401 |
this:
happended here: it's flaky and i'm able to reproduce it locally with with some of the other failures here, however are different: the e2e suite passes:
but then
|
@aojea
|
also: gives me some CRI errors:
EDIT: i think the cause for:
is a bad node-image. |
"CRI errors" during kind build node-image are normal.
|
a couple of problems:
looks like
|
593 fixed the log export and image load issues, not sure about whatever flakiness is about #593 |
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master the latest green run passes with kind from master. the one before that had the timout problem that #593 fixed. a run that has kubernetes/test-infra#12883 should be in 1 hour. |
I narrowed it down to this snippet, what can be the problem? what means
|
bootstrap tokens are technically bearer tokens: not sure how a token can flake though... on a side note, kubeadm does have concurrent CP join in k/k master now: |
also the token in kind is hardcoded: |
@neolit123 the job seems stable now https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master , doesn´t it? |
yes, but we switched it to kind v0.2.0, because it's in the release-informing dashboard and the release team wants it green. |
Seems that the etcd cluster wasn´t ready, in this case, the kind-control-plane3 is the one that wasn´t able to join the cluster at the moment of the authentication. I think that
|
this is a possibility yes. but the kinder HA jobs are consistently green in this regard. i think we should bisect kind v0.2.0 -> latest and find what hapended. |
That´s a fantastic idea, I think that I know what´s happening, let me do some tests |
/assign @aojea |
I am experiencing this flakiness as well. Let me know if I can help test any fixes! |
One additional question: is it possible that health check on the load balancer are not working properly (the lead balancer lets traffic pass before the API server is fully operative - with etcd)? And some answers (hopefully helpful)
Kinder has wait loops after each action and it is less aggressive than kind in terms of parallelism; e.g. It is possible to pass
Afaik kubeadm join implements a wait loop, but as soon as one API server answers it proceed with joining. |
@fabriziopandini sorry for not updating the ticket, you are absolutely right, the problem is that nginx is only doing health checks at TCP level, however, haproxy was doing health checks at the TLS level with the option
Platforms/Architectures images: haproxy and nginx seems to have docker images for almost all platforms, but traefik and envoy are starting to support more platforms |
Alternatively we can not LB to all nodes while they are not joined yet. |
@aojea thanks for your great work trying to address this issue. Before answering to your question, le me recap the options that in my opinion should be considered for solving this issue:
IMO 1 is the best option, but it will take some time I'm in contact with etcd maintainers for implementing 2 in a smart way (I don't like the idea to add arbitrary sleep). ATM there is a promising feature still only on master; let's see if there is also something immediately actionable. You are looking at 2/3 With regards to timers, they are the results of a painful trade-off between different types of kubeadm users. We should be really careful in touching those values unless we find something badly wrong or we can assume that the change doesn't hurt users with slower devices (raspberry is the corner case in this area). AFAIK, licenses probe isn't the best candidate for this use case (detecting things are not started); I'm also trying to understand if readiness/liveness can actually block traffic for a static pod using host networking... |
immediate answer - no, we cannot as these are well balanced. we have found cases where the values do not match a use case, but normally there is another issue at play there. |
I've looked at envoy and I'm afraid it can not solve the problem, seems it doesn't have (at least I couldn't find it) the option to do https health checks against the API /healthz endpoint (it does http only) . If anybody wants to give a shot this is the configuration I've tried envoy config
docker run -d envoyproxy/envoy-alpine:v1.10.0
docker run -it ENVOYCONTAINER sh
cat <<EOF > /etc/envoy/envoy.yaml
admin:
access_log_path: "/dev/null"
address:
socket_address:
protocol: TCP
address: 127.0.0.1
port_value: 9901
static_resources:
listeners:
- name: listener_0
address:
socket_address:
protocol: TCP
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.tcp_proxy
typed_config:
"@type": type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy
stat_prefix: kind_tcp
cluster: kind_cluster
clusters:
- name: kind_cluster
connect_timeout: 0.25s
type: STATIC
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: kind_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 172.16.0.4
port_value: 6443
health_checks:
- timeout: 1s
interval: 2s
interval_jitter: 1s
unhealthy_threshold: 1
healthy_threshold: 3
http_health_check:
path: "/healthz"
exit
docker restart ENVOYCONTAINER I think we should go with the 1- 3 options mentioned by fabrizio and, if we want to work around the problem until we found the definitive solution, we can go for the active loop #598. |
/unassign @aojea |
can't seem to reproduce the problem after the workaround in kubernetes/kubernetes#78915 the issue of CM retries is now tracked in kubernetes/kubeadm#1606 |
Thanks @neolit123 and @ereslibre , that's awesome |
I'll defer this decision to others. However, in my opinion, the workaround included in kubeadm was to save the situation for flakiness temporarily, and is meant to be reverted as far as we have discussed. Ideally, during the 1.16 cycle we should be able to understand the source(s) of problems for concurrent joins and as a result of these investigations it could happen that they circle back to kind (e.g. the LB would need to perform health checks, so we only route traffic to initialized apiservers). |
NOTE: kubespray also has a mode with an internal nginx LB... |
/assign @aojea I was talking today about this with @ereslibre , despite all the issues, one fact is that the API server is advertising the correct status in the After trying different LBs seems that the only open source LB that allows to do health checks on an HTTPS endpoint and HTTPS passthrough load balancing are:
|
The standard docs show adding your config to the image. You can put a dummy config. If we boot a sleep then restart doesn't work properly.
we actually had this working at one point already :-) |
#645 implemented haproxy with healthchecks per #588 (comment) |
so far I haven't gotten it to flake, still "soak testing" this locally 🙃 |
if using k/k from master our workaround in kubernetes/kubernetes#78915 might be kicking in. |
given #645 merged, we can locally revert kubernetes/kubernetes#78915 and test again for flakes. but this issue can also be closed as kind is now doing what is needed (e.g. use ha-proxy)? |
@neolit123 absolutely, I will start investigating on my spare time. Will ping you back with my findings. |
I'm testing with v1.14.2 currently. (the default image until we bump it... about to do that) |
|
I think we can close now, please re-open or file a new issue if you see further problems! |
What happened:
i started seeing odd failures in the kind-master and -1.14 kubeadm jobs:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-1.14
after switching to this HA config:
What you expected to happen:
no errors.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
--loglevel=debug
.Environment:
kind version
): master at 43bf0e2docker info
):/etc/os-release
):/kind bug
/priority important-soon
(?)
The text was updated successfully, but these errors were encountered: