Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenStack: HA proxy on service vm switches back and forth from/to bootstrap and production cluster #1854

Closed
jomeier opened this issue Jun 15, 2019 · 8 comments

Comments

@jomeier
Copy link
Contributor

jomeier commented Jun 15, 2019

Hi,

What happened?

On the api VM with HAProxy and CoreDNS after around 10-15 Minutes HAProxy starts continuously to restart.

It looks like the haproxy-watcher triggers a switch over from the bootstrap node to the production cluster too early.

In my case the production control plane wasn't ready. The API server hasn't started leading to a broken install.

If I turn off the haproxy-watcher, I get almost through the installation procedure.

What you expected to happen?

HAProxy should only switch to production cluster if the control plane is really stable.

Greetings,

Josef

Version

$ openshift-install version
./openshift-install unreleased-master-1112-g7a54a6b120a0044cf9c07b14331760af8e3c216b-dirty
built from commit 7a54a6b120a0044cf9c07b14331760af8e3c216b
release image registry.svc.ci.openshift.org/origin/release:4.2

Platform (aws|libvirt|openstack):

OpenStack (Stein), 4GByte RAM, 2 VCPUs, 60 GByte Disk

@jomeier
Copy link
Contributor Author

jomeier commented Jun 15, 2019

[core@test-cluster-njrtc-api ~]$ sudo podman run --rm -ti --net=host -v /etc/haproxy:/usr/local/etc/haproxy:ro docker.io/library/haproxy:1.7
<7>haproxy-systemd-wrapper: executing /usr/local/sbin/haproxy -p /run/haproxy.pid -db -f /usr/local/etc/haproxy/haproxy.cfg -Ds 
[WARNING] 165/051754 (9) : config : missing timeouts for proxy 'test-cluster-njrtc-api-masters'.
   | While not properly invalid, you will certainly encounter various problems
   | with such a configuration. To fix this, please ensure that all following
   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.
[WARNING] 165/051754 (9) : config : missing timeouts for proxy 'test-cluster-njrtc-api-workers'.
   | While not properly invalid, you will certainly encounter various problems
   | with such a configuration. To fix this, please ensure that all following
   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.

@jomeier
Copy link
Contributor Author

jomeier commented Jun 15, 2019

Maybe it has something to do with the haproxy-watcher service. Maybe the watcher script gets some nonsense back from the unhealthy cluster which triggers stuff (new haproxy config, restart of the haproxy service, ...) which is not wanted at that point in time?

@jomeier jomeier changed the title api-vm on OpenStack: HA proxy fails after around 10 minutes and continuously crashes OpenStack: HA proxy on service vm switches to early to the production cluster Jun 15, 2019
@jomeier jomeier changed the title OpenStack: HA proxy on service vm switches to early to the production cluster OpenStack: HA proxy on service vm switches back and forth from/to bootstrap and production cluster Jun 16, 2019
@mandre
Copy link
Member

mandre commented Jun 18, 2019

Yes, I also noticed the haproxy-watcher service does weird stuff that could cause back and forth between the bootstrap and the production control plane. In my experience though, the LB setup will eventually converge to the production control plane once it is stable.
FYI we're in the process of removing the API VM (since it's single point of failure), and move the LB service to the master nodes. This part is going to be re-worked.

@bjhuangr
Copy link

@mandre guess DNS service will also moved to master nodes once API VM removed?

@mandre
Copy link
Member

mandre commented Jul 10, 2019

@mandre guess DNS service will also moved to master nodes once API VM removed?

That's correct, we'll be running haproxy, keepalived and coredns as static pods on the master nodes.

@mandre
Copy link
Member

mandre commented Aug 1, 2019

/label platform/openstack

@mandre
Copy link
Member

mandre commented Aug 20, 2019

/close

With #1959 we've removed the service VM that was causing the back and forth.

@openshift-ci-robot
Copy link
Contributor

@mandre: Closing this issue.

In response to this:

/close

With #1959 we've removed the service VM that was causing the back and forth.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants