Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do HA catalog-sync #58

Closed
jacksontj opened this issue Jan 25, 2019 · 14 comments
Closed

How to do HA catalog-sync #58

jacksontj opened this issue Jan 25, 2019 · 14 comments
Labels
area/sync Related to catalog sync theme/health-checks About Consul health checking type/question Question about product, ideally should be pointed to discuss.hashicorp.com

Comments

@jacksontj
Copy link

While looking into using consul-k8s for my cluster I am unable to find any docs/code/comments/issues regarding an HA setup for consul-k8s. In addition to the concerns in #57 RE liveliness/readiness checks -- how do I have more than one of these running?

I have basically 2 HA concerns: (1) failure duration and (2) failure impact.

  1. Failure duration
    Assuming Liveliness/Readiness checks? #57 is resolved, we could in theory have more than one pod running behind a lock (presumably in consul) such that when the first pod fails the second can kick in pretty quickly. This would help mitigate the time that there isn't a consul-k8s working in the cluster

  2. failure impact
    As it is today consul-k8s is a single worker which syncs the entire state of consul/k8s -- I'm interested in ways of reducing the amount of work it needs to do specifically to reduce impact in a failure. As an extreme example (which I realize doesn't exactly work given the current design, but it illustrates the point) if we had consul-k8s running on each node in a k8s cluster and was responsible for syncing the state of things local to the node (as I said before, doesn't quite work for some things like LBs etc.) then the failure of a single process impacts the sync state of all pods on that node.

@tsmgeek
Copy link

tsmgeek commented Jan 31, 2019

K8s should handle this if you set it up as a "Deployment", but only in the instance that the executable within the pod crashes or stops. If for any reason it just stops syncing but stays running it will cause an issue.
I guess what they need to have an open port that responds OK if the last sync was within X, then you can monitor this using a liveliness check and it will kill the pod.

@jacksontj
Copy link
Author

After a lot of investigation it seems that fundamentally our needs won't be met by consul-k8s (a combination of this and #57) -- which lead me to create katalog-sync, the highlights are:

  • syncing direct to a consul-agent: this means syncing is scoped per-node (when deployed as a deemonset)
  • agent-services in consul: meaning the health of services is tied to node health
  • sync readiness state from k8s as check to consul: meaning you only have to define the check in k8s and consul will reflect that state
  • (optional) sidecar to put in with your services to go "ready" when the pod is registered in consul.

IMO these questions should still be answered by consul-k8s, but some of them (specifically the failure impact and readiness checks during pod startup) seem to not be tractable with the current design (although it does allow for syncing cluster-wide resources such as services). So it seems like maybe a combination is the best way to go?

@tsmgeek
Copy link

tsmgeek commented Jan 31, 2019

Agreed there seems to be key functionally missing from consul-k8s.
Im still working on another problem related to services doing hide and seek.

@lkysow lkysow added type/question Question about product, ideally should be pointed to discuss.hashicorp.com theme/health-checks About Consul health checking labels Oct 15, 2019
@hmlkao
Copy link

hmlkao commented Jan 4, 2021

Any best practice or progress how to run consul-k8s in HA mode?

We solved an issue that reschedule to another node takes long time (5:40 min with default K8s component values) when node on which was running consul-k8s pod falls down. Even after tuning K8s component values it takes much longer (~30s) than it would be acceptable.

@lkysow
Copy link
Member

lkysow commented Jan 5, 2021

Any best practice or progress how to run consul-k8s in HA mode?

We solved an issue that reschedule to another node takes long time (5:40 min with default K8s component values) when node on which was running consul-k8s pod falls down. Even after tuning K8s component values it takes much longer (~30s) than it would be acceptable.

Hi, which consul-k8s pod got rescheduled? Are you using service mesh or catalog sync?

@hmlkao
Copy link

hmlkao commented Jan 6, 2021

Hi @lkysow , we are using catalog sync without service mesh, deployed over consul-helm (v0.8.1), in one-way sync K8s > Consul mode.

@lkysow
Copy link
Member

lkysow commented Jan 6, 2021

Do you know why the rescheduling of the catalog sync pod took 5:40?

@hmlkao
Copy link

hmlkao commented Jan 7, 2021

It is standard behaviour of K8s with default values of controller-manager node-monitor-grace-period (40 s) and pod-eviction-timeout (5 min), details are in article at Medium.

When node on which is catalog-sync pod running fails (dies or is reset or whatever). It takes 5:40 min until the pods are evicted. during this time are not any changes synced to Consul and it can cause problems with access to apps.

Is there any dangerous to run consul-k8s catalog-sync in more replicas?

@lkysow
Copy link
Member

lkysow commented Jan 7, 2021

Ahh I see, sorry I missed it was because the node died.

Is there any dangerous to run consul-k8s catalog-sync in more replicas?

Currently that is not supported unfortunately but it looks like we need to fix this so you can run >1 replicas and maybe use some sort of leader election to swap between them.

@hmlkao
Copy link

hmlkao commented Jan 7, 2021

It would be great if it was supported.

Meanwhile I did some tests with more running replicas on our dev cluster and it looks like it works as expected.
It can (maybe) cause some race condition but it is more acceptable then outage caused by failed node.

@lkysow
Copy link
Member

lkysow commented Apr 8, 2021

Note: #479 will make the service mesh injector HA but it won't address the catalog sync HA.

ndhanushkodi pushed a commit to ndhanushkodi/consul-k8s that referenced this issue Jul 9, 2021
Provide a valid maxUnavailable value when using a single replica
@david-yu david-yu changed the title How to do HA consul-k8s How to do HA catalog-sync Jul 16, 2021
@lkysow lkysow added the area/sync Related to catalog sync label Nov 4, 2021
@david-yu
Copy link
Contributor

Hi there, Consul K8s PM here. I'm going to close this issue since at this time we likely won't be making changes to support running catalog sync on multiple replicas. I have to acknowledge that it's been a long time since this issue was filed. Since that time, our priorities have shifted towards building a robust service mesh which enables service discovery on K8s. If you do have an PRs that you would like us to review that enables multiple replicas, please go ahead and file a PR and we can review. Thank you.

@Dentrax
Copy link

Dentrax commented Dec 20, 2022

Hey @david-yu, any reason why consul-k8s does not support leader election as of now? It'd be great to learn some background. I can file a proposal for this if reasonable. Then a follow-up PR to add leader-election since I have previous experience of adding this to some open-source projects. Our use-case is to run consul-k8s in HA mode. (podAntiAffinity + min 3 replicas). Waiting your thoughts.

@david-yu
Copy link
Contributor

Hi @Dentrax if you're interested in working on a PR could you file a new issue with a proposal of how you'd like to see this problem solved? We probably would like to review before you go down this path. As said previously this is something that probably involves even deeper changes to Catalog sync than just deploying multiple replicas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sync Related to catalog sync theme/health-checks About Consul health checking type/question Question about product, ideally should be pointed to discuss.hashicorp.com
Projects
None yet
Development

No branches or pull requests

6 participants