Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set pod conditions from a container probe #28658

Closed
bprashanth opened this issue Jul 8, 2016 · 20 comments
Closed

Set pod conditions from a container probe #28658

bprashanth opened this issue Jul 8, 2016 · 20 comments
Labels
area/stateful-apps lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@bprashanth
Copy link
Contributor

I'd like a way to export some key pieces of information from a pod, periodically, into the apiserver, without invoking kubectl or linking against the Kubernetes api. One way to do this is through a new type of http "condition" probe, that returns a json struct of fields to set in either the PodCondition, or Pod annotations under a special key.

One can use this to implement a PetSet of "type=master/slave" by having a probe on all the slaves return "isLeader=false". The PetSet controller can create a private master service with just a single endpoint that matches the one pet in the set that returns "isLeader=true". Users would hand out the DNS name/ip of this master Service to clients knowing that it will always redirect to the master.

@kubernetes/sig-apps @thockin

@bprashanth bprashanth added sig/network Categorizes an issue or PR as relevant to SIG Network. team/cluster area/stateful-apps labels Jul 8, 2016
@bprashanth bprashanth added this to the next-candidate milestone Jul 8, 2016
@smarterclayton
Copy link
Contributor

This is not uncommon in enterprise pod balancers, although typically it's
different endpoints. I thought both nginx and haproxy could also do
something like this.

On the other hand, should petsets always make the leader the pod-0 (the
programmed leader)? I assume this is for petsets that have no innate
failover as per your earlier issue?

On Jul 7, 2016, at 9:30 PM, Prashanth B notifications@github.com wrote:

I'd like a way to export some key pieces of information from a pod,
periodically, into the apiserver, without invoking kubectl or linking
against the Kubernetes api. One way to do this is through a new type of
http "condition" probe, that returns a json struct of fields to set in
either the PodCondition, or Pod annotations under a special key.

One can use this to implement a PetSet of "type=master/slave" by having a
probe on all the slaves return "isLeader=false". The PetSet controller can
create a private master service with just a single endpoint that matches
the one pet in the set that returns "isLeader=true". Users would hand out
the DNS name/ip of this master Service to clients knowing that it will
always redirect to the master.

@kubernetes/sig-apps https://github.com/orgs/kubernetes/teams/sig-apps
@thockin https://github.com/thockin


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#28658, or mute the thread
https://github.com/notifications/unsubscribe/ABG_p7Qmo6B2eJJY8JGpr30p8hRRvibPks5qTag6gaJpZM4JHpiz
.

@bprashanth
Copy link
Contributor Author

This is for pets with innate failover. Making pet-0 the leader is easy, we can given pet-0 to all the clients of the db, but if it gets partitioned we need to either:

  1. redirect requests to pet-0 to the new master instead
  2. don't give pet-0 to clients, but give pet-master instead, and keep updating what pet-master resolves to

I'm going for 2 with this.

Yeah we could force people to setup a loadbalancer in front which would failover to one of the backups, effectively achieving the same thing. The only caveat is we can't start sending writes to the old master as soon as it starts responding to health checks. Nginx and haproxy take a list of servers, we can mark the fist as primary and all the others as backup. When a new master is elected, we need to re-write the config (I don't think there's an automatic way).

@bprashanth
Copy link
Contributor Author

This is for pets with innate failover.

I mean, pets with innate failover for master/slave. For example we don't need this for zookeeper because you write to any voting member and it waits for an ack from a majority. With mysql master/slave you must write to the current master. I think the usual way to handle this is keepalived and a floating vip.

@smarterclayton
Copy link
Contributor

A service clusterIP can manage this pretty effectively if we could figure out a way to have the endpoints controller select on the 'leader' signal. It's somewhat like ready where the service only marks a single pod as ready. Could we do this today with readiness checks and services as is? I.e. readiness check for each member only returns true if leader?

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 5, 2016

What's interesting is that active/passive there should only ever be one "ready" item. But readiness is a pod condition, not a port condition, so we can't apply different rules necessarily to readiness per container / port per service.

Another thought - most active/passive sets are for things that are not innately HA and need fencing / safety guarantees in place. If we assume fencing is solved at the volume level (either by innate behavior in the storage provider like attach/detach or locks), then the guarantee still needs some sort of "terminate old, make sure data is flushed, then do something to the new". That problem sounds a lot like pet set reconfiguration, but the decision to do the failover is triggered externally (probably by a deletion of the pod by the node controller).

@smarterclayton
Copy link
Contributor

Let me sketch out two active/passive use cases that are related to this and to pet sets:

  1. I want to make a VM highly available on a kubernetes cluster by running qemu inside of the pod, and whenever that pod gets deleted by the node controller I want to scale up a new pod, keep the new pod running, then run a migration step that connects between the two pods, then allow the old pod to terminate. If the new pod is deleted in the middle of a migration, bad things happen - i may need to create a third pod.
    a. Variant - I want a hot failover VM that is receiving memory state deltas from the master - both pods are active at once, and a background process is feeding hypervisor state from active to passive.
  2. I want to make a classic postgres database HA on a kubernetes cluster by having an active passive state. If the active pod is deleted, I want to move leadership to the passive pod. If the active pod fails a health check, I want to move leadership to the passive pod (based on forgiveness value). Both pods are running at once. Leadership transitions typically require the other pod to be "caught up", and once leadership transfers the roles should be reversed. I want my load balancer to be able to easily move traffic to the passive pod only when the leadership transition occurs.

Leadership coming from the pods seems strange here, if only because I don't necessarily know which one to trust as the petset controller.

@bprashanth
Copy link
Contributor Author

We'd have to come up with a way to take majority (i.e pods respond with leader=pod-1), which gets complicated because of epochs. The simple answer is to assume a partitioned master knows it is partitioned and responds negative. I'm not sure we can solve this problem completely without doing something really complicated.

I think the best we can do is describe a simple protocol and guarantee that if pods obey it, the Service will point at the new master. The Service will probably go through some period where it has 0 endpoints, clients should retry (or we can offer the normal CP vs AP choice).

An example of such a protocol might be to put the petset controller in the failover loop. So the current master failing a probe leads to the petset controller delivering a "failover" event to each non-master pod, one at a time, and waiting for its leader probe result before sending the event to the next pod. The potential masters need to yield to the most advanced slave by responding negative.

I guess the difference between your 2 cases is clustering. It does sounds like simple readiness is enough for the vm case. As you noted, I don't know if volume fencing is enough to completely repurpose readiness in the clustered case.

@bprashanth
Copy link
Contributor Author

The other way to solve this is to avoid a probe alltogether, and go with ttl'd leases. Pet-0 will 9/10 times get the master lease, every other pet sets a watch on the lease, if they ever see a ttl expiration they all try to grab it. The petset controller just applies the label to the name of the pet in the lease.

This feels like the lease api, and requires all pods to understand the apiserver, though. Not to mention correctly handling a watch etc. I feel like such complications shouldn't be necessary just to receive a failover event, and we can resolve the lease-taking-race by just delivering this event serially.

@bgrant0607
Copy link
Member

The conditions idea sounds very similar to custom metrics.

@foxish
Copy link
Contributor

foxish commented May 2, 2017

/cc @crimsonfaith91

@cmluciano
Copy link

Is this still relevant? Do we think that custom metrics could support this?

@cmluciano
Copy link

/assign

@thockin
Copy link
Member

thockin commented May 25, 2017 via email

@cmluciano
Copy link

/unassign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 27, 2018
@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@rfer
Copy link

rfer commented May 5, 2018

As anyone figured out a way to handle the initial use-case? It still seems relevant.

/reopen

@k8s-ci-robot
Copy link
Contributor

@rfer: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

As anyone figured out a way to handle the initial use-case? It still seems relevant.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@goutamtadi1
Copy link

Yes, this seems unresolved. We are also looking for solution to similar problem. Anyone able to solve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stateful-apps lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

10 participants