Set pod conditions from a container probe #28658

bprashanth · 2016-07-08T01:30:20Z

I'd like a way to export some key pieces of information from a pod, periodically, into the apiserver, without invoking kubectl or linking against the Kubernetes api. One way to do this is through a new type of http "condition" probe, that returns a json struct of fields to set in either the PodCondition, or Pod annotations under a special key.

One can use this to implement a PetSet of "type=master/slave" by having a probe on all the slaves return "isLeader=false". The PetSet controller can create a private master service with just a single endpoint that matches the one pet in the set that returns "isLeader=true". Users would hand out the DNS name/ip of this master Service to clients knowing that it will always redirect to the master.

@kubernetes/sig-apps @thockin

smarterclayton · 2016-07-08T02:17:23Z

This is not uncommon in enterprise pod balancers, although typically it's
different endpoints. I thought both nginx and haproxy could also do
something like this.

On the other hand, should petsets always make the leader the pod-0 (the
programmed leader)? I assume this is for petsets that have no innate
failover as per your earlier issue?

On Jul 7, 2016, at 9:30 PM, Prashanth B notifications@github.com wrote:

I'd like a way to export some key pieces of information from a pod,
periodically, into the apiserver, without invoking kubectl or linking
against the Kubernetes api. One way to do this is through a new type of
http "condition" probe, that returns a json struct of fields to set in
either the PodCondition, or Pod annotations under a special key.

One can use this to implement a PetSet of "type=master/slave" by having a
probe on all the slaves return "isLeader=false". The PetSet controller can
create a private master service with just a single endpoint that matches
the one pet in the set that returns "isLeader=true". Users would hand out
the DNS name/ip of this master Service to clients knowing that it will
always redirect to the master.

@kubernetes/sig-apps https://github.com/orgs/kubernetes/teams/sig-apps
@thockin https://github.com/thockin

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#28658, or mute the thread
https://github.com/notifications/unsubscribe/ABG_p7Qmo6B2eJJY8JGpr30p8hRRvibPks5qTag6gaJpZM4JHpiz
.

bprashanth · 2016-07-08T04:54:07Z

This is for pets with innate failover. Making pet-0 the leader is easy, we can given pet-0 to all the clients of the db, but if it gets partitioned we need to either:

redirect requests to pet-0 to the new master instead
don't give pet-0 to clients, but give pet-master instead, and keep updating what pet-master resolves to

I'm going for 2 with this.

Yeah we could force people to setup a loadbalancer in front which would failover to one of the backups, effectively achieving the same thing. The only caveat is we can't start sending writes to the old master as soon as it starts responding to health checks. Nginx and haproxy take a list of servers, we can mark the fist as primary and all the others as backup. When a new master is elected, we need to re-write the config (I don't think there's an automatic way).

bprashanth · 2016-07-08T04:57:50Z

This is for pets with innate failover.

I mean, pets with innate failover for master/slave. For example we don't need this for zookeeper because you write to any voting member and it waits for an ack from a majority. With mysql master/slave you must write to the current master. I think the usual way to handle this is keepalived and a floating vip.

smarterclayton · 2016-08-01T17:21:43Z

A service clusterIP can manage this pretty effectively if we could figure out a way to have the endpoints controller select on the 'leader' signal. It's somewhat like ready where the service only marks a single pod as ready. Could we do this today with readiness checks and services as is? I.e. readiness check for each member only returns true if leader?

smarterclayton · 2016-08-05T21:59:51Z

What's interesting is that active/passive there should only ever be one "ready" item. But readiness is a pod condition, not a port condition, so we can't apply different rules necessarily to readiness per container / port per service.

Another thought - most active/passive sets are for things that are not innately HA and need fencing / safety guarantees in place. If we assume fencing is solved at the volume level (either by innate behavior in the storage provider like attach/detach or locks), then the guarantee still needs some sort of "terminate old, make sure data is flushed, then do something to the new". That problem sounds a lot like pet set reconfiguration, but the decision to do the failover is triggered externally (probably by a deletion of the pod by the node controller).

smarterclayton · 2016-08-05T22:21:11Z

Let me sketch out two active/passive use cases that are related to this and to pet sets:

I want to make a VM highly available on a kubernetes cluster by running qemu inside of the pod, and whenever that pod gets deleted by the node controller I want to scale up a new pod, keep the new pod running, then run a migration step that connects between the two pods, then allow the old pod to terminate. If the new pod is deleted in the middle of a migration, bad things happen - i may need to create a third pod.
a. Variant - I want a hot failover VM that is receiving memory state deltas from the master - both pods are active at once, and a background process is feeding hypervisor state from active to passive.
I want to make a classic postgres database HA on a kubernetes cluster by having an active passive state. If the active pod is deleted, I want to move leadership to the passive pod. If the active pod fails a health check, I want to move leadership to the passive pod (based on forgiveness value). Both pods are running at once. Leadership transitions typically require the other pod to be "caught up", and once leadership transfers the roles should be reversed. I want my load balancer to be able to easily move traffic to the passive pod only when the leadership transition occurs.

Leadership coming from the pods seems strange here, if only because I don't necessarily know which one to trust as the petset controller.

bprashanth · 2016-08-05T22:59:45Z

We'd have to come up with a way to take majority (i.e pods respond with leader=pod-1), which gets complicated because of epochs. The simple answer is to assume a partitioned master knows it is partitioned and responds negative. I'm not sure we can solve this problem completely without doing something really complicated.

I think the best we can do is describe a simple protocol and guarantee that if pods obey it, the Service will point at the new master. The Service will probably go through some period where it has 0 endpoints, clients should retry (or we can offer the normal CP vs AP choice).

An example of such a protocol might be to put the petset controller in the failover loop. So the current master failing a probe leads to the petset controller delivering a "failover" event to each non-master pod, one at a time, and waiting for its leader probe result before sending the event to the next pod. The potential masters need to yield to the most advanced slave by responding negative.

I guess the difference between your 2 cases is clustering. It does sounds like simple readiness is enough for the vm case. As you noted, I don't know if volume fencing is enough to completely repurpose readiness in the clustered case.

bprashanth · 2016-08-05T23:13:28Z

The other way to solve this is to avoid a probe alltogether, and go with ttl'd leases. Pet-0 will 9/10 times get the master lease, every other pet sets a watch on the lease, if they ever see a ttl expiration they all try to grab it. The petset controller just applies the label to the name of the pet in the lease.

This feels like the lease api, and requires all pods to understand the apiserver, though. Not to mention correctly handling a watch etc. I feel like such complications shouldn't be necessary just to receive a failover event, and we can resolve the lease-taking-race by just delivering this event serially.

bgrant0607 · 2016-08-10T02:07:21Z

The conditions idea sounds very similar to custom metrics.

foxish · 2017-05-02T20:55:53Z

/cc @crimsonfaith91

cmluciano · 2017-05-16T20:25:58Z

Is this still relevant? Do we think that custom metrics could support this?

cmluciano · 2017-05-16T20:26:05Z

/assign

thockin · 2017-05-25T18:08:40Z

This is still coming up in some contexts. Let's leave it open.

…

On Tue, May 16, 2017 at 1:26 PM, cmluciano ***@***.***> wrote: Is this still relevant? Do we think that custom metrics could support this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28658 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVAPakygSUEyKtAbQQWR0vs1MI1h1ks5r6gZxgaJpZM4JHpiz> .

cmluciano · 2017-06-19T18:56:12Z

/unassign

fejta-bot · 2017-12-28T19:07:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-27T19:16:27Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-26T20:01:52Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

rfer · 2018-05-05T16:40:02Z

As anyone figured out a way to handle the initial use-case? It still seems relevant.

/reopen

k8s-ci-robot · 2018-05-05T16:40:02Z

@rfer: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

As anyone figured out a way to handle the initial use-case? It still seems relevant.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

goutamtadi1 · 2018-11-15T17:41:27Z

Yes, this seems unresolved. We are also looking for solution to similar problem. Anyone able to solve this?

bprashanth added sig/network Categorizes an issue or PR as relevant to SIG Network. team/cluster area/stateful-apps labels Jul 8, 2016

bprashanth added this to the next-candidate milestone Jul 8, 2016

This was referenced Jul 8, 2016

Pet set upgrades #28706

Closed

Pet Set in beta #28718

Closed

webwurst mentioned this issue Jul 25, 2016

Create example for a distributed database with PetSet giantswarm/kubernetes-recipes#1

Open

bprashanth mentioned this issue Sep 12, 2016

PetSet should be more accomodating to custom babysitters #32528

Closed

bprashanth mentioned this issue Sep 21, 2016

Added mongodb helm chart. helm/charts#74

Merged

0xmichalis mentioned this issue May 3, 2017

initial StatefulSet updates proposal kubernetes/community#503

Merged

k8s-ci-robot assigned cmluciano May 16, 2017

k8s-ci-robot unassigned cmluciano Jun 19, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2017

k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 27, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2018

k8s-ci-robot closed this as completed Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set pod conditions from a container probe #28658

Set pod conditions from a container probe #28658

bprashanth commented Jul 8, 2016

smarterclayton commented Jul 8, 2016

bprashanth commented Jul 8, 2016

bprashanth commented Jul 8, 2016

smarterclayton commented Aug 1, 2016

smarterclayton commented Aug 5, 2016 •

edited

smarterclayton commented Aug 5, 2016

bprashanth commented Aug 5, 2016

bprashanth commented Aug 5, 2016

bgrant0607 commented Aug 10, 2016

foxish commented May 2, 2017

cmluciano commented May 16, 2017

cmluciano commented May 16, 2017

thockin commented May 25, 2017 via email

cmluciano commented Jun 19, 2017

fejta-bot commented Dec 28, 2017

fejta-bot commented Jan 27, 2018

fejta-bot commented Feb 26, 2018

rfer commented May 5, 2018

k8s-ci-robot commented May 5, 2018

goutamtadi1 commented Nov 15, 2018

Set pod conditions from a container probe #28658

Set pod conditions from a container probe #28658

Comments

bprashanth commented Jul 8, 2016

smarterclayton commented Jul 8, 2016

bprashanth commented Jul 8, 2016

bprashanth commented Jul 8, 2016

smarterclayton commented Aug 1, 2016

smarterclayton commented Aug 5, 2016 • edited

smarterclayton commented Aug 5, 2016

bprashanth commented Aug 5, 2016

bprashanth commented Aug 5, 2016

bgrant0607 commented Aug 10, 2016

foxish commented May 2, 2017

cmluciano commented May 16, 2017

cmluciano commented May 16, 2017

thockin commented May 25, 2017 via email

cmluciano commented Jun 19, 2017

fejta-bot commented Dec 28, 2017

fejta-bot commented Jan 27, 2018

fejta-bot commented Feb 26, 2018

rfer commented May 5, 2018

k8s-ci-robot commented May 5, 2018

goutamtadi1 commented Nov 15, 2018

smarterclayton commented Aug 5, 2016 •

edited