Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PetSet (was nominal services) #260

Closed
bgrant0607 opened this issue Jun 26, 2014 · 160 comments
Closed

PetSet (was nominal services) #260

bgrant0607 opened this issue Jun 26, 2014 · 160 comments
Assignees
Labels
area/api area/downward-api area/stateful-apps kind/design kind/documentation priority/important-soon sig/network
Milestone

Comments

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jun 26, 2014

@smarterclayton raised this issue in #199: how should Kubernetes support non-load-balanced and/or stateful services? Specifically, Zookeeper was the example.

Zookeeper (or etcd) exhibits 3 common problems:

  1. Identification of the instance(s) clients should contact
  2. Identification of peers
  3. Stateful instances

And it enables master election for other replicated services, which typically share the same problems, and probably need to advertise the elected master to clients.

@bgrant0607
Copy link
Member Author

@bgrant0607 bgrant0607 commented Jun 27, 2014

Note that we should probably also rename service to lbservice or somesuch to distinguish them from other types of services.

@bgrant0607
Copy link
Member Author

@bgrant0607 bgrant0607 commented Jul 9, 2014

As part of this, I'd remove service objects from the core apiserver and facilitate the use of other load balancers, such as HAProxy and nginx.

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jul 9, 2014

It would be nice if the logical definition of a service (the query and/or global name) was able to be used/specialized in multiple ways - as a simple load balancer installed via the infrastructure, as a more feature complete load balancer like nginx or haproxy also offered by the infrastructure, as a queryable endpoint an integrator could poll/wait on (GET /services/foo -> { endpoints: [{host, port}, ...] }), or as information available to hosts to expose local load balancers. Obviously these could be multiple different use cases and as such split into their own resources, but having some flexibility to specify intent (unify under a lb) distinct from mechanism makes it easier to satisfy a wide range of reqts.

@bgrant0607
Copy link
Member Author

@bgrant0607 bgrant0607 commented Jul 9, 2014

@smarterclayton I agree with separating policy and mechanism.

Primitives we need:

  1. The ability to poll/watch a set identified by a label selector. Not sure if there is an issue filed yet.
  2. The ability to query pod IP addresses (#385).

This would be enough to compose with other naming/discovery mechanisms and/or load balancers. We could then build a higher-level layer on top of the core that bundles common patterns with a simple API.

@brendandburns
Copy link
Contributor

@brendandburns brendandburns commented Jul 13, 2014

The two primitives described by @bgrant0607 is it worth keeping this issue open? Or are there more specific issues we can file?

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jul 14, 2014

I don't think zookeeper is solved - since you need the unique identifier in each container. I think you could do this with 3 separate replication controllers (one per instance) or a mode on the replication controller.

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jul 22, 2014

Service design I think deserves some discussion as Brian notes. Currently it couples an infrastructure abstraction (local proxy) with a mechanism for exposure (environment variables in all containers) with a label query. There is an equally valid use case for an edge proxy that takes L7 hosts/paths and balances them to a label query, as well as supporting protocols like http(s) and web sockets. In addition, services have a hard scale limit today of 60k backends, shared across the entire cluster (the amount of IPs allocated). It should be possible to run a local proxy on a minion that proxies only the services the containers on that host need, and also to avoid containers having to know about the external port. We can move this discussion to #494 if necessary.

@bgrant0607 bgrant0607 added area/downward-api sig/network kind/documentation labels Sep 30, 2014
@bgrant0607 bgrant0607 changed the title Support/document how to run other types of services Proposal: cardinal services Oct 2, 2014
@bgrant0607 bgrant0607 added the area/api label Oct 2, 2014
@bgrant0607
Copy link
Member Author

@bgrant0607 bgrant0607 commented Oct 2, 2014

Tackling the problem of singleton services and non-auto-scaled services with fixed replication, such as master-slave replicated databases, key-value stores with fixed-size peer groups (e.g., etcd, zookeeper), etc.

The fixed-replication cases require predictable array-like behavior. Peers need to be able to discover and individually address each other. These services generally have their own client libraries and/or protocols, so we don't need to solve the problem of determining which instance a client should connect to, other than to make the instances individually addressable.

Proposal: We should create a new flavor of service, called Cardinal services, which map N IP addresses instead of just one. Cardinal services would perform a stable assignment of these IP addresses to N instances targeted by their label selector (i.e., a specified N, not just however many targets happen to exist). Once we have DNS ( #1261, #146 ), it would assign predictable DNS names based on a provided prefix, with suffixes 0 to N-1. The assignments could be recorded in annotations or labels of the targeted pods.

This would preserve the decoupling of role assignment from the identities of pods and replication controllers, while providing stable names and IP addresses, which could be used in standard application configuration mechanisms.

Some of the discussion around different types of load balancing happened in the services v2 design: #1107.

I'll file a separate issue for master election.

/cc @smarterclayton @thockin

@bgrant0607 bgrant0607 removed the kind/documentation label Oct 2, 2014
@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Oct 2, 2014

The assignments would have to carry through into the pods via some environment parameterization mechanism (almost certainly).

For the etcd example, I would create:

  • replication controller cardinality 1: 1 pod, pointing to stable storage volume A
  • replication controller cardinality 2: 1 pod, pointing to stable storage volume B
  • replication controller cardinality 3: 1 pod, pointing to stable storage volume C
  • cardinal service 'etcd' pointing to the pods

If pod 2 dies, replication controller 2 creates a new copy of it and reattaches it to volume B. Cardinal service 'etcd' knows that that pod is new, but how does it know that it should be cardinality 2 (which comes from data stored on volume B)?

@thockin
Copy link
Member

@thockin thockin commented Oct 2, 2014

Rather than 3 replication controllers, why not a sharding controller, which
looks at a label like "kubernetes.io/ShardIndex" when making decisions. If
you want 3-way sharding, it makes 3 pods with indices 0, 1, 2. I feel like
this was shot down before, but I can't reconstruct the trouble it caused in
my head.

It just seems wrong to place that burden on users if this is a relatively
common scenario.

Do you think it matters if the nominal IP for a given pod changes due to
unrelated changes in the set? For example:

at time 0, pods (A, B, C) make up a cardinal service, with IP's
10.0.0.{1-3} respectively

at time 1, the node which hosts pod B dies

at time 2, the replication controller driving B creates a new pod D

at time 3, the cardinal service changes to (A, C, D) with IP's 10.0.0.{1-3}
respectively

NB: pod C's "stable IP" changed from 10.0.0.3 to 10.0.0.2 when the set
membership changed. I expect this will do bad things to running
connections.

To circumvent this, we would need to have the ordinal values specified
outside of the service, or something else clever. Maybe that is OK, but it
seems fragile and easy to get wrong if people have to deal with it.

On Thu, Oct 2, 2014 at 10:17 AM, Clayton Coleman notifications@github.com
wrote:

The assignments would have to carry through into the pods via some
environment parameterization mechanism (almost certainly).

For the etcd example, I would create:

  • replication controller cardinality 1: 1 pod, pointing to stable
    storage volume A
  • replication controller cardinality 2: 1 pod, pointing to stable
    storage volume B
  • replication controller cardinality 3: 1 pod, pointing to stable
    storage volume C
  • cardinal service 'etcd' pointing to the pods

If pod 2 dies, replication controller 2 creates a new copy of it and
reattaches it to volume B. Cardinal service 'etcd' knows that that pod is
new, but how does it know that it should be cardinality 2 (which comes from
data stored on volume B)?

Reply to this email directly or view it on GitHub
#260 (comment)
.

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Oct 2, 2014

I think a sharding controller makes sense and is probably more useful in context of a cardinal service.

I do think that IP changes based on membership are scary and I can think of a bunch of degenerate edge cases. However, if the cardinality is stored with the pods, the decision is less difficult.

@bgrant0607
Copy link
Member Author

@bgrant0607 bgrant0607 commented Oct 2, 2014

First of all, I didn't intend this to be about sharding -- that's #1064. Let's move sharding discussions to there. We've seen many cases of trying to use an analogous mechanism for sharding, and we concluded that it's not the best way to implement sharding.

@bgrant0607
Copy link
Member Author

@bgrant0607 bgrant0607 commented Oct 2, 2014

Second, my intention is that it shouldn't be necessary to run N replication controllers. It should be possible to use only one, though the number required depends on deployment details (canaries, multiple release tracks, rolling updates, etc.).

@bgrant0607
Copy link
Member Author

@bgrant0607 bgrant0607 commented Oct 2, 2014

Third, I agree we need to consider how this would interact with the durable data proposal (#1515) -- @erictune .

@bgrant0607
Copy link
Member Author

@bgrant0607 bgrant0607 commented Oct 2, 2014

Four, I agree we probably need to reflect the identity into the pod. As per #386, ideally a standard mechanism would be used to make the IP and DNS name assignments visible to the pod. How would IP and host aliases normally be surfaced in Linux?

@bgrant0607
Copy link
Member Author

@bgrant0607 bgrant0607 commented Oct 2, 2014

Fifth, I suggested that we ensure assignment stability by recording assignments in the pods via labels or annotations.

@paralin
Copy link
Contributor

@paralin paralin commented May 31, 2016

embedded yaml in annotation strings? oof, ouch :(. thanks though, will investigate making a cassandra set.

@bprashanth
Copy link
Contributor

@bprashanth bprashanth commented May 31, 2016

that's json. It's an alpha feature added to a GA object (init containers in pods).
@chrislovecnm is working on Cassandra, might just want to wait him out.

@chrislovecnm
Copy link
Member

@chrislovecnm chrislovecnm commented May 31, 2016

@paralin here is what I am working on. No time to document and get it into k8s repo now, but that is long term plan. https://github.com/k8s-for-greeks/gpmr/tree/master/pet-race-devops/k8s/cassandra Is working for me locally, on HEAD.

Latest C* image in the demo works well.

We do have issue open for more documentation. Wink wink, knudge @bprashanth

@matchstick matchstick added the kind/documentation label Jun 15, 2016
@ingvagabund
Copy link
Contributor

@ingvagabund ingvagabund commented Jun 30, 2016

PetSets example with etcd cluster [1].

[1] kubernetes-retired/contrib#1295

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jun 30, 2016

Be sure to capture design asks on the proposal doc after you finish review

On Jun 30, 2016, at 1:25 AM, Jan Chaloupka notifications@github.com wrote:

PetSets example with etcd cluster [1].

[1] kubernetes-retired/contrib#1295
kubernetes-retired/contrib#1295


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#260 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_pwVgiaLvRKbtcJG9wzMEZcCNgae8ks5qQ32PgaJpZM4CIC6g
.

@bprashanth
Copy link
Contributor

@bprashanth bprashanth commented Jul 6, 2016

the petset docs are https://github.com/kubernetes/kubernetes.github.io/blob/release-1.3/docs/user-guide/petset.md and https://github.com/kubernetes/kubernetes.github.io/tree/release-1.3/docs/user-guide/petset/bootstrapping, I plan to close this issue and open a new one that addresses moving petset to beta unless anyone objects

@bprashanth
Copy link
Contributor

@bprashanth bprashanth commented Jul 8, 2016

#28718

k8s-github-robot pushed a commit that referenced this issue Oct 27, 2016
Automatic merge from submit-queue

Proposal for implementing nominal services AKA StatefulSets AKA The-Proposal-Formerly-Known-As-PetSets

This is the draft proposal for #260.
xingzhou pushed a commit to xingzhou/kubernetes that referenced this issue Dec 15, 2016
Automatic merge from submit-queue

Proposal for implementing nominal services AKA StatefulSets AKA The-Proposal-Formerly-Known-As-PetSets

This is the draft proposal for kubernetes#260.
wking pushed a commit to wking/kubernetes that referenced this issue Jul 21, 2020
Change command setprefixname to setnameprefix
b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this issue Feb 5, 2021
Bug fix where nil IP (byte slice) was dereferenced
and caused the goroutine to hang.

Fixes kubernetes#260 and kubernetes#267
avalluri added a commit to avalluri/kubernetes that referenced this issue May 2, 2021
This is a first step towards removing the mock CSI driver completely from
e2e testing in favor of hostpath plugin. With the recent hostpath plugin
changes(PR kubernetes#260, kubernetes#269), it supports all the features supported by the mock
csi driver.

Using hostpath-plugin for testing also covers CSI persistent feature
usecases.
pohly pushed a commit to pohly/kubernetes that referenced this issue Dec 2, 2021
This is a first step towards removing the mock CSI driver completely from
e2e testing in favor of hostpath plugin. With the recent hostpath plugin
changes(PR kubernetes#260, kubernetes#269), it supports all the features supported by the mock
csi driver.

Using hostpath-plugin for testing also covers CSI persistent feature
usecases.
pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api area/downward-api area/stateful-apps kind/design kind/documentation priority/important-soon sig/network
Projects
None yet
Development

No branches or pull requests