Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared Pod Storage (e.g. Config Storage) #6923

Closed
invino4 opened this issue Apr 16, 2015 · 11 comments
Closed

Shared Pod Storage (e.g. Config Storage) #6923

invino4 opened this issue Apr 16, 2015 · 11 comments
Labels
area/extensibility priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@invino4
Copy link
Contributor

invino4 commented Apr 16, 2015

I am working on a Kubernetes extension that would run in its own pod but would like to store some information (e.g. configuration) that has similar requirements as the data stored in API Server. In particular it would be nice if the storage had the following properties:

  • Durable: Survives pod restarts, migrations, and upgrades.
  • Consistent: Even if there are multiple pods writing the same resource concurrently, consistency is maintained.
  • Addressable: When a pod starts it can find the storage allocated for its application.
  • Isolated: Different applications don't accidentally write over each others data even if they use the same names, paths, keys, etc.
  • Semi-structured: Something like JSON-store would be fine.
  • Service Oriented: Ideally this is provided as a simple RESTful service such that it is accessible from any operating environment and doesn't require linking any specific client library.
  • Timestamped: Each resource has a logical-clock timestamp (i.e. eTag) that can be used for optimistic concurrency control. Ideally without imposing any particular versioning scheme or strategy on the resources themselves.
  • Watch Support: It is possible to watch or be notified of updates to resources (or even small set of resources, e.g. subfolders).

Possibilities:

Use Volumes

Applications can always mount a volume and write their own storage to files. This is not ideal for a few reasons:

  • It requires a lot of reimplementation for each application.
  • Disks that survive cluster and pod restart are not a guaranteed part of all Kubernetes deployments making apps more difficult to deploy and move between Kube environments.
  • PD can have undesirable limitations (e.g. how many pods can mount them simultaneously).

Have applications run their own storage as a pod

Running your own etcd cluster is an obvious choice. However, running storage as a pod is actually quite tricky. Durability and consistency are difficult to provide and are often better maintained by a dedicated team that knows about backup and recovery. Durability itself requires some off pod storage which brings you back to volumes. Storage pods that are restarted need to rediscover and reattach the persistent storage used by a previous instance even though their identities are not related in any way (see Nominal Services).

API Server's Etcd Cluster

The etcd server used by the API Server provides almost all of these features with the exception of isolation. One possibility would be to expose a thin shim service at the API Server that wraps its underlying etcd cluster and re-exposes scoped portions of its namespace to pods. The shim would enforce scoping and authorization. The shim could also provide some adaption of the etcd watch interface to be more consistent with the semantics exposed by the API Server itself.

Pros: This possibility unifies the problems of durability, consistency, addressability with the API Servers almost identical requirements. Since any deployment must already solve these issues for API Server it doesn't create any additional burden. In hosted environments (e.g. GKE) the host provides Etcd and its durable storage in a way that is independent of the Kubernetes model allowing for backup, restore, and survivability to be implemented in ways that would be difficult if run directly as a pod. API Server already implements a sophisticated RESTful web service endpoint with authentication, authorization, timestamping, and watch support. This logic could be shared by the shim without addition complexity or duplication.

Cons: Opening up the API Server's Etcd server (even if the shim correctly implements isolation) to third-party applications will create additional load on both API Server and Etcd. This would affect the responsiveness of API Server and scalability in ways that may be difficult to predict. Bugs in the shim or Etcd might expose the data from either the master or other applications to corruption or deletion.

@erictune
Copy link
Member

@smarterclayton can you comment on this?

@smarterclayton
Copy link
Contributor

Will do

On Apr 16, 2015, at 3:19 PM, Eric Tune notifications@github.com wrote:

@smarterclayton can you comment on this?


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor

We've talked about exposing "Etcd-as-a-service" - allowing clients to request an endpoint with which they can interact using etcd client tools directly, but secured and provisioned dynamically. A subset of the keyspace would be carved up for each use case and offered on a distinct endpoint. One advantage would be the ability to scale that out - for simple systems it could reuse the apiserver, for larger systems it could be sharded and decoupled.

Once volume as a service lands (persistent volumes) it should be progressively easier to do things like run gluster and use shared, secured mounts (service accounts and security contexts will let you allocate unique Unix uids for each namespace as needed, for sharing remote storage).

Single server etcd should become much easier when pds are in place - you should be easily able to run small etcd servers that are resilient to failure (perhaps not clustered) for low cost (40-50mb per?)

Trying to think of other ideas we've covered...

@thockin
Copy link
Member

thockin commented Apr 17, 2015

The nascent config API object?

On Thu, Apr 16, 2015 at 7:36 PM, Clayton Coleman notifications@github.com
wrote:

We've talked about exposing "Etcd-as-a-service" - allowing clients to
request an endpoint with which they can interact using etcd client tools
directly, but secured and provisioned dynamically. A subset of the keyspace
would be carved up for each use case and offered on a distinct endpoint.
One advantage would be the ability to scale that out - for simple systems
it could reuse the apiserver, for larger systems it could be sharded and
decoupled.

Once volume as a service lands (persistent volumes) it should be
progressively easier to do things like run gluster and use shared, secured
mounts (service accounts and security contexts will let you allocate unique
Unix uids for each namespace as needed, for sharing remote storage).

Single server etcd should become much easier when pds are in place - you
should be easily able to run small etcd servers that are resilient to
failure (perhaps not clustered) for low cost (40-50mb per?)

Trying to think of other ideas we've covered...


Reply to this email directly or view it on GitHub
#6923 (comment)
.

@smarterclayton
Copy link
Contributor

I thinking would be an exact match for the etcd API, and be targeted at etcd. I feel something is wrong about mutating the etcd API to be even more generic, mostly because then people need custom clients in every language.

On Apr 17, 2015, at 12:03 AM, Tim Hockin notifications@github.com wrote:

The nascent config API object?

On Thu, Apr 16, 2015 at 7:36 PM, Clayton Coleman notifications@github.com
wrote:

We've talked about exposing "Etcd-as-a-service" - allowing clients to
request an endpoint with which they can interact using etcd client tools
directly, but secured and provisioned dynamically. A subset of the keyspace
would be carved up for each use case and offered on a distinct endpoint.
One advantage would be the ability to scale that out - for simple systems
it could reuse the apiserver, for larger systems it could be sharded and
decoupled.

Once volume as a service lands (persistent volumes) it should be
progressively easier to do things like run gluster and use shared, secured
mounts (service accounts and security contexts will let you allocate unique
Unix uids for each namespace as needed, for sharing remote storage).

Single server etcd should become much easier when pds are in place - you
should be easily able to run small etcd servers that are resilient to
failure (perhaps not clustered) for low cost (40-50mb per?)

Trying to think of other ideas we've covered...


Reply to this email directly or view it on GitHub
#6923 (comment)
.


Reply to this email directly or view it on GitHub.

@bgrant0607
Copy link
Member

Lots of related threads: #991, #1553, #1627, #2068, #6477. If it's truly an extension of Kubernetes, we do plan to provide generic storage for API plugins (#991).

Exposing etcd directly would be problematic if we ever were to support another storage backend: #1957.

@thockin
Copy link
Member

thockin commented Apr 17, 2015

Yeah, if we expose etcd it should be as a service, not because it happens
to already be present.

On Thu, Apr 16, 2015 at 9:27 PM, Brian Grant notifications@github.com
wrote:

Lots of related threads: #991
#991, #1553
#1553, #1627
#1627, #2068
#2068, #6477
#6477. If it's
truly an extension of Kubernetes, we do plan to provide generic storage for
API plugins (#991
#991).

Exposing etcd directly would be problematic if we ever were to support
another storage backend: #1957
#1957.


Reply to this email directly or view it on GitHub
#6923 (comment)
.

@smarterclayton
Copy link
Contributor

Yeah, I'm thinking an optional API endpoint either provisioned automatically for clients or available via requests.

In the long term, this is effectively the service broker pattern - request an instance of X (where X can be anything) provisioned and attached to your namespace (external service created, environment set on an RC, secrets set in the ns that grant access). You can then talk to your local service to access a remote resource without having to know anything about the implementation of that resource. Common in paas environments.

On Apr 17, 2015, at 12:31 AM, Tim Hockin notifications@github.com wrote:

Yeah, if we expose etcd it should be as a service, not because it happens
to already be present.

On Thu, Apr 16, 2015 at 9:27 PM, Brian Grant notifications@github.com
wrote:

Lots of related threads: #991
#991, #1553
#1553, #1627
#1627, #2068
#2068, #6477
#6477. If it's
truly an extension of Kubernetes, we do plan to provide generic storage for
API plugins (#991
#991).

Exposing etcd directly would be problematic if we ever were to support
another storage backend: #1957
#1957.


Reply to this email directly or view it on GitHub
#6923 (comment)
.


Reply to this email directly or view it on GitHub.

@bgrant0607
Copy link
Member

PaaSes give me the impression that there's a hard distinction between Apps and Services -- the former are run on the PaaS while the latter are run using an underlying IaaS orchestration layer. I think it's great if we can support that, but even better if applications that would be run as Services in a PaaS environment could be run on Kubernetes. In this case, we definitely want to be able to run etcd clusters on Kubernetes, using features such as nominal services (#260).

That said, I'd like extensions/plugins to be on equal footing with core objects. I don't see a compelling reason to not allow them to use the same etcd instance (via the apiserver) so long as we apply proper per-user (not per-plugin) storage quotas and request rate limits. In my experience, users can't really do more damage than they'd do with just pods alone: out-of-control replication, crash loops, enormous (multi-megabyte) command line args or env var lists, ...

As for watch scaling, I fully expect that eventually virtually every container running in the cluster will have multiple watches active simultaneously. We're going to need a replicated fanout layer to handle that.

@smarterclayton
Copy link
Contributor

----- Original Message -----

PaaSes give me the impression that there's a hard distinction between Apps
and Services -- the former are run on the PaaS while the latter are run
using an underlying IaaS orchestration layer. I think it's great if we can
support that, but even better if applications that would be run as Services
in a PaaS environment could be run on Kubernetes. In this case, we
definitely want to be able to run etcd clusters on Kubernetes, using
features such as nominal services (#260).

Agree - it should be easy to bridge the distinction and we should have the concepts that blur the line between

  • Allocate me an app/service/instance outside the cluster
  • Allocate me an app/service/instance inside the cluster (but I don't want to know about the details)
  • I want to allocate a new thing inside the cluster

The tools and interaction for something running in the cluster to all of the above should be identical - only provisioning changes.

That said, I'd like extensions/plugins to be on equal footing with core
objects. I don't see a compelling reason to not allow them to use the same
etcd instance (via the apiserver) so long as we apply proper per-user (not
per-plugin) storage quotas and request rate limits. In my experience, users
can't really do more damage than they'd do with just pods alone:
out-of-control replication, crash loops, enormous (multi-megabyte) command
line args or env var lists, ...

As for watch scaling, I fully expect that eventually virtually every
container running in the cluster will have multiple watches active
simultaneously. We're going to need a replicated fanout layer to handle
that.


Reply to this email directly or view it on GitHub:
#6923 (comment)

@davidopp davidopp self-assigned this Apr 21, 2015
@thockin thockin added kind/support Categorizes issue or PR as a support question. and removed kind/support Categorizes issue or PR as a support question. priority/support labels May 19, 2015
@bgrant0607 bgrant0607 added area/app-lifecycle area/extensibility sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/enhancement priority/backlog Higher priority than priority/awaiting-more-evidence. and removed kind/support Categorizes issue or PR as a support question. labels Jun 30, 2015
@ghost ghost removed the team/master label Aug 20, 2015
@bgrant0607 bgrant0607 added this to the v1.2-candidate milestone Sep 12, 2015
@thockin
Copy link
Member

thockin commented Apr 25, 2016

Closing for lack of activity and because we have config map now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/extensibility priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

6 participants