New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingress HA, Scheduling, and Provisioning Proposal #34013
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,194 @@ | ||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> | ||
|
||
<!-- BEGIN STRIP_FOR_RELEASE --> | ||
|
||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
|
||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> | ||
|
||
If you are using a released version of Kubernetes, you should | ||
refer to the docs that go with that version. | ||
|
||
<!-- TAG RELEASE_LINK, added by the munger automatically --> | ||
<strong> | ||
The latest release of this document can be found | ||
[here](http://releases.k8s.io/release-1.3/docs/proposals/job.md). | ||
|
||
Documentation for other releases can be found at | ||
[releases.k8s.io](http://releases.k8s.io). | ||
</strong> | ||
-- | ||
|
||
<!-- END STRIP_FOR_RELEASE --> | ||
|
||
<!-- END MUNGE: UNVERSIONED_WARNING --> | ||
|
||
# Ingress HA, Scheduling, and Provisioning Proposal | ||
---- | ||
|
||
|
||
## Overview | ||
Ingress can be used to expose a service in the kubernetes cluster: | ||
|
||
* usually cluster admin deploys one Ingress Pod | ||
* user creates a Ingress resource | ||
* the Ingress Pod will list&watch All Ingress Resources in the cluster | ||
* user out of cluster then can access service in the cluster by accessing | ||
the node's ip on which Ingress Pod is running, Ingress Pod will forward | ||
request into cluster based on rules defined in Ingress Resource | ||
|
||
This just works. What's the issus then? | ||
|
||
The issues is: | ||
|
||
* It does not provide High Availability because client needs to know | ||
the IP addresss of the node where Ingress Pod is running. In case of a | ||
failure the Ingress Pod can be be moved to a different node. | ||
* How many Ingress Pod should run in a cluster? Should all Ingress Pod | ||
list&watch all Ingress Resource with out distinction? There is no way | ||
to bind or schedule ingress resource to a Ingress Pod/ReplicaSet(or a | ||
set of Ingress Pods/ReplicaSets), result in insufficient or excessive | ||
use of resource. | ||
|
||
## Goal | ||
This Proposal aims to address the above issues by the following mechanism: | ||
|
||
* Ingress HA: using keepalived and VIP to provide High Availability(mainly | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. clarify how much this gives us over just running a Service over the ingress controller(s)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean running a NodePort type Service over the ingress controller(s) and use keepalived-vip to provide HA for the Service? I just want to simplify Ingress creation:
It's not a "happy path" and hinder automation(especially considering the Ingress auto provision). Instead, it would be much helpful to just create a Ingress ReplicaSet, and implement the HA logic in Ingress Pod. |
||
for nginx/haproxy implementation, cloud implementation usually already | ||
provide HA) | ||
* Ingress Scheduling: schedule Ingress Resource to Ingress Pod/ReplicaSet | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you really want scheduling, or is that taking it too far? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there should be three mechanism for choosing:
Just like the relationship between PV and PVC,:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's involved in scheduling, qos? I agree with @bprashanth this is a little far; maybe you just mean bound?
|
||
* Ingress Provisioning: allow user to dynamically add Ingress Pod/ReplicaSet | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good goal, one we thought to solve with ingress claims: #30151. we haven't fleshed out the model yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a pretty important feature. Is anyone actively working on it ? We've already started working on some of it, @mqliang will update the proposal in a few days. |
||
on demand | ||
|
||
## NonGoal | ||
|
||
* Ingress ReplicaSet rolling update | ||
|
||
## Ingress HA | ||
(AKA: Ingress Virtual IP using keepalived) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to seperate the keepalived details from the api. We need a way to get a vip, public of private. That maybe keepalived, or iptables proxy, or something new. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree |
||
|
||
#### High level design | ||
* use keepalived to provide HA | ||
* cluster admin choose a group of nodes which could be accessed out of cluster | ||
and are in the same L2 broacast domain to run Ingress Pod | ||
* deploy Ingress Pod using ReplicaSet(at least 2 replicas for HA) | ||
* using AntiAffinity feature so that Ingress Pod created by the same Ingress | ||
ReplicaSet could be scheduled to different node | ||
* cluster admin choose a CIDR for Ingress VIP(AKA IngreeVIPCIDR) | ||
* each Ingress Replicaset will be allocated a VIP from IngreeVIPCIDR(allocated by | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the claims proposal, each ingress claim would get a vip. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe for "Ingress ReplicaSet", @mqliang means something that actually hold the vip, kind of like the ingress controller in current setup. However, by "each ingress claim would get a vip", you seem to suggest that the life cycle of vip is bound to claim? I'm trying to figure out the difference and if we want to apply the PV/PVC model to ingress claim. If we do want to use the PV/PVC model, then there needs to be such "Ingress ReplicaSet" which actually hold the VIP, not the claim holding the VIP. Then, user can create ingress claims to claim VIP. It is the VIP, not the claim, that have attributes on it, like qos. If not, then the only new type we need is ingress claim, in this case, how do we add more ingress resources to an existing claim? For example, if user creates an ingress claim for ".foo.com", a vip is allocated for it, but not yet useful; next, user can create ingress resources to consume the DNS and vip. Now some more ingress resources are needed for DNS ".bar.com" and user wants to use the same vip, how do they achieve this? Do they have to edit their claim? |
||
cluster admin or API server) | ||
* Ingress Pods use host network | ||
* Ingress Pods created by the same Ingress ReplicaSet will run keepalived, only | ||
one Ingress Pod will get the VIP | ||
* users out of cluster access incluster service by the Ingress VIP | ||
|
||
#### Why VIP instead of round-robin DNS | ||
A question that pops up every now and then is why we do all this stuff | ||
with virtual IPs rather than just use standard round-robin DNS. | ||
There are a few reasons: | ||
|
||
* There is a long history of DNS libraries not respecting DNS TTLs and | ||
caching the results of name lookups. | ||
* Many apps do DNS lookups once and cache the results. | ||
* Even if apps and libraries did proper re-resolution, the load of every | ||
client re-resolving DNS over and over would be difficult to manage. | ||
|
||
#### Challenge | ||
* VIP is boud to Ingress ReplicaSet, how to expost it to Ingress Pod? One | ||
approache is using ConfigMap, but then cluster admin need allocate VIP and | ||
write it to ConnfigMap, which makes automatic deployment harder. | ||
* All Ingress Pods created by the same Ingress ReplicaSet need know others' RIP. | ||
|
||
## Ingress Scheduling | ||
|
||
#### High level design | ||
|
||
* Ingress ReplicaSets are created by cluster admin in advance | ||
* If all Ingress Pods are saturated, it's cluster admin's duty to create | ||
more Ingress ReplicaSets | ||
* There is a Ingress Scheduler which will schedule Ingress Resources to Ingress | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO this might not be necessary. Users pick an Ingress claim based on QoS needs. The Ingress claim has a vip. The vip is backed by pods. If the pods are saturated, the admin or an autoscaler needs to scale them, but the vip doesn't change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It seems more reasonable that:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
scale up or scale out? |
||
ReplicaSets | ||
* Ingress Pod will only list&watch Ingress Resources which is scheduled on it's | ||
Ingress ReplicaSet. | ||
* Ingress Scheduler make the shceduling decision by label/selector, number of | ||
Ingress Resource already bound, and some metrics(for example, mem/cpu/bandwidth | ||
load of Ingress Pod) | ||
|
||
#### Implementation | ||
* add a IngressReplicasetName and Selector field to IngressSpec | ||
(add as annotation during incubation) | ||
|
||
``` | ||
type IngressSpec struct{ | ||
/* | ||
... | ||
*/ | ||
Selector labels.Selector | ||
IngressReplicasetName string | ||
} | ||
``` | ||
|
||
* Ingress Scheduler will bind a Ingress Resource to a Ingress ReplicaSet | ||
* Ingress Pod will only list&watch Ingress Resources which is scheduled to | ||
it's Ingress ReplicaSet | ||
* Implement Ingress scheduler so that it could respect Selector in the first step | ||
* In long run, we will make Ingress scheduler make scheduling decision basen on | ||
some monitoring metrics(e.g. mem/cpu/bandwidth load). | ||
|
||
#### Challenge | ||
* Ingress Resource is bound to Ingress ReplicaSet, how to expost it to Ingress | ||
Pod? | ||
|
||
#### TBD | ||
* Should the scheduler bind Ingress Resource to only one Ingress ReplicaSet, or | ||
bind to multiple Ingress ReplicaSet? | ||
|
||
## Ingress Provisoning | ||
|
||
#### High level design | ||
* Ingress ReplicaSets could be dynamically provisioned on deman, instead of | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/deman/demand |
||
been created by cluster admin in advance | ||
* If a user want a Ingress ReplicaSet to serve his Ingress Resource, he could | ||
create a IngressClaim resource: | ||
|
||
``` | ||
type IngressClaim struct { | ||
unversioned.TypeMeta `json:",inline"` | ||
ObjectMeta `json:"metadata,omitempty"` | ||
|
||
Spec IngressClaimSpec | ||
Status IngressClaimStatus | ||
} | ||
|
||
type IngressClaimSpec struct { | ||
Ingresses []LocalObjectReference //reference to Ingress Resources | ||
IngressReplicaSetSpec ReplicaSetSpec | ||
} | ||
|
||
type IngressClaimStatus struct { | ||
IngressReplicaSetName string | ||
} | ||
``` | ||
|
||
* No Ingress scheduling process will be envolved, Ingress Reources in IngressClaim | ||
are directly bound to the Ingress ReplicaSet auto provisioned. | ||
* If all Ingress Resources in IngressClaim are deleted, IngressClaim will be | ||
retained/recycled/deleted based on some policy specified by user | ||
* Add a IngressClaimController in ControllerManager to sync IngressClaim resource, | ||
it works in the way similiar with PersistentVolumeClaimController: auto | ||
provision Ingress ReplicaSet based on IngressClaim; retain/recycle/delete | ||
IngressClaim if it's referenced Ingress Resources are deleted. | ||
|
||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/job.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/with out/without