Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress HA, Scheduling, and Provisioning Proposal #34013

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
194 changes: 194 additions & 0 deletions docs/proposals/ingress-ha-scheduling-provisioning.md
@@ -0,0 +1,194 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

<!-- TAG RELEASE_LINK, added by the munger automatically -->
<strong>
The latest release of this document can be found
[here](http://releases.k8s.io/release-1.3/docs/proposals/job.md).

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Ingress HA, Scheduling, and Provisioning Proposal
----


## Overview
Ingress can be used to expose a service in the kubernetes cluster:

* usually cluster admin deploys one Ingress Pod
* user creates a Ingress resource
* the Ingress Pod will list&watch All Ingress Resources in the cluster
* user out of cluster then can access service in the cluster by accessing
the node's ip on which Ingress Pod is running, Ingress Pod will forward
request into cluster based on rules defined in Ingress Resource

This just works. What's the issus then?

The issues is:

* It does not provide High Availability because client needs to know
the IP addresss of the node where Ingress Pod is running. In case of a
failure the Ingress Pod can be be moved to a different node.
* How many Ingress Pod should run in a cluster? Should all Ingress Pod
list&watch all Ingress Resource with out distinction? There is no way
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/with out/without

to bind or schedule ingress resource to a Ingress Pod/ReplicaSet(or a
set of Ingress Pods/ReplicaSets), result in insufficient or excessive
use of resource.

## Goal
This Proposal aims to address the above issues by the following mechanism:

* Ingress HA: using keepalived and VIP to provide High Availability(mainly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify how much this gives us over just running a Service over the ingress controller(s)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean running a NodePort type Service over the ingress controller(s) and use keepalived-vip to provide HA for the Service?

I just want to simplify Ingress creation:

  • create Ingress ReplicaSet
  • create a NodePort type Service over the Ingress ReplicaSet
  • create a keeplived-vip daemonset to provide HA for the Ingress Service

It's not a "happy path" and hinder automation(especially considering the Ingress auto provision). Instead, it would be much helpful to just create a Ingress ReplicaSet, and implement the HA logic in Ingress Pod.

for nginx/haproxy implementation, cloud implementation usually already
provide HA)
* Ingress Scheduling: schedule Ingress Resource to Ingress Pod/ReplicaSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want scheduling, or is that taking it too far?
With claims you have different qos classes, and a user picks a class.
The ingress is satisfied by whatever's behind that class, be it a single pod, a group of pods, a cloud lb etc.

Copy link
Contributor Author

@mqliang mqliang Oct 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be three mechanism for choosing:

  • if user know exactly which Ingress Pod/RS to use, just use it.
  • If user don't know exactly which Ingress Pod/RS to use, but he know he want a Ingress Pod with some properties(cpu/mem/bandwidth available, nginx/haproxy/cloud lb implementation, etc), he can claim one, the kubernetes will iterate all the existing Ingress Pod/RS to find a best matching for him (in other words, it's scheduling)
  • user can also claim a Ingress Pod/RS with a "auto-provision" annotation, in such a case, kubernetes will dynamically provide one.

Just like the relationship between PV and PVC,:

  • if user want use a specific cloud disk(and he know the cloud disk id), just use it.
  • If user just want a PV with XXX size and some properties, he can claim a PV by creating a PVC: kubernetes will find a best matching PV(scheduling) for user.
  • user also can create PV with auto-provision annotation, in such a case kubernetes will call cloud provider api to create a cloud disk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's involved in scheduling, qos? I agree with @bprashanth this is a little far; maybe you just mean bound?

If user don't know exactly which Ingress Pod/RS to use, but he know he want a Ingress Pod with some properties(cpu/mem/bandwidth available, nginx/haproxy/cloud lb implementation, etc), he can claim one, the kubernetes will iterate all the existing Ingress Pod/RS to find a best matching for him (in other words, it's scheduling)

* Ingress Provisioning: allow user to dynamically add Ingress Pod/ReplicaSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good goal, one we thought to solve with ingress claims: #30151. we haven't fleshed out the model yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty important feature. Is anyone actively working on it ? We've already started working on some of it, @mqliang will update the proposal in a few days.

on demand

## NonGoal

* Ingress ReplicaSet rolling update

## Ingress HA
(AKA: Ingress Virtual IP using keepalived)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to seperate the keepalived details from the api. We need a way to get a vip, public of private. That maybe keepalived, or iptables proxy, or something new.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree


#### High level design
* use keepalived to provide HA
* cluster admin choose a group of nodes which could be accessed out of cluster
and are in the same L2 broacast domain to run Ingress Pod
* deploy Ingress Pod using ReplicaSet(at least 2 replicas for HA)
* using AntiAffinity feature so that Ingress Pod created by the same Ingress
ReplicaSet could be scheduled to different node
* cluster admin choose a CIDR for Ingress VIP(AKA IngreeVIPCIDR)
* each Ingress Replicaset will be allocated a VIP from IngreeVIPCIDR(allocated by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the claims proposal, each ingress claim would get a vip.
"Ingress Replicaset" is a term which doesn't make sense to me, today an Ingress points to Services which may point to replicassets.

Copy link
Contributor

@ddysher ddysher Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe for "Ingress ReplicaSet", @mqliang means something that actually hold the vip, kind of like the ingress controller in current setup. However, by "each ingress claim would get a vip", you seem to suggest that the life cycle of vip is bound to claim? I'm trying to figure out the difference and if we want to apply the PV/PVC model to ingress claim.

If we do want to use the PV/PVC model, then there needs to be such "Ingress ReplicaSet" which actually hold the VIP, not the claim holding the VIP. Then, user can create ingress claims to claim VIP. It is the VIP, not the claim, that have attributes on it, like qos.

If not, then the only new type we need is ingress claim, in this case, how do we add more ingress resources to an existing claim? For example, if user creates an ingress claim for ".foo.com", a vip is allocated for it, but not yet useful; next, user can create ingress resources to consume the DNS and vip. Now some more ingress resources are needed for DNS ".bar.com" and user wants to use the same vip, how do they achieve this? Do they have to edit their claim?

cluster admin or API server)
* Ingress Pods use host network
* Ingress Pods created by the same Ingress ReplicaSet will run keepalived, only
one Ingress Pod will get the VIP
* users out of cluster access incluster service by the Ingress VIP

#### Why VIP instead of round-robin DNS
A question that pops up every now and then is why we do all this stuff
with virtual IPs rather than just use standard round-robin DNS.
There are a few reasons:

* There is a long history of DNS libraries not respecting DNS TTLs and
caching the results of name lookups.
* Many apps do DNS lookups once and cache the results.
* Even if apps and libraries did proper re-resolution, the load of every
client re-resolving DNS over and over would be difficult to manage.

#### Challenge
* VIP is boud to Ingress ReplicaSet, how to expost it to Ingress Pod? One
approache is using ConfigMap, but then cluster admin need allocate VIP and
write it to ConnfigMap, which makes automatic deployment harder.
* All Ingress Pods created by the same Ingress ReplicaSet need know others' RIP.

## Ingress Scheduling

#### High level design

* Ingress ReplicaSets are created by cluster admin in advance
* If all Ingress Pods are saturated, it's cluster admin's duty to create
more Ingress ReplicaSets
* There is a Ingress Scheduler which will schedule Ingress Resources to Ingress
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this might not be necessary. Users pick an Ingress claim based on QoS needs. The Ingress claim has a vip. The vip is backed by pods. If the pods are saturated, the admin or an autoscaler needs to scale them, but the vip doesn't change.

Copy link
Contributor Author

@mqliang mqliang Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ingress claim has a vip

It seems more reasonable that:

  • Ingress Service have a vip, external or internal(if user want in-cluster L7 loadbalancing), Ingress Service was backended by several Ingress Pods(may be created by a ReplicaSet)
  • User can specify a Ingress Service for Ingress Resource
  • If user don't now which Ingress Service to specify, it can use a IngressClaim, then a ingress-claim-controller will iterate through all Ingress Services and find a best matching one and bind the Ingress Service with IngressClaim.

Copy link
Contributor Author

@mqliang mqliang Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the admin or an autoscaler needs to scale them

scale up or scale out?

ReplicaSets
* Ingress Pod will only list&watch Ingress Resources which is scheduled on it's
Ingress ReplicaSet.
* Ingress Scheduler make the shceduling decision by label/selector, number of
Ingress Resource already bound, and some metrics(for example, mem/cpu/bandwidth
load of Ingress Pod)

#### Implementation
* add a IngressReplicasetName and Selector field to IngressSpec
(add as annotation during incubation)

```
type IngressSpec struct{
/*
...
*/
Selector labels.Selector
IngressReplicasetName string
}
```

* Ingress Scheduler will bind a Ingress Resource to a Ingress ReplicaSet
* Ingress Pod will only list&watch Ingress Resources which is scheduled to
it's Ingress ReplicaSet
* Implement Ingress scheduler so that it could respect Selector in the first step
* In long run, we will make Ingress scheduler make scheduling decision basen on
some monitoring metrics(e.g. mem/cpu/bandwidth load).

#### Challenge
* Ingress Resource is bound to Ingress ReplicaSet, how to expost it to Ingress
Pod?

#### TBD
* Should the scheduler bind Ingress Resource to only one Ingress ReplicaSet, or
bind to multiple Ingress ReplicaSet?

## Ingress Provisoning

#### High level design
* Ingress ReplicaSets could be dynamically provisioned on deman, instead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/deman/demand

been created by cluster admin in advance
* If a user want a Ingress ReplicaSet to serve his Ingress Resource, he could
create a IngressClaim resource:

```
type IngressClaim struct {
unversioned.TypeMeta `json:",inline"`
ObjectMeta `json:"metadata,omitempty"`

Spec IngressClaimSpec
Status IngressClaimStatus
}

type IngressClaimSpec struct {
Ingresses []LocalObjectReference //reference to Ingress Resources
IngressReplicaSetSpec ReplicaSetSpec
}

type IngressClaimStatus struct {
IngressReplicaSetName string
}
```

* No Ingress scheduling process will be envolved, Ingress Reources in IngressClaim
are directly bound to the Ingress ReplicaSet auto provisioned.
* If all Ingress Resources in IngressClaim are deleted, IngressClaim will be
retained/recycled/deleted based on some policy specified by user
* Add a IngressClaimController in ControllerManager to sync IngressClaim resource,
it works in the way similiar with PersistentVolumeClaimController: auto
provision Ingress ReplicaSet based on IngressClaim; retain/recycle/delete
IngressClaim if it's referenced Ingress Resources are deleted.


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/job.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->