KEP-1880: Multiple Service CIDRs

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Allow to dynamically expand the number of IPs available for Services.

Motivation

Services are an abstract way to expose an application running on a set of Pods. Some type of Services: ClusterIP, NodePort and LoadBalancer use cluster-scoped virtual IP address, ClusterIP. Across your whole cluster, every Service ClusterIP must be unique. Trying to create a Service with a specific ClusterIP that has already been allocated will return an error.

Current implementation of the Services IP allocation logic has several limitations:

users can not resize or increase the ranges of IPs assigned to Services, causing problems when there are overlapping networks or the cluster run out of available IPs.
the Service IP range is not exposed, so it is not possible for other components in the cluster to consume it
the configuration is per apiserver, there is no consensus and different apiservers can fight and delete others IPs.
the allocation logic is racy, with the possibility of leaking ClusterIPs kubernetes/kubernetes#87603
the size of the Service Cluster CIDR, for IPv4 the prefix size is limited to /12, however, for IPv6 it is limited to /112 or fewer. This restriction is causing issues for IPv6 users, since /64 is the standard and minimum recommended prefix length
only is possible to use reserved a range of Service IP addresses using the feature gate "keps/sig-network/3070-reserved-service-ip-range" kubernetes/kubernetes#95570

Goals

Implement a new allocation logic for Services IPs that:

scales well with the number of reservations
the number of allocation is tunable
is completely backwards compatible

Non-Goals

Any change unrelated to Services
Any generalization of the API model that can evolve onto something different, like an IPAM API, or collide with existing APIs like Ingress and GatewayAPI.
NodePorts use the same allocation model than ClusterIPs, but it is totally out of scope of this KEP. However, this KEP can be a proof that a similar model will work for NodePorts too.
Change the default IPFamily used for Services IP allocation, the defaulting depends on the services.spec.IPFamily and services.spec.IPFamilyPolicy, a simple webhook or an admission plugin can set this fields to the desired default, so the allocation logic doesn't have to handle it.
Removing the apiserver flags that define the service IP CIDRs, though that may be possible in the future.
Any admin or cluster wide process related to Services, like automating the default Service CIDR range, though, this KEP will implement the behaviours and primitives necessaries to perform those kind of operations automatically.

Proposal

The proposal is to implement a new allocator logic that uses 2 new API Objects: ServiceCIDR and IPAddress, and allow users to dynamically increase the number of Services IPs available by creating new ServiceCIDRs.

The new allocator will be able to "automagically" consume IPs from any ServiceCIDR available, we can think about this model, as the same as adding more disks to a Storage system to increase the capacity.

To simplify the model, make it backwards compatible and to avoid that it can evolve into something different and collide with other APIs, like Gateway APIs, we are adding the following constraints:

a ServiceCIDR will be immutable after creation (to be revisited before Beta).
a ServiceCIDR can only be deleted if there are no Service IP associated to it (enforced by finalizer).
there can be overlapping ServiceCIDRs.
the apiserver will periodically ensure that a "default" ServiceCIDR exists to cover the service CIDR flags and the "kubernetes.default" Service.
any IPAddress existing in a cluster has to belong to a Service CIDR defined by a ServiceCIDR.
any Service with a ClusterIP assigned is expected to have always a IPAddress object associated.
a ServiceCIDR which is being deleted can not be used to allocate new IPs

This creates a 1-to-1 relation between Service and IPAddress, and a 1-to-N relation between the ServiceCIDRs defined by the ServiceCIDR and IPAddress. It is important to clarify that overlapping ServiceCIDR are merged in memory, an IPAddress doesn't come from a specific ServiceCIDR object, but "any ServiceCIDR that includes that IP"

The new allocator logic can be used by other APIs, like Gateway API.

The new well defined behaviors and objects implemented will allow future developments to perform admin and cluster wide operations on the Service ranges.

User Stories

Story 1

As a Kubernetes user I want to be able to dynamically increase the number of IPs available for Services.

Story 2

As a Kubernetes admin I want to have a process that allows me to renumber my Services IPs.

Story 3

As a Kubernetes developer I want to be able to know the current Service IP range and the IP addresses allocated.

Story 4

As a Kubernetes admin that wants to use IPv6, I want to be able to follow the IPv6 address planning best practices, being able to assign /64 prefixes to my end subnets.

Notes/Constraints/Caveats (Optional)

Current API machinery doesn't consider transactions, Services API is kind of special in this regard since already performs allocations inside the Registry pipeline using the Storage to keep consistency on an Obscure API Object that stores the bitmap with the allocations. This proposal still maintains this behavior, but instead of modifying the shared bitmap, it will create a new IPAddress object.

Changing service IP allocation to be async with regards to the service creation would be a MAJOR semantic change and would almost certainly break clients.

Risks and Mitigations

Cross validation of Objects is not common in the Kubernetes API, sig-apimachinery should verify that a bad precedent is not introduced.

Current Kubernetes cluster have a very static network configuration, allowing to expand the Service IP ranges will give more flexibility to users, with the risk of having problematic or incongruent network configurations with overlapping. But this is not really a new problem, users need to do a proper network planning before adding new ranges to the Service IP pool.

Service implementations,like kube-proxy, can be impacted if they were doing assumptions about the ranges assigned to the Services. Those implementations should implement logic to watch the configured Service CIDRs.

Kubernetes DNS implementations, like CoreDNS, need to know the Service CIDRs for PTR lookups (and to know which PTR look ups they are NOT authoritative on). Those implementations should be impacted, but they can also benefit from the new API to automatically configure the Service CIDRs.

Design Details

Current implementation details

Kubernetes Services need to be able to dynamically allocate resources from a predefined range/pool for ClusterIPs.

ClusterIP is the IP address of the service and is usually assigned randomly. If an address is specified manually, is in-range, and is not in use, it will be allocated to the service; otherwise creation of the service will fail. The Service ClusterIP range is defined in the kube-apiserver with the following flags:

--service-cluster-ip-range string
A CIDR notation IP range from which to assign service cluster IPs. This must not overlap with any IP ranges assigned to nodes or pods. Max of two dual-stack CIDRs is allowed.

And in the controller-manager (this is required for the node-ipam controller to avoid overlapping with the podCIDR ranges):

--service-cluster-ip-range string
CIDR Range for Services in cluster. Requires --allocate-node-cidrs to be true

The allocator is implemented using a bitmap that is serialized and stored in an "opaque" API object.

To avoid leaking resources, the apiservers run a repair loop that keep the bitmaps in sync with the current Services allocated IPs.

The order of the IP families defined in the service-cluster-ip-range flag defines the primary IP family used by the allocator. This default IP family is used in cases where a Service creation doesn't provide the necessary information, defaulting the Service to Single Stack with an IP chosen from the default IP family defined.

The current implementation doesn't guarantee consistency for Service IP ranges and default IP families across multiple apiservers, see kubernetes/kubernetes#114743.

New allocation model

The new allocation mode requires:

2 new API objects ServiceCIDR and IPAddress in the group networking.k8s.io, see https://groups.google.com/g/kubernetes-sig-api-machinery/c/S0KuN_PJYXY/m/573BLOo4EAAJ. The ServiceCIDR will be protected with a finalizer, the IPAddress object doesn't need a finalizer because the APIserver always release and delete the IPAddress after the Service has been deleted.
1 new allocator implementing current allocator.Interface, that runs in each apiserver, and uses the new ServiceCIDRs objects to allocate IPs for Services.
1 new repair loop that runs in the apiserver that reconciles the Services with the IPAddresses: repair Services, garbage collecting orphan IPAddresses and handle the upgrade from the old allocators.
1 new controller that handles the bootstrap process and the ServiceCIDR object. This controllers participates on the ServiceCIDR deletion, guaranteeing that the existing IPaddresses always have a ServiceCIDR associated.

The kube-apiserver bootstrap process and the service-cidr flags

Currently, the Service CIDRs are configured independently in each kube-apiserver using flags. During the bootstrap process, the apiserver uses the first IP of each range to create the special "kubernetes.default" Service. It also starts a reconcile loop, that synchronize the state of the bitmap used by the internal allocators with the assigned IPs to the Services. This "kubernetes.default" Service is never updated, the first apiserver wins and assigns the ClusterIP from its configured service-ip-range, other apiservers with different ranges will not try to change the IP. If the apiserver that created the Service no longer works, the admin has to delete the Service so others apiservers can create it with its own ranges.

With current implementation, each kube-apiserver can boot with different ranges configured without errors, but the cluster will not work correctly, see kubernetes/kubernetes#114743. There is no conflict resolution, each apiserver keep writing and deleting others apiservers allocator bitmap and Services.

In order to be completely backwards compatible, the bootstrap process will remain the same, the difference is that instead of creating a bitmap based on the flags, it will create a new ServiceCIDR object from the flags (flags configuration removal is out of scope of this KEP) with a special well-known name kubernetes.

The new bootstrap process will be:

at startup:
 read_flags
 if invalid flags
  exit
 run default-service-ip-range controller
 run kubernetes.default service loop (it uses the first ip from the subnet defined in the flags)
 run service-repair loop (reconcile services, ipaddresses)
 run apiserver

controller:
  if ServiceCIDR `kubernetes` does not exist
    create it and create the kubernetes.default service (avoid races)
  else
    keep watching to handle finalizers and recreate if needed

All the apiservers will be synchronized on the ServiceCIDR and default Service created by the first to win. Changes on the configuration imply manual removal of the ServiceCIDR and default Service, automatically the rest of the apiservers will race and the winner will set the configuration of the cluster.

This behavior align with current behavior of kubernetes.default, that it makes it consistent and easier to think about, allowing future developments to use it to implement more complex operations at the admin cluster level.

The special "default" ServiceCIDR

The kubernetes.default Service is expected to be covered by a valid range. Each apiserver ensure that a ServiceCIDR object exists to cover its own flag-defined ranges. If someone were to force-delete the ServiceCIDR covering kubernetes.default it would be treated the same as before, any apiserver will try to recreate the Service from its configured default Service CIDR flag-defined range.

This well-known an establish behavior can allow administrators to replace the kubernetes.default by a series of operations, per example:

Initial state: 2 kube-apiservers with default ServiceCIDR 10.0.0.0/24
Apiservers will create the kubernetes.default Service with ClusterIP 10.0.0.1.
Upgrade kube-apiservers and replace the service-cidr flag to 192.168.0.0/24
Delete the ServiceCIDRs objects and the kubernetes.default Service.
The kube-apiserver will recreate the kubernetes.default with IP 192.168.0.1.

Note this can also be used to switch the IP family of the cluster.

Service IP Allocation

When a a new Service is going to be created and already has defined the ClusterIPs, the allocator will just try to create the corresponding IPAddress objects, any error creating the IPAddress object will cause an error on the Service creation. The allocator will also set the reference to the Service on the IPAddress object.

The new allocator will have a local list with all the Service CIDRs and IPs already allocated, so it will have to check just for one free IP in any of these ranges and try to create the object. There can be races with 2 allocators trying to allocate the same IP, but the storage guarantees the consistency and the first will win, meanwhile the other will have to retry.

Another racy situation is when the allocator is full and an IPAddress is deleted but the allocator takes some time to receive the notification. One solution can be to perform a List to obtain current state, but it will be simpler to just fail to create the Service and ask the user to try to create the Service again.

If the apiserver crashes during a Service create operation, the IPAddress is allocated but the Service is not created, the IPaddress will be orphan. To avoid leaks, a controller will use the metadata.creationTimestamp field to detect orphan objects and delete them.

There is going to be a controller to avoid leaking resources:

checking that the corresponding parentReference on the IPAddress match the corresponding Service
cleaning all IPAddress without an owner reference or, if the time since it was created is greater than 60 seconds (default timeout value on the kube-apiserver )

Service IP Reservation

In keps/sig-network/3070-reserved-service-ip-range a new feature was introduced that allow to prioritize one part of the Service CIDR for dynamically allocation, the range size for dynamic allocation depends ont the size of the CIDR.

The new allocation logic has to be compatible with current implementation.

Edge cases

Since we have to maintain 3 relationships Services, ServiceCIDRs and IPAddresses, we should be able to handle edge cases and race conditions.

Valid Service and IPAddress without ServiceCIDR:

This situation can happen if a user forcefully delete the ServiceCIDR, it can be recreated for the "default" ServiceCIDRs because the information is in the apiserver flags, but for other ServiceCIDRs that information is no longer available.

Another possible situation is when one ServiceCIDR has been deleted, but the information takes too long to reach one of the apiservers, its allocator will still consider the range valid and may allocate one IP from there. To mitigate this problem, we'll set a grace period of 60 seconds on the servicecidrconfig controller to remove the finalizer, if an IP address is created during this time we'll be able to block the deletion and inform the user.

For any Service and IPAddress that doesn't belong to a ServiceCIDR the controller will raise an event informing the user, keeping the previous behavior

// cluster IP is out of range
c.recorder.Eventf(&svc, nil, v1.EventTypeWarning, "ClusterIPOutOfRange", "ClusterIPAllocation", "Cluster IP [%v]:%s is not within the service CIDR %s; please recreate service",

Valid Service and ServiceCIDR but not IPAddress

It can happen that an user forcefully delete an IPAddress, in this case, the controller will regenerate the IPAddress, as long as a valid ServiceCIDR covers it.

During this time, there is a chance that an apiserver tries to allocate this IPAddress, with a possible situation where 2 Services has the same IPAddress. In order to avoid it, the Allocator will not delete an IP from its local cache until it verifies that the consumer associated to that IP has been deleted too.

Valid IPAddress and ServiceCIDR but no Service

The IPAddress will be deleted and an event generated if the controller determines that the IPAddress is orphan (see Allocator section)

IPAddress referencing recreated Object (different UID)

User created Gateway "foo"
Gateway controller allocated IP and ref -> "foo"
User deleted gateway "foo"
Gateway controller doesn't delete the IP (leave it for GC)
User creates a new Gateway "foo"
apiserver repair loop finds the IP with a valid ref to "foo"

If the new gateway is created before the apiserver observes the delete, apiserver will find that gateway "foo" still exists and can't release the IP. It can't peek inside "foo" to see if that is the right IP because it is a type it does not know. If it knew the UID it could see that "foo" UID was different and release the IP. The IPAddress will use the UID to reference the parent to avoid problems in this scenario.

Resizing Service IP Ranges

One of the most common problems users may have is how to scale up or scale down their Service CIDRs.

Let's see an example on how these operations will be implemented.

Assume we have configured a Service CIDR 10.0.0.0/24 and its fully used, we can:

Add another /24 Service CIDR 10.0.1.0/24 and keep the previous one
Add an overlapping larger Service CIDR 10.0.0.0/23

After 2., the user can now remove the first /24 Service CIDR, since the new Service CIDR covers all the existing IP Addresses

The same applies for scaling down, meanwhile the new Service CIDR contains all the existing IPAddresses, the old Service CIDR will be safely deleted.

However, there can be a race condition during the operation of scaling down, since the propagation of the deletion can take some time, one allocator can successfully allocate an IP address out of new Service CIDR (but still inside of the old Service CIDR).

There is one controller that will periodically check that the 1-on-1 relation between IPAddresses and Services is correct, and will start sending events to warn the user that it has to fix/recreate the corresponding Service, keeping the same behavior that exists today.

API

// ServiceCIDR defines a range of IPs using CIDR format (192.168.0.0/24 or 2001:db2::0/64).
type ServiceCIDR struct {
 metav1.TypeMeta   `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 Spec   ServiceCIDRSpec   `json:"spec,omitempty"`
}


// ServiceCIDRSpec describe how the ServiceCIDR's specification looks like.
type ServiceCIDRSpec struct {
 // IPv4 is an IPv4 block in CIDR notation "10.0.0.0/8" 
 IPv4 string `json:"ipv4"`
 // IPv6 is an IPv6 block in CIDR notation "fd12:3456:789a:1::/64" 
 IPv6 string `json:"ipv6"`
}

// IPAddress represents an IP used by Kubernetes associated to an ServiceCIDR.
// The name of the object is the IP address in canonical format.
type IPAddress struct {
 metav1.TypeMeta   `json:",inline"`
 metav1.ObjectMeta `json:"metadata,omitempty"`

 Spec   IPAddressSpec   `json:"spec,omitempty"`
}

// IPAddressSpec describe the attributes in an IP Address,
type IPAddressSpec struct {
  // ParentRef references the resources (usually Services) that a IPAddress wants to be attached to.
  ParentRef ParentReference
}

type ParentReference struct {
  // Group is the group of the referent.
  Group string
  // Resource is resource of the referent.
  Resource string
  // Namespace is the namespace of the referent
  Namespace string
  // Name is the name of the referent
  Name string
  // UID is the uid of the referent
  UID string
}

Allocator

A new allocator will be implemented that implements the current Allocator interface in the apiserver.

// Interface manages the allocation of IP addresses out of a range. Interface
// should be threadsafe.
type Interface interface {
 Allocate(net.IP) error
 AllocateNext() (net.IP, error)
 Release(net.IP) error
 ForEach(func(net.IP))
 CIDR() net.IPNet
 IPFamily() api.IPFamily
 Has(ip net.IP) bool
 Destroy()
 EnableMetrics()

 // DryRun offers a way to try operations without persisting them.
 DryRun() Interface
}

This allocator will have an informer watching Services, ServiceCIDRs and IPAddresses, so it can have locally the information needed to assign new IP addresses to Services.

IPAddresses can only be allocated from ServiceCIDRs that are available and not being deleted.

The uniqueness of an IPAddress is guaranteed by the apiserver, since trying to create an IP address that already exist will fail.

The allocator will set finalizer on the IPAddress created to avoid that there are Services without the corresponding IP allocated.

It also will add a reference to the Service the IPAddress is associated.

Test Plan

This is a very core and critical change, it has to be thoroughly tested on different layers:

API objects must have unit tests for defaulting and validation and integration tests exercising the different operations and fields permutations, with both positive and negative cases: Special attention to cross-reference validation problems, like create IPAddress reference wrong ServiceCIDR or invalid or missing

Controllers must have unit tests and integration tests covering all possible race conditions.

E2E test must be added covering the user stories defined in the KEP.

In addition to testing, it will require a lot of user feedback, requiring multiple releases to graduate.

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

cmd/kube-apiserver/app/options/validation_test.go: 06/02/23 - 99.1
pkg/apis/networking/validation/validation_test.go: 06/02/23 - 91.7
pkg/controlplane/instance_test.go: 06/02/23 - 49.7
pkg/printers/internalversion/printers_test.go: 06/02/23 - 49.7
pkg/registry/core/service/ipallocator/bitmap_test.go: 06/02/23 - 86.9
pkg/registry/core/service/ipallocator/controller/repairip_test.go: 06/02/23 - 0 (new)
pkg/registry/core/service/ipallocator/ipallocator_test.go: 06/02/23 - 0 (new)
pkg/registry/networking/ipaddress/strategy_test.go: 06/02/23 - 0 (new)
staging/src/k8s.io/kubectl/pkg/describe/describe_test.go: 06/02/23 - 49.7

Integration tests

There will be added tests to verify:

API servers using the old and new allocators at same time
API server upgrade from old to new allocatr
ServicesCIDRs resizing
ServiceCIDR without the IPv6 limitation size

Files:

test/integration/controlplane/synthetic_controlplane_test.go
test/integration/servicecidr/allocator_test.go
test/integration/servicecidr/main_test.go

e2e tests

e2e tests will cover all the user stories defined in the KEP

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial unit, integration and e2e tests completed and enabled
Only basic functionality e2e tests implemented
Define "Service IP Reservation" implementation

Beta

API stability, no changes on new types and behaviors:
- ServiceCIDR immutability
- default IPFamily
- two or one IP families per ServiceCIDR
Gather feedback from developers and users
Document and add more complex testing scenarios: scaling out ServiceCIDRs, ...
Additional tests are in Testgrid and linked in KEP
Scale test to O(10K) services and O(1K) ranges
Improve performance on the allocation logic, O(1) for allocating a free address
Allowing time for feedback

GA

2 examples of real-world usage
More rigorous forms of testing—e.g., downgrade tests and scalability tests
Allowing time for feedback

Note: Generally we also wait at least two releases between beta and GA/stable, because there's no opportunity for user feedback, or even bug reports, in back-to-back releases.

For non-optional features moving to GA, the graduation criteria must include conformance tests.

Deprecation

Upgrade / Downgrade / Version Skew Strategy

The source of truth are the IPs assigned to the Services, both the old and new methods have reconcile loops that rebuild the state of the allocators based on the assigned IPs to the Services, this allows to support upgrades and skewed clusters.

Clusters running with the new allocation model will not keep running the reconcile loops that keep the bitmap used by the allocators.

Since the new allocation model will remove some of the limitations of the current model, skewed versions and downgrades can only work if the configurations are fully compatible, per example, current CIDRs are limited to a /112 max for IPv6, if an user configures a /64 to their IPv6 subnets in the new model, and IPs are assigned out of the first /112 block, the old allocator based in bitmap will not be able to use those IPs creating an inconsistency in the cluster.

It will be required that those Services are recreated to get IP addresses inside the configured ranges, for consistency, but there should not be any functional problem in the cluster if the Service implementation (kube-proxy, ...) is able to handle those IPs.

Example:

flags set to 10.0.0.0/20
upgrade to N+1 with alpha gate
apiserver create a default ServiceCIDR object for 10.0.0.0/20
user creates a new ServiceCIDR for 192.168.1.0/24
create a Service which gets 192.168.1.1
rollback or disable the gate
the apiserver repair loops will generate periodic events informing the user that the Service with the IP allocated is not within the configured range

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: ServiceCIDR
- Components depending on the feature gate: kube-apiserver, kube-controller-manager

Does enabling the feature change any default behavior?

The time to reuse an already allocated ClusterIP to a Service will be longer, since the ClusterIPs now depend on an IPAddress object that is protected by a finalizer.

The bootstrap logic has been updated to deal with the problem of multiple apiservers with different configurations, making it more flexible and resilient.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

If the feature is disabled the old allocator logic will reconcile all the current allocated IPs based on the Services existing on the cluster.

If there are IPAddresses allocated outside of the configured Service IP Ranges in the apiserver flags, the apiserver will generate events referencing the Services using IPs outside of the configured range. The user must delete and recreate these Services to obtain new IPs within the range.

What happens if we reenable the feature if it was previously rolled back?

There are controllers that will reconcile the state based on the current created Services, restoring the cluster to a working state.

Are there any tests for feature enablement/disablement?

Tests for feature enablement/disablement will be implemented. TBD later, during the alpha implementation.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

See Drawbacks section

Will enabling / using this feature result in any new API calls?

When creating a Service this will require to create an IPAddress object, previously we updated a bitmap on etcd, so we keep the number of request but the size is reduced considerable.

Will enabling / using this feature result in introducing new API types?

See API

Will enabling / using this feature result in any new calls to the cloud provider?

N/A

Will enabling / using this feature result in increasing size or count of the existing API objects?

See Drawbacks

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

The apiservers will increase the memory usage because they have to keep a local informer with the new objects ServiceCIDR and IPAddress.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The feature depends on the API server to work.

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

The number of etcd objects required scales at O(N) where N is the number of services. That seems reasonable (the largest number of services today is limited by the bitmap size ~32k). etcd memory use is proportional to the number of keys, but there are other objects like Pods and Secrets that use a much higher number of objects. 1-2 million keys in etcd is a reasonable upper bound in a cluster, and this has not a big impact (<10%). The objects described here are smaller than most other Kube objects (especially pods), so the etcd storage size is still reasonable.

xref: Clayton Coleman https://github.com/kubernetes/enhancements/pull/1881/files#r669732012

Alternatives

Several alternatives were proposed in the original PR but discarded by different reasons:

Files

README.md

Latest commit

History