Start exporting the in-cluster network programming latency metric. #71999

mm4tt · 2018-12-12T17:12:35Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
This is the final step of implementing the first version of in-cluster network programming latency that was proposed here - https://github.com/kubernetes/community/blob/master/sig-scalability/slos/network_programming_latency.md
The computation of the latency is based on the EndpointsLastChangeTriggerTime annotation, which implementation can be found in #71998

Does this PR introduce a user-facing change?:
NONE

k8s-ci-robot · 2018-12-12T17:12:36Z

@MateuszMatejczyk: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2018-12-12T17:12:43Z

Hi @MateuszMatejczyk. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mm4tt · 2018-12-12T17:12:51Z

/assign @wojtek-t

mm4tt · 2018-12-12T17:16:23Z

/assign @wojtek-t

mm4tt · 2018-12-12T17:18:30Z

/uncc @lavalamp
/uncc @freehan

Unassigning @lavalamp and @freehan until @wojtek-t takes a look.

wojtek-t · 2018-12-12T17:38:55Z

I will do the first pass later this week and then will add more reviewers from networking team.

wojtek-t

This looks reasonable to me. Hover, this doesn't make sense to submit until #71998 is at least ready for merging.
So holding for now, but assigning someone else to also take a look.

/assign @freehan

/hold
/ok-to-test

wojtek-t · 2018-12-13T13:17:59Z

pkg/proxy/endpoints.go

 	}
 	return len(ect.items) > 0
 }

+func getLastChangeTriggerTime(endpoints *v1.Endpoints) time.Time {
+	val, _ := time.Parse(time.RFC3339Nano, endpoints.Annotations[v1.EndpointsLastChangeTriggerTime])


Don't silently ignore errors - if not more, at least log an error.

Done. Added log statement and a comment explaining why we can ignore the error.

wojtek-t · 2019-02-06T12:09:03Z

/hold cancel

With endpoint controller changes already being merged, we are ready to resurrect this change.
@mm4tt - can you please rebase?

mm4tt

Thanks, PTAL

mm4tt · 2019-02-06T14:12:38Z

pkg/proxy/endpoints.go

@@ -30,6 +30,7 @@ import (
 	"k8s.io/client-go/tools/record"
 	utilproxy "k8s.io/kubernetes/pkg/proxy/util"
 	utilnet "k8s.io/kubernetes/pkg/util/net"
+	"time"


pkg/proxy/endpoints.go

mm4tt · 2019-02-06T14:13:16Z

pkg/proxy/endpoints_test.go

@@ -26,6 +26,8 @@ import (
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/types"
 	"k8s.io/apimachinery/pkg/util/sets"
+	"time"
+	"sort"


mm4tt · 2019-02-06T14:21:14Z

pkg/proxy/endpoints.go

@@ -92,6 +93,9 @@ type EndpointChangeTracker struct {
 	// isIPv6Mode indicates if change tracker is under IPv6/IPv4 mode. Nil means not applicable.
 	isIPv6Mode *bool
 	recorder   record.EventRecorder
+	// Map from the Endpoints namespaced-name to the time of the trigger that caused the endpoints
+	// object to change. Used to calculate the network-programming-latency.
+	lastChangeTriggerTimes map[types.NamespacedName]time.Time


Good point.
Added in the documentation of the metric that exports the Network Programming Latency.

mm4tt · 2019-02-06T14:33:38Z

pkg/proxy/endpoints.go

 	}
 	change.current = ect.endpointsToEndpointsMap(current)
 	// if change.previous equal to change.current, it means no change
 	if reflect.DeepEqual(change.previous, change.current) {
 		delete(ect.items, namespacedName)
+		delete(ect.lastChangeTriggerTimes, namespacedName)


IIUC, the situation you described is something like this:

T0: proxier.Sync()
T1: proxier observes Endpoints E1 change, E1.EndpointsLastChangeTriggerTime = t0
T2: proxier observes Endpoints E1 change, E1.EndpointsLastChangeTriggerTime = t1 (t1>=t0)
T3: proxier.Sync()

In such case the implementation will ignore the second timestamp and use t0 (which is guaranteed to be <= t1) to measure the latency.

Let me know if it makes sense.

wojtek-t · 2019-02-11T09:08:05Z

pkg/proxy/metrics/metrics.go

+			Name:      "network_programming_latency_seconds",
+			Help:      "In Cluster Network Programming Latency in seconds",
+			// The last bucket will be [0.001s*2^20 ~= 17min, +inf)
+			Buckets: prometheus.ExponentialBuckets(0.001, 2, 20),


I'm not convinced that the buckets are correct.
I'm fine with leaving it as is for now, but please leave a TODO to reevaluate it before 1.14 release.

wojtek-t

Two more comments - other than that lgtm.

wojtek-t · 2019-02-11T10:54:55Z

That LGTM.

@freehan - can you please take another look?

freehan

just a nit, LGTM overall

freehan · 2019-02-12T01:20:40Z

pkg/proxy/endpoints.go

+		// Reset the lastChangeTriggerTimes for the Endpoints object. Given that the network programming
+		// SLI is defined as the duration between a time of an event and a time when the network was
+		// programmed to incorporate that event, if there are events that happened between two
+		// consecutive syncs syncs and that canceled each other out, e.g. pod A added -> pod A deleted,


There are 2 syncs

Good catch, done.

freehan · 2019-02-12T01:25:41Z

pkg/proxy/endpoints.go

 	}
 	change.current = ect.endpointsToEndpointsMap(current)
 	// if change.previous equal to change.current, it means no change
 	if reflect.DeepEqual(change.previous, change.current) {
 		delete(ect.items, namespacedName)
+		delete(ect.lastChangeTriggerTimes, namespacedName)


Okay. Noop sounds good.

wojtek-t · 2019-02-12T08:07:12Z

/lgtm
/approve

k8s-ci-robot · 2019-02-12T08:07:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mm4tt, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/proxy/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mm4tt · 2019-02-12T09:32:07Z

/retest

k8s-ci-robot · 2019-02-12T10:39:05Z

@mm4tt: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-kops-aws	c116d4f77daa048f89bf8d97a3b6e0e9cea63b58	link	`/test pull-kubernetes-e2e-kops-aws`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

krzysied · 2019-02-12T10:43:05Z

/retest

The DNS Programming Latency defintion can be found [here](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/dns_programming_latency.md) This PR covers only "headless with selector" services, other service kinds are blocked on the impossibility of determining the LastUpdateTime of a service object. The PR bears some similarities to kubernetes/kubernetes#71999, which introdced In-Cluster Network Programming Latency. The main difference is that there is no actual "programming" happening in CoreDNS (for comparison, in kube-proxy the network programming consists of writing IPTables/IPVS rules). The CoreDNS serves the content directly from the endpoints/service/pod cache, creating DNS records on the fly. Thus, we assume that the programming of DNS ends in the moment when the endpoints/service/pod change reaches the CoreDNS via the Watch mechanism.

The DNS Programming Latency definition can be found [here](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/dns_programming_latency.md) This PR covers only "headless with selector" services, other service kinds are blocked on the impossibility of determining the LastUpdateTime of a service object. The PR bears some similarities to kubernetes/kubernetes#71999, which introdced In-Cluster Network Programming Latency. The main difference is that there is no actual "programming" happening in CoreDNS (for comparison, in kube-proxy the network programming consists of writing IPTables/IPVS rules). The CoreDNS serves the content directly from the endpoints/service/pod cache, creating DNS records on the fly. Thus, we assume that the programming of DNS ends in the moment when the endpoints/service/pod change reaches the CoreDNS via the Watch mechanism.

k8s-ci-robot requested review from freehan and lavalamp December 12, 2018 17:15

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 12, 2018

k8s-ci-robot assigned wojtek-t Dec 12, 2018

k8s-ci-robot removed request for lavalamp and freehan December 12, 2018 17:18

wojtek-t reviewed Dec 13, 2018

View reviewed changes

k8s-ci-robot assigned freehan Dec 13, 2018

mm4tt mentioned this pull request Dec 13, 2018

Export EndpointsLastChangeTriggerTime annotation in endpoints_controler. #71998

Closed

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 13, 2018

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 26, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 6, 2019

mm4tt force-pushed the kube-proxy branch from c116d4f to bfd4ee1 Compare February 6, 2019 14:38

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2019

mm4tt commented Feb 6, 2019

View reviewed changes

mm4tt force-pushed the kube-proxy branch from bfd4ee1 to 0c74ca6 Compare February 6, 2019 15:52

wojtek-t reviewed Feb 11, 2019

View reviewed changes

mm4tt force-pushed the kube-proxy branch from 0c74ca6 to c0d750a Compare February 11, 2019 10:33

freehan reviewed Feb 12, 2019

View reviewed changes

Start exporting the in-cluster network programming latency metric.

7141ece

mm4tt force-pushed the kube-proxy branch from c0d750a to 7141ece Compare February 12, 2019 07:10

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2019

k8s-ci-robot merged commit 41d2445 into kubernetes:master Feb 12, 2019

mm4tt deleted the kube-proxy branch February 12, 2019 14:01

This was referenced Feb 13, 2019

kube-proxy.log is spammed with 'Error while parsing EndpointsLastChangeTriggerTimeAnnotation' logs #74003

Closed

REQUEST: New membership for mm4tt kubernetes/org#504

Closed

mm4tt mentioned this pull request Mar 13, 2019

Metric for measuring DNS Programming Latency SLI. coredns/coredns#2690

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start exporting the in-cluster network programming latency metric. #71999

Start exporting the in-cluster network programming latency metric. #71999

mm4tt commented Dec 12, 2018

k8s-ci-robot commented Dec 12, 2018

k8s-ci-robot commented Dec 12, 2018

mm4tt commented Dec 12, 2018

mm4tt commented Dec 12, 2018

mm4tt commented Dec 12, 2018

wojtek-t commented Dec 12, 2018

wojtek-t left a comment

wojtek-t Dec 13, 2018

mm4tt Dec 13, 2018

wojtek-t commented Feb 6, 2019

mm4tt left a comment

mm4tt Feb 6, 2019

mm4tt Feb 6, 2019

mm4tt Feb 6, 2019

mm4tt Feb 6, 2019

wojtek-t Feb 11, 2019

mm4tt Feb 11, 2019

wojtek-t left a comment

wojtek-t commented Feb 11, 2019

freehan left a comment

freehan Feb 12, 2019

mm4tt Feb 12, 2019

freehan Feb 12, 2019

wojtek-t commented Feb 12, 2019

k8s-ci-robot commented Feb 12, 2019

mm4tt commented Feb 12, 2019

k8s-ci-robot commented Feb 12, 2019 •

edited

krzysied commented Feb 12, 2019

Start exporting the in-cluster network programming latency metric. #71999

Start exporting the in-cluster network programming latency metric. #71999

Conversation

mm4tt commented Dec 12, 2018

k8s-ci-robot commented Dec 12, 2018

k8s-ci-robot commented Dec 12, 2018

mm4tt commented Dec 12, 2018

mm4tt commented Dec 12, 2018

mm4tt commented Dec 12, 2018

wojtek-t commented Dec 12, 2018

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Feb 6, 2019

mm4tt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

wojtek-t commented Feb 11, 2019

freehan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Feb 12, 2019

k8s-ci-robot commented Feb 12, 2019

mm4tt commented Feb 12, 2019

k8s-ci-robot commented Feb 12, 2019 • edited

krzysied commented Feb 12, 2019

k8s-ci-robot commented Feb 12, 2019 •

edited