Pass enable-udp-aggregation=true to ovn-kubernetes #1533

danwinship · 2022-08-23T21:21:27Z

Re-push of #1489; that got reverted (#1510) because QE had reported that it caused performance regressions. However

Most of the problem turned out to be because un-labeled numbers in the perf results had been interpreted as being in milliseconds when they were actually microseconds; there was an increase in latency, but it was tiny and expected, not large and problematic.
The rest of the problem is with a dubious test scenario which tests how fast the network is when using a network protocol that is, essentially, maximally "optimized" for slowness by doing fully-serialized request / response / request / response / request / response... with tiny packets.

No one has been able to suggest a real-world use case that would see the sort of drastic slowdowns seen in the "Request/Response operations" test:

RTP/RTSP/WebRTC and other streaming protocols don't look like that, because they just keep sending packets, rather than sending 1 response packet for each request packet. (And of course, streaming protocols are what the UDP aggregation feature was intended to speed up, and it did speed them up; the 64-byte packet size streaming throughput nearly doubled.)
DNS doesn't look like that, because (a) no one ever does 10,000 DNS requests all at once, (b) if they did, they'd parallelize them rather than doing them 1-by-1, (c) the DNS protocol has specific design features to minimize the need for serialized lookups (eg, when you get a CNAME record, it automatically includes the corresonding A record as well so you don't have to make a second request) because people figured out in the 1980s that you shouldn't design protocols that require lots of small serialized requests and responses.
HTTP/3 doesn't look like that because (a) it uses large packets when sending large data, (b) as in the streaming case, a single request can result in many response packets, rather than being 1-to-1, (c) a single HTTP/3 connection can have multiple concurrent request/response streams.

etc

Additionally, testing a protocol like this inside a kubernetes clusters is an implausibly-"best" best case scenario; in most normal environments, the client and server would be farther apart from each other (in terms of network topology), and thus a protocol like the "Request/Response operations" test that was extremely latency-sensitive would have much worse behavior, making it even more likely that the developers would change the way it worked.

So, this PR re-enables the feature.

danwinship · 2022-08-25T13:55:11Z

/retest-required

danwinship · 2022-08-26T13:34:08Z

/retest-required

trozet · 2022-08-29T22:15:55Z

@danwinship the request/response test you mentioned in your 2nd point...It looks like the latency has increased for handling single UDP packets. I'm looking at https://bugzilla.redhat.com/show_bug.cgi?id=2085089#c37

However that is the comment I think where the units are wrong, and now we are focusing on the number of request/response in a minute. What I don't understand is if the latency is faster now that we know the real unit is microseconds, how is it possible that the number of request/responses are so much worse?

@mffiedler do you agree with re-enabling this now?

danwinship · 2022-08-30T14:36:03Z

What I don't understand is if the latency is faster now that we know the real unit is microseconds, how is it possible that the number of request/responses are so much worse?

Because "the number of request/responses" is just 1 second divided by the latency of a single request/response. The units don't matter; if you make a single request/response take twice as long, then the total number of serialized requests/responses you can do in a fixed amount of time will be halved.

(Eg, for 64-byte packets, the test reports 76us latency before and 137us latency after, with 14,447 round trips before and 7,574 after, which is within 10% of what you get if you just divide 1s/76us and 1s/137us.)

But the relevant questions are:

Do we think the 76us latency for cross-node pod-to-pod UDP packets measured in the (before) perfscale test is actually a good estimate of real-world use cases? (The latency increase from this PR is fixed at 50us, so if customers have much higher baseline latency than in our test case, then +50us might just be noise. But if they had much lower baseline latency, then +50us would be much more of a performance killer.)
Assuming the answer to question 1 is "yes, it's realistic", are we OK with 137us latency rather than 76us latency for a single small UDP request? (If not, then why? It's not reasonable to say "they might just need the absolute lowest latency possible" because in that case they probably wouldn't be using the pod network at all.)
Assuming the answer to question 2 is "yes, it's fine for a single request", then what is the N such that 76us * N was acceptable, but 137us * N is too slow, and is there any plausible scenario where a customer would actually be sending N serialized UDP request/response pairs? (We talked in the meeting about how DNS might end up sending 5 serialized requests because of ndots, but presumably if your application can handle spending 1/3 of a millisecond doing DNS, then you can also handle spending 2/3 of a millisecond doing DNS. (And that's assuming an instantaneous response from the DNS server anyway.))

danwinship · 2022-08-30T14:36:35Z

/hold

Ben suggests adding a "chicken flag" to disable this in case of emergency

danwinship · 2022-09-22T17:28:16Z

/retest

danwinship · 2022-09-26T14:28:30Z

/hold cancel
/retest-required

trozet · 2022-09-27T20:53:24Z

/lgtm

openshift-ci-robot · 2022-09-28T00:15:17Z

/retest-required

Remaining retests: 0 against base HEAD f4018b0 and 2 for PR HEAD f1dbb51 in total

openshift-ci-robot · 2022-09-28T05:38:48Z

/retest-required

Remaining retests: 0 against base HEAD b84ba70 and 1 for PR HEAD f1dbb51 in total

Always return a nil OVNConfigBootstrapResult on error (since the caller ignores it anyway). If there is no dpu-mode-config override configmap, don't log a message; that's the normal expected case. If dpu-mode-config parsing fails, don't bail out of bootstrapOVNConfig completely. Just skip overriding the NodeMode.

danwinship · 2022-09-28T14:07:23Z

rebased and updated the new unit test to include egressip-node-healthcheck-port=9107

trozet · 2022-09-28T15:02:47Z

/lgtm

openshift-ci · 2022-09-28T15:06:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danwinship,trozet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2022-09-28T15:56:53Z

/retest-required

Remaining retests: 0 against base HEAD b84ba70 and 2 for PR HEAD b485074 in total

danwinship · 2022-09-28T20:18:48Z

/retest-required

openshift-ci · 2022-09-28T22:22:09Z

@danwinship: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-azure-ovn	`b485074`	link	false	`/test e2e-azure-ovn`
ci/prow/e2e-network-mtu-migration-sdn-ipv4	`b485074`	link	false	`/test e2e-network-mtu-migration-sdn-ipv4`
ci/prow/e2e-openstack-ovn	`b485074`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-vsphere-ovn	`b485074`	link	false	`/test e2e-vsphere-ovn`
ci/prow/e2e-openstack-sdn	`b485074`	link	false	`/test e2e-openstack-sdn`
ci/prow/e2e-hypershift-ovn	`b485074`	link	false	`/test e2e-hypershift-ovn`
ci/prow/e2e-aws-sdn-upgrade	`b485074`	link	false	`/test e2e-aws-sdn-upgrade`
ci/prow/e2e-ovn-ipsec-step-registry	`b485074`	link	false	`/test e2e-ovn-ipsec-step-registry`
ci/prow/e2e-aws-ovn-serial	`b485074`	link	false	`/test e2e-aws-ovn-serial`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

danwinship · 2022-09-29T12:08:34Z

/retest-required

openshift-ci bot requested review from abhat and JacobTanenbaum August 23, 2022 21:22

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 23, 2022

danwinship force-pushed the udp-gro-again branch from 09eb31b to fe58c86 Compare August 24, 2022 12:31

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 30, 2022

danwinship force-pushed the udp-gro-again branch from fe58c86 to f1dbb51 Compare September 21, 2022 13:15

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2022

trozet approved these changes Sep 27, 2022

View reviewed changes

openshift-ci bot assigned trozet Sep 27, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2022

danwinship added 2 commits September 28, 2022 10:05

Pass enable-udp-aggregation=true to ovn-kubernetes

b485074

danwinship force-pushed the udp-gro-again branch from f1dbb51 to b485074 Compare September 28, 2022 14:07

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2022

openshift-merge-robot merged commit f09940f into openshift:master Sep 29, 2022

danwinship deleted the udp-gro-again branch October 11, 2022 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass enable-udp-aggregation=true to ovn-kubernetes #1533

Pass enable-udp-aggregation=true to ovn-kubernetes #1533

danwinship commented Aug 23, 2022

danwinship commented Aug 25, 2022

danwinship commented Aug 26, 2022

trozet commented Aug 29, 2022

danwinship commented Aug 30, 2022

danwinship commented Aug 30, 2022

danwinship commented Sep 22, 2022

danwinship commented Sep 26, 2022

trozet commented Sep 27, 2022

openshift-ci-robot commented Sep 28, 2022

openshift-ci-robot commented Sep 28, 2022

danwinship commented Sep 28, 2022

trozet commented Sep 28, 2022

openshift-ci bot commented Sep 28, 2022

openshift-ci-robot commented Sep 28, 2022

danwinship commented Sep 28, 2022

openshift-ci bot commented Sep 28, 2022 •

edited

Loading

danwinship commented Sep 29, 2022

Pass enable-udp-aggregation=true to ovn-kubernetes #1533

Pass enable-udp-aggregation=true to ovn-kubernetes #1533

Conversation

danwinship commented Aug 23, 2022

danwinship commented Aug 25, 2022

danwinship commented Aug 26, 2022

trozet commented Aug 29, 2022

danwinship commented Aug 30, 2022

danwinship commented Aug 30, 2022

danwinship commented Sep 22, 2022

danwinship commented Sep 26, 2022

trozet commented Sep 27, 2022

openshift-ci-robot commented Sep 28, 2022

openshift-ci-robot commented Sep 28, 2022

danwinship commented Sep 28, 2022

trozet commented Sep 28, 2022

openshift-ci bot commented Sep 28, 2022

openshift-ci-robot commented Sep 28, 2022

danwinship commented Sep 28, 2022

openshift-ci bot commented Sep 28, 2022 • edited Loading

danwinship commented Sep 29, 2022

openshift-ci bot commented Sep 28, 2022 •

edited

Loading