OCPBUGS-27397: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 #109

gcs278 · 2024-01-23T18:59:11Z

Cherry-Pick of coredns#6277 which fixes OCPBUGS-27397 by adding logic that more gracefully handles overflow errors by setting the Truncated Bit and clearing the response. This indicates to DNS clients to retry with TCP, whereas previously they would have always gotten a SERVFAIL.

This becomes more important/exposed with CoreDNS 1.10.1 (OCP 4.13) onwards, given the changes introduced by coredns#5671. This modification only uses EDNS on the upstream DNS request when the client query used EDNS. It appears that certain DNS resolvers may not handle non-EDNS queries in a compliant manner.

Handle UDP responses that overflow with TC bit with test case (coredns#6277) Signed-off-by: SriHarshaBS001 <SriHarshaBS009@gmail.com>

openshift-ci-robot · 2024-01-23T19:01:02Z

@gcs278: This pull request references Jira Issue OCPBUGS-27397, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Cherry-Pick of coredns#6277 which fixes OCPBUGS-27397 by adding logic that more gracefully handles overflow errors by setting the Truncated Bit and clearing the response. This indicates to DNS clients to retry with TCP, whereas previously they would have always gotten a SERVFAIL.

This becomes more important/exposed with CoreDNS 1.10.1 (OCP 4.13) onwards, given the changes introduced by coredns#5671. This modification only uses EDNS on the upstream DNS request when the client query used EDNS. It appears that certain DNS resolvers may not handle non-EDNS queries in a compliant manner.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

gcs278 · 2024-01-23T19:02:05Z

/jira refresh

openshift-ci-robot · 2024-01-23T19:02:12Z

@gcs278: This pull request references Jira Issue OCPBUGS-27397, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-01-23T19:02:47Z

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: melvinjoseph86.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@gcs278: This pull request references Jira Issue OCPBUGS-27397, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target version (4.16.0) matches configured target version for branch (4.16.0)

bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gcs278 · 2024-01-23T19:04:49Z

I can't remember if this will work before it's merged, but let me try:
/cherry-pick release-4.12

openshift-cherrypick-robot · 2024-01-23T19:04:51Z

@gcs278: once the present PR merges, I will cherry-pick it on top of release-4.12 in a new PR and assign it to you.

In response to this:

I can't remember if this will work before it's merged, but let me try:
/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gcs278 · 2024-01-23T19:05:34Z

Whoops, I didn't mean to do 4.12...we aren't backporting that far.

/cherry-pick release-4.15

openshift-cherrypick-robot · 2024-01-23T19:05:36Z

@gcs278: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you.

In response to this:

Whoops, I didn't mean to do 4.12...we aren't backporting that far.

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2024-01-23T21:40:12Z

plugin/pkg/proxy/connect.go

+// Function to return an empty response with TC (truncated) bit set.
+func truncateResponse(response *dns.Msg) *dns.Msg {
+	// Clear out Answer, Extra, and Ns sections
+	response.Answer = nil
+	response.Extra = nil
+	response.Ns = nil
+
+	// Set TC bit to indicate truncation.
+	response.Truncated = true
+	return response
+}


Any idea why the existing Truncate method isn't used for this?

coredns/vendor/github.com/miekg/dns/msg_truncate.go

Lines 3 to 97 in 37a9afe

// Truncate ensures the reply message will fit into the requested buffer

// size by removing records that exceed the requested size.

//

// It will first check if the reply fits without compression and then with

// compression. If it won't fit with compression, Truncate then walks the

// record adding as many records as possible without exceeding the

// requested buffer size.

//

// If the message fits within the requested size without compression,

// Truncate will set the message's Compress attribute to false. It is

// the caller's responsibility to set it back to true if they wish to

// compress the payload regardless of size.

//

// The TC bit will be set if any records were excluded from the message.

// If the TC bit is already set on the message it will be retained.

// TC indicates that the client should retry over TCP.

//

// According to RFC 2181, the TC bit should only be set if not all of the

// "required" RRs can be included in the response. Unfortunately, we have

// no way of knowing which RRs are required so we set the TC bit if any RR

// had to be omitted from the response.

//

// The appropriate buffer size can be retrieved from the requests OPT

// record, if present, and is transport specific otherwise. dns.MinMsgSize

// should be used for UDP requests without an OPT record, and

// dns.MaxMsgSize for TCP requests without an OPT record.

func (dns *Msg) Truncate(size int) {

if dns.IsTsig() != nil {

// To simplify this implementation, we don't perform

// truncation on responses with a TSIG record.

return

}

// RFC 6891 mandates that the payload size in an OPT record

// less than 512 (MinMsgSize) bytes must be treated as equal to 512 bytes.

//

// For ease of use, we impose that restriction here.

if size < MinMsgSize {

size = MinMsgSize

}

l := msgLenWithCompressionMap(dns, nil) // uncompressed length

if l <= size {

// Don't waste effort compressing this message.

dns.Compress = false

return

}

dns.Compress = true

edns0 := dns.popEdns0()

if edns0 != nil {

// Account for the OPT record that gets added at the end,

// by subtracting that length from our budget.

//

// The EDNS(0) OPT record must have the root domain and

// it's length is thus unaffected by compression.

size -= Len(edns0)

}

compression := make(map[string]struct{})

l = headerSize

for _, r := range dns.Question {

l += r.len(l, compression)

}

var numAnswer int

if l < size {

l, numAnswer = truncateLoop(dns.Answer, size, l, compression)

}

var numNS int

if l < size {

l, numNS = truncateLoop(dns.Ns, size, l, compression)

}

var numExtra int

if l < size {

_, numExtra = truncateLoop(dns.Extra, size, l, compression)

}

// See the function documentation for when we set this.

dns.Truncated = dns.Truncated || len(dns.Answer) > numAnswer ||

len(dns.Ns) > numNS || len(dns.Extra) > numExtra

dns.Answer = dns.Answer[:numAnswer]

dns.Ns = dns.Ns[:numNS]

dns.Extra = dns.Extra[:numExtra]

if edns0 != nil {

// Add the OPT record back onto the additional section.

dns.Extra = append(dns.Extra, edns0)

}

}

Good question. I looked around, the best I could find was: coredns#5953 (comment):

I did a quick PR on the simplest handling case - just responding with an empty truncated message.

So it's possible they wanted the simplest solution or they wanted to indicate something was unusually by stripping the answer off completely. I think it's worth asking.

This may benefit Linux distros that aren't capable of using DNS over TCP, because often the superfluous information are the ADDITIONAL records, which a client may not care about. Returning an answer is argubly better than returning nothing at all.

However, for our users, we don't support distros like Alpine that don't support DNS over TCP, so, I believe the client is going to retry anyways and it's not much a net gain for us.

I'll follow up in CoreDNS slack.

https://cloud-native.slack.com/archives/C4DF7FP71/p1706048479458799

Okay I followed up and also tested out using Truncate. The issue is that when the upstream DNS library encounters an overflow error, it stops processing the message and leaves the msg answer field empty (since it overflowed unpacking it).

In theory, I think the upstream library could still unpack what it could while returning an overflow error, which would allow some information to be returned, but I don't think this is a trivial effort and probably not worth delaying this fix over.

Let me know if this makes sense.

Miciah · 2024-01-23T21:42:44Z

Thanks! If you're confident that #109 (review) isn't an issue, I'm fine with this as is; feel free to release the hold. (If you don't know, maybe it's worth a quick question on upstream's Slack channel.)
/approve
/lgtm
/hold

openshift-ci · 2024-01-23T21:42:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gcs278 · 2024-01-23T22:25:15Z

@Miciah thanks - I don't see it causing issues (this fix is still a net positive), but arguably it may be more effective to try to return something. I think it's worth waiting a day for a response from upstream in case they change their mind and update the fix.

gcs278 · 2024-01-23T22:43:33Z

/test e2e-gcp-serial

openshift-ci · 2024-01-24T01:23:34Z

@gcs278: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ipi-ovn	`37a9afe`	link	false	`/test e2e-metal-ipi-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

gcs278 · 2024-01-24T14:53:26Z

I followed up with upstream and the way they approached this seems reasonable given their constraints.
/unhold

openshift-ci-robot · 2024-01-24T14:55:53Z

@gcs278: Jira Issue OCPBUGS-27397: All pull requests linked via external trackers have merged:

openshift/coredns#109

Jira Issue OCPBUGS-27397 has been moved to the MODIFIED state.

In response to this:

Cherry-Pick of coredns#6277 which fixes OCPBUGS-27397 by adding logic that more gracefully handles overflow errors by setting the Truncated Bit and clearing the response. This indicates to DNS clients to retry with TCP, whereas previously they would have always gotten a SERVFAIL.

This becomes more important/exposed with CoreDNS 1.10.1 (OCP 4.13) onwards, given the changes introduced by coredns#5671. This modification only uses EDNS on the upstream DNS request when the client query used EDNS. It appears that certain DNS resolvers may not handle non-EDNS queries in a compliant manner.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2024-01-24T14:56:48Z

@gcs278: #109 failed to apply on top of branch "release-4.12":

Applying: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397
Using index info to reconstruct a base tree...
A	plugin/pkg/proxy/connect.go
A	plugin/pkg/proxy/proxy_test.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): plugin/pkg/proxy/proxy_test.go deleted in HEAD and modified in UPSTREAM: 6277: openshift: Fix OCPBUGS-27397. Version UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 of plugin/pkg/proxy/proxy_test.go left in tree.
Auto-merging plugin/forward/connect.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 UPSTREAM: 6277: openshift: Fix OCPBUGS-27397
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

I can't remember if this will work before it's merged, but let me try:
/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2024-01-24T14:57:29Z

@gcs278: new pull request created: #110

In response to this:

Whoops, I didn't mean to do 4.12...we aren't backporting that far.

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2024-01-25T05:55:48Z

[ART PR BUILD NOTIFIER]

This PR has been included in build coredns-container-v4.16.0-202401242108.p0.gb5a4844.assembly.stream for distgit coredns.
All builds following this will include this PR.

sreber84 · 2024-01-25T07:02:27Z

/cherry-pick release-4.14

openshift-cherrypick-robot · 2024-01-25T07:03:07Z

@sreber84: #109 failed to apply on top of branch "release-4.14":

Applying: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397
Using index info to reconstruct a base tree...
A	plugin/pkg/proxy/connect.go
A	plugin/pkg/proxy/proxy_test.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): plugin/pkg/proxy/proxy_test.go deleted in HEAD and modified in UPSTREAM: 6277: openshift: Fix OCPBUGS-27397. Version UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 of plugin/pkg/proxy/proxy_test.go left in tree.
Auto-merging plugin/forward/connect.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 UPSTREAM: 6277: openshift: Fix OCPBUGS-27397
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

UPSTREAM: 6277: openshift: Fix OCPBUGS-27397

37a9afe

Handle UDP responses that overflow with TC bit with test case (coredns#6277) Signed-off-by: SriHarshaBS001 <SriHarshaBS009@gmail.com>

openshift-ci bot requested review from candita and rfredette January 23, 2024 18:59

gcs278 changed the title ~~UPSTREAM: 6277: openshift: Fix OCPBUGS-27397~~ OCPBUGS-27397: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 Jan 23, 2024

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 23, 2024

Miciah reviewed Jan 23, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2024

openshift-ci bot assigned Miciah Jan 23, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 24, 2024

openshift-merge-bot bot merged commit b5a4844 into openshift:master Jan 24, 2024
6 of 7 checks passed

openshift-cherrypick-robot mentioned this pull request Jan 24, 2024

[release-4.15] OCPBUGS-27904: UPSTREAM: 6277: openshift: Fix OCPBUGS-27904 #110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-27397: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 #109

OCPBUGS-27397: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 #109

gcs278 commented Jan 23, 2024

openshift-ci-robot commented Jan 23, 2024

gcs278 commented Jan 23, 2024

openshift-ci-robot commented Jan 23, 2024

openshift-ci bot commented Jan 23, 2024

gcs278 commented Jan 23, 2024

openshift-cherrypick-robot commented Jan 23, 2024

gcs278 commented Jan 23, 2024

openshift-cherrypick-robot commented Jan 23, 2024

Miciah Jan 23, 2024

gcs278 Jan 23, 2024

gcs278 Jan 23, 2024

gcs278 Jan 24, 2024

Miciah commented Jan 23, 2024

openshift-ci bot commented Jan 23, 2024

gcs278 commented Jan 23, 2024

gcs278 commented Jan 23, 2024

openshift-ci bot commented Jan 24, 2024

gcs278 commented Jan 24, 2024

openshift-ci-robot commented Jan 24, 2024

openshift-cherrypick-robot commented Jan 24, 2024

openshift-cherrypick-robot commented Jan 24, 2024

openshift-bot commented Jan 25, 2024

sreber84 commented Jan 25, 2024

openshift-cherrypick-robot commented Jan 25, 2024

	// Truncate ensures the reply message will fit into the requested buffer
	// size by removing records that exceed the requested size.
	//
	// It will first check if the reply fits without compression and then with
	// compression. If it won't fit with compression, Truncate then walks the
	// record adding as many records as possible without exceeding the
	// requested buffer size.
	//
	// If the message fits within the requested size without compression,
	// Truncate will set the message's Compress attribute to false. It is
	// the caller's responsibility to set it back to true if they wish to
	// compress the payload regardless of size.
	//
	// The TC bit will be set if any records were excluded from the message.
	// If the TC bit is already set on the message it will be retained.
	// TC indicates that the client should retry over TCP.
	//
	// According to RFC 2181, the TC bit should only be set if not all of the
	// "required" RRs can be included in the response. Unfortunately, we have
	// no way of knowing which RRs are required so we set the TC bit if any RR
	// had to be omitted from the response.
	//
	// The appropriate buffer size can be retrieved from the requests OPT
	// record, if present, and is transport specific otherwise. dns.MinMsgSize
	// should be used for UDP requests without an OPT record, and
	// dns.MaxMsgSize for TCP requests without an OPT record.
	func (dns *Msg) Truncate(size int) {
	if dns.IsTsig() != nil {
	// To simplify this implementation, we don't perform
	// truncation on responses with a TSIG record.
	return
	}

	// RFC 6891 mandates that the payload size in an OPT record
	// less than 512 (MinMsgSize) bytes must be treated as equal to 512 bytes.
	//
	// For ease of use, we impose that restriction here.
	if size < MinMsgSize {
	size = MinMsgSize
	}

	l := msgLenWithCompressionMap(dns, nil) // uncompressed length
	if l <= size {
	// Don't waste effort compressing this message.
	dns.Compress = false
	return
	}

	dns.Compress = true

	edns0 := dns.popEdns0()
	if edns0 != nil {
	// Account for the OPT record that gets added at the end,
	// by subtracting that length from our budget.
	//
	// The EDNS(0) OPT record must have the root domain and
	// it's length is thus unaffected by compression.
	size -= Len(edns0)
	}

	compression := make(map[string]struct{})

	l = headerSize
	for _, r := range dns.Question {
	l += r.len(l, compression)
	}

	var numAnswer int
	if l < size {
	l, numAnswer = truncateLoop(dns.Answer, size, l, compression)
	}

	var numNS int
	if l < size {
	l, numNS = truncateLoop(dns.Ns, size, l, compression)
	}

	var numExtra int
	if l < size {
	_, numExtra = truncateLoop(dns.Extra, size, l, compression)
	}

	// See the function documentation for when we set this.
	dns.Truncated = dns.Truncated \|\| len(dns.Answer) > numAnswer \|\|
	len(dns.Ns) > numNS \|\| len(dns.Extra) > numExtra

	dns.Answer = dns.Answer[:numAnswer]
	dns.Ns = dns.Ns[:numNS]
	dns.Extra = dns.Extra[:numExtra]

	if edns0 != nil {
	// Add the OPT record back onto the additional section.
	dns.Extra = append(dns.Extra, edns0)
	}
	}

OCPBUGS-27397: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 #109

OCPBUGS-27397: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 #109

Conversation

gcs278 commented Jan 23, 2024

openshift-ci-robot commented Jan 23, 2024

gcs278 commented Jan 23, 2024

openshift-ci-robot commented Jan 23, 2024

openshift-ci bot commented Jan 23, 2024

gcs278 commented Jan 23, 2024

openshift-cherrypick-robot commented Jan 23, 2024

gcs278 commented Jan 23, 2024

openshift-cherrypick-robot commented Jan 23, 2024

Miciah Jan 23, 2024

Choose a reason for hiding this comment

gcs278 Jan 23, 2024

Choose a reason for hiding this comment

gcs278 Jan 23, 2024

Choose a reason for hiding this comment

gcs278 Jan 24, 2024

Choose a reason for hiding this comment

Miciah commented Jan 23, 2024

openshift-ci bot commented Jan 23, 2024

gcs278 commented Jan 23, 2024

gcs278 commented Jan 23, 2024

openshift-ci bot commented Jan 24, 2024

gcs278 commented Jan 24, 2024

openshift-ci-robot commented Jan 24, 2024

openshift-cherrypick-robot commented Jan 24, 2024

openshift-cherrypick-robot commented Jan 24, 2024

openshift-bot commented Jan 25, 2024

sreber84 commented Jan 25, 2024

openshift-cherrypick-robot commented Jan 25, 2024