Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-27397: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 #109

Merged
merged 1 commit into from Jan 24, 2024

Conversation

gcs278
Copy link

@gcs278 gcs278 commented Jan 23, 2024

Cherry-Pick of coredns#6277 which fixes OCPBUGS-27397 by adding logic that more gracefully handles overflow errors by setting the Truncated Bit and clearing the response. This indicates to DNS clients to retry with TCP, whereas previously they would have always gotten a SERVFAIL.

This becomes more important/exposed with CoreDNS 1.10.1 (OCP 4.13) onwards, given the changes introduced by coredns#5671. This modification only uses EDNS on the upstream DNS request when the client query used EDNS. It appears that certain DNS resolvers may not handle non-EDNS queries in a compliant manner.

Handle UDP responses that overflow with TC bit with test case (coredns#6277)

Signed-off-by: SriHarshaBS001 <SriHarshaBS009@gmail.com>
@gcs278 gcs278 changed the title UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 OCPBUGS-27397: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 Jan 23, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 23, 2024
@openshift-ci-robot
Copy link

@gcs278: This pull request references Jira Issue OCPBUGS-27397, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Cherry-Pick of coredns#6277 which fixes OCPBUGS-27397 by adding logic that more gracefully handles overflow errors by setting the Truncated Bit and clearing the response. This indicates to DNS clients to retry with TCP, whereas previously they would have always gotten a SERVFAIL.

This becomes more important/exposed with CoreDNS 1.10.1 (OCP 4.13) onwards, given the changes introduced by coredns#5671. This modification only uses EDNS on the upstream DNS request when the client query used EDNS. It appears that certain DNS resolvers may not handle non-EDNS queries in a compliant manner.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gcs278
Copy link
Author

gcs278 commented Jan 23, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 23, 2024
@openshift-ci-robot
Copy link

@gcs278: This pull request references Jira Issue OCPBUGS-27397, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

openshift-ci bot commented Jan 23, 2024

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: melvinjoseph86.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@gcs278: This pull request references Jira Issue OCPBUGS-27397, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gcs278
Copy link
Author

gcs278 commented Jan 23, 2024

I can't remember if this will work before it's merged, but let me try:
/cherry-pick release-4.12

@openshift-cherrypick-robot

@gcs278: once the present PR merges, I will cherry-pick it on top of release-4.12 in a new PR and assign it to you.

In response to this:

I can't remember if this will work before it's merged, but let me try:
/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gcs278
Copy link
Author

gcs278 commented Jan 23, 2024

Whoops, I didn't mean to do 4.12...we aren't backporting that far.

/cherry-pick release-4.15

@openshift-cherrypick-robot

@gcs278: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you.

In response to this:

Whoops, I didn't mean to do 4.12...we aren't backporting that far.

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment on lines +178 to +188
// Function to return an empty response with TC (truncated) bit set.
func truncateResponse(response *dns.Msg) *dns.Msg {
// Clear out Answer, Extra, and Ns sections
response.Answer = nil
response.Extra = nil
response.Ns = nil

// Set TC bit to indicate truncation.
response.Truncated = true
return response
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why the existing Truncate method isn't used for this?

// Truncate ensures the reply message will fit into the requested buffer
// size by removing records that exceed the requested size.
//
// It will first check if the reply fits without compression and then with
// compression. If it won't fit with compression, Truncate then walks the
// record adding as many records as possible without exceeding the
// requested buffer size.
//
// If the message fits within the requested size without compression,
// Truncate will set the message's Compress attribute to false. It is
// the caller's responsibility to set it back to true if they wish to
// compress the payload regardless of size.
//
// The TC bit will be set if any records were excluded from the message.
// If the TC bit is already set on the message it will be retained.
// TC indicates that the client should retry over TCP.
//
// According to RFC 2181, the TC bit should only be set if not all of the
// "required" RRs can be included in the response. Unfortunately, we have
// no way of knowing which RRs are required so we set the TC bit if any RR
// had to be omitted from the response.
//
// The appropriate buffer size can be retrieved from the requests OPT
// record, if present, and is transport specific otherwise. dns.MinMsgSize
// should be used for UDP requests without an OPT record, and
// dns.MaxMsgSize for TCP requests without an OPT record.
func (dns *Msg) Truncate(size int) {
if dns.IsTsig() != nil {
// To simplify this implementation, we don't perform
// truncation on responses with a TSIG record.
return
}
// RFC 6891 mandates that the payload size in an OPT record
// less than 512 (MinMsgSize) bytes must be treated as equal to 512 bytes.
//
// For ease of use, we impose that restriction here.
if size < MinMsgSize {
size = MinMsgSize
}
l := msgLenWithCompressionMap(dns, nil) // uncompressed length
if l <= size {
// Don't waste effort compressing this message.
dns.Compress = false
return
}
dns.Compress = true
edns0 := dns.popEdns0()
if edns0 != nil {
// Account for the OPT record that gets added at the end,
// by subtracting that length from our budget.
//
// The EDNS(0) OPT record must have the root domain and
// it's length is thus unaffected by compression.
size -= Len(edns0)
}
compression := make(map[string]struct{})
l = headerSize
for _, r := range dns.Question {
l += r.len(l, compression)
}
var numAnswer int
if l < size {
l, numAnswer = truncateLoop(dns.Answer, size, l, compression)
}
var numNS int
if l < size {
l, numNS = truncateLoop(dns.Ns, size, l, compression)
}
var numExtra int
if l < size {
_, numExtra = truncateLoop(dns.Extra, size, l, compression)
}
// See the function documentation for when we set this.
dns.Truncated = dns.Truncated || len(dns.Answer) > numAnswer ||
len(dns.Ns) > numNS || len(dns.Extra) > numExtra
dns.Answer = dns.Answer[:numAnswer]
dns.Ns = dns.Ns[:numNS]
dns.Extra = dns.Extra[:numExtra]
if edns0 != nil {
// Add the OPT record back onto the additional section.
dns.Extra = append(dns.Extra, edns0)
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I looked around, the best I could find was: coredns#5953 (comment):

I did a quick PR on the simplest handling case - just responding with an empty truncated message.

So it's possible they wanted the simplest solution or they wanted to indicate something was unusually by stripping the answer off completely. I think it's worth asking.

This may benefit Linux distros that aren't capable of using DNS over TCP, because often the superfluous information are the ADDITIONAL records, which a client may not care about. Returning an answer is argubly better than returning nothing at all.

However, for our users, we don't support distros like Alpine that don't support DNS over TCP, so, I believe the client is going to retry anyways and it's not much a net gain for us.

I'll follow up in CoreDNS slack.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I followed up and also tested out using Truncate. The issue is that when the upstream DNS library encounters an overflow error, it stops processing the message and leaves the msg answer field empty (since it overflowed unpacking it).

In theory, I think the upstream library could still unpack what it could while returning an overflow error, which would allow some information to be returned, but I don't think this is a trivial effort and probably not worth delaying this fix over.

Let me know if this makes sense.

@Miciah
Copy link

Miciah commented Jan 23, 2024

Thanks! If you're confident that #109 (review) isn't an issue, I'm fine with this as is; feel free to release the hold. (If you don't know, maybe it's worth a quick question on upstream's Slack channel.)
/approve
/lgtm
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2024
Copy link

openshift-ci bot commented Jan 23, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2024
@gcs278
Copy link
Author

gcs278 commented Jan 23, 2024

@Miciah thanks - I don't see it causing issues (this fix is still a net positive), but arguably it may be more effective to try to return something. I think it's worth waiting a day for a response from upstream in case they change their mind and update the fix.

@gcs278
Copy link
Author

gcs278 commented Jan 23, 2024

/test e2e-gcp-serial

Copy link

openshift-ci bot commented Jan 24, 2024

@gcs278: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn 37a9afe link false /test e2e-metal-ipi-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gcs278
Copy link
Author

gcs278 commented Jan 24, 2024

I followed up with upstream and the way they approached this seems reasonable given their constraints.
/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 24, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit b5a4844 into openshift:master Jan 24, 2024
6 of 7 checks passed
@openshift-ci-robot
Copy link

@gcs278: Jira Issue OCPBUGS-27397: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-27397 has been moved to the MODIFIED state.

In response to this:

Cherry-Pick of coredns#6277 which fixes OCPBUGS-27397 by adding logic that more gracefully handles overflow errors by setting the Truncated Bit and clearing the response. This indicates to DNS clients to retry with TCP, whereas previously they would have always gotten a SERVFAIL.

This becomes more important/exposed with CoreDNS 1.10.1 (OCP 4.13) onwards, given the changes introduced by coredns#5671. This modification only uses EDNS on the upstream DNS request when the client query used EDNS. It appears that certain DNS resolvers may not handle non-EDNS queries in a compliant manner.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@gcs278: #109 failed to apply on top of branch "release-4.12":

Applying: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397
Using index info to reconstruct a base tree...
A	plugin/pkg/proxy/connect.go
A	plugin/pkg/proxy/proxy_test.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): plugin/pkg/proxy/proxy_test.go deleted in HEAD and modified in UPSTREAM: 6277: openshift: Fix OCPBUGS-27397. Version UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 of plugin/pkg/proxy/proxy_test.go left in tree.
Auto-merging plugin/forward/connect.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 UPSTREAM: 6277: openshift: Fix OCPBUGS-27397
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

I can't remember if this will work before it's merged, but let me try:
/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@gcs278: new pull request created: #110

In response to this:

Whoops, I didn't mean to do 4.12...we aren't backporting that far.

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

This PR has been included in build coredns-container-v4.16.0-202401242108.p0.gb5a4844.assembly.stream for distgit coredns.
All builds following this will include this PR.

@sreber84
Copy link

/cherry-pick release-4.14

@openshift-cherrypick-robot

@sreber84: #109 failed to apply on top of branch "release-4.14":

Applying: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397
Using index info to reconstruct a base tree...
A	plugin/pkg/proxy/connect.go
A	plugin/pkg/proxy/proxy_test.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): plugin/pkg/proxy/proxy_test.go deleted in HEAD and modified in UPSTREAM: 6277: openshift: Fix OCPBUGS-27397. Version UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 of plugin/pkg/proxy/proxy_test.go left in tree.
Auto-merging plugin/forward/connect.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 UPSTREAM: 6277: openshift: Fix OCPBUGS-27397
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants