-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1849432: [baremetal] verify resolv.conf in HAProxy static pod synced with host resolv.conf #1872
Bug 1849432: [baremetal] verify resolv.conf in HAProxy static pod synced with host resolv.conf #1872
Conversation
@yboaron: This pull request references Bugzilla bug 1849432, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'll wait first for feedback on the suggested fix for baremetal (and see also the results of CI e2e-metal-ipi) and after that will adopt this change to other platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want the container resolv.conf to match the host resolv.conf, wouldn't it be easier to just mount it in the container directly?
bd0b944
to
9fd76c1
Compare
@yboaron: This pull request references Bugzilla bug 1849432, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/test e2e-metal-ipi
Still hard to believe there isn't a simpler way to do this, but if there is I haven't found it either. As it turns out, even if the bind mount worked it probably wouldn't help here because we replace resolv.conf completely in the dispatcher script, which would break the bind mount.
Just waiting for metal ci to pass before lgtm.
@yboaron: This pull request references Bugzilla bug 1849432, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The prepend NM script being used also by the rest of the platforms (Vsphere, Ovirt and Openstack), I guess I should update the PR. |
aa05913
to
36b1959
Compare
- "-c" | ||
- | | ||
#/bin/bash | ||
cp /host/etc/resolv.conf /etc/resolv.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about this non-atomic copy? that's another race
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgolangh I don't think that atomic cp is required here, the monitor code runs periodically so even if it fails once or twice (till the CP operation completed) it should be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not worried about the monitor, but on the dns resolver that could have a resolv.conf bits in-flight. I'm quite sure this is what happened on the old prepender.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the resolv.conf in the host or in the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to do it atomically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@celebdor @rgolangh
I tried to change cp to be atomic ( use the same trick as [1] ) , but mv command fails with EBUSY error because /etc/rsolv.conf is a mount point.
Do you have any idea how to make cp atomic?
[1] https://github.com/openshift/machine-config-operator/pull/1763/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a bit more. First, I don't think we can make operations on resolv.conf atomic because it is its own tmpfs filesystem inside the container. Even mv is not atomic across filesystems (it degrades into a cp+unlink).
However, I don't think this is a concern inside the container. As Yossi pointed out, if we get an incomplete resolv.conf from the copy then the liveness probe will immediately fail because of the mismatch with the host resolv.conf. Furthermore, nothing is running in the container at the point where the copy is made so there's no possibility of in-flight DNS resolutions using a bad resolv.conf. The monitor doesn't start until after the file has been copied. The important thing here is that we're not modifying the system-wide resolv.conf, just this one container's, so there's no potential problems with concurrent access to the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue, one comment.
- "/bin/bash" | ||
- "-c" | ||
- | | ||
#/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a comment. I think you meant to have #!
- "/bin/bash" | ||
- "-c" | ||
- | | ||
#/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on #!
- "/bin/bash" | ||
- "-c" | ||
- | | ||
#/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
- "/bin/bash" | ||
- "-c" | ||
- | | ||
#/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
I wouldn't think you even need to specify the interpreter when you're already calling /bin/bash -c. It should be fine to exclude that completely. |
36b1959
to
465f19a
Compare
Definitely. |
/test e2e-openstack |
… resolv.conf In some cases, we noticed that HAProxy static pod starts running before NM resolv prepend script[1] was applied, as a result of that the pod's resolv.conf file doesn't point to the local Coredns instance. In this case, HAProxy pod (actually it's haproxy-monitor container) will fail to retrieve information from api-int:kube-apiserver (because local Coredns instance his the one that resolves api-int) . This PR updates haproxy-monitor container to be in sync with node's resolv.conf [1] https://github.com/openshift/machine-config-operator/blob/master/templates/common/baremetal/files/NetworkManager-resolv-prepender.yaml
465f19a
to
74e5062
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/retest
Looks like the unnecessary shebangs are gone, and as I noted inline I don't think the resolv.conf race is a concern because it's a one process race. Assuming this passes metal ci (which I expect it will since it works for me locally), it should be good to go.
- "-c" | ||
- | | ||
#/bin/bash | ||
cp /host/etc/resolv.conf /etc/resolv.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a bit more. First, I don't think we can make operations on resolv.conf atomic because it is its own tmpfs filesystem inside the container. Even mv is not atomic across filesystems (it degrades into a cp+unlink).
However, I don't think this is a concern inside the container. As Yossi pointed out, if we get an incomplete resolv.conf from the copy then the liveness probe will immediately fail because of the mismatch with the host resolv.conf. Furthermore, nothing is running in the container at the point where the copy is made so there's no possibility of in-flight DNS resolutions using a bad resolv.conf. The monitor doesn't start until after the file has been copied. The important thing here is that we're not modifying the system-wide resolv.conf, just this one container's, so there's no potential problems with concurrent access to the file.
/test e2e-ovirt @yboaron ovirt ci should be green now. lets see. |
/test e2e-ovirt |
@patrickdillon from vSphere perspective, do you think we can merge this fix? |
/test e2e-gcp-upgrade |
@mandre Do you think we can remove the hold ? |
/hold cancel |
/test e2e-gcp-upgrade |
/lgtm |
@runcom @ericavonb Can we merge this PR? seems that we have green light from all platforms |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bcrochet, cybertron, ericavonb, patrickdillon, yboaron The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/skip |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@yboaron: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@yboaron: Some pull requests linked via external trackers have merged: openshift/machine-config-operator#1872. The following pull requests linked via external trackers have not merged:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.5 |
@yboaron: new pull request created: #1974 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In some cases, we noticed that HAProxy static pod starts running before NM resolv prepend script[1] was applied,
as a result of that the pod's resolv.conf file doesn't point to the local Coredns instance.
In this case, HAProxy pod (actually it's haproxy-monitor container) will fail to retrieve information
from api-int:kube-apiserver (because local Coredns instance his the one that resolves api-int).
With this PR the resolv.conf used by haproxy-monitor should be always synced with node's resolv.conf
[1] https://github.com/openshift/machine-config-operator/blob/master/templates/common/baremetal/files/NetworkManager-resolv-prepender.yaml