Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1849432: [baremetal] verify resolv.conf in HAProxy static pod synced with host resolv.conf #1872

Merged
merged 1 commit into from
Jul 22, 2020

Conversation

yboaron
Copy link
Contributor

@yboaron yboaron commented Jun 25, 2020

In some cases, we noticed that HAProxy static pod starts running before NM resolv prepend script[1] was applied,
as a result of that the pod's resolv.conf file doesn't point to the local Coredns instance.

In this case, HAProxy pod (actually it's haproxy-monitor container) will fail to retrieve information
from api-int:kube-apiserver (because local Coredns instance his the one that resolves api-int).

With this PR the resolv.conf used by haproxy-monitor should be always synced with node's resolv.conf

[1] https://github.com/openshift/machine-config-operator/blob/master/templates/common/baremetal/files/NetworkManager-resolv-prepender.yaml

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 25, 2020
@openshift-ci-robot
Copy link
Contributor

@yboaron: This pull request references Bugzilla bug 1849432, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1849432: [baremetal] verify resolv.conf in HAProxy static pod synced with host resolv.conf

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yboaron
Copy link
Contributor Author

yboaron commented Jun 25, 2020

I'll wait first for feedback on the suggested fix for baremetal (and see also the results of CI e2e-metal-ipi) and after that will adopt this change to other platforms.

Copy link
Member

@cybertron cybertron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want the container resolv.conf to match the host resolv.conf, wouldn't it be easier to just mount it in the container directly?

@ashcrow ashcrow requested review from runcom and removed request for ashcrow June 29, 2020 15:28
@openshift-ci-robot
Copy link
Contributor

@yboaron: This pull request references Bugzilla bug 1849432, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1849432: [baremetal] verify resolv.conf in HAProxy static pod synced with host resolv.conf

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yboaron
Copy link
Contributor Author

yboaron commented Jun 30, 2020

/retest

Copy link
Member

@cybertron cybertron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test e2e-metal-ipi

Still hard to believe there isn't a simpler way to do this, but if there is I haven't found it either. As it turns out, even if the bind mount worked it probably wouldn't help here because we replace resolv.conf completely in the dispatcher script, which would break the bind mount.

Just waiting for metal ci to pass before lgtm.

@openshift-ci-robot
Copy link
Contributor

@yboaron: This pull request references Bugzilla bug 1849432, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1849432: [baremetal] verify resolv.conf in HAProxy static pod synced with host resolv.conf

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yboaron
Copy link
Contributor Author

yboaron commented Jul 1, 2020

The prepend NM script being used also by the rest of the platforms (Vsphere, Ovirt and Openstack), I guess I should update the PR.

@yboaron yboaron force-pushed the haproxy_resolv_sync branch 2 times, most recently from aa05913 to 36b1959 Compare July 1, 2020 08:22
@yboaron
Copy link
Contributor Author

yboaron commented Jul 1, 2020

/cc @mandre @rgolangh @patrickdillon

- "-c"
- |
#/bin/bash
cp /host/etc/resolv.conf /etc/resolv.conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this non-atomic copy? that's another race

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgolangh I don't think that atomic cp is required here, the monitor code runs periodically so even if it fails once or twice (till the CP operation completed) it should be OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not worried about the monitor, but on the dns resolver that could have a resolv.conf bits in-flight. I'm quite sure this is what happened on the old prepender.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the resolv.conf in the host or in the pod?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to do it atomically

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@celebdor @rgolangh
I tried to change cp to be atomic ( use the same trick as [1] ) , but mv command fails with EBUSY error because /etc/rsolv.conf is a mount point.

Do you have any idea how to make cp atomic?

[1] https://github.com/openshift/machine-config-operator/pull/1763/files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a bit more. First, I don't think we can make operations on resolv.conf atomic because it is its own tmpfs filesystem inside the container. Even mv is not atomic across filesystems (it degrades into a cp+unlink).

However, I don't think this is a concern inside the container. As Yossi pointed out, if we get an incomplete resolv.conf from the copy then the liveness probe will immediately fail because of the mismatch with the host resolv.conf. Furthermore, nothing is running in the container at the point where the copy is made so there's no possibility of in-flight DNS resolutions using a bad resolv.conf. The monitor doesn't start until after the file has been copied. The important thing here is that we're not modifying the system-wide resolv.conf, just this one container's, so there's no potential problems with concurrent access to the file.

Copy link
Member

@bcrochet bcrochet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue, one comment.

- "/bin/bash"
- "-c"
- |
#/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a comment. I think you meant to have #!

- "/bin/bash"
- "-c"
- |
#/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on #!

- "/bin/bash"
- "-c"
- |
#/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

- "/bin/bash"
- "-c"
- |
#/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@cybertron
Copy link
Member

I wouldn't think you even need to specify the interpreter when you're already calling /bin/bash -c. It should be fine to exclude that completely.

@mandre
Copy link
Member

mandre commented Jul 1, 2020

I wouldn't think you even need to specify the interpreter when you're already calling /bin/bash -c. It should be fine to exclude that completely.

Definitely.
The #/bin/bash mistake is copied in a few places in our templates and should probably be cleaned up at some point.

@mandre
Copy link
Member

mandre commented Jul 1, 2020

/test e2e-openstack

… resolv.conf

In some cases, we noticed that HAProxy static pod starts running before NM resolv prepend script[1] was applied,
as a result of that the pod's resolv.conf file doesn't point to the local Coredns instance.

In this case, HAProxy pod (actually it's haproxy-monitor container) will fail to retrieve information
from api-int:kube-apiserver (because local Coredns instance his the one that resolves api-int) .

This PR updates haproxy-monitor container to be in sync with node's resolv.conf

[1] https://github.com/openshift/machine-config-operator/blob/master/templates/common/baremetal/files/NetworkManager-resolv-prepender.yaml
Copy link
Member

@cybertron cybertron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/retest

Looks like the unnecessary shebangs are gone, and as I noted inline I don't think the resolv.conf race is a concern because it's a one process race. Assuming this passes metal ci (which I expect it will since it works for me locally), it should be good to go.

- "-c"
- |
#/bin/bash
cp /host/etc/resolv.conf /etc/resolv.conf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a bit more. First, I don't think we can make operations on resolv.conf atomic because it is its own tmpfs filesystem inside the container. Even mv is not atomic across filesystems (it degrades into a cp+unlink).

However, I don't think this is a concern inside the container. As Yossi pointed out, if we get an incomplete resolv.conf from the copy then the liveness probe will immediately fail because of the mismatch with the host resolv.conf. Furthermore, nothing is running in the container at the point where the copy is made so there's no possibility of in-flight DNS resolutions using a bad resolv.conf. The monitor doesn't start until after the file has been copied. The important thing here is that we're not modifying the system-wide resolv.conf, just this one container's, so there's no potential problems with concurrent access to the file.

@rgolangh
Copy link
Contributor

/test e2e-ovirt

@yboaron ovirt ci should be green now. lets see.

@rgolangh
Copy link
Contributor

/test e2e-ovirt

@yboaron
Copy link
Contributor Author

yboaron commented Jul 16, 2020

@patrickdillon from vSphere perspective, do you think we can merge this fix?

@yboaron
Copy link
Contributor Author

yboaron commented Jul 16, 2020

/test e2e-gcp-upgrade
/test e2e-vsphere
/test e2e-openstack

@yboaron
Copy link
Contributor Author

yboaron commented Jul 16, 2020

@mandre Do you think we can remove the hold ?

@rgolangh
Copy link
Contributor

@yboaron @mandre lgtm for ovirt. no hold from our side

@mandre
Copy link
Member

mandre commented Jul 16, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2020
@yboaron
Copy link
Contributor Author

yboaron commented Jul 20, 2020

/test e2e-gcp-upgrade
/test e2e-vsphere

@patrickdillon
Copy link
Contributor

/lgtm
for vsphere

@yboaron
Copy link
Contributor Author

yboaron commented Jul 20, 2020

@runcom @ericavonb Can we merge this PR? seems that we have green light from all platforms

@ericavonb
Copy link
Contributor

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bcrochet, cybertron, ericavonb, patrickdillon, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@kikisdeliveryservice
Copy link
Contributor

/skip

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 22, 2020

@yboaron: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-vsphere 74e5062 link /test e2e-vsphere
ci/prow/e2e-aws-proxy 74e5062 link /test e2e-aws-proxy

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 9f13391 into openshift:master Jul 22, 2020
@openshift-ci-robot
Copy link
Contributor

@yboaron: Some pull requests linked via external trackers have merged: openshift/machine-config-operator#1872. The following pull requests linked via external trackers have not merged:

In response to this:

Bug 1849432: [baremetal] verify resolv.conf in HAProxy static pod synced with host resolv.conf

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yboaron
Copy link
Contributor Author

yboaron commented Aug 3, 2020

/cherry-pick release-4.5

@openshift-cherrypick-robot

@yboaron: new pull request created: #1974

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.