OCPBUGS-25753,OCPBUGS-22721: Run resolv-prepender entirely async #4102

cybertron · 2024-01-05T22:46:06Z

Currently the resolv-prepender dispatcher script starts the systemd service and then waits for it to complete. This can cause the dispatcher script to time out if the runtimecfg image pull is slow or if resolv.conf does not get populated in a timely fashion (it's not entirely clear to me why the latter happens, but it does). This can cause configure-ovs to time out if there are a large number of interfaces on the system triggering the dispatcher script, such as when there are many VLANs configured.

To avoid this, we can stop waiting for the systemd service in the dispatcher script. In fact, there's an argument that we shouldn't wait since we need to be able to handle asynchronous execution anyway for the slow image pull case (which was the entire reason the script was split into a service the way it is).

I have found a few possible issues with async execution however:

If we start the service with an empty $DHCP6_FQDN_FQDN value and then later get a new value for that, we may not correctly apply the new value if the service is still running because we only ever "systemd start" the service, which is a noop if the service is already running.
Similarly, if new IP4/6_DOMAINS values come in on a later connection that may not be reflected in the service either.

Even though these may sound like the same problem, I mention them separately on purpose because the solutions are different:

For the DHCP6 case, we can move that logic back into the dispatcher script so we will always set the hostname no matter what happens with the prepender code. One could argue that this should be in its own script anyway since it's largely unrelated to resolv.conf.
For the domains case, we do need to restart the service since the domains are involved in resolv.conf generation. However, we do not want to restart the service every time since that may be unnecessary and if we restart in the middle of the image pull it could result in a corrupt image (the whole thing we were trying to avoid by running this as a service in the first place).

To avoid problems with restarting the service when we don't want to, I've added logic that only restarts the service if there are changed env values AND the runtimecfg image has already been pulled. This should mean the worst case scenario is that we don't properly set the domains and resolv.conf is temporarily generated with and incorrect search line. This should be resolved the next time any event that triggers the dispatcher script happens.

- What I did

- How to verify it

- Description for the changelog

openshift-ci · 2024-01-05T22:46:11Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2024-01-05T22:46:12Z

@cybertron: This pull request references Jira Issue OCPBUGS-25753, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently the resolv-prepender dispatcher script starts the systemd service and then waits for it to complete. This can cause the dispatcher script to time out if the runtimecfg image pull is slow or if resolv.conf does not get populated in a timely fashion (it's not entirely clear to me why the latter happens, but it does). This can cause configure-ovs to time out if there are a large number of interfaces on the system triggering the dispatcher script, such as when there are many VLANs configured.

To avoid this, we can stop waiting for the systemd service in the dispatcher script. In fact, there's an argument that we shouldn't wait since we need to be able to handle asynchronous execution anyway for the slow image pull case (which was the entire reason the script was split into a service the way it is).

I have found a few possible issues with async execution however:

If we start the service with an empty $DHCP6_FQDN_FQDN value and then later get a new value for that, we may not correctly apply the new value if the service is still running because we only ever "systemd start" the service, which is a noop if the service is already running.

Similarly, if new IP4/6_DOMAINS values come in on a later connection that may not be reflected in the service either.

Even though these may sound like the same problem, I mention them separately on purpose because the solutions are different:

For the DHCP6 case, we can move that logic back into the dispatcher script so we will always set the hostname no matter what happens with the prepender code. One could argue that this should be in its own script anyway since it's largely unrelated to resolv.conf.

For the domains case, we do need to restart the service since the domains are involved in resolv.conf generation. However, we do not want to restart the service every time since that may be unnecessary and if we restart in the middle of the image pull it could result in a corrupt image (the whole thing we were trying to avoid by running this as a service in the first place).

To avoid problems with restarting the service when we don't want to, I've added logic that only restarts the service if there are changed env values AND the runtimecfg image has already been pulled. This should mean the worst case scenario is that we don't properly set the domains and resolv.conf is temporarily generated with and incorrect search line. This should be resolved the next time any event that triggers the dispatcher script happens.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cybertron · 2024-01-05T22:46:48Z

I'm still testing this to confirm it actually fixes the problem, but this version at least seems to work as expected so I thought I'd get it up for initial review.

Currently the resolv-prepender dispatcher script starts the systemd service and then waits for it to complete. This can cause the dispatcher script to time out if the runtimecfg image pull is slow or if resolv.conf does not get populated in a timely fashion (it's not entirely clear to me why the latter happens, but it does). This can cause configure-ovs to time out if there are a large number of interfaces on the system triggering the dispatcher script, such as when there are many VLANs configured. To avoid this, we can stop waiting for the systemd service in the dispatcher script. In fact, there's an argument that we shouldn't wait since we need to be able to handle asynchronous execution anyway for the slow image pull case (which was the entire reason the script was split into a service the way it is). I have found a few possible issues with async execution however: * If we start the service with an empty $DHCP6_FQDN_FQDN value and then later get a new value for that, we may not correctly apply the new value if the service is still running because we only ever "systemd start" the service, which is a noop if the service is already running. * Similarly, if new IP4/6_DOMAINS values come in on a later connection that may not be reflected in the service either. Even though these may sound like the same problem, I mention them separately on purpose because the solutions are different: * For the DHCP6 case, we can move that logic back into the dispatcher script so we will always set the hostname no matter what happens with the prepender code. One could argue that this should be in its own script anyway since it's largely unrelated to resolv.conf. * For the domains case, we do need to restart the service since the domains are involved in resolv.conf generation. However, we do not want to restart the service every time since that may be unnecessary and if we restart in the middle of the image pull it could result in a corrupt image (the whole thing we were trying to avoid by running this as a service in the first place). To avoid problems with restarting the service when we don't want to, I've added logic that only restarts the service if there are changed env values AND the runtimecfg image has already been pulled. This should mean the worst case scenario is that we don't properly set the domains and resolv.conf is temporarily generated with and incorrect search line. This should be resolved the next time any event that triggers the dispatcher script happens.

cybertron · 2024-01-16T21:08:39Z

/jira refresh

openshift-ci-robot · 2024-01-16T21:08:43Z

@cybertron: This pull request references Jira Issue OCPBUGS-25753, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-01-16T21:09:47Z

@cybertron: This pull request references Jira Issue OCPBUGS-25753, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhaozhanqi

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-22721, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhaozhanqi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently the resolv-prepender dispatcher script starts the systemd service and then waits for it to complete. This can cause the dispatcher script to time out if the runtimecfg image pull is slow or if resolv.conf does not get populated in a timely fashion (it's not entirely clear to me why the latter happens, but it does). This can cause configure-ovs to time out if there are a large number of interfaces on the system triggering the dispatcher script, such as when there are many VLANs configured.

To avoid this, we can stop waiting for the systemd service in the dispatcher script. In fact, there's an argument that we shouldn't wait since we need to be able to handle asynchronous execution anyway for the slow image pull case (which was the entire reason the script was split into a service the way it is).

I have found a few possible issues with async execution however:

If we start the service with an empty $DHCP6_FQDN_FQDN value and then later get a new value for that, we may not correctly apply the new value if the service is still running because we only ever "systemd start" the service, which is a noop if the service is already running.

Similarly, if new IP4/6_DOMAINS values come in on a later connection that may not be reflected in the service either.

Even though these may sound like the same problem, I mention them separately on purpose because the solutions are different:

For the DHCP6 case, we can move that logic back into the dispatcher script so we will always set the hostname no matter what happens with the prepender code. One could argue that this should be in its own script anyway since it's largely unrelated to resolv.conf.

For the domains case, we do need to restart the service since the domains are involved in resolv.conf generation. However, we do not want to restart the service every time since that may be unnecessary and if we restart in the middle of the image pull it could result in a corrupt image (the whole thing we were trying to avoid by running this as a service in the first place).

To avoid problems with restarting the service when we don't want to, I've added logic that only restarts the service if there are changed env values AND the runtimecfg image has already been pulled. This should mean the worst case scenario is that we don't properly set the domains and resolv.conf is temporarily generated with and incorrect search line. This should be resolved the next time any event that triggers the dispatcher script happens.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cybertron · 2024-01-18T17:47:03Z

We have confirmation that this fixes the problem in at least one environment, so I think we can move forward with it.

/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ipi-ovn-dualstack

These two jobs are affected by the hostname logic change so we should confirm that they still pass.

cybertron · 2024-01-18T20:55:55Z

/cc @mkowalski

It looks like this should fix the issues we've had with timeouts when a large number of VLANs are configured.

mkowalski

/lgtm

cybertron · 2024-01-19T16:40:44Z

/test e2e-metal-ipi-ovn-ipv6

No changes in the logic, just added quotes around the image references. As long as that didn't somehow break the script completely this should still be good to go.

cybertron · 2024-01-22T18:41:33Z

/retest-required

Unrelated failures. The platform jobs affected by this change have all passed.

zhaozhanqi · 2024-01-23T01:14:16Z

pre-merge testing this feature on ipi-vsphere with 40 vlans on the worker and pass.
/label qe-approved

openshift-ci-robot · 2024-01-23T01:14:22Z

@cybertron: This pull request references Jira Issue OCPBUGS-25753, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhaozhanqi

This pull request references Jira Issue OCPBUGS-22721, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhaozhanqi

In response to this:

Currently the resolv-prepender dispatcher script starts the systemd service and then waits for it to complete. This can cause the dispatcher script to time out if the runtimecfg image pull is slow or if resolv.conf does not get populated in a timely fashion (it's not entirely clear to me why the latter happens, but it does). This can cause configure-ovs to time out if there are a large number of interfaces on the system triggering the dispatcher script, such as when there are many VLANs configured.

To avoid this, we can stop waiting for the systemd service in the dispatcher script. In fact, there's an argument that we shouldn't wait since we need to be able to handle asynchronous execution anyway for the slow image pull case (which was the entire reason the script was split into a service the way it is).

I have found a few possible issues with async execution however:

If we start the service with an empty $DHCP6_FQDN_FQDN value and then later get a new value for that, we may not correctly apply the new value if the service is still running because we only ever "systemd start" the service, which is a noop if the service is already running.

Similarly, if new IP4/6_DOMAINS values come in on a later connection that may not be reflected in the service either.

Even though these may sound like the same problem, I mention them separately on purpose because the solutions are different:

For the DHCP6 case, we can move that logic back into the dispatcher script so we will always set the hostname no matter what happens with the prepender code. One could argue that this should be in its own script anyway since it's largely unrelated to resolv.conf.

For the domains case, we do need to restart the service since the domains are involved in resolv.conf generation. However, we do not want to restart the service every time since that may be unnecessary and if we restart in the middle of the image pull it could result in a corrupt image (the whole thing we were trying to avoid by running this as a service in the first place).

To avoid problems with restarting the service when we don't want to, I've added logic that only restarts the service if there are changed env values AND the runtimecfg image has already been pulled. This should mean the worst case scenario is that we don't properly set the domains and resolv.conf is temporarily generated with and incorrect search line. This should be resolved the next time any event that triggers the dispatcher script happens.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cybertron · 2024-01-23T15:20:39Z

/assign @cdoern

cybertron · 2024-01-30T15:25:10Z

/retest-required
/assign @yuqi-zhang

I think the SNO job should be passing now that we merged the fix.

yuqi-zhang · 2024-01-30T16:40:25Z

@cybertron would you say this is a critical fix?

openshift-ci · 2024-01-30T16:42:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, mkowalski, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cybertron · 2024-01-30T17:17:00Z

We discussed this elsewhere, but for posterity I think this can wait until the critical fixes label is no longer needed. It doesn't sound like it should be too long until that goes away.

openshift-ci-robot · 2024-01-31T18:02:03Z

/retest-required

Remaining retests: 0 against base HEAD aa1496f and 2 for PR HEAD 4609fda in total

openshift-ci · 2024-01-31T21:01:36Z

@cybertron: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-images	`4609fda`	link	false	`/test okd-images`
ci/prow/e2e-gcp-op-layering	`4609fda`	link	false	`/test e2e-gcp-op-layering`
ci/prow/okd-scos-e2e-aws-ovn	`4609fda`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-azure-ovn-upgrade-out-of-change	`4609fda`	link	false	`/test e2e-azure-ovn-upgrade-out-of-change`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

mkowalski · 2024-02-01T09:26:48Z

/retest-required

openshift-ci-robot · 2024-02-01T11:00:32Z

@cybertron: Jira Issue OCPBUGS-25753: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#4102

Jira Issue OCPBUGS-25753 has been moved to the MODIFIED state.

Jira Issue OCPBUGS-22721: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-22721 has been moved to the MODIFIED state.

In response to this:

Currently the resolv-prepender dispatcher script starts the systemd service and then waits for it to complete. This can cause the dispatcher script to time out if the runtimecfg image pull is slow or if resolv.conf does not get populated in a timely fashion (it's not entirely clear to me why the latter happens, but it does). This can cause configure-ovs to time out if there are a large number of interfaces on the system triggering the dispatcher script, such as when there are many VLANs configured.

To avoid this, we can stop waiting for the systemd service in the dispatcher script. In fact, there's an argument that we shouldn't wait since we need to be able to handle asynchronous execution anyway for the slow image pull case (which was the entire reason the script was split into a service the way it is).

I have found a few possible issues with async execution however:

If we start the service with an empty $DHCP6_FQDN_FQDN value and then later get a new value for that, we may not correctly apply the new value if the service is still running because we only ever "systemd start" the service, which is a noop if the service is already running.

Similarly, if new IP4/6_DOMAINS values come in on a later connection that may not be reflected in the service either.

Even though these may sound like the same problem, I mention them separately on purpose because the solutions are different:

For the DHCP6 case, we can move that logic back into the dispatcher script so we will always set the hostname no matter what happens with the prepender code. One could argue that this should be in its own script anyway since it's largely unrelated to resolv.conf.

For the domains case, we do need to restart the service since the domains are involved in resolv.conf generation. However, we do not want to restart the service every time since that may be unnecessary and if we restart in the middle of the image pull it could result in a corrupt image (the whole thing we were trying to avoid by running this as a service in the first place).

To avoid problems with restarting the service when we don't want to, I've added logic that only restarts the service if there are changed env values AND the runtimecfg image has already been pulled. This should mean the worst case scenario is that we don't properly set the domains and resolv.conf is temporarily generated with and incorrect search line. This should be resolved the next time any event that triggers the dispatcher script happens.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-02-01T13:06:52Z

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-proxy-pull-test-container-v4.16.0-202402011241.p0.g2da0539.assembly.stream for distgit openshift-proxy-pull-test.
All builds following this will include this PR.

openshift-merge-robot · 2024-02-02T05:13:06Z

Fix included in accepted release 4.16.0-0.nightly-2024-02-02-002725

cybertron · 2024-02-02T15:57:32Z

/cherry-pick release-4.15

openshift-cherrypick-robot · 2024-02-02T15:58:21Z

@cybertron: new pull request created: #4161

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2024

openshift-ci-robot added the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Jan 5, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 5, 2024

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 5, 2024

cybertron force-pushed the async-resolv-prepender branch from 29f1fe7 to 10a4774 Compare January 12, 2024 22:20

cybertron changed the title ~~OCPBUGS-25753: Run resolv-prepender entirely async~~ OCPBUGS-25753,OCPBUGS-22721: Run resolv-prepender entirely async Jan 16, 2024

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 16, 2024

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 16, 2024

openshift-ci bot requested a review from zhaozhanqi January 16, 2024 21:09

cybertron marked this pull request as ready for review January 18, 2024 17:45

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 18, 2024

openshift-ci bot requested review from patrickdillon and pierreprinetti January 18, 2024 17:47

openshift-ci bot requested a review from mkowalski January 18, 2024 20:55

Add quotes around image reference in resolv-prepender

4609fda

mkowalski approved these changes Jan 19, 2024

View reviewed changes

openshift-ci bot assigned mkowalski Jan 19, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 19, 2024

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jan 23, 2024

openshift-ci bot assigned cdoern Jan 23, 2024

openshift-ci bot assigned yuqi-zhang Jan 30, 2024

yuqi-zhang approved these changes Jan 30, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 30, 2024

openshift-merge-bot bot merged commit 2da0539 into openshift:master Feb 1, 2024
14 of 18 checks passed

openshift-cherrypick-robot mentioned this pull request Feb 2, 2024

[release-4.15] OCPBUGS-28909,OCPBUGS-28910: Run resolv-prepender entirely async #4161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-25753,OCPBUGS-22721: Run resolv-prepender entirely async #4102

OCPBUGS-25753,OCPBUGS-22721: Run resolv-prepender entirely async #4102

cybertron commented Jan 5, 2024

openshift-ci bot commented Jan 5, 2024

openshift-ci-robot commented Jan 5, 2024

cybertron commented Jan 5, 2024

cybertron commented Jan 16, 2024

openshift-ci-robot commented Jan 16, 2024

openshift-ci-robot commented Jan 16, 2024

cybertron commented Jan 18, 2024

cybertron commented Jan 18, 2024

mkowalski left a comment

cybertron commented Jan 19, 2024

cybertron commented Jan 22, 2024

zhaozhanqi commented Jan 23, 2024

openshift-ci-robot commented Jan 23, 2024

cybertron commented Jan 23, 2024

cybertron commented Jan 30, 2024

yuqi-zhang commented Jan 30, 2024

openshift-ci bot commented Jan 30, 2024

cybertron commented Jan 30, 2024

openshift-ci-robot commented Jan 31, 2024

openshift-ci bot commented Jan 31, 2024 •

edited

mkowalski commented Feb 1, 2024

openshift-ci-robot commented Feb 1, 2024

openshift-bot commented Feb 1, 2024

openshift-merge-robot commented Feb 2, 2024

cybertron commented Feb 2, 2024

openshift-cherrypick-robot commented Feb 2, 2024

OCPBUGS-25753,OCPBUGS-22721: Run resolv-prepender entirely async #4102

OCPBUGS-25753,OCPBUGS-22721: Run resolv-prepender entirely async #4102

Conversation

cybertron commented Jan 5, 2024

openshift-ci bot commented Jan 5, 2024

openshift-ci-robot commented Jan 5, 2024

cybertron commented Jan 5, 2024

cybertron commented Jan 16, 2024

openshift-ci-robot commented Jan 16, 2024

openshift-ci-robot commented Jan 16, 2024

cybertron commented Jan 18, 2024

cybertron commented Jan 18, 2024

mkowalski left a comment

Choose a reason for hiding this comment

cybertron commented Jan 19, 2024

cybertron commented Jan 22, 2024

zhaozhanqi commented Jan 23, 2024

openshift-ci-robot commented Jan 23, 2024

cybertron commented Jan 23, 2024

cybertron commented Jan 30, 2024

yuqi-zhang commented Jan 30, 2024

openshift-ci bot commented Jan 30, 2024

cybertron commented Jan 30, 2024

openshift-ci-robot commented Jan 31, 2024

openshift-ci bot commented Jan 31, 2024 • edited

mkowalski commented Feb 1, 2024

openshift-ci-robot commented Feb 1, 2024

openshift-bot commented Feb 1, 2024

openshift-merge-robot commented Feb 2, 2024

cybertron commented Feb 2, 2024

openshift-cherrypick-robot commented Feb 2, 2024

openshift-ci bot commented Jan 31, 2024 •

edited