Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-23022: Installer should not complain when API and API-INT are resolved. #7701

Merged
merged 1 commit into from
May 13, 2024

Conversation

barbacbd
Copy link
Contributor

@barbacbd barbacbd commented Nov 9, 2023

** Removed the warning that API and/or API-Int cannot be resolved.
** Used the last errors to be alerted instead during failures.
** Used Warning rather than Info to alert users of errors.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 9, 2023
@openshift-ci-robot
Copy link
Contributor

@barbacbd: This pull request references Jira Issue OCPBUGS-23022, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

** Removed the warning that API and/or API-Int cannot be resolved.
** Used the last errors to be alerted instead during failures.
** Used Warning rather than Info to alert users of errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@barbacbd
Copy link
Contributor Author

barbacbd commented Nov 9, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 9, 2023
@openshift-ci-robot
Copy link
Contributor

@barbacbd: This pull request references Jira Issue OCPBUGS-23022, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from gpei November 9, 2023 17:12
@patrickdillon
Copy link
Contributor

Awesome: we really need to fix this bug, so it's great to see progress on this front.

That said, I think we need to go a little deeper to solve the problem. There are two things I would recommend checking out:

@wking left some suggestions regarding handling the analysis in the Additional Info section of the bug. It would be good to assess that.

In addition, I have another take as to what may be the problem:

The checks are run as part of the bootkube script:

# Check if the API and API_INT Server URLs can be reached.
echo "Check if API and API-Int URLs are reachable during bootstrap"
check_url "API_URL" "${API_SERVER_URL}"
check_url "API_INT_URL" "${API_INT_SERVER_URL}"

The bootkube script is run as a loop, so these checks will be rerun over and over. What the checks do is curl the api server:
https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootstrap-verify-api-server-urls.sh#L26

One problem with this approach is that it is normal during the bootstrapping process for the apiserver to become periodically unavailable. These restarts are expected and not symptomatic of a problem, but it will cause the check to fail--even if the check has previously succeeded. So during a normal install, you will see a random sampling of PASS FAILs throughout the install.

The intention of the check is just to establish that DNS is available, so a single PASS should be good enough to verify DNS availability, then we don't need to test anymore. The way we generally handle this in bootkube is by wrapping the code to check a filemarker:

So in this case, I'm wondering would it fix the probelm if we wrapped these checks with:

if [ ! -f api-dns-check.done ]
    # Check if the API and API_INT Server URLs can be reached.
    echo "Check if API and API-Int URLs are reachable during bootstrap"

    check_url "API_URL" "${API_SERVER_URL}"
    check_url "API_INT_URL" "${API_INT_SERVER_URL}"
fi

api-dns-check.done would be touched when the checks succeed.

@@ -118,7 +118,6 @@ func checkAPIURLs(a analysis) bool {
}
// Note: Even when there is a stage failure, we are not returning false here. That is
// intentional because we donot want to report this as an error in the "analyze" output.
logrus.Warn("The bootstrap machine is unable to resolve API and/or API-Int Server URLs")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see anything in this function that's specific to the API resolution steps, but the function name is still checkAPIURLs. If the goal is to report the last error in the bootkube steps, regardless of whether that is an API-resolution error or not, can we rename the function to checkBootkubeService or something, and adjust the function's Godocs to match? Also I'm unclear about why checkReleaseImageDownload returns false on error while this function returns true. Perhaps there should be a generic checkService function? Or maybe I'm missing the nuance of the Note: Even when there is a stage failure... comment?

Copy link
Contributor

@sadasu sadasu Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. None of the functions in https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootstrap-verify-api-server-urls.sh actually return an error even when the URL is detected as invalid or it is not resolvable. Hence, that comment.
    But, that also means that we will not be able to detect the right time to display the logrus.Warn(). +1 for removing it.

  2. +1 to renaming the function to checkBootkubeService() with corresponding changes to the Godocs. That way, there is no assumptions about the checks happening in the bootkube service.

  3. +1 to have api-dns-check.done so that we run these checks only once. (In the custom-dns case, these checks may not succeed right away. So, we skip this test in that case, or allow multiple checks.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to have api-dns-check.done so that we run these checks only once. (In the custom-dns case, these checks may not succeed right away. So, we skip this test in that case, or allow multiple checks.)

The check would be run on each iteration until it passes, once it passes it would be skipped thereafter. Agreed that it would not work for the api record in custom DNS because it is expected that the record is created post install.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since analyze just looks at the logs generated during bootkube service, this should still work for custom DNS because both API and API-Int entries are created within the CoreDNS pod.
I am not aware of all the reasons why DNS resolution can fluctuate during install (causing this erroneous reporting we are seeing here), but I suspect the same reasons would be in play even when custom dns is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should still work for custom DNS because both API and API-Int entries are created within the CoreDNS pod.

Ah, that’s right. So we should be good. 👍

I am not aware of all the reasons why DNS resolution can fluctuate during install

it’s not dns resolution that’s failing—the dns check curls the api server which becomes unavailable periodically during an install. Happens during all installs.

@sadasu
Copy link
Contributor

sadasu commented Nov 14, 2023

api-dns-check.done would be touched when the checks succeed.

Once the check is performed once in the bootkube service during bootstrap, will it be run again during analyze?

@patrickdillon
Copy link
Contributor

Once the check is performed once in the bootkube service during bootstrap,

Hopefully this comment clarifies "performed once."

will it be run again during analyze?

No: analyze just parses the logs gathered from the bootstrap node, so the checks wouldn't be run again as part of that.

@barbacbd
Copy link
Contributor Author

barbacbd commented Apr 4, 2024

/retest-required

… resolved.

** Removed the warning that API and/or API-Int cannot be resolved.
** Used the last errors to be alerted instead during failures.
** Used Warning rather than Info to alert users of errors.
@barbacbd
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@sadasu sadasu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes to the checks if API and API-Int URLs are resolvable and reachable need to be run only one during bootstrap. Adding the <url_type>-resolve.done and -dns-check.done ensures that.

Copy link
Contributor

openshift-ci bot commented Apr 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sadasu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2024
if [ ! -f api-int-dns-check.done ]; then
echo "Check if API-Int URL is reachable during bootstrap"
if check_url "API_INT_URL" "${API_INT_SERVER_URL}"; then
touch api-int-dns-check.done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this file could be called -reachable.done.

Will not hold PR for this nit.

@bfournie
Copy link
Contributor

bfournie commented May 9, 2024

looks good to me, but will let @patrickdillon have a look before adding label

Copy link
Contributor

@r4f4 r4f4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just introducing status markers in the bootkube process, it seems harmless to merge.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 10, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 3420c22 and 2 for PR HEAD a1f0905 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 18b9aa2 and 1 for PR HEAD a1f0905 in total

@bfournie
Copy link
Contributor

/override ci/prow/shellcheck

Copy link
Contributor

openshift-ci bot commented May 10, 2024

@bfournie: Overrode contexts on behalf of bfournie: ci/prow/shellcheck

In response to this:

/override ci/prow/shellcheck

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD a424a7f and 0 for PR HEAD a1f0905 in total

@openshift-ci-robot
Copy link
Contributor

/hold

Revision a1f0905 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2024
@bfournie
Copy link
Contributor

/unhold
/retest

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD a424a7f and 2 for PR HEAD a1f0905 in total

@r4f4
Copy link
Contributor

r4f4 commented May 11, 2024

/override ci/prow/shellcheck
Known issue and not related to the changes in this PR

Copy link
Contributor

openshift-ci bot commented May 11, 2024

@r4f4: Overrode contexts on behalf of r4f4: ci/prow/shellcheck

In response to this:

/override ci/prow/shellcheck
Known issue and not related to the changes in this PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 7d51e80 and 1 for PR HEAD a1f0905 in total

@barbacbd
Copy link
Contributor Author

/retest-required

Copy link
Contributor

openshift-ci bot commented May 13, 2024

@barbacbd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-edge-zones-manifest-validation a044f91 link true /test e2e-aws-ovn-edge-zones-manifest-validation
ci/prow/okd-e2e-aws-ovn-upgrade a1f0905 link false /test okd-e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bfournie
Copy link
Contributor

/override ci/prow/shellcheck

Copy link
Contributor

openshift-ci bot commented May 13, 2024

@bfournie: Overrode contexts on behalf of bfournie: ci/prow/shellcheck

In response to this:

/override ci/prow/shellcheck

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 7f7c54c into openshift:master May 13, 2024
24 of 25 checks passed
@openshift-ci-robot
Copy link
Contributor

@barbacbd: Jira Issue OCPBUGS-23022: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-23022 has been moved to the MODIFIED state.

In response to this:

** Removed the warning that API and/or API-Int cannot be resolved.
** Used the last errors to be alerted instead during failures.
** Used Warning rather than Info to alert users of errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants