-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-23022: Installer should not complain when API and API-INT are resolved. #7701
Conversation
@barbacbd: This pull request references Jira Issue OCPBUGS-23022, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@barbacbd: This pull request references Jira Issue OCPBUGS-23022, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Awesome: we really need to fix this bug, so it's great to see progress on this front. That said, I think we need to go a little deeper to solve the problem. There are two things I would recommend checking out: @wking left some suggestions regarding handling the analysis in the In addition, I have another take as to what may be the problem: The checks are run as part of the bootkube script: installer/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template Lines 510 to 514 in f87f13a
The bootkube script is run as a loop, so these checks will be rerun over and over. What the checks do is One problem with this approach is that it is normal during the bootstrapping process for the apiserver to become periodically unavailable. These restarts are expected and not symptomatic of a problem, but it will cause the check to fail--even if the check has previously succeeded. So during a normal install, you will see a random sampling of PASS FAILs throughout the install. The intention of the check is just to establish that DNS is available, so a single PASS should be good enough to verify DNS availability, then we don't need to test anymore. The way we generally handle this in bootkube is by wrapping the code to check a filemarker:
So in this case, I'm wondering would it fix the probelm if we wrapped these checks with: if [ ! -f api-dns-check.done ]
# Check if the API and API_INT Server URLs can be reached.
echo "Check if API and API-Int URLs are reachable during bootstrap"
check_url "API_URL" "${API_SERVER_URL}"
check_url "API_INT_URL" "${API_INT_SERVER_URL}"
fi
|
@@ -118,7 +118,6 @@ func checkAPIURLs(a analysis) bool { | |||
} | |||
// Note: Even when there is a stage failure, we are not returning false here. That is | |||
// intentional because we donot want to report this as an error in the "analyze" output. | |||
logrus.Warn("The bootstrap machine is unable to resolve API and/or API-Int Server URLs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see anything in this function that's specific to the API resolution steps, but the function name is still checkAPIURLs
. If the goal is to report the last error in the bootkube
steps, regardless of whether that is an API-resolution error or not, can we rename the function to checkBootkubeService
or something, and adjust the function's Godocs to match? Also I'm unclear about why checkReleaseImageDownload
returns false
on error while this function returns true
. Perhaps there should be a generic checkService
function? Or maybe I'm missing the nuance of the Note: Even when there is a stage failure...
comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
None of the functions in https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootstrap-verify-api-server-urls.sh actually return an error even when the URL is detected as invalid or it is not resolvable. Hence, that comment.
But, that also means that we will not be able to detect the right time to display the logrus.Warn(). +1 for removing it. -
+1 to renaming the function to checkBootkubeService() with corresponding changes to the Godocs. That way, there is no assumptions about the checks happening in the bootkube service.
-
+1 to have
api-dns-check.done
so that we run these checks only once. (In the custom-dns case, these checks may not succeed right away. So, we skip this test in that case, or allow multiple checks.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to have api-dns-check.done so that we run these checks only once. (In the custom-dns case, these checks may not succeed right away. So, we skip this test in that case, or allow multiple checks.)
The check would be run on each iteration until it passes, once it passes it would be skipped thereafter. Agreed that it would not work for the api
record in custom DNS because it is expected that the record is created post install.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since analyze just looks at the logs generated during bootkube service, this should still work for custom DNS because both API and API-Int entries are created within the CoreDNS pod.
I am not aware of all the reasons why DNS resolution can fluctuate during install (causing this erroneous reporting we are seeing here), but I suspect the same reasons would be in play even when custom dns is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should still work for custom DNS because both API and API-Int entries are created within the CoreDNS pod.
Ah, that’s right. So we should be good. 👍
I am not aware of all the reasons why DNS resolution can fluctuate during install
it’s not dns resolution that’s failing—the dns check curls the api server which becomes unavailable periodically during an install. Happens during all installs.
Once the check is performed once in the bootkube service during bootstrap, will it be run again during analyze? |
Hopefully this comment clarifies "performed once."
No: analyze just parses the logs gathered from the bootstrap node, so the checks wouldn't be run again as part of that. |
6b6b517
to
a044f91
Compare
5a45722
to
f2de33e
Compare
/retest-required |
… resolved. ** Removed the warning that API and/or API-Int cannot be resolved. ** Used the last errors to be alerted instead during failures. ** Used Warning rather than Info to alert users of errors.
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes to the checks if API and API-Int URLs are resolvable and reachable need to be run only one during bootstrap. Adding the <url_type>-resolve.done and -dns-check.done ensures that.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sadasu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if [ ! -f api-int-dns-check.done ]; then | ||
echo "Check if API-Int URL is reachable during bootstrap" | ||
if check_url "API_INT_URL" "${API_INT_SERVER_URL}"; then | ||
touch api-int-dns-check.done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this file could be called -reachable.done.
Will not hold PR for this nit.
looks good to me, but will let @patrickdillon have a look before adding label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is just introducing status markers in the bootkube process, it seems harmless to merge.
/lgtm
/override ci/prow/shellcheck |
@bfournie: Overrode contexts on behalf of bfournie: ci/prow/shellcheck In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/hold Revision a1f0905 was retested 3 times: holding |
/unhold |
/override ci/prow/shellcheck |
@r4f4: Overrode contexts on behalf of r4f4: ci/prow/shellcheck In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest-required |
@barbacbd: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/override ci/prow/shellcheck |
@bfournie: Overrode contexts on behalf of bfournie: ci/prow/shellcheck In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
7f7c54c
into
openshift:master
@barbacbd: Jira Issue OCPBUGS-23022: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-23022 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
** Removed the warning that API and/or API-Int cannot be resolved.
** Used the last errors to be alerted instead during failures.
** Used Warning rather than Info to alert users of errors.