New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AGENT-455: Check registry and rendezvous host access at startup #6767
Conversation
Users would like the ability to change their network configuration at the console if network connectivity problems are detected. To achieve this goal, this patch adds a new service called agent-interactive-console.service to block the login prompt and the agent services that pulls an image from the registry. The service will execute the agent TUI to allow users to update their network configuration. The TUI will check there is connectivity to the registry and to the rendezvous host. If the connectivity checks pass, the TUI exits, which also lead the interactive console service to exit, and this unblocks the login prompt and agent services waiting for pull from the registry, allowing the agent-based installer to proceed. The agent TUI will be added in a future patch. For now, the service executes a script that logs its presence, sleeps for 60 seconds, and exits. This should not block the automated flow. Added ConditionPathExists=/usr/local/bin/agent-tui, which means the service does not start nor is it active until the agent-tui binary is present at that path. Most of the service definition was lifted from celebdor's POC: openshift#6560 Signed-off-by: Richard Su <rwsu@redhat.com>
The interactive console service will not start or be active until the path exists.
/cc @andfasano @bfournie |
} | ||
|
||
function has_rendezvous_host_connectivity() { | ||
if (>&2 ping -c 4 "$NODE_ZERO_IP"); then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Isn't expected to be running on the node zero host? I guess this check should run on non-rendezvous hosts, even though it should be harmless when running on the rendezvous host
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, its a bit confusing to see "Successfully pinged rendezvous host ..." when on the rendezvous host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I tested the normal, success path and I also changed the node's IP to force a ping failure and verified it entered the path to call the agent-tui.
(>&2 podman pull "$RELEASE_IMAGE") | ||
registry_check_status_code=$? | ||
if [[ "$registry_check_status_code" -ne 0 ]]; then | ||
(>&2 echo "WARNING: Unable to pull release image.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible (likely?) issue when failing to pull the release image in a disconnected environment is misconfiguration of the ICSP info in install-config. Since we can't change this in the tui it would be useful to log some additional info about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should plan more sophisticated checks in agent-tui only. There was a discussion to keep these initial checks in bash here for the moment, but to gradually move them in the agent-tui code
} | ||
|
||
function has_rendezvous_host_connectivity() { | ||
if (>&2 ping -c 4 "$NODE_ZERO_IP"); then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, its a bit confusing to see "Successfully pinged rendezvous host ..." when on the rendezvous host.
source /etc/assisted/agent-installer.env | ||
|
||
function has_registry_connectivity() { | ||
(>&2 echo "INFO: Checking OpenShift release image at $RELEASE_IMAGE is pullable.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - maybe "Checking that Openshift release ..."
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andfasano The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Create a .path to track existence of the agent-tui. If the agent-tui binary exists at the specified path, PathExists, then the unit agent-interactive-console.service will be enabled. If the agent-tui binary does not exist, then agent-interactive-console.service is not enabled.
Update the interactive console service to have it check the current host can access the release image and rendezvous host. Display this connectivity information to the user before prompting to run the network configuration tui. If there is connectivity, then the prompt times out after 60s and the automation flow is allowed to continue. If there is a connectivity issue, then the network configuration tui will be executed to allow users to update. The tui, however, has not been integrated with the interactive console service, so in the meantime, there is a 60s sleep in place of executing the tui. This will be changed in a future patch. Depends on: openshift#6756 Signed-off-by: Richard Su <rwsu@redhat.com>
Refactor node zero ip check into its own function so that it can be reused by both set-node-zero.sh and agent-interactive-console.sh. Fix indenting in agent-interactive-console.sh
/lgtm |
/retest-required |
/hold Revision 157fd0d was retested 3 times: holding |
/retest-required |
/unhold |
@rwsu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Update the interactive console service to have it check the current
host can access the release image and rendezvous host. Display
this connectivity information to the user before prompting to run
the network configuration tui.
If there is connectivity, then the prompt times out after 60s and
the automation flow is allowed to continue.
If there is a connectivity issue, then the network configuration
tui will be executed to allow users to update. The tui, however,
has not been integrated with the interactive console service, so
in the meantime, there is a 60s sleep in place of executing the tui.
This will be changed in a future patch.
Depends on: #6756