Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AGENT-455: Check registry and rendezvous host access at startup #6767

Merged
merged 7 commits into from Jan 27, 2023

Conversation

rwsu
Copy link
Contributor

@rwsu rwsu commented Jan 12, 2023

Update the interactive console service to have it check the current
host can access the release image and rendezvous host. Display
this connectivity information to the user before prompting to run
the network configuration tui.

If there is connectivity, then the prompt times out after 60s and
the automation flow is allowed to continue.

If there is a connectivity issue, then the network configuration
tui will be executed to allow users to update. The tui, however,
has not been integrated with the interactive console service, so
in the meantime, there is a 60s sleep in place of executing the tui.
This will be changed in a future patch.

Depends on: #6756

Users would like the ability to change their network
configuration at the console if network connectivity problems
are detected.

To achieve this goal, this patch adds a new service called
agent-interactive-console.service to block the login prompt
and the agent services that pulls an image from the registry.

The service will execute the agent TUI to allow users to update
their network configuration. The TUI will check there is
connectivity to the registry and to the rendezvous host. If
the connectivity checks pass, the TUI exits, which also lead
the interactive console service to exit, and this unblocks the
login prompt and agent services waiting for pull from the registry,
allowing the agent-based installer to proceed.

The agent TUI will be added in a future patch.

For now, the service executes a script that logs its
presence, sleeps for 60 seconds, and exits. This should not
block the automated flow.

Added ConditionPathExists=/usr/local/bin/agent-tui, which means
the service does not start nor is it active until the agent-tui
binary is present at that path.

Most of the service definition was lifted from celebdor's POC:
openshift#6560

Signed-off-by: Richard Su <rwsu@redhat.com>
The interactive console service will not start or be active until
the path exists.
@rwsu
Copy link
Contributor Author

rwsu commented Jan 13, 2023

/cc @andfasano @bfournie

}

function has_rendezvous_host_connectivity() {
if (>&2 ping -c 4 "$NODE_ZERO_IP"); then
Copy link
Contributor

@andfasano andfasano Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Isn't expected to be running on the node zero host? I guess this check should run on non-rendezvous hosts, even though it should be harmless when running on the rendezvous host

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, its a bit confusing to see "Successfully pinged rendezvous host ..." when on the rendezvous host.

Copy link
Contributor

@bfournie bfournie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I tested the normal, success path and I also changed the node's IP to force a ping failure and verified it entered the path to call the agent-tui.

(>&2 podman pull "$RELEASE_IMAGE")
registry_check_status_code=$?
if [[ "$registry_check_status_code" -ne 0 ]]; then
(>&2 echo "WARNING: Unable to pull release image.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible (likely?) issue when failing to pull the release image in a disconnected environment is misconfiguration of the ICSP info in install-config. Since we can't change this in the tui it would be useful to log some additional info about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should plan more sophisticated checks in agent-tui only. There was a discussion to keep these initial checks in bash here for the moment, but to gradually move them in the agent-tui code

}

function has_rendezvous_host_connectivity() {
if (>&2 ping -c 4 "$NODE_ZERO_IP"); then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, its a bit confusing to see "Successfully pinged rendezvous host ..." when on the rendezvous host.

source /etc/assisted/agent-installer.env

function has_registry_connectivity() {
(>&2 echo "INFO: Checking OpenShift release image at $RELEASE_IMAGE is pullable.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - maybe "Checking that Openshift release ..."

@andfasano
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 19, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 19, 2023
Create a .path to track existence of the agent-tui.

If the agent-tui binary exists at the specified path, PathExists,
then the unit agent-interactive-console.service will be enabled.

If the agent-tui binary does not exist, then
agent-interactive-console.service is not enabled.
Update the interactive console service to have it check the current
host can access the release image and rendezvous host. Display
this connectivity information to the user before prompting to run
the network configuration tui.

If there is connectivity, then the prompt times out after 60s and
the automation flow is allowed to continue.

If there is a connectivity issue, then the network configuration
tui will be executed to allow users to update. The tui, however,
has not been  integrated with the interactive console service, so
in the meantime, there is a 60s sleep in place of executing the tui.
This will be changed in a future patch.

Depends on: openshift#6756

Signed-off-by: Richard Su <rwsu@redhat.com>
Refactor node zero ip check into its own function so that it
can be reused by both set-node-zero.sh and
agent-interactive-console.sh.

Fix indenting in agent-interactive-console.sh
@bfournie
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 20, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 26ce7ce and 2 for PR HEAD 157fd0d in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 950ed9f and 1 for PR HEAD 157fd0d in total

@rwsu
Copy link
Contributor Author

rwsu commented Jan 23, 2023

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 8c032f1 and 0 for PR HEAD 157fd0d in total

@openshift-ci-robot
Copy link
Contributor

/hold

Revision 157fd0d was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2023
@rwsu
Copy link
Contributor Author

rwsu commented Jan 26, 2023

/retest-required

@rwsu
Copy link
Contributor Author

rwsu commented Jan 27, 2023

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 27, 2023

@rwsu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-disruptive 157fd0d link false /test e2e-aws-ovn-disruptive
ci/prow/e2e-agent-sno-ipv6 157fd0d link false /test e2e-agent-sno-ipv6
ci/prow/e2e-vsphere-upi-zones 157fd0d link false /test e2e-vsphere-upi-zones

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD fca4137 and 2 for PR HEAD 157fd0d in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 5ad7ab9 and 1 for PR HEAD 157fd0d in total

@openshift-merge-robot openshift-merge-robot merged commit 601963c into openshift:master Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants