Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AGENT-864: New make target to add day2 node with agent #1654

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rwsu
Copy link
Contributor

@rwsu rwsu commented Apr 20, 2024

The new target is named "agent_add_node".

Requires at least one extra worker node to be defined.

export NUM_EXTRA_WORKERS=1

Extra worker disk must be at least 100GB to pass assisted service validations.

export EXTRA_WORKER_DISK=100

If multiple extraworker nodes were provisioned, you can specify which node to add to the cluster through a new
AGENT_EXTRAWORKER_NODE_TO_ADD environment variable.

For example, if your cluster was created with NUM_EXTRA_WORKERS=2, you can add both nodes to the cluster using these commands:
AGENT_EXTRAWORKER_NODE_TO_ADD=0 make agent_add_node
AGENT_EXTRAWORKER_NODE_TO_ADD=1 make agent_add_node

The target calls "node-joiner add-nodes" to create the addnodes ISO. It then attaches the ISO to extraworker-0 and boots it. It finally calls "node-joiner monitor-add-nodes" to monitor progress. In the background, dev-scripts approves any CSRs so that the node joins the cluster and becomes Ready.

Depends on: openshift/installer#8171

Copy link

openshift-ci bot commented Apr 20, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign derekhiggins for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rwsu
Copy link
Contributor Author

rwsu commented Apr 20, 2024

@openshift-ci openshift-ci bot requested review from andfasano, bfournie and pawanpinjarkar and removed request for cybertron and lranjbar April 20, 2024 02:09
@rwsu
Copy link
Contributor Author

rwsu commented Apr 22, 2024

/hold

Cannot add --auto-approve-csrs flag to monitor-add-nodes command. Need to approve CSRs within dev-scripts.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 23, 2024
The new target is named "agent_add_node".

Requires at least one extra worker node to be defined.

export NUM_EXTRA_WORKERS=1

Currently only one extra worker node is added to the
cluster even if NUM_EXTRA_WORKERS > 1.

Extra worker disk must be at least 100GB to pass assisted
service validations.

export EXTRA_WORKER_DISK=100

Depends on the release image having the node-joiner binary.
node-joiner is included in baremetal-installer image.

The target calls "node-joiner add-nodes" to create the
addnodes ISO. It then attaches the ISO to extraworker-0
and boots it. It finally calls "node-joiner monitor-add-nodes"
to monitor progress. In the background, dev-scripts approves
any CSRs so that the node joins the cluster and becomes
Ready.
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2024
@rwsu
Copy link
Contributor Author

rwsu commented May 2, 2024

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2024
@@ -33,6 +33,9 @@ agent_gather:
agent_tests:
./agent/agent_tests.sh

agent_add_node:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having a separate step, since it will be a useful way to develop/test locally the node-joiner. From an high level point, we should aim to support two different scenarios:

  1. Dev/Local testing
    In this scenario, the dev first runs a make agent to deploy a cluster, then a make agent_add_node multiple times while developing. For this reason, the step should be:
  • If there are are extra workers already added as nodes, drain/cordon them
  • Stop the extra workers vm and cleanup their disks
  • Rebuild the baremetal-installer image (step 4) and update it in the local registry
  • Re-run the portion of code that adds the etraworker
  1. CI jobs
    From a CI point of view, we should have a way to invoke the agent_add_node step when required. We could either override the DEVSCRIPTS_TARGET in a agent workflow (ie here), or add always this step in the make agent and exclude it via a config var. Personally I'd be more inclined to the former option

Now, since the reference card is targeting the second scenario (CI jobs), and since the first one may require more additional work not strictly needed within this sprint, I think it would be better to have a new card for scenario 1 in the carry over epic AGENT-896 and limit this one just to the second scenario. This should also help in simplify the code


early_deploy_validation

function build_node_joiner() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the previous comment, the build function for the node-joiner can be removed one and addressed in a separate card.

Note for the future development: the build_local_release function in step 4 already contains the logic for properly building/pushing a local image, it would be useful to extract it and reuse it when a single image update is required

}

function approve_csrs() {
while true; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will be run in a CI job, it would be useful also to have a timeout condition (ie 10mins), to avoid relying on the CI timeout (usually 4h) in case of an issue during the approval

done
}

if [ -z "$KNI_INSTALL_FROM_GIT" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested before, we don't need now this part for local rebuild, so it could be removed

cp "$OPENSHIFT_INSTALL_PATH/bin/node-joiner" "$OCP_DIR"
fi

rm $OCP_DIR/add-node/.openshift_install_state.json | true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be required. When running transient commands, the ${WORKING_DIR} should be used as a base dir (and cleaned up at the end)

if [[ "$AGENT_STATIC_IP_NODE0_ONLY" = "true" ]]; then
static_ips=1
else
static_ips=$NUM_MASTERS+$NUM_WORKERS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static_ips=$NUM_MASTERS+$NUM_WORKERS
static_ips=$NUM_MASTERS+$NUM_WORKERS+$NUM_EXTRA_WORKERS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below

else
worker_num=$((${i}-$NUM_MASTERS))
AGENT_NODES_HOSTNAMES+=($(printf ${WORKER_HOSTNAME_FORMAT} ${worker_num}))
cluster_name=${CLUSTER_NAME}_worker_${worker_num}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point for managing the extra worker.

fi
done

AGENT_EXTRA_WORKERS_IPS=()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really a lot of duplication that I'd prefer to not have. An extra worker it's really just like an ordinary node, but not effectively deployed, and it should be managed in an agnostic way in the code where possible.

The previous suggestions are meant to accomodate that vision, by incorporating the extra workers in the loop that is responsible for managing the nodes. In general I'm not against into having dedicated AGENT_EXTRA_WORKERS_* vars, in addition to the AGENT_NODES_* one, since I understand it may be useful to discriminate the two sets for the sake of template generation, but without having the code duplication

# External load balancer configuration.
# The following ports are opened in firewalld so that libvirt VMs can communicate with haproxy.
export MACHINE_CONFIG_SERVER_PORT=22623
export KUBE_API_PORT=6443
export INGRESS_ROUTER_PORT=443
export AGENT_NODE0_IPSV6=${AGENT_NODE0_IPSV6:-}

export AGENT_EXTRAWORKER_NODE_TO_ADD=${AGENT_EXTRAWORKER_NODE_TO_ADD:-"0"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the previous comment, this should not be required

@@ -32,3 +32,15 @@
src: "agent-config_bond_yaml.j2"
dest: "{{ install_path }}/agent-config.yaml"
when: agent_bond_config != 'none'

- name: create add-node directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not really part of the install-config, it must be placed into its own task file related to add nodes

@@ -822,6 +822,11 @@ set -x
#
# export APPLIANCE_IMAGE="quay.io/edge-infrastructure/openshift-appliance:latest"

# Sets which extraworker node to add to cluster using the agent add-node workflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be helpful to have a wrteup/help showing what are the steps required before running node-joiner. Per my understanding,

  1. Build the openshift-binary
  2. Use it to generate agent ISO as per your config
  3. Boot up the ISO and Create a cluster
  4. Now, if you want to add extra nodes, do the following
  5. Build the node-joiner binary by doing xyz
  6. Use this new binary as described below to add nodes
    1. details
    2. additional details
    3. etc

rwsu added 2 commits May 9, 2024 21:24
environment variable to specify which node to add to the cluster.

If multiple extraworker nodes were provisioned, you can specify
which node to add to the cluster through the new
AGENT_EXTRAWORKER_NODE_TO_ADD environment variable.

For example, if your cluster was created with NUM_EXTRA_WORKERS=2,
you can add both nodes to the cluster using these commands:

AGENT_EXTRAWORKER_NODE_TO_ADD=0 make agent_add_node
AGENT_EXTRAWORKER_NODE_TO_ADD=1 make agent_add_node
Copy link

openshift-ci bot commented May 10, 2024

@rwsu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agent-compact-ipv4 d61ab55 link true /test e2e-agent-compact-ipv4
ci/prow/e2e-agent-ha-dualstack d61ab55 link false /test e2e-agent-ha-dualstack
ci/prow/e2e-metal-ipi-virtualmedia d61ab55 link false /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-metal-ipi-serial-ipv4 d61ab55 link true /test e2e-metal-ipi-serial-ipv4

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants