AGENT-864: New make target to add day2 node with agent #1654

rwsu · 2024-04-20T02:08:23Z

The new target is named "agent_add_node".

Requires at least one extra worker node to be defined.

export NUM_EXTRA_WORKERS=1

Extra worker disk must be at least 100GB to pass assisted service validations.

export EXTRA_WORKER_DISK=100

If multiple extraworker nodes were provisioned, you can specify which node to add to the cluster through a new
AGENT_EXTRAWORKER_NODE_TO_ADD environment variable.

For example, if your cluster was created with NUM_EXTRA_WORKERS=2, you can add both nodes to the cluster using these commands:
AGENT_EXTRAWORKER_NODE_TO_ADD=0 make agent_add_node
AGENT_EXTRAWORKER_NODE_TO_ADD=1 make agent_add_node

The target calls "node-joiner add-nodes" to create the addnodes ISO. It then attaches the ISO to extraworker-0 and boots it. It finally calls "node-joiner monitor-add-nodes" to monitor progress. In the background, dev-scripts approves any CSRs so that the node joins the cluster and becomes Ready.

Depends on: openshift/installer#8171

openshift-ci · 2024-04-20T02:08:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign derekhiggins for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rwsu · 2024-04-20T02:09:17Z

/cc @andfasano @bfournie @pawanpinjarkar
/uncc @lranjbar @cybertron

rwsu · 2024-04-22T20:30:40Z

/hold

Cannot add --auto-approve-csrs flag to monitor-add-nodes command. Need to approve CSRs within dev-scripts.

The new target is named "agent_add_node". Requires at least one extra worker node to be defined. export NUM_EXTRA_WORKERS=1 Currently only one extra worker node is added to the cluster even if NUM_EXTRA_WORKERS > 1. Extra worker disk must be at least 100GB to pass assisted service validations. export EXTRA_WORKER_DISK=100 Depends on the release image having the node-joiner binary. node-joiner is included in baremetal-installer image. The target calls "node-joiner add-nodes" to create the addnodes ISO. It then attaches the ISO to extraworker-0 and boots it. It finally calls "node-joiner monitor-add-nodes" to monitor progress. In the background, dev-scripts approves any CSRs so that the node joins the cluster and becomes Ready.

rwsu · 2024-05-02T16:40:54Z

/unhold

andfasano · 2024-05-03T08:26:09Z

Makefile

@@ -33,6 +33,9 @@ agent_gather:
 agent_tests:
 	./agent/agent_tests.sh

+agent_add_node:


I like the idea of having a separate step, since it will be a useful way to develop/test locally the node-joiner. From an high level point, we should aim to support two different scenarios:

Dev/Local testing
In this scenario, the dev first runs a make agent to deploy a cluster, then a make agent_add_node multiple times while developing. For this reason, the step should be:

If there are are extra workers already added as nodes, drain/cordon them

Stop the extra workers vm and cleanup their disks

Rebuild the baremetal-installer image (step 4) and update it in the local registry

Re-run the portion of code that adds the etraworker

CI jobs
From a CI point of view, we should have a way to invoke the agent_add_node step when required. We could either override the DEVSCRIPTS_TARGET in a agent workflow (ie here), or add always this step in the make agent and exclude it via a config var. Personally I'd be more inclined to the former option

Now, since the reference card is targeting the second scenario (CI jobs), and since the first one may require more additional work not strictly needed within this sprint, I think it would be better to have a new card for scenario 1 in the carry over epic AGENT-896 and limit this one just to the second scenario. This should also help in simplify the code

andfasano · 2024-05-03T08:33:56Z

agent/07_agent_add_node.sh

+
+early_deploy_validation
+
+function build_node_joiner() {


As per the previous comment, the build function for the node-joiner can be removed one and addressed in a separate card.

Note for the future development: the build_local_release function in step 4 already contains the logic for properly building/pushing a local image, it would be useful to extract it and reuse it when a single image update is required

andfasano · 2024-05-03T08:39:01Z

agent/07_agent_add_node.sh

+}
+
+function approve_csrs() {
+  while true; do


Since this will be run in a CI job, it would be useful also to have a timeout condition (ie 10mins), to avoid relying on the CI timeout (usually 4h) in case of an issue during the approval

andfasano · 2024-05-03T08:42:31Z

agent/07_agent_add_node.sh

+  done
+}
+
+if [ -z "$KNI_INSTALL_FROM_GIT" ]; then


As suggested before, we don't need now this part for local rebuild, so it could be removed

andfasano · 2024-05-03T08:47:10Z

agent/07_agent_add_node.sh

+  cp "$OPENSHIFT_INSTALL_PATH/bin/node-joiner" "$OCP_DIR"
+fi
+
+rm $OCP_DIR/add-node/.openshift_install_state.json | true


This should never be required. When running transient commands, the ${WORKING_DIR} should be used as a base dir (and cleaned up at the end)

andfasano · 2024-05-03T09:18:09Z

agent/common.sh

+    if [[ "$AGENT_STATIC_IP_NODE0_ONLY" = "true" ]]; then
+        static_ips=1
+    else
+        static_ips=$NUM_MASTERS+$NUM_WORKERS


Suggested change

static_ips=$NUM_MASTERS+$NUM_WORKERS

static_ips=$NUM_MASTERS+$NUM_WORKERS+$NUM_EXTRA_WORKERS

See comment below

andfasano · 2024-05-03T09:42:17Z

agent/common.sh

+        else
+	          worker_num=$((${i}-$NUM_MASTERS))
+            AGENT_NODES_HOSTNAMES+=($(printf ${WORKER_HOSTNAME_FORMAT} ${worker_num}))
+            cluster_name=${CLUSTER_NAME}_worker_${worker_num}


Another point for managing the extra worker.

andfasano · 2024-05-03T09:58:20Z

agent/common.sh

+        fi
+    done
+
+    AGENT_EXTRA_WORKERS_IPS=()


That's really a lot of duplication that I'd prefer to not have. An extra worker it's really just like an ordinary node, but not effectively deployed, and it should be managed in an agnostic way in the code where possible.

The previous suggestions are meant to accomodate that vision, by incorporating the extra workers in the loop that is responsible for managing the nodes. In general I'm not against into having dedicated AGENT_EXTRA_WORKERS_* vars, in addition to the AGENT_NODES_* one, since I understand it may be useful to discriminate the two sets for the sake of template generation, but without having the code duplication

andfasano · 2024-05-03T09:58:48Z

agent/common.sh

 # External load balancer configuration.
 # The following ports are opened in firewalld so that libvirt VMs can communicate with haproxy.
 export MACHINE_CONFIG_SERVER_PORT=22623
 export KUBE_API_PORT=6443
 export INGRESS_ROUTER_PORT=443
 export AGENT_NODE0_IPSV6=${AGENT_NODE0_IPSV6:-}
+
+export AGENT_EXTRAWORKER_NODE_TO_ADD=${AGENT_EXTRAWORKER_NODE_TO_ADD:-"0"}


As per the previous comment, this should not be required

andfasano · 2024-05-03T10:00:59Z

agent/roles/manifests/tasks/install-config.yml

@@ -32,3 +32,15 @@
    src: "agent-config_bond_yaml.j2"
    dest: "{{ install_path }}/agent-config.yaml"
  when: agent_bond_config != 'none'
+
+- name: create add-node directory


That's not really part of the install-config, it must be placed into its own task file related to add nodes

pawanpinjarkar · 2024-05-08T20:46:37Z

config_example.sh

@@ -822,6 +822,11 @@ set -x
 #
 # export APPLIANCE_IMAGE="quay.io/edge-infrastructure/openshift-appliance:latest"

+# Sets which extraworker node to add to cluster using the agent add-node workflow.


I think it will be helpful to have a wrteup/help showing what are the steps required before running node-joiner. Per my understanding,

Build the openshift-binary

Use it to generate agent ISO as per your config

Boot up the ISO and Create a cluster

Now, if you want to add extra nodes, do the following

Build the node-joiner binary by doing xyz

Use this new binary as described below to add nodes

details

additional details

etc

environment variable to specify which node to add to the cluster. If multiple extraworker nodes were provisioned, you can specify which node to add to the cluster through the new AGENT_EXTRAWORKER_NODE_TO_ADD environment variable. For example, if your cluster was created with NUM_EXTRA_WORKERS=2, you can add both nodes to the cluster using these commands: AGENT_EXTRAWORKER_NODE_TO_ADD=0 make agent_add_node AGENT_EXTRAWORKER_NODE_TO_ADD=1 make agent_add_node

openshift-ci · 2024-05-10T05:38:41Z

@rwsu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-agent-compact-ipv4	`d61ab55`	link	true	`/test e2e-agent-compact-ipv4`
ci/prow/e2e-agent-ha-dualstack	`d61ab55`	link	false	`/test e2e-agent-ha-dualstack`
ci/prow/e2e-metal-ipi-virtualmedia	`d61ab55`	link	false	`/test e2e-metal-ipi-virtualmedia`
ci/prow/e2e-metal-ipi-serial-ipv4	`d61ab55`	link	true	`/test e2e-metal-ipi-serial-ipv4`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from cybertron and lranjbar April 20, 2024 02:08

openshift-ci bot requested review from andfasano, bfournie and pawanpinjarkar and removed request for cybertron and lranjbar April 20, 2024 02:09

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 23, 2024

rwsu force-pushed the AGENT-864 branch from c87d605 to e81cdde Compare May 1, 2024 21:15

rwsu force-pushed the AGENT-864 branch from e81cdde to b2c86a7 Compare May 1, 2024 21:26

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2024

rwsu force-pushed the AGENT-864 branch from 0736205 to 1d36ba0 Compare May 2, 2024 16:43

rwsu mentioned this pull request May 2, 2024

AGENT-861: Monitor add nodes command (single node) openshift/installer#8171

Open

andfasano reviewed May 3, 2024

View reviewed changes

pawanpinjarkar reviewed May 8, 2024

View reviewed changes

rwsu added 2 commits May 9, 2024 21:24

Document how to add nodes to a cluster using agent_add_node target

d61ab55

rwsu force-pushed the AGENT-864 branch from 1d36ba0 to d61ab55 Compare May 10, 2024 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENT-864: New make target to add day2 node with agent #1654

AGENT-864: New make target to add day2 node with agent #1654

rwsu commented Apr 20, 2024 •

edited

openshift-ci bot commented Apr 20, 2024

rwsu commented Apr 20, 2024

rwsu commented Apr 22, 2024

rwsu commented May 2, 2024

andfasano May 3, 2024

andfasano May 3, 2024

andfasano May 3, 2024

andfasano May 3, 2024

andfasano May 3, 2024

andfasano May 3, 2024

andfasano May 3, 2024

andfasano May 3, 2024

andfasano May 3, 2024

andfasano May 3, 2024

andfasano May 3, 2024

pawanpinjarkar May 8, 2024

openshift-ci bot commented May 10, 2024

	static_ips=$NUM_MASTERS+$NUM_WORKERS
	static_ips=$NUM_MASTERS+$NUM_WORKERS+$NUM_EXTRA_WORKERS

AGENT-864: New make target to add day2 node with agent #1654

Are you sure you want to change the base?

AGENT-864: New make target to add day2 node with agent #1654

Conversation

rwsu commented Apr 20, 2024 • edited

openshift-ci bot commented Apr 20, 2024

rwsu commented Apr 20, 2024

rwsu commented Apr 22, 2024

rwsu commented May 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented May 10, 2024

rwsu commented Apr 20, 2024 •

edited