New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AGENT-864: New make target to add day2 node with agent #1654
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @andfasano @bfournie @pawanpinjarkar |
/hold Cannot add --auto-approve-csrs flag to monitor-add-nodes command. Need to approve CSRs within dev-scripts. |
The new target is named "agent_add_node". Requires at least one extra worker node to be defined. export NUM_EXTRA_WORKERS=1 Currently only one extra worker node is added to the cluster even if NUM_EXTRA_WORKERS > 1. Extra worker disk must be at least 100GB to pass assisted service validations. export EXTRA_WORKER_DISK=100 Depends on the release image having the node-joiner binary. node-joiner is included in baremetal-installer image. The target calls "node-joiner add-nodes" to create the addnodes ISO. It then attaches the ISO to extraworker-0 and boots it. It finally calls "node-joiner monitor-add-nodes" to monitor progress. In the background, dev-scripts approves any CSRs so that the node joins the cluster and becomes Ready.
/unhold |
@@ -33,6 +33,9 @@ agent_gather: | |||
agent_tests: | |||
./agent/agent_tests.sh | |||
|
|||
agent_add_node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having a separate step, since it will be a useful way to develop/test locally the node-joiner. From an high level point, we should aim to support two different scenarios:
- Dev/Local testing
In this scenario, the dev first runs amake agent
to deploy a cluster, then amake agent_add_node
multiple times while developing. For this reason, the step should be:
- If there are are extra workers already added as nodes, drain/cordon them
- Stop the extra workers vm and cleanup their disks
- Rebuild the baremetal-installer image (step 4) and update it in the local registry
- Re-run the portion of code that adds the etraworker
- CI jobs
From a CI point of view, we should have a way to invoke theagent_add_node
step when required. We could either override theDEVSCRIPTS_TARGET
in a agent workflow (ie here), or add always this step in the make agent and exclude it via a config var. Personally I'd be more inclined to the former option
Now, since the reference card is targeting the second scenario (CI jobs), and since the first one may require more additional work not strictly needed within this sprint, I think it would be better to have a new card for scenario 1 in the carry over epic AGENT-896 and limit this one just to the second scenario. This should also help in simplify the code
|
||
early_deploy_validation | ||
|
||
function build_node_joiner() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the previous comment, the build function for the node-joiner can be removed one and addressed in a separate card.
Note for the future development: the build_local_release function in step 4 already contains the logic for properly building/pushing a local image, it would be useful to extract it and reuse it when a single image update is required
} | ||
|
||
function approve_csrs() { | ||
while true; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this will be run in a CI job, it would be useful also to have a timeout condition (ie 10mins), to avoid relying on the CI timeout (usually 4h) in case of an issue during the approval
done | ||
} | ||
|
||
if [ -z "$KNI_INSTALL_FROM_GIT" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested before, we don't need now this part for local rebuild, so it could be removed
cp "$OPENSHIFT_INSTALL_PATH/bin/node-joiner" "$OCP_DIR" | ||
fi | ||
|
||
rm $OCP_DIR/add-node/.openshift_install_state.json | true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never be required. When running transient commands, the ${WORKING_DIR}
should be used as a base dir (and cleaned up at the end)
if [[ "$AGENT_STATIC_IP_NODE0_ONLY" = "true" ]]; then | ||
static_ips=1 | ||
else | ||
static_ips=$NUM_MASTERS+$NUM_WORKERS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_ips=$NUM_MASTERS+$NUM_WORKERS | |
static_ips=$NUM_MASTERS+$NUM_WORKERS+$NUM_EXTRA_WORKERS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below
else | ||
worker_num=$((${i}-$NUM_MASTERS)) | ||
AGENT_NODES_HOSTNAMES+=($(printf ${WORKER_HOSTNAME_FORMAT} ${worker_num})) | ||
cluster_name=${CLUSTER_NAME}_worker_${worker_num} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point for managing the extra worker.
fi | ||
done | ||
|
||
AGENT_EXTRA_WORKERS_IPS=() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really a lot of duplication that I'd prefer to not have. An extra worker it's really just like an ordinary node, but not effectively deployed, and it should be managed in an agnostic way in the code where possible.
The previous suggestions are meant to accomodate that vision, by incorporating the extra workers in the loop that is responsible for managing the nodes. In general I'm not against into having dedicated AGENT_EXTRA_WORKERS_*
vars, in addition to the AGENT_NODES_*
one, since I understand it may be useful to discriminate the two sets for the sake of template generation, but without having the code duplication
# External load balancer configuration. | ||
# The following ports are opened in firewalld so that libvirt VMs can communicate with haproxy. | ||
export MACHINE_CONFIG_SERVER_PORT=22623 | ||
export KUBE_API_PORT=6443 | ||
export INGRESS_ROUTER_PORT=443 | ||
export AGENT_NODE0_IPSV6=${AGENT_NODE0_IPSV6:-} | ||
|
||
export AGENT_EXTRAWORKER_NODE_TO_ADD=${AGENT_EXTRAWORKER_NODE_TO_ADD:-"0"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the previous comment, this should not be required
@@ -32,3 +32,15 @@ | |||
src: "agent-config_bond_yaml.j2" | |||
dest: "{{ install_path }}/agent-config.yaml" | |||
when: agent_bond_config != 'none' | |||
|
|||
- name: create add-node directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not really part of the install-config, it must be placed into its own task file related to add nodes
@@ -822,6 +822,11 @@ set -x | |||
# | |||
# export APPLIANCE_IMAGE="quay.io/edge-infrastructure/openshift-appliance:latest" | |||
|
|||
# Sets which extraworker node to add to cluster using the agent add-node workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be helpful to have a wrteup/help showing what are the steps required before running node-joiner. Per my understanding,
- Build the openshift-binary
- Use it to generate agent ISO as per your config
- Boot up the ISO and Create a cluster
- Now, if you want to add extra nodes, do the following
- Build the node-joiner binary by doing xyz
- Use this new binary as described below to add nodes
- details
- additional details
- etc
environment variable to specify which node to add to the cluster. If multiple extraworker nodes were provisioned, you can specify which node to add to the cluster through the new AGENT_EXTRAWORKER_NODE_TO_ADD environment variable. For example, if your cluster was created with NUM_EXTRA_WORKERS=2, you can add both nodes to the cluster using these commands: AGENT_EXTRAWORKER_NODE_TO_ADD=0 make agent_add_node AGENT_EXTRAWORKER_NODE_TO_ADD=1 make agent_add_node
@rwsu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The new target is named "agent_add_node".
Requires at least one extra worker node to be defined.
export NUM_EXTRA_WORKERS=1
Extra worker disk must be at least 100GB to pass assisted service validations.
export EXTRA_WORKER_DISK=100
If multiple extraworker nodes were provisioned, you can specify which node to add to the cluster through a new
AGENT_EXTRAWORKER_NODE_TO_ADD environment variable.
For example, if your cluster was created with NUM_EXTRA_WORKERS=2, you can add both nodes to the cluster using these commands:
AGENT_EXTRAWORKER_NODE_TO_ADD=0 make agent_add_node
AGENT_EXTRAWORKER_NODE_TO_ADD=1 make agent_add_node
The target calls "node-joiner add-nodes" to create the addnodes ISO. It then attaches the ISO to extraworker-0 and boots it. It finally calls "node-joiner monitor-add-nodes" to monitor progress. In the background, dev-scripts approves any CSRs so that the node joins the cluster and becomes Ready.
Depends on: openshift/installer#8171