-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MIXEDARCH-392: agent-installer support for multi-arch #48148
Conversation
@zniu1011: This pull request references MIXEDARCH-392 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.16." or "openshift-4.16.", but it targets "openshift-4.15" instead. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@zniu1011,
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
bce57f3
to
da1dfac
Compare
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.15-multi-nightly-baremetal-agent-ipv4-static-amd-mixarch-f14 |
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-baremetal-sno-agent-ipv4-static-connected-f7 |
/retest |
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.15-multi-nightly-baremetal-agent-ipv4-static-amd-mixarch-f14 |
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-baremetal-sno-agent-ipv4-static-connected-f7 |
/retest |
da1dfac
to
4a8eb03
Compare
@zniu1011,
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
4a8eb03
to
204a9e8
Compare
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.15-multi-nightly-baremetal-agent-ipv4-static-amd-mixarch-f14 |
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-baremetal-sno-agent-ipv4-static-connected-f7 |
report_template: '{{if eq .Status.State "success"}} :rainbow: Job *{{.Spec.Job}}* | ||
ended with *{{.Status.State}}*. <{{.Status.URL}}|View logs> :rainbow: {{else}} | ||
:volcano: Job *{{.Spec.Job}}* ended with *{{.Status.State}}*. <{{.Status.URL}}|View | ||
logs> :volcano: {{end}}' | ||
spec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pr needs rebase probably, it removes some non-related code.
others look good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let me try rebase it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zniu1011 here are some comments (if some of them require more time than usual, we can look at them after your holidays)
@@ -1313,6 +1313,52 @@ tests: | |||
test: | |||
- chain: openshift-e2e-test-qe | |||
workflow: cucushift-installer-rehearse-azure-ipi-publish-mixed-ingress-internal | |||
- as: baremetal-agent-ipv4-static-amd-mixarch-f14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zniu1011, if I understand correctly, this should be considered a compact cluster (3-masters only) with a day2 scale up operation. Should we explicitly clarify it as in the other profiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aleskandro , the name "baremetal-compact-agent-ipv4-dhcp-disconnected-day2-amd-mixarch-f14" exceed the maximum length,
"67 characters long, maximum length is 61"
we may need to cut at least one of the elements here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's, reduce some chars maybe? Not sure what we usually do in these cases, but compact
looks important to state as the 3 master nodes will also get the worker role (regardless of the scale up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are some cases using disc
replace disconnected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lwan-wanglin, let me try 'disc' instead of 'disconnected'
test: | ||
- chain: openshift-e2e-test-qe | ||
workflow: baremetal-lab-agent-install-network-static-day2 | ||
- as: baremetal-agent-ipv4-static-disconnected-amd-mixarch-f14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use dhcp instead of static and ADDITIONAL_WORKER_ARCHITECTURE_DAY2: false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok with dhcp instead of static.
What do you mean about "ADDITIONAL_WORKER_ARCHITECTURE_DAY2: false"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the agent installer generate two ISOs, one per architecture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aleskandro me and @zniu1011 has discussed this question, we can't use agent ISO for additional nodes scale up, agent ISO not only contains boot information, but also network, the nomber of node etc.
if we create a agent ISO for arm, it stands a new fresh install, ranther scale up nodes. #47637 (comment)
@zniu1011 please correct me if I miss-understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwan-wanglin Fully correct.
@aleskandro Agent install only support one boot iso install, with fixed nodes' count, and fixed configurations, and it doesn't support mixed-arch. Any installation hasn't meet this fixed requirements, it will not pass checking and not go to install. Agent team was working on the day2 install design too, it is possible to install day2 nodes using a different way/command to create a additional image, but the design is still pending and not settled yet. https://issues.redhat.com/browse/AGENT-682
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was wondering is: can we set all the nodes (even the additional ones) in the same agent-install config (day0) and generate one iso for each arch with the same exact content?
If we want to install an arm-based CP, how does the agent installer understand that it should generate an iso for arm? From the controlPlane.architecthre field in the install-config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is true, we are using controlPlane.architecthre and compute:- architecture field in the install-config to define the arch.
For now, controlPlane.architecthre and compute:- architecture should be totally the same, if there are 2 different archs in install-config, the agent..iso image can't be created with error.
We can run "create image" command twice to generate 2 sets: agent.x86_64.iso and agent.arrch64.iso, but it is not booting success to install the mixed arch clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. Can you fix the documentation of the day2 parameter for the agent steps by specifying that only "true" is supported because day0 is not implemented yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will update the documentation to clarify this. Thanks @aleskandro
@@ -29,6 +29,11 @@ EOF | |||
for bmhost in $(yq e -o=j -I=0 '.[]' "${SHARED_DIR}/hosts.yaml"); do | |||
# shellcheck disable=SC1090 | |||
. <(echo "$bmhost" | yq e 'to_entries | .[] | (.key + "=\"" + .value + "\"")') | |||
if [[ "${name}" == *-a-* ]] && [ "${ADDITIONAL_WORKERS_DAY2}" == "true" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zniu1011 can you double check the other steps and synchronize if any similar loop is left such that we have them all in the same page? Even if no test configs are defined for multiarch compute nodes, we can just verify the change against regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had run the regressions, it will not effected for single arch cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zniu1011 there are more instances of blocks that loop through the hosts where we want to provision the ABI cluster. We should update them too while working on this enablement.
Line 30 in 04545ae
. <(echo "$bmhost" | yq e 'to_entries | .[] | (.key + "=\"" + .value + "\"")') Lines 53 to 67 in 343ccf9
"iso") echo -e "\nMounting the unconfigured agent ISO image in the hosts via virtual media and powering on the hosts..." # shellcheck disable=SC2154 for bmhost in $(yq e -o=j -I=0 '.[]' "${SHARED_DIR}/hosts.yaml"); do # shellcheck disable=SC1090 . <(echo "$bmhost" | yq e 'to_entries | .[] | (.key + "=\"" + .value + "\"")') if [ "${transfer_protocol_type}" == "cifs" ]; then IP_ADDRESS="$(dig +short "${AUX_HOST}")" iso_path="${IP_ADDRESS}/isos/${CLUSTER_NAME}/${UNCONFIGURED_AGENT_IMAGE_FILENAME}" else # Assuming HTTP or HTTPS iso_path="${transfer_protocol_type:-http}://${AUX_HOST}/${CLUSTER_NAME}/${UNCONFIGURED_AGENT_IMAGE_FILENAME}" fi mount_virtual_media "${host}" "${iso_path}" & done Lines 95 to 106 in 04545ae
" echo "[INFO] Processing the platform.baremetal.hosts list in the install-config.yaml..." # shellcheck disable=SC2154 for bmhost in $(yq e -o=j -I=0 '.[]' "${SHARED_DIR}/hosts.yaml"); do # shellcheck disable=SC1090 . <(echo "$bmhost" | yq e 'to_entries | .[] | (.key + "=\"" + .value + "\"")') ADAPTED_YAML=" name: ${name} role: ${name%%-[0-9]*} bootMACAddress: ${mac} " - ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aleskandro Okay, I will update these blocks too. We only run the dynamic flow for the day2 cases, so the other changes will not being used for now, since we will not reserve day2 nodes in the other flows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zniu1011 ... We don't use them now, but for example, if we want to do ephemeral installations later, the only change required to allow day2 scale up is probably just that. So it's worth it to do that change wherever it could be required as part of a single PR that targets a full enablement of the current agent steps to support multiarch compute nodes at day2.
@@ -248,3 +258,15 @@ if ! wait "$!"; then | |||
# TODO: gather logs?? | |||
exit 1 | |||
fi | |||
|
|||
# Exit normally if there is no day2 jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this as it's the same content as the other worker.ign file you're overwriting. You find this procedure in the doc because users might od this a lot of time later the cluster installation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aleskandro I think there is no the other worker.ign file for agent, if you don't create it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks for clarifying.
204a9e8
to
f94ed0d
Compare
@zniu1011: This pull request references MIXEDARCH-392 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.16." or "openshift-4.16.", but it targets "openshift-4.15" instead. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@zniu1011: This pull request references MIXEDARCH-392 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.16." or "openshift-4.16.", but it targets "openshift-4.15" instead. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5fdfc36
to
2b1528c
Compare
@zniu1011,
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
2b1528c
to
8ffeed3
Compare
Whether to add (power on) the additional workers after the cluster is installed. And the default value is "true" | ||
when ADDITIONAL_WORKERS > 0, "false" is not supported in ABI. This is useful for testing multi-arch compute node | ||
clusters as a real day2 operation and is especially important for testing single-architectures clusters based on | ||
single-architecture payloads migrating to a multi-arch payload. Also, at time of writing, the above path is the | ||
only one possible for multi-arch clusters with arm64 control plane and additional amd64 workers ( because the | ||
release image is not a manifest list when installing from Prow and this check would fail at installation time if | ||
installing an arm64 CP cluster from the multi payload: | ||
https://github.com/openshift/installer/blob/4daff6117fce1d73774b02caf605d9e1b836dd7c/data/data/bootstrap/files/usr/local/bin/release-image-download.sh.template#L36-L50). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether to add (power on) the additional workers after the cluster is installed. And the default value is "true" | |
when ADDITIONAL_WORKERS > 0, "false" is not supported in ABI. This is useful for testing multi-arch compute node | |
clusters as a real day2 operation and is especially important for testing single-architectures clusters based on | |
single-architecture payloads migrating to a multi-arch payload. Also, at time of writing, the above path is the | |
only one possible for multi-arch clusters with arm64 control plane and additional amd64 workers ( because the | |
release image is not a manifest list when installing from Prow and this check would fail at installation time if | |
installing an arm64 CP cluster from the multi payload: | |
https://github.com/openshift/installer/blob/4daff6117fce1d73774b02caf605d9e1b836dd7c/data/data/bootstrap/files/usr/local/bin/release-image-download.sh.template#L36-L50). | |
Whether to add additional workers after the cluster is installed. And the default value is "true" | |
when ADDITIONAL_WORKERS > 0, "false" is not supported in ABI. This is useful for testing multi-arch compute node | |
clusters as a real day2 operation and is especially important for testing single-architectures clusters based on | |
single-architecture payloads migrating to a multi-arch payload. |
That last sentence doesn't apply anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aleskandro Would I remove this sentence of 'baremetal-lab-upi-install-ref' and 'upi-install-heterogeneous-ref' too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you could do it, that'd be good too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, done!
af23222
to
20931dc
Compare
# Extract the ignition file for additional workers if additional workers count > 0 | ||
oc extract -n openshift-machine-api secret/worker-user-data-managed --keys=userData --to=- > "${SHARED_DIR}"/worker.ign | ||
|
||
echo -e "\nCopying ignition files into bastion host..." | ||
chmod 644 "${SHARED_DIR}"/*.ign | ||
scp "${SSHOPTS[@]}" "${SHARED_DIR}"/*.ign "root@${AUX_HOST}:/opt/html/${CLUSTER_NAME}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a final thing @zniu1011 , can you put this snippet in https://github.com/openshift/release/blob/master/ci-operator/step-registry/upi/install/heterogeneous/upi-install-heterogeneous-commands.sh#L141 ?
I don't like this logic to stay here and since it applies well for upi on bm too, we can put it there for all (otherwise, some other agent workflows might need this change too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aleskandro, if it is ok that copy and overwrite "*.ign" files the second time in UPI day2 install for you, I will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's harmless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aleskandro, the update is done, the job is running fine after the updates too.
20931dc
to
bc4c641
Compare
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.15-multi-nightly-baremetal-compact-agent-ipv4-dhcp-disc-day2-amd-mixarch-f14 |
2 similar comments
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.15-multi-nightly-baremetal-compact-agent-ipv4-dhcp-disc-day2-amd-mixarch-f14 |
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.15-multi-nightly-baremetal-compact-agent-ipv4-dhcp-disc-day2-amd-mixarch-f14 |
/lgtm Thanks @zniu1011, great job! |
cc @liangxia to review,thanks |
@zniu1011: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@@ -1313,6 +1313,52 @@ tests: | |||
test: | |||
- chain: openshift-e2e-test-qe | |||
workflow: cucushift-installer-rehearse-azure-ipi-publish-mixed-ingress-internal | |||
- as: baremetal-compact-agent-ipv4-dhcp-day2-amd-mixarch-f14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the test in 4.16 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, @aleskandro, @lwan-wanglin , should we move the jobs to 4.16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering we will postpone the feature to 4.16, I think we can move the jobs to 4.16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of move the job to 4.16, perhaps we can add a new job for 4.16 ( and also keep the 4.15 one ). WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's ok, whether do we need to reduce frequency to f28 for 4.15? because of limited baremetal infra @aleskandro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's set f28 for 4.15 to get some early testing signal, but also open a story to remember of deleting it after GA on 4.16
bc4c641
to
65ae23e
Compare
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-baremetal-compact-agent-ipv4-dhcp-day2-amd-mixarch-f14 |
env: | ||
ADDITIONAL_WORKER_ARCHITECTURE: aarch64 | ||
ADDITIONAL_WORKERS: "1" | ||
ADDITIONAL_WORKERS_DAY2: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like ADDITIONAL_WORKERS_DAY2: "true"
is the default value. IMO, there is no need to add it there (the same to the other env vars, or else, we will have a long list of env vars, which is hard to maintain).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me remove it
65ae23e
to
049930c
Compare
[REHEARSALNOTIFIER]
A total of 87 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aleskandro, liangxia, zniu1011 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/pj-rehearse ack |
This is the PR for agent based installer with multi-payload day2 install CI implement, and the cases deployment running amd64 based multi-payloads.
We are missing arm64 based multi-payload cases for now, as it is pending fix. Will have another PR for it.