ipi-install-install-aws: Fix ADD_INGRESS_RECORDS_MANUALLY=yes timeout#77955
Conversation
Walkthroughprepare_next_steps() now accepts an exit-code argument and writes it to install-status.txt. Synchronous router LB polling and wait_router_lb_provision() were removed. A background DNS watcher (create_ingress_dns_record()) and CFN template writer (write_ingress_cfn_template()) were added to create Route53 records; trap/cleanup logic was reworked to terminate background jobs and call prepare_next_steps(exit_code). Changes
Sequence Diagram(s)sequenceDiagram
participant Installer as Installer Process
participant DNS as DNS Watcher
participant K8s as Kubernetes API
participant LB as Router Service
participant Route53 as Route53
participant CFN as CloudFormation
Installer->>Installer: Launch installer (create cluster)
Installer->>DNS: Start create_ingress_dns_record() (background)
par Installer continues / DNS watcher runs
Installer->>Installer: Installer proceeds (install completes later)
DNS->>K8s: Poll for kubeconfig until available
DNS->>LB: Query router-default service for LB hostname
DNS->>Route53: Resolve private hosted zone ID
DNS->>DNS: write_ingress_cfn_template(cfn_template)
DNS->>CFN: Create `${CLUSTER_NAME}-apps-dns` CloudFormation stack
CFN-->>DNS: Stack creation completes
end
Installer->>DNS: Wait for DNS watcher PID to exit
DNS-->>Installer: Exit status
Installer->>Installer: Call prepare_next_steps(exit_code) and finish
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@ci-operator/step-registry/ipi/install/install/aws/ipi-install-install-aws-commands.sh`:
- Around line 115-199: The if/elif currently checks CLUSTER_TYPE only for exact
values "aws" and "aws-arm64", which excludes other AWS partitions like
"aws-eusc"; update the conditional logic around the CloudFormation template
generation (the if that writes to "${cfn_template}" and the corresponding elif
for "aws-usgov") to use a pattern match for any AWS partition (e.g., test
CLUSTER_TYPE starts with "aws") for the general AWS template path and keep the
explicit "aws-usgov" branch for GovCloud; apply the same change to the other
manual-ingress CFN-template branch in the file that duplicates this logic so all
AWS partitions (including aws-eusc) are supported when
ADD_INGRESS_RECORDS_MANUALLY is used.
- Around line 358-359: The TERM trap currently calls prepare_next_steps which
causes cleanup to run twice and can overwrite the original exit code; change the
TERM trap to only reap and kill background children (keep the jobs -p; kill ...;
wait logic) and remove prepare_next_steps from the TERM handler so that
prepare_next_steps is invoked only from the EXIT trap; look for the trap lines
and update the TERM trap to not call prepare_next_steps while leaving the
existing EXIT trap that calls prepare_next_steps intact.
- Around line 241-248: The current lookup for the private Route53 hosted zone
can return the wrong zone; update the logic around private_route53_hostzone_name
and private_route53_hostzone_id to perform an exact name match and require
PrivateZone=true: call aws route53 list-hosted-zones (or
list-hosted-zones-by-name but compare HostedZones[].Name ==
"${private_route53_hostzone_name}." with the trailing dot) and filter
HostedZones[] where .PrivateZone == true, then extract the .Id (strip the
/hostedzone/ prefix) into private_route53_hostzone_id and keep the existing
error return if empty; update any jq/awk pipeline around
private_route53_hostzone_id to reflect this exact-match+PrivateZone filtering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a6f271fb-88f1-49bf-af5e-0906e03bd151
📒 Files selected for processing (1)
ci-operator/step-registry/ipi/install/install/aws/ipi-install-install-aws-commands.sh
5730608 to
a64fda5
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
ci-operator/step-registry/ipi/install/install/aws/ipi-install-install-aws-commands.sh (2)
261-285: Consider adding anelseclause for defensive coding.Currently, if
CLUSTER_TYPEis not one ofaws,aws-arm64, oraws-usgov, the stack creation is silently skipped (thoughwrite_ingress_cfn_templatewould have already failed earlier at line 254). Adding an explicitelseclause would make the code more maintainable and catch any future logic drift.Suggested addition
elif [ "${CLUSTER_TYPE}" == "aws-usgov" ]; then aws --region "${REGION}" cloudformation create-stack --stack-name "${APPS_DNS_STACK_NAME}" \ ... --capabilities CAPABILITY_NAMED_IAM || return 1 + else + echo "DNS watcher: ERROR - unsupported CLUSTER_TYPE for stack creation: ${CLUSTER_TYPE}" + return 1 fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci-operator/step-registry/ipi/install/install/aws/ipi-install-install-aws-commands.sh` around lines 261 - 285, The code path handling CLUSTER_TYPE covers "aws"/"aws-arm64" and "aws-usgov" but silently skips stack creation for any other value; add an explicit else branch after the existing if/elif to fail fast and log/return an error. Update the conditional block that sets router_lb_hostzone_id and calls aws cloudformation (referencing CLUSTER_TYPE, router_lb_hostzone_id, APPS_DNS_STACK_NAME and write_ingress_cfn_template) to include a final else that prints a clear error (via echo or process logger) mentioning the unexpected CLUSTER_TYPE and returns a non-zero status (e.g., return 1) so the caller immediately sees the misconfiguration.
359-375: Good cleanup implementation with proper exit code handling.The trap flow correctly:
- Captures exit code before any commands modify
$?- Reaps background children
- Passes the original exit code to
prepare_next_steps- Removes the EXIT trap to prevent double invocation
One minor robustness note: consider adding
|| truetowaitin case some children have already exited, though this is not strictly necessary since we're in cleanup.Optional: more defensive error suppression
if test -n "${CHILDREN}"; then - kill ${CHILDREN} 2>/dev/null - wait + kill ${CHILDREN} 2>/dev/null || true + wait || true fiNote: The static analysis hint SC2329 about
cleanupnever being invoked is a false positive — it's invoked via the trap at line 375. The SC2086 hint about quoting${CHILDREN}is intentionally unquoted to allow word splitting of multiple PIDs.,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci-operator/step-registry/ipi/install/install/aws/ipi-install-install-aws-commands.sh` around lines 359 - 375, The cleanup function should tolerate "wait" returning non-zero if children already exited; update the wait call in function cleanup (where CHILDREN is populated and kill is invoked) to suppress errors (e.g., change the standalone wait to a tolerant form like wait || true) so cleanup still proceeds to call prepare_next_steps with the captured exit_code and remove the EXIT trap; keep the rest of the logic in cleanup and the trap cleanup EXIT TERM unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@ci-operator/step-registry/ipi/install/install/aws/ipi-install-install-aws-commands.sh`:
- Around line 261-285: The code path handling CLUSTER_TYPE covers
"aws"/"aws-arm64" and "aws-usgov" but silently skips stack creation for any
other value; add an explicit else branch after the existing if/elif to fail fast
and log/return an error. Update the conditional block that sets
router_lb_hostzone_id and calls aws cloudformation (referencing CLUSTER_TYPE,
router_lb_hostzone_id, APPS_DNS_STACK_NAME and write_ingress_cfn_template) to
include a final else that prints a clear error (via echo or process logger)
mentioning the unexpected CLUSTER_TYPE and returns a non-zero status (e.g.,
return 1) so the caller immediately sees the misconfiguration.
- Around line 359-375: The cleanup function should tolerate "wait" returning
non-zero if children already exited; update the wait call in function cleanup
(where CHILDREN is populated and kill is invoked) to suppress errors (e.g.,
change the standalone wait to a tolerant form like wait || true) so cleanup
still proceeds to call prepare_next_steps with the captured exit_code and remove
the EXIT trap; keep the rest of the logic in cleanup and the trap cleanup EXIT
TERM unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 2ce4721c-01c1-42b9-a697-8184651d6420
📒 Files selected for processing (1)
ci-operator/step-registry/ipi/install/install/aws/ipi-install-install-aws-commands.sh
a64fda5 to
ff58de6
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@ci-operator/step-registry/ipi/install/install/aws/ipi-install-install-aws-commands.sh`:
- Around line 369-373: In cleanup(), the 'wait' after killing background
CHILDREN can return non-zero and, under errexit, abort the function before
prepare_next_steps runs; change the pattern to tolerate non-zero job exits
(e.g., replace 'wait' with 'wait || true' or capture and ignore the wait status)
so that cleanup() always proceeds to call prepare_next_steps and persist
artifacts/status; locate the block that uses CHILDREN=$(jobs -p) and adjust the
wait handling in the cleanup() function accordingly.
- Around line 250-256: The DNS watcher references unset variables (match_count,
zone_fqdn, hz_json) which will break under set -o nounset; update the
hosted-zone selection block (the code around the private_route53_hostzone_id
assignment and the conditional using match_count) to first validate/initialize
these variables: test that hz_json is non-empty before running jq, ensure
zone_fqdn is defined (or fail with a clear error log and return 1), and treat
match_count safely (use default 0 or compute it only after hz_json is verified)
so the echo/return logic and the jq + awk invocation do not run on unset values;
add explicit error logging and return codes when any required input is missing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0a6c2d49-d459-4baf-af42-db49b6fc6077
📒 Files selected for processing (1)
ci-operator/step-registry/ipi/install/install/aws/ipi-install-install-aws-commands.sh
ff58de6 to
e17256e
Compare
|
The first 2 should be without ADD_INGRESS_RECORDS_MANUALLY, the second 2 should set it. /pj-rehearse cucushift-installer-rehearse-aws-ipi-edge-zone-rootvolume cucushift-installer-rehearse-aws-ipi-edge-zone-byo-subnet-ovn-day2 cucushift-installer-rehearse-aws-ipi-disconnected-private-cco-manual-security-token-service-private-s3-with-ep-sts-ec2-elb cucushift-installer-rehearse-aws-ipi-disconnected-private-edge-zone |
|
@mdbooth: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/test ci-operator-config ci-operator-registry owners step-registry-metadata step-registry-shellcheck |
|
/test step-registry-metadata |
|
@mdbooth: job(s): cucushift-installer-rehearse-aws-ipi-edge-zone-rootvolume, cucushift-installer-rehearse-aws-ipi-edge-zone-byo-subnet-ovn-day2, cucushift-installer-rehearse-aws-ipi-disconnected-private-cco-manual-security-token-service-private-s3-with-ep-sts-ec2-elb, cucushift-installer-rehearse-aws-ipi-disconnected-private-edge-zone either don't exist or were not found to be affected, and cannot be rehearsed |
|
/pj-rehearse |
|
@mdbooth: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-aws-ipi-disc-priv-sts-fips-dualstack-ipv6-mini-perm-tp-f7 periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-aws-ipi-localzone-byo-subnet-ovn-day2-f7 |
|
@mdbooth: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@mdbooth: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
When ADD_INGRESS_RECORDS_MANUALLY=yes we explicitly remove zone creation from the manifests. This prevents the ingress operator from coming up, which means that 'create cluster' always times out. We were previously creating the dns records cluster create timed out and then doing 'wait-for install-complete', so it still succeeded eventually. Ideally we would create the ingress records in advance, but because the LB is created during installation we have no way of knowing what they should be. There is also currently no mechanism to precreate a loadbalancer and pass it to the installer. To avoid the timeout we need a workaround. With this change, instead of running after create cluster and assuming that router-default has been created, we kick off a process that polls for its creation. We create the DNS records immediately, during cluster create, which means we don't have to wait for the timeout. This saves 20-30 minutes per run in testing.
e17256e to
10ad89a
Compare
|
[REHEARSALNOTIFIER]
A total of 854 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-4.19-upgrade-from-stable-4.18-aws-ipi-disc-priv-f28 uses periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-aws-ipi-localzone-byo-subnet-ovn-day2-f7 does not set /pj-rehearse ack |
|
@mdbooth: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/assign xueqzhan Are you able to approve this? This looks big, but it's almost entirely code motion. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damdo, mdbooth, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When ADD_INGRESS_RECORDS_MANUALLY=yes we explicitly remove zone creation
from the manifests. This prevents the ingress operator from coming up,
which means that 'create cluster' always times out.
Ideally we would create the ingress records in advance, but because the
LB is created during installation we have no way of knowing what they
should be. There is also currently no mechanism to precreate a
loadbalancer and pass it to the installer. To avoid the timeout we need
a workaround.
With this change, instead of running after create cluster and assuming
that router-default has been created, we kick off a process that polls
for its creation. We create the DNS records immediately, during cluster
create, which means we don't have to wait for the timeout. This saves
20-30 minutes per run in testing.
Previous flow:
create cluster-> times out after 40 mins due to missing DNS recordsNew flow:
create clustercreates the cluster as beforecreate_ingress_dns_recordwaits for the default ingress router to exist and creates DNS records for it immediately it is observedcreate clustersucceeds within the initial timeout.wait-for-install-completecompletes immediately.Summary by CodeRabbit