Skip to content

CORS-4058: Migrate AWS Destroy to SDK v2#9939

Merged
openshift-merge-bot[bot] merged 12 commits intoopenshift:mainfrom
barbacbd:CORS-4058-release-4.21
Sep 19, 2025
Merged

CORS-4058: Migrate AWS Destroy to SDK v2#9939
openshift-merge-bot[bot] merged 12 commits intoopenshift:mainfrom
barbacbd:CORS-4058-release-4.21

Conversation

@barbacbd
Copy link
Contributor

Reintroduce the changes for the AWS Destroy code.

The Big changes between this and the original PR #9736:

  • The Max Retry Attempts is default set (this was the major issue and reason for removing the changes before).
  • The default config and client creation happens through functions in the installconfig. Take advantage of these and use them where we can.
  • Cleanup some of the fixups from before.
  • Cleaup potential merge conflicts with packages being imported

** the bulk of the changes are to the ec2helpers file. All of the sdk v1 imports
are removed except for session as this one is engrained too many files currently.

pkg/destroy/aws/aws.go

** Add a client for ELB ELBV2 and IAM to the Cluster Removal Struct. Even though
these changes are mainly to ec2helpers, the other clients were required in for
certain operations.

** The rest of the file updates are alter ARN import to come from aws sdk v2.
** Remove/Change all imports from AWS sdk v1 to v2.

pkg/destroy/aws/errors.go
pkg/destroy/aws/ec2helpers.go

** Remove the Error checking/formatting function from ec2helpers and put the function
in the errors.go file.
** Remove all SDK v1 imports from elb helpers.
** Update Route53, s3, and efs services to sdk v2. This is slowly removing the
requirement for aws session.
** This caused updates to other packages such as aws/config, credentials, stscreds, and
a list of aws internal packages.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 11, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 11, 2025

@barbacbd: This pull request references CORS-4058 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.21." or "openshift-4.21.", but it targets "4.20.0" instead.

Details

In response to this:

Reintroduce the changes for the AWS Destroy code.

The Big changes between this and the original PR #9736:

  • The Max Retry Attempts is default set (this was the major issue and reason for removing the changes before).
  • The default config and client creation happens through functions in the installconfig. Take advantage of these and use them where we can.
  • Cleanup some of the fixups from before.
  • Cleaup potential merge conflicts with packages being imported

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@barbacbd
Copy link
Contributor Author

/cc @patrickdillon
/cc @tthvo

Comment on lines 226 to 229
cfg, err := configv2.LoadDefaultConfig(context.TODO(), configv2.WithRegion(region))
if err != nil {
return nil, fmt.Errorf("failed loading default config: %w", err)
}
return createResourceTaggingClientWithConfig(cfg, region, endpoints), nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the v2 Config, we should get them from: GetConfigWithOptions(ctx, configv2.WithRegion(region).

This has the common settings (i.e. maxRetries) that we probably want to have.

@barbacbd barbacbd force-pushed the CORS-4058-release-4.21 branch from 13ecc48 to 831afca Compare September 16, 2025 14:53
pkg/destroy/aws:

** Alter the function name from HandleErrorCode to handleErrorCode. The initial thought was that
this function could be used in other areas of the code, but it will remain in destroy for now.

pkg/destroy/aws/shared.go:

** Remove the session import and uses in the file.
** Remove session from the imports. Added the agent handler to the configurations.
@barbacbd barbacbd force-pushed the CORS-4058-release-4.21 branch from 831afca to 4fcea32 Compare September 16, 2025 17:27
@tthvo
Copy link
Member

tthvo commented Sep 16, 2025

/test gofmt

@barbacbd barbacbd requested a review from tthvo September 16, 2025 19:04
Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall to me 🚀 !

I had some more questions and suggestions 🤔 I think there are a few more places where we can refactor:

  1. Move client constructs to the common client file pkg/asset/installconfig/aws/clients.go
  2. Move error handling and error constants to a common file pkg/asset/installconfig/aws/awserrors.go

These can be handled in CORS-4078 as a final touch :D

var awsErr awserr.Error
ok := errors.As(err, &awsErr)
if ok && awsErr.Code() == resourcegroupstaggingapi.ErrorCodeInvalidParameterException {
if strings.Contains(HandleErrorCode(err), "InvalidParameter") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this way of error checking (i.e. parsing error code) and the one above (i.e. errors.As to a specific error type) is equivalent.

Though, should we use the same way for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems previously we were checking to the error code in both places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tthvo I believe that this is already the equivalent. HandleErrorCode will run errors.As and return the code.

route53Client, err := awssession.NewRoute53Client(ctx, awssession.EndpointOptions{
Region: region,
Endpoints: metadata.AWS.ServiceEndpoints,
}, "") // FIXME: Do we need an ARN here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, we don't need to set the hostedZoneRole arn here for this common route53 client as it is used for deleting "owned" hosted zone.

However, we do need to recreate the client with the role arn in file: pkg/destroy/aws/shared.go for "shared" hosted zone. See lines below:

https://github.com/openshift/installer/pull/9939/files#diff-fd89b44e69dab26b843e9e0a31cd945164e7aab3c8d15780fb5eafcfb1d9b836L209-L217

That block was deleted and we need to add that logic back.

Comment on lines 211 to 217
publicZoneClient := route53.New(session)
privateZoneClient := route53.New(session)
if o.HostedZoneRole != "" {
creds := stscreds.NewCredentials(session, o.HostedZoneRole)
privateZoneClient = route53.New(session, &aws.Config{Credentials: creds})
logger.Infof("Assuming role %s to destroy records in private hosted zone", o.HostedZoneRole)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this block is still needed. If not, the handler will use the common Route53 and completely ignore the hosted zone role.

Thus, we need to recreate the route53 client here:

  1. A publicZoneClient: use the common Route53 client.
  2. A privateZoneClient: if hostZoneRole is specified, create a separate client with AWS STS.

@tthvo
Copy link
Member

tthvo commented Sep 17, 2025

/test e2e-aws-ovn-custom-iam-profile e2e-aws-ovn-public-ipv4-pool e2e-aws-ovn-upi e2e-aws-ovn-heterogeneous

Set a Destroy User Agent.
Cleanup pointer references to use the aws sdk.
Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Nice, we have still have a discussion for #9939 (comment). But this looks good!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 17, 2025
@patrickdillon
Copy link
Contributor

I'm not sure about the other CI jobs--we can check back when they finish running--but I have some concerns about the destroy in the aws custom dns job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/9939/pull-ci-openshift-installer-main-e2e-aws-custom-dns-techpreview/1968339873618202624/artifacts/e2e-aws-custom-dns-techpreview/ipi-deprovision-deprovision/artifacts/

  1. The endpoint log messages are overwhelming
  2. It looks like the VPC is not deleting because of a dependency

@tthvo
Copy link
Member

tthvo commented Sep 17, 2025

  1. The endpoint log messages are overwhelming

The destroy finished just now I think:

time="2025-09-17T18:08:54Z" level=info msg=Deleted id=vpc-0d44eceb3e9433e2f resourceType=vpc
  1. It looks like the VPC is not deleting because of a dependency
time="2025-09-17T18:08:53Z" level=debug msg="Custom EC2 endpoint not found, using default endpoint"
time="2025-09-17T18:08:53Z" level=debug msg="Custom EC2 endpoint not found, using default endpoint"

These logs come from endpoint resolver for AWS SDK v2. They have been so noisy. Let me remove them in #9907

@patrickdillon
Copy link
Contributor

The destroy finished just now I think:

time="2025-09-17T18:08:54Z" level=info msg=Deleted id=vpc-0d44eceb3e9433e2f resourceType=vpc

Oh, yeah I think I actually just lost it in the error messages. You're right, no issues with destroy!

  1. It looks like the VPC is not deleting because of a dependency
time="2025-09-17T18:08:53Z" level=debug msg="Custom EC2 endpoint not found, using default endpoint"
time="2025-09-17T18:08:53Z" level=debug msg="Custom EC2 endpoint not found, using default endpoint"

These logs come from endpoint resolver for AWS SDK v2. They have been so noisy. Let me remove them in #9907

Ah, that will be a separate PR, so this LGTM.

Thanks!

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 17, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2025
@tthvo
Copy link
Member

tthvo commented Sep 17, 2025

/cc @yunjiang29

@openshift-ci openshift-ci bot requested a review from yunjiang29 September 17, 2025 19:36
@tthvo
Copy link
Member

tthvo commented Sep 17, 2025

/test e2e-aws-ovn

@tthvo
Copy link
Member

tthvo commented Sep 17, 2025

/test e2e-aws-ovn-custom-iam-profile e2e-aws-ovn-public-ipv4-pool e2e-aws-ovn-upi e2e-aws-ovn-heterogeneous

@barbacbd
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 18, 2025

@barbacbd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn c4f0a7a link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-vsphere-static-ovn c4f0a7a link false /test e2e-vsphere-static-ovn
ci/prow/e2e-aws-ovn-heterogeneous c4f0a7a link false /test e2e-aws-ovn-heterogeneous
ci/prow/e2e-vsphere-ovn-multi-network c4f0a7a link false /test e2e-vsphere-ovn-multi-network
ci/prow/e2e-aws-custom-dns-techpreview c4f0a7a link false /test e2e-aws-custom-dns-techpreview
ci/prow/e2e-aws-ovn-single-node c4f0a7a link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@barbacbd
Copy link
Contributor Author

/verified by ci/prow/e2e-aws-ovn

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Sep 19, 2025
@openshift-ci-robot
Copy link
Contributor

@barbacbd: This PR has been marked as verified by ci/prow/e2e-aws-ovn.

Details

In response to this:

/verified by ci/prow/e2e-aws-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot bot merged commit bad1b52 into openshift:main Sep 19, 2025
28 of 34 checks passed
patrickdillon added a commit to patrickdillon/installer that referenced this pull request Sep 29, 2025
…ease-4.21"

This reverts commit bad1b52, reversing
changes made to beebc39.
patrickdillon added a commit that referenced this pull request Sep 30, 2025
no-jira: Revert "Merge pull request #9939 from barbacbd/CORS-4058-release-4.21"
resourcegroupstaggingapi.New(awsSession, aws.NewConfig().WithRegion(endpoints.UsGovWest1RegionID)))
case endpointUSGovEast1, endpointUSGovWest1:
if o.Region != endpointUSGovWest1 {
tagClient, err := createResourceTaggingClient(endpointUSGovWest1, o.endpoints)
Copy link
Contributor

@sadasu sadasu Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? When the region is not endpointUSGovWest1 , we create a resource tagging client for the same? If it is, could you please add a comment why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have little context, but that block has always been there. For example, in release 4.19:

case endpoints.UsGovEast1RegionID, endpoints.UsGovWest1RegionID:
if o.Region != endpoints.UsGovWest1RegionID {
tagClients = append(tagClients,
resourcegroupstaggingapi.New(awsSession, aws.NewConfig().WithRegion(endpoints.UsGovWest1RegionID)))
}

Git Blame returns PR #4042. IIUC, for government cloud partition, there are route53 resources that are only available in us-west-gov-1, that will have to be cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants