-
Notifications
You must be signed in to change notification settings - Fork 2k
CNTRLPLANE-1857: feat(hypershift): add explicit DNS cleanup for AKS test deprovision #71319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Update: Added conformance workflow fixGood catch on the main AKS workflows! I've added a commit to also update the Changes in latest commit:
Note on hypershift-azure-aks-e2e workflow:The e2e workflow is structured differently - it installs the HyperShift operator and then runs test suites that create/destroy their own clusters internally. It doesn't create a single cluster in the pre steps, so there's no specific cluster-name available in the post steps for cleanup. For the e2e workflow, we may need a different cleanup approach (like cleaning up all records older than X hours), but that's riskier as it could interfere with concurrent tests. Will monitor to see if this workflow is also leaking records or if the test suite handles cleanup properly. Total workflows now protected from DNS leaks: 9 (8 specialized deprovision chains + conformance workflow) |
Update: Added e2e workflow fix + ran make jobsAdded another commit to update the All AKS workflows now protected (10 total):Specialized Deprovision Chains (8):
Main Workflows (2): For the e2e workflow, the cleanup will gracefully skip if there's no cluster-name file (since the e2e tests manage multiple ephemeral clusters internally). But this provides protection if the framework does write cluster names. |
|
/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks |
|
@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks |
|
@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@bryan-cox: This pull request references CNTRLPLANE-1857 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Fix Pushed: DNS Cleanup Now Works for E2E TestsIssue FoundAfter analyzing the test artifacts from the previous run, I discovered why the cleanup wasn't working: Root Cause: The script was searching for DNS records matching The FixUpdated the cleanup script to detect when it has an AKS cluster name and switch to "all-records" mode: Before:
After:
Why All-Records Mode is Safe
TestingThe next test run should show: And successfully clean up all the leaked records. |
|
/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks |
|
@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
3f10468 to
3fa3b94
Compare
|
/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks |
|
@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Add explicit DNS cleanup step to prevent DNS record leaks in Azure AKS
test workflows. The issue was identified when the DNS zone hit the 10,000
record limit due to leaked records from external-dns.
Root cause:
- HyperShift cluster deletion triggers external-dns cleanup
- AKS management cluster gets deleted immediately after
- external-dns controller is killed before it can process deletion events
- DNS records accumulate over time in shared DNS zones
Solution:
Query the management cluster for all HostedClusters and extract their
infraIDs, then clean up DNS records matching those infraIDs. This approach:
- Works for e2e tests (multiple clusters with different infraIDs)
- Works for conformance tests (single cluster)
- Only cleans up DNS records from THIS test run
- Safe for shared DNS zones (won't delete concurrent test records)
Implementation:
1. Query: kubectl get hostedclusters --all-namespaces -o jsonpath='{.spec.infraID}'
2. Find DNS records containing any of those infraIDs
3. Delete only those matching records
Example:
- HostedCluster has infraID: autoscaling-9hpz5
- DNS records: api-autoscaling-9hpz5, a-api-autoscaling-9hpz5-external-dns
- Cleanup: Deletes records containing 'autoscaling-9hpz5'
- Preserves: Records from other tests (different infraIDs)
Updated workflows:
- hypershift-azure-aks-conformance (single cluster)
- hypershift-azure-aks-e2e (multiple clusters)
- 8 specialized deprovision chains
Fixes: CNTRLPLANE-1857
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
3fa3b94 to
5a4b330
Compare
|
/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks |
|
@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Add dedicated e2e-aks-override job to test Control Plane Operator (CPO) overrides on Azure AKS platform. This complements the existing e2e-aws-override job. Key features: - Triggered only when overrides.yaml is modified (run_if_changed) - Sets TEST_CPO_OVERRIDE=1 to enable override testing - Uses same workflow as regular AKS tests (hypershift-azure-aks-e2e) - Paired with runTests field in overrides.yaml for granular control This allows testing Azure CPO overrides independently from AWS overrides, saving CI resources by skipping tests for platforms not being modified. Related: - hypershift PR openshift#7206 (runTests field implementation) - CNTRLPLANE-1893 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bryan-cox The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
[REHEARSALNOTIFIER]
A total of 120 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse pull-ci-openshift-hypershift-main-e2e-aks |
|
@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@bryan-cox: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Fixes DNS record leaks in Azure DNS zones by adding explicit cleanup before AKS cluster deletion.
Problem
We recently hit the maximum DNS zone record limit of 10,000 records in the
aks-e2e.hypershift.azure.devcluster.openshift.comzone. Investigation revealed DNS records were leaking during AKS e2e and conformance test runs.Root Cause: Race condition in the deprovision workflow
Reference:
/contrib/ci/Azure/Manage Azure Cloud Resources/Deleting-DNS-Zone-Recordsets.mdin the hypershift repo documents this known issue.Solution
Created a new step-registry step that explicitly cleans up DNS records after the HostedCluster is destroyed but before the AKS cluster is deleted.
Implementation
New step:
hypershift-azure-cleanup-external-dnsUpdated Deprovision Flow:
Changes
New Files
ci-operator/step-registry/hypershift/azure/cleanup-external-dns/hypershift-azure-cleanup-external-dns-ref.yamlci-operator/step-registry/hypershift/azure/cleanup-external-dns/hypershift-azure-cleanup-external-dns-commands.shci-operator/step-registry/hypershift/azure/cleanup-external-dns/hypershift-azure-cleanup-external-dns-ref.metadata.jsonci-operator/step-registry/hypershift/azure/cleanup-external-dns/OWNERSModified Files
Updated 8 AKS HyperShift deprovision chains to include the cleanup step:
cucushift-installer-rehearse-azure-aks-hypershift-base-deprovision-chain.yamlcucushift-installer-rehearse-azure-aks-hypershift-cilium-deprovision-chain.yamlcucushift-installer-rehearse-azure-aks-hypershift-byo-vnet-deprovision-chain.yamlcucushift-installer-rehearse-azure-aks-hypershift-ephemeral-creds-deprovision-chain.yamlcucushift-installer-rehearse-azure-aks-hypershift-etcd-disk-encryption-deprovision-chain.yamlcucushift-installer-rehearse-azure-aks-hypershift-registry-overrides-deprovision-chain.yamlcucushift-installer-rehearse-azure-aks-hypershift-heterogeneous-deprovision-chain.yamlcucushift-installer-rehearse-azure-aks-hypershift-disaster-recovery-infra-deprovision-chain.yamlTest Plan
Related
Jira: CNTRLPLANE-1857
🤖 Generated with Claude Code