fix: EKS cluster provisioning and team-operator deployment#114
Merged
ian-flores merged 12 commits intomainfrom Feb 10, 2026
Merged
fix: EKS cluster provisioning and team-operator deployment#114ian-flores merged 12 commits intomainfrom
ian-flores merged 12 commits intomainfrom
Conversation
Fixes race conditions during fresh cluster creation where resources were created before their dependencies were ready: - Node groups now depend on Tigera/Calico CNI being installed first - EBS CSI addon now depends on node groups being ready - StorageClassPatch now depends on EBS CSI addon being healthy This ensures the correct creation order: Tigera CNI → Node Groups → EBS CSI Addon → StorageClassPatch
The previous commit incorrectly made node groups depend on Tigera, causing a deadlock during fresh cluster creation: - Tigera operator needs nodes to schedule on - Node groups waited for Tigera to complete first This restores the correct parallel execution: - Node groups and Tigera run concurrently - Tigera operator uses hostNetwork, so it schedules on NotReady nodes - Once Tigera deploys CNI, nodes become Ready The StorageClassPatch dependency on EBS CSI addon (from the previous commit in aws_eks_cluster.py) is still in place and correct.
…on resources The helm-migration ServiceAccount was failing to create because the posit-team-system namespace didn't exist yet. The namespace was expected to be created by the Helm chart, but migration resources run before it. - Add explicit creation of posit-team-system namespace in Pulumi - Add dependency from migration ServiceAccount to the namespace
Updating default team-operator Helm chart version for partners01-staging clean redeploy after migration issues with v1.5.0.
Adding skip_await=True allows the Helm release to complete even if pods aren't fully ready. This lets us see actual failures in the cluster instead of misleading "service not found" errors from Pulumi's wait logic.
Control rooms don't output mimir_password - only workloads do. The persistent step was failing with "mimir_password not found in outputs" when running on control rooms because the check wasn't guarded.
The deployment succeeded - the skip_await was a debug measure that's no longer needed. Reverting to default behavior.
The old Calico chart repository (docs.projectcalico.org/charts) is returning 500 errors. Updating to the new Tigera repository URL (docs.tigera.io/calico/charts).
timtalbot
reviewed
Feb 6, 2026
| if newMimirPassword != currentMimirPassword { | ||
| slog.Info("Updating control room mimir password", "target", s.DstTarget.Name()) | ||
| return updateControlRoomMimirPassword(ctx, s.SrcTarget, s.DstTarget.Name(), newMimirPassword) | ||
| if !s.DstTarget.ControlRoom() && s.DstTarget.CloudProvider() == types.AWS { |
Contributor
There was a problem hiding this comment.
I don't think this is right, Azure workloads also run Mimir and will need to do this
Contributor
Author
There was a problem hiding this comment.
Fixed in 9e70212 — removed the CloudProvider() == types.AWS condition.
| ), | ||
| ) | ||
|
|
||
| # Create posit-team-system namespace for the operator and migration resources |
Contributor
There was a problem hiding this comment.
Will this require importing the existing namespace in all current workloads?
Contributor
Author
There was a problem hiding this comment.
I tested it in ganso01-staging and you were right. I removed the explicit namespace resource in 7b64328 and let Helm continue managing it.
Azure workloads also run Mimir and need password sync with the control room. The AWS guard was incorrectly applied to this block.
The Helm release already creates this namespace via create_namespace=True. Adding it as a separate Pulumi resource conflicts with the existing namespace on running workloads.
timtalbot
approved these changes
Feb 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR bundles several fixes discovered during partners01-staging cluster provisioning:
EKS Access Entries
eks_access_entries.enabled = True)Resource Dependencies
StorageClassPatchdependency on EBS CSI addon (fixes race condition)Team Operator
posit-team-systemnamespace resource (conflicts with existing namespace created by Helm'screate_namespace=True)Mimir
Tigera/Calico
docs.projectcalico.orgtodocs.tigera.io(old URL returning 500 errors)Test plan
Fixes #6