Skip to content

fix: EKS cluster provisioning and team-operator deployment#114

Merged
ian-flores merged 12 commits intomainfrom
fix-eks-resource-dependencies
Feb 10, 2026
Merged

fix: EKS cluster provisioning and team-operator deployment#114
ian-flores merged 12 commits intomainfrom
fix-eks-resource-dependencies

Conversation

@ian-flores
Copy link
Copy Markdown
Contributor

@ian-flores ian-flores commented Feb 5, 2026

Summary

This PR bundles several fixes discovered during partners01-staging cluster provisioning:

EKS Access Entries

  • Enable access entries by default (eks_access_entries.enabled = True)

Resource Dependencies

  • Add StorageClassPatch dependency on EBS CSI addon (fixes race condition)
  • Restore parallel execution for Tigera and node groups (Tigera uses hostNetwork)

Team Operator

  • Bump default chart version to v1.6.0
  • Remove explicit posit-team-system namespace resource (conflicts with existing namespace created by Helm's create_namespace=True)

Mimir

  • Remove AWS-only guard from mimir password sync (Azure workloads also run Mimir)

Tigera/Calico

  • Update Helm chart repository URL from docs.projectcalico.org to docs.tigera.io (old URL returning 500 errors)

Test plan

  • Tested on partners01-staging fresh cluster provisioning
  • Tested on ganso01-staging (existing cluster)
  • Team operator running successfully
  • All controllers started (Site, Flightdeck, Connect, Workbench, etc.)

Fixes #6

Fixes race conditions during fresh cluster creation where resources
were created before their dependencies were ready:

- Node groups now depend on Tigera/Calico CNI being installed first
- EBS CSI addon now depends on node groups being ready
- StorageClassPatch now depends on EBS CSI addon being healthy

This ensures the correct creation order:
Tigera CNI → Node Groups → EBS CSI Addon → StorageClassPatch
The previous commit incorrectly made node groups depend on Tigera,
causing a deadlock during fresh cluster creation:
- Tigera operator needs nodes to schedule on
- Node groups waited for Tigera to complete first

This restores the correct parallel execution:
- Node groups and Tigera run concurrently
- Tigera operator uses hostNetwork, so it schedules on NotReady nodes
- Once Tigera deploys CNI, nodes become Ready

The StorageClassPatch dependency on EBS CSI addon (from the previous
commit in aws_eks_cluster.py) is still in place and correct.
…on resources

The helm-migration ServiceAccount was failing to create because the
posit-team-system namespace didn't exist yet. The namespace was expected
to be created by the Helm chart, but migration resources run before it.

- Add explicit creation of posit-team-system namespace in Pulumi
- Add dependency from migration ServiceAccount to the namespace
@ian-flores ian-flores requested a review from timtalbot February 5, 2026 22:03
@ian-flores ian-flores marked this pull request as ready for review February 5, 2026 22:04
@ian-flores ian-flores requested a review from a team as a code owner February 5, 2026 22:04
Updating default team-operator Helm chart version for partners01-staging
clean redeploy after migration issues with v1.5.0.
Adding skip_await=True allows the Helm release to complete even if pods
aren't fully ready. This lets us see actual failures in the cluster instead
of misleading "service not found" errors from Pulumi's wait logic.
Control rooms don't output mimir_password - only workloads do. The
persistent step was failing with "mimir_password not found in outputs"
when running on control rooms because the check wasn't guarded.
The deployment succeeded - the skip_await was a debug measure that's no
longer needed. Reverting to default behavior.
@ian-flores ian-flores marked this pull request as draft February 5, 2026 22:34
The old Calico chart repository (docs.projectcalico.org/charts) is
returning 500 errors. Updating to the new Tigera repository URL
(docs.tigera.io/calico/charts).
@ian-flores ian-flores marked this pull request as ready for review February 5, 2026 22:50
@ian-flores ian-flores changed the title fix(eks): add StorageClassPatch dependency on EBS CSI addon fix: EKS cluster provisioning and team-operator deployment Feb 5, 2026
Comment thread lib/steps/persistent.go Outdated
if newMimirPassword != currentMimirPassword {
slog.Info("Updating control room mimir password", "target", s.DstTarget.Name())
return updateControlRoomMimirPassword(ctx, s.SrcTarget, s.DstTarget.Name(), newMimirPassword)
if !s.DstTarget.ControlRoom() && s.DstTarget.CloudProvider() == types.AWS {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right, Azure workloads also run Mimir and will need to do this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9e70212 — removed the CloudProvider() == types.AWS condition.

),
)

# Create posit-team-system namespace for the operator and migration resources
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this require importing the existing namespace in all current workloads?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it in ganso01-staging and you were right. I removed the explicit namespace resource in 7b64328 and let Helm continue managing it.

Base automatically changed from feat-eks-access-entries-default to main February 6, 2026 15:45
Azure workloads also run Mimir and need password sync with the control
room. The AWS guard was incorrectly applied to this block.
The Helm release already creates this namespace via create_namespace=True.
Adding it as a separate Pulumi resource conflicts with the existing
namespace on running workloads.
@ian-flores ian-flores requested a review from timtalbot February 6, 2026 18:44
@ian-flores ian-flores merged commit 10ad6d7 into main Feb 10, 2026
6 checks passed
@ian-flores ian-flores deleted the fix-eks-resource-dependencies branch February 10, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EKS Addon Create Failing - "Addon Already Exists"

2 participants