Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure deplyoment await logic uses the latest deployment object #2943

Merged
merged 3 commits into from
Apr 11, 2024

Conversation

rquitales
Copy link
Contributor

@rquitales rquitales commented Apr 11, 2024

Proposed changes

This PR ensures that we compare deployment revisions based on the old live object, and the most current deployment object.

This PR does the following:

  • Add a repro test that fails before the change is applied
  • Only watch for deployment events until the deployment object has been reconciled by the deployment controller
  • Update existing unit tests to account for blocking unbuffered channels, and faked objects that did not have .status.observedGeneration

Related issues (optional)

Fixes: #2941

Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

Copy link

codecov bot commented Apr 11, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 18 lines in your changes are missing coverage. Please review.

Project coverage is 29.16%. Comparing base (78ab387) to head (2ef5470).

Files Patch % Lines
provider/pkg/await/deployment.go 33.33% 14 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2943      +/-   ##
==========================================
+ Coverage   27.63%   29.16%   +1.52%     
==========================================
  Files          54       61       +7     
  Lines        7862     8194     +332     
==========================================
+ Hits         2173     2390     +217     
- Misses       5504     5600      +96     
- Partials      185      204      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rquitales rquitales requested review from blampe, EronWright and a team April 11, 2024 20:20
Comment on lines 347 to 349
// We run the event handlers in goroutines to avoid blocking the main event loop.
// We also need to ensure that only one of each event processes can run at a time.
var depMutex, rsMutex, podMutex, pvcMutex sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried this introduces room for more race conditions on dia, because we can still be handling multiple events at once. Re: not blocking the main loop, do you have an idea of where we're getting bottlenecked?

I wonder if there's an alternative where we simply ignore replicaset events until we see a deployment event with a generation change. Something like:

select {
case event := <-deploymentEvents:
    deploying := dia.processDeploymentEvent(event)
case event := <-replicaSetEvents:
    if deploying {
        dia.processReplicaSetEvent(event)
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the initial approach in this PR may be risky. We also shouldn't be discarding replicaset events since we need those events to inform us of status updates so I'm hesitant to go down the approach you suggested.

I've reworked this PR so that we continuously poll for deployment events first, and ensure that the deployment controller has informed us if a rollout was triggered or not before we proceed onto processing other events. This ensures that when we check the replicaset, the deployment controller would have informed us if a roll-out has been triggered.

provider/pkg/await/deployment.go Outdated Show resolved Hide resolved
@rquitales rquitales marked this pull request as draft April 11, 2024 21:25
@rquitales rquitales force-pushed the rquitales/fix-deployment-await-error branch from 173f462 to 89affa7 Compare April 11, 2024 22:19
@@ -1537,7 +1540,7 @@ func deploymentUpdated(namespace, name, revision string) *unstructured.Unstructu
}
},
"status": {
"observedGeneration": 1,
"observedGeneration": 2,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the unit test had the incorrect observedGeneration for this object.

observedGeneration=generation before processing
@rquitales rquitales force-pushed the rquitales/fix-deployment-await-error branch from 89affa7 to 2ef5470 Compare April 11, 2024 22:42
@@ -664,12 +665,12 @@ func Test_Apps_Deployment_Without_PersistentVolumeClaims(t *testing.T) {
// The Deployment specifically does not reference the PVC in
// its spec. Therefore, the Deployment should succeed no
// matter what phase the PVC is in as it does not have to wait on it.
deployments <- watchAddedEvent(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event is reordered since we're using unbuffered channels which will block until there is a value receiver.

@rquitales rquitales marked this pull request as ready for review April 11, 2024 22:45
@rquitales rquitales requested a review from blampe April 11, 2024 22:46
Copy link
Contributor

@EronWright EronWright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

continue
}

if deployment.GetGeneration() == observedGeneration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would generation <= observedGeneration be technically safer here? I don't know if that's actually observable in practice.

Copy link
Contributor Author

@rquitales rquitales Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the observedGeneration to match exactly the generation of the deployment object so that the status matches the right manifest spec. generation != observedGeneration is currently why this bug exists, though it would be a bug in k8s if generation < observeredGeneration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too stared at the comparison for a few minutes. Rollout code will sometimes have a 'planned' generation that they got from the call to Apply, and await an observed generation that is greater-or-equal to the planned generation. This code is not doing that; it is simply comparing generations within a given object and has no planned generation.

Comment on lines +339 to +343
// Before we start processing any ReplicaSet, PVC or Pod events, we need to wait until the Deployment controller
// has seen and updated the status of the Deployment object.
if err := dia.waitUntilDeploymentControllerReconciles(deploymentEvents, timeout); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rquitales rquitales enabled auto-merge (squash) April 11, 2024 23:04
@rquitales rquitales merged commit c8150eb into master Apr 11, 2024
18 checks passed
@rquitales rquitales deleted the rquitales/fix-deployment-await-error branch April 11, 2024 23:13
lumiere-bot bot added a commit to coolguy1771/home-ops that referenced this pull request Apr 12, 2024
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [@pulumi/kubernetes](https://pulumi.com)
([source](https://togithub.com/pulumi/pulumi-kubernetes)) | dependencies
| minor | [`4.9.1` ->
`4.10.0`](https://renovatebot.com/diffs/npm/@pulumi%2fkubernetes/4.9.1/4.10.0)
|

---

> [!WARNING]
> Some dependencies could not be looked up. Check the Dependency
Dashboard for more information.

---

### Release Notes

<details>
<summary>pulumi/pulumi-kubernetes (@&#8203;pulumi/kubernetes)</summary>

###
[`v4.10.0`](https://togithub.com/pulumi/pulumi-kubernetes/blob/HEAD/CHANGELOG.md#4100-April-11-2024)

[Compare
Source](https://togithub.com/pulumi/pulumi-kubernetes/compare/v4.9.1...v4.10.0)

- ConfigGroup V2
([pulumi/pulumi-kubernetes#2844)
- ConfigFile V2
([pulumi/pulumi-kubernetes#2862)
- Bugfix for ambiguous kinds
([pulumi/pulumi-kubernetes#2889)
- \[yaml/v2] Support for resource ordering
[pulumi/pulumi-kubernetes#2894)
- Bugfix for deployment await logic not referencing the correct
deployment status
([pulumi/pulumi-kubernetes#2943)

##### New Features

A new MLC-based implementation of `ConfigGroup` and of `ConfigFile` is
now available in the "yaml/v2" package. These resources are
usable in all Pulumi languages, including Pulumi YAML and in the Java
Pulumi SDK.

Note that transformations aren't supported in this release (see
[pulumi/pulumi#12996).

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Renovate
Bot](https://togithub.com/renovatebot/renovate).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4yODcuMSIsInVwZGF0ZWRJblZlciI6IjM3LjI4Ny4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9ucG0iLCJ0eXBlL21pbm9yIl19-->

Co-authored-by: lumiere-bot[bot] <98047013+lumiere-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faulty resource update doesn't result in error of pulumi up
3 participants