Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hanging updates for deployments #1596

Merged
merged 7 commits into from
Jun 15, 2021
Merged

Fix hanging updates for deployments #1596

merged 7 commits into from
Jun 15, 2021

Conversation

viveklak
Copy link
Contributor

@viveklak viveklak commented May 28, 2021

Fix for #1502

In my investigation for the above issue, I was able to determine that there is a TOCTTOU style race in the state maintained by the await logic for deployments. Specifically, the initial state is seeded during Read() where information on replicaset generations, pods and PVCs are all populated. The watch is kicked off subsequently but based on the most recent resource version. The current await logic seems quite brittle against the potential for missed events. Indeed in my repro, I was able to determine that the await logic would hang forever waiting to see updates for an older generation of the replicaset (as referenced during the initial Read()) while the watch began with a state where the expected (old) replicaset was already deleted. This change seeds the watches to begin at the resource versions we initially read.

In some ways I am not thrilled with this approach since we are further perpetuating the current highly stateful nature of the await logic here. As part of the refactor/rearchitecture of the await logic for complex resources such as deployments we should definitely consider an approach where the strong consistency of the event stream is not necessary. The current biggest caveats with this approach:

  1. With setting the resource revision for the watch, there is the possibility that the resource version to read from is no longer retained (Await() is significantly delayed from Read()) in which case we would get a 410 from the api server. I am not sure what we should do in that situation to recover.
  2. The set resource version might cause an additional performance penalty.
  3. Not entirely sure how best to test this. My repro environment was a bit convoluted though I will share the preliminary performance/load testing scaffold I used to track the issue (and verify fix) separately.

@viveklak viveklak requested a review from lblackstone May 28, 2021 08:38
@viveklak viveklak added the impact/no-changelog-required This issue doesn't require a CHANGELOG update label May 28, 2021
if dia.deployment != nil {
depListOptions.ResourceVersion = dia.deployment.GetResourceVersion()
}
deploymentWatcher, err := deploymentClient.Watch(context.TODO(), depListOptions)
Copy link
Contributor Author

@viveklak viveklak May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: handle potential 410 - gone error response here (and other places): https://kubernetes.io/docs/reference/using-api/api-concepts/#410-gone-responses

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the fallback option could be grabbing the latest like it was doing before. Condition 1 is the only one that is comparing generations, and the generation just has to be later. If the specified resource version no longer exists, that implies that condition 1 is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this to restart at the current latest and reset our references to the latest deployment state.

@viveklak viveklak marked this pull request as draft May 28, 2021 08:42
@github-actions
Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

Copy link
Member

@mikhailshilkov mikhailshilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me directionally, and I agree with your concerns. A couple of questions to improve my understanding.

provider/pkg/await/deployment.go Show resolved Hide resolved
provider/pkg/await/deployment.go Outdated Show resolved Hide resolved
@github-actions
Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

Copy link
Member

@lblackstone lblackstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes make sense, but would want to verify that this fixes the issue before merging.

if dia.deployment != nil {
depListOptions.ResourceVersion = dia.deployment.GetResourceVersion()
}
deploymentWatcher, err := deploymentClient.Watch(context.TODO(), depListOptions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the fallback option could be grabbing the latest like it was doing before. Condition 1 is the only one that is comparing generations, and the generation just has to be later. If the specified resource version no longer exists, that implies that condition 1 is true.

@viveklak viveklak force-pushed the vl/FixDepStuck branch 2 times, most recently from 46b4932 to 2fa873d Compare June 10, 2021 08:22
@github-actions
Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

1 similar comment
@github-actions
Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

@viveklak viveklak marked this pull request as ready for review June 11, 2021 00:18
@viveklak viveklak added area/await-logic and removed impact/no-changelog-required This issue doesn't require a CHANGELOG update labels Jun 11, 2021
@github-actions
Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

2 similar comments
@github-actions
Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

@github-actions
Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

@viveklak
Copy link
Contributor Author

@lblackstone this is ready for another review. I am looking into the test failure. I can't seem to reproduce on my own test cluster on EKS while it repros consistently in CI. I expect this to be unrelated to the core of the changes here but I will get to the bottom of it today.

return
}

currentGeneration := dia.deployment.GetAnnotations()[revision]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this annotation guaranteed to be present? If it's not set, the return value will be "". (I noticed that this was already present in other logic in this awaiter, but it might be good to be defensive.)

@github-actions
Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

@lblackstone
Copy link
Member

@viveklak Is this ready to merge?

@viveklak
Copy link
Contributor Author

Yup just rebased. Merging.

Should fix #1502 and #1512

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants