-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hanging updates for deployments #1596
Conversation
provider/pkg/await/deployment.go
Outdated
if dia.deployment != nil { | ||
depListOptions.ResourceVersion = dia.deployment.GetResourceVersion() | ||
} | ||
deploymentWatcher, err := deploymentClient.Watch(context.TODO(), depListOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: handle potential 410 - gone error response here (and other places): https://kubernetes.io/docs/reference/using-api/api-concepts/#410-gone-responses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the fallback option could be grabbing the latest like it was doing before. Condition 1 is the only one that is comparing generations, and the generation just has to be later. If the specified resource version no longer exists, that implies that condition 1 is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this to restart at the current latest and reset our references to the latest deployment state.
Does the PR have any schema changes?Looking good! No breaking changes found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me directionally, and I agree with your concerns. A couple of questions to improve my understanding.
Does the PR have any schema changes?Looking good! No breaking changes found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes make sense, but would want to verify that this fixes the issue before merging.
provider/pkg/await/deployment.go
Outdated
if dia.deployment != nil { | ||
depListOptions.ResourceVersion = dia.deployment.GetResourceVersion() | ||
} | ||
deploymentWatcher, err := deploymentClient.Watch(context.TODO(), depListOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the fallback option could be grabbing the latest like it was doing before. Condition 1 is the only one that is comparing generations, and the generation just has to be later. If the specified resource version no longer exists, that implies that condition 1 is true.
46b4932
to
2fa873d
Compare
Does the PR have any schema changes?Looking good! No breaking changes found. |
1 similar comment
Does the PR have any schema changes?Looking good! No breaking changes found. |
Does the PR have any schema changes?Looking good! No breaking changes found. |
2 similar comments
Does the PR have any schema changes?Looking good! No breaking changes found. |
Does the PR have any schema changes?Looking good! No breaking changes found. |
@lblackstone this is ready for another review. I am looking into the test failure. I can't seem to reproduce on my own test cluster on EKS while it repros consistently in CI. I expect this to be unrelated to the core of the changes here but I will get to the bottom of it today. |
return | ||
} | ||
|
||
currentGeneration := dia.deployment.GetAnnotations()[revision] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this annotation guaranteed to be present? If it's not set, the return value will be ""
. (I noticed that this was already present in other logic in this awaiter, but it might be good to be defensive.)
Does the PR have any schema changes?Looking good! No breaking changes found. |
@viveklak Is this ready to merge? |
Fix for #1502
In my investigation for the above issue, I was able to determine that there is a TOCTTOU style race in the state maintained by the await logic for deployments. Specifically, the initial state is seeded during Read() where information on replicaset generations, pods and PVCs are all populated. The watch is kicked off subsequently but based on the most recent resource version. The current await logic seems quite brittle against the potential for missed events. Indeed in my repro, I was able to determine that the await logic would hang forever waiting to see updates for an older generation of the
replicaset
(as referenced during the initialRead()
) while the watch began with a state where the expected (old)replicaset
was already deleted. This change seeds the watches to begin at the resource versions we initially read.In some ways I am not thrilled with this approach since we are further perpetuating the current highly stateful nature of the await logic here. As part of the refactor/rearchitecture of the await logic for complex resources such as deployments we should definitely consider an approach where the strong consistency of the event stream is not necessary. The current biggest caveats with this approach:
Await()
is significantly delayed fromRead()
) in which case we would get a410
from the api server. I am not sure what we should do in that situation to recover.