New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1824203: Make egressVXLANMonitor updates channel buffered #132
Bug 1824203: Make egressVXLANMonitor updates channel buffered #132
Conversation
The egressIPTracker has methods that lock eit.mutex and that call evm functions that lock evm.mutex. The problem with this is that evm.mutex has to write to the evm.updates channel which isn't buffered and becomes blocked until eit.setNodeOffline, which also locks eit.mutex, is running. This causes a deadlock. By adding a buffered channel with 300 elements we should avoid the locking problem because "640k ought to be enough for anybody"
@juanluisvaladas: This pull request references Bugzilla bug 1824203, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
BTW I'm fully aware this is a horrible hack. |
/retest |
/test e2e-aws |
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: juanluisvaladas, rcarrillocruz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@danwinship perhaps you should have a look before actually merging this. |
yeah, I did mean to look at this.... Any channel size other than 0 or 1 is a code smell. (They talk about "massive" egress IP migrations in the bug... Are you sure 300 is enough?) The real fix would be to reorganize the locking to put the channel write outside the lock. Or alternatively, rather than pushing the actual updates over the channel, do like BoundedFrequencyRunner does, and have a length 1 |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
I guess I'd be OK approving this as long as you at least filed an issue about fixing things better and left a FIXME in the code pointing to that (so anyone looking at the current code when thinking about ovn-kubernetes egress IPs will understand the problem) |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@juanluisvaladas: All pull requests linked via external trackers have merged: openshift/sdn#132. Bugzilla bug 1824203 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The egressIPTracker has methods that lock eit.mutex and that call
evm functions that lock evm.mutex.
The problem with this is that evm.mutex has to write to the evm.updates
channel which isn't buffered and becomes blocked until
eit.setNodeOffline, which also locks eit.mutex, is running.
This causes a deadlock. By adding a buffered channel with 300 elements
we should avoid the locking problem because "640k ought to be enough
for anybody"