Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1824203: Make egressVXLANMonitor updates channel buffered #132

Conversation

juanluisvaladas
Copy link
Contributor

The egressIPTracker has methods that lock eit.mutex and that call
evm functions that lock evm.mutex.

The problem with this is that evm.mutex has to write to the evm.updates
channel which isn't buffered and becomes blocked until
eit.setNodeOffline, which also locks eit.mutex, is running.

This causes a deadlock. By adding a buffered channel with 300 elements
we should avoid the locking problem because "640k ought to be enough
for anybody"

The egressIPTracker has methods that lock eit.mutex and that call
evm functions that lock evm.mutex.

The problem with this is that evm.mutex has to write to the evm.updates
channel which isn't buffered and becomes blocked until
eit.setNodeOffline, which also locks eit.mutex, is running.

This causes a deadlock. By adding a buffered channel with 300 elements
we should avoid the locking problem because "640k ought to be enough
for anybody"
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Apr 16, 2020
@openshift-ci-robot
Copy link
Contributor

@juanluisvaladas: This pull request references Bugzilla bug 1824203, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1824203: Make egressVXLANMonitor updates channel buffered

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@juanluisvaladas
Copy link
Contributor Author

BTW I'm fully aware this is a horrible hack.

@juanluisvaladas
Copy link
Contributor Author

/retest

@juanluisvaladas
Copy link
Contributor Author

/test e2e-aws

@juanluisvaladas
Copy link
Contributor Author

/retest

@rcarrillocruz
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juanluisvaladas, rcarrillocruz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@juanluisvaladas
Copy link
Contributor Author

@danwinship perhaps you should have a look before actually merging this.

@danwinship
Copy link
Contributor

yeah, I did mean to look at this....

Any channel size other than 0 or 1 is a code smell. (They talk about "massive" egress IP migrations in the bug... Are you sure 300 is enough?) The real fix would be to reorganize the locking to put the channel write outside the lock.

Or alternatively, rather than pushing the actual updates over the channel, do like BoundedFrequencyRunner does, and have a length 1 chan struct{} where the channel is used purely to signal the other thread that it needs to act, but not to pass the actual data, and so if there's already an item in the channel, you don't need to signal it again, and when the reader thread finally reads from the channel, it would call some evm.GetUpdatedNodes() method to get the list of pending updates.

@juanluisvaladas
Copy link
Contributor Author

Any channel size other than 0 or 1 is a code smell.
Agree.

(They talk about "massive" egress IP migrations in the bug... Are you sure 300 is enough?)
Their largest node has 92 egress IPs so this buys us their 3 largest full nodes plus some margin, and that'd be without executing eit.setNodeOffline a single time in the meantime.
I'm not entirely certain this will be enough

The real fix would be to reorganize the locking to put the channel write outside the lock.
Agree, should I put this on hold and do that instead?

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@danwinship
Copy link
Contributor

I guess I'd be OK approving this as long as you at least filed an issue about fixing things better and left a FIXME in the code pointing to that (so anyone looking at the current code when thinking about ovn-kubernetes egress IPs will understand the problem)

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 3bc4715 into openshift:master Apr 22, 2020
@openshift-ci-robot
Copy link
Contributor

@juanluisvaladas: All pull requests linked via external trackers have merged: openshift/sdn#132. Bugzilla bug 1824203 has been moved to the MODIFIED state.

In response to this:

Bug 1824203: Make egressVXLANMonitor updates channel buffered

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants