Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release 3.11]Bug 1824243: Fix egressVXLANMonitor and egressIPTracker deadlock #25027

Conversation

juanluisvaladas
Copy link
Contributor

The egressIPTracker has methods that lock eit.mutex and that call
evm functions that lock evm.mutex.

The problem with this is that evm.mutex has to write to the evm.updates
channel which isn't buffered and becomes blocked until
eit.setNodeOffline, which also locks eit.mutex, is running.

This causes a deadlock. Initially I tried doing a horrible hack by
making the updates channel huge, however it wasn't enough.

Instead this fix adds a shared nodes list between eit and evm and the
evm uses the updates channel just to notify the eit that there are
updates in the shared list.

The egressIPTracker has methods that lock eit.mutex and that call
evm functions that lock evm.mutex.

The problem with this is that evm.mutex has to write to the evm.updates
channel which isn't buffered and becomes blocked until
eit.setNodeOffline, which also locks eit.mutex, is running.

This causes a deadlock. Initially I tried doing a horrible hack by
making the updates channel huge, however it wasn't enough.

Instead this fix adds a shared nodes list between eit and evm and the
evm uses the updates channel just to notify the eit that there are
updates in the shared list.
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 26, 2020
@openshift-ci-robot
Copy link

@juanluisvaladas: This pull request references Bugzilla bug 1824243, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (3.11.z) matches configured target release for branch (3.11.z)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1824243: Fix egressVXLANMonitor and egressIPTracker deadlock

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@juanluisvaladas
Copy link
Contributor Author

/hold
The BZ has a few blockers on 4.2-4.4 pending to be merged. I'll cancel it as soon as I get that one merged

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 26, 2020
@juanluisvaladas juanluisvaladas changed the title Bug 1824243: Fix egressVXLANMonitor and egressIPTracker deadlock [release 3.11]Bug 1824243: Fix egressVXLANMonitor and egressIPTracker deadlock May 27, 2020
@juanluisvaladas
Copy link
Contributor Author

/retest

@juanluisvaladas
Copy link
Contributor Author

/test e2e-gcp

@knobunc
Copy link
Contributor

knobunc commented May 27, 2020

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 27, 2020
@knobunc knobunc added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. and removed lgtm Indicates that a PR is ready to be merged. labels May 27, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2020
@juanluisvaladas
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 27, 2020
@juanluisvaladas
Copy link
Contributor Author

/test e2e-gcp

@juanluisvaladas
Copy link
Contributor Author

e2e-gcp seems to be haven been down for a few days so I'm not retrying it for a while

@tsmetana
Copy link
Member

tsmetana commented Jun 1, 2020

/test e2e-gcp

@juanluisvaladas
Copy link
Contributor Author

/retest

1 similar comment
@sferich888
Copy link
Contributor

/retest

@vikaslaad
Copy link

@knobunc could you please skip e2e-gcp tests ? they are broken.

@knobunc
Copy link
Contributor

knobunc commented Jun 30, 2020

/lgtm
/override ci/prow/e2e-gcp

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 30, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juanluisvaladas, knobunc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

@knobunc: Overrode contexts on behalf of knobunc: ci/prow/e2e-gcp

In response to this:

/lgtm
/override ci/prow/e2e-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

24 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@knobunc
Copy link
Contributor

knobunc commented Jul 1, 2020

/override ci/openshift-jenkins/extended_conformance_install

This is failing due to a mirror problem. We had a clean test run prior to that. I'll work on working out the test problem separately.

@openshift-ci-robot
Copy link

@knobunc: Overrode contexts on behalf of knobunc: ci/openshift-jenkins/extended_conformance_install

In response to this:

/override ci/openshift-jenkins/extended_conformance_install

This is failing due to a mirror problem. We had a clean test run prior to that. I'll work on working out the test problem separately.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 88feed3 into openshift:release-3.11 Jul 1, 2020
@openshift-ci-robot
Copy link

@juanluisvaladas: All pull requests linked via external trackers have merged: openshift/origin#25027. Bugzilla bug 1824243 has been moved to the MODIFIED state.

In response to this:

[release 3.11]Bug 1824243: Fix egressVXLANMonitor and egressIPTracker deadlock

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants