Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing backoff for XDP failures and eliminate panic behaviour for non-xdp non-responding kernels #2165

Merged
merged 2 commits into from Oct 23, 2019

Conversation

@jayunit100
Copy link
Contributor

jayunit100 commented Oct 19, 2019

See projectcalico/calico#2904 for details on how I stumbled upon this. Also please not I'm new to project calico so feel free to scrutinize my logic if I'm missing something :)

Description

The xdp_state.go function has some very important logic around reconcilation around BPF, and as I was looking at the code for the above issue, I found a that errors (1) were not explicitly returned, (2) not backedoff during retries . In looking at exactly what functionality might be failing under the hood i found that it was tricky to see exactly what was going on because (3) there was some coupling in the code that was using the common struct as a mule for accessing the kernel's ip and bpf states. So , this PR does two things.

  1. Solves the concrete problem of the missing backoff with a simple 100 ms wait time (1 second total) for disablement of XDP, meaning this will be much easier to reason about or report in bugs if we see it again.

  2. Removes the "stamp coupling" thats happening with the common object, so that the function that builds the datastructures which drive reconciliation isnt aware of the 'common' object or the 'needsRetry' (although good hygiene, the real reason for this is that it improves the overall readability of a complex synchronization loop)

  3. Renames isSource with ipsSource again for readability in the called wrapper function.

Release Note

Added backoff to the shutdown functionality when BPF dataplane doesnt seem to be working properly (i.e. due to security or other restrictions).
@jayunit100 jayunit100 requested a review from projectcalico/core-maintainers as a code owner Oct 19, 2019
@jayunit100 jayunit100 force-pushed the jayunit100:2904-xdpResync branch 3 times, most recently from 1618226 to ab9be83 Oct 19, 2019
Copy link
Member

fasaxc left a comment

Thanks for the contribution, the fix looks good and the little refactoring too. Afraid I have to push back on the comment rewrites though; I don't think they are improvements as they stand.

The CI failed, you need to fix up xdp_state_test.go after changing the signature of that method.

dataplane/linux/xdp_state.go Outdated Show resolved Hide resolved
dataplane/linux/int_dataplane.go Show resolved Hide resolved
dataplane/linux/xdp_state.go Outdated Show resolved Hide resolved
dataplane/linux/xdp_state.go Outdated Show resolved Hide resolved
dataplane/linux/xdp_state.go Outdated Show resolved Hide resolved
@jayunit100 jayunit100 force-pushed the jayunit100:2904-xdpResync branch from ab9be83 to 41ac785 Oct 21, 2019
Copy link
Member

fasaxc left a comment

Couple of minor comments

dataplane/linux/xdp_state.go Outdated Show resolved Hide resolved
dataplane/linux/xdp_state.go Show resolved Hide resolved
@jayunit100

This comment has been minimized.

Copy link
Contributor Author

jayunit100 commented Oct 21, 2019

ill fix the unit test now...

@jayunit100

This comment has been minimized.

Copy link
Contributor Author

jayunit100 commented Oct 21, 2019

pushed up unit test fixes, im assuming CI will be Green now.

@fasaxc

This comment has been minimized.

Copy link
Member

fasaxc commented Oct 22, 2019

@jayunit100 it's still red, please can you take a look?

@jayunit100

This comment has been minimized.

Copy link
Contributor Author

jayunit100 commented Oct 22, 2019

Ok, Will check shortly .

@jayunit100

This comment has been minimized.

Copy link
Contributor Author

jayunit100 commented Oct 22, 2019

Ah I see it’s the static checker . It wants me to deal with the error, I’ll restore the original panic semantics (won’t be as bad bc it will only happen if the error persists for one whole second/10 tries).

@jayunit100 jayunit100 changed the title fix coupling in xdp_state.go, plumbing of BPF errors, and implement 1s of backoff. Implementing backoff for BPS failures and eliminate panic behaviour for non-xdp non-responding kernels Oct 22, 2019
@jayunit100 jayunit100 changed the title Implementing backoff for BPS failures and eliminate panic behaviour for non-xdp non-responding kernels Implementing backoff for XDP failures and eliminate panic behaviour for non-xdp non-responding kernels Oct 22, 2019
@jayunit100

This comment has been minimized.

Copy link
Contributor Author

jayunit100 commented Oct 23, 2019

Updated, am thinking were probably all set now. thanks for the reviews.

@fasaxc
fasaxc approved these changes Oct 23, 2019
Copy link
Member

fasaxc left a comment

LGTM. Please can you squash your commits?

@jayunit100

This comment has been minimized.

Copy link
Contributor Author

jayunit100 commented Oct 23, 2019

gotcha, will do that now .

@jayunit100 jayunit100 force-pushed the jayunit100:2904-xdpResync branch from 6663cf8 to 9b30893 Oct 23, 2019
@CLAassistant

This comment has been minimized.

Copy link

CLAassistant commented Oct 23, 2019

CLA assistant check
All committers have signed the CLA.

@jayunit100 jayunit100 force-pushed the jayunit100:2904-xdpResync branch from 9b30893 to a6b9440 Oct 23, 2019
@fasaxc fasaxc merged commit 2330119 into projectcalico:master Oct 23, 2019
2 checks passed
2 checks passed
license/cla Contributor License Agreement is signed.
Details
semaphoreci The build passed on Semaphore.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.