fix(xdp): propagate XDP rule plumbing failures to caller#5948
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5948 +/- ##
==========================================
- Coverage 85.93% 84.21% -1.73%
==========================================
Files 60 60
Lines 18726 18730 +4
==========================================
- Hits 16093 15774 -319
- Misses 2633 2956 +323 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2cc67b3 to
f69672d
Compare
51ce23f to
5d238f6
Compare
|
Simplified the conditional — a success on a later interface no longer overwrites an earlier failure. Also fixed a potential use-after-free in the non-Wildcard branch: the PortSet free is now guarded by |
|
My biggest concern (when I created this issue) is partial instantiation...... and the non-determinism of it all. This isn't an issue with your PR, I am just braindumping the current problem we have and hopefully go for a more complete + deterministic solution. Today, MsQuic creates 1 program per interface per queue, and each program has its own rules engine. So, if a system has N interfaces, and each interface has K queues, MsQuic will create N * K XDP programs, and tries to replicate the same ruleset to those N * K programs. Now if we fail to plumb rules on some programs, the flawed logic is we continue on failure and ignore it. It's good that your PR at least adds an error status, but that doesn't solve the partial instantiation, and the necessary cleanup. So, continuing on failure ideally should be avoided. We have to have a coherent cleanup strategy on failure. Ideally, we have something atomic. Either all programs get their rules plumbed, or none do and the application can try it again later. |
|
I can take care of the two concrete items — adding the best-effort comment and warning log in |
5d238f6 to
cc683ba
Compare
That's fair, sorry for the scope creep. But the gist of it is, we should try to clean up at least on a best effort basis partially plumbed interfaces/queues. Maybe "guarantee" is a strong word :) |
Done — pushed the best-effort cleanup. On failure in the Wildcard branch the loop now breaks at the first failure and rolls back already-configured interfaces. Also added the warning log in |
cc683ba to
6499658
Compare
|
There are still a couple of CLOG failures - you might need to clear auto-generated sidecar files then run update-sidecar.ps1 again. |
|
There is currently no coverage for the cleanup routines and your error handling code. If even as a sanity check, can you verify that there is no crash or undefined behavior when your failure path runs? You can hard-code a conditional to go into that path through the usage of MsQuic Test Hooks. It would be ideal if this can be added to our automated CI. But at a bare minimum, can you verify manually the cleanup code works as expected + update PR description with how you tested this? |
6499658 to
d842c4e
Compare
I looked into using |
I see. If it's do-able for you to add those hook callbacks to inject those failures in the Datapath, that would be great. If it's too much, then we can save it for another PR. But at least having a manual pass through the codepath to add some coverage is needed. BTW, you said "ran 30 DataPath tests on Linux" --- I am not sure how this relates to the current changes that are only in the WinUser xdp path. Did you mean windows? |
Yes, I did mean Windows, sorry about that. I dug a bit deeper and found that the layering, Option A: Add a static function pointer Option B: Use the existing Which approach would you prefer? I'm happy to consider other approaches as well. |
Yeah you can just trigger the OOM branch with CxPlatSetAllocFailDenominator. The specific failure doesn't matter, I just want to make sure that the cleanup code runs and no egregious errors. |
74bb3e8 to
f0f3b7a
Compare
e04889b to
b35d0da
Compare
|
A bunch of BVT CI failures @MarkedMuichiro For the CLOG errors, have you tried just cloning the repo to your local environment instead of using Github codespaces? |
|
For the CLOG failures, if running the script causes issues, the CI failure should report the diff and you can fix it by hand. |
b35d0da to
5e30067
Compare
|
Pushed the trailing-space fix for the CLOG diff. Workflows are waiting on maintainer approval to run — could either of you approve when you get a chance @guhetier @ProjectsByJackHe ? |
|
|
A test that depends on the order / number of memory allocations will be brittle, I don't think we should merge it. A manual validation, modifying the datapath code to pretend an allocation fails, and checking the error path execute as expected would be welcome. @ProjectsByJackHe since it can be difficult to setup XDP, maybe you can assist with running this validation? |
|
Hey @MarkedMuichiro, I created MarkedMuichiro#1 to fix up the BVT errors and confirmed from a local debugger that the added cleanup path has code coverage from my fixed up DataPathTest. Once we get that merged, and clean up some nits from @guhetier, this PR should be good to go! |
547aceb to
b9637a4
Compare
|
Pushed. Three changes:
Also includes Jack's PR #1 changes for the simplified cleanup approach — |
I'm confused by that part, why? Am I missing a problem with still returning an error from CxPlatDpRawInterfaceUpdateRules and keeping the rollback as it is done now? |
@MarkedMuichiro we still need to propagate this up. You can just set the HEAD of this PR to equal the HEAD of #5968. To clear up the objective, our goal is to notify the app whenever any problems occur when setting up XDP for data paths that requires it (i.e. does not have a fallback option (QTIP, CIBIR)) We can have the cleanup be best-effort and non-deterministic. Surprisingly, |
9299089 to
90d0346
Compare
|
Updated branch to align with #5968 — Also addressed the two remaining nits:
Sorry for the earlier confusion @guhetier — I misread your previous comment about |
Fixes microsoft#5935 CxPlatDpRawPlumbRulesOnSocket and CxPlatDpRawInterfaceAddRules now return QUIC_STATUS to propagate XDP rule plumbing failures (XdpCreateProgram, allocation, rule overflow) to the caller. On install failure, CxPlatDpRawPlumbRulesOnSocket stops at the first failed interface and returns the error. The caller (RawSocketCreateUdp) performs best-effort cleanup by re-invoking PlumbRulesOnSocket with IsCreated=FALSE before returning failure. RawSocketCreateUdp checks the return value and rolls back the socket registration on failure. RawSocketDelete deletion is best-effort: logs a warning on failure. Co-authored-by: Jack He <jackhe@microsoft.com>
90d0346 to
6023876
Compare
Simplified condition to check for non-null PortSet in rules.
Fixes #5935
Problem
CxPlatDpRawPlumbRulesOnSocket,CxPlatDpRawInterfaceAddRules, andCxPlatDpRawInterfaceUpdateRulesall returnedvoid, silently dropping failures fromXdpCreateProgram, allocation failures, and rule overflow. When XDP is required without OS socket fallback (e.g. CIBIR + XDP server sockets as introduced in #5798), a silent failure means the socket accepts no traffic with no indication to the application.Fix
Changed all three functions to return
QUIC_STATUSand propagate errors up through the call chain to the caller.CxPlatDpRawInterfaceUpdateRules: Tracks the firstXdpCreateProgramfailure across all queues and returns it after all queues are attempted.CxPlatDpRawInterfaceAddRules: ReturnsQUIC_STATUS_BUFFER_TOO_SMALLon rule overflow,QUIC_STATUS_OUT_OF_MEMORYon alloc failure, and propagates the return value fromCxPlatDpRawInterfaceUpdateRules. Fixed a pre-existing PortSet memory leak — the buffer is now freed whenAddRulesfails before copying the rule intoInterface->Rules.CxPlatDpRawPlumbRulesOnSocket: Propagates failures fromCxPlatDpRawInterfaceAddRules. On failure, both Wildcard and non-Wildcard branches now break at the first failure and perform best-effort rollback of already-configured interfaces before returning. The Wildcard branch callsCxPlatDpRawInterfaceRemoveRuleson interfaces that were already configured. The non-Wildcard branch clears the port bit on interfaces that were already configured.datapath_raw.h: Updated declaration fromvoidtoQUIC_STATUS.datapath_raw.c(RawSocketDelete): Deletion is best-effort — logs a warning on failure rather than silently discarding the return value.datapath_raw_win.c: Creation path now checks the return value ofCxPlatDpRawPlumbRulesOnSocketand callsCxPlatRemoveSocketto roll back on failure.Note:
CxPlatDpRawInterfaceRemoveRulesis intentionally left asvoid— rule removal failures are out of scope for this fix.Testing
All 30
DataPathTestcases pass on Linux (msquicplatformtest --gtest_filter="DataPathTest*").The XDP rule plumbing cleanup paths in
CxPlatDpRawPlumbRulesOnSocketare Windows-only —datapath_raw_xdp_win.cis excluded from the Linux build entirely. The existingQUIC_TEST_DATAPATH_HOOKSinfrastructure operates at the packet layer inbinding.cand cannot reachCxPlatDpRawInterfaceAddRulesin the XDP platform layer. Wiring failure injection there would require adding a new callback to the hook struct — happy to do this if the team considers it worth the effort.