Skip to content

Conversation

@ovsrobot
Copy link
Owner

@ovsrobot ovsrobot commented Oct 28, 2025

Auto-submission for "http://patchwork.dpdk.org/project/dpdk/list/?series=36480"

Summary by Sourcery

Enable AVX2-accelerated split queue Rx/Tx in the Intel idpf PMD by adding new vectorized receive and transmit routines, consolidating rearm logic, and registering the new paths in the driver

New Features:

  • Add AVX2-based split queue receive function for idpf driver
  • Add AVX2-based split queue transmit function and supporting routines
  • Enable driver selection of Split AVX2 Vector Rx/Tx based on SIMD width

Enhancements:

  • Extract common split queue rearm logic into idpf_splitq_rearm_common
  • Refactor single queue AVX2 transmit burst function variable declaration

Summary by CodeRabbit

  • Documentation

    • Clarified split queue mode completion queue configuration constraints.
  • New Features

    • Enabled AVX2 vector optimizations for split-queue network packet reception and transmission on compatible systems.

In case some CPUs don't support AVX512. Enable AVX2 for them to
get better per-core performance.

In the single queue model, the same descriptor queue is used by SW
to post descriptors to the device and used by device to report completed
descriptors to SW. While as the split queue model separates them into
different queues for parallel processing and improved performance.

Signed-off-by: Shaiq Wani <shaiq.wani@intel.com>
Signed-off-by: 0-day Robot <robot@bytheb.org>
In case some CPUs don't support AVX512. Enable AVX2 for them to
get better per-core performance.

In the single queue model, the same descriptor queue is used by SW
to post descriptors to the device and used by device to report completed
descriptors to SW. While as the split queue model separates them into
different queues for parallel processing and improved performance.

Signed-off-by: Shaiq Wani <shaiq.wani@intel.com>
Signed-off-by: 0-day Robot <robot@bytheb.org>
Added a note in the IDPF Poll Mode Driver documentation to clarify
that sharing a completion queue among multiple TX queues serviced
by different CPU cores is not supported in split queue mode.

Signed-off-by: Shaiq Wani <shaiq.wani@intel.com>
Signed-off-by: 0-day Robot <robot@bytheb.org>
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 28, 2025

Reviewer's Guide

This PR adds AVX2-accelerated split-queue Rx and Tx paths in the Intel idpf driver by implementing vectorized packet processing routines, extracting common rearm logic, and updating device registration and header declarations accordingly.

Class diagram for new and updated split queue AVX2 Rx/Tx functions

classDiagram
    class idpf_rx_queue {
        +adapter
        +bufq2
        +rx_tail
        +expected_gen_id
        +nb_rx_desc
        +mbuf_initializer
        +rx_ring
        +rxrearm_nb
        +rxrearm_start
        +mp
        +sw_ring
        +fake_mbuf
        +rx_stats
        +qrx_tail
    }
    class ci_tx_queue {
        +tx_tail
        +nb_tx_desc
        +expected_gen_id
        +compl_ring
        +txqs
        +tx_start_qid
        +rs_compl_count
        +tx_free_thresh
        +tx_rs_thresh
        +desc_ring
        +sw_ring_vec
        +nb_tx_free
        +qtx_tail
        +complq
    }
    class idpf_flex_tx_sched_desc {
    }
    class idpf_splitq_tx_compl_desc {
    }
    class rte_mbuf {
        +data_len
        +buf_iova
        +data_off
        +rearm_data
    }
    idpf_rx_queue --> ci_tx_queue : uses in Tx
    ci_tx_queue --> idpf_flex_tx_sched_desc : uses for Tx desc
    ci_tx_queue --> idpf_splitq_tx_compl_desc : uses for completion
    idpf_rx_queue --> rte_mbuf : uses for Rx
    ci_tx_queue --> rte_mbuf : uses for Tx

    class idpf_common_rxtx_avx2 {
        +idpf_dp_splitq_recv_pkts_avx2()
        +idpf_splitq_rearm_common()
        +idpf_splitq_scan_cq_ring()
        +idpf_splitq_vtx1_avx2()
        +idpf_splitq_vtx_avx2()
        +idpf_splitq_xmit_fixed_burst_vec_avx2()
        +idpf_dp_splitq_xmit_pkts_avx2()
    }
    idpf_common_rxtx_avx2 --> idpf_rx_queue
    idpf_common_rxtx_avx2 --> ci_tx_queue
    idpf_common_rxtx_avx2 --> idpf_flex_tx_sched_desc
    idpf_common_rxtx_avx2 --> idpf_splitq_tx_compl_desc
    idpf_common_rxtx_avx2 --> rte_mbuf
Loading

Flow diagram for AVX2 split queue Rx packet processing

flowchart TD
    Start(["Start Rx Burst"])
    RearmCheck["Check if rearm needed"]
    RearmCall["Call idpf_splitq_rearm_common if needed"]
    HeadGenCheck["Check head generation"]
    Loop["For each batch of 4 descriptors"]
    DDCheck["Check DD bits"]
    CopyMbuf["Copy mbuf pointers"]
    BuildMbuf["Build mbuf rearm data"]
    UpdateTail["Update rx_tail and expected_gen_id"]
    ReturnPkts["Return received packets"]
    Start --> RearmCheck
    RearmCheck -->|Yes| RearmCall
    RearmCheck -->|No| HeadGenCheck
    RearmCall --> HeadGenCheck
    HeadGenCheck -->|OK| Loop
    HeadGenCheck -->|Fail| ReturnPkts
    Loop --> DDCheck
    DDCheck -->|All set| CopyMbuf
    DDCheck -->|Not all| UpdateTail
    CopyMbuf --> BuildMbuf
    BuildMbuf --> Loop
    Loop --> UpdateTail
    UpdateTail --> ReturnPkts
Loading

Flow diagram for AVX2 split queue Tx packet processing

flowchart TD
    Start(["Start Tx Burst"])
    ScanCQ["Scan completion ring"]
    FreeBufs["Free buffers if rs_compl_count > tx_free_thresh"]
    Loop["While nb_pkts > 0"]
    FixedBurst["Call idpf_splitq_xmit_fixed_burst_vec_avx2"]
    UpdateTail["Update tx_tail"]
    WriteTail["Write to NIC tail register"]
    ReturnTx["Return nb_tx"]
    Start --> ScanCQ
    ScanCQ --> FreeBufs
    FreeBufs --> Loop
    Loop --> FixedBurst
    FixedBurst --> UpdateTail
    UpdateTail --> WriteTail
    WriteTail --> Loop
    Loop -->|Done| ReturnTx
Loading

File-Level Changes

Change Details Files
Implement split-queue Rx using AVX2 intrinsics
  • Added idpf_dp_splitq_recv_pkts_avx2 with AVX2-based descriptor processing loop
  • Defined 256-bit shuffle, length mask, and ptype mask vectors for mbuf rearm_data assembly
  • Integrated generation-bit checks, tail pointer updates, and rearm threshold logic
  • Exported the new Rx function and registered it in rx_path_infos for SIMD width 256
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c
Implement split-queue Tx using AVX2 intrinsics
  • Introduced idpf_splitq_scan_cq_ring for scanning completion queue with generation toggling
  • Added idpf_splitq_vtx1_avx2 and idpf_splitq_vtx_avx2 for vectorized descriptor builds
  • Created idpf_splitq_xmit_fixed_burst_vec_avx2 and idpf_dp_splitq_xmit_pkts_avx2 for burst transmission
  • Exported the new Tx functions for internal symbol visibility
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c
Extract and unify split-queue rearm logic
  • Moved idpf_splitq_rearm_common implementation into idpf_common_rxtx.c
  • Removed duplicate inline rearm in idpf_common_rxtx_avx512.c
  • Added EXPORT and header declaration for idpf_splitq_rearm_common
drivers/net/intel/idpf/idpf_common_rxtx.c
drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
drivers/net/intel/idpf/idpf_common_rxtx.h
Update device setup and header enums for AVX2 support
  • Extended idpf_set_tx_function to select split-queue AVX2 vector Tx path for SIMD width 256
  • Added IDPF_RX_AVX2 enum and prototype exports in idpf_common_device.h and idpf_common_rxtx.h
  • Registered the new AVX2 Rx function in the device RX path info
drivers/net/intel/idpf/idpf_rxtx.c
drivers/net/intel/idpf/idpf_common_device.h
drivers/net/intel/idpf/idpf_common_rxtx.h

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

Walkthrough

The changes add AVX2 support for split-queue RX/TX data paths in the Intel IDPF driver. A common rearm function is extracted as a public symbol for shared use. The RX/TX path selection logic is updated to recognize AVX2-optimized paths. Documentation notes a split-queue completion queue sharing limitation.

Changes

Cohort / File(s) Summary
Documentation
doc/guides/nics/idpf.rst
Added runtime configuration note stating that completion queue sharing among multiple TX queues serviced by different CPU cores is unsupported in split queue mode.
Enumeration and declarations
drivers/net/intel/idpf/idpf_common_device.h, drivers/net/intel/idpf/idpf_common_rxtx.h
Added IDPF_RX_AVX2 enum member to idpf_rx_func_type. Declared new public functions: idpf_splitq_rearm_common(), idpf_dp_splitq_recv_pkts_avx2(), and idpf_dp_splitq_xmit_pkts_avx2() for AVX2 split-queue support.
Common RX/TX implementation
drivers/net/intel/idpf/idpf_common_rxtx.c
Implemented idpf_splitq_rearm_common() for bulk MBUF allocation and RX descriptor initialization. Added new AVX2 split RX path entry to idpf_rx_path_infos array with 256-bit SIMD metadata.
AVX2 split-queue paths
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c
Implemented idpf_dp_splitq_recv_pkts_avx2() for vectorized split-queue RX packet reception with head-gen checks and descriptor prefetching. Implemented idpf_dp_splitq_xmit_pkts_avx2() for vectorized split-queue TX transmission. Added static helpers: idpf_splitq_scan_cq_ring(), idpf_splitq_vtx1_avx2(), idpf_splitq_vtx_avx2(), and idpf_splitq_xmit_fixed_burst_vec_avx2().
AVX512 cleanup
drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
Removed static idpf_splitq_rearm_common() function, consolidating rearm logic into the shared public implementation.
Path selection logic
drivers/net/intel/idpf/idpf_rxtx.c
Extended idpf_set_tx_function() to recognize AVX2 SIMD width (256-bit) and assign split-queue AVX2 transmit path, inserting AVX2 handling between AVX512 and scalar fallback paths.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Driver as IDPF Driver
    participant PathSel as Path Selection<br/>(idpf_set_tx_function)
    participant AVX512Path as AVX512 Path
    participant AVX2Path as AVX2 Path
    participant ScalarPath as Scalar Path

    App->>Driver: Initialize TX
    Driver->>PathSel: Select TX path
    
    alt SIMD width = 512-bit
        PathSel->>AVX512Path: Use AVX512
        AVX512Path->>App: TX burst function
    else SIMD width = 256-bit
        PathSel->>AVX2Path: Use AVX2 (new)
        AVX2Path->>App: TX burst function
    else
        PathSel->>ScalarPath: Use Scalar
        ScalarPath->>App: TX burst function
    end
Loading
sequenceDiagram
    participant RXQ as RX Queue
    participant Rearm as Rearm<br/>(idpf_splitq_rearm_common)
    participant Burst as RX Burst<br/>(idpf_dp_splitq_recv_pkts_avx2)
    participant Desc as RX Descriptors
    participant MBUF as MBUF Pool

    Burst->>RXQ: Scan completion queue
    Burst->>Desc: Read 4 descriptors/iteration
    Burst->>Desc: Check head-gen state
    
    loop Process descriptors
        Burst->>Desc: Extract packet info
        Burst->>RXQ: Update tail & GEN
    end

    RXQ->>Rearm: Trigger rearm when needed
    Rearm->>MBUF: Allocate MBUFs in bulk
    Rearm->>Desc: Initialize RX descriptors
    Rearm->>RXQ: Update tail pointer
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • AVX2 vectorized RX/TX implementation (idpf_common_rxtx_avx2.c): Dense logic with descriptor processing loops, GEN state management, prefetching, and helper functions—requires careful verification of SIMD semantics and descriptor ring handling.
  • Code relocation and deduplication (idpf_common_rxtx_avx512.c and idpf_common_rxtx.c): Extracted idpf_splitq_rearm_common() from AVX512 to shared common code—verify functional equivalence and that both AVX2 and AVX512 paths use it correctly.
  • Path selection precedence (idpf_rxtx.c): AVX512 → AVX2 → Scalar ordering must be correct to avoid incorrect path selection.
  • Integration across multiple files: Changes span headers, implementations, and selection logic; cross-file consistency requires attention.

Poem

🐰 AVX2 wings sprout on split queues bright,
Rearm extracted, shared in morning light,
Vectorized bursts now dance 256-bit wide,
Old AVX512 twins, side by side,
Fast frames fly where the rabbit once tied! 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "[PWCI] 'net/idpf: enable AVX2 for split queue Rx/Tx'" directly and accurately describes the primary objective of the changeset. The changes consistently add AVX2-based implementations for split-queue receive and transmit paths, including new enum members, internal functions, and path selection logic to support AVX2 acceleration. The title is concise, specific, and clearly conveys the main enhancement being introduced, which allows developers scanning the history to immediately understand the purpose of the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch series_36480

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The header declaration of idpf_splitq_rearm_common is missing the __rte_internal attribute – please add it to keep internal API visibility consistent.
  • There's significant duplication between the AVX2 TX vector code for single-queue and split-queue paths; consider extracting shared loops or helpers to reduce maintenance overhead.
  • The wrap-around and generation-bit logic in idpf_dp_splitq_recv_pkts_avx2 assumes nb_rx_desc is a power-of-two; please add an assert or explanatory comment to make this requirement explicit.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The header declaration of idpf_splitq_rearm_common is missing the __rte_internal attribute – please add it to keep internal API visibility consistent.
- There's significant duplication between the AVX2 TX vector code for single-queue and split-queue paths; consider extracting shared loops or helpers to reduce maintenance overhead.
- The wrap-around and generation-bit logic in idpf_dp_splitq_recv_pkts_avx2 assumes nb_rx_desc is a power-of-two; please add an assert or explanatory comment to make this requirement explicit.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
doc/guides/nics/idpf.rst (1)

82-86: Clarify the CQ-sharing limitation with guidance.

Suggest adding why it’s unsupported (completion ownership/coherency) and what to do instead (dedicated CQ per TX queue or pin queues sharing a CQ to the same lcore) to reduce operator confusion.

drivers/net/intel/idpf/idpf_common_rxtx.h (2)

59-61: Macro prefix typo: IDPD → IDPF.

Rename IDPD_TXQ_SCAN_CQ_THRESH to IDPF_TXQ_SCAN_CQ_THRESH for consistency with the codebase.


32-38: Duplicate constant definition.

IDPF_TX_MAX_MTU_SEG is defined twice; remove one to avoid drift.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39b54f2 and 1a68515.

📒 Files selected for processing (7)
  • doc/guides/nics/idpf.rst (1 hunks)
  • drivers/net/intel/idpf/idpf_common_device.h (1 hunks)
  • drivers/net/intel/idpf/idpf_common_rxtx.c (2 hunks)
  • drivers/net/intel/idpf/idpf_common_rxtx.h (2 hunks)
  • drivers/net/intel/idpf/idpf_common_rxtx_avx2.c (4 hunks)
  • drivers/net/intel/idpf/idpf_common_rxtx_avx512.c (0 hunks)
  • drivers/net/intel/idpf/idpf_rxtx.c (1 hunks)
💤 Files with no reviewable changes (1)
  • drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
🧰 Additional context used
🧬 Code graph analysis (4)
drivers/net/intel/idpf/idpf_common_rxtx.h (2)
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c (5)
  • void (12-82)
  • void (765-802)
  • void (804-816)
  • idpf_dp_splitq_recv_pkts_avx2 (485-602)
  • idpf_dp_splitq_xmit_pkts_avx2 (910-935)
drivers/net/intel/idpf/idpf_common_rxtx_avx512.c (14)
  • void (14-124)
  • void (126-240)
  • void (543-648)
  • void (946-958)
  • void (962-1012)
  • void (1114-1145)
  • void (1149-1161)
  • void (1163-1213)
  • uint16_t (243-529)
  • uint16_t (650-934)
  • uint16_t (1014-1082)
  • uint16_t (1084-1104)
  • uint16_t (1215-1277)
  • uint16_t (1279-1306)
drivers/net/intel/idpf/idpf_common_rxtx.c (2)
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c (4)
  • void (12-82)
  • void (765-802)
  • void (804-816)
  • idpf_dp_splitq_recv_pkts_avx2 (485-602)
drivers/net/intel/idpf/idpf_common_rxtx_avx512.c (8)
  • void (14-124)
  • void (126-240)
  • void (543-648)
  • void (946-958)
  • void (962-1012)
  • void (1114-1145)
  • void (1149-1161)
  • void (1163-1213)
drivers/net/intel/idpf/idpf_rxtx.c (1)
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c (1)
  • idpf_dp_splitq_xmit_pkts_avx2 (910-935)
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c (2)
drivers/net/intel/idpf/idpf_common_rxtx_avx512.c (8)
  • void (14-124)
  • void (126-240)
  • void (543-648)
  • void (946-958)
  • void (962-1012)
  • void (1114-1145)
  • void (1149-1161)
  • void (1163-1213)
drivers/net/intel/idpf/idpf_rxtx_vec_common.h (1)
  • idpf_tx_desc_done (34-44)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Sourcery review
🔇 Additional comments (5)
drivers/net/intel/idpf/idpf_common_device.h (1)

68-77: Enum addition looks correct; confirm table alignment.

Adding IDPF_RX_AVX2 here matches the new entry in idpf_rx_path_infos. Please verify array indexes remain aligned with enum order and that unknown/default paths still resolve to IDPF_RX_DEFAULT.

drivers/net/intel/idpf/idpf_common_rxtx.h (1)

206-207: New internal APIs are fine.

Signatures for common rearm and AVX2 split Rx/Tx are consistent with call sites.

Also applies to: 257-259, 267-269

drivers/net/intel/idpf/idpf_rxtx.c (1)

853-860: Split AVX2 TX selection LGTM.

Correct path selection and early return; prep function set.

Please confirm idpf_dp_splitq_xmit_pkts_avx2 is compiled in the same build configs guarded by RTE_ARCH_X86.

drivers/net/intel/idpf/idpf_common_rxtx.c (2)

253-304: Common split-queue rearm path looks sound.

Bulk allocation, descriptor init, tail update follow existing patterns; fallback matches prior logic.


1739-1745: New RX path table entry consistent with enum and features.

“Split AVX2 Vector” entry uses SIMD_256 and correct burst function.

Comment on lines +485 to +602
uint16_t
idpf_dp_splitq_recv_pkts_avx2(void *rxq, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
{
struct idpf_rx_queue *queue = (struct idpf_rx_queue *)rxq;
const uint32_t *ptype_tbl = queue->adapter->ptype_tbl;
struct rte_mbuf **sw_ring = &queue->bufq2->sw_ring[queue->rx_tail];
volatile union virtchnl2_rx_desc *rxdp =
(volatile union virtchnl2_rx_desc *)queue->rx_ring + queue->rx_tail;
const __m256i mbuf_init = _mm256_set_epi64x(0, 0, 0, queue->mbuf_initializer);

rte_prefetch0(rxdp);
nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, 4); /* 4 desc per AVX2 iteration */

if (queue->bufq2->rxrearm_nb > IDPF_RXQ_REARM_THRESH)
idpf_splitq_rearm_common(queue->bufq2);

/* head gen check */
uint64_t head_gen = rxdp->flex_adv_nic_3_wb.pktlen_gen_bufq_id;
if (((head_gen >> VIRTCHNL2_RX_FLEX_DESC_ADV_GEN_S) &
VIRTCHNL2_RX_FLEX_DESC_ADV_GEN_M) != queue->expected_gen_id)
return 0;

uint16_t received = 0;

/* Shuffle mask: picks fields from each 16-byte descriptor pair into the
* layout that will be merged into mbuf->rearm_data candidates.
*/

const __m256i shuf = _mm256_set_epi8(
/* high 128 bits (desc 3 then desc 2 lanes) */
0xFF, 0xFF, 0xFF, 0xFF, 11, 10, 5, 4,
0xFF, 0xFF, 5, 4, 0xFF, 0xFF, 0xFF, 0xFF,
/* low 128 bits (desc 1 then desc 0 lanes) */
0xFF, 0xFF, 0xFF, 0xFF, 11, 10, 5, 4,
0xFF, 0xFF, 5, 4, 0xFF, 0xFF, 0xFF, 0xFF
);

/* mask that clears bits 14 and 15 of the packet length word */
const __m256i len_mask = _mm256_set_epi32(
0xffffffff, 0xffffffff, 0xffff3fff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffff3fff, 0xffffffff
);

const __m256i ptype_mask = _mm256_set1_epi16(VIRTCHNL2_RX_FLEX_DESC_PTYPE_M);

for (int i = nb_pkts; i >= IDPF_VPMD_DESCS_PER_LOOP; i -= IDPF_VPMD_DESCS_PER_LOOP) {
rxdp -= IDPF_VPMD_DESCS_PER_LOOP;

/* Check DD bits */
bool dd0 = (rxdp[0].flex_adv_nic_3_wb.status_err0_qw1 &
(1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
bool dd1 = (rxdp[1].flex_adv_nic_3_wb.status_err0_qw1 &
(1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
bool dd2 = (rxdp[2].flex_adv_nic_3_wb.status_err0_qw1 &
(1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
bool dd3 = (rxdp[3].flex_adv_nic_3_wb.status_err0_qw1 &
(1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;

if (!(dd0 && dd1 && dd2 && dd3))
break;

/* copy mbuf pointers */
memcpy(&rx_pkts[i - IDPF_VPMD_DESCS_PER_LOOP],
&sw_ring[i - IDPF_VPMD_DESCS_PER_LOOP],
sizeof(rx_pkts[0]) * IDPF_VPMD_DESCS_PER_LOOP);

__m128i d3 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[3]));
__m128i d2 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[2]));
__m128i d1 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[1]));
__m128i d0 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[0]));

__m256i d23 = _mm256_set_m128i(d3, d2);
__m256i d01 = _mm256_set_m128i(d1, d0);

/* mask length and shuffle to build mbuf rearm data */
__m256i desc01 = _mm256_and_si256(d01, len_mask);
__m256i desc23 = _mm256_and_si256(d23, len_mask);
__m256i mb01 = _mm256_shuffle_epi8(desc01, shuf);
__m256i mb23 = _mm256_shuffle_epi8(desc23, shuf);

/* ptype extraction */
__m256i pt01 = _mm256_and_si256(d01, ptype_mask);
__m256i pt23 = _mm256_and_si256(d23, ptype_mask);

uint16_t ptype0 = (uint16_t)_mm256_extract_epi16(pt01, 1);
uint16_t ptype1 = (uint16_t)_mm256_extract_epi16(pt01, 9);
uint16_t ptype2 = (uint16_t)_mm256_extract_epi16(pt23, 1);
uint16_t ptype3 = (uint16_t)_mm256_extract_epi16(pt23, 9);

mb01 = _mm256_insert_epi32(mb01, (int)ptype_tbl[ptype1], 2);
mb01 = _mm256_insert_epi32(mb01, (int)ptype_tbl[ptype0], 0);
mb23 = _mm256_insert_epi32(mb23, (int)ptype_tbl[ptype3], 2);
mb23 = _mm256_insert_epi32(mb23, (int)ptype_tbl[ptype2], 0);

/* build rearm data for each mbuf */
__m256i rearm0 = _mm256_permute2f128_si256(mbuf_init, mb01, 0x20);
__m256i rearm1 = _mm256_blend_epi32(mbuf_init, mb01, 0xF0);
__m256i rearm2 = _mm256_permute2f128_si256(mbuf_init, mb23, 0x20);
__m256i rearm3 = _mm256_blend_epi32(mbuf_init, mb23, 0xF0);

_mm256_storeu_si256((__m256i *)&rx_pkts[i - 4]->rearm_data, rearm0);
_mm256_storeu_si256((__m256i *)&rx_pkts[i - 3]->rearm_data, rearm1);
_mm256_storeu_si256((__m256i *)&rx_pkts[i - 2]->rearm_data, rearm2);
_mm256_storeu_si256((__m256i *)&rx_pkts[i - 1]->rearm_data, rearm3);

received += IDPF_VPMD_DESCS_PER_LOOP;
}

queue->rx_tail += received;
queue->expected_gen_id ^= ((queue->rx_tail & queue->nb_rx_desc) != 0);
queue->rx_tail &= (queue->nb_rx_desc - 1);
if ((queue->rx_tail & 1) == 1 && received > 1) {
queue->rx_tail--;
received--;
}
queue->bufq2->rxrearm_nb += received;
return received;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix split RX AVX2: wrong ring traversal and out-of-order rx_pkts fill.

Current loop walks rxdp backwards and fills rx_pkts from the end. On early DD break, returned count N does not populate rx_pkts[0..N-1]. Also rxdp -= 4 risks reading before rx_tail. Iterate forward from rx_tail and fill rx_pkts contiguously.

Apply this diff:

@@
- struct rte_mbuf **sw_ring = &queue->bufq2->sw_ring[queue->rx_tail];
- volatile union virtchnl2_rx_desc *rxdp =
-   (volatile union virtchnl2_rx_desc *)queue->rx_ring + queue->rx_tail;
+ uint16_t idx = queue->rx_tail;
+ struct rte_mbuf **sw_ring = &queue->bufq2->sw_ring[idx];
+ volatile union virtchnl2_rx_desc *rxdp =
+   (volatile union virtchnl2_rx_desc *)queue->rx_ring + idx;
@@
- for (int i = nb_pkts; i >= IDPF_VPMD_DESCS_PER_LOOP; i -= IDPF_VPMD_DESCS_PER_LOOP) {
-   rxdp -= IDPF_VPMD_DESCS_PER_LOOP;
+ for (uint16_t i = 0; i + IDPF_VPMD_DESCS_PER_LOOP <= nb_pkts; i += IDPF_VPMD_DESCS_PER_LOOP) {
@@
-   /* Check DD bits */
-   bool dd0 = (rxdp[0].flex_adv_nic_3_wb.status_err0_qw1 &
+   /* Check DD bits */
+   bool dd0 = (rxdp[0].flex_adv_nic_3_wb.status_err0_qw1 &
                (1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
-   bool dd1 = (rxdp[1].flex_adv_nic_3_wb.status_err0_qw1 &
+   bool dd1 = (rxdp[1].flex_adv_nic_3_wb.status_err0_qw1 &
                (1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
-   bool dd2 = (rxdp[2].flex_adv_nic_3_wb.status_err0_qw1 &
+   bool dd2 = (rxdp[2].flex_adv_nic_3_wb.status_err0_qw1 &
                (1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
-   bool dd3 = (rxdp[3].flex_adv_nic_3_wb.status_err0_qw1 &
+   bool dd3 = (rxdp[3].flex_adv_nic_3_wb.status_err0_qw1 &
                (1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
    if (!(dd0 && dd1 && dd2 && dd3))
        break;
@@
-   /* copy mbuf pointers */
-   memcpy(&rx_pkts[i - IDPF_VPMD_DESCS_PER_LOOP],
-          &sw_ring[i - IDPF_VPMD_DESCS_PER_LOOP],
-          sizeof(rx_pkts[0]) * IDPF_VPMD_DESCS_PER_LOOP);
+   /* copy mbuf pointers contiguously */
+   memcpy(&rx_pkts[received],
+          &sw_ring[received],
+          sizeof(rx_pkts[0]) * IDPF_VPMD_DESCS_PER_LOOP);
@@
-   __m128i d3 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[3]));
-   __m128i d2 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[2]));
-   __m128i d1 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[1]));
-   __m128i d0 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[0]));
+   __m128i d0 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[0]));
+   __m128i d1 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[1]));
+   __m128i d2 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[2]));
+   __m128i d3 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[3]));
@@
-   _mm256_storeu_si256((__m256i *)&rx_pkts[i - 4]->rearm_data, rearm0);
-   _mm256_storeu_si256((__m256i *)&rx_pkts[i - 3]->rearm_data, rearm1);
-   _mm256_storeu_si256((__m256i *)&rx_pkts[i - 2]->rearm_data, rearm2);
-   _mm256_storeu_si256((__m256i *)&rx_pkts[i - 1]->rearm_data, rearm3);
+   _mm256_storeu_si256((__m256i *)&rx_pkts[received + 0]->rearm_data, rearm0);
+   _mm256_storeu_si256((__m256i *)&rx_pkts[received + 1]->rearm_data, rearm1);
+   _mm256_storeu_si256((__m256i *)&rx_pkts[received + 2]->rearm_data, rearm2);
+   _mm256_storeu_si256((__m256i *)&rx_pkts[received + 3]->rearm_data, rearm3);
@@
-   received += IDPF_VPMD_DESCS_PER_LOOP;
+   received += IDPF_VPMD_DESCS_PER_LOOP;
+   rxdp += IDPF_VPMD_DESCS_PER_LOOP;
 }
@@
- queue->rx_tail += received;
+ queue->rx_tail += received;
🤖 Prompt for AI Agents
In drivers/net/intel/idpf/idpf_common_rxtx_avx2.c around lines 485-602: the AVX2
receive loop walks rxdp backwards and copies mbuf pointers into rx_pkts from the
end, which causes out-of-order/holes on early DD break and can underflow the
ring; change the loop to iterate forward from queue->rx_tail (increment rxdp and
index), fill rx_pkts sequentially from 0 upward, perform DD checks on the
upcoming 4 descriptors without decrementing rxdp, use modulo masking when
advancing rxdp/index to avoid reading before rx_tail, increment received per 4
processed descriptors, and then update queue->rx_tail, expected_gen_id and
bufq2->rxrearm_nb using the new forward-advanced tail; ensure memcpy, loads, and
stores use the forward indices so rx_pkts[0..received-1] are populated in order.

Comment on lines +861 to +909
static inline uint16_t
idpf_splitq_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
{
struct ci_tx_queue *txq = (struct ci_tx_queue *)tx_queue;
struct idpf_flex_tx_sched_desc *txdp;
struct ci_tx_entry_vec *txep;
uint16_t n, nb_commit;
uint64_t cmd_dtype = IDPF_TXD_FLEX_FLOW_CMD_EOP;
uint16_t tx_id = txq->tx_tail;

nb_commit = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
nb_pkts = nb_commit;
if (unlikely(nb_pkts == 0))
return 0;

txdp = (struct idpf_flex_tx_sched_desc *)&txq->desc_ring[tx_id];
txep = &txq->sw_ring_vec[tx_id];

txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);

n = (uint16_t)(txq->nb_tx_desc - tx_id);
if (nb_commit >= n) {
ci_tx_backlog_entry_vec(txep, tx_pkts, n);

idpf_splitq_vtx_avx2(txdp, tx_pkts, n - 1, cmd_dtype);
tx_pkts += (n - 1);
txdp += (n - 1);

idpf_splitq_vtx1_avx2(txdp, *tx_pkts++, cmd_dtype);

nb_commit = (uint16_t)(nb_commit - n);
tx_id = 0;

txdp = &txq->desc_ring[tx_id];
txep = (void *)txq->sw_ring;
}

ci_tx_backlog_entry_vec(txep, tx_pkts, nb_commit);

idpf_splitq_vtx_avx2(txdp, tx_pkts, nb_commit, cmd_dtype);

tx_id = (uint16_t)(tx_id + nb_commit);
txq->tx_tail = tx_id;

IDPF_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);

return nb_pkts;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix wrong TX backlog pointer after wrap.

After wrap, txep must point into sw_ring_vec, not sw_ring. Current cast breaks vector backlog bookkeeping.

Apply this diff:

@@
- txdp = &txq->desc_ring[tx_id];
- txep = (void *)txq->sw_ring;
+ txdp = &txq->desc_ring[tx_id];
+ txep = &txq->sw_ring_vec[tx_id];
🤖 Prompt for AI Agents
In drivers/net/intel/idpf/idpf_common_rxtx_avx2.c around lines 861-909 the code
incorrectly resets the software backlog pointer to sw_ring after a wrap; replace
that cast with a pointer into sw_ring_vec at the wrapped index (tx_id, which is
0 here) so txep points to the sw_ring_vec entry (e.g. set txep =
&txq->sw_ring_vec[tx_id]) to preserve correct vector backlog bookkeeping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants