-
Couldn't load subscription status.
- Fork 4
[PWCI] "net/idpf: enable AVX2 for split queue Rx/Tx" #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In case some CPUs don't support AVX512. Enable AVX2 for them to get better per-core performance. In the single queue model, the same descriptor queue is used by SW to post descriptors to the device and used by device to report completed descriptors to SW. While as the split queue model separates them into different queues for parallel processing and improved performance. Signed-off-by: Shaiq Wani <shaiq.wani@intel.com> Signed-off-by: 0-day Robot <robot@bytheb.org>
In case some CPUs don't support AVX512. Enable AVX2 for them to get better per-core performance. In the single queue model, the same descriptor queue is used by SW to post descriptors to the device and used by device to report completed descriptors to SW. While as the split queue model separates them into different queues for parallel processing and improved performance. Signed-off-by: Shaiq Wani <shaiq.wani@intel.com> Signed-off-by: 0-day Robot <robot@bytheb.org>
Added a note in the IDPF Poll Mode Driver documentation to clarify that sharing a completion queue among multiple TX queues serviced by different CPU cores is not supported in split queue mode. Signed-off-by: Shaiq Wani <shaiq.wani@intel.com> Signed-off-by: 0-day Robot <robot@bytheb.org>
Reviewer's GuideThis PR adds AVX2-accelerated split-queue Rx and Tx paths in the Intel idpf driver by implementing vectorized packet processing routines, extracting common rearm logic, and updating device registration and header declarations accordingly. Class diagram for new and updated split queue AVX2 Rx/Tx functionsclassDiagram
class idpf_rx_queue {
+adapter
+bufq2
+rx_tail
+expected_gen_id
+nb_rx_desc
+mbuf_initializer
+rx_ring
+rxrearm_nb
+rxrearm_start
+mp
+sw_ring
+fake_mbuf
+rx_stats
+qrx_tail
}
class ci_tx_queue {
+tx_tail
+nb_tx_desc
+expected_gen_id
+compl_ring
+txqs
+tx_start_qid
+rs_compl_count
+tx_free_thresh
+tx_rs_thresh
+desc_ring
+sw_ring_vec
+nb_tx_free
+qtx_tail
+complq
}
class idpf_flex_tx_sched_desc {
}
class idpf_splitq_tx_compl_desc {
}
class rte_mbuf {
+data_len
+buf_iova
+data_off
+rearm_data
}
idpf_rx_queue --> ci_tx_queue : uses in Tx
ci_tx_queue --> idpf_flex_tx_sched_desc : uses for Tx desc
ci_tx_queue --> idpf_splitq_tx_compl_desc : uses for completion
idpf_rx_queue --> rte_mbuf : uses for Rx
ci_tx_queue --> rte_mbuf : uses for Tx
class idpf_common_rxtx_avx2 {
+idpf_dp_splitq_recv_pkts_avx2()
+idpf_splitq_rearm_common()
+idpf_splitq_scan_cq_ring()
+idpf_splitq_vtx1_avx2()
+idpf_splitq_vtx_avx2()
+idpf_splitq_xmit_fixed_burst_vec_avx2()
+idpf_dp_splitq_xmit_pkts_avx2()
}
idpf_common_rxtx_avx2 --> idpf_rx_queue
idpf_common_rxtx_avx2 --> ci_tx_queue
idpf_common_rxtx_avx2 --> idpf_flex_tx_sched_desc
idpf_common_rxtx_avx2 --> idpf_splitq_tx_compl_desc
idpf_common_rxtx_avx2 --> rte_mbuf
Flow diagram for AVX2 split queue Rx packet processingflowchart TD
Start(["Start Rx Burst"])
RearmCheck["Check if rearm needed"]
RearmCall["Call idpf_splitq_rearm_common if needed"]
HeadGenCheck["Check head generation"]
Loop["For each batch of 4 descriptors"]
DDCheck["Check DD bits"]
CopyMbuf["Copy mbuf pointers"]
BuildMbuf["Build mbuf rearm data"]
UpdateTail["Update rx_tail and expected_gen_id"]
ReturnPkts["Return received packets"]
Start --> RearmCheck
RearmCheck -->|Yes| RearmCall
RearmCheck -->|No| HeadGenCheck
RearmCall --> HeadGenCheck
HeadGenCheck -->|OK| Loop
HeadGenCheck -->|Fail| ReturnPkts
Loop --> DDCheck
DDCheck -->|All set| CopyMbuf
DDCheck -->|Not all| UpdateTail
CopyMbuf --> BuildMbuf
BuildMbuf --> Loop
Loop --> UpdateTail
UpdateTail --> ReturnPkts
Flow diagram for AVX2 split queue Tx packet processingflowchart TD
Start(["Start Tx Burst"])
ScanCQ["Scan completion ring"]
FreeBufs["Free buffers if rs_compl_count > tx_free_thresh"]
Loop["While nb_pkts > 0"]
FixedBurst["Call idpf_splitq_xmit_fixed_burst_vec_avx2"]
UpdateTail["Update tx_tail"]
WriteTail["Write to NIC tail register"]
ReturnTx["Return nb_tx"]
Start --> ScanCQ
ScanCQ --> FreeBufs
FreeBufs --> Loop
Loop --> FixedBurst
FixedBurst --> UpdateTail
UpdateTail --> WriteTail
WriteTail --> Loop
Loop -->|Done| ReturnTx
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
WalkthroughThe changes add AVX2 support for split-queue RX/TX data paths in the Intel IDPF driver. A common rearm function is extracted as a public symbol for shared use. The RX/TX path selection logic is updated to recognize AVX2-optimized paths. Documentation notes a split-queue completion queue sharing limitation. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Driver as IDPF Driver
participant PathSel as Path Selection<br/>(idpf_set_tx_function)
participant AVX512Path as AVX512 Path
participant AVX2Path as AVX2 Path
participant ScalarPath as Scalar Path
App->>Driver: Initialize TX
Driver->>PathSel: Select TX path
alt SIMD width = 512-bit
PathSel->>AVX512Path: Use AVX512
AVX512Path->>App: TX burst function
else SIMD width = 256-bit
PathSel->>AVX2Path: Use AVX2 (new)
AVX2Path->>App: TX burst function
else
PathSel->>ScalarPath: Use Scalar
ScalarPath->>App: TX burst function
end
sequenceDiagram
participant RXQ as RX Queue
participant Rearm as Rearm<br/>(idpf_splitq_rearm_common)
participant Burst as RX Burst<br/>(idpf_dp_splitq_recv_pkts_avx2)
participant Desc as RX Descriptors
participant MBUF as MBUF Pool
Burst->>RXQ: Scan completion queue
Burst->>Desc: Read 4 descriptors/iteration
Burst->>Desc: Check head-gen state
loop Process descriptors
Burst->>Desc: Extract packet info
Burst->>RXQ: Update tail & GEN
end
RXQ->>Rearm: Trigger rearm when needed
Rearm->>MBUF: Allocate MBUFs in bulk
Rearm->>Desc: Initialize RX descriptors
Rearm->>RXQ: Update tail pointer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The header declaration of idpf_splitq_rearm_common is missing the __rte_internal attribute – please add it to keep internal API visibility consistent.
- There's significant duplication between the AVX2 TX vector code for single-queue and split-queue paths; consider extracting shared loops or helpers to reduce maintenance overhead.
- The wrap-around and generation-bit logic in idpf_dp_splitq_recv_pkts_avx2 assumes nb_rx_desc is a power-of-two; please add an assert or explanatory comment to make this requirement explicit.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The header declaration of idpf_splitq_rearm_common is missing the __rte_internal attribute – please add it to keep internal API visibility consistent.
- There's significant duplication between the AVX2 TX vector code for single-queue and split-queue paths; consider extracting shared loops or helpers to reduce maintenance overhead.
- The wrap-around and generation-bit logic in idpf_dp_splitq_recv_pkts_avx2 assumes nb_rx_desc is a power-of-two; please add an assert or explanatory comment to make this requirement explicit.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
doc/guides/nics/idpf.rst (1)
82-86: Clarify the CQ-sharing limitation with guidance.Suggest adding why it’s unsupported (completion ownership/coherency) and what to do instead (dedicated CQ per TX queue or pin queues sharing a CQ to the same lcore) to reduce operator confusion.
drivers/net/intel/idpf/idpf_common_rxtx.h (2)
59-61: Macro prefix typo: IDPD → IDPF.Rename IDPD_TXQ_SCAN_CQ_THRESH to IDPF_TXQ_SCAN_CQ_THRESH for consistency with the codebase.
32-38: Duplicate constant definition.IDPF_TX_MAX_MTU_SEG is defined twice; remove one to avoid drift.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
doc/guides/nics/idpf.rst(1 hunks)drivers/net/intel/idpf/idpf_common_device.h(1 hunks)drivers/net/intel/idpf/idpf_common_rxtx.c(2 hunks)drivers/net/intel/idpf/idpf_common_rxtx.h(2 hunks)drivers/net/intel/idpf/idpf_common_rxtx_avx2.c(4 hunks)drivers/net/intel/idpf/idpf_common_rxtx_avx512.c(0 hunks)drivers/net/intel/idpf/idpf_rxtx.c(1 hunks)
💤 Files with no reviewable changes (1)
- drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
🧰 Additional context used
🧬 Code graph analysis (4)
drivers/net/intel/idpf/idpf_common_rxtx.h (2)
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c (5)
void(12-82)void(765-802)void(804-816)idpf_dp_splitq_recv_pkts_avx2(485-602)idpf_dp_splitq_xmit_pkts_avx2(910-935)drivers/net/intel/idpf/idpf_common_rxtx_avx512.c (14)
void(14-124)void(126-240)void(543-648)void(946-958)void(962-1012)void(1114-1145)void(1149-1161)void(1163-1213)uint16_t(243-529)uint16_t(650-934)uint16_t(1014-1082)uint16_t(1084-1104)uint16_t(1215-1277)uint16_t(1279-1306)
drivers/net/intel/idpf/idpf_common_rxtx.c (2)
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c (4)
void(12-82)void(765-802)void(804-816)idpf_dp_splitq_recv_pkts_avx2(485-602)drivers/net/intel/idpf/idpf_common_rxtx_avx512.c (8)
void(14-124)void(126-240)void(543-648)void(946-958)void(962-1012)void(1114-1145)void(1149-1161)void(1163-1213)
drivers/net/intel/idpf/idpf_rxtx.c (1)
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c (1)
idpf_dp_splitq_xmit_pkts_avx2(910-935)
drivers/net/intel/idpf/idpf_common_rxtx_avx2.c (2)
drivers/net/intel/idpf/idpf_common_rxtx_avx512.c (8)
void(14-124)void(126-240)void(543-648)void(946-958)void(962-1012)void(1114-1145)void(1149-1161)void(1163-1213)drivers/net/intel/idpf/idpf_rxtx_vec_common.h (1)
idpf_tx_desc_done(34-44)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Sourcery review
🔇 Additional comments (5)
drivers/net/intel/idpf/idpf_common_device.h (1)
68-77: Enum addition looks correct; confirm table alignment.Adding IDPF_RX_AVX2 here matches the new entry in idpf_rx_path_infos. Please verify array indexes remain aligned with enum order and that unknown/default paths still resolve to IDPF_RX_DEFAULT.
drivers/net/intel/idpf/idpf_common_rxtx.h (1)
206-207: New internal APIs are fine.Signatures for common rearm and AVX2 split Rx/Tx are consistent with call sites.
Also applies to: 257-259, 267-269
drivers/net/intel/idpf/idpf_rxtx.c (1)
853-860: Split AVX2 TX selection LGTM.Correct path selection and early return; prep function set.
Please confirm idpf_dp_splitq_xmit_pkts_avx2 is compiled in the same build configs guarded by RTE_ARCH_X86.
drivers/net/intel/idpf/idpf_common_rxtx.c (2)
253-304: Common split-queue rearm path looks sound.Bulk allocation, descriptor init, tail update follow existing patterns; fallback matches prior logic.
1739-1745: New RX path table entry consistent with enum and features.“Split AVX2 Vector” entry uses SIMD_256 and correct burst function.
| uint16_t | ||
| idpf_dp_splitq_recv_pkts_avx2(void *rxq, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) | ||
| { | ||
| struct idpf_rx_queue *queue = (struct idpf_rx_queue *)rxq; | ||
| const uint32_t *ptype_tbl = queue->adapter->ptype_tbl; | ||
| struct rte_mbuf **sw_ring = &queue->bufq2->sw_ring[queue->rx_tail]; | ||
| volatile union virtchnl2_rx_desc *rxdp = | ||
| (volatile union virtchnl2_rx_desc *)queue->rx_ring + queue->rx_tail; | ||
| const __m256i mbuf_init = _mm256_set_epi64x(0, 0, 0, queue->mbuf_initializer); | ||
|
|
||
| rte_prefetch0(rxdp); | ||
| nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, 4); /* 4 desc per AVX2 iteration */ | ||
|
|
||
| if (queue->bufq2->rxrearm_nb > IDPF_RXQ_REARM_THRESH) | ||
| idpf_splitq_rearm_common(queue->bufq2); | ||
|
|
||
| /* head gen check */ | ||
| uint64_t head_gen = rxdp->flex_adv_nic_3_wb.pktlen_gen_bufq_id; | ||
| if (((head_gen >> VIRTCHNL2_RX_FLEX_DESC_ADV_GEN_S) & | ||
| VIRTCHNL2_RX_FLEX_DESC_ADV_GEN_M) != queue->expected_gen_id) | ||
| return 0; | ||
|
|
||
| uint16_t received = 0; | ||
|
|
||
| /* Shuffle mask: picks fields from each 16-byte descriptor pair into the | ||
| * layout that will be merged into mbuf->rearm_data candidates. | ||
| */ | ||
|
|
||
| const __m256i shuf = _mm256_set_epi8( | ||
| /* high 128 bits (desc 3 then desc 2 lanes) */ | ||
| 0xFF, 0xFF, 0xFF, 0xFF, 11, 10, 5, 4, | ||
| 0xFF, 0xFF, 5, 4, 0xFF, 0xFF, 0xFF, 0xFF, | ||
| /* low 128 bits (desc 1 then desc 0 lanes) */ | ||
| 0xFF, 0xFF, 0xFF, 0xFF, 11, 10, 5, 4, | ||
| 0xFF, 0xFF, 5, 4, 0xFF, 0xFF, 0xFF, 0xFF | ||
| ); | ||
|
|
||
| /* mask that clears bits 14 and 15 of the packet length word */ | ||
| const __m256i len_mask = _mm256_set_epi32( | ||
| 0xffffffff, 0xffffffff, 0xffff3fff, 0xffffffff, | ||
| 0xffffffff, 0xffffffff, 0xffff3fff, 0xffffffff | ||
| ); | ||
|
|
||
| const __m256i ptype_mask = _mm256_set1_epi16(VIRTCHNL2_RX_FLEX_DESC_PTYPE_M); | ||
|
|
||
| for (int i = nb_pkts; i >= IDPF_VPMD_DESCS_PER_LOOP; i -= IDPF_VPMD_DESCS_PER_LOOP) { | ||
| rxdp -= IDPF_VPMD_DESCS_PER_LOOP; | ||
|
|
||
| /* Check DD bits */ | ||
| bool dd0 = (rxdp[0].flex_adv_nic_3_wb.status_err0_qw1 & | ||
| (1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0; | ||
| bool dd1 = (rxdp[1].flex_adv_nic_3_wb.status_err0_qw1 & | ||
| (1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0; | ||
| bool dd2 = (rxdp[2].flex_adv_nic_3_wb.status_err0_qw1 & | ||
| (1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0; | ||
| bool dd3 = (rxdp[3].flex_adv_nic_3_wb.status_err0_qw1 & | ||
| (1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0; | ||
|
|
||
| if (!(dd0 && dd1 && dd2 && dd3)) | ||
| break; | ||
|
|
||
| /* copy mbuf pointers */ | ||
| memcpy(&rx_pkts[i - IDPF_VPMD_DESCS_PER_LOOP], | ||
| &sw_ring[i - IDPF_VPMD_DESCS_PER_LOOP], | ||
| sizeof(rx_pkts[0]) * IDPF_VPMD_DESCS_PER_LOOP); | ||
|
|
||
| __m128i d3 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[3])); | ||
| __m128i d2 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[2])); | ||
| __m128i d1 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[1])); | ||
| __m128i d0 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[0])); | ||
|
|
||
| __m256i d23 = _mm256_set_m128i(d3, d2); | ||
| __m256i d01 = _mm256_set_m128i(d1, d0); | ||
|
|
||
| /* mask length and shuffle to build mbuf rearm data */ | ||
| __m256i desc01 = _mm256_and_si256(d01, len_mask); | ||
| __m256i desc23 = _mm256_and_si256(d23, len_mask); | ||
| __m256i mb01 = _mm256_shuffle_epi8(desc01, shuf); | ||
| __m256i mb23 = _mm256_shuffle_epi8(desc23, shuf); | ||
|
|
||
| /* ptype extraction */ | ||
| __m256i pt01 = _mm256_and_si256(d01, ptype_mask); | ||
| __m256i pt23 = _mm256_and_si256(d23, ptype_mask); | ||
|
|
||
| uint16_t ptype0 = (uint16_t)_mm256_extract_epi16(pt01, 1); | ||
| uint16_t ptype1 = (uint16_t)_mm256_extract_epi16(pt01, 9); | ||
| uint16_t ptype2 = (uint16_t)_mm256_extract_epi16(pt23, 1); | ||
| uint16_t ptype3 = (uint16_t)_mm256_extract_epi16(pt23, 9); | ||
|
|
||
| mb01 = _mm256_insert_epi32(mb01, (int)ptype_tbl[ptype1], 2); | ||
| mb01 = _mm256_insert_epi32(mb01, (int)ptype_tbl[ptype0], 0); | ||
| mb23 = _mm256_insert_epi32(mb23, (int)ptype_tbl[ptype3], 2); | ||
| mb23 = _mm256_insert_epi32(mb23, (int)ptype_tbl[ptype2], 0); | ||
|
|
||
| /* build rearm data for each mbuf */ | ||
| __m256i rearm0 = _mm256_permute2f128_si256(mbuf_init, mb01, 0x20); | ||
| __m256i rearm1 = _mm256_blend_epi32(mbuf_init, mb01, 0xF0); | ||
| __m256i rearm2 = _mm256_permute2f128_si256(mbuf_init, mb23, 0x20); | ||
| __m256i rearm3 = _mm256_blend_epi32(mbuf_init, mb23, 0xF0); | ||
|
|
||
| _mm256_storeu_si256((__m256i *)&rx_pkts[i - 4]->rearm_data, rearm0); | ||
| _mm256_storeu_si256((__m256i *)&rx_pkts[i - 3]->rearm_data, rearm1); | ||
| _mm256_storeu_si256((__m256i *)&rx_pkts[i - 2]->rearm_data, rearm2); | ||
| _mm256_storeu_si256((__m256i *)&rx_pkts[i - 1]->rearm_data, rearm3); | ||
|
|
||
| received += IDPF_VPMD_DESCS_PER_LOOP; | ||
| } | ||
|
|
||
| queue->rx_tail += received; | ||
| queue->expected_gen_id ^= ((queue->rx_tail & queue->nb_rx_desc) != 0); | ||
| queue->rx_tail &= (queue->nb_rx_desc - 1); | ||
| if ((queue->rx_tail & 1) == 1 && received > 1) { | ||
| queue->rx_tail--; | ||
| received--; | ||
| } | ||
| queue->bufq2->rxrearm_nb += received; | ||
| return received; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix split RX AVX2: wrong ring traversal and out-of-order rx_pkts fill.
Current loop walks rxdp backwards and fills rx_pkts from the end. On early DD break, returned count N does not populate rx_pkts[0..N-1]. Also rxdp -= 4 risks reading before rx_tail. Iterate forward from rx_tail and fill rx_pkts contiguously.
Apply this diff:
@@
- struct rte_mbuf **sw_ring = &queue->bufq2->sw_ring[queue->rx_tail];
- volatile union virtchnl2_rx_desc *rxdp =
- (volatile union virtchnl2_rx_desc *)queue->rx_ring + queue->rx_tail;
+ uint16_t idx = queue->rx_tail;
+ struct rte_mbuf **sw_ring = &queue->bufq2->sw_ring[idx];
+ volatile union virtchnl2_rx_desc *rxdp =
+ (volatile union virtchnl2_rx_desc *)queue->rx_ring + idx;
@@
- for (int i = nb_pkts; i >= IDPF_VPMD_DESCS_PER_LOOP; i -= IDPF_VPMD_DESCS_PER_LOOP) {
- rxdp -= IDPF_VPMD_DESCS_PER_LOOP;
+ for (uint16_t i = 0; i + IDPF_VPMD_DESCS_PER_LOOP <= nb_pkts; i += IDPF_VPMD_DESCS_PER_LOOP) {
@@
- /* Check DD bits */
- bool dd0 = (rxdp[0].flex_adv_nic_3_wb.status_err0_qw1 &
+ /* Check DD bits */
+ bool dd0 = (rxdp[0].flex_adv_nic_3_wb.status_err0_qw1 &
(1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
- bool dd1 = (rxdp[1].flex_adv_nic_3_wb.status_err0_qw1 &
+ bool dd1 = (rxdp[1].flex_adv_nic_3_wb.status_err0_qw1 &
(1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
- bool dd2 = (rxdp[2].flex_adv_nic_3_wb.status_err0_qw1 &
+ bool dd2 = (rxdp[2].flex_adv_nic_3_wb.status_err0_qw1 &
(1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
- bool dd3 = (rxdp[3].flex_adv_nic_3_wb.status_err0_qw1 &
+ bool dd3 = (rxdp[3].flex_adv_nic_3_wb.status_err0_qw1 &
(1U << VIRTCHNL2_RX_FLEX_DESC_ADV_STATUS0_DD_S)) != 0;
if (!(dd0 && dd1 && dd2 && dd3))
break;
@@
- /* copy mbuf pointers */
- memcpy(&rx_pkts[i - IDPF_VPMD_DESCS_PER_LOOP],
- &sw_ring[i - IDPF_VPMD_DESCS_PER_LOOP],
- sizeof(rx_pkts[0]) * IDPF_VPMD_DESCS_PER_LOOP);
+ /* copy mbuf pointers contiguously */
+ memcpy(&rx_pkts[received],
+ &sw_ring[received],
+ sizeof(rx_pkts[0]) * IDPF_VPMD_DESCS_PER_LOOP);
@@
- __m128i d3 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[3]));
- __m128i d2 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[2]));
- __m128i d1 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[1]));
- __m128i d0 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[0]));
+ __m128i d0 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[0]));
+ __m128i d1 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[1]));
+ __m128i d2 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[2]));
+ __m128i d3 = _mm_load_si128(RTE_CAST_PTR(const __m128i *, &rxdp[3]));
@@
- _mm256_storeu_si256((__m256i *)&rx_pkts[i - 4]->rearm_data, rearm0);
- _mm256_storeu_si256((__m256i *)&rx_pkts[i - 3]->rearm_data, rearm1);
- _mm256_storeu_si256((__m256i *)&rx_pkts[i - 2]->rearm_data, rearm2);
- _mm256_storeu_si256((__m256i *)&rx_pkts[i - 1]->rearm_data, rearm3);
+ _mm256_storeu_si256((__m256i *)&rx_pkts[received + 0]->rearm_data, rearm0);
+ _mm256_storeu_si256((__m256i *)&rx_pkts[received + 1]->rearm_data, rearm1);
+ _mm256_storeu_si256((__m256i *)&rx_pkts[received + 2]->rearm_data, rearm2);
+ _mm256_storeu_si256((__m256i *)&rx_pkts[received + 3]->rearm_data, rearm3);
@@
- received += IDPF_VPMD_DESCS_PER_LOOP;
+ received += IDPF_VPMD_DESCS_PER_LOOP;
+ rxdp += IDPF_VPMD_DESCS_PER_LOOP;
}
@@
- queue->rx_tail += received;
+ queue->rx_tail += received;🤖 Prompt for AI Agents
In drivers/net/intel/idpf/idpf_common_rxtx_avx2.c around lines 485-602: the AVX2
receive loop walks rxdp backwards and copies mbuf pointers into rx_pkts from the
end, which causes out-of-order/holes on early DD break and can underflow the
ring; change the loop to iterate forward from queue->rx_tail (increment rxdp and
index), fill rx_pkts sequentially from 0 upward, perform DD checks on the
upcoming 4 descriptors without decrementing rxdp, use modulo masking when
advancing rxdp/index to avoid reading before rx_tail, increment received per 4
processed descriptors, and then update queue->rx_tail, expected_gen_id and
bufq2->rxrearm_nb using the new forward-advanced tail; ensure memcpy, loads, and
stores use the forward indices so rx_pkts[0..received-1] are populated in order.
| static inline uint16_t | ||
| idpf_splitq_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts) | ||
| { | ||
| struct ci_tx_queue *txq = (struct ci_tx_queue *)tx_queue; | ||
| struct idpf_flex_tx_sched_desc *txdp; | ||
| struct ci_tx_entry_vec *txep; | ||
| uint16_t n, nb_commit; | ||
| uint64_t cmd_dtype = IDPF_TXD_FLEX_FLOW_CMD_EOP; | ||
| uint16_t tx_id = txq->tx_tail; | ||
|
|
||
| nb_commit = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts); | ||
| nb_pkts = nb_commit; | ||
| if (unlikely(nb_pkts == 0)) | ||
| return 0; | ||
|
|
||
| txdp = (struct idpf_flex_tx_sched_desc *)&txq->desc_ring[tx_id]; | ||
| txep = &txq->sw_ring_vec[tx_id]; | ||
|
|
||
| txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts); | ||
|
|
||
| n = (uint16_t)(txq->nb_tx_desc - tx_id); | ||
| if (nb_commit >= n) { | ||
| ci_tx_backlog_entry_vec(txep, tx_pkts, n); | ||
|
|
||
| idpf_splitq_vtx_avx2(txdp, tx_pkts, n - 1, cmd_dtype); | ||
| tx_pkts += (n - 1); | ||
| txdp += (n - 1); | ||
|
|
||
| idpf_splitq_vtx1_avx2(txdp, *tx_pkts++, cmd_dtype); | ||
|
|
||
| nb_commit = (uint16_t)(nb_commit - n); | ||
| tx_id = 0; | ||
|
|
||
| txdp = &txq->desc_ring[tx_id]; | ||
| txep = (void *)txq->sw_ring; | ||
| } | ||
|
|
||
| ci_tx_backlog_entry_vec(txep, tx_pkts, nb_commit); | ||
|
|
||
| idpf_splitq_vtx_avx2(txdp, tx_pkts, nb_commit, cmd_dtype); | ||
|
|
||
| tx_id = (uint16_t)(tx_id + nb_commit); | ||
| txq->tx_tail = tx_id; | ||
|
|
||
| IDPF_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail); | ||
|
|
||
| return nb_pkts; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix wrong TX backlog pointer after wrap.
After wrap, txep must point into sw_ring_vec, not sw_ring. Current cast breaks vector backlog bookkeeping.
Apply this diff:
@@
- txdp = &txq->desc_ring[tx_id];
- txep = (void *)txq->sw_ring;
+ txdp = &txq->desc_ring[tx_id];
+ txep = &txq->sw_ring_vec[tx_id];🤖 Prompt for AI Agents
In drivers/net/intel/idpf/idpf_common_rxtx_avx2.c around lines 861-909 the code
incorrectly resets the software backlog pointer to sw_ring after a wrap; replace
that cast with a pointer into sw_ring_vec at the wrapped index (tx_id, which is
0 here) so txep points to the sw_ring_vec entry (e.g. set txep =
&txq->sw_ring_vec[tx_id]) to preserve correct vector backlog bookkeeping.
Auto-submission for "http://patchwork.dpdk.org/project/dpdk/list/?series=36480"
Summary by Sourcery
Enable AVX2-accelerated split queue Rx/Tx in the Intel idpf PMD by adding new vectorized receive and transmit routines, consolidating rearm logic, and registering the new paths in the driver
New Features:
Enhancements:
Summary by CodeRabbit
Documentation
New Features