Skip to content

Conversation

@ovsrobot
Copy link
Owner

@ovsrobot ovsrobot commented Nov 19, 2025

NOTE: This is an auto submission for "net/mlx5: fix IP tunnel detection for IPIP and IPv6 ENCAP".

See "http://patchwork.dpdk.org/project/dpdk/list/?series=36752" for details.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added DCB forwarding commands for traffic class configuration and per-TC core allocation
    • Introduced BPF ELF-based load testing
    • Added new AES-CCM test vectors for cryptodev testing
    • Introduced performance metrics collection capability
  • Bug Fixes

    • Fixed memory allocation size calculation for flow lists
    • Corrected IPv6 address test expectation
    • Enhanced IPv4/IPv6 timestamp validation in flexible RX paths
    • Improved MLX5 driver capability detection and defaults
  • Documentation

    • Updated testpmd guidance with new DCB forwarding commands
    • Enhanced network driver documentation for raw patterns and representor syntax
    • Clarified DTS configuration paths and examples

bruce-richardson and others added 30 commits November 12, 2025 17:21
The capabilities flag for the vector offload path include the QinQ
offload capability, but in fact the offload path lacks any ability to
create context descriptors. This means that it cannot insert multiple
vlan tags for QinQ support, so move the offload from the VECTOR_OFFLOAD
list to the NO_VECTOR list. Similarly, remove any check for the QinQ
mbuf flag in any packets being transmitted, since that offload is
invalid to request if the feature is not enabled.

Fixes: 808a17b ("net/ice: add Rx AVX512 offload path")
Cc: stable@dpdk.org

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Ciara Loftus <ciara.loftus@intel.com>
The common Rx path selection logic iterates through an array of
candidate paths and selects the best fit for the requested features.
Currently, in the event that two potential candidates are identified,
the one with the fewer offloads (and thus less complex path) is
selected. However this is not correct, because if the path with more
offloads has a greater SIMD width, that should be chosen. This commit
reworks the logic so that the number of offloads is only taken into
consideration when choosing between two paths with the same SIMD width.

Since the paths arrays are ordered from lowest SIMD width to highest,
and vector paths tend to have fewer offloads enabled than scalar paths,
"new" candidate paths with greater SIMDs widths tended to have fewer or
equal offloads than the "current" candidate paths and thus were
correctly accepted as the best candidate. For this reason the incorrect
logic did not cause any incorrect path selections in practise.

Fixes: 9d99641 ("net/intel: introduce infrastructure for Rx path selection")

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Add methods to Linux Session class for setting the link
of an interface up and deleting an interface.

Signed-off-by: Dean Marx <dmarx@iol.unh.edu>
Reviewed-by: Patrick Robb <probb@iol.unh.edu>
Add a command to the testpmd shell for setting the portlist
(list of forwarding ports) within a testpmd session. This
allows for changing the forwarding order between ports.

Signed-off-by: Dean Marx <dmarx@iol.unh.edu>
Reviewed-by: Patrick Robb <probb@iol.unh.edu>
Add test suite covering virtio-user/vhost-user
server/client forwarding scenarios with
testpmd packet validation.

Signed-off-by: Dean Marx <dmarx@iol.unh.edu>
Reviewed-by: Patrick Robb <probb@iol.unh.edu>
Add a section to the dts rst under How to Write a Test Suite
which provides an example for how to write a test case
docstring, including a steps and verify section.

Signed-off-by: Dean Marx <dmarx@iol.unh.edu>
Reviewed-by: Luca Vizzarro <luca.vizzarro@arm.com>
Current documentation for protocol agnostic filtering for ICE driver is a
bit terse and relies on a lot of assumed knowledge. Document the feature
better and make all of the assumptions explicit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Vladimir Medvedkin <vladimir.medvedkin@intel.com>
The statistics contain 40 bits. The lower 32 bits are read first, followed
by the upper 8 bits.

In some cases, after reading the lower 32 bits, a carry occurs from
the lower bits, which causes the final statistics to be incorrect.

This commit fixes this issue.

Fixes: a37bde5 ("net/ice: support statistics")
Cc: stable@dpdk.org

Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
As per recent change by the following commit:

commit 066f3d9 ("ethdev: remove callback checks from fast path")

framework unconditionally invokes dev->tx_pkt_prepare.
Due to this, ICE driver gets crashed as tx_pkt_prepare
was set to NULL during initialization.

Ensure dev->tx_pkt_prepare is not NULL when vector or simple
Tx paths are selected, by assigning rte_eth_tx_pkt_prepare_dummy.

This aligns with expectations with above mentioned commit.

Bugzilla ID: 1795
Fixes: 6eac0b7 ("net/ice: support advance Rx/Tx")
Cc: stable@dpdk.org

Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
Tested-by: Hailin Xu <hailinx.xu@intel.com>
As per recent change by the following commit:

commit 066f3d9 ("ethdev: remove callback checks from fast path")

framework unconditionally invokes dev->tx_pkt_prepare. Ensure
dev->tx_pkt_prepare is not NULL when vector or simple TX paths are
selected, by assigning rte_eth_tx_pkt_prepare_dummy.

This aligns with expectations with above mentioned commit.

Fixes: 7829b8d ("net/ixgbe: add Tx preparation")
Cc: stable@dpdk.org

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
As per recent change by the following commit:

commit 066f3d9 ("ethdev: remove callback checks from fast path")

framework unconditionally invokes dev->tx_pkt_prepare. Ensure
dev->tx_pkt_prepare is not NULL when vector or simple TX paths are
selected, by assigning rte_eth_tx_pkt_prepare_dummy.

This aligns with expectations with above mentioned commit.

Fixes: 9b134aa ("net/fm10k: add Tx preparation")
Cc: stable@dpdk.org

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Create a new default tree for the SDP interface if more than one Tx
queue is requested. This helps to back pressure each queue independently
when they are created with separate channels.

Signed-off-by: Satha Rao <skoteshwar@marvell.com>
Drain all CQ buffers and close CQ when SQ enabled completion is about to
stop.

Signed-off-by: Satha Rao <skoteshwar@marvell.com>
Update scatter check as warning for SDP interfaces instead of error
to support cases where host application is already aware for the max
buf size.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
This patch fixes the inline device functions to work
when roc_nix is NULL.

Fixes: f81ee71 ("common/cnxk: support inline SA context invalidate")
Cc: stable@dpdk.org

Signed-off-by: Monendra Singh Kushwaha <kmonendra@marvell.com>
LSO enhanced to support flags modification. Added new mbox to enable
this feature.

Signed-off-by: Satha Rao <skoteshwar@marvell.com>
Extend LSO offload to support IPv4 fragmentation.

Signed-off-by: Satha Rao <skoteshwar@marvell.com>
While performing IPv4 fragmentation, consider the DF flag from the
original packet header instead of setting it to zero.

Signed-off-by: Satha Rao <skoteshwar@marvell.com>
SQ context extended with new feature, if enabled the counter is updated
when a packet if processed, whether it is transmitted or dropped.

Signed-off-by: Satha Rao <skoteshwar@marvell.com>
Added function to check whether board supports
16B alignment

Signed-off-by: Rakesh Kudurumalla <rkudurumalla@marvell.com>
On CN20K SoC, back pressure can be configured for eight
different traffic classes per pool along with threshold and
BPIDs.

RoC API is added to configure the same.

Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
By default, SQB pool is created with max (512 buffers) +
extra threshold buffers and aura limit is set to 512 + thr.

But while clean up, aura limit is reset to MAX (512 buffers)
only before destroying the pool.

Hence while destroying the pool, only 512 buffers are cleaned
from aura and extra threshold buffers are left as it is.

At later stage if same SQB pool is created then H/W
throws error for extra threshold buffers that it is already
in pool.

Fixes: 780f90e ("common/cnxk: restore NIX SQB pool limit before destroy")
Cc: stable@dpdk.org

Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
Add support for SQ resize by making SQB mem allocated
in chunks of SQB size.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
CN10K platform supports Tx schedulers up to 2K.

Signed-off-by: Satha Rao <skoteshwar@marvell.com>
Fix klockwork for NULL pointer dereference

Signed-off-by: Rakesh Kudurumalla <rkudurumalla@marvell.com>
Condition to check SQ is non NULL before access. Also pktio locks
are simplified while doing threshold_profile config.

Fixes: 90a903f ("common/cnxk: split NIX TM hierarchy enable API")
Cc: stable@dpdk.org

Signed-off-by: Satha Rao <skoteshwar@marvell.com>
Aura field width has changed from 20 bits to 17 bits for
cn20k.

Signed-off-by: Rahul Bhansali <rbhansali@marvell.com>
Fix issue reported by klocwork.

Fixes: f410059 ("common/cnxk: support inline inbound queue")
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Fix format specifier for bandwidth profile ID.

Fixes: db5744d ("common/cnxk: support NIX debug for CN20K")
Cc: stable@dpdk.org

Signed-off-by: Aarnav JP <ajp@marvell.com>
Fix CPT res address config logic to avoid garbage values
and trigger only when inline dev is present.

Fixes: 3c31a74 ("common/cnxk: config CPT result address for CN20K")
Cc: stable@dpdk.org

Signed-off-by: Aarnav JP <ajp@marvell.com>
roidayan and others added 27 commits November 18, 2025 14:19
The cited commit removed the representor interrupt
handler cleanup by mistake.

Fixes: 5cf0707 ("net/mlx5: remove Rx queue data list from device")

Signed-off-by: Roi Dayan <roid@nvidia.com>
Acked-by: Suanming Mou <suanmingm@nvidia.com>
When creating a new mempool but assigning a shared entries
from a different mempool need to release the newly unused
allocated entries. Fix it.

Fixes: 8947eeb ("common/mlx5: fix shared memory region ranges allocation")

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Suanming Mou <suanmingm@nvidia.com>
`buddy` was erroroneously declared as static. When multiple
threads call this routine, they set the same static variable
corrupting pool data and can cause potential double free when
releasing resources.

Fixes: b4dd7bc ("net/mlx5/hws: add pool and buddy")

Signed-off-by: Nupur Uttarwar <nuttarwar@nvidia.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
When mlx5_dev_start() fails partway through initialization, the error
cleanup code unconditionally calls cleanup functions for all steps,
including those that were never successfully initialized. This causes
state corruption leading to incorrect behavior on subsequent start
attempts.

The issue manifests as:
1. First start attempt fails with -ENOMEM (expected)
2. Second start attempt returns -EINVAL instead of -ENOMEM
3. With flow isolated mode, second attempt incorrectly succeeds,
   leading to segfault in rte_eth_rx_burst()

Root cause: The single error label cleanup path calls functions like
mlx5_traffic_disable() and mlx5_flow_stop_default() even when their
corresponding initialization functions (mlx5_traffic_enable() and
mlx5_flow_start_default()) were never called due to earlier failure.

For example, when mlx5_rxq_start() fails:
- mlx5_traffic_enable() at line 1403 never executes
- mlx5_flow_start_default() at line 1420 never executes
- But cleanup unconditionally calls:
  * mlx5_traffic_disable() - destroys control flows list
  * mlx5_flow_stop_default() - corrupts flow metadata state

This corrupts the device state, causing subsequent start attempts to
fail with different errors or, in isolated mode, to incorrectly succeed
with an improperly initialized device.

Fix by replacing the single error label with cascading error labels
(Linux kernel style). Each label cleans up only its corresponding step,
then falls through to clean up earlier steps.
This ensures only successfully initialized steps are cleaned up,
maintaining device state consistency across failed start attempts.

Bugzilla ID: 1419
Fixes: 8db7e3b ("net/mlx5: change operations for non-cached flows")
Cc: stable@dpdk.org

Signed-off-by: Maayan Kashani <mkashani@nvidia.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Gcc-16 detects use of uninitialized variable.
If the retry loop exits the code would do memcmp against uninitialized
stack value. Resolve by initializing to zero.

Bugzilla ID: 1823
Fixes: 1256805 ("net/mlx5: move Linux-specific functions")
Fixes: cfee947 ("net/mlx5: fix link status to use wait to complete")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Raslan Darawsheh <rasland@nvidia.com>
If flow template table was created with resizable attribute,
then completing table resize could fail for 2 user-related reasons:

- not all flow rules were yet updated to use the resized table,
- resize was not started.

Both of these were reported with the same error message
i.e., "cannot complete table resize".

Since PMD can distinguish these 2 cases, this patch improves the error
reporting to report these 2 errors separately.

Also, this patch removes redundant __rte_unused on device parameter.

Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Bing Zhao <bingz@nvidia.com>
Offending patch introduced support for additional flow tag indexes
with HW Steering flow engine. New tag indexes will be mapped
to HW registers REG_C_8 to REG_C_11, depending on HW capabilities.

That patch only handled tables created on group > 0 (non-root table),
where mlx5 PMD directly configures the HW.
Tables and flow rules on group 0 (root table) are handled through
kernel driver, and new registers were not addressed for that case.

Because of that, usage of unsupported tag index in group 0
triggered an assertion in flow_dv_match_meta_reg().

This patch adds necessary definitions for REG_C_8 to REG_C_11 to make
these registers usable for flow tag indexes in root table.
Validation of flow tag to HW register translation is also amended
to report invalid cases to the user, instead of relying on assertions.

Fixes: 7e3a144 ("net/mlx5/hws: support 4 additional C registers")
Cc: stable@dpdk.org

Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Some cases are not supported for rule hash calculation. For example
when the matcher defined as FW matcher, or is the hash type is different
than CRC32. One case is when the distribute mode is not by hash, the
previous condition checked a wrong capability, while this commit fixes
it to the correct check.

Fixes: 7f5e6de ("net/mlx5/hws: query flow rule hash")
Cc: stable@dpdk.org

Signed-off-by: Itamar Gozlan <igozlan@nvidia.com>
Acked-by: Bing Zhao <bingz@nvidia.com>
GCC analyzer identified a code path where acts->mhdr could be
NULL when dereferenced.
When modify header validation fails in mlx5_tbl_translate_modify_header(),
 __flow_hw_action_template_destroy() sets acts->mhdr to NULL.
Add defensive NULL check in mlx5_tbl_ensure_shared_modify_header()
to prevent the dereference.

Bugzilla ID: 1521
Fixes: 12f2ed3 ("net/mlx5: set modify header as shared flow action")
Cc: stable@dpdk.org

Signed-off-by: Shani Peretz <shperetz@nvidia.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
If flow isolation is enabled, then skip flow_hw_create_ctrl_rx_tables
because these are not used with flow isolation. This is used to save
the unneeded resource allocation and also speed up the device startup
time.

Fixes: 9fa7c1c ("net/mlx5: create control flow rules with HWS")
Cc: stable@dpdk.org

Signed-off-by: Nupur Uttarwar <nuttarwar@nvidia.com>
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
The mlx5_ipool_free() function was called with a NULL pool pointer
during HW flow destruction, causing a segmentation fault. This occurred
when flow creation failed and the cleanup path attempted to free
resources from an uninitialized flow pool.

The crash happened in the following scenario:
1. During device start, a default NTA copy action flow is created
2. If the flow creation fails, mlx5_flow_hw_list_destroy() is called
3. In hw_cmpl_flow_update_or_destroy(), table->flow pool could be NULL
4. mlx5_ipool_free(table->flow, flow->idx) was called without checking
   if table->flow is NULL
5. Inside mlx5_ipool_free(), accessing pool->cfg.per_core_cache caused
   a segmentation fault due to NULL pointer dereference

The fix adds two layers of protection,
1. Add NULL check for table->flow before calling mlx5_ipool_free() in
   hw_cmpl_flow_update_or_destroy(), consistent with the existing check
   for table->resource on the previous line
2. Add NULL check for pool parameter in mlx5_ipool_free() as a defensive
   measure to prevent similar crashes in other code paths

The fix also renames the ‘flow’ field in rte_flow_template_table
to ‘flow_pool’ for better code readability.

Stack trace of the fault:
  mlx5_ipool_free (pool=0x0) at mlx5_utils.c:753
  hw_cmpl_flow_update_or_destroy at mlx5_flow_hw.c:4481
  mlx5_flow_hw_destroy at mlx5_flow_hw.c:14219
  mlx5_flow_hw_list_destroy at mlx5_flow_hw.c:14279
  flow_hw_list_create at mlx5_flow_hw.c:14415
  mlx5_flow_start_default at mlx5_flow.c:8263
  mlx5_dev_start at mlx5_trigger.c:1420

Fixes: 27d171b ("net/mlx5: abstract flow action and enable reconfigure")
Cc: stable@dpdk.org

Signed-off-by: Maayan Kashani <mkashani@nvidia.com>
Acked-by: Bing Zhao <bingz@nvidia.com>
Since auxiliary structure is associated with per rule, it can be
allocated in the same ipool allocation to save the extra overhead
of the *alloc header and the unneeded CPU cycles.

Fixes: 27d171b ("net/mlx5: abstract flow action and enable reconfigure")
Cc: stable@dpdk.org

Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Currently the Windows implementation hardcodes match criteria to
MLX5_MATCH_OUTER_HEADERS when creating flow rules, which prevents
matching on inner headers and other criteria types like NVGRE.

The fix uses the matcher's match_criteria_enable attribute instead
of hardcoding OUTER_HEADERS, and moves the assignment outside the
action switch block to apply to all cases.

NVGRE item type is also added to the supported items list.

Fixes: 1d19449 ("net/mlx5: create flow rule on Windows")
Cc: stable@dpdk.org

Signed-off-by: Itai Sharoni <isharoni@nvidia.com>
Acked-by: Bing Zhao <bingz@nvidia.com>
Update the build configuration for Microsoft Azure Cobalt 100 SoC to
use a CPU-specific mcpu value supported by GCC 14+.

Signed-off-by: Doug Foster <doug.foster@arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
The array used to save the flow rules pointer was allocated with an
incorrect length. 1 more rule space should be appended but not 1 byte.

Fixes: 070316d ("app/flow-perf: add multi-core rule insertion and deletion")
Cc: stable@dpdk.org

Signed-off-by: Bing Zhao <bingz@nvidia.com>
Reviewed-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Wisam Jaddo <wisamm@nvidia.com>
On the ConnectX-7 Multi-Host Setup, the PF index is not continuous,
user need to query the PF index based on the return value of the
related sysfs entries:
    cat /sys/class/net/*/phys_port_name

Example output is 0 and 2 for PF1 and PF2, then use 0 and 2 of PF index::
   -a <Primary_PCI_BDF>,representor=pf[0,2]vf[0-2]

Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
As specified in RFC 4291 section 2.5.1, link local addresses must be
generated based on a modified EUI-64 interface identifier:

> Modified EUI-64 format interface identifiers are formed by inverting
> the "u" bit (universal/local bit in IEEE EUI-64 terminology) when
> forming the interface identifier from IEEE EUI-64 identifiers.

This translates to 'mac->addr_bytes[0] ^= 0x02'.

Fixes: 3d6d85f ("net: add utilities for well known IPv6 address types")
Cc: stable@dpdk.org

Signed-off-by: Christophe Fontaine <cfontain@redhat.com>
Signed-off-by: Robin Jarry <rjarry@redhat.com>
Add start tx_first method to testpmd shell, which sends
a specified number of burst packets prior to starting
packet forwarding.

Signed-off-by: Dean Marx <dmarx@iol.unh.edu>
Reviewed-by: Luca Vizzarro <luca.vizzarro@arm.com>
Reviewed-by: Patrick Robb <probb@iol.unh.edu>
Mock the Pydantic import so that even when Pydantic
is available on the system, it is not loaded by
Sphinx, ensuring we perform the doc build without
Pydantic regardless of the environment.

Signed-off-by: Patrick Robb <probb@iol.unh.edu>
Some maintainers known to be inactive for at least a year
are removed to make clearer where help is needed.

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
There are warnings from Coverity and other tools that handling
of intermediate bursts may wraparound. Shouldn't be possible but
use unsigned int to avoid any issues, and just as fast.

Coverity issue: 499471
Fixes: 0dea03e ("pdump: remove use of VLA")

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
There is a race where the request to disable pdump may get ahead
of the handling of pdump requests in dumpcap. The fix is to do
local removal of callbacks before forwarding same to secondary.

To reproduce:
   1. Start testpmd and start traffic
   2. Start dumpcap to capture
   3. Interrupt dumpcap with ^C

Testpmd will show missing response and dumpcap will show error:
EAL: Cannot find action: mp_pdump

Only reproducible if additional logging not enabled.

Fixes: c3ceb87 ("pdump: forward callback enable to secondary process")

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Create an ELF file to load using clang.
Repackage the object into an array using xdd.
Write a test to see load and run the BPF.

If libelf library is not available, then DPDK bpf
will return -ENOTSUP to the test and the test will be skipped.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Marat Khalili <marat.khalili@huawei.com>
Tested-by: Marat Khalili <marat.khalili@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
New test using null device to test filtering with BPF.

If libelf library is not available, then DPDK bpf
will return -ENOTSUP to the test and the test will be skipped.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Marat Khalili <marat.khalili@huawei.com>
Tested-by: Marat Khalili <marat.khalili@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
This adds the BlueField-4 device ID to the list of
NVIDIA devices that run the mlx5 drivers.
The device is still in development stage.

Signed-off-by: Raslan Darawsheh <rasland@nvidia.com>
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
The HWS (Hardware Steering) IP tunnel detection implementation was not
recognizing IPIP tunnel types based on the next protocol value in the
outer IP header as specified by RFC 2003 (IP-in-IP) and RFC 2473
(IPv6 Encapsulation).

The issue was that the code treated all combinations of [IPv4|6] / [IPv4|6]
as IPIP tunnels, setting MLX5_FLOW_LAYER_IPIP flag regardless of the actual
protocol. While this happened to work for simple cases due to the fallback
classification, it violated RFC compliance and could cause issues with
protocol validation.

The fix adds RFC-compliant tunnel detection by
1. Checking the protocol field (IPPROTO_IPIP=4 or IPPROTO_IPV6=41) in the
   outer IP header to determine tunnel type
2. Adding protocol mismatch validation to ensure outer protocol matches
   inner header type
3. Correctly setting MLX5_FLOW_LAYER_IPV6_ENCAP for IPv6 encapsulation

The expected tunnel patterns are now:
- [ipv4 | ipv6] proto is 4 / ipv4 -> MLX5_FLOW_LAYER_IPIP
- [ipv4 | ipv6] proto is 41 / ipv6 -> MLX5_FLOW_LAYER_IPV6_ENCAP

For cases without protocol specification, fallback classification based on
inner header type is still supported.

Fixes: f66c7c3 ("net/mlx5/hws: recognize IP-in-IP tunnel in definer layer")
Fixes: 4b70445 ("net/mlx5: support IP-in-IP tunnel for all combinations")
Cc: stable@dpdk.org

Signed-off-by: Junfeng Guo <junfengg@nvidia.com>
Signed-off-by: 0-day Robot <robot@bytheb.org>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @ovsrobot, your pull request is larger than the review limit of 150000 diff characters

@coderabbitai
Copy link

coderabbitai bot commented Nov 19, 2025

Walkthrough

This pull request introduces DCB forwarding enhancements to testpmd, adds BPF ELF-based testing support, expands CNXK driver capabilities with dynamic SQB resizing and LSO optimizations, refactors MLX5 flow handling and representor management, and improves multiple network drivers with targeted fixes and new features.

Changes

Cohort / File(s) Summary
Configuration and Maintainers
.mailmap, MAINTAINERS, VERSION
Added mailmap entries for Itai Sharoni and Nupur Uttarwar; updated Maxime Leroy mapping; removed maintainers from stable branches, public CI, EAL, and Windows support; bumped version to 25.11.0-rc3.
TestPMD DCB Forwarding
app/test-pmd/cmdline.c, app/test-pmd/config.c, app/test-pmd/testpmd.c, app/test-pmd/testpmd.h
Added two new CLI commands for DCB forwarding: set dcb fwd_tc and set dcb fwd_tc_cores. Introduced globals dcb_fwd_tc_mask and dcb_fwd_tc_cores with validation, compression logic, and multi-core per-TC queue distribution.
TestPMD Flow and Memory Fixes
app/test-flow-perf/main.c, app/test-pmd/cmd_flex_item.c
Fixed memory allocation sizing in insert_flows (allocate (rules_count_per_core + 1) instead of rules_count_per_core + 1 bytes); reworked flex item parsing to use rte_flow_conv result for conditional, sized memcpy operations.
BPF ELF Testing Framework
app/test/bpf/filter.c, app/test/bpf/load.c, app/test/bpf/meson.build, app/test/test_bpf.c, app/test/meson.build
Added BPF-based test programs (test_filter, test_drop, test_allow, test_call1), Meson build configuration for clang-based BPF compilation with ELF linking, and ELF-based BPF load tests.
Cryptodev Test Extensions
app/test/test_cryptodev.c, app/test/test_cryptodev_aead_test_vectors.h
Added AES-CCM 128-bit authenticated encryption/decryption test cases with new AAD vector.
IPv6 Test Fix
app/test/test_net_ip6.c
Corrected IPv6 address constant in local IP test from 0x047b to 0x067b.
Arm Build Configuration
config/arm/meson.build
Updated Cobalt-100 mcpu configuration: changed part number config value to cobalt-100; added crypto feature to Cobalt-100 SoC; removed custom mcpu_cobalt100 definition.
CNXK NIX/NPA Driver Core
drivers/common/cnxk/hw/nix.h, drivers/common/cnxk/roc_nix.h, drivers/common/cnxk/roc_nix_priv.h
Extended NIX SQ context structures with update counter and SEB/IOVA fields; expanded public API with LSO format/SQ resize/count-update functions; added lso_ipv4_idx to nix struct; updated TM max hardware TX scheduler queues (1024→2048).
CNXK Queue and LSO Management
drivers/common/cnxk/roc_nix_queue.c, drivers/common/cnxk/roc_nix_ops.c, drivers/common/cnxk/roc_nix_tm.c
Introduced dynamic SQB resizing with helpers (sqes_per_sqb_calc, sqb_pool_dyn_populate, sqb_aura_dyn_expand); added runtime SQ count control via roc_nix_sq_cnt_update; implemented IPv4 LSO fragmentation with alt-flags profile setup.
CNXK Flow and Backpressure
drivers/common/cnxk/roc_nix_fc.c, drivers/common/cnxk/roc_npa.c, drivers/common/cnxk/roc_npa.h
Added multi-BPID backpressure configuration for CN20K; introduced roc_npa_pool_bp_configure API; replaced scalar BP field with union for threshold array (CN20K support).
CNXK Utility Updates
drivers/common/cnxk/roc_cpt.c, drivers/common/cnxk/roc_cpt.h, drivers/common/cnxk/roc_features.h, drivers/common/cnxk/roc_nix_debug.c, drivers/common/cnxk/roc_nix_inl.c, drivers/common/cnxk/roc_nix_inl_dev.c, drivers/common/cnxk/roc_nix_tm_ops.c, drivers/common/cnxk/roc_nix_tm_utils.c, drivers/common/cnxk/roc_npa_priv.h, drivers/common/cnxk/roc_npa_type.c, drivers/common/cnxk/roc_platform.h, drivers/common/cnxk/roc_platform_base_symbols.c
Introduced feature detection functions for SQ count and 16b alignment; added CPT queue entry size macro; expanded atomic operation abstractions; added NULL checks for inline device handling; exported new ROC API symbols.
MLX5 PCI and Memory
drivers/common/mlx5/mlx5_common.h, drivers/common/mlx5/mlx5_common_mr.c
Added PCI_DEVICE_ID_MELLANOX_BLUEFIELD4 constant; improved mempool MR registration by deferring old MR list cleanup to avoid deadlock.
MLX5 DevX and HCA
drivers/common/mlx5/mlx5_devx_cmds.c, drivers/common/mlx5/mlx5_devx_cmds.h
Added six new 1-bit bitfields to mlx5_hca_attr for software ownership flags (rx/tx/esw, v2 variants).
MLX5 Hardware Registers
drivers/common/mlx5/mlx5_prm.h
Replaced steering format macros with enum; expanded metadata registers (C8–C11); restructured ESW flow table capabilities layout.
MLX5 Flow Infrastructure
drivers/common/mlx5/mlx5.c, drivers/common/mlx5/mlx5.h, drivers/common/mlx5/mlx5_flow.c, drivers/common/mlx5/mlx5_flow.h, drivers/common/mlx5/mlx5_flow_dv.c, drivers/common/mlx5/mlx5_flow_hw.c
Removed repr_matching_en parameter; added capability-aware DV flow mode selection (HWS vs DV); introduced pool-based flow storage (flow_pool); added metadata register C8–C11 support; refactored tunnel detection and TAG translation error handling; cleaned up TX metadata copy flow templates.
MLX5 Advanced Features
drivers/common/mlx5/mlx5_nta_sample.c, drivers/common/mlx5/mlx5_rx.c, drivers/common/mlx5/mlx5_trigger.c, drivers/common/mlx5/mlx5_txq.c, drivers/common/mlx5/mlx5_utils.c
Added terminal action validation for sample flows; introduced CQE ownership monitoring; refactored port-start validation (removed repr_matching dependency); unconditional TX matching flow creation; added pool NULL checks.
MLX5 Windows Driver
drivers/common/mlx5/windows/mlx5_flow_os.c, drivers/common/mlx5/windows/mlx5_flow_os.h
Made match-criteria-enable dynamic per matcher; added NVGRE as supported item type.
CNXK Crypto Operations
drivers/crypto/cnxk/cn20k_cryptodev_ops.c, drivers/crypto/cnxk/cnxk_cryptodev_ops.c, drivers/crypto/cnxk/cnxk_cryptodev_ops.h
Set completion-queue flag for CN20K instructions; introduced per-LF CQ handling and CN20K-specific paths; restructured inflight request metadata layout.
Other Crypto Drivers
drivers/crypto/dpaa2_sec/dpaa2_sec_dpseci.c, drivers/crypto/mlx5/mlx5_crypto.c, drivers/crypto/qat/qat_sym_session.c
Removed stored cryptodev reference in DPAA2; added BLUEFIELD4 to MLX5 crypto PCI map; changed CCM hash-state sizing from digest size to aligned AAD length.
DMA Driver
drivers/dma/dpaa2/dpaa2_qdma.c
Added forward declaration and uninit call for DPDMAI on close; removed dmadev assignment in probe.
Network Drivers: Axgbe
drivers/net/axgbe/axgbe_ethdev.c
Implemented read-modify-write sequence in timesync disable for MAC_TSCR.
Network Drivers: CNXK Ethernet
drivers/net/cnxk/cn10k_ethdev.c, drivers/net/cnxk/cn10k_ethdev_sec.c, drivers/net/cnxk/cn20k_ethdev.c, drivers/net/cnxk/cnxk_ethdev.c, drivers/net/cnxk/cnxk_ethdev_mtr.c, drivers/net/cnxk/cnxk_ethdev_ops.c
Added TX completion handling in queue release; guarded inline LF checks; enabled ROC CQ cleanup for TX; relaxed scatter-offload enforcement on SDP; added meter color index bounds checking.
Network Drivers: Bus and Memory
drivers/bus/fslmc/bus_fslmc_driver.h, drivers/bus/fslmc/fslmc_bus.c
Removed anonymous union from rte_dpaa2_device (deprecated pointer members); implemented active plug/unplug with driver probing and removal callbacks.
Network Drivers: DPAA2
drivers/net/dpaa2/dpaa2_ethdev.c, drivers/net/dpaa2/dpaa2_ethdev.h, drivers/net/dpaa2/dpaa2_recycle.c
Added multi-queue active DPS cleanup helper; removed dpaa2_dev_recycle_qp_setup function; improved probe/remove lifecycle with named device lookup.
Network Drivers: Intel Common and E1000
drivers/net/intel/common/rx.h, drivers/net/intel/e1000/igb_ethdev.c
Renamed RX path tracking from "current" to "chosen"; added flex filter mask array boundary check.
Network Drivers: FM10K and iAVF
drivers/net/intel/fm10k/fm10k_ethdev.c, drivers/net/intel/iavf/iavf_ethdev.c, drivers/net/intel/iavf/iavf_rxtx.c, drivers/net/intel/iavf/iavf_rxtx.h
Set dummy TX packet prepare for vector paths; added RX timestamp capability check; gated timestamp processing on TS_VALID bit.
Network Drivers: Ice
drivers/net/intel/ice/ice_ethdev.c, drivers/net/intel/ice/ice_rxtx.c, drivers/net/intel/ice/ice_rxtx_vec_common.h
Hardened 40-bit counter wraparound handling; set dummy TX prepare for vector; added QINQ_INSERT to non-vector flags; narrowed VLAN offload condition.
Network Drivers: IDPF and iXGBE
drivers/net/intel/idpf/idpf_rxtx.c, drivers/net/intel/ixgbe/base/ixgbe_vf.c, drivers/net/intel/ixgbe/ixgbe_rxtx.c
Added TCP TSO offload handling; reduced PF link state mailbox size; assigned dummy TX prepare for simple path.
Network Drivers: MLX5 HWS and Utilities
drivers/net/mlx5/hws/mlx5dr_buddy.c, drivers/net/mlx5/hws/mlx5dr_definer.c, drivers/net/mlx5/hws/mlx5dr_pool.c, drivers/net/mlx5/hws/mlx5dr_rule.c
Added buddy structure freeing; updated tunnel encapsulation detection (IPIP/IPV6_ENCAP); removed persistent buddy state; changed matcher support condition from hash to distribute mode.
Network Drivers: MLX5 OS and Linux
drivers/net/mlx5/linux/mlx5_ethdev_os.c, drivers/net/mlx5/linux/mlx5_os.c
Zero-initialized rte_eth_link in link update; added PF representor ignore logic; introduced send-to-kernel action cleanup helper.
Network Drivers: QDE (Removed)
drivers/net/qede/base/ecore_mng_tlv.c
Removed entire TLV (type-length-value) processing module for MFW communication.
Documentation Updates
doc/api/dts/tests.*.rst, doc/guides/conf.py, doc/guides/nics/ice.rst, doc/guides/nics/mlx5.rst, doc/guides/prog_guide/ethdev/ethdev.rst, doc/guides/rel_notes/deprecation.rst, doc/guides/rel_notes/release_25_11.rst, doc/guides/testpmd_app_ug/testpmd_funcs.rst, doc/guides/tools/dts.rst
Added DTS test suite documentation; updated autodoc mock imports; expanded raw pattern and representor documentation; revised MLX5 DV flow defaults and metadata documentation; documented new testpmd DCB commands; updated DTS configuration references.
DTS Framework
dts/api/packet.py, dts/api/test.py
Added assess_performance_by_packet function for performance testing; wired traffic-generator references to func_tg; introduced write_performance_json for test metrics export.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • CNXK Dynamic SQ Resizing (drivers/common/cnxk/roc_nix_queue.c): Complex multi-helper refactoring for SQB pool management, dynamic sizing, and per-TC distribution logic with intricate state transitions.
  • MLX5 Flow Representor Refactoring (drivers/net/mlx5/mlx5.c, mlx5_flow_hw.c): Major capability-driven logic replacing repr_matching_en parameter; verify all representor detection and proxy port handling paths.
  • MLX5 Metadata Registers (drivers/net/mlx5/mlx5_flow_dv.c): Extended metadata support (C8–C11) with new register mapping; ensure correct field placement in misc5.
  • CNXK LSO and Multi-BPID (drivers/common/cnxk/roc_nix_ops.c, roc_nix_fc.c): New IPv4 fragmentation and CN20K-specific multi-BPID backpressure paths; verify model-specific branching correctness.
  • CNXK Crypto Per-LF CQ (drivers/crypto/cnxk/cnxk_cryptodev_ops.c): Significant per-LF handling refactoring affecting queue lifecycle; confirm null-checks and model-conditional paths.
  • MLX5 Port-Start Validation (drivers/net/mlx5/mlx5_trigger.c): Simplified representor dependency logic; ensure all cleanup paths properly revert on error with the new macro-based flow.
  • Testpmd DCB Configuration (app/test-pmd/config.c): Complex queue distribution logic across multiple cores per TC with DCB compression; verify per-core stream count increments and queue mapping.

Poem

🐰 A rabbit hops through code so grand,
New features bloom across the land—
DCB queues and LSO dance free,
MLX5 flows and SQ count spree!
Metadata registers multiply with grace,
CNXK speeds up the network race!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.65% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly identifies the main change as fixing IP tunnel detection for IPIP and IPv6 ENCAP in the net/mlx5 driver, which aligns with the file summary showing tunnel classification updates in mlx5dr_definer.c.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch series_36752

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (13)
drivers/net/axgbe/axgbe_ethdev.c (1)

1745-1759: Critical: Incorrect bit manipulation - OR operation cannot clear bits.

The disable logic is fundamentally broken. The code uses value |= mac_tscr (line 1757) to apply disable settings, but OR operations can only set bits to 1, never clear them to 0. Since AXGMAC_SET_BITS(..., 0) creates fields with 0 values to disable features, ORing these with the current register value will fail to clear the corresponding enable bits that are currently set to 1.

Additionally, line 1750 already directly modifies the MAC_TSCR register via AXGMAC_IOWRITE_BITS, making the subsequent read at line 1756 redundant and the approach inconsistent. The enable function axgbe_timesync_enable uses a direct write pattern without read-modify-write.

To fix this, either:

  1. Use AND with inverted mask to clear bits, or
  2. Follow the enable function's pattern with direct write:
 	struct axgbe_port *pdata = dev->data->dev_private;
 	unsigned int mac_tscr = 0;
-	unsigned int value = 0;
 
 	/*disable timestamp for all pkts*/
 	AXGMAC_SET_BITS(mac_tscr, MAC_TSCR, TSENALL, 0);
-	/*disable the addened register*/
-	AXGMAC_IOWRITE_BITS(pdata, MAC_TSCR, TSADDREG, 0);
+	/*disable the addend register*/
+	AXGMAC_SET_BITS(mac_tscr, MAC_TSCR, TSADDREG, 0);
 	/* disable timestamp update */
 	AXGMAC_SET_BITS(mac_tscr, MAC_TSCR, TSCFUPDT, 0);
 	/*disable time stamp*/
 	AXGMAC_SET_BITS(mac_tscr, MAC_TSCR, TSENA, 0);
-
-	value = AXGMAC_IOREAD(pdata, MAC_TSCR);
-	value |= mac_tscr;
-	AXGMAC_IOWRITE(pdata, MAC_TSCR, value);
-
+	AXGMAC_IOWRITE(pdata, MAC_TSCR, mac_tscr);
+ 
 	return 0;
drivers/net/intel/ice/ice_rxtx_vec_common.h (1)

1-207: PR title and description do not match the actual file changes.

The PR is titled "net/mlx5: fix IP tunnel detection for IPIP and IPv6 ENCAP" and references the MLX5 driver, but this file is part of the Intel ICE driver (drivers/net/intel/ice/) and the changes relate to QINQ offload handling, not IP tunnel detection. Please verify:

  1. Is this the correct file for this PR?
  2. Should the PR title and description be updated to reflect the actual changes?
  3. Are there additional MLX5-related files that should be in this PR?
drivers/net/cnxk/cnxk_ethdev_ops.c (1)

1-1: PR title doesn't match the changes in this file.

The PR title references "net/mlx5: fix IP tunnel detection for IPIP and IPv6 ENCAP", but this file is part of the CNXK driver (drivers/net/cnxk/). The AI summary also indicates this PR contains changes across multiple drivers (DCB, BPF, CNXK, MLX5) and components.

Consider organizing related changes into separate PRs to improve reviewability and change tracking.

drivers/net/mlx5/mlx5_flow.c (1)

8810-8818: Missing closing brace for if (dev->data->dev_started) causes syntax error

As written, the if (dev->data->dev_started) { block is never closed before priv->isolated = !!enable;, which makes the function invalid C and changes the scope of subsequent statements. You likely intended an early-return guard with the rest of the function outside the if.

Add the missing } after the return -rte_errno;:

 	if (dev->data->dev_started) {
 		rte_flow_error_set(error, EBUSY,
 				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
 				   NULL,
 				   "port must be stopped first");
 		return -rte_errno;
+	}
 
 	priv->isolated = !!enable;
drivers/net/mlx5/mlx5_trigger.c (1)

59-67: Fix potential null dereference in mlx5_txq_start()

txq_ctrl is dereferenced to initialize txq_data before you check if (!txq_ctrl), so a NULL returned from mlx5_txq_get() would cause undefined behaviour. The if (!txq_ctrl) branch below suggests NULL is expected.

Consider reordering:

-		for (i = 0; i != priv->txqs_n; ++i) {
-			struct mlx5_txq_ctrl *txq_ctrl = mlx5_txq_get(dev, i);
-			struct mlx5_txq_data *txq_data = &txq_ctrl->txq;
-
-			if (!txq_ctrl)
-				continue;
+		for (i = 0; i != priv->txqs_n; ++i) {
+			struct mlx5_txq_ctrl *txq_ctrl = mlx5_txq_get(dev, i);
+			struct mlx5_txq_data *txq_data;
+
+			if (!txq_ctrl)
+				continue;
+			txq_data = &txq_ctrl->txq;

This avoids dereferencing a possible NULL txq_ctrl.

drivers/net/cnxk/cn10k_ethdev_sec.c (1)

1333-1352: Avoid leaving rx_inject flag enabled when inline LF is missing

The new inl_lf NULL guard is good, but roc_idev_nix_rx_inject_set(port_id, enable); has already been called. If roc_nix_inl_inb_inj_lf_get(nix) returns NULL, the function now exits with -ENOTSUP while the global inject flag stays enabled and inj_cfg is never initialized, which can lead to inconsistent behavior later.

Consider reverting the flag (or moving the setter below the guard), e.g.:

	inl_lf = roc_nix_inl_inb_inj_lf_get(nix);
-	if (!inl_lf)
-		return -ENOTSUP;
+	if (!inl_lf) {
+		roc_idev_nix_rx_inject_set(port_id, false);
+		return -ENOTSUP;
+	}
drivers/common/cnxk/roc_nix_inl_dev.c (1)

245-252: Error cleanup for CPT inline queue missing slot/queue ID

In nix_cfg_fail, the cleanup now sends a cpt_rx_inl_queue_cfg message with only enable = 0 set:

cpt_req = mbox_alloc_msg_cpt_rx_inl_queue_cfg(mbox);
if (!cpt_req) {
	rc |= -ENOSPC;
} else {
	cpt_req->enable = 0;
	rc |= mbox_process(mbox);
}

Unlike the normal release path in nix_inl_inb_queue_release(), this does not populate slot or rx_queue_id, so the FW may not correctly identify which CPT LF/queue to disable (or might operate on defaults).

To mirror the release path and avoid leaking or mis-disabling queues, populate these fields as well:

nix_cfg_fail:
	cpt_req = mbox_alloc_msg_cpt_rx_inl_queue_cfg(mbox);
	if (!cpt_req) {
		rc |= -ENOSPC;
	} else {
-		cpt_req->enable = 0;
+		cpt_req->enable = 0;
+		cpt_req->slot = slot_id;
+		cpt_req->rx_queue_id = qid;
		rc |= mbox_process(mbox);
	}
drivers/common/cnxk/roc_nix_queue.c (1)

2069-2149: SQ init: consider cleaning up SQB aura on failures after pool population

roc_nix_sq_init() now branches between sqb_pool_dyn_populate() and sqb_pool_populate(), but on subsequent failures (TM leaf fetch, SQ context init, mbox_process) the nomem path only frees sq->fc:

if (rc)
	goto nomem;
/* ... later ... */
nomem:
	plt_free(sq->fc);
fail:
	return rc;

At that point the SQB aura and backing memory (static or dynamic) have already been created and are leaked if init fails.

It would be safer to also destroy the pool in this path, e.g. along the lines of:

nomem:
-	plt_free(sq->fc);
+	if (sq->aura_handle)
+		roc_npa_pool_destroy(sq->aura_handle);
+	if (!roc_nix->sq_resize_ena && sq->sqe_mem)
+		plt_free(sq->sqe_mem);
+	plt_free(sq->fc);

This is a pre‑existing pattern but becomes more important with dynamic SQB resizing, where failures are more likely during bring‑up.

app/test/test_bpf.c (1)

5-11: Add <errno.h> for system errno usage in create_temp_bpf_file()

The code uses system errno with strerror(errno) at lines 3299, 3307, and 3318 within the create_temp_bpf_file() helper, which calls standard C library functions (asprintf(), mkstemps(), write()) that set system errno. However, the file only includes <rte_errno.h> (DPDK-specific), not the required system header <errno.h>. This will cause portability issues and build failures on strict toolchains.

Add #include <errno.h> after the other standard C headers (before the DPDK headers):

-#include <stdint.h>
-#include <inttypes.h>
-#include <unistd.h>
+#include <stdint.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <errno.h>
app/test-pmd/cmd_flex_item.c (1)

141-165: Fix incorrect rte_flow_conv() source parameter and add error/bounds checks

In flex_link_item_parse() (lines 145–165), the code passes item to rte_flow_conv(RTE_FLOW_CONV_OP_ITEM_MASK, NULL, 0, item, NULL). However, item is the destination buffer with only type set, while pattern is the source item with spec/mask/last fully populated by flow_parse(). The conversion should query from pattern, not the partially-initialized item.

This causes three problems:

  1. Wrong conversion source: rte_flow_conv() returns the mask size for the item, which should be based on the parsed pattern, not the destination item.
  2. Unhandled errors: Negative ret values (conversion failures) are never checked; many similar uses in the codebase check ret <= 0.
  3. Buffer overflow: Pre-allocated spec/mask buffers are FLEX_MAX_FLOW_PATTERN_LENGTH (64 bytes). If ret > 64, the subsequent memcpy() calls overflow these fixed buffers.

Additionally, the current logic sets item->spec/mask to NULL when ret <= 0, which causes valid JSON to be rejected by the downstream validation in flex_link_parse() (line 184: if (!link->item.spec || !link->item.mask) return -EINVAL;).

Fix:

 	item->type = pattern->type;
-	ret = rte_flow_conv(RTE_FLOW_CONV_OP_ITEM_MASK, NULL, 0, item, NULL);
-	if ((ret > 0) && pattern->spec) {
+	/* Query mask size from the parsed pattern item. */
+	ret = rte_flow_conv(RTE_FLOW_CONV_OP_ITEM_MASK, NULL, 0, pattern, NULL);
+	if (ret < 0)
+		return ret;
+	if (ret > FLEX_MAX_FLOW_PATTERN_LENGTH)
+		return -EINVAL;
+	if (pattern->spec) {
 		ptr = (void *)(uintptr_t)item->spec;
 		memcpy(ptr, pattern->spec, ret);
 	} else {
 		item->spec = NULL;
 	}
-	if ((ret > 0) && pattern->mask) {
+	if (pattern->mask) {
 		ptr = (void *)(uintptr_t)item->mask;
 		memcpy(ptr, pattern->mask, ret);
 	} else {
 		item->mask = NULL;
 	}
-	if ((ret > 0) && pattern->last) {
+	if (pattern->last) {
 		ptr = (void *)(uintptr_t)item->last;
 		memcpy(ptr, pattern->last, ret);
 	} else {
 		item->last = NULL;
 	}
drivers/net/mlx5/mlx5_flow_hw.c (3)

11305-11333: ctrl-rx create/cleanup: partial-failure leak and differing responsibilities

  • mlx5_flow_hw_create_ctrl_rx_tables() allocates priv->hw_ctrl_rx and then creates pattern templates per (eth_pattern_type, rss_type). On failure it just returns -rte_errno without cleaning up already-created templates or hw_ctrl_rx. This will leak templates and the container struct on early failure; consider calling mlx5_flow_hw_cleanup_ctrl_rx_tables(dev) or an internal rollback helper from the err: path.
  • mlx5_flow_hw_cleanup_ctrl_rx_tables() fully destroys tables, pattern templates, RSS action templates, and frees hw_ctrl_rx.
  • mlx5_flow_hw_cleanup_ctrl_rx_templates() only tears down the tables and RSS templates (and nulls their pointers) but leaves pattern templates and hw_ctrl_rx intact, presumably for reuse on reconfigure.

The split responsibilities make sense, but it would help to document when each should be used (e.g., “templates” for port stop/reconfigure, “tables” for final teardown) and to ensure the create‑path error cleanup calls the stronger helper.

Also applies to: 11592-11629, 11632-11661


15278-15315: flow_hw_table_resize_multi_pattern_actions uses undefined table identifier

Inside flow_hw_table_resize_multi_pattern_actions() the first line uses:

struct mlx5_multi_pattern_segment *segment = table->mpctx.segments;

but the parameter is named tbl. This won’t compile and should be:

struct mlx5_multi_pattern_segment *segment = tbl->mpctx.segments;

and any other table-> references in this helper should be adjusted to tbl->.

-static int
-flow_hw_table_resize_multi_pattern_actions(struct rte_eth_dev *dev,
-					   struct rte_flow_template_table *tbl,
-					   uint32_t nb_flows,
-					   struct rte_flow_error *error)
-{
-	struct mlx5_multi_pattern_segment *segment = table->mpctx.segments;
+static int
+flow_hw_table_resize_multi_pattern_actions(struct rte_eth_dev *dev,
+					   struct rte_flow_template_table *tbl,
+					   uint32_t nb_flows,
+					   struct rte_flow_error *error)
+{
+	struct mlx5_multi_pattern_segment *segment = tbl->mpctx.segments;

13620-13627: Verify Windows platform buffer overflow and update comment

The review concern is partially verified and requires attention:

Confirmed issues:

  1. Platform-specific buffer overflow risk: On Windows (mlx5_os.c:550), priv->flows[i] is allocated with size sizeof(rte_flow_hw) + sizeof(rte_flow_nt2hws) only—without rte_flow_hw_aux. However, flow_hw_prepare() at line 13622 writes to memory beyond the allocated chunk by computing: (uintptr_t)((*flow)->nt2hws) + sizeof(struct rte_flow_nt2hws). On Linux with HWS support (mlx5_os.c:1654-1655), the allocation correctly includes the aux structure, but Windows lacks this.

  2. Misleading comment: Lines 13609-13610 claim the pool size is only "sizeof(rte_flow_hw) + sizeof(rte_flow_nt2hws)," omitting any reference to rte_flow_hw_aux. This is accurate for Windows but misleading for Linux HWS configurations where aux is allocated.

  3. Dead code confirmed: The null check at lines 13624-13627 cannot fire for a valid allocation returned by mlx5_ipool_zmalloc(), since pointer arithmetic on a valid pointer cannot yield NULL.

Required actions:

  • Ensure Windows allocation (mlx5_os.c:550) includes space for sizeof(struct rte_flow_hw_aux) to match the code's expectations.
  • Update the misleading comment at lines 13609-13610 to clarify the full layout.
  • Remove the unreachable null check at lines 13624-13627.
🧹 Nitpick comments (43)
doc/guides/rel_notes/release_25_11.rst (1)

247-254: Add "previously announced" language for consistency with other removed items.

This entry breaks the established pattern in the Removed Items section—all other removals reference prior deprecation announcements (e.g., lines 230, 237, 243). Add a parallel introductory phrase for consistency.

Additionally, clarify the grammatical scope of "Applications which disabled this option." The intent appears to be applications that disabled repr_matching_en (set it to false), but the phrasing is ambiguous.

Apply this diff to improve consistency and clarity:

-* net/mlx5: ``repr_matching_en`` device argument has been removed.
-  Applications which disabled this option were able to receive traffic
-  from any physical port/VF/SF on any representor port.
-  Specifically, in most cases, this was used to process all traffic on representor port
-  which is a transfer proxy port.
-  Similar behavior in mlx5 PMD can be achieved without this device argument,
-  by using ``RTE_FLOW_ACTION_TYPE_RSS`` in transfer flow rules.
+* net/mlx5: As previously announced, the ``repr_matching_en`` device argument has been removed.
+  Applications that previously disabled this option (repr_matching_en=false) were able to receive traffic
+  from any physical port/VF/SF on any representor port.
+  This option was commonly used to process all traffic on a representor port, which acts as a transfer proxy.
+  Equivalent behavior can be achieved using ``RTE_FLOW_ACTION_TYPE_RSS`` in transfer flow rules.
drivers/net/cnxk/cnxk_ethdev_mtr.c (1)

1264-1270: Good defensive bounds checking added.

The index validation prevents out-of-bounds access on color_map and safely defaults to GREEN for invalid inputs. This improves robustness.

Consider these optional enhancements:

  1. Add logging when defaulting occurs to aid debugging:
 	/* Index validation */
 	if (idx >= RTE_COLORS)
+		plt_warn("Invalid default_input_color %d, defaulting to GREEN", idx);
 		cfg->icolor = ROC_NIX_BPF_COLOR_GREEN;
 	else
 		cfg->icolor = color_map[idx];
  1. Apply consistent validation to other color_map indexing sites for completeness:
    • Line 765: color_map[dscp_table[i]]
    • Line 837: color_map[vlan_table[i]]
    • Lines 1395, 1411: color_map[mtr->params.dscp_table[i]] and color_map[mtr->params.vlan_table[i]]
drivers/common/mlx5/mlx5_devx_cmds.h (1)

339-344: Consider adding documentation for the new software ownership fields.

These six new bitfields lack documentation explaining their purpose, when they're populated, and how they should be used. Consider adding a comment block describing:

  • What "software ownership" means in this context
  • The difference between v2 and non-v2 versions
  • When each field is expected to be set

Example:

+	/* Software ownership flags for flow tables:
+	 * - rx_sw_owner/rx_sw_owner_v2: RX flow table ownership
+	 * - tx_sw_owner/tx_sw_owner_v2: TX flow table ownership  
+	 * - esw_sw_owner/esw_sw_owner_v2: E-Switch flow table ownership
+	 * v2 variants are preferred when available.
+	 */
 	uint8_t rx_sw_owner:1;
 	uint8_t rx_sw_owner_v2:1;
drivers/bus/fslmc/fslmc_bus.c (1)

617-628: Clarify return value when no matching driver is found.

When the driver iteration completes without finding a match, the function returns 0 (success). This could be misleading if the caller expects an error when a device cannot be probed. Consider returning -ENOENT or a similar error code when no matching driver is found.

Apply this pattern:

+		break;
+	}
+
+	/* If loop completed without break, no driver was found */
+	if (drv == NULL)
+		return -ENOENT;
+
 	return ret;
drivers/crypto/qat/qat_sym_session.c (1)

2777-2780: CCM hash_state_sz change looks correct; you can simplify the alignment.

Basing auth_param->hash_state_sz on the (aligned) AAD size instead of digestsize is consistent with the GCM path and looks like the right fix for CCM. Since auth_param->u2.aad_sz is already aligned to ICP_QAT_HW_CCM_AAD_ALIGNMENT when aad_length > 0, you could compute this as:

auth_param->hash_state_sz = auth_param->u2.aad_sz >> 3;

and, if needed, ensure u2.aad_sz is aligned in the aad_length == 0 branch as well. That would remove the redundant RTE_ALIGN_CEIL call and make the intent a bit clearer without changing behavior.

dts/api/packet.py (2)

91-93: Remove redundant cast after isinstance assertion.

The cast on line 93 is unnecessary because the isinstance assertion on line 91 already serves as a type guard for static type checkers.

Apply this diff to simplify:

-    assert isinstance(
-        get_ctx().func_tg, CapturingTrafficGenerator
-    ), "Cannot capture with a non-capturing traffic generator"
-    tg: CapturingTrafficGenerator = cast(CapturingTrafficGenerator, get_ctx().func_tg)
+    assert isinstance(
+        get_ctx().func_tg, CapturingTrafficGenerator
+    ), "Cannot capture with a non-capturing traffic generator"
+    tg = get_ctx().func_tg

347-350: Remove redundant cast after isinstance assertion.

Similar to line 93, the cast on line 350 is unnecessary after the isinstance assertion.

Apply this diff:

-    assert isinstance(
-        get_ctx().perf_tg, PerformanceTrafficGenerator
-    ), "Cannot send performance traffic with non-performance traffic generator"
-    tg: PerformanceTrafficGenerator = cast(PerformanceTrafficGenerator, get_ctx().perf_tg)
+    assert isinstance(
+        get_ctx().perf_tg, PerformanceTrafficGenerator
+    ), "Cannot send performance traffic with non-performance traffic generator"
+    tg = get_ctx().perf_tg
dts/api/test.py (1)

152-153: Refine timestamp handling to avoid overrides and include timezone info.

Right now, a timestamp key in performance_data will override the generated timestamp, and the ISO string is naive (no timezone). If you want this helper to always stamp when it runs and log unambiguous times, consider:

-    perf_data = {"timestamp": datetime.now().isoformat(), **performance_data}
+    perf_data = {
+        **performance_data,
+        "timestamp": datetime.now().astimezone().isoformat(),
+    }

This ensures the helper’s timestamp always wins and includes timezone information.

drivers/net/mlx5/mlx5_nta_sample.c (2)

528-562: Clarify which action list is actually being validated

validate_sample_terminal_actions() walks the actions[] array from the passed-in pointer and enforces constraints on the last non‑VOID action. At the call site you pass sample (the RTE_FLOW_ACTION_TYPE_SAMPLE action in the outer action list), so this function currently validates “last non‑VOID action from SAMPLE onward” (i.e. SAMPLE + suffix actions), not the inner sample_conf->actions list.

If the goal is to validate the terminal action list executed by the SAMPLE action, this should instead receive sample_conf->actions (or sample_actions) and be called after retrieving sample_conf. If the intent is to constrain the terminal action of the outer flow for sample flows, it would help to document that explicitly or rename the parameter to make this clear.


866-876: Make the SAMPLE precondition explicit at the call site

validate_sample_terminal_actions(dev, attr, sample) assumes sample points to a valid actions array terminated by RTE_FLOW_ACTION_TYPE_END. Immediately after, sample->conf is dereferenced as well. If mlx5_nta_sample_flow_list_create() is ever invoked on an actions list without a SAMPLE action, both uses will access invalid memory.

If the higher-level design guarantees that this helper is only called for flows that contain SAMPLE, consider adding an assertion or an early error when sample == NULL to make the precondition explicit and fail predictably.

drivers/dma/dpaa2/dpaa2_qdma.c (1)

1511-1512: Propagate dpaa2_dpdmai_dev_uninit() errors and confirm close semantics

Wiring dpaa2_dpdmai_dev_uninit() into dpaa2_qdma_close() nicely centralizes DPDMAI teardown, and the qdma_dev == NULL guard in dpaa2_qdma_close() keeps double-close safe.

Two follow-ups are worth considering:

  • dpaa2_qdma_close() always returns 0, even if dpdmai_close() fails inside dpaa2_dpdmai_dev_uninit(). Propagating that error would make failures visible to callers:
-    dpaa2_dpdmai_dev_uninit(dev);
-
-    return 0;
+    return dpaa2_dpdmai_dev_uninit(dev);
  • On the primary process, dpaa2_dpdmai_dev_uninit() frees qdma_dev, so after dpaa2_qdma_close() the device cannot be reconfigured or started again without a fresh probe / init. Please double-check that this matches the intended DPDK dmadev lifecycle (i.e., that dev_close is terminal for this PMD’s usage model).
drivers/net/mlx5/hws/mlx5dr_definer.c (1)

3381-3399: IPIP / IPv6 ENCAP tunnel detection fix looks right; consider extension‑header edge case

The new logic that, for [IPv4|IPv6] / IPv4 and [IPv4|IPv6] / IPv6, sets cd.tunnel = true and tags item_flags with MLX5_FLOW_LAYER_IPIP / MLX5_FLOW_LAYER_IPV6_ENCAP before calling the IPv4/IPv6 converters correctly ensures the inner IP header is treated as tunneled, fixing the earlier misclassification.

Two things worth double‑checking:

  • That MLX5_FLOW_LAYER_IPIP and MLX5_FLOW_LAYER_IPV6_ENCAP both carry the MLX5_FLOW_LAYER_TUNNEL bit so mlx5dr_definer_set_conv_tunnel() continues to work for deeper inner layers.
  • Whether sequences like IPv6 / IPv6_ROUTING_EXT / IPv6 are expected to be recognized as IPv6 ENCAP as well; those won’t satisfy cd.last_item == RTE_FLOW_ITEM_TYPE_IPV4/6 and will still be classified as non‑tunnel under this logic.

If either of these should be handled, a small follow‑up tweak around this condition might be needed.

drivers/net/mlx5/mlx5.c (1)

1438-1475: HWS/SWS capability helpers look correct; consider build‑time gating for HWS.

The new helpers cleanly encapsulate runtime checks for HWS (devx && wqe_based_flow_table_sup) and SWS (Rx/Tx/esw SW owner flags), and mlx5_kvargs_is_used() correctly leverages the is_used bitmap from mlx5_kvargs_ctrl.

One thing to double‑check: on builds where HAVE_MLX5_HWS_SUPPORT is not enabled but the FW still reports wqe_based_flow_table_sup, mlx5_hws_is_supported() will still return true and can drive dv_flow_en to 2. If any HWS code paths are compiled out under #ifdef HAVE_MLX5_HWS_SUPPORT, it may be safer to either:

  • gate mlx5_hws_is_supported() behind that macro, or
  • ensure all runtime uses of dv_flow_en == 2 are already fully guarded by the same macro.

If this is already guaranteed elsewhere, current code is fine; otherwise, consider adding such a guard.

drivers/net/mlx5/mlx5_flow_dv.c (1)

10838-10872: flow_dv_translate_item_tag now correctly surfaces register mapping errors

Switching flow_dv_translate_item_tag() to return int and explicitly handling:

  • reg < 0: unexpected mapping failures, now logged and returned as -EINVAL.
  • reg == REG_NON: unsupported registers, now logged and returned as -ENOTSUP.

prevents silent misuse of invalid reg values in flow_dv_match_meta_reg() and aligns with the rest of the error-handling strategy. The behavior for valid mappings is unchanged.

If you want slightly richer diagnostics later, you could propagate a struct rte_flow_error * into this helper and set it directly instead of relying solely on the caller, but this is not required for correctness.

drivers/net/mlx5/mlx5_flow.h (3)

24-25: Proxy-port detection macro looks correct

Expression matches the usual “E-Switch mode && master port” notion of an uplink/proxy; !! normalization to boolean is fine. If you expect this to be used heavily, a static inline bool helper could give type checking, but the macro is acceptable as-is.


1810-1843: META/TAG register selection changes align with expected behavior

  • Lines 1813-1814: Extending the META case to use REG_C_1 when either dv_xmeta_en == MLX5_XMETA_MODE_META32_HWS or mlx5_esw_metadata_passing_enabled(sh) holds keeps the existing behavior and additionally covers the “metadata passing” capability path. Because mlx5_esw_metadata_passing_enabled() already implies dv_esw_en, the extra sh->config.dv_esw_en && is redundant but harmless.
  • Lines 1841-1842: Returning REG_NON when id >= MLX5_FLOW_HW_TAGS_MAX avoids out-of-bounds access on reg->hw_avl_tags[id] and is consistent with how unsupported fields are indicated elsewhere in this function. Callers already treat REG_NON as “unavailable register”, so this is a safe hardening.

These changes look correct and should address the tunnel/metadata corner cases without regressing NIC paths.


3715-3719: IPv6 TC capability helper is minimal and readable

mlx5_dv_modify_ipv6_traffic_class_supported() provides a clear, single-source check for IPv6 TC modify support, which should simplify checks in call sites. If there are configurations where priv->sh->phdev may be NULL, consider adding an MLX5_ASSERT(priv->sh->phdev) here, but otherwise this looks fine.

drivers/net/dpaa2/dpaa2_ethdev.c (1)

634-653: Helper correctly clears active DQS; consider softening the busy‑wait

The helper looks logically correct: it walks all per‑lcore queue storage entries, waits for any outstanding command to complete, then clears the SWP’s global active_dqs and the per‑queue pointer. That should make teardown safer before freeing queue storage.

The while (!qbman_check_command_complete(...)) continue; loop is a hard spin with no pause or timeout, though. If HW ever misbehaves, this could spin indefinitely and burn a full core during device close. Consider adding a small backoff (rte_pause() or similar) and/or a bounded retry with a warning log to avoid pathological hangs during teardown.

drivers/crypto/cnxk/cnxk_cryptodev_ops.h (1)

55-59: cpt_inflight_req layout + static asserts look consistent, but may break debug builds

The new rsvd[16], relocated meta[META_LEN] (Line 58) and added qp field (Line 64) give a clean, cache‑line–sized layout, and the offsetof(meta) == 32 assertion (Line 77) matches the preceding res + rsvd fields. This aligns with using cpt_inflight_req alongside CN20K CQ entries and with the new use of qp in CN20K adapter dequeue paths.

One caveat: PLT_STATIC_ASSERT(sizeof(struct cpt_inflight_req) == ROC_CACHE_LINE_SZ) (Line 76) will fail as soon as CPT_INST_DEBUG_ENABLE is defined, because the debug-only fields below increase the struct size. If you expect to build with CPT_INST_DEBUG_ENABLE enabled, consider wrapping this assert in #ifndef CPT_INST_DEBUG_ENABLE (or adjusting it to tolerate the debug fields) so debug builds continue to compile.

Also applies to: 64-64, 76-78

drivers/crypto/cnxk/cn20k_cryptodev_ops.c (1)

263-266: CQ enable + atomic result init for CN20K path are reasonable; consider harmonizing other NOT_DONE inits

Setting inst[0].cq_ena = 1 (Line 264) together with the atomic initialization of infl_req->res.u64[0] to CPT_COMP_NOT_DONE (Lines 265-266) matches how the various dequeue paths poll this word and is consistent with enabling the per‑LF CQ for CN20K elsewhere.

As a minor follow‑up, you might consider switching other initializations of the same result word (e.g., plain infl_req->res.cn20k.compcode = CPT_COMP_NOT_DONE; in the adapter path) to use the same rte_atomic_store_explicit helper for uniform atomic semantics, but the current change is already an improvement.

drivers/crypto/cnxk/cnxk_cryptodev_ops.c (1)

220-238: Symmetric CQ disable on stop; minor nit on using consistent lf pointer

The stop path mirrors start: it skips NULL LFs (Lines 230-232) and only disables CQ when lf->cpt_cq_ena is set (Lines 235-237). Behaviourally this is correct and symmetric.

As a minor consistency nit, you could pass lf instead of roc_cpt->lf[qp_id] to roc_cpt_iq_disable() (Line 234) since lf has already been looked up and NULL‑checked.

drivers/net/mlx5/mlx5_trigger.c (1)

1229-1232: Centralized errno preservation and cleanup in mlx5_dev_start() looks correct

The SAVE_RTE_ERRNO_AND_STOP(ret, dev) macro plus the new label structure gives a consistent pattern: on each post‑dev_started failure, you (a) stash rte_errno in ret, (b) clear dev_started, then (c) unwind resources in the right order before restoring rte_errno at the error: label. All gotos into the shared cleanup chain now go through this macro, so there are no paths that would leave dev_started == 1 on failure or lose the errno value. Just keep this invariant in mind for any future gotos into these labels.

Also applies to: 1320-1497

drivers/common/cnxk/roc_npa.h (1)

817-820: Clarify first parameter semantics for roc_npa_pool_bp_configure

The header names the first argument pool_id, while the implementation in roc_npa.c uses and treats it as an aura handle (then derives pool_id via roc_npa_aura_handle_to_aura). ABI is fine (types/order match), but this discrepancy is easy to misread.

Consider either:

  • Renaming the parameter in the .c implementation to pool_id and passing a raw pool index, or
  • Renaming it here to aura_handle and documenting that a handle is expected and internally mapped to a pool.

Also, for readability, you may want to mention in a comment that this API is pool‑oriented CN20K multi‑BP configuration to distinguish it from roc_npa_aura_bp_configure.

drivers/common/cnxk/roc_nix_inl.c (1)

831-858: Behavior change in roc_nix_inl_inb_rx_inject_enable for inline‑dev‑only queries

The new guards avoid NULL dereferences, which is good, but they also change semantics:

  • Previously, for inb_inl_dev == true, a non‑NULL idev->nix_inl_dev with attach_cptlf && rx_inj_ena was enough to return true, even when roc_nix was NULL (though the final return roc_nix->rx_inj_ena; would have been UB if reached).
  • Now the inline‑dev branch additionally requires roc_nix && roc_nix->rx_inj_ena, and if roc_nix is NULL this function will always return false, regardless of inl_dev->rx_inj_ena.

Other helpers like roc_nix_inl_inb_sa_base_get() and profile‑ID getters explicitly support roc_nix == NULL when inb_inl_dev == true, so this stricter requirement may be unintended.

If you want to keep the old “inline‑dev only” semantics while fixing the NULL deref, consider something like:

-    if (inl_dev && inl_dev->attach_cptlf && inl_dev->rx_inj_ena && roc_nix &&
-        roc_nix->rx_inj_ena)
-        return true;
+    if (inl_dev && inl_dev->attach_cptlf && inl_dev->rx_inj_ena) {
+        if (!roc_nix)
+            return true; /* inline-dev only case */
+        if (roc_nix->rx_inj_ena)
+            return true;
+    }
@@
- return roc_nix ? roc_nix->rx_inj_ena : 0;
+ return roc_nix ? roc_nix->rx_inj_ena : 0;

Please double‑check call sites to confirm which behavior is desired for inb_inl_dev == true and roc_nix == NULL.

drivers/common/cnxk/roc_nix_tm_utils.c (1)

583-593: Optionally sanity-check TL4 SDP relchan against tx_chan_cnt

For SDP links, relchan is now computed as (nix->tx_chan_base & 0xff) + node->rel_chan, which seems intentional to support per-node relative channels. To harden against misconfiguration (e.g., node->rel_channix->tx_chan_cnt), consider adding a debug check/log before programming NIX_AF_TL4X_SDP_LINK_CFG, e.g. warning if relchan >= nix->tx_chan_cnt or exceeds the HW width.

This would not change behavior in the valid case but would make TM misconfigurations easier to diagnose.

drivers/net/cnxk/cn20k_ethdev.c (1)

193-205: Guard tx completion cleanup in cn20k_nix_tx_queue_release

The added completion-drain logic is correct in intent (drain CQ, free tx_compl.ptr, then delegate to cnxk_nix_tx_queue_release), but it currently assumes txq and the completion context are always valid.

To make this more robust against repeated releases or partial setup failures, consider:

 static void
 cn20k_nix_tx_queue_release(struct rte_eth_dev *eth_dev, uint16_t qid)
 {
-	struct cn20k_eth_txq *txq = eth_dev->data->tx_queues[qid];
+	struct cn20k_eth_txq *txq = eth_dev->data->tx_queues[qid];
 	struct cnxk_eth_dev *dev = cnxk_eth_pmd_priv(eth_dev);
 	struct roc_nix *nix = &dev->nix;
 
-	if (nix->tx_compl_ena) {
+	if (txq && nix->tx_compl_ena && txq->tx_compl.ena && txq->tx_compl.ptr) {
 		/* First process all CQ entries */
 		handle_tx_completion_pkts(txq, 0);
 		plt_free(txq->tx_compl.ptr);
 	}
 	cnxk_nix_tx_queue_release(eth_dev, qid);
 }

This keeps behavior unchanged in the normal case while avoiding potential NULL dereferences.

drivers/net/cnxk/cn10k_ethdev.c (1)

199-211: Guard tx completion cleanup in cn10k_nix_tx_queue_release

Draining TX completions before delegating to cnxk_nix_tx_queue_release() is the right ordering, but the code assumes txq is always valid and its completion context initialized:

struct cn10k_eth_txq *txq = eth_dev->data->tx_queues[qid];
...
if (nix->tx_compl_ena) {
	/* First process all CQ entries */
	handle_tx_completion_pkts(txq, 0);
	plt_free(txq->tx_compl.ptr);
}

To be safer on repeated releases or partial setups, mirror the CNXK helper’s NULL checks and the completion enable flag:

-	struct cn10k_eth_txq *txq = eth_dev->data->tx_queues[qid];
+	struct cn10k_eth_txq *txq = eth_dev->data->tx_queues[qid];
...
-	if (nix->tx_compl_ena) {
+	if (txq && nix->tx_compl_ena && txq->tx_compl.ena && txq->tx_compl.ptr) {
 	/* First process all CQ entries */
 		handle_tx_completion_pkts(txq, 0);
 		plt_free(txq->tx_compl.ptr);
 	}
 	cnxk_nix_tx_queue_release(eth_dev, qid);

This preserves the intended behavior while hardening against edge cases.

drivers/common/cnxk/roc_nix_queue.c (3)

1438-1482: SQE‑per‑SQB and SQB slack helpers look sound, but consider minimal validation

  • sqes_per_sqb_calc() correctly derives SQEs per SQB for W16/W8 max SQE sizes and reserves one SQE for the next‑SQB pointer.
  • sq_desc_to_sqb() and sqb_slack_adjust() provide clear separation of concerns (translate descriptors to SQBs, then apply slack/max clamps).

Given these helpers are now central to static and dynamic SQB sizing, you might want a defensive assert to catch impossible configurations (e.g., PLT_ASSERT(sqes_per_sqb);) before divisions, though it’s not strictly required if nix->sqb_size and max_sqe_sz are always sane.


1563-1651: Dynamic SQB pool population is coherent; reclaim path is robust

sqb_pool_dyn_populate() sets up a dynamic SQB pool sized to max_sqb_count (+ slack) and populates only the initial nb_sqb_bufs buffers:

  • Per‑SQ state (sqes_per_sqb, nb_sqb_bufs[_adj], aura_sqb_bufs) is initialized appropriately.
  • Per‑SQB allocations are individually plt_zmalloc’d and freed into the aura, with full reclaim on failure in nomem/npa_fail paths before destroying the pool.
  • Aura limit is set to the initial count and the range is broadened to [0, UINT64_MAX], which is consistent with dynamically allocated SQBs living anywhere in IOVA space.

Given the dynamic nature, this looks correct. Just be aware that roc_npa_aura_op_range_set(..., 0, UINT64_MAX) effectively disables range checking; that’s intentional but worth keeping in mind if future code assumes tighter ranges.


2168-2225: SQ fini: dynamic SQB teardown flow is coherent

  • Restoring the aura limit to sq->aura_sqb_bufs before draining improves the chances of successful aura drain.
  • For resize‑enabled SQs, calling nix_sqb_mem_dyn_free() before roc_npa_pool_destroy() ensures dynamically allocated SQBs are freed back to the host.
  • Only freeing sq->sqe_mem for non‑resize SQs correctly reflects that dynamic SQBs are individually allocated and tracked via the aura.

Only minor nit: you may want to check and log the return value of roc_npa_aura_limit_modify() in case restoring the limit fails, but this is non‑critical.

drivers/common/cnxk/roc_nix.h (1)

32-36: Public API and struct extensions align with implementation changes

  • The new ROC_NIX_LSO_FORMAT_IDX_* macros mirror the kernel’s NIX_LSO_FORMAT_IDX_TSOV4/TSOV6 and reserve IDX_IPV4 for the new IPv4 fragmentation format.
  • struct roc_nix_sq gains sqes_per_sqb, sq_cnt_ptr, and update_sq_cnt, which are used by the new SQB sizing and SQ count update logic in roc_nix_queue.c.
  • struct roc_nix gains sq_resize_ena and use_multi_bpids, matching dynamic SQ resizing and multi‑BPID usage in queue/TM and NPA code.
  • ROC_NIX_TM_SDP and roc_nix_tm_sdp_prepare_tree() are consistent with the SDP TM tree implementation.
  • New LSO and SQ APIs (roc_nix_lso_fmt_ipv4_frag_get, roc_nix_lso_alt_flags_profile_setup, roc_nix_sq_cnt_update, roc_nix_sq_resize) have corresponding definitions and usages elsewhere.

These interface changes look coherent with the rest of the PR. Just be aware that adding fields in public structs (roc_nix, roc_nix_sq) is an ABI change and should align with the project’s versioning/ABI policy.

Also applies to: 475-501, 559-606, 698-705, 875-876, 975-993, 1073-1093

drivers/common/cnxk/hw/nix.h (1)

669-675: HW structs and masks for ALT_FLAGS and CN20K SQ count look internally consistent

  • struct nix_lso_format’s new fields (alt_flags, alt_flags_index, shift) and the associated masks (NIX_LSO_FIELD_SHIFT_MASK, *_ALT_FLG_*, updated NIX_LSO_FIELD_MASK) line up with their bit positions and how they are programmed in nix_lso_ipv4().
  • The ALT_FLAGS format union nix_lso_alt_flg_format_t cleanly encodes the two 64‑bit ALT_FLAGS config registers and is used as such by the ROC layer.
  • CN20K SQ context structs now expose:
    • update_sq_count and seb_count in nix_cn20k_sq_ctx_hw_s.
    • A 46‑bit sq_count_iova and seb_count in nix_cn20k_sq_ctx_s, matching the high/low split fields in the HW view.
  • NIX_LSOALG_ALT_FLAGS and NIX_LSO_FORMAT_IDX_TSOV4/TSOV6 complete the LSO enum and index set in a way that matches the ROC‑level macros.

Only minor observation: alt_flags_index is 2 bits, so only 4 ALT_FLAGS profiles are representable in nix_lso_format. If HW supports more, callers (like nix_lso_ipv4()) should validate that the returned profile index fits in 0–3 before assigning it; otherwise behaviour will silently truncate.

Also applies to: 2509-2553, 2073-2202, 2700-2703

doc/guides/conf.py (1)

107-110: Pydantic mocking logic is fine; consider iterable-unpacking form

The explicit set+concatenation works and correctly guarantees pydantic/pydantic_core are mocked, but Ruff flags this pattern. You can keep semantics and address RUF005 with a slightly cleaner expression:

autodoc_mock_imports = list({*autodoc_mock_imports, "pydantic", "pydantic_core"})

This avoids list concatenation, keeps entries unique, and matches Ruff’s preferred style.

app/test/bpf/filter.c (1)

21-37: TCP/IPv4 filter logic is sound; consider naming header offsets

The filter correctly assumes an Ethernet header of 14 bytes, validates IPv4 via the version nibble, and checks the protocol field for TCP (6), returning 1 only in that case. For readability, you might consider defining named constants for the IP header offset and protocol field offset instead of bare 14 and 14 + 9, but this is purely cosmetic in a small test program.

doc/guides/testpmd_app_ug/testpmd_funcs.rst (1)

1490-1505: Polish wording for new DCB forwarding commands

The semantics are clear, but the English can be tightened a bit to match surrounding documentation style. For example:

-Config DCB forwarding on specify TCs, if bit-n in tc-mask is 1, then TC-n's
-forwarding is enabled, and vice versa::
+Configure DCB forwarding on specified traffic classes (TCs). If bit n in
+tc_mask is 1, then TC n forwarding is enabled; otherwise it is disabled::
@@
-Config DCB forwarding cores per-TC, 1-means one core process all queues of a TC,
-2-means two cores process all queues of a TC, and so on::
+Configure the number of forwarding cores per TC.
+For example, 1 means one core processes all queues of a TC,
+2 means two cores process all queues of a TC, and so on::

This keeps the meaning but improves readability.

app/test/bpf/load.c (1)

11-51: Ensure struct layout stays in sync with test_bpf.c and clarify dummy_func1 call

This ELF BPF program redefines struct dummy_offset/struct dummy_vect8 to “match” the host-side structs in app/test/test_bpf.c, but:

  • The host version of struct dummy_offset uses RTE_ATOMIC(...) fields, while this file uses plain integers plus __attribute__((packed)).
  • Any change to the host structs (field order, padding, atomics) will silently desynchronize the layout assumed by this BPF object.
  • test_call1() calls dummy_func1(arg, &v32, &v64) where arg is a struct dummy_vect8 *, while dummy_func1 conceptually operates on a single struct dummy_offset. Today this works only because dummy_vect8 starts with in[0] and you rely on that aliasing.

To make this more robust and self-documenting:

  • Move the dummy_offset/dummy_vect8 definitions into a small shared header used by both test_bpf.c and this BPF source, so the layouts cannot diverge.
  • Consider passing &arg->in[0] to dummy_func1 instead of arg, which makes the intent explicit and avoids relying on the first-member trick:
-	/* Call external function */
-	dummy_func1(arg, &v32, &v64);
+	/* Call external function on first input element */
+	dummy_func1(&arg->in[0], &v32, &v64);

[recommendation only — behavior is currently consistent as long as the layouts remain identical.]

app/test/bpf/meson.build (1)

4-5: Meson BPF setup looks fine; consider removing unused bpf_test_hdrs

The Meson logic for building BPF objects with clang and converting them to headers via xxd is consistent with typical DPDK patterns, and wiring them into resources makes sense for the tests.

One minor nit: bpf_test_hdrs is initialized but never used; all headers are tracked via resources instead. You can either drop bpf_test_hdrs or, if you intended to expose these headers separately, append to it in the loop:

-bpf_test_hdrs = [ ]
+ bpf_test_hdrs = []
@@
-    bpf_test_h = custom_target(bpf_src + '_h',
+    bpf_test_h = custom_target(bpf_src + '_h',
@@
-    resources += bpf_test_h
+    resources += bpf_test_h
+    bpf_test_hdrs += bpf_test_h

Or simply remove the bpf_test_hdrs variable to avoid confusion.

Also applies to: 26-53

app/test-pmd/cmdline.c (2)

514-516: Help text is accurate but can be slightly clarified

The new set dcb fwd_tc (tc_mask) entry is wired correctly to the new command, but the description text could be tightened for readability and consistency with the CLI syntax, e.g. use tc_mask instead of tc-mask and “specific TCs” instead of “specify TCs”. This is purely cosmetic and can be deferred.


6230-6328: New DCB forwarding TC and cores commands look correct; consider minor robustness and help improvements

Functionality-wise this block looks sound:

  • Parsing and token wiring for set dcb fwd_tc <tc_mask> and set dcb fwd_tc_cores <tc_cores> are consistent with the rest of the file.
  • tc_mask is validated as non‑zero and printed cleanly by iterating i < RTE_ETH_8_TCS.
  • tc_cores is validated as non‑zero before updating dcb_fwd_tc_cores.

A few non‑blocking points you may want to polish:

  1. Help string mismatch / clarity (Line 6273)
    .help_str starts with “config DCB forwarding…”, while the actual command is set dcb fwd_tc <tc_mask>. Aligning help text with the exact syntax improves discoverability:
  • .help_str = "config DCB forwarding on specify TCs, if bit-n in tc-mask is 1, then TC-n's forwarding is enabled, and vice versa.",
  • .help_str = "set dcb fwd_tc <tc_mask>: configure DCB forwarding on specific TCs; if bit n in tc_mask is 1 then TC n forwarding is enabled.",
Similarly, consider starting the cores help with `set dcb fwd_tc_cores <tc_cores>:`.

2. **Missing help entry for `fwd_tc_cores` in the long help section**  
There is a help line for `set dcb fwd_tc (tc_mask)` but none for `set dcb fwd_tc_cores`. Adding a line beside 514–516 would keep the config help complete:
```diff
			"set dcb fwd_tc (tc_mask)\n"
			"    Set DCB forwarding on specify TCs, if bit-n in tc-mask is 1, then TC-n's forwarding is enabled\n\n"
+			"set dcb fwd_tc_cores (tc_cores)\n"
+			"    Set number of forwarding cores per enabled TC; 1 means a single core processes all queues in a TC\n\n"
  1. Optional: sanity‑check tc_cores against available forwarding cores
    Right now tc_cores is only checked for non‑zero. If dcb_fwd_tc_cores is later used to derive core assignments, you may want to clamp or warn when tc_cores > nb_fwd_lcores to avoid surprising runtime behavior. This can be added later if the usage path justifies it.
app/test-pmd/config.c (2)

5160-5194: dcb_fwd_tc_update_dcb_info() only preserves TC queue mapping; other DCB fields are zeroed.

This helper compresses org_dcb_info down to the masked TCs by rebuilding a fresh dcb_info with only tc_queue and nb_tcs populated, then assigning *org_dcb_info = dcb_info;. That’s fine for the current usage in dcb_fwd_config_setup(), where only tc_queue and nb_tcs are read afterwards.

However, this makes the function fragile for future reuse: any later code that expects other fields in struct rte_eth_dcb_info (e.g., prio_tc, tc_bws) will silently see zeros.

If you want to future‑proof it a bit, consider:

  • Starting from a copy of *org_dcb_info and only rewriting tc_queue and nb_tcs, or
  • Adding a brief comment that this function intentionally only preserves TC queue layout and nothing else, and is not a general “mask dcb_info” helper.

Given it’s static and only used in this DCB fwd path, this is more of a maintainability concern than a correctness bug.


5215-5339: DCB fwd config multi‑core logic is sound; minor robustness nits.

The new dcb_fwd_config_setup() logic generally looks consistent:

  • Validates per‑TC queue counts via dcb_fwd_check_cores_per_tc() before configuring.
  • Computes total_tc_num based on the mask and enforces nb_fwd_lcores >= total_tc_num * dcb_fwd_tc_cores.
  • Uses sub_core_idx and per‑TC queue partitioning to distribute each TC’s queues across dcb_fwd_tc_cores lcores, in a way that matches the comments and respects VMDQ pools.

A couple of small improvements to consider:

  1. Type/overflow safety for target_lcores:

    target_lcores is lcoreid_t, and you compute it via:

    target_lcores = (lcoreid_t)total_tc_num * (lcoreid_t)dcb_fwd_tc_cores;

    If lcoreid_t is narrower than int, this multiplication happens in the narrower type and can silently wrap before the < nb_fwd_lcores check. Safer to do the multiplication in a wider type first:

  • lcoreid_t lc_id, target_lcores;
  • lcoreid_t lc_id;
  • uint32_t target_lcores;
    ...
  • target_lcores = (lcoreid_t)total_tc_num * (lcoreid_t)dcb_fwd_tc_cores;
  • target_lcores = (uint32_t)total_tc_num * (uint32_t)dcb_fwd_tc_cores;
    if (nb_fwd_lcores < target_lcores) {
    ...
    }
  • cur_fwd_config.nb_fwd_lcores = target_lcores;
  • cur_fwd_config.nb_fwd_lcores = (lcoreid_t)target_lcores;
    
    
  1. nb_fwd_streams vs masked TCs:

    When dcb_fwd_tc_mask doesn’t cover all TCs, the code will only populate streams for the masked TCs, but cur_fwd_config.nb_fwd_streams is still set to nb_rxq * nb_fwd_ports. That’s okay because forwarding uses fwd_lcores[lc_id]->stream_idx/stream_nb rather than nb_fwd_streams, but it does make the streams count in pkt_fwd_config_display() potentially misleading. If you want the display to reflect reality, you could accumulate the total number of assigned streams while building the mapping and assign that to cur_fwd_config.nb_fwd_streams.

Overall, the algorithm and indexing (sm_id, k + j) look safe, and the mask + dcb_fwd_tc_cores behaviour matches the new feature description.

drivers/net/mlx5/mlx5_flow_hw.c (2)

2339-2343: Modify-header length validation is good; avoid double action teardown

Adding flow_hw_validate_compiled_modify_field() before creating the DR modify-header action is a solid guard against exceeding max_header_modify_pattern_length, and the new acts->mhdr == NULL check in mlx5_tbl_ensure_shared_modify_header() prevents NULL derefs.

You currently call __flow_hw_action_template_destroy(dev, acts) inside mlx5_tbl_translate_modify_header() on error and again in the caller’s err: path. The destroy helper is idempotent, so this is safe but redundant; consider letting the caller own cleanup to avoid confusion.

Also applies to: 2375-2383


8387-8513: Tunnel detection logic: behaviorally sound, but IPv6 FRAG helper unused

The new tunnel helpers:

  • Read outer protocol from the previous L3 item only when that field is actually masked in the template.
  • Prefer RFC‑compliant classification (IPIP vs IPv6‑in‑IP) based on IPPROTO_IPIP / IPPROTO_IPV6, with a fallback to inner header type when proto isn’t constrained.
  • Enforce a single tunnel layer by checking *item_flags & MLX5_FLOW_LAYER_TUNNEL.
  • Treat IPv6 routing extension (OUTER_IPV6_ROUTING_EXT) as an IPv6 tunnel anchor.

This matches the intent to fix IPIP/IPv6 ENCAP detection and should avoid misclassifying non‑tunnel flows when proto is set.

One nit: the RTE_FLOW_ITEM_TYPE_IPV6_FRAG_EXT case in mlx5_hw_flow_get_next_protocol() is currently not used, since last_l3_item is only ever set to IPv4/IPv6 items in __flow_hw_pattern_validate(). Either wire last_l3_item to the frag-ext item when relevant or drop that branch to avoid dead code.

Also applies to: 8791-8805, 8807-8812, 8982-8985

Comment on lines +5115 to +5143
static int
dcb_fwd_check_cores_per_tc(void)
{
struct rte_eth_dcb_info dcb_info = {0};
uint32_t port, tc, vmdq_idx;

if (dcb_fwd_tc_cores == 1)
return 0;

for (port = 0; port < nb_fwd_ports; port++) {
(void)rte_eth_dev_get_dcb_info(fwd_ports_ids[port], &dcb_info);
for (tc = 0; tc < dcb_info.nb_tcs; tc++) {
for (vmdq_idx = 0; vmdq_idx < RTE_ETH_MAX_VMDQ_POOL; vmdq_idx++) {
if (dcb_info.tc_queue.tc_rxq[vmdq_idx][tc].nb_queue == 0)
break;
/* make sure nb_rx_queue can be divisible. */
if (dcb_info.tc_queue.tc_rxq[vmdq_idx][tc].nb_queue %
dcb_fwd_tc_cores)
return -1;
/* make sure nb_tx_queue can be divisible. */
if (dcb_info.tc_queue.tc_txq[vmdq_idx][tc].nb_queue %
dcb_fwd_tc_cores)
return -1;
}
}
}

return 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against dcb_fwd_tc_cores == 0 and align check with dcb_fwd_tc_mask.

Two things to tighten here:

  1. dcb_fwd_tc_cores == 0 will currently hit the modulus operations on Lines 5132 and 5136, which is undefined behaviour. Even if CLI validation is supposed to prevent 0, it’s safer to handle it here:
static int
dcb_fwd_check_cores_per_tc(void)
{
    struct rte_eth_dcb_info dcb_info = {0};
    uint32_t port, tc, vmdq_idx;

-   if (dcb_fwd_tc_cores == 1)
-       return 0;
+   if (dcb_fwd_tc_cores == 0)
+       return -EINVAL;
+   if (dcb_fwd_tc_cores == 1)
+       return 0;
  1. The divisibility check currently applies to all TCs, but the rest of the DCB forwarding logic (including get_fwd_port_total_tc_num() and dcb_fwd_tc_update_dcb_info()) only cares about TCs enabled in dcb_fwd_tc_mask. As written, a port with a “non-divisible” TC that is not selected in the mask will still cause this check to fail unnecessarily. You likely want to ignore masked‑out TCs here:
-       for (tc = 0; tc < dcb_info.nb_tcs; tc++) {
+       for (tc = 0; tc < dcb_info.nb_tcs; tc++) {
+           if (!(dcb_fwd_tc_mask & (1u << tc)))
+               continue;
            for (vmdq_idx = 0; vmdq_idx < RTE_ETH_MAX_VMDQ_POOL; vmdq_idx++) {
                if (dcb_info.tc_queue.tc_rxq[vmdq_idx][tc].nb_queue == 0)
                    break;
                /* make sure nb_rx_queue can be divisible. */
                if (dcb_info.tc_queue.tc_rxq[vmdq_idx][tc].nb_queue %
                    dcb_fwd_tc_cores)
                    return -1;
                /* make sure nb_tx_queue can be divisible. */
                if (dcb_info.tc_queue.tc_txq[vmdq_idx][tc].nb_queue %
                    dcb_fwd_tc_cores)
                    return -1;
            }
        }

This keeps the pre‑check consistent with the mask-driven behaviour of the rest of the DCB forwarding path.

🤖 Prompt for AI Agents
In app/test-pmd/config.c around lines 5115 to 5143, guard against
dcb_fwd_tc_cores being zero and only perform the divisibility checks for traffic
classes enabled in dcb_fwd_tc_mask: first if dcb_fwd_tc_cores == 0 return -1 (or
appropriate error) to avoid division/modulus by zero, then inside the loop skip
any TC where the corresponding bit in dcb_fwd_tc_mask is not set so you only
validate enabled TCs, keeping the pre-check consistent with the rest of the DCB
forwarding logic.

Comment on lines +3468 to +3524
static unsigned int
setup_mbufs(struct rte_mbuf *burst[], unsigned int n)
{
struct rte_ether_hdr eh = {
.ether_type = rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4),
};
const struct rte_ipv4_hdr iph = {
.version_ihl = RTE_IPV4_VHL_DEF,
.total_length = rte_cpu_to_be_16(BPF_TEST_PKT_LEN - sizeof(eh)),
.time_to_live = IPDEFTTL,
.src_addr = rte_cpu_to_be_32(ip_src_addr),
.dst_addr = rte_cpu_to_be_32(ip_dst_addr),
};
unsigned int tcp_count = 0;

rte_eth_random_addr(eh.dst_addr.addr_bytes);

for (unsigned int i = 0; i < n; i++) {
struct rte_mbuf *mb = burst[i];

/* Setup Ethernet header */
*rte_pktmbuf_mtod(mb, struct rte_ether_hdr *) = eh;

/* Setup IP header */
struct rte_ipv4_hdr *ip
= rte_pktmbuf_mtod_offset(mb, struct rte_ipv4_hdr *, sizeof(eh));
*ip = iph;

if (rte_rand() & 1) {
struct rte_udp_hdr *udp
= rte_pktmbuf_mtod_offset(mb, struct rte_udp_hdr *,
sizeof(eh) + sizeof(iph));

ip->next_proto_id = IPPROTO_UDP;
*udp = (struct rte_udp_hdr) {
.src_port = rte_cpu_to_be_16(9), /* discard */
.dst_port = rte_cpu_to_be_16(9), /* discard */
.dgram_len = BPF_TEST_PKT_LEN - sizeof(eh) - sizeof(iph),
};

} else {
struct rte_tcp_hdr *tcp
= rte_pktmbuf_mtod_offset(mb, struct rte_tcp_hdr *,
sizeof(eh) + sizeof(iph));

ip->next_proto_id = IPPROTO_TCP;
*tcp = (struct rte_tcp_hdr) {
.src_port = rte_cpu_to_be_16(9), /* discard */
.dst_port = rte_cpu_to_be_16(9), /* discard */
.tcp_flags = RTE_TCP_RST_FLAG,
};
++tcp_count;
}
}

return tcp_count;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Initialize mbuf lengths in setup_mbufs

setup_mbufs() writes Ethernet/IP/UDP/TCP headers into the mbuf data area but never updates mb->data_len/mb->pkt_len. After rte_pktmbuf_alloc_bulk(), these are zero, so you end up transmitting zero‑length packets, which is incorrect even if the null PMD ignores it.

A minimal fix is to set the packet length after populating headers:

 static unsigned int
 setup_mbufs(struct rte_mbuf *burst[], unsigned int n)
 {
@@
-	for (unsigned int i = 0; i < n; i++) {
-		struct rte_mbuf *mb = burst[i];
+	for (unsigned int i = 0; i < n; i++) {
+		struct rte_mbuf *mb = burst[i];
@@
-		if (rte_rand() & 1) {
+		if (rte_rand() & 1) {
 			struct rte_udp_hdr *udp
@@
 			++tcp_count;
 		}
+
+		/* Full packet length: L2 + L3 + L4 (no payload). */
+		mb->data_len = BPF_TEST_PKT_LEN;
+		mb->pkt_len = BPF_TEST_PKT_LEN;
 	}

(If you prefer, you can instead use rte_pktmbuf_append(mb, BPF_TEST_PKT_LEN) and write via that pointer.)



🏁 Script executed:

#!/bin/bash
# Verify the issue by examining setup_mbufs usage and related code
set -e

# 1. Search for calls to setup_mbufs
echo "=== Calls to setup_mbufs ==="
rg -n "setup_mbufs" --type=c

# 2. Find BPF_TEST_PKT_LEN definition
echo -e "\n=== BPF_TEST_PKT_LEN definition ==="
rg -n "BPF_TEST_PKT_LEN" --type=c

# 3. Check for rte_pktmbuf_alloc_bulk calls in test_bpf.c
echo -e "\n=== rte_pktmbuf_alloc_bulk calls ==="
rg -n "rte_pktmbuf_alloc_bulk" --type=c -A 3 -B 3

# 4. Check if data_len or pkt_len are set anywhere in setup_mbufs or nearby
echo -e "\n=== Search for data_len/pkt_len modifications ==="
rg -n "data_len|pkt_len" app/test/test_bpf.c

Length of output: 33045


🌐 Web query:

DPDK mbuf data_len pkt_len initialization behavior after rte_pktmbuf_alloc_bulk

💡 Result:

Short answer: rte_pktmbuf_alloc_bulk returns mbufs with segment data_len = 0 and packet pkt_len = 0 (and data_off initialized to headroom), and other mbuf fields reset to their pktmbuf defaults (refcnt = 1, nb_segs = 1, next = NULL). In contrast rte_mbuf_raw_alloc[_bulk] returns uninitialized mbufs (only buf_addr/buf_len/refcnt/nb_segs/next/pool/priv_size are set) and the caller must initialize data_len/pkt_len. [1][2]

Sources:

  • DPDK rte_pktmbuf_alloc_bulk / rte_pktmbuf_alloc behavior: “initialize … and reset the fields to default values” (pkt/segment length zero, headroom set). [1]
  • DPDK rte_mbuf_raw_alloc/_alloc_bulk behavior: returns uninitialized mbufs — caller must initialize other fields (data_len/pkt_len etc.). [2]

Set mbuf lengths in setup_mbufs function

The issue is confirmed. After rte_pktmbuf_alloc_bulk(), mbufs are allocated with data_len and pkt_len initialized to 0. The setup_mbufs() function writes Ethernet/IP/UDP/TCP headers into the mbuf data area but never updates these fields, leaving them at zero. This causes packets to appear empty despite containing valid header data, breaking test correctness.

Add the following before the loop closes (after the increment of tcp_count):

+		mb->data_len = BPF_TEST_PKT_LEN;
+		mb->pkt_len = BPF_TEST_PKT_LEN;

This sets the packet length to 64 bytes, matching the full header size (Ether + IP + L4).

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In app/test/test_bpf.c around lines 3468 to 3524, after writing the
Ethernet/IP/L4 headers for each mbuf inside the loop, set the mbuf length fields
so the packet is not treated as empty: assign mb->data_len = mb->pkt_len =
BPF_TEST_PKT_LEN (or 64 if you prefer the literal) for each mbuf before
finishing the loop iteration.

Comment on lines +3662 to +3725
test_bpf_elf_rx_load(void)
{
static const char null_dev[] = "net_null_bpf0";
struct rte_mempool *pool = NULL;
char *tmpfile = NULL;
uint16_t port;
int ret;

printf("%s start\n", __func__);

/* Make a pool for packets */
pool = rte_pktmbuf_pool_create("bpf_rx_test_pool", 2 * BPF_TEST_BURST,
0, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
SOCKET_ID_ANY);
TEST_ASSERT(pool != NULL, "failed to create mempool");

ret = null_vdev_setup(null_dev, &port, pool);
if (ret != 0)
goto fail;

/* Create temp file from embedded BPF object */
tmpfile = create_temp_bpf_file(app_test_bpf_filter_o, app_test_bpf_filter_o_len, "rx");
if (tmpfile == NULL)
goto fail;

/* Do test with VM */
ret = bpf_rx_test(port, tmpfile, pool, "drop", 0, 0);
if (ret != 0)
goto fail;

/* Repeat with JIT */
ret = bpf_rx_test(port, tmpfile, pool, "drop", RTE_BPF_ETH_F_JIT, 0);
if (ret != 0)
goto fail;

/* Repeat with allow all */
ret = bpf_rx_test(port, tmpfile, pool, "allow", 0, BPF_TEST_BURST);
if (ret != 0)
goto fail;

/* Repeat with JIT */
ret = bpf_rx_test(port, tmpfile, pool, "allow", RTE_BPF_ETH_F_JIT, BPF_TEST_BURST);
if (ret != 0)
goto fail;

printf("%s: RX ELF load test passed\n", __func__);

/* The filter should free the mbufs */
unsigned int avail = rte_mempool_avail_count(pool);
TEST_ASSERT_EQUAL(avail, BPF_TEST_POOLSIZE,
"Mempool available %u != %u leaks?", avail, BPF_TEST_POOLSIZE);

fail:
if (tmpfile) {
unlink(tmpfile);
free(tmpfile);
}

if (port != UINT16_MAX)
rte_vdev_uninit(null_dev);

rte_mempool_free(pool);

return ret == 0 ? TEST_SUCCESS : TEST_FAILED;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Uninitialized port in test_bpf_elf_rx_load cleanup

test_bpf_elf_rx_load() declares uint16_t port; without initialization, but in the fail: path you check if (port != UINT16_MAX) before calling rte_vdev_uninit(null_dev);. If null_vdev_setup() fails before assigning port, this comparison reads an indeterminate value (undefined behavior).

Initialize port as in the TX test:

 static int
 test_bpf_elf_rx_load(void)
 {
 	static const char null_dev[] = "net_null_bpf0";
-	struct rte_mempool *pool = NULL;
-	char *tmpfile = NULL;
-	uint16_t port;
+	struct rte_mempool *pool = NULL;
+	char *tmpfile = NULL;
+	uint16_t port = UINT16_MAX;

🏁 Script executed:

#!/bin/bash
# Check current state of test_bpf_elf_rx_load function and find test_bpf_elf_tx_load
rg -n "test_bpf_elf_rx_load|test_bpf_elf_tx_load" app/test/test_bpf.c -A 15 | head -80

Length of output: 1364


🏁 Script executed:

#!/bin/bash
# Search for null_vdev_setup function definition to understand port assignment
rg -n "null_vdev_setup" app/test/test_bpf.c -B 2 -A 15

Length of output: 1858


🏁 Script executed:

#!/bin/bash
# Verify the exact initialization patterns in both test functions
rg -n "uint16_t port" app/test/test_bpf.c -B 2 -A 2

Length of output: 617


Initialize port to UINT16_MAX in test_bpf_elf_rx_load

The RX test declares uint16_t port; without initialization at line 3667, but the cleanup path checks if (port != UINT16_MAX) before calling rte_vdev_uninit(). If null_vdev_setup() fails before assigning port, this reads an indeterminate value (undefined behavior). The TX test correctly initializes port = UINT16_MAX at line 3574; apply the same fix to the RX test.

 static int
 test_bpf_elf_rx_load(void)
 {
 	static const char null_dev[] = "net_null_bpf0";
 	struct rte_mempool *pool = NULL;
 	char *tmpfile = NULL;
-	uint16_t port;
+	uint16_t port = UINT16_MAX;
 	int ret;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
test_bpf_elf_rx_load(void)
{
static const char null_dev[] = "net_null_bpf0";
struct rte_mempool *pool = NULL;
char *tmpfile = NULL;
uint16_t port;
int ret;
printf("%s start\n", __func__);
/* Make a pool for packets */
pool = rte_pktmbuf_pool_create("bpf_rx_test_pool", 2 * BPF_TEST_BURST,
0, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
SOCKET_ID_ANY);
TEST_ASSERT(pool != NULL, "failed to create mempool");
ret = null_vdev_setup(null_dev, &port, pool);
if (ret != 0)
goto fail;
/* Create temp file from embedded BPF object */
tmpfile = create_temp_bpf_file(app_test_bpf_filter_o, app_test_bpf_filter_o_len, "rx");
if (tmpfile == NULL)
goto fail;
/* Do test with VM */
ret = bpf_rx_test(port, tmpfile, pool, "drop", 0, 0);
if (ret != 0)
goto fail;
/* Repeat with JIT */
ret = bpf_rx_test(port, tmpfile, pool, "drop", RTE_BPF_ETH_F_JIT, 0);
if (ret != 0)
goto fail;
/* Repeat with allow all */
ret = bpf_rx_test(port, tmpfile, pool, "allow", 0, BPF_TEST_BURST);
if (ret != 0)
goto fail;
/* Repeat with JIT */
ret = bpf_rx_test(port, tmpfile, pool, "allow", RTE_BPF_ETH_F_JIT, BPF_TEST_BURST);
if (ret != 0)
goto fail;
printf("%s: RX ELF load test passed\n", __func__);
/* The filter should free the mbufs */
unsigned int avail = rte_mempool_avail_count(pool);
TEST_ASSERT_EQUAL(avail, BPF_TEST_POOLSIZE,
"Mempool available %u != %u leaks?", avail, BPF_TEST_POOLSIZE);
fail:
if (tmpfile) {
unlink(tmpfile);
free(tmpfile);
}
if (port != UINT16_MAX)
rte_vdev_uninit(null_dev);
rte_mempool_free(pool);
return ret == 0 ? TEST_SUCCESS : TEST_FAILED;
test_bpf_elf_rx_load(void)
{
static const char null_dev[] = "net_null_bpf0";
struct rte_mempool *pool = NULL;
char *tmpfile = NULL;
uint16_t port = UINT16_MAX;
int ret;
printf("%s start\n", __func__);
/* Make a pool for packets */
pool = rte_pktmbuf_pool_create("bpf_rx_test_pool", 2 * BPF_TEST_BURST,
0, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
SOCKET_ID_ANY);
TEST_ASSERT(pool != NULL, "failed to create mempool");
ret = null_vdev_setup(null_dev, &port, pool);
if (ret != 0)
goto fail;
/* Create temp file from embedded BPF object */
tmpfile = create_temp_bpf_file(app_test_bpf_filter_o, app_test_bpf_filter_o_len, "rx");
if (tmpfile == NULL)
goto fail;
/* Do test with VM */
ret = bpf_rx_test(port, tmpfile, pool, "drop", 0, 0);
if (ret != 0)
goto fail;
/* Repeat with JIT */
ret = bpf_rx_test(port, tmpfile, pool, "drop", RTE_BPF_ETH_F_JIT, 0);
if (ret != 0)
goto fail;
/* Repeat with allow all */
ret = bpf_rx_test(port, tmpfile, pool, "allow", 0, BPF_TEST_BURST);
if (ret != 0)
goto fail;
/* Repeat with JIT */
ret = bpf_rx_test(port, tmpfile, pool, "allow", RTE_BPF_ETH_F_JIT, BPF_TEST_BURST);
if (ret != 0)
goto fail;
printf("%s: RX ELF load test passed\n", __func__);
/* The filter should free the mbufs */
unsigned int avail = rte_mempool_avail_count(pool);
TEST_ASSERT_EQUAL(avail, BPF_TEST_POOLSIZE,
"Mempool available %u != %u leaks?", avail, BPF_TEST_POOLSIZE);
fail:
if (tmpfile) {
unlink(tmpfile);
free(tmpfile);
}
if (port != UINT16_MAX)
rte_vdev_uninit(null_dev);
rte_mempool_free(pool);
return ret == 0 ? TEST_SUCCESS : TEST_FAILED;
}
🤖 Prompt for AI Agents
In app/test/test_bpf.c around lines 3662 to 3725, the local variable "uint16_t
port;" is used in the cleanup check "if (port != UINT16_MAX)" but is not
initialized, causing undefined behavior if null_vdev_setup() fails before
assignment; initialize port to UINT16_MAX at declaration (same as the TX test)
so the cleanup check is safe and rte_vdev_uninit() is only called when port was
successfully assigned.

};
struct rte_flow_error error;
struct rte_flow *flow = flow = rte_flow_create(port_id, &attr, pattern, actions, &error);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix double assignment syntax error in C code example.

Line 844 contains a double assignment that would result in a syntax error if copied directly: struct rte_flow *flow = flow = rte_flow_create(...). The variable declaration and assignment should not include the intermediate flow =.

Apply this diff to fix the syntax error:

-   struct rte_flow *flow = flow = rte_flow_create(port_id, &attr, pattern, actions, &error);
+   struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &error);
🤖 Prompt for AI Agents
In doc/guides/nics/ice.rst around line 844, the C example contains a double
assignment "struct rte_flow *flow = flow = rte_flow_create(...)" which is a
syntax error; change it to declare and assign once by removing the extra "flow
=" so it reads as a single declaration with the call result assigned to flow.

Comment on lines +592 to 629
fslmc_bus_plug(struct rte_device *rte_dev)
{
/* No operation is performed while plugging the device */
return 0;
int ret = 0;
struct rte_dpaa2_device *dev = container_of(rte_dev,
struct rte_dpaa2_device, device);
struct rte_dpaa2_driver *drv;

TAILQ_FOREACH(drv, &rte_fslmc_bus.driver_list, next) {
ret = rte_fslmc_match(drv, dev);
if (ret)
continue;

if (!drv->probe)
continue;

if (rte_dev_is_probed(&dev->device))
continue;

if (dev->device.devargs &&
dev->device.devargs->policy == RTE_DEV_BLOCKED) {
DPAA2_BUS_DEBUG("%s Blocked, skipping",
dev->device.name);
continue;
}

ret = drv->probe(drv, dev);
if (ret) {
DPAA2_BUS_ERR("Unable to probe");
} else {
dev->driver = drv;
dev->device.driver = &drv->driver;
DPAA2_BUS_INFO("%s Plugged", dev->device.name);
}
break;
}

return ret;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Verify hot-plug behavior aligns with scan mode policy.

The fslmc_bus_plug implementation closely mirrors rte_fslmc_probe (lines 462-494) but omits the allowlist check. In rte_fslmc_probe, devices are only probed if either probe_all is true OR the device has an explicit RTE_DEV_ALLOWED policy (lines 481-483). However, fslmc_bus_plug probes any device that is not blocked and not already probed, potentially violating the scan mode configuration.

Additionally, there is significant code duplication between these two functions. The matching, blocking, and probing logic is nearly identical.

Consider:

  1. Adding the allowlist check to maintain policy consistency:
+		if (dev->device.devargs &&
+		    dev->device.devargs->policy == RTE_DEV_BLOCKED) {
+			DPAA2_BUS_DEBUG("%s Blocked, skipping",
+				      dev->device.name);
+			continue;
+		}
+
+		/* Only probe if explicitly allowed during hot-plug */
+		if (!dev->device.devargs ||
+		    dev->device.devargs->policy != RTE_DEV_ALLOWED) {
+			continue;
+		}
  1. Extracting the common probe logic into a helper function like fslmc_probe_device(dev, drv) to eliminate duplication.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fslmc_bus_plug(struct rte_device *rte_dev)
{
/* No operation is performed while plugging the device */
return 0;
int ret = 0;
struct rte_dpaa2_device *dev = container_of(rte_dev,
struct rte_dpaa2_device, device);
struct rte_dpaa2_driver *drv;
TAILQ_FOREACH(drv, &rte_fslmc_bus.driver_list, next) {
ret = rte_fslmc_match(drv, dev);
if (ret)
continue;
if (!drv->probe)
continue;
if (rte_dev_is_probed(&dev->device))
continue;
if (dev->device.devargs &&
dev->device.devargs->policy == RTE_DEV_BLOCKED) {
DPAA2_BUS_DEBUG("%s Blocked, skipping",
dev->device.name);
continue;
}
ret = drv->probe(drv, dev);
if (ret) {
DPAA2_BUS_ERR("Unable to probe");
} else {
dev->driver = drv;
dev->device.driver = &drv->driver;
DPAA2_BUS_INFO("%s Plugged", dev->device.name);
}
break;
}
return ret;
}
fslmc_bus_plug(struct rte_device *rte_dev)
{
int ret = 0;
struct rte_dpaa2_device *dev = container_of(rte_dev,
struct rte_dpaa2_device, device);
struct rte_dpaa2_driver *drv;
TAILQ_FOREACH(drv, &rte_fslmc_bus.driver_list, next) {
ret = rte_fslmc_match(drv, dev);
if (ret)
continue;
if (!drv->probe)
continue;
if (rte_dev_is_probed(&dev->device))
continue;
if (dev->device.devargs &&
dev->device.devargs->policy == RTE_DEV_BLOCKED) {
DPAA2_BUS_DEBUG("%s Blocked, skipping",
dev->device.name);
continue;
}
/* Only probe if explicitly allowed during hot-plug */
if (!dev->device.devargs ||
dev->device.devargs->policy != RTE_DEV_ALLOWED) {
continue;
}
ret = drv->probe(drv, dev);
if (ret) {
DPAA2_BUS_ERR("Unable to probe");
} else {
dev->driver = drv;
dev->device.driver = &drv->driver;
DPAA2_BUS_INFO("%s Plugged", dev->device.name);
}
break;
}
return ret;
}
🤖 Prompt for AI Agents
In drivers/bus/fslmc/fslmc_bus.c around lines 592 to 629, hot-plug probing
ignores the scan-mode allowlist check and duplicates logic from rte_fslmc_probe;
update fslmc_bus_plug to enforce the same policy as rte_fslmc_probe by skipping
probes unless probe_all is true or dev->device.devargs->policy ==
RTE_DEV_ALLOWED, and refactor the common match/block/probe sequence into a
shared helper (e.g., fslmc_probe_device(dev, drv)) that performs the
rte_fslmc_match call, devargs policy checks (blocked/allowed), rte_dev_is_probed
check, calls drv->probe and sets dev->driver and dev->device.driver on success,
then have both rte_fslmc_probe and fslmc_bus_plug call that helper to remove
duplication.

Comment on lines 667 to +694
for (i = 0; i < priv->nb_rx_queues; i++) {
dpaa2_q = priv->rx_vq[i];
dpaa2_clear_queue_active_dps(dpaa2_q,
RTE_MAX_LCORE);
dpaa2_queue_storage_free(dpaa2_q,
RTE_MAX_LCORE);
priv->rx_vq[i] = NULL;
}
/* cleanup tx queue cscn */
for (i = 0; i < priv->nb_tx_queues; i++) {
dpaa2_q = priv->tx_vq[i];
rte_free(dpaa2_q->cscn);
priv->tx_vq[i] = NULL;
}
if (priv->flags & DPAA2_TX_CONF_ENABLE) {
/* cleanup tx conf queue storage */
for (i = 0; i < priv->nb_tx_queues; i++) {
dpaa2_q = priv->tx_conf_vq[i];
dpaa2_queue_storage_free(dpaa2_q,
RTE_MAX_LCORE);
priv->tx_conf_vq[i] = NULL;
}
}
if (priv->flags & DPAAX_RX_ERROR_QUEUE_FLAG) {
dpaa2_q = priv->rx_err_vq;
dpaa2_queue_storage_free(dpaa2_q, RTE_MAX_LCORE);
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Free path should also release rx_err_vq and (optionally) clear its active DQS

The added cleanup for RX/TX queues is good: you now clear active DQS before freeing RX queue storage and NULL out rx_vq[i], tx_vq[i], and tx_conf_vq[i] entries.

Two issues around the error RX queue remain:

  1. rx_err_vq is allocated with rte_zmalloc("dpni_rx_err", sizeof(struct dpaa2_queue), ...) in dpaa2_alloc_rx_tx_queues, but here you only call dpaa2_queue_storage_free() on it and never rte_free() the struct dpaa2_queue itself or NULL the pointer. That struct will leak for devices using the error queue.
  2. Because this block is nested under if (priv->rx_vq[0]), the error‑queue struct is not freed in the allocation‑failure path where priv->rx_vq[0] has been set to NULL but rx_err_vq was already allocated and its storage freed in the fail path.

You can address both by unconditionally cleaning up rx_err_vq outside the rx_vq[0] block, and fully releasing the struct as well. For example:

 static void
 dpaa2_free_rx_tx_queues(struct rte_eth_dev *dev)
 {
@@
-	/* Queue allocation base */
-	if (priv->rx_vq[0]) {
+	/* Queue allocation base */
+	if (priv->rx_vq[0]) {
@@
 			dpaa2_queue_storage_free(dpaa2_q,
 				RTE_MAX_LCORE);
 			priv->rx_vq[i] = NULL;
@@
 			rte_free(dpaa2_q->cscn);
 			priv->tx_vq[i] = NULL;
@@
 				dpaa2_queue_storage_free(dpaa2_q,
 					RTE_MAX_LCORE);
 				priv->tx_conf_vq[i] = NULL;
 			}
 		}
-		if (priv->flags & DPAAX_RX_ERROR_QUEUE_FLAG) {
-			dpaa2_q = priv->rx_err_vq;
-			dpaa2_queue_storage_free(dpaa2_q, RTE_MAX_LCORE);
-		}
-
 		/*free memory for all queues (RX+TX) */
 		rte_free(priv->rx_vq[0]);
 		priv->rx_vq[0] = NULL;
 	}
+
+	if (priv->flags & DPAAX_RX_ERROR_QUEUE_FLAG && priv->rx_err_vq) {
+		dpaa2_clear_queue_active_dps(priv->rx_err_vq, RTE_MAX_LCORE);
+		dpaa2_queue_storage_free(priv->rx_err_vq, RTE_MAX_LCORE);
+		rte_free(priv->rx_err_vq);
+		priv->rx_err_vq = NULL;
+	}
 }

This ensures the error queue’s per‑lcore state and struct are always reclaimed, including when queue allocation failed earlier.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In drivers/net/dpaa2/dpaa2_ethdev.c around lines 667 to 694, the rx_err_vq
allocated in dpaa2_alloc_rx_tx_queues is only having its per-lcore storage freed
but the dpaa2_queue struct itself is never rte_free()'d nor its pointer NULLed,
and because current cleanup is nested under the rx_vq[0] block it can leak when
rx_vq[0] is cleared on earlier failure; move unconditional cleanup of
priv->rx_err_vq outside the rx_vq[0] conditional, call
dpaa2_clear_queue_active_dps(priv->rx_err_vq, RTE_MAX_LCORE) (optional but
recommended) then dpaa2_queue_storage_free(priv->rx_err_vq, RTE_MAX_LCORE),
followed by rte_free(priv->rx_err_vq) and set priv->rx_err_vq = NULL so the
struct and pointer are always reclaimed even on partial allocation failure.

Comment on lines 3379 to 3398
struct rte_eth_dev *eth_dev;
int ret;
int ret = 0;

eth_dev = rte_eth_dev_allocated(dpaa2_dev->device.name);
if (eth_dev) {
ret = dpaa2_dev_close(eth_dev);
if (ret)
DPAA2_PMD_ERR("dpaa2_dev_close ret= %d", ret);

ret = rte_eth_dev_release_port(eth_dev);
}

eth_dev = dpaa2_dev->eth_dev;
dpaa2_dev_close(eth_dev);
dpaa2_valid_dev--;
if (!dpaa2_valid_dev)
if (!dpaa2_valid_dev) {
rte_mempool_free(dpaa2_tx_sg_pool);
ret = rte_eth_dev_release_port(eth_dev);
dpaa2_tx_sg_pool = NULL;
}

return ret;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent dpaa2_valid_dev underflow and only refcount when an eth_dev exists

In rte_dpaa2_remove, dpaa2_valid_dev is decremented unconditionally, even when rte_eth_dev_allocated() returns NULL. If a probe failed earlier and no rte_eth_dev was ever registered for this dpaa2_device, you’ll still decrement from 0 to −1. That desynchronizes the reference count from the actual number of live ports and can prevent dpaa2_tx_sg_pool from being freed when the last successfully‑probed device is removed.

You can tighten the logic by only adjusting the refcount (and freeing the pool) when an eth_dev was found:

 static int
 rte_dpaa2_remove(struct rte_dpaa2_device *dpaa2_dev)
 {
-	struct rte_eth_dev *eth_dev;
-	int ret = 0;
-
-	eth_dev = rte_eth_dev_allocated(dpaa2_dev->device.name);
-	if (eth_dev) {
-		ret = dpaa2_dev_close(eth_dev);
-		if (ret)
-			DPAA2_PMD_ERR("dpaa2_dev_close ret= %d", ret);
-
-		ret = rte_eth_dev_release_port(eth_dev);
-	}
-
-	dpaa2_valid_dev--;
-	if (!dpaa2_valid_dev) {
-		rte_mempool_free(dpaa2_tx_sg_pool);
-		dpaa2_tx_sg_pool = NULL;
-	}
-
-	return ret;
+	struct rte_eth_dev *eth_dev;
+	int ret = 0;
+
+	eth_dev = rte_eth_dev_allocated(dpaa2_dev->device.name);
+	if (eth_dev) {
+		ret = dpaa2_dev_close(eth_dev);
+		if (ret)
+			DPAA2_PMD_ERR("dpaa2_dev_close ret= %d", ret);
+
+		ret = rte_eth_dev_release_port(eth_dev);
+
+		dpaa2_valid_dev--;
+		if (dpaa2_valid_dev == 0 && dpaa2_tx_sg_pool) {
+			rte_mempool_free(dpaa2_tx_sg_pool);
+			dpaa2_tx_sg_pool = NULL;
+		}
+	}
+
+	return ret;
 }

This keeps the global device count consistent with the actual set of allocated rte_eth_dev instances and avoids leaking or prematurely freeing the shared TX SG mempool.

🤖 Prompt for AI Agents
In drivers/net/dpaa2/dpaa2_ethdev.c around lines 3379 to 3398, dpaa2_valid_dev
is decremented unconditionally even when rte_eth_dev_allocated() returns NULL;
change the logic so the refcount and the pool-freeing happen only when eth_dev
is non-NULL: move the dpaa2_valid_dev-- and the check/free of dpaa2_tx_sg_pool
inside the if (eth_dev) block (so they run only when an eth_dev was found and
release_port was attempted), and leave ret as-is (0) when no eth_dev exists to
avoid underflow and keep the global count consistent with actual allocated
ports.

Comment on lines +1180 to 1182
if (vf->vf_res->vf_cap_flags & VIRTCHNL_VF_CAP_PTP &&
vf->ptp_caps & VIRTCHNL_1588_PTP_CAP_RX_TSTAMP)
dev_info->rx_offload_capa |= RTE_ETH_RX_OFFLOAD_TIMESTAMP;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Stricter RX timestamp capability check looks right; please confirm ptp_caps init timing

Requiring both VIRTCHNL_VF_CAP_PTP and vf->ptp_caps & VIRTCHNL_1588_PTP_CAP_RX_TSTAMP ensures we only advertise RTE_ETH_RX_OFFLOAD_TIMESTAMP when the PF explicitly reports RX timestamp support, which is a sensible tightening.

One subtle point to double‑check: rte_eth_dev_info_get() can be called before iavf_get_ptp_cap() runs. If vf->ptp_caps is still zero in that phase (or if older PFs don’t populate it), RX timestamp offload might no longer be advertised even though the VF can timestamp packets. If that’s possible, either initialising ptp_caps earlier or falling back to VF_CAP_PTP when ptp_caps == 0 may be needed.


🏁 Script executed:

#!/bin/bash
# Verify where and when vf->ptp_caps is set relative to dev_info_get().

# Show all references to ptp_caps in the iavf driver
rg -n "ptp_caps" drivers/net/intel/iavf -C3

# Also list calls to iavf_get_ptp_cap to see if it’s only invoked from dev_start
rg -n "iavf_get_ptp_cap" drivers/net/intel/iavf -C3

Length of output: 5791


🏁 Script executed:

#!/bin/bash
# Find function names and full context around the two key locations

# Context around line 1023 (iavf_get_ptp_cap call)
echo "=== Context around iavf_get_ptp_cap call (line 1023) ==="
sed -n '1000,1040p' drivers/net/intel/iavf/iavf_ethdev.c | head -50

echo ""
echo "=== Context around dev_info check (line 1180) ==="
sed -n '1160,1200p' drivers/net/intel/iavf/iavf_ethdev.c

echo ""
echo "=== Find where ptp_caps might be initialized ==="
rg -n "ptp_caps\s*=" drivers/net/intel/iavf -C2

Length of output: 2985



🏁 Script executed:

#!/bin/bash
# Find exact function names

echo "=== Function containing line 1023 ==="
awk 'NR<=1023 {if (/^[a-zA-Z_].*\(/) last=$0} NR==1023 {print last; exit}' drivers/net/intel/iavf/iavf_ethdev.c

echo ""
echo "=== Function containing line 1180 ==="
awk 'NR<=1180 {if (/^[a-zA-Z_].*\(/) last=$0} NR==1180 {print last; exit}' drivers/net/intel/iavf/iavf_ethdev.c

echo ""
echo "=== Search for function signatures ==="
rg -n "^[a-z_]+.*dev_start|^[a-z_]+.*dev_info_get|^[a-z_]+.*iavf_dev_start" drivers/net/intel/iavf/iavf_ethdev.c | head -10

echo ""
echo "=== Check if vf structure is memset/zeroed ==="
rg -n "memset.*vf\)|calloc.*iavf_info\)|malloc.*iavf_info" drivers/net/intel/iavf -C2 | head -30

echo ""
echo "=== Check where vf struct is allocated ==="
rg -n "IAVF_DEV_PRIVATE_TO_VF|iavf_info.*alloc" drivers/net/intel/iavf -C2 | head -20

Length of output: 1959



🏁 Script executed:

#!/bin/bash
# Find device private allocation

echo "=== Search for dev_private allocation ==="
rg -n "dev_private.*calloc|dev_private.*malloc|rte_zmalloc.*dev_private|rte_malloc.*dev_private" drivers/net/intel/iavf -C3

echo ""
echo "=== Check probe/init function ==="
rg -n "eth_dev_alloc|eth_dev_pci_allocate|rte_eth_dev_allocate" drivers/net/intel/iavf -C3 | head -30

echo ""
echo "=== Search for device initialization sequence ==="
rg -n "static struct eth_dev_ops|.dev_info_get\s*=|.dev_start\s*=" drivers/net/intel/iavf/iavf_ethdev.c -C1 | head -20

Length of output: 283



🏁 Script executed:

#!/bin/bash
# Find probe/attach function and device allocation

echo "=== Search for eth_dev allocation in probe ==="
rg -n "eth_dev_pci_probe|eth_dev_pci_generic_probe|rte_eth_dev_pci_allocate" drivers/net/intel/iavf -C5 | head -40

echo ""
echo "=== Check for secondary/primary function ==="
rg -n "secondary_init|iavf_adapter_init|iavf_init_vf" drivers/net/intel/iavf/iavf_ethdev.c -C3 | head -30

echo ""
echo "=== Look for struct initialization in probe context ==="
rg -n "pci_probe|pci_attach|dev_probe" drivers/net/intel/iavf/iavf_ethdev.c -A20 | head -50

echo ""
echo "=== Check if ptp_caps has default/fallback behavior ==="
rg -n "vf->ptp_caps\s*=\s*0|ptp_caps.*init|ptp_caps.*reset" drivers/net/intel/iavf -C2

Length of output: 2757



🏁 Script executed:

#!/bin/bash
# Verify the device private allocation uses calloc (zero-init)

echo "=== Check DPDK rte_eth_dev_pci_generic_probe behavior ==="
echo "Documentation: rte_eth_dev_pci_generic_probe typically uses rte_zmalloc or calloc"
echo "This means dev_private is zero-initialized"

echo ""
echo "=== Verify iavf_init_vf context (called during probe) ==="
sed -n '2535,2700p' drivers/net/intel/iavf/iavf_ethdev.c | head -100

echo ""
echo "=== Check if dev_info_get is ever called before dev_start in DPDK ==="
echo "Standard DPDK flow: probe -> configure -> (can call dev_info_get) -> start"
echo "dev_info_get() is a query that can be called anytime for device capabilities"

echo ""
echo "=== Look for what changed in the condition (git diff or comments) ==="
rg -n "VIRTCHNL_VF_CAP_PTP|VF_CAP_PTP" drivers/net/intel/iavf/iavf_ethdev.c -B2 -A2

Length of output: 3699


RX timestamp capability check has a real ordering issue—ptp_caps may be zero when dev_info_get() is called

The verification confirms the original concern is valid. Here's the problem:

  1. vf->ptp_caps is zero-initialized when the device is probed (via rte_eth_dev_pci_generic_probe)
  2. iavf_get_ptp_cap() is called only in iavf_dev_start() (line 1023), which sets vf->ptp_caps
  3. In DPDK, dev_info_get() can be called before dev_start() (during configure phase)
  4. The new condition at lines 1180–1182 requires both flags to be set; if ptp_caps is still 0, the condition fails and RTE_ETH_RX_OFFLOAD_TIMESTAMP is not advertised, even though the VF supports it

Fix: Either query ptp_caps during probe/init (not just dev_start), or add fallback logic to dev_info_get() to handle the case where ptp_caps == 0 but VIRTCHNL_VF_CAP_PTP is set.

🤖 Prompt for AI Agents
In drivers/net/intel/iavf/iavf_ethdev.c around lines 1180–1182, dev_info_get()
currently checks both vf->vf_res->vf_cap_flags and vf->ptp_caps to decide
whether to advertise RTE_ETH_RX_OFFLOAD_TIMESTAMP, but vf->ptp_caps may still be
zero during probe/configure (dev_info_get can be called before dev_start),
causing missed advertisement; fix by either populating vf->ptp_caps earlier
(call the existing iavf_get_ptp_cap() during probe/init so ptp_caps is set
before dev_info_get is called) or add fallback logic in dev_info_get(): if
vf_cap_flags has VIRTCHNL_VF_CAP_PTP and vf->ptp_caps is zero, attempt to
query/set ptp_caps (via the same helper) or conservatively treat the VF as
supporting RX timestamps (set RTE_ETH_RX_OFFLOAD_TIMESTAMP) only when confirmed,
ensuring the code avoids false positives and keeps behavior consistent with how
capabilities are determined elsewhere.

Comment on lines +114 to +116
tg = get_ctx().func_tg
if tg:
tg.send_packets(packets, get_ctx().topology.tg_port_egress)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent failure hides configuration errors.

If func_tg is None, this function silently does nothing, which could hide bugs or misconfiguration. Consider either:

  1. Raising an error if func_tg is None (consistent with send_packets_and_capture), or
  2. Documenting when func_tg can be None and why silent failure is appropriate.

If option 1 is preferred:

     packets = adjust_addresses(packets)
     tg = get_ctx().func_tg
-    if tg:
-        tg.send_packets(packets, get_ctx().topology.tg_port_egress)
+    assert tg is not None, "No functional traffic generator available"
+    tg.send_packets(packets, get_ctx().topology.tg_port_egress)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tg = get_ctx().func_tg
if tg:
tg.send_packets(packets, get_ctx().topology.tg_port_egress)
tg = get_ctx().func_tg
assert tg is not None, "No functional traffic generator available"
tg.send_packets(packets, get_ctx().topology.tg_port_egress)
🤖 Prompt for AI Agents
In dts/api/packet.py around lines 114-116, the current code silently skips
sending when get_ctx().func_tg is None; update it to mirror
send_packets_and_capture by explicitly checking for func_tg and raising a clear
exception (e.g., RuntimeError or ValueError) when it is None, including
contextual info (like topology name or tg_port_egress) in the message so
misconfiguration is visible; replace the silent if-guard with a check that
raises when missing, otherwise call tg.send_packets(...) as before.

Comment on lines +327 to +352
def assess_performance_by_packet(
packet: Packet, duration: float, send_mpps: int | None = None
) -> PerformanceTrafficStats:
"""Send a given packet for a given duration and assess basic performance statistics.
Send `packet` and assess NIC performance for a given duration, corresponding to the test
suite's given topology.
Args:
packet: The packet to send.
duration: Performance test duration (in seconds).
send_mpps: The millions packets per second send rate.
Returns:
Performance statistics of the generated test.
"""
from framework.testbed_model.traffic_generator.performance_traffic_generator import (
PerformanceTrafficGenerator,
)

assert isinstance(
get_ctx().perf_tg, PerformanceTrafficGenerator
), "Cannot send performance traffic with non-performance traffic generator"
tg: PerformanceTrafficGenerator = cast(PerformanceTrafficGenerator, get_ctx().perf_tg)
# TODO: implement @requires for types of traffic generator
return tg.calculate_traffic_and_stats(packet, duration, send_mpps)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Missing address adjustment for packet.

Unlike send_packets_and_capture (line 95) and send_packets (line 113), this function doesn't call adjust_addresses(packet) before sending. This inconsistency could result in incorrect L2/L3 addresses in performance tests.

Verify whether performance traffic requires address adjustment by checking if the underlying calculate_traffic_and_stats handles addressing or if callers are expected to provide properly addressed packets:


🏁 Script executed:

#!/bin/bash
# Search for usages of assess_performance_by_packet to see if callers pre-adjust packets
rg -n -C5 'assess_performance_by_packet'

# Search for how calculate_traffic_and_stats is used in other contexts
rg -n -C5 'calculate_traffic_and_stats'

Length of output: 6054


🏁 Script executed:

#!/bin/bash
# Get broader context around the assess_performance_by_packet call in the test
head -80 dts/tests/TestSuite_single_core_forward_perf.py | tail -50

# Search for _generate_traffic implementation to see if it handles address adjustment
rg -n -A10 'def _generate_traffic' dts/framework/testbed_model/traffic_generator/

Length of output: 193



---

🏁 Script executed:

```shell
#!/bin/bash
# Find the actual paths using fd
fd -t f 'packet.py' | head -5
fd -t f 'TestSuite_single_core_forward_perf.py' | head -5
fd -t f 'performance_traffic_generator.py' | head -5

# Search for _generate_traffic implementation across codebase
rg -n 'def _generate_traffic' | head -20

Length of output: 229



---

🏁 Script executed:

```shell
#!/bin/bash
# Get context around the assess_performance_by_packet call
sed -n '40,80p' dts/tests/TestSuite_single_core_forward_perf.py

# Get _generate_traffic implementation
rg -n -A20 'def _generate_traffic' dts/framework/testbed_model/traffic_generator/trex.py

# Compare with how send_packets and send_packets_and_capture use adjust_addresses
sed -n '90,120p' dts/api/packet.py

Length of output: 3070



---

🏁 Script executed:

```shell
#!/bin/bash
# Find and show adjust_addresses implementation
rg -n -A15 'def adjust_addresses' dts/api/packet.py

# Check if there are any comments or patterns explaining why perf_tg differs from func_tg
rg -n -B5 -A5 'perf_tg|func_tg' dts/api/packet.py | head -50

Length of output: 2215



---

🏁 Script executed:

```shell
#!/bin/bash
# Look for other usages of assess_performance_by_packet or perf_tg.calculate_traffic_and_stats
rg -n 'perf_tg|assess_performance_by_packet' --type py -C3 | head -100

# Check _create_packet_stream implementation to see if it handles addressing
rg -n -A20 'def _create_packet_stream' dts/framework/testbed_model/traffic_generator/trex.py

Length of output: 6301


Missing address adjustment for packet.

Unlike send_packets_and_capture (line 95) and send_packets (line 113), this function doesn't call adjust_addresses(packet) before sending. The adjust_addresses function performs L2 and L3 address adjustments for the two-link topology, which the performance test also requires. Without this adjustment, performance tests will send packets with placeholder addresses instead of topology-aware addresses, potentially invalidating performance measurements.

🤖 Prompt for AI Agents
In dts/api/packet.py around lines 327 to 352, the function
assess_performance_by_packet omits the topology address adjustment done in other
send helpers; call adjust_addresses(packet) immediately before sending so the
packet gets L2/L3 addresses for the two-link topology (same as
send_packets_and_capture/send_packets). Add the call (adjust_addresses(packet))
just before invoking tg.calculate_traffic_and_stats; ensure adjust_addresses is
available in scope (import or use the existing helper in this module) and that
the function uses the adjusted packet passed to the traffic generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.