Skip to content

Commit

Permalink
dpif-netdev: Introduce hash-based Tx packet steering mode.
Browse files Browse the repository at this point in the history
This patch adds a new hash Tx steering mode that
distributes the traffic on all the Tx queues, whatever the
number of PMD threads. It would be useful for guests
expecting traffic to be distributed on all the vCPUs.

The idea here is to re-use the 5-tuple hash of the packets,
already computed to build the flows batches (and so it
does not provide flexibility on which fields are part of
the hash).

There are also no user-configurable indirection table,
given the feature is transparent to the guest. The queue
selection is just a modulo operation between the packet
hash and the number of Tx queues.

There are no (at least intentionnally) functionnal changes
for the existing XPS and static modes. There should not be
noticeable performance changes for these modes (only one
more branch in the hot path).

For the hash mode, performance could be impacted due to
locking when multiple PMD threads are in use (same as
XPS mode) and also because of the second level of batching.

Regarding the batching, the existing Tx port output_pkts
is not modified. It means that at maximum, NETDEV_MAX_BURST
can be batched for all the Tx queues. A second level of
batching is done in dp_netdev_pmd_flush_output_on_port(),
only for this hash mode.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
  • Loading branch information
mcoquelin authored and igsilya committed Jan 17, 2022
1 parent 31e67c9 commit c18e707
Show file tree
Hide file tree
Showing 5 changed files with 190 additions and 23 deletions.
1 change: 1 addition & 0 deletions Documentation/automake.mk
Expand Up @@ -58,6 +58,7 @@ DOC_SOURCE = \
Documentation/topics/record-replay.rst \
Documentation/topics/tracing.rst \
Documentation/topics/userspace-tso.rst \
Documentation/topics/userspace-tx-steering.rst \
Documentation/topics/windows.rst \
Documentation/howto/index.rst \
Documentation/howto/dpdk.rst \
Expand Down
1 change: 1 addition & 0 deletions Documentation/topics/index.rst
Expand Up @@ -55,3 +55,4 @@ OVS
userspace-tso
idl-compound-indexes
ovs-extensions
userspace-tx-steering
74 changes: 74 additions & 0 deletions Documentation/topics/userspace-tx-steering.rst
@@ -0,0 +1,74 @@
..
Licensed under the Apache License, Version 2.0 (the "License"); you may
not use this file except in compliance with the License. You may obtain
a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License.

Convention for heading levels in Open vSwitch documentation:

======= Heading 0 (reserved for the title in a document)
------- Heading 1
~~~~~~~ Heading 2
+++++++ Heading 3
''''''' Heading 4

Avoid deeper levels because they do not render well.

============================
Userspace Tx packet steering
============================

The userspace datapath supports two transmit packet steering modes.

Thread mode
~~~~~~~~~~~

This mode is automatically selected when the port's ``tx-steering`` option is
set to ``thread`` or unset.

Depending on the port's number of Tx queues being greater or equal than the
number of PMD threads, static txq mapping or XPS will be used.

This is the recommended mode for performance reasons if the number of Tx queues
is greater or equal to the number of PMD threads, because the Tx lock is not
acquired.

If the number of Tx queues is greater than the number of threads, the remaining
Tx queues will not be used.

This mode is enabled by default.

Hash mode
~~~~~~~~~

Hash-based Tx packet steering mode distributes the packets on all the port's
transmit queues, whatever the number of PMD threads. The queue selection is
based on the 5-tuples hash to build the flows batches, the selected queue being
the modulo between the hash and the number of Tx queues of the port.

Hash mode may be used for example with vhost-user ports, when the number of
vCPUs and queues of the guest are greater than the number of PMD threads.
Without hash mode, the Tx queues used would be limited to the number of
threads.

Hash-based Tx packet steering may have an impact on the performance, given the
Tx lock acquisition is always required and a second level of batching is
performed.

Usage
~~~~~

To enable hash mode::

$ ovs-vsctl set Interface <iface> other_config:tx-steering=hash

To disable hash mode::

$ ovs-vsctl set Interface <iface> other_config:tx-steering=thread
3 changes: 3 additions & 0 deletions NEWS
Expand Up @@ -3,6 +3,9 @@ Post-v2.16.0
- Userspace datapath:
* Optimized flow lookups for datapath flows with simple match criteria.
See 'Simple Match Lookup' in Documentation/topics/dpdk/bridge.rst.
* New per-interface configuration knob 'other_config:tx-steering'. If set
to 'hash', enables hash-based Tx packet steering mode to utilize all the
Tx queues of the interface regardles of the number of PMD threads.
- DPDK:
* EAL argument --socket-mem is no longer configured by default upon
start-up. If dpdk-socket-mem and dpdk-alloc-mem are not specified,
Expand Down
134 changes: 111 additions & 23 deletions lib/dpif-netdev.c
Expand Up @@ -387,15 +387,21 @@ struct dp_netdev_rxq {
atomic_ullong cycles_intrvl[PMD_INTERVAL_MAX];
};

enum txq_req_mode {
TXQ_REQ_MODE_THREAD,
TXQ_REQ_MODE_HASH,
};

enum txq_mode {
TXQ_MODE_STATIC,
TXQ_MODE_XPS,
TXQ_MODE_XPS_HASH,
};

/* A port in a netdev-based datapath. */
struct dp_netdev_port {
odp_port_t port_no;
enum txq_mode txq_mode; /* static, XPS */
enum txq_mode txq_mode; /* static, XPS, XPS_HASH. */
bool need_reconfigure; /* True if we should reconfigure netdev. */
struct netdev *netdev;
struct hmap_node node; /* Node in dp_netdev's 'ports'. */
Expand All @@ -407,6 +413,7 @@ struct dp_netdev_port {
bool emc_enabled; /* If true EMC will be used. */
char *type; /* Port type as requested by user. */
char *rxq_affinity_list; /* Requested affinity of rx queues. */
enum txq_req_mode txq_requested_mode;
};

static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
Expand Down Expand Up @@ -442,6 +449,7 @@ struct tx_port {
struct hmap_node node;
long long flush_time;
struct dp_packet_batch output_pkts;
struct dp_packet_batch *txq_pkts; /* Only for hash mode. */
struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
};

Expand Down Expand Up @@ -4634,6 +4642,8 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
int error = 0;
const char *affinity_list = smap_get(cfg, "pmd-rxq-affinity");
bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
const char *tx_steering_mode = smap_get(cfg, "tx-steering");
enum txq_req_mode txq_mode;

ovs_mutex_lock(&dp->port_mutex);
error = get_port_by_number(dp, port_no, &port);
Expand Down Expand Up @@ -4675,19 +4685,33 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
}

/* Checking for RXq affinity changes. */
if (!netdev_is_pmd(port->netdev)
|| nullable_string_is_equal(affinity_list, port->rxq_affinity_list)) {
goto unlock;
if (netdev_is_pmd(port->netdev)
&& !nullable_string_is_equal(affinity_list, port->rxq_affinity_list)) {

error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
if (error) {
goto unlock;
}
free(port->rxq_affinity_list);
port->rxq_affinity_list = nullable_xstrdup(affinity_list);

dp_netdev_request_reconfigure(dp);
}

error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
if (error) {
goto unlock;
if (nullable_string_is_equal(tx_steering_mode, "hash")) {
txq_mode = TXQ_REQ_MODE_HASH;
} else {
txq_mode = TXQ_REQ_MODE_THREAD;
}

if (txq_mode != port->txq_requested_mode) {
port->txq_requested_mode = txq_mode;
VLOG_INFO("%s: Tx packet steering mode has been set to '%s'.",
netdev_get_name(port->netdev),
(txq_mode == TXQ_REQ_MODE_THREAD) ? "thread" : "hash");
dp_netdev_request_reconfigure(dp);
}
free(port->rxq_affinity_list);
port->rxq_affinity_list = nullable_xstrdup(affinity_list);

dp_netdev_request_reconfigure(dp);
unlock:
ovs_mutex_unlock(&dp->port_mutex);
return error;
Expand Down Expand Up @@ -4802,18 +4826,46 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,

cycle_timer_start(&pmd->perf_stats, &timer);

if (p->port->txq_mode == TXQ_MODE_XPS) {
tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p);
concurrent_txqs = true;
} else {
tx_qid = pmd->static_tx_qid;
concurrent_txqs = false;
}

output_cnt = dp_packet_batch_size(&p->output_pkts);
ovs_assert(output_cnt > 0);

netdev_send(p->port->netdev, tx_qid, &p->output_pkts, concurrent_txqs);
if (p->port->txq_mode == TXQ_MODE_XPS_HASH) {
int n_txq = netdev_n_txq(p->port->netdev);

/* Re-batch per txq based on packet hash. */
for (i = 0; i < output_cnt; i++) {
struct dp_packet *packet = p->output_pkts.packets[i];
uint32_t hash;

if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
hash = dp_packet_get_rss_hash(packet);
} else {
struct flow flow;

flow_extract(packet, &flow);
hash = flow_hash_5tuple(&flow, 0);
}
dp_packet_batch_add(&p->txq_pkts[hash % n_txq], packet);
}

/* Flush batches of each Tx queues. */
for (i = 0; i < n_txq; i++) {
if (dp_packet_batch_is_empty(&p->txq_pkts[i])) {
continue;
}
netdev_send(p->port->netdev, i, &p->txq_pkts[i], true);
dp_packet_batch_init(&p->txq_pkts[i]);
}
} else {
if (p->port->txq_mode == TXQ_MODE_XPS) {
tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p);
concurrent_txqs = true;
} else {
tx_qid = pmd->static_tx_qid;
concurrent_txqs = false;
}
netdev_send(p->port->netdev, tx_qid, &p->output_pkts, concurrent_txqs);
}
dp_packet_batch_init(&p->output_pkts);

/* Update time of the next flush. */
Expand Down Expand Up @@ -5975,7 +6027,10 @@ reconfigure_datapath(struct dp_netdev *dp)
HMAP_FOR_EACH (port, node, &dp->ports) {
if (netdev_is_reconf_required(port->netdev)
|| ((port->txq_mode == TXQ_MODE_XPS)
!= (netdev_n_txq(port->netdev) < wanted_txqs))) {
!= (netdev_n_txq(port->netdev) < wanted_txqs))
|| ((port->txq_mode == TXQ_MODE_XPS_HASH)
!= (port->txq_requested_mode == TXQ_REQ_MODE_HASH
&& netdev_n_txq(port->netdev) > 1))) {
port->need_reconfigure = true;
}
}
Expand Down Expand Up @@ -6010,8 +6065,15 @@ reconfigure_datapath(struct dp_netdev *dp)
seq_change(dp->port_seq);
port_destroy(port);
} else {
port->txq_mode = (netdev_n_txq(port->netdev) < wanted_txqs)
? TXQ_MODE_XPS : TXQ_MODE_STATIC;
/* With a single queue, there is no point in using hash mode. */
if (port->txq_requested_mode == TXQ_REQ_MODE_HASH &&
netdev_n_txq(port->netdev) > 1) {
port->txq_mode = TXQ_MODE_XPS_HASH;
} else if (netdev_n_txq(port->netdev) < wanted_txqs) {
port->txq_mode = TXQ_MODE_XPS;
} else {
port->txq_mode = TXQ_MODE_STATIC;
}
}
}

Expand Down Expand Up @@ -6300,9 +6362,11 @@ pmd_free_cached_ports(struct dp_netdev_pmd_thread *pmd)
dpif_netdev_xps_revalidate_pmd(pmd, true);

HMAP_FOR_EACH_POP (tx_port_cached, node, &pmd->tnl_port_cache) {
free(tx_port_cached->txq_pkts);
free(tx_port_cached);
}
HMAP_FOR_EACH_POP (tx_port_cached, node, &pmd->send_port_cache) {
free(tx_port_cached->txq_pkts);
free(tx_port_cached);
}
}
Expand All @@ -6322,14 +6386,27 @@ pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
hmap_shrink(&pmd->tnl_port_cache);

HMAP_FOR_EACH (tx_port, node, &pmd->tx_ports) {
int n_txq = netdev_n_txq(tx_port->port->netdev);
struct dp_packet_batch *txq_pkts_cached;

if (netdev_has_tunnel_push_pop(tx_port->port->netdev)) {
tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
if (tx_port->txq_pkts) {
txq_pkts_cached = xmemdup(tx_port->txq_pkts,
n_txq * sizeof *tx_port->txq_pkts);
tx_port_cached->txq_pkts = txq_pkts_cached;
}
hmap_insert(&pmd->tnl_port_cache, &tx_port_cached->node,
hash_port_no(tx_port_cached->port->port_no));
}

if (netdev_n_txq(tx_port->port->netdev)) {
if (n_txq) {
tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
if (tx_port->txq_pkts) {
txq_pkts_cached = xmemdup(tx_port->txq_pkts,
n_txq * sizeof *tx_port->txq_pkts);
tx_port_cached->txq_pkts = txq_pkts_cached;
}
hmap_insert(&pmd->send_port_cache, &tx_port_cached->node,
hash_port_no(tx_port_cached->port->port_no));
}
Expand Down Expand Up @@ -7125,6 +7202,7 @@ dp_netdev_pmd_clear_ports(struct dp_netdev_pmd_thread *pmd)
free(poll);
}
HMAP_FOR_EACH_POP (port, node, &pmd->tx_ports) {
free(port->txq_pkts);
free(port);
}
ovs_mutex_unlock(&pmd->port_mutex);
Expand Down Expand Up @@ -7195,6 +7273,15 @@ dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
tx->flush_time = 0LL;
dp_packet_batch_init(&tx->output_pkts);

if (tx->port->txq_mode == TXQ_MODE_XPS_HASH) {
int i, n_txq = netdev_n_txq(tx->port->netdev);

tx->txq_pkts = xzalloc(n_txq * sizeof *tx->txq_pkts);
for (i = 0; i < n_txq; i++) {
dp_packet_batch_init(&tx->txq_pkts[i]);
}
}

hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
pmd->need_reload = true;
}
Expand All @@ -7207,6 +7294,7 @@ dp_netdev_del_port_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
OVS_REQUIRES(pmd->port_mutex)
{
hmap_remove(&pmd->tx_ports, &tx->node);
free(tx->txq_pkts);
free(tx);
pmd->need_reload = true;
}
Expand Down

0 comments on commit c18e707

Please sign in to comment.