Skip to content

Commit

Permalink
bpf: Add fd-based tcx multi-prog infra with link support
Browse files Browse the repository at this point in the history
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

  - From Meta: "It's especially important for applications that are deployed
    fleet-wide and that don't "control" hosts they are deployed to. If such
    application crashes and no one notices and does anything about that, BPF
    program will keep running draining resources or even just, say, dropping
    packets. We at FB had outages due to such permanent BPF attachment
    semantics. With fd-based BPF link we are getting a framework, which allows
    safe, auto-detachable behavior by default, unless application explicitly
    opts in by pinning the BPF link." [1]

  - From Cilium-side the tc BPF programs we attach to host-facing veth devices
    and phys devices build the core datapath for Kubernetes Pods, and they
    implement forwarding, load-balancing, policy, EDT-management, etc, within
    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
    experienced hard-to-debug issues in a user's staging environment where
    another Kubernetes application using tc BPF attached to the same prio/handle
    of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
    it. The goal is to establish a clear/safe ownership model via links which
    cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

  [0] https://lpc.events/event/16/contributions/1353/
  [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
  [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
  [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
  [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
  • Loading branch information
borkmann authored and Alexei Starovoitov committed Jul 19, 2023
1 parent 053c8e1 commit e420bed
Show file tree
Hide file tree
Showing 17 changed files with 935 additions and 144 deletions.
4 changes: 3 additions & 1 deletion MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -3778,13 +3778,15 @@ L: netdev@vger.kernel.org
S: Maintained
F: kernel/bpf/bpf_struct*

BPF [NETWORKING] (tc BPF, sock_addr)
BPF [NETWORKING] (tcx & tc BPF, sock_addr)
M: Martin KaFai Lau <martin.lau@linux.dev>
M: Daniel Borkmann <daniel@iogearbox.net>
R: John Fastabend <john.fastabend@gmail.com>
L: bpf@vger.kernel.org
L: netdev@vger.kernel.org
S: Maintained
F: include/net/tcx.h
F: kernel/bpf/tcx.c
F: net/core/filter.c
F: net/sched/act_bpf.c
F: net/sched/cls_bpf.c
Expand Down
9 changes: 9 additions & 0 deletions include/linux/bpf_mprog.h
Original file line number Diff line number Diff line change
Expand Up @@ -315,4 +315,13 @@ int bpf_mprog_detach(struct bpf_mprog_entry *entry,
int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
struct bpf_mprog_entry *entry);

static inline bool bpf_mprog_supported(enum bpf_prog_type type)
{
switch (type) {
case BPF_PROG_TYPE_SCHED_CLS:
return true;
default:
return false;
}
}
#endif /* __BPF_MPROG_H */
15 changes: 6 additions & 9 deletions include/linux/netdevice.h
Original file line number Diff line number Diff line change
Expand Up @@ -1930,8 +1930,7 @@ enum netdev_ml_priv_type {
*
* @rx_handler: handler for received packets
* @rx_handler_data: XXX: need comments on this one
* @miniq_ingress: ingress/clsact qdisc specific data for
* ingress processing
* @tcx_ingress: BPF & clsact qdisc specific data for ingress processing
* @ingress_queue: XXX: need comments on this one
* @nf_hooks_ingress: netfilter hooks executed for ingress packets
* @broadcast: hw bcast address
Expand All @@ -1952,8 +1951,7 @@ enum netdev_ml_priv_type {
* @xps_maps: all CPUs/RXQs maps for XPS device
*
* @xps_maps: XXX: need comments on this one
* @miniq_egress: clsact qdisc specific data for
* egress processing
* @tcx_egress: BPF & clsact qdisc specific data for egress processing
* @nf_hooks_egress: netfilter hooks executed for egress packets
* @qdisc_hash: qdisc hash table
* @watchdog_timeo: Represents the timeout that is used by
Expand Down Expand Up @@ -2253,9 +2251,8 @@ struct net_device {
unsigned int xdp_zc_max_segs;
rx_handler_func_t __rcu *rx_handler;
void __rcu *rx_handler_data;

#ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_ingress;
#ifdef CONFIG_NET_XGRESS
struct bpf_mprog_entry __rcu *tcx_ingress;
#endif
struct netdev_queue __rcu *ingress_queue;
#ifdef CONFIG_NETFILTER_INGRESS
Expand Down Expand Up @@ -2283,8 +2280,8 @@ struct net_device {
#ifdef CONFIG_XPS
struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
#endif
#ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_egress;
#ifdef CONFIG_NET_XGRESS
struct bpf_mprog_entry __rcu *tcx_egress;
#endif
#ifdef CONFIG_NETFILTER_EGRESS
struct nf_hook_entries __rcu *nf_hooks_egress;
Expand Down
4 changes: 2 additions & 2 deletions include/linux/skbuff.h
Original file line number Diff line number Diff line change
Expand Up @@ -944,7 +944,7 @@ struct sk_buff {
__u8 __mono_tc_offset[0];
/* public: */
__u8 mono_delivery_time:1; /* See SKB_MONO_DELIVERY_TIME_MASK */
#ifdef CONFIG_NET_CLS_ACT
#ifdef CONFIG_NET_XGRESS
__u8 tc_at_ingress:1; /* See TC_AT_INGRESS_MASK */
__u8 tc_skip_classify:1;
#endif
Expand Down Expand Up @@ -993,7 +993,7 @@ struct sk_buff {
__u8 csum_not_inet:1;
#endif

#ifdef CONFIG_NET_SCHED
#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
__u16 tc_index; /* traffic control index */
#endif

Expand Down
2 changes: 1 addition & 1 deletion include/net/sch_generic.h
Original file line number Diff line number Diff line change
Expand Up @@ -703,7 +703,7 @@ int skb_do_redirect(struct sk_buff *);

static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
{
#ifdef CONFIG_NET_CLS_ACT
#ifdef CONFIG_NET_XGRESS
return skb->tc_at_ingress;
#else
return false;
Expand Down
206 changes: 206 additions & 0 deletions include/net/tcx.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2023 Isovalent */
#ifndef __NET_TCX_H
#define __NET_TCX_H

#include <linux/bpf.h>
#include <linux/bpf_mprog.h>

#include <net/sch_generic.h>

struct mini_Qdisc;

struct tcx_entry {
struct mini_Qdisc __rcu *miniq;
struct bpf_mprog_bundle bundle;
bool miniq_active;
struct rcu_head rcu;
};

struct tcx_link {
struct bpf_link link;
struct net_device *dev;
u32 location;
};

static inline void tcx_set_ingress(struct sk_buff *skb, bool ingress)
{
#ifdef CONFIG_NET_XGRESS
skb->tc_at_ingress = ingress;
#endif
}

#ifdef CONFIG_NET_XGRESS
static inline struct tcx_entry *tcx_entry(struct bpf_mprog_entry *entry)
{
struct bpf_mprog_bundle *bundle = entry->parent;

return container_of(bundle, struct tcx_entry, bundle);
}

static inline struct tcx_link *tcx_link(struct bpf_link *link)
{
return container_of(link, struct tcx_link, link);
}

static inline const struct tcx_link *tcx_link_const(const struct bpf_link *link)
{
return tcx_link((struct bpf_link *)link);
}

void tcx_inc(void);
void tcx_dec(void);

static inline void tcx_entry_sync(void)
{
/* bpf_mprog_entry got a/b swapped, therefore ensure that
* there are no inflight users on the old one anymore.
*/
synchronize_rcu();
}

static inline void
tcx_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry,
bool ingress)
{
ASSERT_RTNL();
if (ingress)
rcu_assign_pointer(dev->tcx_ingress, entry);
else
rcu_assign_pointer(dev->tcx_egress, entry);
}

static inline struct bpf_mprog_entry *
tcx_entry_fetch(struct net_device *dev, bool ingress)
{
ASSERT_RTNL();
if (ingress)
return rcu_dereference_rtnl(dev->tcx_ingress);
else
return rcu_dereference_rtnl(dev->tcx_egress);
}

static inline struct bpf_mprog_entry *tcx_entry_create(void)
{
struct tcx_entry *tcx = kzalloc(sizeof(*tcx), GFP_KERNEL);

if (tcx) {
bpf_mprog_bundle_init(&tcx->bundle);
return &tcx->bundle.a;
}
return NULL;
}

static inline void tcx_entry_free(struct bpf_mprog_entry *entry)
{
kfree_rcu(tcx_entry(entry), rcu);
}

static inline struct bpf_mprog_entry *
tcx_entry_fetch_or_create(struct net_device *dev, bool ingress, bool *created)
{
struct bpf_mprog_entry *entry = tcx_entry_fetch(dev, ingress);

*created = false;
if (!entry) {
entry = tcx_entry_create();
if (!entry)
return NULL;
*created = true;
}
return entry;
}

static inline void tcx_skeys_inc(bool ingress)
{
tcx_inc();
if (ingress)
net_inc_ingress_queue();
else
net_inc_egress_queue();
}

static inline void tcx_skeys_dec(bool ingress)
{
if (ingress)
net_dec_ingress_queue();
else
net_dec_egress_queue();
tcx_dec();
}

static inline void tcx_miniq_set_active(struct bpf_mprog_entry *entry,
const bool active)
{
ASSERT_RTNL();
tcx_entry(entry)->miniq_active = active;
}

static inline bool tcx_entry_is_active(struct bpf_mprog_entry *entry)
{
ASSERT_RTNL();
return bpf_mprog_total(entry) || tcx_entry(entry)->miniq_active;
}

static inline enum tcx_action_base tcx_action_code(struct sk_buff *skb,
int code)
{
switch (code) {
case TCX_PASS:
skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
fallthrough;
case TCX_DROP:
case TCX_REDIRECT:
return code;
case TCX_NEXT:
default:
return TCX_NEXT;
}
}
#endif /* CONFIG_NET_XGRESS */

#if defined(CONFIG_NET_XGRESS) && defined(CONFIG_BPF_SYSCALL)
int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
void tcx_uninstall(struct net_device *dev, bool ingress);

int tcx_prog_query(const union bpf_attr *attr,
union bpf_attr __user *uattr);

static inline void dev_tcx_uninstall(struct net_device *dev)
{
ASSERT_RTNL();
tcx_uninstall(dev, true);
tcx_uninstall(dev, false);
}
#else
static inline int tcx_prog_attach(const union bpf_attr *attr,
struct bpf_prog *prog)
{
return -EINVAL;
}

static inline int tcx_link_attach(const union bpf_attr *attr,
struct bpf_prog *prog)
{
return -EINVAL;
}

static inline int tcx_prog_detach(const union bpf_attr *attr,
struct bpf_prog *prog)
{
return -EINVAL;
}

static inline int tcx_prog_query(const union bpf_attr *attr,
union bpf_attr __user *uattr)
{
return -EINVAL;
}

static inline void dev_tcx_uninstall(struct net_device *dev)
{
}
#endif /* CONFIG_NET_XGRESS && CONFIG_BPF_SYSCALL */
#endif /* __NET_TCX_H */
34 changes: 30 additions & 4 deletions include/uapi/linux/bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -1036,6 +1036,8 @@ enum bpf_attach_type {
BPF_LSM_CGROUP,
BPF_STRUCT_OPS,
BPF_NETFILTER,
BPF_TCX_INGRESS,
BPF_TCX_EGRESS,
__MAX_BPF_ATTACH_TYPE
};

Expand All @@ -1053,7 +1055,7 @@ enum bpf_link_type {
BPF_LINK_TYPE_KPROBE_MULTI = 8,
BPF_LINK_TYPE_STRUCT_OPS = 9,
BPF_LINK_TYPE_NETFILTER = 10,

BPF_LINK_TYPE_TCX = 11,
MAX_BPF_LINK_TYPE,
};

Expand Down Expand Up @@ -1569,13 +1571,13 @@ union bpf_attr {
__u32 map_fd; /* struct_ops to attach */
};
union {
__u32 target_fd; /* object to attach to */
__u32 target_ifindex; /* target ifindex */
__u32 target_fd; /* target object to attach to or ... */
__u32 target_ifindex; /* target ifindex */
};
__u32 attach_type; /* attach type */
__u32 flags; /* extra flags */
union {
__u32 target_btf_id; /* btf_id of target to attach to */
__u32 target_btf_id; /* btf_id of target to attach to */
struct {
__aligned_u64 iter_info; /* extra bpf_iter_link_info */
__u32 iter_info_len; /* iter_info length */
Expand Down Expand Up @@ -1609,6 +1611,13 @@ union bpf_attr {
__s32 priority;
__u32 flags;
} netfilter;
struct {
union {
__u32 relative_fd;
__u32 relative_id;
};
__u64 expected_revision;
} tcx;
};
} link_create;

Expand Down Expand Up @@ -6217,6 +6226,19 @@ struct bpf_sock_tuple {
};
};

/* (Simplified) user return codes for tcx prog type.
* A valid tcx program must return one of these defined values. All other
* return codes are reserved for future use. Must remain compatible with
* their TC_ACT_* counter-parts. For compatibility in behavior, unknown
* return codes are mapped to TCX_NEXT.
*/
enum tcx_action_base {
TCX_NEXT = -1,
TCX_PASS = 0,
TCX_DROP = 2,
TCX_REDIRECT = 7,
};

struct bpf_xdp_sock {
__u32 queue_id;
};
Expand Down Expand Up @@ -6499,6 +6521,10 @@ struct bpf_link_info {
} event; /* BPF_PERF_EVENT_EVENT */
};
} perf_event;
struct {
__u32 ifindex;
__u32 attach_type;
} tcx;
};
} __attribute__((aligned(8)));

Expand Down
1 change: 1 addition & 0 deletions kernel/bpf/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ config BPF_SYSCALL
select TASKS_TRACE_RCU
select BINARY_PRINTF
select NET_SOCK_MSG if NET
select NET_XGRESS if NET
select PAGE_POOL if NET
default n
help
Expand Down
Loading

0 comments on commit e420bed

Please sign in to comment.