From 7a15be69b00fe8f66a3f3929434b39676f325a7a Mon Sep 17 00:00:00 2001 From: Gurucharan Shetty Date: Wed, 15 Jun 2016 06:24:30 -0700 Subject: [PATCH] ovn: Add support for Load balancers. This commit adds schema changes to the OVN_Northbound database to support Load balancers. In ovn-northd, it adds two logical tables to program logical flows. It adds a 'pre_lb' table that sits before 'pre_stateful' table. For packets that need to be load balanced, this table sets reg0[0] to act as a hint for the pre-stateful table to send the packet to the conntrack table for defragmentation. It also adds a 'lb' table that sits before 'stateful' table. For packets from established connections, this table sets reg0[2] to indicate to the 'stateful' table that the packet needs to be sent to connection tracking table to just do NAT. In stateful table, packet for a new connection that needs to be load balanced is given a ct_lb($IP_LIST) action. Signed-off-by: Gurucharan Shetty Acked-by: Ben Pfaff --- ovn/northd/ovn-northd.8.xml | 104 +++++++++++++++--- ovn/northd/ovn-northd.c | 206 ++++++++++++++++++++++++++++++++++-- ovn/ovn-nb.ovsschema | 22 +++- ovn/ovn-nb.xml | 43 ++++++++ ovn/utilities/ovn-nbctl.c | 4 + 5 files changed, 351 insertions(+), 28 deletions(-) diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml index b8ee1067fed..6bc83ea50ca 100644 --- a/ovn/northd/ovn-northd.8.xml +++ b/ovn/northd/ovn-northd.8.xml @@ -252,7 +252,23 @@ before eventually advancing to ingress table ACLs.

-

Ingress Table 4: Pre-stateful

+

Ingress Table 4: Pre-LB

+ +

+ This table prepares flows for possible stateful load balancing processing + in ingress table LB and Stateful. It contains + a priority-0 flow that simply moves traffic to the next table. If load + balancing rules with virtual IP addresses (and ports) are configured in + OVN_Northbound database for a logical datapath, a + priority-100 flow is added for each configured virtual IP address + VIP with a match ip && ip4.dst == VIP + that sets an action reg0[0] = 1; next; to act as a + hint for table Pre-stateful to send IP packets to the + connection tracker for packet de-fragmentation before eventually + advancing to ingress table LB. +

+ +

Ingress Table 5: Pre-stateful

This table prepares flows for all possible stateful processing @@ -263,7 +279,7 @@ ct_next; action.

-

Ingress table 5: from-lport ACLs

+

Ingress table 6: from-lport ACLs

Logical flows in this table closely reproduce those in the @@ -312,16 +328,57 @@ -

Ingress Table 6: Stateful

+

Ingress Table 7: LB

It contains a priority-0 flow that simply moves traffic to the next - table. A priority-100 flow commits packets to connection tracker using - ct_commit; next; action based on a hint provided by - the previous tables (with a match for reg0[1] == 1). + table. For established connections a priority 100 flow matches on + ct.est && !ct.rel && !ct.new && + !ct.inv and sets an action reg0[2] = 1; next; to act + as a hint for table Stateful to send packets through + connection tracker to NAT the packets. (The packet will automatically + get DNATed to the same IP address as the first packet in that + connection.)

-

Ingress Table 7: ARP responder

+

Ingress Table 8: Stateful

+ +
    +
  • + For all the configured load balancing rules in + OVN_Northbound database that includes a L4 port + PORT of protocol P and IPv4 address + VIP, a priority-120 flow that matches on + ct.new && ip && ip4.dst == VIP + && P && P.dst == PORT + with an action of ct_lb(args), + where args contains comma separated IPv4 addresses (and + optional port numbers) to load balance to. +
  • +
  • + For all the configured load balancing rules in + OVN_Northbound database that includes just an IP address + VIP to match on, a priority-110 flow that matches on + ct.new && ip && ip4.dst == VIP + with an action of ct_lb(args), where + args contains comma separated IPv4 addresses. +
  • +
  • + A priority-100 flow commits packets to connection tracker using + ct_commit; next; action based on a hint provided by + the previous tables (with a match for reg0[1] == 1). +
  • +
  • + A priority-100 flow sends the packets to connection tracker using + ct_lb; as the action based on a hint provided by the + previous tables (with a match for reg0[2] == 1). +
  • +
  • + A priority-0 flow that simply moves traffic to the next table. +
  • +
+ +

Ingress Table 9: ARP responder

This table implements ARP responder for known IPs. It contains these @@ -366,7 +423,7 @@ output; -

Ingress Table 8: Destination Lookup

+

Ingress Table 10: Destination Lookup

This table implements switching behavior. It contains these logical @@ -397,33 +454,50 @@ output; -

Egress Table 0: to-lport Pre-ACLs

+

Egress Table 0: Pre-LB

+ +

+ This table is similar to ingress table Pre-LB. It + contains a priority-0 flow that simply moves traffic to the next table. + If any load balancing rules exist for the datapath, a priority-100 flow + is added with a match of ip and action of reg0[0] = 1; + next; to act as a hint for table Pre-stateful to + send IP packets to the connection tracker for packet de-fragmentation. +

+ +

Egress Table 1: to-lport Pre-ACLs

This is similar to ingress table Pre-ACLs except for to-lport traffic.

-

Egress Table 1: Pre-stateful

+

Egress Table 2: Pre-stateful

This is similar to ingress table Pre-stateful.

-

Egress Table 2: to-lport ACLs

+

Egress Table 3: LB

+

+ This is similar to ingress table LB. +

+ +

Egress Table 4: to-lport ACLs

This is similar to ingress table ACLs except for to-lport ACLs.

-

Egress Table 3: Stateful

+

Egress Table 5: Stateful

- This is similar to ingress table Stateful. + This is similar to ingress table Stateful except that + there are no rules added for load balancing new connections.

-

Egress Table 4: Egress Port Security - IP

+

Egress Table 6: Egress Port Security - IP

This is similar to the port security logic in table @@ -433,7 +507,7 @@ output; ip4.src and ip6.src

-

Egress Table 5: Egress Port Security - L2

+

Egress Table 7: Egress Port Security - L2

This is similar to the ingress port security logic in ingress table diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c index 5149f01c862..f4b4435d217 100644 --- a/ovn/northd/ovn-northd.c +++ b/ovn/northd/ovn-northd.c @@ -33,6 +33,7 @@ #include "packets.h" #include "poll-loop.h" #include "smap.h" +#include "sset.h" #include "stream.h" #include "stream-ssl.h" #include "unixctl.h" @@ -92,19 +93,23 @@ enum ovn_stage { PIPELINE_STAGE(SWITCH, IN, PORT_SEC_IP, 1, "ls_in_port_sec_ip") \ PIPELINE_STAGE(SWITCH, IN, PORT_SEC_ND, 2, "ls_in_port_sec_nd") \ PIPELINE_STAGE(SWITCH, IN, PRE_ACL, 3, "ls_in_pre_acl") \ - PIPELINE_STAGE(SWITCH, IN, PRE_STATEFUL, 4, "ls_in_pre_stateful") \ - PIPELINE_STAGE(SWITCH, IN, ACL, 5, "ls_in_acl") \ - PIPELINE_STAGE(SWITCH, IN, STATEFUL, 6, "ls_in_stateful") \ - PIPELINE_STAGE(SWITCH, IN, ARP_ND_RSP, 7, "ls_in_arp_nd_rsp") \ - PIPELINE_STAGE(SWITCH, IN, L2_LKUP, 8, "ls_in_l2_lkup") \ + PIPELINE_STAGE(SWITCH, IN, PRE_LB, 4, "ls_in_pre_lb") \ + PIPELINE_STAGE(SWITCH, IN, PRE_STATEFUL, 5, "ls_in_pre_stateful") \ + PIPELINE_STAGE(SWITCH, IN, ACL, 6, "ls_in_acl") \ + PIPELINE_STAGE(SWITCH, IN, LB, 7, "ls_in_lb") \ + PIPELINE_STAGE(SWITCH, IN, STATEFUL, 8, "ls_in_stateful") \ + PIPELINE_STAGE(SWITCH, IN, ARP_ND_RSP, 9, "ls_in_arp_rsp") \ + PIPELINE_STAGE(SWITCH, IN, L2_LKUP, 10, "ls_in_l2_lkup") \ \ /* Logical switch egress stages. */ \ - PIPELINE_STAGE(SWITCH, OUT, PRE_ACL, 0, "ls_out_pre_acl") \ - PIPELINE_STAGE(SWITCH, OUT, PRE_STATEFUL, 1, "ls_out_pre_stateful") \ - PIPELINE_STAGE(SWITCH, OUT, ACL, 2, "ls_out_acl") \ - PIPELINE_STAGE(SWITCH, OUT, STATEFUL, 3, "ls_out_stateful") \ - PIPELINE_STAGE(SWITCH, OUT, PORT_SEC_IP, 4, "ls_out_port_sec_ip") \ - PIPELINE_STAGE(SWITCH, OUT, PORT_SEC_L2, 5, "ls_out_port_sec_l2") \ + PIPELINE_STAGE(SWITCH, OUT, PRE_LB, 0, "ls_out_pre_lb") \ + PIPELINE_STAGE(SWITCH, OUT, PRE_ACL, 1, "ls_out_pre_acl") \ + PIPELINE_STAGE(SWITCH, OUT, PRE_STATEFUL, 2, "ls_out_pre_stateful") \ + PIPELINE_STAGE(SWITCH, OUT, LB, 3, "ls_out_lb") \ + PIPELINE_STAGE(SWITCH, OUT, ACL, 4, "ls_out_acl") \ + PIPELINE_STAGE(SWITCH, OUT, STATEFUL, 5, "ls_out_stateful") \ + PIPELINE_STAGE(SWITCH, OUT, PORT_SEC_IP, 6, "ls_out_port_sec_ip") \ + PIPELINE_STAGE(SWITCH, OUT, PORT_SEC_L2, 7, "ls_out_port_sec_l2") \ \ /* Logical router ingress stages. */ \ PIPELINE_STAGE(ROUTER, IN, ADMISSION, 0, "lr_in_admission") \ @@ -134,6 +139,7 @@ enum ovn_stage { #define REGBIT_CONNTRACK_DEFRAG "reg0[0]" #define REGBIT_CONNTRACK_COMMIT "reg0[1]" +#define REGBIT_CONNTRACK_NAT "reg0[2]" /* Returns an "enum ovn_stage" built from the arguments. */ static enum ovn_stage @@ -1400,6 +1406,107 @@ build_pre_acls(struct ovn_datapath *od, struct hmap *lflows, } } +/* For a 'key' of the form "IP:port" or just "IP", sets 'port' and + * 'ip_address'. The caller must free() the memory allocated for + * 'ip_address'. */ +static void +ip_address_and_port_from_lb_key(const char *key, char **ip_address, + uint16_t *port) +{ + char *ip_str, *start, *next; + *ip_address = NULL; + *port = 0; + + next = start = xstrdup(key); + ip_str = strsep(&next, ":"); + if (!ip_str || !ip_str[0]) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1); + VLOG_WARN_RL(&rl, "bad ip address for load balancer key %s", key); + free(start); + return; + } + + ovs_be32 ip, mask; + char *error = ip_parse_masked(ip_str, &ip, &mask); + if (error || mask != OVS_BE32_MAX) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1); + VLOG_WARN_RL(&rl, "bad ip address for load balancer key %s", key); + free(start); + free(error); + return; + } + + int l4_port = 0; + if (next && next[0]) { + if (!str_to_int(next, 0, &l4_port) || l4_port < 0 || l4_port > 65535) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1); + VLOG_WARN_RL(&rl, "bad ip port for load balancer key %s", key); + free(start); + return; + } + } + + *port = l4_port; + *ip_address = strdup(ip_str); + free(start); +} + +static void +build_pre_lb(struct ovn_datapath *od, struct hmap *lflows) +{ + /* Allow all packets to go to next tables by default. */ + ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_LB, 0, "1", "next;"); + ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_LB, 0, "1", "next;"); + + struct sset all_ips = SSET_INITIALIZER(&all_ips); + if (od->nbs->load_balancer) { + struct nbrec_load_balancer *lb = od->nbs->load_balancer; + struct smap *vips = &lb->vips; + struct smap_node *node; + bool vip_configured = false; + + SMAP_FOR_EACH (node, vips) { + vip_configured = true; + + /* node->key contains IP:port or just IP. */ + char *ip_address = NULL; + uint16_t port; + ip_address_and_port_from_lb_key(node->key, &ip_address, &port); + if (!ip_address) { + continue; + } + + if (!sset_contains(&all_ips, ip_address)) { + sset_add(&all_ips, ip_address); + } + + free(ip_address); + + /* Ignore L4 port information in the key because fragmented packets + * may not have L4 information. The pre-stateful table will send + * the packet through ct() action to de-fragment. In stateful + * table, we will eventually look at L4 information. */ + } + + /* 'REGBIT_CONNTRACK_DEFRAG' is set to let the pre-stateful table send + * packet to conntrack for defragmentation. */ + const char *ip_address; + SSET_FOR_EACH(ip_address, &all_ips) { + char *match = xasprintf("ip && ip4.dst == %s", ip_address); + ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_LB, + 100, match, REGBIT_CONNTRACK_DEFRAG" = 1; next;"); + free(match); + } + + sset_destroy(&all_ips); + + if (vip_configured) { + ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_LB, + 100, "ip", REGBIT_CONNTRACK_DEFRAG" = 1; next;"); + } + } +} + static void build_pre_stateful(struct ovn_datapath *od, struct hmap *lflows) { @@ -1531,6 +1638,27 @@ build_acls(struct ovn_datapath *od, struct hmap *lflows) } } +static void +build_lb(struct ovn_datapath *od, struct hmap *lflows) +{ + /* Ingress and Egress LB Table (Priority 0): Packets are allowed by + * default. */ + ovn_lflow_add(lflows, od, S_SWITCH_IN_LB, 0, "1", "next;"); + ovn_lflow_add(lflows, od, S_SWITCH_OUT_LB, 0, "1", "next;"); + + if (od->nbs->load_balancer) { + /* Ingress and Egress LB Table (Priority 65535). + * + * Send established traffic through conntrack for just NAT. */ + ovn_lflow_add(lflows, od, S_SWITCH_IN_LB, UINT16_MAX, + "ct.est && !ct.rel && !ct.new && !ct.inv", + REGBIT_CONNTRACK_NAT" = 1; next;"); + ovn_lflow_add(lflows, od, S_SWITCH_OUT_LB, UINT16_MAX, + "ct.est && !ct.rel && !ct.new && !ct.inv", + REGBIT_CONNTRACK_NAT" = 1; next;"); + } +} + static void build_stateful(struct ovn_datapath *od, struct hmap *lflows) { @@ -1545,6 +1673,60 @@ build_stateful(struct ovn_datapath *od, struct hmap *lflows) REGBIT_CONNTRACK_COMMIT" == 1", "ct_commit; next;"); ovn_lflow_add(lflows, od, S_SWITCH_OUT_STATEFUL, 100, REGBIT_CONNTRACK_COMMIT" == 1", "ct_commit; next;"); + + /* If REGBIT_CONNTRACK_NAT is set as 1, then packets should just be sent + * through nat (without committing). + * + * REGBIT_CONNTRACK_COMMIT is set for new connections and + * REGBIT_CONNTRACK_NAT is set for established connections. So they + * don't overlap. + */ + ovn_lflow_add(lflows, od, S_SWITCH_IN_STATEFUL, 100, + REGBIT_CONNTRACK_NAT" == 1", "ct_lb;"); + ovn_lflow_add(lflows, od, S_SWITCH_OUT_STATEFUL, 100, + REGBIT_CONNTRACK_NAT" == 1", "ct_lb;"); + + /* Load balancing rules for new connections get committed to conntrack + * table. So even if REGBIT_CONNTRACK_COMMIT is set in a previous table + * a higher priority rule for load balancing below also commits the + * connection, so it is okay if we do not hit the above match on + * REGBIT_CONNTRACK_COMMIT. */ + if (od->nbs->load_balancer) { + struct nbrec_load_balancer *lb = od->nbs->load_balancer; + struct smap *vips = &lb->vips; + struct smap_node *node; + + SMAP_FOR_EACH (node, vips) { + uint16_t port = 0; + + /* node->key contains IP:port or just IP. */ + char *ip_address = NULL; + ip_address_and_port_from_lb_key(node->key, &ip_address, &port); + if (!ip_address) { + continue; + } + + /* New connections in Ingress table. */ + char *action = xasprintf("ct_lb(%s);", node->value); + struct ds match = DS_EMPTY_INITIALIZER; + ds_put_format(&match, "ct.new && ip && ip4.dst == %s", ip_address); + if (port) { + if (lb->protocol && !strcmp(lb->protocol, "udp")) { + ds_put_format(&match, "&& udp && udp.dst == %d", port); + } else { + ds_put_format(&match, "&& tcp && tcp.dst == %d", port); + } + ovn_lflow_add(lflows, od, S_SWITCH_IN_STATEFUL, + 120, ds_cstr(&match), action); + } else { + ovn_lflow_add(lflows, od, S_SWITCH_IN_STATEFUL, + 110, ds_cstr(&match), action); + } + + ds_destroy(&match); + free(action); + } + } } static void @@ -1563,8 +1745,10 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap *ports, } build_pre_acls(od, lflows, ports); + build_pre_lb(od, lflows); build_pre_stateful(od, lflows); build_acls(od, lflows); + build_lb(od, lflows); build_stateful(od, lflows); } diff --git a/ovn/ovn-nb.ovsschema b/ovn/ovn-nb.ovsschema index 2a8e127ff71..ee7c2c6ebf5 100644 --- a/ovn/ovn-nb.ovsschema +++ b/ovn/ovn-nb.ovsschema @@ -1,7 +1,7 @@ { "name": "OVN_Northbound", - "version": "3.1.0", - "cksum": "4200094286 6620", + "version": "3.2.0", + "cksum": "1784604034 7539", "tables": { "Logical_Switch": { "columns": { @@ -16,6 +16,11 @@ "refType": "strong"}, "min": 0, "max": "unlimited"}}, + "load_balancer": {"type": {"key": {"type": "uuid", + "refTable": "Load_Balancer", + "refType": "strong"}, + "min": 0, + "max": 1}}, "external_ids": { "type": {"key": "string", "value": "string", "min": 0, "max": "unlimited"}}}, @@ -59,6 +64,19 @@ "min": 0, "max": "unlimited"}}}, "indexes": [["name"]], "isRoot": true}, + "Load_Balancer": { + "columns": { + "vips": { + "type": {"key": "string", "value": "string", + "min": 0, "max": "unlimited"}}, + "protocol": { + "type": {"key": {"type": "string", + "enum": ["set", ["tcp", "udp"]]}, + "min": 0, "max": 1}}, + "external_ids": { + "type": {"key": "string", "value": "string", + "min": 0, "max": "unlimited"}}}, + "isRoot": true}, "ACL": { "columns": { "priority": {"type": {"key": {"type": "integer", diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml index 2469dc20ac0..ff2e69576ab 100644 --- a/ovn/ovn-nb.xml +++ b/ovn/ovn-nb.xml @@ -69,6 +69,11 @@

+ + Load balance a virtual ipv4 address to a set of logical port endpoint + ipv4 addresses. + + Access control rules that apply to packets within the logical switch. @@ -543,6 +548,44 @@ + +

+ Each row represents one load balancer. +

+ + +

+ A map of virtual IPv4 addresses (and an optional port number with + : as a separator) associated with this load balancer and + their corresponding endpoint IPv4 addresses (and optional port numbers + with : as separators) separated by commas. If + the destination IP address (and port number) of a packet leaving a + container or a VM matches the virtual IPv4 address (and port number) + provided here as a key, then OVN will statefully replace the + destination IP address by one of the provided IPv4 address (and port + number) in this map as a value. Examples for keys are "192.168.1.4" + and "172.16.1.8:80". Examples for value are "10.0.0.1, 10.0.0.2" and + "20.0.0.10:8800, 20.0.0.11:8800". +

+
+ + +

+ Valid protocols are tcp or udp. This column + is useful when a port number is provided as part of the + vips column. If this column is empty and a port number + is provided as part of vips column, OVN assumes the + protocol to be tcp. +

+
+ + + + See External IDs at the beginning of this document. + + +
+

Each row in this table represents one ACL rule for a logical switch diff --git a/ovn/utilities/ovn-nbctl.c b/ovn/utilities/ovn-nbctl.c index 78172ab5cdc..ad70a05ec95 100644 --- a/ovn/utilities/ovn-nbctl.c +++ b/ovn/utilities/ovn-nbctl.c @@ -1852,6 +1852,10 @@ static const struct ctl_table_class tables[] = { {{NULL, NULL, NULL}, {NULL, NULL, NULL}}}, + {&nbrec_table_load_balancer, + {{NULL, NULL, NULL}, + {NULL, NULL, NULL}}}, + {&nbrec_table_logical_router, {{&nbrec_table_logical_router, &nbrec_logical_router_col_name, NULL}, {NULL, NULL, NULL}}},