Flush: fix error leak when flushing multiple messages in a lasting connection #239

jronak · 2023-09-12T01:22:36Z

This issue is related to reusing lasting connection after failure on flush.
You start a lasting connection, where you buffer 2+ messages that will fail, then flush these messages which results in failure. But when you reuse this connection to perform an operation which will pass, but flush returns back an error although the operation has passed on checking directly with nft. Turns out the error was not fully drained from the previous message which leaks into next operation. See pro code to understand -

repro

package main

import (
	"fmt"

	"github.com/google/nftables"
)

func main() {
	conn, err := nftables.New(nftables.AsLasting())
	if err != nil {
		panic(err)
	}
	defer conn.CloseLasting()

	fmt.Println("Flushing and updating a non-existent set which must fail")
	if err := flushAndUpdateSet(conn); err != nil {
		fmt.Printf("[expected] Failed to flushAndUpdateSet: %v\n", err)
	}

	fmt.Println("Create a table must pass")
	if err := createTable(conn); err != nil {
		fmt.Printf("[unexpected] Failed to createTable: %v\n", err)
	}
}

func createTable(conn *nftables.Conn) error {
	table := &nftables.Table{
		Family: nftables.TableFamilyIPv4,
		Name:   "test-table",
	}
	conn.AddTable(table)
	return conn.Flush()
}

func flushAndUpdateSet(conn *nftables.Conn) error {
	set := &nftables.Set{
		Name:    "non-existent-set",
		Table:    &nftables.Table{
			Name:   "non-existent-table",
			Family: nftables.TableFamilyIPv4,
		},
		KeyType: nftables.TypeInteger,
	}

	conn.FlushSet(set)
	if err := conn.SetAddElements(set, []nftables.SetElement{{Key: []byte{0x01, 0x00, 0x00, 0x00}}}); err != nil {
		return err
	}

	return conn.Flush()
}

Repro code does the following:

Creates a lasting connection.
Flush and Update a non-existent set which must fail.
Create a new table using the same existing lasting connection which must pass.

By running the repo code, you notice both ops failed:

> nft list ruleset
# emtpy

> ./repro
Flushing and updating a non-existent set which must fail
[expected] Failed to flushAndUpdateSet: conn.Receive: netlink receive: no such file or directory
Create a table must pass
[unexpected] Failed to createTable: conn.Receive: netlink receive: no such file or directory

> nft list ruleset 
table ip test-table {
}

# table creation from repro actually worked, but it still threw an error

So repro code produces error for both (2) and (3). When you check in nft ruleset (3) actually succeeded, but repro code returned an error. This happens as errors from (2) were not fully drained from the connection.

stapelberg · 2023-09-12T07:00:59Z

This seems to be a duplicate of #237 (sorry, didn’t get a chance to review that one yet). Can you figure out which fix is the better one? cc @turekt

turekt · 2023-09-23T12:47:14Z

Hi @stapelberg and @jronak,

please note that this PR's description demonstrates an issue that is not caused or related to
buffer overflows mentioned in PR #237, so they are not related.

First of all check the output shown after running the repro code:

go run .
# Assume table & set does not exist, output is
failed to flush: conn.Receive: netlink receive: no such file or directory
# Parallel make the table and set available, but you will see an error again - although nftable set element was added
failed to flush: conn.Receive: netlink receive: no such file or directory

The error is not the same as the one seen in buffer overflow:

panic: conn.Receive: netlink receive: recvmsg: no buffer space available

The error in this PR relates to the set or table not being present in nftables.

I do not see from the description how exactly was the table and set created before the repro code was executed, but my current assumption is that there was an error in creation of either table or set. The repro code works for me if I properly create the table and set:

# go run main.go
failed to flush: conn.Receive: netlink receive: no such file or directory
netlink receive: no such file or directory
^Csignal: interrupt
# nft list ruleset
# nft add table non-existent-table
# nft list ruleset
table ip non-existent-table {
}
# go run main.go
failed to flush: conn.Receive: netlink receive: no such file or directory
netlink receive: no such file or directory
^Csignal: interrupt
# nft add set non-existent-table non-existent-set { type ipv4_addr \; }
# go run main.go
flushed
^Csignal: interrupt

I assume that there is no bug with the library, just the code using it is not properly written.

I think it is best for @jronak to give more details about the issue he is experiencing.

jronak · 2023-10-01T04:54:11Z

Sorry for the late reply. This issue isn't related to the buffer overflow issue from @turekt.

This issue is related to reusing lasting connection after failure on flush.
You start a lasting connection, where you buffer 2+ messages that will fail, then flush these messages which results in failure. But when you reuse this connection to perform an operation which will pass, but flush returns back an error although the operation has passed on checking directly with nft. Turns out the error was not fully drained from the previous message which leaks into next operation. See pro code to understand -

Previous repro code wasn't clear enough, created something much more simpler -

repro

package main

import (
	"fmt"

	"github.com/google/nftables"
)

func main() {
	conn, err := nftables.New(nftables.AsLasting())
	if err != nil {
		panic(err)
	}
	defer conn.CloseLasting()

	fmt.Println("Flushing and updating a non-existent set which must fail")
	if err := flushAndUpdateSet(conn); err != nil {
		fmt.Printf("[expected] Failed to flushAndUpdateSet: %v\n", err)
	}

	fmt.Println("Create a table must pass")
	if err := createTable(conn); err != nil {
		fmt.Printf("[unexpected] Failed to createTable: %v\n", err)
	}
}

func createTable(conn *nftables.Conn) error {
	table := &nftables.Table{
		Family: nftables.TableFamilyIPv4,
		Name:   "test-table",
	}
	conn.AddTable(table)
	return conn.Flush()
}

func flushAndUpdateSet(conn *nftables.Conn) error {
	set := &nftables.Set{
		Name:    "non-existent-set",
		Table:    &nftables.Table{
			Name:   "non-existent-table",
			Family: nftables.TableFamilyIPv4,
		},
		KeyType: nftables.TypeInteger,
	}

	conn.FlushSet(set)
	if err := conn.SetAddElements(set, []nftables.SetElement{{Key: []byte{0x01, 0x00, 0x00, 0x00}}}); err != nil {
		return err
	}

	return conn.Flush()
}

Repro code does the following:

Creates a lasting connection.
Flush and Update a non-existent set which must fail.
Create a new table using the same existing lasting connection which must pass.

By running the repo code, you notice both ops failed:

> nft list ruleset
# emtpy

> ./repro
Flushing and updating a non-existent set which must fail
[expected] Failed to flushAndUpdateSet: conn.Receive: netlink receive: no such file or directory
Create a table must pass
[unexpected] Failed to createTable: conn.Receive: netlink receive: no such file or directory

> nft list ruleset
table ip test-table {
}

So repro code produces error for both (2) and (3). When you check in nft ruleset (3) actually succeeded, but repro code returned an error. This happens as errors from (2) were not fully drained from the connection.

When you flush multiple messages/ops on a connection, and if flush fails to apply, the netlink connection returns errors per command. Since we are returning on noticing the first error, the rest of the errors are buffered and leaks into the result of next flush. This pull request invokes `conn.Receive()` * number of messages to drain any buffered errors in the connection.

turekt · 2023-10-01T20:18:47Z

Hi @jronak,

thanks for additional reproduction code. Now it all makes sense.

Your repro basically accumulates two netlink messages in flushAndUpdateSet: one for flushing set elements (type 2574) and second for adding set elements (type 2572).
This is observed with NLDEBUG (messages of type 2574 and 2572):

Flushing and updating a non-existent set which must fail
nl: send msgs: {Header:{Length:20 Type:unknown(16) Flags:request Sequence:2511835559 PID:3857} Data:[0 0 0 10]}
nl: send msgs: {Header:{Length:68 Type:unknown(2574) Flags:request|acknowledge Sequence:2511835560 PID:3857} Data:[2 0 0 0 23 0 1 0 110 111 110 45 101 120 105 115 116 101 110 116 45 116 97 98 108 101 0 0 21 0 2 0 110 111 110 45 101 120 105 115 116 101 110 116 45 115 101 116 0 0 0 0]}
nl: send msgs: {Header:{Length:96 Type:unknown(2572) Flags:request|acknowledge|0x400 Sequence:2511835561 PID:3857} Data:[2 0 0 0 21 0 2 0 110 111 110 45 101 120 105 115 116 101 110 116 45 115 101 116 0 0 0 0 8 0 4 0 0 0 0 0 23 0 1 0 110 111 110 45 101 120 105 115 116 101 110 116 45 116 97 98 108 101 0 0 20 0 3 128 16 0 1 128 12 0 1 128 8 0 1 0 1 0 0 0]}
nl: send msgs: {Header:{Length:20 Type:unknown(17) Flags:request Sequence:2511835562 PID:3857} Data:[0 0 0 10]}
cc.messages: [{{0 unknown(2574) request|acknowledge 0 0} [2 0 0 0 23 0 1 0 110 111 110 45 101 120 105 115 116 101 110 116 45 116 97 98 108 101 0 0 21 0 2 0 110 111 110 45 101 120 105 115 116 101 110 116 45 115 101 116 0 0 0 0]} {{0 unknown(2572) request|acknowledge|0x400 0 0} [2 0 0 0 21 0 2 0 110 111 110 45 101 120 105 115 116 101 110 116 45 115 101 116 0 0 0 0 8 0 4 0 0 0 0 0 23 0 1 0 110 111 110 45 101 120 105 115 116 101 110 116 45 116 97 98 108 101 0 0 20 0 3 128 16 0 1 128 12 0 1 128 8 0 1 0 1 0 0 0]}]
nl: recv: err: netlink receive: no such file or directory
nl: recv: err: netlink receive: no such file or directory
additional nlconn.Receive returns: netlink receive: no such file or directory
[expected] Failed to flushAndUpdateSet: conn.Receive: netlink receive: no such file or directory
Create a table must pass
nl: send msgs: {Header:{Length:20 Type:unknown(16) Flags:request Sequence:2511835563 PID:3857} Data:[0 0 0 10]}
nl: send msgs: {Header:{Length:44 Type:unknown(2560) Flags:request|acknowledge|0x400 Sequence:2511835564 PID:3857} Data:[2 0 0 0 15 0 1 0 116 101 115 116 45 116 97 98 108 101 0 0 8 0 2 0 0 0 0 0]}
nl: send msgs: {Header:{Length:20 Type:unknown(17) Flags:request Sequence:2511835565 PID:3857} Data:[0 0 0 10]}
cc.messages: [{{0 unknown(2560) request|acknowledge|0x400 0 0} [2 0 0 0 15 0 1 0 116 101 115 116 45 116 97 98 108 101 0 0 8 0 2 0 0 0 0 0]}]
nl: recv: {Header:{Length:36 Type:error Flags:0x100 Sequence:2511835564 PID:3857} Data:[0 0 0 0 44 0 0 0 0 10 5 4 172 145 183 149 17 15 0 0]}

Since both messages (2574 and 2572) are sent with one Flush call, netlink will respond with two errors, one for each message. Current code returns on the first error for 2574 and, since it is a lasting connection, the second error for 2572 stays inside the buffer and gets picked up by the next Flush call initiated in createTable function as a response for the message with type 2560. Since your code exits afterwards, the actual success response for the table creation is never picked up but the table gets created anyways (creation happens before response is delivered, we just collect the wrong response).

In a non-lasting connection this was not a problem since the second error for 2572 would drop from the buffer as soon as the netlink request messages are sent and connection closes, so your PR makes perfect sense.

LGTM

stapelberg · 2023-10-01T21:06:16Z

Thanks everyone!

jronak force-pushed the fix/flush branch from bb717cf to f6ed3a2 Compare October 1, 2023 04:57

jronak force-pushed the fix/flush branch from f6ed3a2 to 2d43302 Compare October 1, 2023 17:30

stapelberg approved these changes Oct 1, 2023

View reviewed changes

stapelberg merged commit 7879d7e into google:main Oct 1, 2023
2 checks passed

jronak deleted the fix/flush branch October 1, 2023 21:10

black-desk mentioned this pull request Oct 21, 2023

BUG: block in Conn.Flush() #242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush: fix error leak when flushing multiple messages in a lasting connection #239

Flush: fix error leak when flushing multiple messages in a lasting connection #239

jronak commented Sep 12, 2023 •

edited

Loading

stapelberg commented Sep 12, 2023

turekt commented Sep 23, 2023

jronak commented Oct 1, 2023

turekt commented Oct 1, 2023

stapelberg commented Oct 1, 2023

Flush: fix error leak when flushing multiple messages in a lasting connection #239

Flush: fix error leak when flushing multiple messages in a lasting connection #239

Conversation

jronak commented Sep 12, 2023 • edited Loading

stapelberg commented Sep 12, 2023

turekt commented Sep 23, 2023

jronak commented Oct 1, 2023

turekt commented Oct 1, 2023

stapelberg commented Oct 1, 2023

jronak commented Sep 12, 2023 •

edited

Loading