Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush: fix error leak when flushing multiple messages in a lasting connection #239

Merged
merged 1 commit into from
Oct 1, 2023

Conversation

jronak
Copy link
Contributor

@jronak jronak commented Sep 12, 2023

This issue is related to reusing lasting connection after failure on flush.
You start a lasting connection, where you buffer 2+ messages that will fail, then flush these messages which results in failure. But when you reuse this connection to perform an operation which will pass, but flush returns back an error although the operation has passed on checking directly with nft. Turns out the error was not fully drained from the previous message which leaks into next operation. See pro code to understand -

repro
package main

import (
	"fmt"

	"github.com/google/nftables"
)

func main() {
	conn, err := nftables.New(nftables.AsLasting())
	if err != nil {
		panic(err)
	}
	defer conn.CloseLasting()

	fmt.Println("Flushing and updating a non-existent set which must fail")
	if err := flushAndUpdateSet(conn); err != nil {
		fmt.Printf("[expected] Failed to flushAndUpdateSet: %v\n", err)
	}

	fmt.Println("Create a table must pass")
	if err := createTable(conn); err != nil {
		fmt.Printf("[unexpected] Failed to createTable: %v\n", err)
	}
}

func createTable(conn *nftables.Conn) error {
	table := &nftables.Table{
		Family: nftables.TableFamilyIPv4,
		Name:   "test-table",
	}
	conn.AddTable(table)
	return conn.Flush()
}

func flushAndUpdateSet(conn *nftables.Conn) error {
	set := &nftables.Set{
		Name:    "non-existent-set",
		Table:    &nftables.Table{
			Name:   "non-existent-table",
			Family: nftables.TableFamilyIPv4,
		},
		KeyType: nftables.TypeInteger,
	}

	conn.FlushSet(set)
	if err := conn.SetAddElements(set, []nftables.SetElement{{Key: []byte{0x01, 0x00, 0x00, 0x00}}}); err != nil {
		return err
	}

	return conn.Flush()
}

Repro code does the following:

  1. Creates a lasting connection.
  2. Flush and Update a non-existent set which must fail.
  3. Create a new table using the same existing lasting connection which must pass.

By running the repo code, you notice both ops failed:

> nft list ruleset
# emtpy

> ./repro
Flushing and updating a non-existent set which must fail
[expected] Failed to flushAndUpdateSet: conn.Receive: netlink receive: no such file or directory
Create a table must pass
[unexpected] Failed to createTable: conn.Receive: netlink receive: no such file or directory

> nft list ruleset 
table ip test-table {
}

# table creation from repro actually worked, but it still threw an error

So repro code produces error for both (2) and (3). When you check in nft ruleset (3) actually succeeded, but repro code returned an error. This happens as errors from (2) were not fully drained from the connection.

@stapelberg
Copy link
Collaborator

This seems to be a duplicate of #237 (sorry, didn’t get a chance to review that one yet). Can you figure out which fix is the better one? cc @turekt

@turekt
Copy link
Contributor

turekt commented Sep 23, 2023

Hi @stapelberg and @jronak,

please note that this PR's description demonstrates an issue that is not caused or related to
buffer overflows mentioned in PR #237, so they are not related.

First of all check the output shown after running the repro code:

go run .
# Assume table & set does not exist, output is
failed to flush: conn.Receive: netlink receive: no such file or directory
# Parallel make the table and set available, but you will see an error again - although nftable set element was added
failed to flush: conn.Receive: netlink receive: no such file or directory

The error is not the same as the one seen in buffer overflow:

panic: conn.Receive: netlink receive: recvmsg: no buffer space available

The error in this PR relates to the set or table not being present in nftables.

I do not see from the description how exactly was the table and set created before the repro code was executed, but my current assumption is that there was an error in creation of either table or set. The repro code works for me if I properly create the table and set:

# go run main.go
failed to flush: conn.Receive: netlink receive: no such file or directory
netlink receive: no such file or directory
^Csignal: interrupt
# nft list ruleset
# nft add table non-existent-table
# nft list ruleset
table ip non-existent-table {
}
# go run main.go
failed to flush: conn.Receive: netlink receive: no such file or directory
netlink receive: no such file or directory
^Csignal: interrupt
# nft add set non-existent-table non-existent-set { type ipv4_addr \; }
# go run main.go
flushed
^Csignal: interrupt

I assume that there is no bug with the library, just the code using it is not properly written.

I think it is best for @jronak to give more details about the issue he is experiencing.

@jronak
Copy link
Contributor Author

jronak commented Oct 1, 2023

Sorry for the late reply. This issue isn't related to the buffer overflow issue from @turekt.

This issue is related to reusing lasting connection after failure on flush.
You start a lasting connection, where you buffer 2+ messages that will fail, then flush these messages which results in failure. But when you reuse this connection to perform an operation which will pass, but flush returns back an error although the operation has passed on checking directly with nft. Turns out the error was not fully drained from the previous message which leaks into next operation. See pro code to understand -

Previous repro code wasn't clear enough, created something much more simpler -

repro
package main

import (
	"fmt"

	"github.com/google/nftables"
)

func main() {
	conn, err := nftables.New(nftables.AsLasting())
	if err != nil {
		panic(err)
	}
	defer conn.CloseLasting()

	fmt.Println("Flushing and updating a non-existent set which must fail")
	if err := flushAndUpdateSet(conn); err != nil {
		fmt.Printf("[expected] Failed to flushAndUpdateSet: %v\n", err)
	}

	fmt.Println("Create a table must pass")
	if err := createTable(conn); err != nil {
		fmt.Printf("[unexpected] Failed to createTable: %v\n", err)
	}
}

func createTable(conn *nftables.Conn) error {
	table := &nftables.Table{
		Family: nftables.TableFamilyIPv4,
		Name:   "test-table",
	}
	conn.AddTable(table)
	return conn.Flush()
}

func flushAndUpdateSet(conn *nftables.Conn) error {
	set := &nftables.Set{
		Name:    "non-existent-set",
		Table:    &nftables.Table{
			Name:   "non-existent-table",
			Family: nftables.TableFamilyIPv4,
		},
		KeyType: nftables.TypeInteger,
	}

	conn.FlushSet(set)
	if err := conn.SetAddElements(set, []nftables.SetElement{{Key: []byte{0x01, 0x00, 0x00, 0x00}}}); err != nil {
		return err
	}

	return conn.Flush()
}

Repro code does the following:

  1. Creates a lasting connection.
  2. Flush and Update a non-existent set which must fail.
  3. Create a new table using the same existing lasting connection which must pass.

By running the repo code, you notice both ops failed:

> nft list ruleset
# emtpy

> ./repro
Flushing and updating a non-existent set which must fail
[expected] Failed to flushAndUpdateSet: conn.Receive: netlink receive: no such file or directory
Create a table must pass
[unexpected] Failed to createTable: conn.Receive: netlink receive: no such file or directory

> nft list ruleset
table ip test-table {
}

So repro code produces error for both (2) and (3). When you check in nft ruleset (3) actually succeeded, but repro code returned an error. This happens as errors from (2) were not fully drained from the connection.

When you flush multiple messages/ops on a connection, and if flush fails
to apply, the netlink connection returns errors per command. Since we
are returning on noticing the first error, the rest of the errors are
buffered and leaks into the result of next flush.

This pull request invokes `conn.Receive()` * number of messages to drain
any buffered errors in the connection.
@turekt
Copy link
Contributor

turekt commented Oct 1, 2023

Hi @jronak,

thanks for additional reproduction code. Now it all makes sense.

Your repro basically accumulates two netlink messages in flushAndUpdateSet: one for flushing set elements (type 2574) and second for adding set elements (type 2572).
This is observed with NLDEBUG (messages of type 2574 and 2572):

Flushing and updating a non-existent set which must fail
nl: send msgs: {Header:{Length:20 Type:unknown(16) Flags:request Sequence:2511835559 PID:3857} Data:[0 0 0 10]}
nl: send msgs: {Header:{Length:68 Type:unknown(2574) Flags:request|acknowledge Sequence:2511835560 PID:3857} Data:[2 0 0 0 23 0 1 0 110 111 110 45 101 120 105 115 116 101 110 116 45 116 97 98 108 101 0 0 21 0 2 0 110 111 110 45 101 120 105 115 116 101 110 116 45 115 101 116 0 0 0 0]}
nl: send msgs: {Header:{Length:96 Type:unknown(2572) Flags:request|acknowledge|0x400 Sequence:2511835561 PID:3857} Data:[2 0 0 0 21 0 2 0 110 111 110 45 101 120 105 115 116 101 110 116 45 115 101 116 0 0 0 0 8 0 4 0 0 0 0 0 23 0 1 0 110 111 110 45 101 120 105 115 116 101 110 116 45 116 97 98 108 101 0 0 20 0 3 128 16 0 1 128 12 0 1 128 8 0 1 0 1 0 0 0]}
nl: send msgs: {Header:{Length:20 Type:unknown(17) Flags:request Sequence:2511835562 PID:3857} Data:[0 0 0 10]}
cc.messages: [{{0 unknown(2574) request|acknowledge 0 0} [2 0 0 0 23 0 1 0 110 111 110 45 101 120 105 115 116 101 110 116 45 116 97 98 108 101 0 0 21 0 2 0 110 111 110 45 101 120 105 115 116 101 110 116 45 115 101 116 0 0 0 0]} {{0 unknown(2572) request|acknowledge|0x400 0 0} [2 0 0 0 21 0 2 0 110 111 110 45 101 120 105 115 116 101 110 116 45 115 101 116 0 0 0 0 8 0 4 0 0 0 0 0 23 0 1 0 110 111 110 45 101 120 105 115 116 101 110 116 45 116 97 98 108 101 0 0 20 0 3 128 16 0 1 128 12 0 1 128 8 0 1 0 1 0 0 0]}]
nl: recv: err: netlink receive: no such file or directory
nl: recv: err: netlink receive: no such file or directory
additional nlconn.Receive returns: netlink receive: no such file or directory
[expected] Failed to flushAndUpdateSet: conn.Receive: netlink receive: no such file or directory
Create a table must pass
nl: send msgs: {Header:{Length:20 Type:unknown(16) Flags:request Sequence:2511835563 PID:3857} Data:[0 0 0 10]}
nl: send msgs: {Header:{Length:44 Type:unknown(2560) Flags:request|acknowledge|0x400 Sequence:2511835564 PID:3857} Data:[2 0 0 0 15 0 1 0 116 101 115 116 45 116 97 98 108 101 0 0 8 0 2 0 0 0 0 0]}
nl: send msgs: {Header:{Length:20 Type:unknown(17) Flags:request Sequence:2511835565 PID:3857} Data:[0 0 0 10]}
cc.messages: [{{0 unknown(2560) request|acknowledge|0x400 0 0} [2 0 0 0 15 0 1 0 116 101 115 116 45 116 97 98 108 101 0 0 8 0 2 0 0 0 0 0]}]
nl: recv: {Header:{Length:36 Type:error Flags:0x100 Sequence:2511835564 PID:3857} Data:[0 0 0 0 44 0 0 0 0 10 5 4 172 145 183 149 17 15 0 0]}

Since both messages (2574 and 2572) are sent with one Flush call, netlink will respond with two errors, one for each message. Current code returns on the first error for 2574 and, since it is a lasting connection, the second error for 2572 stays inside the buffer and gets picked up by the next Flush call initiated in createTable function as a response for the message with type 2560. Since your code exits afterwards, the actual success response for the table creation is never picked up but the table gets created anyways (creation happens before response is delivered, we just collect the wrong response).

In a non-lasting connection this was not a problem since the second error for 2572 would drop from the buffer as soon as the netlink request messages are sent and connection closes, so your PR makes perfect sense.

LGTM

@stapelberg stapelberg merged commit 7879d7e into google:main Oct 1, 2023
2 checks passed
@stapelberg
Copy link
Collaborator

Thanks everyone!

@jronak jronak deleted the fix/flush branch October 1, 2023 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants