Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: Interfaces on Linux should check for rtnetlink dump interrupted #52137

Open
PerSundstr8m opened this issue Apr 4, 2022 · 4 comments
Open

net: Interfaces on Linux should check for rtnetlink dump interrupted #52137

PerSundstr8m opened this issue Apr 4, 2022 · 4 comments
Labels
NeedsInvestigation
Milestone

Comments

@PerSundstr8m
Copy link

@PerSundstr8m PerSundstr8m commented Apr 4, 2022

What version of Go are you using (go version)?

go version go1.17.6 linux/amd64

Does this issue reproduce with the latest release?

As far as I can tell Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ $ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/qabpesu/.cache/go-build"
GOENV="/home/qabpesu/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/qabpesu/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/qabpesu/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/golang"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.6"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/qabpesu/kubernetes/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3928162490=/tmp/go-build -gno-record-gcc-switches"

What did you do?

See #51934 for the background.

I have helped analyze the issue, mainly from a Linux kernel perspective, and I have found two issues:

  1. The 'golang' library should check for the 'NLM_F_DUMP_INTR'.
  2. The linux kernel should not lose the ‘NLM_F_DUMP_INTR’ flag between address families.

We see a 'golang' program (kube-proxy) somethimes receive a truncated list of address when calling "net.InterfaceAddrs()".
The library issues an 'RTM_GETADDR' request to the kernel.

The 'RTM_GETADDR' request ends up in the kernel routine "inet_dump_ifaddr()"
https://elixir.bootlin.com/linux/v4.15.18/source/net/ipv4/devinet.c#L1638

The kernel keeps two 512 position hash tables to store IP addresses.
One hashed by interface name, and one hashed by the interface index.
The one used to retrieve addresses with 'netlink' is the index hash
table. Each hack bucket contains a liked list of interfaces, and
each interface has a linked list of addresses.

The linux kernel will fill as many addresses as will fit in one
SKB buffer (max size 32k) by iterating over the buckets/devices/addresses.
Depending on how many per-address attributes exists, approx 400
addresses are included in one buffer.

When a buffer is full, the position of the bucket/interface/address
iterators are saved until the next buffer is to be filled.
This will only happen after the user mode program has read the
previous one.

To visualize the issue, lets instead assume that each buffer
can only contain 6 addresses.

So, after having filled the buffer with the addresses from
interface 'a', 'b', 'c', and the first three addresses from
interface 'd' the buffer is full, and the we save the indexes
A=4, B=2, C=3 as the iterator bucket/interface/address indexes.

                A=4
                │
                ▼
┌───────────────────────────────────────────────----------───┐
│                 256 hash buckets                           │
└──┬─────────────┬──────────┬─────────────────┬─----------───┘
   │             │          │                 │
┌─▼┐          ┌─▼┐       ┌─▼┐              ┌─▼┐
│a │►┌─┐      │b │►┌─┐   │f │              │g │►┌─┐
└──┘ └─┘      └──┘ └─┘   └──┘              └──┘ └─┘
                 ▼
               ┌──┐
               │c │►┌─┐
               └──┘ └─┘
                 ▼
               ┌──┐
    B=2 ─────► │d │►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐
               └──┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘
                 ▼
               ┌──┐              ▲
               │e │►┌─┐►┌─┐      │
               └──┘ └─┘ └─┘      │

                                 C=3
   Interfaces   Adresses
     ┌──┐        ┌─┐
     │x │        └─┘
     └──┘

Now, interface 'c' gets deleted, so the state changes like the picture
below.
Notice that when we continue, the iterator interface index will now
select interface 'e' instead of 'd'
Furthermore, the iterator address index is now greater than the number
of addresses, so the next available address to store will be the one for
interface 'g'.
The result is that all the remaining addresses from interface 'd' and
the first three (in this case all) addresses of interface 'e' will
not be part of the reported set of addresses.

                A=4
                │
                ▼
┌───────────────────────────────────────────────----------───┐
│                 256 hash buckets                           │
└──┬─────────────┬──────────┬─────────────────┬─----------───┘
   │             │          │                 │
┌─▼┐          ┌─▼┐       ┌─▼┐              ┌─▼┐
│a │►┌─┐      │b │►┌─┐   │f │              │g │►┌─┐
└──┘ └─┘      └──┘ └─┘   └──┘              └──┘ └─┘
                 ▼
               ┌──┐
               │d │►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐
               └──┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘
                 ▼
               ┌──┐              
    B=2 ─────► │e │►┌─┐►┌─┐      
               └──┘ └─┘ └─┘      
                                 ▲
                                 |
   Interfaces   Adresses         |
    ┌──┐        ┌─┐
    │x │        └─┘             C=3
    └──┘

To help the user become aware that the returned data may be inconsistent,
a flag 'NLM_F_DUMP_INTR' is set in the header of the message when something
like this is discovered (routine 'nlm_check_consistent()'). Unfortunately,
the 'golang' routine 'interfaceTable()' that has the possibility to detect
this, does not. (https://go.dev/src/net/interface_linux.go).

Also, if the request is for AF_UNSPEC (as in 'golang'/net/interfaceTable()'),
then the 'NLM_F_DUMP_INTR' flag may get wiped out by the IPv6 addresses
that follow if there are no IPv4 interfaces with addresses after the
interruption. (in our example, if the 'g' interface did not exist).

Another unfortunate thing is that new interfaces are added at the head
of the hash table list, so if there are more than 256 interfaces (one in
each bucket), the issue may happen on every subsequent create/delete cycle.

So, to reproduce the issue:

  1. create an interface 'kube-proxy0' with many (>500) addresses
  2. Run the Python program below:
    # ./list_addresses.py > A
  3. create and delete 511 interfaces
  4. create and keep another interface 'transient0'. The interface index (as shown
    in the first column by 'ip link') masked by 0xff should now be the same
    for both 'kube-proxy0' and 'transient0'.
  5. Run the program again, now like this:
    # (./list_addresses.py >B)& sleep .5; ip link del transient0
  6. Compare 'A' and 'B'
  7. Try the same but change 'AF_INET' to 'AF_UNSPEC', now you may miss
    the '* DUMP INTERESTTED *' message depending on what index your devices
    have and if they have IPv6 addresses only.

So, what is needed to fix the problem:

  1. The 'golang' library should check for the 'NLM_F_DUMP_INTR'.
  2. The linux kernel should not lose the ‘NLM_F_DUMP_INTR’ flag between address families.
  3. A suggestion that the kernel adds new devices a the tail instead at the head.

A short term solution for our issue would be to instead of 2) have
the 'golang' library not use 'AF_UNSPEC'

/Per

--- list_addresses.py ---
#!/usr/bin/env python3

import time
from socket import AF_INET, AF_UNSPEC
from pyroute2 import IPRSocket
from pyroute2.netlink import NLM_F_REQUEST, NLM_F_DUMP, NLMSG_DONE
from pyroute2.iproute import RTM_GETADDR
from pyroute2.netlink.rtnl.ifaddrmsg import ifaddrmsg
NLM_F_DUMP_INTR = 0x10

msg = ifaddrmsg()
msg['header']['type'] = RTM_GETADDR
msg['header']['flags'] = NLM_F_REQUEST | NLM_F_DUMP
msg['family'] = AF_INET
msg.encode()

nlsock=IPRSocket()
nlsock.sendto(msg.data, (0, 0))

done = False
while not done:
    time.sleep(2)
    data=nlsock.recv(1024*1024)
    msg=nlsock.marshal.parse(data)
    print("--------------------------------------")
    for m in msg:
        if m['header']['flags'] & NLM_F_DUMP_INTR:
            print("**************** DUMP INTERRUPTED ******************")
        if m['event'] == 'NLMSG_DONE':
            done = True
            break
        print(f"MSG:({m['event']}(flags:{m['header']['flags']:x})", end="")
        for k,v in m['attrs']:
            if k in ('IFA_LABEL', 'IFA_ADDRESS'):
                print(f" {k:11} {v:16}", end="")
        print("")
@mdlayher mdlayher changed the title net: interfaceTable() net: Interfaces on Linux should check for rtnetlink dump interrupted Apr 4, 2022
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Apr 5, 2022

Thanks for looking into it. Would it be possible to provide a standalone reproducer? Thanks.

@cherrymui cherrymui added the NeedsInvestigation label Apr 5, 2022
@cherrymui cherrymui added this to the Backlog milestone Apr 5, 2022
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Apr 5, 2022

@PerSundstr8m
Copy link
Author

@PerSundstr8m PerSundstr8m commented Apr 6, 2022

This script demostrates the issue.
You need to have 'go' and 'gdb' installed

~# cat reproduce.sh
#!/bin/bash

HERE=$PWD

dev_idx()
{
  ip link show $1 | awk -F: "/$1/"'{print $1}'
}

create_dummy_dev()
{
  ip link add "my-dummy0" type dummy
  dev_idx "my-dummy0"
}

idx2bucket()
{
  echo $(( "$1" & 255 ))
}

create_dummy_dev_in_same_bucket()
{
  target_bucket=$(idx2bucket $(dev_idx $1))
  while (( "$(idx2bucket $(create_dummy_dev))" != "$target_bucket" ))
  do
    printf "."
    ip link del my-dummy0
  done
  printf "\n"
}

add_500_addresses()
{
 for i in 1 2
 do
  for j in {1..250}
  do
   ip addr add 10.10.$i.$j dev "$1"
   printf "."
  done
 done
 printf "\n"
}

echo 'Creating device "my-device0"'
ip link show dev my-device0 >/dev/null 2>&1 && ip link del dev my-device0
ip link add my-device0 type dummy

echo 'Creating device "my-device1" (May be needed to detect interruption)'
ip link show dev my-device1 >/dev/null 2>&1 && ip link del dev my-device1
ip link add my-device1 type dummy
ip addr add 2.2.2.2/24 dev my-device1

echo 'Adding 500 addresses to "my-device0"'
add_500_addresses my-device0

echo 'Creating a dummy device in same bucket as "my-device0"'
create_dummy_dev_in_same_bucket my-device0

echo 'Create and build "go" program'

BUILD_DIR=gotest/src/addr_list
mkdir -p "$BUILD_DIR"
cd "$BUILD_DIR"
cat > main.go << EOF
package main

import (
    "fmt"
    "net"
)

func main() {
        addr, err := net.InterfaceAddrs()
        if err == nil {
                for _, a := range addr {
                        fmt.Println(a)
                }
        }
}
EOF

go build


echo 'Run program deleting dummy interface at the first "recvfrom"'
cat > go.cmd << EOF
 b syscall.recvfrom
 command 1
 !ip link del my-dummy0
 disable 1
 cont
 end
 run
 quit
EOF

gdb -x go.cmd addr_list 1> "$HERE"/bad_result.out 2>/dev/null

echo 'Run program again - normal case (device already deleted)"'
gdb -x go.cmd addr_list 1> "$HERE"/ok_result.out 2>/dev/null

cd "$HERE"
echo 'Result: # of addresses:'
wc -l {bad,ok}_result.out

The result should look something like this:

# ./reproduce.sh
Creating device "my-device0"
Creating device "my-device1" (May be needed to detect interruption)
Adding 500 addresses to "my-device0"
....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Creating a dummy device in same bucket as "my-device0"
........................................................................................................................................................................................................................
......................................
Create and build "go" program
Run program deleting dummy interface at the first "recvfrom"
Run program again - normal case (device already deleted)"
Result: # of addresses:
  101 bad_result.out
  564 ok_result.out
  665 total

Then you can compare the result files.

@PerSundstr8m
Copy link
Author

@PerSundstr8m PerSundstr8m commented May 3, 2022

Given that this issue is very unlikely to happen, and that an API change is unwanted, my suggestion is that the library does a small number (one or two) of retries when detecting a 'NLM_F_DUMP_INTR', and only return an error in the extremely unlikely event of the retries also failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

2 participants