-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Description
What version of Go are you using (go version)?
go version go1.17.6 linux/amd64
Does this issue reproduce with the latest release?
As far as I can tell Yes
What operating system and processor architecture are you using (go env)?
go env Output
$ $ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/qabpesu/.cache/go-build" GOENV="/home/qabpesu/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/qabpesu/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/qabpesu/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/lib/golang" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64" GOVCS="" GOVERSION="go1.17.6" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/home/qabpesu/kubernetes/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3928162490=/tmp/go-build -gno-record-gcc-switches"
What did you do?
See #51934 for the background.
I have helped analyze the issue, mainly from a Linux kernel perspective, and I have found two issues:
- The 'golang' library should check for the 'NLM_F_DUMP_INTR'.
- The linux kernel should not lose the ‘NLM_F_DUMP_INTR’ flag between address families.
We see a 'golang' program (kube-proxy) somethimes receive a truncated list of address when calling "net.InterfaceAddrs()".
The library issues an 'RTM_GETADDR' request to the kernel.
The 'RTM_GETADDR' request ends up in the kernel routine "inet_dump_ifaddr()"
https://elixir.bootlin.com/linux/v4.15.18/source/net/ipv4/devinet.c#L1638
The kernel keeps two 512 position hash tables to store IP addresses.
One hashed by interface name, and one hashed by the interface index.
The one used to retrieve addresses with 'netlink' is the index hash
table. Each hack bucket contains a liked list of interfaces, and
each interface has a linked list of addresses.
The linux kernel will fill as many addresses as will fit in one
SKB buffer (max size 32k) by iterating over the buckets/devices/addresses.
Depending on how many per-address attributes exists, approx 400
addresses are included in one buffer.
When a buffer is full, the position of the bucket/interface/address
iterators are saved until the next buffer is to be filled.
This will only happen after the user mode program has read the
previous one.
To visualize the issue, lets instead assume that each buffer
can only contain 6 addresses.
So, after having filled the buffer with the addresses from
interface 'a', 'b', 'c', and the first three addresses from
interface 'd' the buffer is full, and the we save the indexes
A=4, B=2, C=3 as the iterator bucket/interface/address indexes.
A=4
│
▼
┌───────────────────────────────────────────────----------───┐
│ 256 hash buckets │
└──┬─────────────┬──────────┬─────────────────┬─----------───┘
│ │ │ │
┌─▼┐ ┌─▼┐ ┌─▼┐ ┌─▼┐
│a │►┌─┐ │b │►┌─┐ │f │ │g │►┌─┐
└──┘ └─┘ └──┘ └─┘ └──┘ └──┘ └─┘
▼
┌──┐
│c │►┌─┐
└──┘ └─┘
▼
┌──┐
B=2 ─────► │d │►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐
└──┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘
▼
┌──┐ ▲
│e │►┌─┐►┌─┐ │
└──┘ └─┘ └─┘ │
C=3
Interfaces Adresses
┌──┐ ┌─┐
│x │ └─┘
└──┘
Now, interface 'c' gets deleted, so the state changes like the picture
below.
Notice that when we continue, the iterator interface index will now
select interface 'e' instead of 'd'
Furthermore, the iterator address index is now greater than the number
of addresses, so the next available address to store will be the one for
interface 'g'.
The result is that all the remaining addresses from interface 'd' and
the first three (in this case all) addresses of interface 'e' will
not be part of the reported set of addresses.
A=4
│
▼
┌───────────────────────────────────────────────----------───┐
│ 256 hash buckets │
└──┬─────────────┬──────────┬─────────────────┬─----------───┘
│ │ │ │
┌─▼┐ ┌─▼┐ ┌─▼┐ ┌─▼┐
│a │►┌─┐ │b │►┌─┐ │f │ │g │►┌─┐
└──┘ └─┘ └──┘ └─┘ └──┘ └──┘ └─┘
▼
┌──┐
│d │►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐►┌─┐
└──┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘
▼
┌──┐
B=2 ─────► │e │►┌─┐►┌─┐
└──┘ └─┘ └─┘
▲
|
Interfaces Adresses |
┌──┐ ┌─┐
│x │ └─┘ C=3
└──┘
To help the user become aware that the returned data may be inconsistent,
a flag 'NLM_F_DUMP_INTR' is set in the header of the message when something
like this is discovered (routine 'nlm_check_consistent()'). Unfortunately,
the 'golang' routine 'interfaceTable()' that has the possibility to detect
this, does not. (https://go.dev/src/net/interface_linux.go).
Also, if the request is for AF_UNSPEC (as in 'golang'/net/interfaceTable()'),
then the 'NLM_F_DUMP_INTR' flag may get wiped out by the IPv6 addresses
that follow if there are no IPv4 interfaces with addresses after the
interruption. (in our example, if the 'g' interface did not exist).
Another unfortunate thing is that new interfaces are added at the head
of the hash table list, so if there are more than 256 interfaces (one in
each bucket), the issue may happen on every subsequent create/delete cycle.
So, to reproduce the issue:
- create an interface 'kube-proxy0' with many (>500) addresses
- Run the Python program below:
# ./list_addresses.py > A - create and delete 511 interfaces
- create and keep another interface 'transient0'. The interface index (as shown
in the first column by 'ip link') masked by 0xff should now be the same
for both 'kube-proxy0' and 'transient0'. - Run the program again, now like this:
# (./list_addresses.py >B)& sleep .5; ip link del transient0 - Compare 'A' and 'B'
- Try the same but change 'AF_INET' to 'AF_UNSPEC', now you may miss
the '* DUMP INTERESTTED *' message depending on what index your devices
have and if they have IPv6 addresses only.
So, what is needed to fix the problem:
- The 'golang' library should check for the 'NLM_F_DUMP_INTR'.
- The linux kernel should not lose the ‘NLM_F_DUMP_INTR’ flag between address families.
- A suggestion that the kernel adds new devices a the tail instead at the head.
A short term solution for our issue would be to instead of 2) have
the 'golang' library not use 'AF_UNSPEC'
/Per
--- list_addresses.py ---
#!/usr/bin/env python3
import time
from socket import AF_INET, AF_UNSPEC
from pyroute2 import IPRSocket
from pyroute2.netlink import NLM_F_REQUEST, NLM_F_DUMP, NLMSG_DONE
from pyroute2.iproute import RTM_GETADDR
from pyroute2.netlink.rtnl.ifaddrmsg import ifaddrmsg
NLM_F_DUMP_INTR = 0x10
msg = ifaddrmsg()
msg['header']['type'] = RTM_GETADDR
msg['header']['flags'] = NLM_F_REQUEST | NLM_F_DUMP
msg['family'] = AF_INET
msg.encode()
nlsock=IPRSocket()
nlsock.sendto(msg.data, (0, 0))
done = False
while not done:
time.sleep(2)
data=nlsock.recv(1024*1024)
msg=nlsock.marshal.parse(data)
print("--------------------------------------")
for m in msg:
if m['header']['flags'] & NLM_F_DUMP_INTR:
print("**************** DUMP INTERRUPTED ******************")
if m['event'] == 'NLMSG_DONE':
done = True
break
print(f"MSG:({m['event']}(flags:{m['header']['flags']:x})", end="")
for k,v in m['attrs']:
if k in ('IFA_LABEL', 'IFA_ADDRESS'):
print(f" {k:11} {v:16}", end="")
print("")