Skip to content

iptables initialization code has a race condition that can lead to some iptables calls failing #1677

@maxvt

Description

@maxvt

The suggested fix for this issue is #1676.

We have a Docker network driver that uses the iptables module. If the driver is restarted, it gets a thundering herd of requests, so multiple goroutines kick off and start, among other things, doing iptables calls. Here's what we see in the log in a few rare cases:

time="2017-03-03T15:35:44Z" level=debug msg="/sbin/iptables, [-t nat -A PREROUTING -i pdnet -p udp --dport 8125 -j REDIRECT --to-port 8125]"
time="2017-03-03T15:35:44Z" level=debug msg="/sbin/iptables, [-t nat -A PREROUTING -i pdnet -p udp --dport 8125 -j REDIRECT --to-port 8125]"
time="2017-03-03T15:35:45Z" level=debug msg="/sbin/iptables, [--wait --version]"
time="2017-03-03T15:35:45Z" level=debug msg="/sbin/iptables, [--wait -t nat -C PREROUTING -i pdnet -p udp --dport 53 -j REDIRECT --to-port 53]"
time="2017-03-03T15:35:45Z" level=debug msg="/sbin/iptables, [--wait -t nat -C PREROUTING -i pdnet -p tcp --dport 53 -j REDIRECT --to-port 53]"

some of the iptables calls in this sequence eventually fail with errors similar to this:

iptables failed: iptables -t nat -A PREROUTING -i pdnet -p tcp --dport 53 -j REDIRECT --to-port 53: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?\n (exit status 4))

You can see that earlier calls (which fail) start without --wait flag, then there is a version check, then all following calls add the --wait flag. What sourcery is this?

My guess is all calls have to pass through initCheck(); the first one sets iptablesPath, and some of the ones behind the first see iptablesPath as set, so they bypass most of initCheck() and continue straight to invocation of iptables, but in fact the rest of the initCheck() function in the first goroutine is not done yet (in particular the long execs into testing for availability of --wait flag and determining iptables version). So the startup value of availability of --wait (false) is used for those early calls, which (because they are concurrent, and do not use --wait) leads to some of them failing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions