Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to retrieve user's IP address in docker swarm mode #25526

Open
PanJ opened this issue Aug 9, 2016 · 331 comments
Open

Unable to retrieve user's IP address in docker swarm mode #25526

PanJ opened this issue Aug 9, 2016 · 331 comments

Comments

@PanJ
Copy link

@PanJ PanJ commented Aug 9, 2016

Output of docker version:

Client:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:00:36 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:00:36 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 155
 Running: 65
 Paused: 0
 Stopped: 90
Images: 57
Server Version: 1.12.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 868
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host overlay null bridge
Swarm: active
 NodeID: 0ddz27v59pwh2g5rr1k32d9bv
 Is Manager: true
 ClusterID: 32c5sn0lgxoq9gsl1er0aucsr
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot interval: 10000
  Heartbeat tick: 1
  Election tick: 3
 Dispatcher:
  Heartbeat period: 5 seconds
 CA configuration:
  Expiry duration: 3 months
 Node Address: 172.31.24.209
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.13.0-92-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.42 GiB
Name: ip-172-31-24-209
ID: 4LDN:RTAI:5KG5:KHR2:RD4D:MV5P:DEXQ:G5RE:AZBQ:OPQJ:N4DK:WCQQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: panj
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

Steps to reproduce the issue:

  1. run following service which publishes port 80
docker service create \
--name debugging-simple-server \
--publish 80:3000 \
panj/debugging-simple-server
  1. Try connecting with http://<public-ip>/.

Describe the results you received:
Neither ip nor header.x-forwarded-for is the correct user's IP address.

Describe the results you expected:
ip or header.x-forwarded-for should be user's IP address. The expected result can be archieved using standalone docker container docker run -d -p 80:3000 panj/debugging-simple-server. You can see both of the results via following links,
http://swarm.issue-25526.docker.takemetour.com:81/
http://container.issue-25526.docker.takemetour.com:82/

Additional information you deem important (e.g. issue happens only occasionally):
This happens on both global mode and replicated mode.

I am not sure if I missed anything that should solve this issue easily.

In the meantime, I think I have to do a workaround which is running a proxy container outside of swarm mode and let it forward to published port in swarm mode (SSL termination should be done on this container too), which breaks the purpose of swarm mode for self-healing and orchestration.

@thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Aug 9, 2016

/cc @aluzzardi @mrjana ptal

@mavenugo
Copy link
Contributor

@mavenugo mavenugo commented Aug 9, 2016

@PanJ can you please share some details on how debugging-simple-server determines the ip ? Also what is the expectation if a service is scaled to more than 1 replica across multiple hosts (or global mode) ?

@PanJ
Copy link
Author

@PanJ PanJ commented Aug 9, 2016

@mavenugo it's koa's request object which uses node's remoteAddress from net module. The result should be the same for any other libraries that can retrieve remote address.

The expectation is that ip field should always be remote address regardless of any configuration.

@marech
Copy link

@marech marech commented Sep 19, 2016

@PanJ you still use your workaround or found some better solution?

@sanimej
Copy link

@sanimej sanimej commented Sep 19, 2016

@PanJ When I run your app as a standalone container..

docker run -it --rm -p 80:3000 --name test panj/debugging-simple-server

and access the published port from another host I get this

vagrant@net-1:~$ curl 192.168.33.12
{"method":"GET","url":"/","header":{"user-agent":"curl/7.38.0","host":"192.168.33.12","accept":"*/*"},"ip":"::ffff:192.168.33.11","ips":[]}
vagrant@net-1:~$

192.168.33.11 is the IP of the host in which I am running curl. Is this the expected behavior ?

@PanJ
Copy link
Author

@PanJ PanJ commented Sep 19, 2016

@sanimej Yes, it is the expected behavior that should be on swarm mode as well.

@PanJ
Copy link
Author

@PanJ PanJ commented Sep 19, 2016

@marech I am still using the standalone container as a workaround, which works fine.

In my case, there are 2 nginx intances, standalone and swarm instances. SSL termination and reverse proxy is done on standalone nginx. Swarm instance is used to route to other services based on request host.

@sanimej
Copy link

@sanimej sanimej commented Sep 19, 2016

@PanJ The way the published port of a container is accessed is different in swarm mode. In the swarm mode a service can be reached from any node in the cluster. To facilitate this we route through an ingress network. 10.255.0.x is the address of the ingress network interface on the host in the cluster from which you try to reach the published port.

@PanJ
Copy link
Author

@PanJ PanJ commented Sep 19, 2016

@sanimej I kinda saw how it works when I dug into the issue. But the use case (ability to retrieve user's IP) is quite common.

I have limited knowledge on how the fix should be implemented. Maybe a special type of network that does not alter source IP address?

Rancher is similar to Docker swarm mode and it seems to have expected behavior. Maybe it is a good place to start.

@marech
Copy link

@marech marech commented Sep 20, 2016

@sanimej good idea could be add all IPs to X-Forwarded-For header if its possible then we can see all chain.

@PanJ hmm, and how your nignx standalone container communicate to swarm instance, via service name or ip? Maybe can share nginx config part where you pass it to swarm instance.

@PanJ
Copy link
Author

@PanJ PanJ commented Sep 20, 2016

@marech standalone container listens to port 80 and then proxies to localhost:8181

server {
  listen 80 default_server;
  location / {
    proxy_set_header        Host $host;
    proxy_set_header        X-Real-IP $remote_addr;
    proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header        X-Forwarded-Proto $scheme;
    proxy_pass          http://localhost:8181;
    proxy_read_timeout  90;
  }
}

If you have to do SSL termination, add another server block that listens to port 443, then do the SSL termination and proxies to localhost:8181 as well

Swarm mode's nginx publishes 8181:80 and routes to another service based on request host.

server {
  listen 80;
  server_name your.domain.com;
  location / {
    proxy_pass          http://your-service:80;
    proxy_set_header Host $host;
    proxy_read_timeout  90;
  }
}

server {
  listen 80;
  server_name another.domain.com;
  location / {
    proxy_pass          http://another-service:80;
    proxy_set_header Host $host;
    proxy_read_timeout  90;
  }
}
@o3o3o
Copy link

@o3o3o o3o3o commented Oct 24, 2016

In our case, our API RateLimit and other functions is depend on the user's ip address. Is there any way to skip the problem in swarm mode?

@darrellenns
Copy link

@darrellenns darrellenns commented Nov 1, 2016

I've also run into the issue when trying to run logstash in swarm mode (for collecting syslog messages from various hosts). The logstash "host" field always appears as 10.255.0.x, instead of the actual IP of the connecting host. This makes it totally unusable, as you can't tell which host the log messages are coming from. Is there some way we can avoid translating the source IP?

@vfarcic
Copy link

@vfarcic vfarcic commented Nov 2, 2016

+1 for a solution for this issue.

Without the ability to retrieve user's IP prevents us from using monitoring solutions like Prometheus.

@darrellenns
Copy link

@darrellenns darrellenns commented Nov 2, 2016

Perhaps the linux kernel IPVS capabilities would be of some use here. I'm guessing that the IP change is taking place because the connections are being proxied in user space. IPVS, on the other hand, can redirect and load balance requests in kernel space without changing the source IP address. IPVS could also be good down the road for building in more advanced functionality, such as different load balancing algorithms, floating IP addresses, and direct routing.

@vfarcic
Copy link

@vfarcic vfarcic commented Nov 2, 2016

For me, it would be enough if I could somehow find out the relation between the virtual IP and the IP of the server the endpoint belongs to. That way, when Prometheus send an alert related to some virtual IP, I could find out what is the affected server. It would not be a good solution but it would be better than nothing.

@darrellenns
Copy link

@darrellenns darrellenns commented Nov 2, 2016

@vfarcic I don't think that's possible with the way it works now. All client connections come from the same IP, so you can't translate it back. The only way that would work is if whatever is doing the proxy/nat of the connections saved a connection log with timestamp, source ip, and source port. Even then, it wouldn't be much help in most use cases where the source IP is needed.

@vfarcic
Copy link

@vfarcic vfarcic commented Nov 2, 2016

I probably did not explain well the use case.

I use Prometheus that is configured to scrap exporters that are running as Swarm global services. It uses tasks.<SERVICE_NAME> to get the IPs of all replicas. So, it's not using the service but replica endpoints (no load balancing). What I'd need is to somehow figure out the IP of the node where each of those replica IPs come from.

@vfarcic
Copy link

@vfarcic vfarcic commented Nov 3, 2016

I just realized the "docker network inspect <NETWORK_NAME>" provides information about containers and IPv4 addresses of a single node. Can this be extended so that there is a cluster-wide information of a network together with nodes?

Something like:

       "Containers": {
            "57bc4f3d826d4955deb32c3b71550473e55139a86bef7d5e584786a3a5fa6f37": {
                "Name": "cadvisor.0.8d1s6qb63xdir22xyhrcjhgsa",
                "EndpointID": "084a032fcd404ae1b51f33f07ffb2df9c1f9ec18276d2f414c2b453fc8e85576",
                "MacAddress": "02:42:0a:00:00:1e",
                "IPv4Address": "10.0.0.30/24",
                "IPv6Address": "",
                "Node": "swarm-4"
            },
...

Note the addition of the "Node".

If such information would be available for the whole cluster, not only a single node with the addition of a --filter argument, I'd have everything I'd need to figure out the relation between a container IPv4 address and the node. It would not be a great solution but still better than nothing. Right now, when Prometheus detects a problem, I need to execute "docker network inspect" on each node until I find out the location of the address.

@tlvenn
Copy link

@tlvenn tlvenn commented Nov 3, 2016

I agree with @dack , given the ingress network is using IPVS, we should solve this issue using IPVS so that the source IP is preserved and presented to the service correctly and transparently.

The solution need to work at the IP level so that any service that are not based on HTTP can still work properly as well (Can't rely on http headers...).

And I cant stress out how important this is, without it, there are many services that simply cant operate at all in swarm mode.

@tlvenn
Copy link

@tlvenn tlvenn commented Nov 3, 2016

@tlvenn
Copy link

@tlvenn tlvenn commented Nov 3, 2016

@kobolog might be able to shed some light on this matter given his talk on IPVS at DockerCon.

@Damidara16
Copy link

@Damidara16 Damidara16 commented Jul 6, 2020

2020 and still not fixed, what a drag. seems like a very important feature

@sebastianfelipe
Copy link

@sebastianfelipe sebastianfelipe commented Jul 7, 2020

This is very needed. Put some host mode is just a patch, sometimes it is necessary to run NGINX behind the network (depending on the use and the setup). Please fix this.

@Damidara16
Copy link

@Damidara16 Damidara16 commented Jul 7, 2020

i think a workaround for this and to have a docker swarm run without setting host is to get the IP on the client-side. ex. using js for web and mobile clients and only accept from trusted sources. ex. js -> get ip, backend only accepts ips that include user-token or etc. ip can be set in the header and encrypted through https. however, i don't know about performance

@sebastianfelipe
Copy link

@sebastianfelipe sebastianfelipe commented Jul 7, 2020

@Damidara16 that's exactly what we don't want to do. Is really insecure to do that. You can bypass it as you want.

@aduzsardi
Copy link

@aduzsardi aduzsardi commented Jul 25, 2020

To bad this is still an open issue , sadly ... it doesn't look like it's going to be fixed soon

@vipcxj
Copy link

@vipcxj vipcxj commented Jul 28, 2020

To bad this is still an open issue , sadly ... it doesn't look like it's going to be fixed soon

I think it will be closed by the bot soon. Since github launched this feature, many bugs can be ignored.

@vicary
Copy link

@vicary vicary commented Jul 28, 2020

To bad this is still an open issue , sadly ... it doesn't look like it's going to be fixed soon

I think it will be closed by the bot soon. Since github launched this feature, many bugs can be ignored.

This is the best feature for enterprises' bloated teams to gain control of the community.

@ni-ajardan
Copy link

@ni-ajardan ni-ajardan commented Jul 28, 2020

There is very little chance this is going to be fixed ever. AFAIK everyone considers k8s won the "race" and swarm is not needed, but I would say both can co-exist and be properly used depending on the necessities and skills of the team using these. RIP swarm :)

@visualex
Copy link

@visualex visualex commented Jul 28, 2020

I use a managed HAIP, but you could use something else in front of the swarm, a standalone nginx load balancer that points to the IPs of your swarm.
https://docs.nginx.com/nginx/admin-guide/load-balancer/http-load-balancer/

In your swarm, the reverse proxy needs this:

server {
        listen 443 ssl proxy_protocol;
        location / {
        proxy_set_header   X-Real-IP $proxy_protocol_addr;  # this is the real IP address 

If you are running a swarm, you will need a load balancer to round-robin the requests to your swarm (or sticky, etc).

So far, this architectural decision may seem like a "missing piece", however, this adds flexibility by providing options and removing the need to disable inbuilt functionality to replace it for something more suitable to the application needs.

@struanb
Copy link

@struanb struanb commented Sep 6, 2020

I believe I may have found a workaround for this issue, with the current limitation that service container replicas must all be deployed to a single node, for example with --constraint-add='node.hostname==mynode', or with a set of swarms each consisting of a single node.

The problem

The underlying problem is caused by the SNAT rule in the iptables nat table in the ingress_sbox namespace, which causes all incoming requests to be seen by containers to have the node's IP address in the ingress network (e.g. 10.0.0.2, 10.0.0.3, ..., in the default ingress network configuration), e.g.:

iptables -t nat -A POSTROUTING -d 10.0.0.0/24 -m ipvs --ipvs -j SNAT --to-source 10.0.0.2

However, removing this SNAT rule means that while containers still receive incoming packets - now originating from the original source IP - outgoing packets sent back to the original source IP are sent via the container's default gateway, which is not on the same ingress network but on the docker_gwbridge network (e.g. 172.31.0.1), and those packets are then lost.

The workaround

So the workaround comprises: 1. removing (in fact, inhibiting) this SNAT rule in the ingress_sbox namespace; and 2. creating a policy routing rule for swarm service containers, that forces those outgoing packets back to the node's ingress network IP address it would have gone back to (e.g. 10.0.0.2); 3. automating the addition of the policy routing rules, so that every new service container has them promptly installed upon creation.

  1. To inhibit the SNAT rule, we create a rule earlier in the table that prevents the usual SNAT being reached:
nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -m ipvs --ipvs -j ACCEPT

(We do it this way, rather than just deleting the existing SNAT rule, as docker seems to recreate the SNAT rule several times during the course of creating a service. This approach just supersedes that rule, which makes it more resilient).

  1. To create the container policy routing rule:
docker inspect -f '{{.State.Pid}}' <container-id>
nsenter -n -t $NID bash -c "ip route add table 1 default via 10.0.0.2 && ip rule add from 10.0.0.0/24 lookup 1 priority 32761"
  1. Finally, putting the above together with docker event we automate the process of modifying the SNAT rules, and watching for newly started containers, and adding the policy routing rules, via this ingress-routing-daemon script:
#!/bin/bash

# Ingress Routing Daemon
# Copyright © 2020 Struan Bartlett
# --------------------------------------------------------------------
# Permission is hereby granted, free of charge, to any person 
# obtaining a copy of this software and associated documentation files 
# (the "Software"), to deal in the Software without restriction, 
# including without limitation the rights to use, copy, modify, merge, 
# publish, distribute, sublicense, and/or sell copies of the Software, 
# and to permit persons to whom the Software is furnished to do so, 
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be 
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS 
# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN 
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 
# SOFTWARE.
# --------------------------------------------------------------------
# Workaround for https://github.com/moby/moby/issues/25526

echo "Ingress Routing Daemon starting ..."

read INGRESS_SUBNET INGRESS_DEFAULT_GATEWAY \
  < <(docker inspect ingress --format '{{(index .IPAM.Config 0).Subnet}} {{index (split (index .Containers "ingress-sbox").IPv4Address "/") 0}}')

echo INGRESS_SUBNET=$INGRESS_SUBNET
echo INGRESS_DEFAULT_GATEWAY=$INGRESS_DEFAULT_GATEWAY

# Add a rule ahead of the ingress network SNAT rule, that will cause the SNAT rule to be skipped.
echo "Adding ingress_sbox iptables nat rule: iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -m ipvs --ipvs -j ACCEPT"
while nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat -D POSTROUTING -d 10.0.0.0/24 -m ipvs --ipvs -j ACCEPT; do true; done 2>/dev/null
nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -m ipvs --ipvs -j ACCEPT

# Watch for container start events, and configure policy routing rules on each container
# to ensure return path traffic from incoming connections is routed back via the correct interface.
docker events \
  --format '{{.ID}} {{index .Actor.Attributes "com.docker.swarm.service.name"}}' \
  --filter 'event=start' \
  --filter 'type=container' | \
  while read ID SERVICE
  do
    if [ -n "$SERVICE" ]; then
    
      NID=$(docker inspect -f '{{.State.Pid}}' $ID)
      echo "Container ID=$ID, NID=$NID, SERVICE=$SERVICE started: applying policy route."
      nsenter -n -t $NID bash -c "ip route add table 1 default via $INGRESS_DEFAULT_GATEWAY && ip rule add from $INGRESS_SUBNET lookup 1 priority 32761"
    fi
  done

Now, when requests arrive at the published ports for the single node, its containers will see the original IP address of the machine making the request.

Usage

Run the above ingress-routing-daemon as root on each and every one of your swarm nodes before creating your service. (If your service is already created, then ensure you scale it to 0 before scaling it back to a positive number of replicas.) The daemon will initialise iptables, detect when docker creates new containers, and apply new routing rules to each new container.

Testing, use-cases and limitations

The above has been tested using multiple replicas constrained to a single node on a service running on a multi-node swarm.

It has also been tested using multiple nodes, each with a separate per-node service constrained to that node, but this comes with the limitation that different published ports must be used for each per-node service. Still that might work for some use-cases.

The method should also work using multiple nodes, if each were configured as a single node in its own swarm. This carries the limitation that the docker swarms can no longer be used to distribute containers across nodes, however there could still be other administration benefits of using docker services, such as container replica and lifecycle management.

Improving the workaround to address further use-cases

With further development, this method should be capable of scaling to multiple nodes without the need for separate per-node services or splitting the swarm. I can think of two possible approaches: 1. Arranging for Docker, or a bespoke daemon, to remove all non-local IPs from each node's ipvsadm table. 2. Extending the policy routing rules to accommodate routing output packages back to the correct node.

For 1, we could poll ipvsadm -S -n to look for new IPs added to any service, check whether each is local, and remove any that aren't. This would allow each node to function as a load balancer for its own containers within the overall service, but without requests reaching one node being able to be forwarded to another. This would certainly satisfy my own use-case, where we have our own IPVS load balancer sitting in front of a set of servers, each running a web application, which we would like to replace with several load-balanced containerised instances of the same application, to allow us to roll out updates without losing a whole server.

For 2, we could use iptables to assign a per-node TOS in each node's ingress_sbox iptable (for example to the final byte of the node ingress network IP); then in the container, arrange to map the TOS value to a connection mark, and then from a connection mark to a firewall mark for outgoing packets, and for each firewall mark select a different routing table that routes the packets back to the originating node. The rules for this will be a bit clunky, but I imagine should scale fine to 2-16 nodes.

I hope the above comes in useful. I will also have a go at (2), and if I make progress will post a further update.

@struanb
Copy link

@struanb struanb commented Sep 7, 2020

Below is an improved version of the ingress routing daemon, ingress-routing-daemon-v2, which extends the policy routing rule model to allow each container to route its output packets back to the correct node, without the need for SNAT.

The improved model

In addition to inhibiting the SNAT rule as per the previous model, the new model requires an iptables rule in the ingress_sbox namespace on each node you intend to use as an IPVS load-balancer endpoint (so normally your manager nodes, or a subset of those manager nodes), that assigns a per-node TOS value to all packets destined for any node in the ingress network. (We use the final byte of the node's ingress network IP.)

As the TOS value is stored within the packet, it can be read by the destination node to which the incoming request has been directed, and the packet has been sent.

Then in the container on the destination node, we arrange to map the TOS value on any incoming packets to a connection mark, using the same value.

Now, since outgoing packets on the same connection will have the same connection mark, we map the connection mark on any outgoing packets to a firewall mark, again using the same value.

Finally, a set of policy routing rules selects a different routing table, designed to route the outgoing packets back to the required load-balancer endpoint node, according to the firewall mark value.

Now, when client requests arrive at the published ports for any node in the swarm, the container (whether on the same and/or other nodes) to which the request is directed will see the original IP address of the client making the request, and be able to route the response back to the originating load-balancer node; which will, in turn, be able to route the response back to the client.

Usage

Setting up

Generate a value for INGRESS_NODE_GATEWAY_IPS specific to your swarm, by running ingress-routing-daemon-v2 as root on every one of your swarm's nodes that you'd like to use as a load-balancer endpoint (normally only your manager nodes, or a subset of your manager nodes), noting the values shown for INGRESS_DEFAULT_GATEWAY. You only have to do this once, or whenever you add or remove nodes. Your INGRESS_NODE_GATEWAY_IPS should look like 10.0.0.2 10.0.0.3 10.0.0.4 10.0.0.5 (according to the subnet defined for the ingress network, and the number of nodes).

Running the daemon

Run INGRESS_NODE_GATEWAY_IPS="<Node Ingress IP List>" ingress-routing-daemon-v2 --install as root on each and every one of your swarm's nodes (managers and workers) before creating your service. (If your service is already created, then ensure you scale it to 0 before scaling it back to a positive number of replicas.) The daemon will initialise iptables, detect when docker creates new containers, and apply new routing rules to each new container.

If you need to restrict the daemon’s activities to a particular service, then modify [ -n "$SERVICE" ] to [ "$SERVICE" = "myservice" ].

Uninstalling iptables rules

Run ingress-routing-daemon-v2 --uninstall on each node.

Testing

The ingress-routing-daemon-v2 script has been tested with 8 replicas of a web service deployed to a four-node swarm.

Curl requests for the service, directed to any of the specified load-balanced endpoint node IPs, returned successful responses, and examination of the container logs showed the application saw the incoming requests as originating from the Curl client’s IP.

Limitations

As the TOS value can store an 8-bit number, this model can in principle support up to 256 load-balancer endpoint nodes.

However as the model requires every container be installed with one iptables mangle rule + one policy routing rule + one policy routing table per manager endpoint node, there might possibly be some performance degradation as the number of such endpoint nodes increases (although experience suggests this is unlikely to be noticeable with <= 16 load-balancer endpoint nodes on modern hardware).

If you add load-balancer endpoints nodes to your swarm - or want to start using existing manager nodes as load-balancer endpoints - you will need to tread carefully as existing containers will not be able to route traffic back to the new endpoint nodes. Try restarting INGRESS_NODE_GATEWAY_IPS="<Node Ingress IP List>" ingress-routing-daemon-v2 with the updated value for INGRESS_NODE_GATEWAY_IPS, then perform a rolling update of all containers, before using the new load-balancer endpoint.

Scope for native Docker integration

I’m not familiar with the Docker codebase, but I can’t see anything that ingress-routing-daemon-v2 does that couldn’t, in principle, be implemented by Docker natively, but I'll leave that for the Docker team to consider, or as an exercise for someone familiar with the Docker code.

The ingress routing daemon v2 script

Here is the new ingress-routing-daemon-v2 script.

#!/bin/bash

# Ingress Routing Daemon v2
# Copyright © 2020 Struan Bartlett
# ----------------------------------------------------------------------
# Permission is hereby granted, free of charge, to any person 
# obtaining a copy of this software and associated documentation files 
# (the "Software"), to deal in the Software without restriction, 
# including without limitation the rights to use, copy, modify, merge, 
# publish, distribute, sublicense, and/or sell copies of the Software, 
# and to permit persons to whom the Software is furnished to do so, 
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be 
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS 
# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN 
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 
# SOFTWARE.
# ----------------------------------------------------------------------
# Workaround for https://github.com/moby/moby/issues/25526

if [ "$1" = "--install" ]; then
  INSTALL=1
elif [ "$1" = "--uninstall" ]; then
  INSTALL=0
else
  echo "Usage: $0 [--install|--uninstall]"
fi

echo
echo "  Dumping key variables..."

if [ "$INSTALL" = "1" ] && [ -z "$INGRESS_NODE_GATEWAY_IPS" ]; then
  echo "!!! ----------------------------------------------------------------------"
  echo "!!! WARNING: Using default INGRESS_NODE_GATEWAY_IPS"
  echo "!!! Please generate a list by noting the values shown"
  echo "!!! for INGRESS_DEFAULT_GATEWAY on each of your swarm nodes."
  echo "!!!"
  echo "!!! You only have to do this once, or whenever you add or remove nodes."
  echo "!!!"
  echo "!!! Then relaunch using:"
  echo "!!! INGRESS_NODE_GATEWAY_IPS=\"<Node Ingress IP List>\" $0 -x"
  echo "!!! ----------------------------------------------------------------------"
fi

read INGRESS_SUBNET INGRESS_DEFAULT_GATEWAY \
  < <(docker inspect ingress --format '{{(index .IPAM.Config 0).Subnet}} {{index (split (index .Containers "ingress-sbox").IPv4Address "/") 0}}')

echo "  - INGRESS_SUBNET=$INGRESS_SUBNET"
echo "  - INGRESS_DEFAULT_GATEWAY=$INGRESS_DEFAULT_GATEWAY"

# We need the final bytes of the IP addresses on the ingress network of every node
# i.e. We need the final byte of $INGRESS_DEFAULT_GATEWAY for every node in the swarm
# This shouldn't change except when nodes are added or removed from the swarm, so should be reasonably stable.
# You should configure this yourself, but for now let's assume we have 8 nodes with IPs in the INGRESS_SUBNET numbered x.x.x.2 ... x.x.x.9
if [ -z "$INGRESS_NODE_GATEWAY_IPS" ]; then
  INGRESS_NET=$(echo $INGRESS_DEFAULT_GATEWAY | cut -d'.' -f1,2,3)
  INGRESS_NODE_GATEWAY_IPS="$INGRESS_NET.2 $INGRESS_NET.3 $INGRESS_NET.4 $INGRESS_NET.5 $INGRESS_NET.6 $INGRESS_NET.7 $INGRESS_NET.8 $INGRESS_NET.9"
fi

echo "  - INGRESS_NODE_GATEWAY_IPS=\"$INGRESS_NODE_GATEWAY_IPS\""

# Create node ID from INGRESS_DEFAULT_GATEWAY final byte
NODE_ID=$(echo $INGRESS_DEFAULT_GATEWAY | cut -d'.' -f4)
echo "  - NODE_ID=$NODE_ID"

if [ -z "$INSTALL" ]; then
  echo
  echo "Ingress Routing Daemon v2 exiting."
  exit 0
fi

# Add a rule ahead of the ingress network SNAT rule, that will cause the SNAT rule to be skipped.
[ "$INSTALL" = "1" ] && echo "Adding ingress_sbox iptables nat rule: iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -m ipvs --ipvs -j ACCEPT"
while nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat -D POSTROUTING -d 10.0.0.0/24 -m ipvs --ipvs -j ACCEPT; do true; done 2>/dev/null
[ "$INSTALL" = "1" ] && nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -m ipvs --ipvs -j ACCEPT

# 1. Set TOS to NODE_ID in all outgoing packets to INGRESS_SUBNET
[ "$INSTALL" = "1" ] && echo "Adding ingress_sbox iptables mangle rule: iptables -t mangle -A POSTROUTING -d $INGRESS_SUBNET -j TOS --set-tos $NODE_ID/0xff"
while nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t mangle -D POSTROUTING -d $INGRESS_SUBNET -j TOS --set-tos $NODE_ID/0xff; do true; done 2>/dev/null
[ "$INSTALL" = "1" ] && nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t mangle -A POSTROUTING -d $INGRESS_SUBNET -j TOS --set-tos $NODE_ID/0xff

if [ "$INSTALL" = "0" ]; then
  echo
  echo "Ingress Routing Daemon v2 iptables rules uninstalled, exiting."
  exit 0
fi

echo "Ingress Routing Daemon v2 starting ..."

# Watch for container start events, and configure policy routing rules on each container
# to ensure return path traffic for incoming connections is routed back via the correct interface
# and to the correct node from which the incoming connection was received.
docker events \
  --format '{{.ID}} {{index .Actor.Attributes "com.docker.swarm.service.name"}}' \
  --filter 'event=start' \
  --filter 'type=container' | \
  while read ID SERVICE
  do
    if [ -n "$SERVICE" ]; then
    
      NID=$(docker inspect -f '{{.State.Pid}}' $ID)
      echo "Container ID=$ID, NID=$NID, SERVICE=$SERVICE started: applying policy routes."
      
      # 3. Map any connection mark on outgoing traffic to a firewall mark on the individual packets.
      nsenter -n -t $NID iptables -t mangle -A OUTPUT -p tcp -j CONNMARK --restore-mark

      for NODE_IP in $INGRESS_NODE_GATEWAY_IPS
      do
        NODE_ID=$(echo $NODE_IP | cut -d'.' -f4)
	
	# 2. Map the TOS value on any incoming packets to a connection mark, using the same value.
        nsenter -n -t $NID iptables -t mangle -A PREROUTING -m tos --tos $NODE_ID/0xff -j CONNMARK --set-xmark $NODE_ID/0xffffffff
	
	# 4. Select the correct routing table to use, according to the firewall mark on the outgoing packet.
        nsenter -n -t $NID ip rule add from $INGRESS_SUBNET fwmark $NODE_ID lookup $NODE_ID prio 32700
	
	# 5. Route outgoing traffic to the correct node's ingress network IP, according to its firewall mark
	#    (which in turn came from its connection mark, its TOS value, and ultimately its IP).
        nsenter -n -t $NID ip route add table $NODE_ID default via $NODE_IP dev eth0
	
      done

    fi
  done
@jrbecart
Copy link

@jrbecart jrbecart commented Nov 10, 2020

Hello @struanb, I don't understand how the uninstall section works in your v2 script, is there something missing?

@struanb
Copy link

@struanb struanb commented Nov 18, 2020

Hello @jrbecart. I hope not. Before iptables rules are installed, you'll see there are two while loops that delete any pre-existing rules, using iptables -D. This is a safety measure, in case the script is run with --install multiple times successively, without any intervening call with --uninstall.

As such, when the script is called with --uninstall, by the time the script exits those rules will have been removed, and new rules not yet added.

Hope this answers your question.

@sebastianfelipe
Copy link

@sebastianfelipe sebastianfelipe commented Nov 19, 2020

Hi everyone, I want to tell you that I discovered a fix to this issue, without installing and configuring nothing else than defining the NGINX config well. I know that all of us have tried different approaches. This one was discovered by mistake. To be honest, I gave up with this a long time ago. Well, until today. While I was implementing a monitoring system, I was able to get the source IP, the real source IP, using the NGINX log, so I began to debug how was that possible.

Here is an example of that kind of log

10.0.0.2 - - [19/Nov/2020:04:56:31 +0000] "GET / HTTP/1.1" 200 58 "-" req_t=0.003 upstream_t=0.004 "<browser-info>" "<source-ip-1,source-ip2,....>"

Note: There are multiple source IPs if you're using a proxies (i.e. Cloudfare and others).

The info was there, my real IP was there. Then, I reviewed the logging NGINX format to know how the magic was possible, and I found this:

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      'req_t=$request_time upstream_t=$upstream_response_time '
                      '"$http_user_agent" "$http_x_forwarded_for"';

That's mean, the magic is here -> $http_x_forwarded_for

After this, I changed the proxy headers like proxy_set_header X-Real-IP $http_x_forwarded_for;.

And finally, the last test, using that information on a NodeJS project, inside the production like system, using Docker Swarm with an overlay network, with about 4 VMs, and guess what, it worked! I could finally get the real IP address.

I'm so happy because this issue has been opened for a long long time but I think this is the answer. The versions I used are:

Docker version: 19.03.8
NGINX version: nginx/1.14.2

I will wait for your feedback. I hope you can have the same results as me.

Cheers!
Sebastián.

P.S: Try this using another network interface, that means, outside localhost, because you will find a "-" in the log, instead of your real IP address. Try to test it along the internet, completely outside your home network.

Bonus: I also could map the IP address to a geolocation, using a lookup table, count them and put it on a map, so the answer is yes, this is what we were looking for guys :)

@vicary
Copy link

@vicary vicary commented Nov 19, 2020

@sebastianfelipe that's a big claim after all these years. You sure you're not using host mode or other workarounds in this thread?

@sebastianfelipe
Copy link

@sebastianfelipe sebastianfelipe commented Nov 19, 2020

@sebastianfelipe that's a big claim after all these years. You sure you're not using host mode or other workarounds in this thread?

I'm sure. I'm not using network host on all those connected services. I just deployed a stack, with an overlay network in a production-like environment, including a Digital Ocean load balancer and it worked. I mean, I can't test it better than this. Is 100% real.

@beornf
Copy link

@beornf beornf commented Nov 19, 2020

@sebastianfelipe I'm guessing the Digital Ocean load balancer is appending the user's IP address to the X-Forwarded-For header. This is a known workaround which doesn't solve the issue of retrieving the user's IP in standalone Docker Swarm mode.

@sebastianfelipe
Copy link

@sebastianfelipe sebastianfelipe commented Nov 19, 2020

@beornf I was trying to sleep and then I read your notification so I had to wake up and try an approach without a Digital Ocean load balancer and it failed. You're right, Digital Ocean add a magic there when a load balancer is added. This happens to $http_x_forwarded_for variable. Digital Ocean load balancer add info to another NGINX variable, info that is not added by Docker Swarm directly. Probably this could lead to a "dummy-like" approach to have a real solution for every case. At least Digital Ocean customers can be happy to know how to deal with this at this moment.

@vicary
Copy link

@vicary vicary commented Nov 19, 2020

@beornf @sebastianfelipe Adding to the context, CloudFlare also adds X-Forwarded-For and is largely free.

@sebastianfelipe
Copy link

@sebastianfelipe sebastianfelipe commented Nov 19, 2020

@beornf @sebastianfelipe Adding to the context, CloudFlare also adds X-Forwarded-For and is largely free.

I think this could work for a lot of us that need a way out to get the real IP. Cloudfare can be adjusted as proxy or just DNS only. It fits perfectly for no Digital Ocean customers. It is the cleaner workaround until now. But I agree with @beornf, we need a real solution, without depending on Digital Ocean or Cloudfare to get this done.

Thanks!

@hinorashi
Copy link

@hinorashi hinorashi commented Jan 12, 2021

seem like everyone leveling up from docker-compose to docker swarm encounters this issue, happy new year 2021 guys, I hope I won't see it in 2022 🙈

@cpuguy83
Copy link
Contributor

@cpuguy83 cpuguy83 commented Jan 12, 2021

A long as people post about it and don't work to fix it, we'll see it.
There is very little time currently going into swarm from anyone.

@olafbuitelaar
Copy link

@olafbuitelaar olafbuitelaar commented Jan 12, 2021

I wonder, since the whole load balancing is based on IPVS, has anybody ever tried to allow to create an ingress network with IPIP mode tunneling instead of NAT mode?
https://github.com/moby/ipvs/blob/d5d89bed3438c8b85db5e5868bb23846ee755ff4/constants.go#L122
Sure the return path might be a bit difficult, but not impossible.

@zerthimon
Copy link

@zerthimon zerthimon commented Jan 15, 2021

The ingress-sbox container needs to be taught Proxy Protocol, then a LB such as nginx, traefik could be used as front end to the services in the swarm and set the XFF header with IP obtained from PP

@olafbuitelaar
Copy link

@olafbuitelaar olafbuitelaar commented Jan 15, 2021

@zelahi I think this should be solved on the network level, not on a protocol level. Not everything speaks the proxy protocol.

like the solution proposed by @struanb or thus by making use of packet encapsulation using IPIP

@zerthimon
Copy link

@zerthimon zerthimon commented Jan 15, 2021

@olafbuitelaar Agreed, that not everything speaks the proxy protocol. But this can be a config option for what does speak, right ?

@struanb
Copy link

@struanb struanb commented Jan 17, 2021

Just to advise, we are now running docker swarm, in conjunction with the docker ingress-routing-daemon (documented above), in production on www.newsnow.co.uk, currently handling some 1,000 requests per second.

We run the daemon on all 10 nodes of our swarm, of which currently only two serve as load balancers for incoming web traffic, which direct traffic to containers running on a selection of 4 of the remaining nodes (the other nodes currently being used for backend processes).

Using the daemon, we have been able to avoid significant changes to our tech stack (no need for cloudflare or nginx) or to our application's internals (which relied upon identifying the requesting client's IP address for geolocation and security purposes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet