Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Windows-specific DefaultHTTPHost (127.0.0.1 instead of localhost) #41616

Open
thaJeztah opened this issue Nov 1, 2020 · 6 comments
Open
Labels

Comments

@thaJeztah
Copy link
Member

Pull request #17777 added a workaround for a performance issue identified in Windows TP4 (technical preview for Windows Server 2016 "RS1").

When connecting to the docker API using TCP, there was a 2-second delay when using localhost, which did not occur when using the IP address (127.0.0.1) instead;

Our server perf team reported a significant regression with the latest build of docker. Myself and @swernli narrowed it down to either a bug in GOLang 1.5.1 and/or a bug in the Windows networking stack in TP4. It is only evident after upgrading from 1.4.3 to 1.5.1. There is a huge comment in the PR which explains what the mitigation is in far more detail. I have deliberately not changed the behaviour of any non-Windows client to avoid any possible other side effects (slim though that might be).

The 2-second delay is because we hit this twice during docker run.

A long description can be found as a comment in source:

// TODO Windows. Identify bug in GOLang 1.5.1+ and/or Windows Server 2016 TP5.

On Windows, this mitigates a problem with the default options of running
a docker client against a local docker daemon on TP5.

What was found that if the default host is "localhost", even if the client
(and daemon as this is local) is not physically on a network, and the DNS
cache is flushed (ipconfig /flushdns), then the client will pause for
exactly one second when connecting to the daemon for calls. For example
using docker run windowsservercore cmd, the CLI will send a create followed
by an attach. You see the delay between the attach finishing and the attach
being seen by the daemon.

Here's some daemon debug logs with additional debug spew put in. The
AfterWriteJSON log is the very last thing the daemon does as part of the
create call. The POST /attach is the second CLI call. Notice the second
time gap.

time="2015-11-06T13:38:37.259627400-08:00" level=debug msg="After createRootfs"
time="2015-11-06T13:38:37.263626300-08:00" level=debug msg="After setHostConfig"
time="2015-11-06T13:38:37.267631200-08:00" level=debug msg="before createContainerPl...."
time="2015-11-06T13:38:37.271629500-08:00" level=debug msg=ToDiskLocking....
time="2015-11-06T13:38:37.275643200-08:00" level=debug msg="loggin event...."
time="2015-11-06T13:38:37.277627600-08:00" level=debug msg="logged event...."
time="2015-11-06T13:38:37.279631800-08:00" level=debug msg="In defer func"
time="2015-11-06T13:38:37.282628100-08:00" level=debug msg="After daemon.create"
time="2015-11-06T13:38:37.286651700-08:00" level=debug msg="return 2"
time="2015-11-06T13:38:37.289629500-08:00" level=debug msg="Returned from daemon.ContainerCreate"
time="2015-11-06T13:38:37.311629100-08:00" level=debug msg="After WriteJSON"
... 1 second gap here....
time="2015-11-06T13:38:38.317866200-08:00" level=debug msg="Calling POST /v1.22/containers/984758282b842f779e805664b2c95d563adc9a979c8a3973e68c807843ee4757/attach"
time="2015-11-06T13:38:38.326882500-08:00" level=info msg="POST /v1.22/containers/984758282b842f779e805664b2c95d563adc9a979c8a3973e68c807843ee4757/attach?stderr=1&stdin=1&stdout=1&stream=1"

We suspect this is either a bug introduced in GOLang 1.5.1, or that a change
in GOLang 1.5.1 (from 1.4.3) is exposing a bug in Windows. In theory,
the Windows networking stack is supposed to resolve "localhost" internally,
without hitting DNS, or even reading the hosts file (which is why localhost
is commented out in the hosts file on Windows).

We have validated that working around this using the actual IPv4 localhost
address does not cause the delay.

This does not occur with the docker client built with 1.4.3 on the same
Windows build, regardless of whether the daemon is built using 1.5.1
or 1.4.3. It does not occur on Linux. We also verified we see the same thing
on a cross-compiled Windows binary (from Linux).

Final note: This is a mitigation, not a 'real' fix. It is still susceptible
to the delay if a user were to do docker run -H=tcp://localhost:2375...
explicitly.

We should investigate if this problem is still there, or if it's safe to remove the workaround.

@thaJeztah
Copy link
Member Author

I'd like to get rid of that ugly workaround to cleanup and simplify the code; so perhaps someone who's running on Windows could check? @olljanat @StefanScherer 🤗

Also see #41611 and #41617

@StefanScherer
Copy link
Contributor

StefanScherer commented Nov 4, 2020

@thaJeztah I've tested it on a Windows Server 2019 and a Docker Engine - Enterprise 19.03.12 configured like this:

{
    "hosts":  [
                  "tcp://localhost:2375",
                  "npipe://"
              ],
    "experimental":  false
}

Running Docker CLI with localhost:2375 connection is indeed a bit slower, but not a whole second:

PS C:\ProgramData\docker\config> measure-command { docker -H=tcp://127.0.0.1:2375 images }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 104
Ticks             : 1043621
TotalDays         : 1.20789467592593E-06
TotalHours        : 2.89894722222222E-05
TotalMinutes      : 0.00173936833333333
TotalSeconds      : 0.1043621
TotalMilliseconds : 104.3621



PS C:\ProgramData\docker\config> measure-command { docker -H=tcp://localhost:2375 images }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 428
Ticks             : 4285078
TotalDays         : 4.95958101851852E-06
TotalHours        : 0.000119029944444444
TotalMinutes      : 0.00714179666666667
TotalSeconds      : 0.4285078
TotalMilliseconds : 428.5078

@StefanScherer
Copy link
Contributor

Now I ran a comparison with this insecure setting (0.0.0.0:2375 instead of localhost:2375 or 127.0.0.1:2375) in the daemon.json:

{
    "hosts":  [
                  "tcp://0.0.0.0:2375",
                  "npipe://"
              ],
    "experimental":  false
}

and then both Docker CLI commands are both fast.

PS C:\ProgramData\docker\config> measure-command { docker -H=tcp://127.0.0.1:2375 images }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 109
Ticks             : 1090040
TotalDays         : 1.26162037037037E-06
TotalHours        : 3.02788888888889E-05
TotalMinutes      : 0.00181673333333333
TotalSeconds      : 0.109004
TotalMilliseconds : 109.004



PS C:\ProgramData\docker\config> measure-command { docker -H=tcp://localhost:2375 images }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 112
Ticks             : 1126574
TotalDays         : 1.30390509259259E-06
TotalHours        : 3.12937222222222E-05
TotalMinutes      : 0.00187762333333333
TotalSeconds      : 0.1126574
TotalMilliseconds : 112.6574

But anyway, the default on Windows is the named pipe and not using a TCP port. So I don't really would bother if it's a bit slower.

@StefanScherer
Copy link
Contributor

StefanScherer commented Nov 4, 2020

And a final test with TLS activated when someone wants to use TCP connection:

{
    "hosts":  [
                  "tcp://0.0.0.0:2376",
                  "npipe://"
              ],
    "tlscert":  "C:\\ProgramData\\docker\\certs.d\\server-cert.pem",
    "experimental":  false,
    "tlsverify":  true,
    "tlskey":  "C:\\ProgramData\\docker\\certs.d\\server-key.pem",
    "tlscacert":  "C:\\ProgramData\\docker\\certs.d\\ca.pem"
}

Both commands are fast:

PS C:\ProgramData\docker\config> measure-command { docker -H=tcp://127.0.0.1:2376 --tls images }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 118
Ticks             : 1184388
TotalDays         : 1.37081944444444E-06
TotalHours        : 3.28996666666667E-05
TotalMinutes      : 0.00197398
TotalSeconds      : 0.1184388
TotalMilliseconds : 118.4388



PS C:\ProgramData\docker\config> measure-command { docker -H=tcp://localhost:2376 --tls images }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 131
Ticks             : 1312670
TotalDays         : 1.51929398148148E-06
TotalHours        : 3.64630555555556E-05
TotalMinutes      : 0.00218778333333333
TotalSeconds      : 0.131267
TotalMilliseconds : 131.267

@thaJeztah
Copy link
Member Author

Thanks for checking!

Running Docker CLI with localhost:2375 connection is indeed a bit slower, but not a whole second:

Interesting, so in that case it's still 4 times as slow as the IP-address (and if I look at the description, that same delay would happen twice on a docker run because it's making multiple API calls.

And a final test with TLS activated when someone wants to use TCP connection

I guess by setting 0.0.0.0, the daemon wouldn't hit the localhost case, so some difference there

But anyway, the default on Windows is the named pipe and not using a TCP port. So I don't really would bother if it's a bit slower.

Correct; looking at when that default is used, and looks to be in

func parseDaemonHost(addr string) (string, error) {

Which looks to use that address if someone specified the scheme without any additional info (-H tcp://, -H nipipe://, -H unix://), or only specified the port-number -H :6666 (?)

Also looking at the original description;

We suspect this is either a bug introduced in GOLang 1.5.1, or that a change in GOLang 1.5.1 (from 1.4.3) is exposing a bug in Windows. In theory, the Windows networking stack is supposed to resolve "localhost" internally, without hitting DNS, or even reading the hosts file (which is why localhost is commented out in the hosts file on Windows).

perhaps @daschott, @madhanrm, or @dineshgovindasamy knows if this is a known issue and/or if there's a ticket inside Microsoft tracking this problem.

@olljanat
Copy link
Contributor

olljanat commented Nov 4, 2020

Running Docker CLI with localhost:2375 connection is indeed a bit slower, but not a whole second:

Interesting, so in that case it's still 4 times as slow as the IP-address (and if I look at the description, that same delay would happen twice on a docker run because it's making multiple API calls.

I suspect that it is because Windows prefer IPv6 over IPv4 so it will first try connect to ::1 and only after if that fails then to 127.0.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants