Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lcow: Slower pull times with newer kernels #3120

Closed
rn opened this issue Jul 16, 2018 · 7 comments
Closed

lcow: Slower pull times with newer kernels #3120

rn opened this issue Jul 16, 2018 · 7 comments

Comments

@rn
Copy link
Member

rn commented Jul 16, 2018

This was reported by @jhowardmsft on slack. He noticed that with newer kernel, specifically 4.14.53, the first pull on fresh data root is a lot slower than on older kernels, specifically he tested 4.14.23. Apparently the mkfs.ext4 seems to take a lot longer. One suspicion is the inclusion of 0020-scsi-storvsc-Allow-only-one-remove-lun-work-item-to-.patch and 0021-scsi-storvsc-Avoid-excessive-host-scan-on-controller.patch, which were introduced with 4.14.30 as part of some other debugging effort around LCOW and disk hotplug.

This is issue is to track the debug

@rn
Copy link
Member Author

rn commented Jul 16, 2018

Trying to repro the issue with:

  • Windows Build: 17713
  • OpenGCS commit: f4421ed6d06d1583e4625634a8cc4f5f864d77fe
  • Docker CLI commit: 2b221d8f
  • Docker Daemon commit: 905811f

The daemon is started on a fresh data-root:

Remove-Item C:\lcow-test -Force -Recurse
dockerd -D --experimental --data-root C:\lcow-test

A quick test with docker run busybox uname -a shows:

  • 4.14.23-linuxkit: First run takes ~13s subsequent rmi + run take ~10s
  • 4.14.54-linuxkit: First run takes ~23s subsequent rmi + run take ~17s

Let's look at it more systematically and use both docker image pull and docker image load of a small image (busybox, 2 layers) and an image with more layers (linuxkit/test-lcow:10layers, 10 small layers). We pull/load an image, time it, rmi the image, and pull/load again, e.g.:

$t = Measure-Command { docker pull --platform linux busybox }; $t.TotalSeconds
docker rmi busybox
$t = Measure-Command { docker pull --platform linux busybox }; $t.TotalSeconds
docker rmi busybox
...

By repeating the rmi + pull (or load) we see if there is a difference between first pull and subsequent pulls.

For docker pull busybox:

  • 4.14.23-linuxkit: First pull takes ~7.2s, subsequent pulls (after rmi) take ~6.5s
  • 4.14.54-linuxkit: First pull takes ~15.7s, subsequent pulls (after rmi) take 13.9s

For docker pull linuxkit/test-lcow:10layers:

  • 4.14.23-linuxkit: First pull takes ~25.9s, subsequent pulls (after rmi) take ~24.3s
  • 4.14.54-linuxkit: Pulls take between 102.2s and 111.9s with no significant difference between first and subsequent (after rmi) pulls.

For docker image load -i 10layers.tar (generated with `docker image save -o 10layers.tar linuxkit/test-lcow:10layers):

  • 4.14.23-linuxkit: Loads take between 21.5s and 21.1s with no significant difference between first and subsequent (after rmi) loads.
  • 4.14.54-linuxkit: Loads take between 96.9s and 112.9s with no significant difference between first and subsequent (after rmi) loads.

So, there is indeed a significant difference between the two kernels. Let's bisect...

@rn
Copy link
Member Author

rn commented Jul 16, 2018

We use the docker image load -i 10layers.tar test as it has the least movable parts (times are from three runs):

4.14.23: 21.5s - 21.1s
...
4.14.30: 21.7s - 21.2s
...
4.14.34: 21.0s - 21.8s
4.14.35: 21.1s - 21.5s
4.14.36: 100.2s - 107.7s
...
4.14.38: 99.6s - 103.6s
...
4.14.54: 96.9s - 112.9s 

So the regression seems to have happened between 4.14.35-linuxkit and 4.14.36-linuxkit. We introduced the patches mentioned above with 4.14.30-linuxkit, so they are unlikely the culprit. Just to make sure, I removed the patches and rebuild 4.14.54-linuxkit and still see the 100+s for the docker image load.

@rn rn changed the title lcow: Slower pull times wit newer kernels lcow: Slower pull times with newer kernels Jul 16, 2018
@rn
Copy link
Member Author

rn commented Jul 17, 2018

As a quick check I looked at boot times of 4.14.35 and 4.14.36 kernels using a quick hack hcsvm tool. Boot times of both kernels are comparable with 4.14.46, actually slightly faster, 5.8s vs 6.2s (note these are quite slow as serial console output seems to slow things down a little).

rn added a commit to rn/lcow that referenced this issue Jul 26, 2018
Between 4.14.35 and 4.14.36 some patches were introduced
which slow down image load/pull times considerably.

For now force downgrade the kernel

See: linuxkit/linuxkit#3120

Signed-off-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
@kmjohansen
Copy link
Contributor

May be coincidental, but 4.14.36 contained some ext4 correctness fixes as well as a change to the locking in writeback. I only mention this because I had to do some looking through that same changelog in order to figure out what was going on with random number generation: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/?h=v4.14.36

@rn
Copy link
Member Author

rn commented Aug 8, 2018

yeah, I saw that too. Haven't gotten around bisecting this one yet.

@lowenna
Copy link

lowenna commented Apr 4, 2019

Closing now it's bumped to 4.19.x. Doesn't seem to repro any more.

@lowenna
Copy link

lowenna commented Apr 4, 2019

Oh I don't seem to have permission to close my own issue in this repro.... but @rn or @justincormack probably can.

Edit: Aha, Rolf opened it on my behalf. Should have looked at the top :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants