Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlay multiple lower directory support #22126

Merged
merged 6 commits into from
Jun 13, 2016

Conversation

dmcgowan
Copy link
Member

@dmcgowan dmcgowan commented Apr 18, 2016

Summary

The overlay driver is quickly becoming more popular as users are looking for alternatives to devicemapper. Even though in many cases overlay is the best graphdriver option, it is still lacking optimization and has not been updated to take advantage of the 4.0 kernel feature to mount multiple lower directories. In addition, its use of the naive diff driver for generating diffs has led to at least one serious storage bug (#21555). Users often complain about inode exhaustion using overlay which is caused by the way upper directories are combined for each layer. For users not yet updated to 4.0 there is not much improvement we can make, however for 4.0 users, we can.

Design (updated May 17th)

Use a "diff" directory as the upper directory for a layer and have its contents be pure for the layer.
The previous "upper" directory was a combination of all layers on top of a single root. Having a pure
"diff" directory allows directly archiving a single directory to get an exportable diff. This also
allows apply to be done on an empty directory and changes to be calculated against a single directory.
The naive driver is no longer used and replaced by archiving the "diff" directory with rewrites of
whiteouts to use the aufs style expected by Docker diff tars.

Each layer has a unique identifier generated for it and a symbolic link in a link directory to the
"diff" directory for the layer. This shortened identifier allows referencing more layers as lower
directories for the overlay mount without hitting the page size limitation on mount arguments. This
unique identifier is put in a "link" file.

A "lower" file is created which stores a ':' separated list of lower directories to be used on mount.
Each item in the lower file is a reference to the symbolic link for each diff directory. The "lower"
file should always be less than or equal to the page size limit - 512 bytes to avoid getting an
error on mount. This is enforce inside the graph driver as a limit of 128 lower layers.

Performance Results

Testing shows a significant reduction in the number of inodes used, especially
for images with many layers. This is likely to solve many user issues around
running out of inodes while using overlay.

Full test log which generated inode/size information can be found here https://gist.github.com/dmcgowan/46d96ba48a38ace4ce03572efbf25697

Inode usage

overlay2 overlay %
dockercore/docker 76908 460576 17%
wordpress 27784 145609 19%
golang 24795 74487 33%
ubuntu 8154 17534 47%

Size (1K blocks)

overlay2 overlay %
dockercore/docker 2536356 3212196 79%
wordpress 463460 682796 68%
golang 787848 878640 90%
ubuntu 135700 147804 92%

Benchmark Results (between addition of commit with overlay change)

benchmark                         old ns/op     new ns/op     delta
BenchmarkExists-8                 1749          1549          -11.44%
BenchmarkGetEmpty-8               4224          6538          +54.78%
BenchmarkDiffBase-8               108961        101431        -6.91%
BenchmarkDiffSmallUpper-8         13801616      207843        -98.49%
BenchmarkDiff10KFileUpper-8       239367502     174130292     -27.25%
BenchmarkDiff10KFilesBottom-8     240473098     213452        -99.91%
BenchmarkDiffApply100-8           23074450      5931350       -74.29%
BenchmarkDiff20Layers-8           14447874      99199         -99.31%
BenchmarkRead20Layers-8           6555          6425          -1.98%

benchmark                         old allocs     new allocs     delta
BenchmarkExists-8                 4              4              +0.00%
BenchmarkGetEmpty-8               8              10             +25.00%
BenchmarkDiffBase-8               307            304            -0.98%
BenchmarkDiffSmallUpper-8         787            744            -5.46%
BenchmarkDiff10KFileUpper-8       644928         641897         -0.47%
BenchmarkDiff10KFilesBottom-8     644918         750            -99.88%
BenchmarkDiffApply100-8           9860           10330          +4.77%
BenchmarkDiff20Layers-8           4532           304            -93.29%
BenchmarkRead20Layers-8           7              7              +0.00%

benchmark                         old bytes     new bytes     delta
BenchmarkExists-8                 544           544           +0.00%
BenchmarkGetEmpty-8               1088          1152          +5.88%
BenchmarkDiffBase-8               24534         23976         -2.27%
BenchmarkDiffSmallUpper-8         44574         38142         -14.43%
BenchmarkDiff10KFileUpper-8       22451091      22057174      -1.75%
BenchmarkDiff10KFilesBottom-8     22433038      38901         -99.83%
BenchmarkDiffApply100-8           561485        631334        +12.44%
BenchmarkDiff20Layers-8           304053        24018         -92.10%
BenchmarkRead20Layers-8           1220          1212          -0.66%

The biggest change involves BenchmarkDiff10KFilesBottom which is intended
to represent steps during a docker build which add few files but
are built on bases with a large number of files (i.e. FROM ubuntu).

@dmcgowan dmcgowan changed the title Overlay native diff Overlay multiple lower directory support Apr 18, 2016

<<<<<<< 894c1f08cd933023e9277eb61be28f6dd42db276
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this conflict, looks like my after my rebase need to add CreateReadWrite as well

@dmcgowan dmcgowan force-pushed the overlay-native-diff branch 2 times, most recently from ed57af6 to 4f1b5ca Compare April 18, 2016 21:33
@dmcgowan
Copy link
Member Author

One additional tweak to the design I am making now is to always have a lower-id and exclude the root from the lower. This will ensure that the combination of the diff and lowers is equivalent to the prior upper directory. It should allow cleaning up some logic and ease some of the cleanup when there is a squash failure.

@unclejack
Copy link
Contributor

@dmcgowan Does this fix some other issues we were encountering when using Overlay?

@dmcgowan
Copy link
Member Author

dmcgowan commented Apr 22, 2016

@unclejack the 2 big issues I know this addresses is inode exhaustion and the missing updated file during diff #21555. There are other possible solutions to 21555 but inode exhaustion is directly related to the way each layer is flattened in the current version. My hope is this will also speed up builds on overlay as the amount of processing to produce a diff has greatly been reduced. What are the issues you had in mind?

@dmcgowan
Copy link
Member Author

dmcgowan commented Apr 23, 2016

Rebased to account for #22168. Might be worth discussing how this will impact this PR. The fsdiff is still used in some cases but not always.

@dmcgowan dmcgowan force-pushed the overlay-native-diff branch 2 times, most recently from 06b8410 to 8e67834 Compare April 27, 2016 05:12
@cpuguy83
Copy link
Member

Pretty awesome!
Verified this works and seems to work well. I have not done any performance comparisons or stress testing, but functionally it seems pretty solid.

The only problem with the current approach is that it's not backwards compatible.
So a user using overlay can upgrade Docker get this changed out under the hood without noticing any changes, but then if they need to downgrade anything new will not work under the old driver.

It seems like this either needs to be a new driver (overlay2?) so there are no surprises here, or the newer version needs to setup new layers in a backwards compatible way.

@runcom
Copy link
Member

runcom commented May 10, 2016

awesome, running with patch w/o any issues so far

overlay2 sounds good to me to maintain back compatibility

@dmcgowan
Copy link
Member Author

@cpuguy83 the current approach uses the new method by default with the option to turn it off using an option "nomultilower". It would be simple to make it an opt-in option rather than opt-out option. My concerns with making it a separate graph driver is related to the amount of code that is shared, it would requiring making almost all the internal functions of the currently overlay driver exported. Also having another driver is likely to make it more obscure and less used. If it is using the same directory and both methods are always supported, then in the future turning it on would allow backward compatibility with the next release and everything after.

@tonistiigi
Copy link
Member

@dmcgowan How much code duplication would there be if the second driver would not keep the current hardlink based implementation at all and would only use upper overlay dirs, like aufs driver does(ignoring the problem with reaching the mount options length limit).

@dmcgowan
Copy link
Member Author

@tonistiigi: just some of the initialization checks. That is fair to point out, the duplication is because the current approach can be used on top of an existing overlay directory and uses the current hard link method to "squash" layers past the options limit into upper directories as used by the current driver.

I found that squashing is required when an image gets about 35 layers deep when using the default "/var/lib/docker" location.

@unclejack
Copy link
Contributor

@dmcgowan I'm sorry for the late reply. I was just hoping this change would get us some fixes for some of the other issues as well.

@dmcgowan
Copy link
Member Author

@unclejack not late at all. Did you have any specific issues in mind? I am going back through the issue list to see which are reproducible and not identified as Kernel bugs.

These are the issues I know are addressed, would love to add to the list

@unclejack
Copy link
Contributor

@dmcgowan #9572 #19647 #12080

These are not necessarily issues anyone would expect this PR to fix, just something I'd hoped this change would fix through multiple lower directories.

@thaJeztah
Copy link
Member

Can someone make a list of pros/cons of having a separate driver vs a hybrid driver, so that we can move this forward?

@dmcgowan
Copy link
Member Author

@thaJeztah added a section on the bottom. Please point out any bias or suggested updates.

@crosbymichael
Copy link
Contributor

@dmcgowan does overlay work with MS_REMOUNT for specifying more layers after it is already mounted or do we have no choice but to use hardlinks to squash?

@dmcgowan
Copy link
Member Author

@crosbymichael tried doing remount and did not work for me. Although my suspicion is that even if it did, it would not solve the problem since all the lowers would still need to be given as mount options.

@crosbymichael
Copy link
Contributor

Would something like this work to get away from all of the hard linking or would the multiple mounts be worse?

package main

import (
    "fmt"
    "os"
    "path/filepath"
    "syscall"

    "github.com/Sirupsen/logrus"
)

func main() {
    if err := run(); err != nil {
        logrus.Fatal(err)
    }
}

func run() error {
    cwd, err := os.Getwd()
    if err != nil {
        return err
    }
    for _, name := range []string{
        "lower1",
        "lower2",
        "merged",
        "upper",
        "work",
    } {
        if err := os.MkdirAll(filepath.Join(cwd, name), 0755); err != nil {
            return err
        }
    }
    // mount the RO layers first so we don't overload the mount syscall with too much data
    if err := syscall.Mount(
        "overlay",
        filepath.Join(cwd, "merged"),
        "overlay",
        0,
        fmt.Sprintf(
            "lowerdir=%s:%s,workdir=%s",
            filepath.Join(cwd, "lower1"),
            filepath.Join(cwd, "lower2"),
            filepath.Join(cwd, "work"))); err != nil {
        return err
    }
    // now mount the rw layer with the already mounted lower's merge dir as the lower dir here
    if err := syscall.Mount(
        "overlay",
        filepath.Join(cwd, "merged"),
        "overlay",
        0,
        fmt.Sprintf(
            "lowerdir=%s,workdir=%s,upperdir=%s",
            filepath.Join(cwd, "merged"),
            filepath.Join(cwd, "work"),
            filepath.Join(cwd, "upper"))); err != nil {
        return err
    }
    return nil
}

@tonistiigi
Copy link
Member

@crosbymichael @vikstrous tried something like that in #18560 (comment) but hit some blockers.

There are other ways to get over the options limit. For example symlinks and relative paths.

@crosbymichael
Copy link
Contributor

boooo

@crosbymichael
Copy link
Contributor

I think my preference would be to make this change a new driver. It would be nice to have the new "pure" implementation and not have it messed up by the old one. Then as time passes we can eventually fade out the old implementation.

I'm fine whipping out my var/lib/docker for the new implementation as its much better and cleaner. This is just my initial thought and I don't feel too strongly about same or different drivers, whatever is cleaner and easier to maintain is better for the long run.

@vdemeester
Copy link
Member

@thaJeztah sounds good to me 👍

@thaJeztah
Copy link
Member

LGTM, thanks @dmcgowan

ping @vdemeester for review/merge

@thaJeztah thaJeztah added impact/changelog docs/revisit and removed status/needs-attention Calls for a collective discussion during a review session labels Jun 13, 2016
Add mention in dockerd command line and storage driver selection documentation.

Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
@vdemeester
Copy link
Member

Oh boy 🎉
LGTM 🐹

@crosbymichael
Copy link
Contributor

LGTM

@crosbymichael crosbymichael merged commit 8a2f9a2 into moby:master Jun 13, 2016
AkihiroSuda added a commit to AkihiroSuda/issues-docker that referenced this pull request Jun 14, 2016
@dmcgowan dmcgowan deleted the overlay-native-diff branch June 15, 2016 17:15
liusdu pushed a commit to liusdu/moby that referenced this pull request Oct 30, 2017
cherry-pick from moby#22126

Signed-off-by: Viktor Stanchev <me@viktorstanchev.com>
SIgned-off-by: Lei Jitang <leijitang@huawei.com>
(cherry picked from commit b03d323)
liusdu pushed a commit to liusdu/moby that referenced this pull request Oct 30, 2017
cherry-pick from moby#22126

Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
Signed-off-by: Lei Jitang <leijitang@huawei.com>
(cherry picked from commit 8222c86)
liusdu pushed a commit to liusdu/moby that referenced this pull request Oct 30, 2017
cherry-pick from: moby#22126

Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
Signed-off-by: Lei Jitang <leijitang@huawei.com>
(cherry picked from commit 246e993)
liusdu pushed a commit to liusdu/moby that referenced this pull request Oct 30, 2017
Adds a new overlay driver which uses multiple lower directories to create the union fs.
Additionally it uses symlinks and relative mount paths to allow a depth of 128 and stay within the mount page size limit.
Diffs and done directly over a single directory allowing diffs to be done efficiently and without the need fo the naive diff driver.

cherry-pick from moby#22126

Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
Signed-off-by: Lei Jitang <leijitang@huawei.com>
(cherry picked from commit 23e5c94)
liusdu pushed a commit to liusdu/moby that referenced this pull request Oct 30, 2017
Add mention in docker daemon command line and storage driver selection documentation.
cherry-pick from moby#22126

Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
Signed-off-by: Lei Jitang <leijitang@huawei.com>
liusdu pushed a commit to liusdu/moby that referenced this pull request Oct 30, 2017
cherry-pick from: moby#22126

To fix the conflicts when cherry-pick this commis, cherry-pick
the following prs:
    moby#23133
    moby#20525
    moby#23193

Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
Signed-off-by: Lei Jitang <leijitang@huawei.com>
(cherry picked from commit 8b0441d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants