-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlay multiple lower directory support #22126
Conversation
|
||
<<<<<<< 894c1f08cd933023e9277eb61be28f6dd42db276 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this conflict, looks like my after my rebase need to add CreateReadWrite
as well
ed57af6
to
4f1b5ca
Compare
One additional tweak to the design I am making now is to always have a |
@dmcgowan Does this fix some other issues we were encountering when using Overlay? |
@unclejack the 2 big issues I know this addresses is inode exhaustion and the missing updated file during diff #21555. There are other possible solutions to 21555 but inode exhaustion is directly related to the way each layer is flattened in the current version. My hope is this will also speed up builds on overlay as the amount of processing to produce a diff has greatly been reduced. What are the issues you had in mind? |
5a318d2
to
c62975f
Compare
Rebased to account for #22168. Might be worth discussing how this will impact this PR. The fsdiff is still used in some cases but not always. |
06b8410
to
8e67834
Compare
Pretty awesome! The only problem with the current approach is that it's not backwards compatible. It seems like this either needs to be a new driver ( |
awesome, running with patch w/o any issues so far
|
@cpuguy83 the current approach uses the new method by default with the option to turn it off using an option "nomultilower". It would be simple to make it an opt-in option rather than opt-out option. My concerns with making it a separate graph driver is related to the amount of code that is shared, it would requiring making almost all the internal functions of the currently overlay driver exported. Also having another driver is likely to make it more obscure and less used. If it is using the same directory and both methods are always supported, then in the future turning it on would allow backward compatibility with the next release and everything after. |
@dmcgowan How much code duplication would there be if the second driver would not keep the current hardlink based implementation at all and would only use upper overlay dirs, like aufs driver does(ignoring the problem with reaching the mount options length limit). |
@tonistiigi: just some of the initialization checks. That is fair to point out, the duplication is because the current approach can be used on top of an existing overlay directory and uses the current hard link method to "squash" layers past the options limit into upper directories as used by the current driver. I found that squashing is required when an image gets about 35 layers deep when using the default "/var/lib/docker" location. |
@dmcgowan I'm sorry for the late reply. I was just hoping this change would get us some fixes for some of the other issues as well. |
@unclejack not late at all. Did you have any specific issues in mind? I am going back through the issue list to see which are reproducible and not identified as Kernel bugs. These are the issues I know are addressed, would love to add to the list
|
Can someone make a list of pros/cons of having a separate driver vs a hybrid driver, so that we can move this forward? |
@thaJeztah added a section on the bottom. Please point out any bias or suggested updates. |
@dmcgowan does overlay work with |
@crosbymichael tried doing remount and did not work for me. Although my suspicion is that even if it did, it would not solve the problem since all the lowers would still need to be given as mount options. |
Would something like this work to get away from all of the hard linking or would the multiple mounts be worse? package main
import (
"fmt"
"os"
"path/filepath"
"syscall"
"github.com/Sirupsen/logrus"
)
func main() {
if err := run(); err != nil {
logrus.Fatal(err)
}
}
func run() error {
cwd, err := os.Getwd()
if err != nil {
return err
}
for _, name := range []string{
"lower1",
"lower2",
"merged",
"upper",
"work",
} {
if err := os.MkdirAll(filepath.Join(cwd, name), 0755); err != nil {
return err
}
}
// mount the RO layers first so we don't overload the mount syscall with too much data
if err := syscall.Mount(
"overlay",
filepath.Join(cwd, "merged"),
"overlay",
0,
fmt.Sprintf(
"lowerdir=%s:%s,workdir=%s",
filepath.Join(cwd, "lower1"),
filepath.Join(cwd, "lower2"),
filepath.Join(cwd, "work"))); err != nil {
return err
}
// now mount the rw layer with the already mounted lower's merge dir as the lower dir here
if err := syscall.Mount(
"overlay",
filepath.Join(cwd, "merged"),
"overlay",
0,
fmt.Sprintf(
"lowerdir=%s,workdir=%s,upperdir=%s",
filepath.Join(cwd, "merged"),
filepath.Join(cwd, "work"),
filepath.Join(cwd, "upper"))); err != nil {
return err
}
return nil
} |
@crosbymichael @vikstrous tried something like that in #18560 (comment) but hit some blockers. There are other ways to get over the options limit. For example symlinks and relative paths. |
boooo |
I think my preference would be to make this change a new driver. It would be nice to have the new "pure" implementation and not have it messed up by the old one. Then as time passes we can eventually fade out the old implementation. I'm fine whipping out my var/lib/docker for the new implementation as its much better and cleaner. This is just my initial thought and I don't feel too strongly about same or different drivers, whatever is cleaner and easier to maintain is better for the long run. |
@thaJeztah sounds good to me 👍 |
LGTM, thanks @dmcgowan ping @vdemeester for review/merge |
Add mention in dockerd command line and storage driver selection documentation. Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
96ac5f5
to
a546042
Compare
Oh boy 🎉 |
LGTM |
cherry-pick from moby#22126 Signed-off-by: Viktor Stanchev <me@viktorstanchev.com> SIgned-off-by: Lei Jitang <leijitang@huawei.com> (cherry picked from commit b03d323)
cherry-pick from moby#22126 Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan) Signed-off-by: Lei Jitang <leijitang@huawei.com> (cherry picked from commit 8222c86)
cherry-pick from: moby#22126 Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan) Signed-off-by: Lei Jitang <leijitang@huawei.com> (cherry picked from commit 246e993)
Adds a new overlay driver which uses multiple lower directories to create the union fs. Additionally it uses symlinks and relative mount paths to allow a depth of 128 and stay within the mount page size limit. Diffs and done directly over a single directory allowing diffs to be done efficiently and without the need fo the naive diff driver. cherry-pick from moby#22126 Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan) Signed-off-by: Lei Jitang <leijitang@huawei.com> (cherry picked from commit 23e5c94)
Add mention in docker daemon command line and storage driver selection documentation. cherry-pick from moby#22126 Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan) Signed-off-by: Lei Jitang <leijitang@huawei.com>
cherry-pick from: moby#22126 To fix the conflicts when cherry-pick this commis, cherry-pick the following prs: moby#23133 moby#20525 moby#23193 Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan) Signed-off-by: Lei Jitang <leijitang@huawei.com> (cherry picked from commit 8b0441d)
Summary
The overlay driver is quickly becoming more popular as users are looking for alternatives to devicemapper. Even though in many cases overlay is the best graphdriver option, it is still lacking optimization and has not been updated to take advantage of the 4.0 kernel feature to mount multiple lower directories. In addition, its use of the naive diff driver for generating diffs has led to at least one serious storage bug (#21555). Users often complain about inode exhaustion using overlay which is caused by the way upper directories are combined for each layer. For users not yet updated to 4.0 there is not much improvement we can make, however for 4.0 users, we can.
Design (updated May 17th)
Use a "diff" directory as the upper directory for a layer and have its contents be pure for the layer.
The previous "upper" directory was a combination of all layers on top of a single root. Having a pure
"diff" directory allows directly archiving a single directory to get an exportable diff. This also
allows apply to be done on an empty directory and changes to be calculated against a single directory.
The naive driver is no longer used and replaced by archiving the "diff" directory with rewrites of
whiteouts to use the aufs style expected by Docker diff tars.
Each layer has a unique identifier generated for it and a symbolic link in a link directory to the
"diff" directory for the layer. This shortened identifier allows referencing more layers as lower
directories for the overlay mount without hitting the page size limitation on mount arguments. This
unique identifier is put in a "link" file.
A "lower" file is created which stores a ':' separated list of lower directories to be used on mount.
Each item in the lower file is a reference to the symbolic link for each diff directory. The "lower"
file should always be less than or equal to the page size limit - 512 bytes to avoid getting an
error on mount. This is enforce inside the graph driver as a limit of 128 lower layers.
Performance Results
Testing shows a significant reduction in the number of inodes used, especially
for images with many layers. This is likely to solve many user issues around
running out of inodes while using overlay.
Full test log which generated inode/size information can be found here https://gist.github.com/dmcgowan/46d96ba48a38ace4ce03572efbf25697
Inode usage
Size (1K blocks)
Benchmark Results (between addition of commit with overlay change)
The biggest change involves
BenchmarkDiff10KFilesBottom
which is intendedto represent steps during a docker build which add few files but
are built on bases with a large number of files (i.e.
FROM ubuntu
).