Unaddressed issues #18

psychemedia · 2019-10-24T13:02:29Z

Some issues that maybe aren't addressed, or not addressed in detail?

optimising builds: good practice around building efficient layers (layer size; efficiency when repeatedly building from the top); some cribs I need to chase on this: @simonw on building smaller Python Docker images, Docker blog on Intro Guide to Dockerfile Best Practices); there was also some discussion on this repo2docker issue on the need for speed;
linters / support tools; eg fromlatest.io; a more risky approach, reproducibility-wise, is to give your image to another tool and let it optimise it for you without changing the Dockerfile (eg docker-slim);
use of multi-stage builds (eg I had to use a multistage build here to try to get a legacy app with horrible build dependencies into a notebook container that could server-proxy it);
tools for analysing layers (eg dive).

The text was updated successfully, but these errors were encountered:

psychemedia · 2019-10-24T14:17:14Z

Something I don't think I've ever noticed rubrics for: producing tests for container builds.

nuest · 2019-12-10T17:12:49Z

@psychemedia Trying to catch up to all the issue :-)

Re. optimisation: that is IMO out of scope. The focus lies on reproducibility and together with the "habits" in Rule 10, the "need for speed" and image sizes are no big issues. Can you live with that?
Re. linters: we do mention (by now) and also discuss in Discussion - Rule 5. Mount data and control code #12

Re. multi-stage builds: I think we can skip that, and I like the argument by @vsoch that actually leaving build artefacts in intentionally is helpful for others investigating the image, see

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 209 in b282da0

    
           For example, if your container uses a script to run a complex install routine, instead of removing it from the container upon completion, which is commonly seen in production Dockerfiles aiming at small image size, you should keep the script in the container for a future user to inspect.

I did not know dive! Good point, will include it.

see #18

psychemedia · 2019-12-10T18:04:43Z

Re: optimisation - out of scope: agreed.. though is it worth mentioning that it is a thing, but a thing that is out of scope?

Re: multi-stage builds: this relates to memory/size efficiency / optimisation. Would it be worth mentioning perhaps alongside unaddressed optimisation issues saying that in some contexts it makes sense to leave artefacts in, but if you did have to optimise, or even obfuscate, or even keep secret, some things in a final container, there are (out of scope) approaches available for this such, as multi stage builds. Just on the point of leaving artefacts in: 1) if you did do a multistage build, the evidence for that is still in the Dockerfile; 2) the presence of build artefacts may add confusion to the working environment (eg if you used a complex build environment to build one thing that left bits around that you might mistake as part of the runtime for something else in the final environment?)

nuest · 2020-03-02T15:45:07Z

@psychemedia Can you please see where into which rule multi-stage builds fit best? I agree it's worth noting them, if only not to confuse researchers when they come across them in the wild.

psychemedia · 2020-03-13T00:05:18Z

Maybe in step 2, where mention is made of using FROM ?

My understanding of multi-stage builds is that you can essentially just extract a layer containing an application and not have to retain all the scaffolding layers used to construct it. The original dockerfile will contain the reproducible script for generating the layer, but the final container will only have the functional layer, and not the build tools required to build it.

It might also work in rule 1 if rule 1 would broadened to something like Consider tools to assist with Dockerfile generation or examples from pre-existing Dockerfiles. The idea being that if folk build "compartmentalised" Dockerfiles, with clear sections identifying how to build/configure particular applications (eg a Jupyter notebook chunk, a database chunk, a chunk that sets things up for working with a GPU etc) you might be able to reuse them.

When it comes to the multi-stage build, your chunk sets up the scaffolding tools and builds the application your interested in, and then the final image just contains the application layer and not the scaffolding tools required to build it.

nuest · 2020-04-15T22:06:28Z

@vsoch What do you think about the issues raised here?

vsoch · 2020-04-15T22:18:30Z

I think that multistage builds are useful enough to warrant a mention, e.g.,

In the case that a smaller container is needed that provides some core set of files for runtime without build dependencies, the user is suggested to do a mutlistage-build.

But for the purposes of this paper, we should leave the detail at that. For some data scientists, the idea of a multi-stage build will directly contradict one of their goals (reproducibility) and for others, it might be essential for simple runtime usage of the container (if it's too big). It's also not correct to match a multi stage build with a specific layer - from the user perspective they are targeting a file or folder from some layer (Docker refers to these as artifacts in the docs) in a previous container that they want to retain for their updated image.

A few comments on the points:

Users aren't going to be thinking much about optimizing layers. We've already mentioned putting steps into logical chunks (e.g., "This RUN statement installs x) but I don't see a user thinking "should this layer be smaller?" They don't really care, and if their image is too big (and they do) then they might look into multi stage build, which we've linked.
I think a link to a linter like fromlatest would be much more useful than verbosely trying to write out tips. E.g., modify the above to be:

In the case that a smaller container is needed that provides some core set of files for runtime without build dependencies, the user is suggested to do a mutlistage-build. A Dockerfile can further be optimized with a linter like fromlatest.

The linter is nice because it suggests a change, and also tells you why. The user could learn something in real practice, vs. having their eyes glaze over reading a paper that tries to list them all in one swoop (which we couldn't even do).

So my 0.02 is to add that one or two lines to mention multi stage builds, and link the interested user to resources to learn more.

nuest · 2020-04-16T06:10:20Z

@vsoch Re. multi-stage builds, IMO this latest change covers that adequately: https://github.com/nuest/ten-simple-rules-dockerfiles/blob/master/ten-simple-rules-dockerfiles.Rmd#L286

I've added linters, so will close this.

psychemedia · 2020-04-16T06:44:19Z

Re:

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 286 in 470d115

    
           If you really need to reduce the size, you may look into using multiple containers (cf.&nbsp;@gruening_recommendations_2019) or multi-stage builds [@docker_multi-stage_2020].

:

If you really need to reduce the size, you may look into using multiple containers (cf. @gruening_recommendations_2019) or multi-stage builds [@docker_multi-stage_2020].

How about: Where image size is a concern, consider using multi-stage builds [@docker_multi-stage_2020] to separate and remove packages only required to build, and not run, the final application. To manage container complexity, consider separating applications into their own containers created from a common root, then composing them as a set of linked containers (cf. @gruening_recommendations_2019).

I also note the preceding

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 285 in 470d115

    
           In general, do not worry about image size because (a) your data is likely to have much larger storage requirements than the software, and (b) transparency and inspectability outweigh storage concerns in data science.

:

In general, do not worry about image size because (a) your data is likely to have much larger storage requirements than the software, and (b) transparency and inspectability outweigh storage concerns in data science.

How about:

In general, do not worry about image size when building images for use in data science because (a) the images are unlikely to be pulled frequently or used to launch large numbers of containers over short periods of time (b) your data is likely to have much larger storage requirements than the software, and (c) transparency and inspectability of the image configuration outweigh storage concerns.

Apols for my not being more engaged with this paper, I tend to to be an outlier in the way I appropriate and use a lot of emerging tech in ways it's often not really meant/intended for...

vsoch · 2020-04-16T08:41:01Z

I think the way it is phrased now is clean and direct, and we don't need to add to it. Specifically:

and remove packages only required to build, and not run, the final application.

is also sort of off, because a multistage build isn't removing packages, it's adding files/folder from previous layers and whiting out the rest. It's a selection to add to another base, and not a removal, technically.

I also don't think this is the right advice:

To manage container complexity, consider separating applications into their own containers created from a common root, then composing them as a set of linked containers

Many times it's "better" to have mutiple co-dependent software installs kept together in the same container. Other times you can use the scientific filesystem. I've seen more successful containers (meaning they work as expected after some time) that don't try to separate and link. I think we are better off not making such a biased statement here, because it really depends.

the images are unlikely to be pulled frequently or used to launch large numbers of containers over short periods of time

We can only wish! This led to near disaster for Singularity Hub before I put pull / interaction limits on it :)

psychemedia · 2020-04-16T09:48:53Z

@vsoch I remember the original Docker advice was one thing per container and use things like docker compose to create more complex environments. Elsewhere, it would be interest to trace the evolution of that work practice compared to stuffing everything into one container.

A couple of advantages of compose:

you decouple applications so you can upgrade one in isolation from others (eg you update a local database server but not the analysis environment)
if your working environment has different setups you may have one compose script for one, another compose script for a different combination of containers etc;

Decomposing things into separate containers means they are lighter elements for use in a pipeline.

This sort of feeds into the idea of should the reproducible environment be a single container, or should it be a reproducible combination of composed or pipelined containers. In that case, you need the good practice on the individual container definition, and on the composition/pipleine definition. But if you are building containers for use in pipelines/compositions, then best practice may well be isolate, minimise size etc etc?

As to frequency with which images are pulled, if it is an issue of containers being pulled a lot, then shouldn't there be a recommendation regarding minimising image size to limit resource usage? Reproducibility should also be mindful of sustainability. If your reproducible workflow is built around an unsustainable practice, it won't be reproducible for long...

vsoch · 2020-04-16T16:10:51Z

That's (compose) really more appropriate for services (e.g., database, web server, regular application / server) and not containers that have scientific purposes. This is an argument that people can go back and forth on for a long time. For most data science use cases, in that large data is kept outside and we are talking about different scripts / software for an analysis, it's better for reproducibility to put into one container. The single data scientist that is just getting used to learning how to use one container should not be forced to learn orchestration. Take a look at bids-apps - each is a separate container, yes, but you don't find different tools for mriqa / quality checking scattered among different containers. It's separated based on 1) maintainership, 2) function, and 3) logical grouping based on the software provided. Building a single container is not an "unsustainable practice."

You must live in this idealistic perfect world where people care about resource usage. As someone who is in research computing and manages a container registry for almost 4 years now, say hello from the real world for me! lol. :)

nuest · 2020-04-16T21:41:44Z

@psychemedia I think I'm with @vsoch here, in my words: the target audience and use case does not benefit from the decomposition enough to include this topic in these ten rules. Maybe we should write an addendum of Five Rules how to write composable containers and multi-image environments with docker-compose.

@psychemedia @vsoch If either of you feel like phrasing your perspective within the text and open a PR, please go ahead. I'll not wait with the publication of the preprint though.

vsoch · 2020-04-16T21:55:22Z

So do you need a PR for the other issue or not?

nuest added a commit that referenced this issue Dec 10, 2019

mention dive, small changes to Rule 0

2e3c03d

see #18

nuest assigned psychemedia Mar 2, 2020

nuest mentioned this issue Apr 2, 2020

Stephen's edits #54

Merged

nuest mentioned this issue Apr 15, 2020

Coauthors | Coordination #6

Closed

nuest closed this as completed in 23a851e Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unaddressed issues #18

Unaddressed issues #18

psychemedia commented Oct 24, 2019

psychemedia commented Oct 24, 2019

nuest commented Dec 10, 2019

psychemedia commented Dec 10, 2019

nuest commented Mar 2, 2020

psychemedia commented Mar 13, 2020

nuest commented Apr 15, 2020

vsoch commented Apr 15, 2020

nuest commented Apr 16, 2020

psychemedia commented Apr 16, 2020 •

edited

vsoch commented Apr 16, 2020

psychemedia commented Apr 16, 2020

vsoch commented Apr 16, 2020

nuest commented Apr 16, 2020

vsoch commented Apr 16, 2020

Unaddressed issues #18

Unaddressed issues #18

Comments

psychemedia commented Oct 24, 2019

psychemedia commented Oct 24, 2019

nuest commented Dec 10, 2019

psychemedia commented Dec 10, 2019

nuest commented Mar 2, 2020

psychemedia commented Mar 13, 2020

nuest commented Apr 15, 2020

vsoch commented Apr 15, 2020

nuest commented Apr 16, 2020

psychemedia commented Apr 16, 2020 • edited

vsoch commented Apr 16, 2020

psychemedia commented Apr 16, 2020

vsoch commented Apr 16, 2020

nuest commented Apr 16, 2020

vsoch commented Apr 16, 2020

psychemedia commented Apr 16, 2020 •

edited