-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unaddressed issues #18
Comments
Something I don't think I've ever noticed rubrics for: producing tests for container builds. |
@psychemedia Trying to catch up to all the issue :-)
|
Re: optimisation - out of scope: agreed.. though is it worth mentioning that it is a thing, but a thing that is out of scope? Re: multi-stage builds: this relates to memory/size efficiency / optimisation. Would it be worth mentioning perhaps alongside unaddressed optimisation issues saying that in some contexts it makes sense to leave artefacts in, but if you did have to optimise, or even obfuscate, or even keep secret, some things in a final container, there are (out of scope) approaches available for this such, as multi stage builds. Just on the point of leaving artefacts in: 1) if you did do a multistage build, the evidence for that is still in the Dockerfile; 2) the presence of build artefacts may add confusion to the working environment (eg if you used a complex build environment to build one thing that left bits around that you might mistake as part of the runtime for something else in the final environment?) |
@psychemedia Can you please see where into which rule multi-stage builds fit best? I agree it's worth noting them, if only not to confuse researchers when they come across them in the wild. |
Maybe in step 2, where mention is made of using FROM ? My understanding of multi-stage builds is that you can essentially just extract a layer containing an application and not have to retain all the scaffolding layers used to construct it. The original dockerfile will contain the reproducible script for generating the layer, but the final container will only have the functional layer, and not the build tools required to build it. It might also work in rule 1 if rule 1 would broadened to something like Consider tools to assist with Dockerfile generation or examples from pre-existing Dockerfiles. The idea being that if folk build "compartmentalised" Dockerfiles, with clear sections identifying how to build/configure particular applications (eg a Jupyter notebook chunk, a database chunk, a chunk that sets things up for working with a GPU etc) you might be able to reuse them. When it comes to the multi-stage build, your chunk sets up the scaffolding tools and builds the application your interested in, and then the final image just contains the application layer and not the scaffolding tools required to build it. |
@vsoch What do you think about the issues raised here? |
I think that multistage builds are useful enough to warrant a mention, e.g.,
But for the purposes of this paper, we should leave the detail at that. For some data scientists, the idea of a multi-stage build will directly contradict one of their goals (reproducibility) and for others, it might be essential for simple runtime usage of the container (if it's too big). It's also not correct to match a multi stage build with a specific layer - from the user perspective they are targeting a file or folder from some layer (Docker refers to these as artifacts in the docs) in a previous container that they want to retain for their updated image. A few comments on the points:
The linter is nice because it suggests a change, and also tells you why. The user could learn something in real practice, vs. having their eyes glaze over reading a paper that tries to list them all in one swoop (which we couldn't even do). So my 0.02 is to add that one or two lines to mention multi stage builds, and link the interested user to resources to learn more. |
@vsoch Re. multi-stage builds, IMO this latest change covers that adequately: https://github.com/nuest/ten-simple-rules-dockerfiles/blob/master/ten-simple-rules-dockerfiles.Rmd#L286 I've added linters, so will close this. |
Re:
How about: Where image size is a concern, consider using multi-stage builds [@docker_multi-stage_2020] to separate and remove packages only required to build, and not run, the final application. To manage container complexity, consider separating applications into their own containers created from a common root, then composing them as a set of linked containers (cf. @gruening_recommendations_2019). I also note the preceding
How about: In general, do not worry about image size when building images for use in data science because (a) the images are unlikely to be pulled frequently or used to launch large numbers of containers over short periods of time (b) your data is likely to have much larger storage requirements than the software, and (c) transparency and inspectability of the image configuration outweigh storage concerns. Apols for my not being more engaged with this paper, I tend to to be an outlier in the way I appropriate and use a lot of emerging tech in ways it's often not really meant/intended for... |
I think the way it is phrased now is clean and direct, and we don't need to add to it. Specifically:
is also sort of off, because a multistage build isn't removing packages, it's adding files/folder from previous layers and whiting out the rest. It's a selection to add to another base, and not a removal, technically. I also don't think this is the right advice:
Many times it's "better" to have mutiple co-dependent software installs kept together in the same container. Other times you can use the scientific filesystem. I've seen more successful containers (meaning they work as expected after some time) that don't try to separate and link. I think we are better off not making such a biased statement here, because it really depends.
We can only wish! This led to near disaster for Singularity Hub before I put pull / interaction limits on it :) |
@vsoch I remember the original Docker advice was one thing per container and use things like docker compose to create more complex environments. Elsewhere, it would be interest to trace the evolution of that work practice compared to stuffing everything into one container. A couple of advantages of compose:
Decomposing things into separate containers means they are lighter elements for use in a pipeline. This sort of feeds into the idea of should the reproducible environment be a single container, or should it be a reproducible combination of composed or pipelined containers. In that case, you need the good practice on the individual container definition, and on the composition/pipleine definition. But if you are building containers for use in pipelines/compositions, then best practice may well be isolate, minimise size etc etc? As to frequency with which images are pulled, if it is an issue of containers being pulled a lot, then shouldn't there be a recommendation regarding minimising image size to limit resource usage? Reproducibility should also be mindful of sustainability. If your reproducible workflow is built around an unsustainable practice, it won't be reproducible for long... |
That's (compose) really more appropriate for services (e.g., database, web server, regular application / server) and not containers that have scientific purposes. This is an argument that people can go back and forth on for a long time. For most data science use cases, in that large data is kept outside and we are talking about different scripts / software for an analysis, it's better for reproducibility to put into one container. The single data scientist that is just getting used to learning how to use one container should not be forced to learn orchestration. Take a look at bids-apps - each is a separate container, yes, but you don't find different tools for mriqa / quality checking scattered among different containers. It's separated based on 1) maintainership, 2) function, and 3) logical grouping based on the software provided. Building a single container is not an "unsustainable practice." You must live in this idealistic perfect world where people care about resource usage. As someone who is in research computing and manages a container registry for almost 4 years now, say hello from the real world for me! lol. :) |
@psychemedia I think I'm with @vsoch here, in my words: the target audience and use case does not benefit from the decomposition enough to include this topic in these ten rules. Maybe we should write an addendum of Five Rules how to write composable containers and multi-image environments with docker-compose. @psychemedia @vsoch If either of you feel like phrasing your perspective within the text and open a PR, please go ahead. I'll not wait with the publication of the preprint though. |
So do you need a PR for the other issue or not? |
Some issues that maybe aren't addressed, or not addressed in detail?
repo2docker
issue on the need for speed;The text was updated successfully, but these errors were encountered: